Skip to content

Resolve the MEOS surface through the unified GeneratedFunctions jar#18

Closed
estebanzimanyi wants to merge 144 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/jmeos-generatedfunctions-foundation
Closed

Resolve the MEOS surface through the unified GeneratedFunctions jar#18
estebanzimanyi wants to merge 144 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/jmeos-generatedfunctions-foundation

Conversation

@estebanzimanyi

Copy link
Copy Markdown
Member

MobilitySpark resolves the MEOS native surface through the unified functions.GeneratedFunctions facade — the MEOS-API/meos-idl.json codegen output shared with MobilityFlink — bundled as libs/JMEOS-1.4.jar, replacing the flat functions.functions generator. The bundled lib/libmeos.so carries the matching surface: th3index, the mul_ temporal multiplication naming, the always-covers spatial relations, the JVM-safe noexit error handler, and reentrant GEOS. Owned char* returns are freed inside the facade, so String-returning UDFs leave no native allocation behind, and the cbuffer ordering check follows the canonical distinct-count form. The full UDF surface across the base, temporal, geo, cbuffer, npoint, pose, rgeo, th3index and portable families builds and passes 926/926 tests against this pairing.

Luis Alfredo Leon Villapun and others added 30 commits August 7, 2023 12:05
…ily flags

Each optional extended temporal-type family (cbuffer, npoint, pose, rgeo, h3)
is included or excluded at build time through a Maven flag whose name mirrors
the MEOS/MobilityDB CMake options: -DCBUFFER=OFF, -DNPOINT=OFF, -DPOSE=OFF,
-DRGEO=OFF and -DH3=OFF drop that family's package from compilation. RGEO
depends on POSE so disabling POSE also drops rgeo, and disabling H3 also drops
the BerlinMOD demo and examples that materialise the th3index trip column.
MobilitySparkSession registers the families reflectively, so an excluded
package's absent registrar class is skipped with zero residue while the
remaining families stay intact. The CI workflow builds the fully excluded
variant and asserts the dropped packages produce no classes.
…acc/all

# Conflicts:
#	src/test/java/org/mobilitydb/spark/temporal/ConstructorUDFsExtTest.java
…/all

# Conflicts:
#	.github/workflows/maven.yml
#	src/main/java/org/mobilitydb/spark/MobilitySparkSession.java
#	src/test/java/org/mobilitydb/spark/temporal/MathUDFsExtTest.java
… into acc/all

# Conflicts:
#	src/main/java/org/mobilitydb/spark/cbuffer/CbufferUDFs.java
cbuffer_cmp is not a stable total order under standalone MEOS: the embedded
geometry carries uninitialized padding (the same reason cbuffer_hash and
cbuffer_hash_extended are unbound), so a pairwise gt/cmp scalar invariant on
two values is non-deterministic across calls. MobilityDB exercises cbuffer
ordering through 151_cbufferset_tbl as numValues(set(array_agg(DISTINCT cb
ORDER BY cb))) — an order-independent distinct count where cbuffer_cmp only
feeds the aggregate's ORDER BY. The Spark test mirrors that form: it dedups and
orders the values (TreeSet) and asserts the count of distinct cbuffers, which is
deterministic because equal cbuffers serialize to identical hex-WKB.
The MobilitySpark UDF layer resolves the MEOS native surface through the unified
functions.GeneratedFunctions facade (the MEOS-API/meos-idl.json codegen output
shared with MobilityFlink), bundled as libs/JMEOS-1.4.jar, and the bundled
lib/libmeos.so the CI copies into place carries the matching surface: th3index,
the mul_ temporal multiplication naming, the always-covers spatial relations,
the JVM-safe noexit error handler, and reentrant GEOS. Owned char* returns are
freed in the facade, so String-returning UDFs leave no native allocation behind.
The per-worktree isolated Maven repository is dropped from tracking and ignored.
The native-leak tests sample VmRSS after System.gc(), which never returns
glibc's freed malloc arenas to the OS, so a UDF that allocates and frees a
large transient buffer (the merged trajectory geometry in trajectory()) leaves
the freed chunks resident in the arena. That retention is glibc-version
dependent, making the RSS proxy unstable across environments. forceGc() now
calls malloc_trim(0) through a minimal libc binding before sampling, so VmRSS
reflects genuinely-retained native memory; a real leak survives the trim and
still trips the assert. glibc-only — the binding is null and skipped elsewhere.
minDistance(tgeompoint[], tgeompoint[]) is registered under the SQL name
minDistance, backed by GeneratedFunctions.mindistance_tgeoarr_tgeoarr. The UDF
marshals each array of hex trips into a native Temporal*[] and keeps the
buffers strongly reachable across the native call with reachabilityFence, so
the kernel never reads memory the JVM reclaimed mid-call. BerlinMOD Q5 uses the
set-set form minDistance(array_agg(trips), array_agg(trips)) over the licence
pairs -- identical SQL across MobilityDB, MobilityDuck and MobilitySpark, the
N-by-N resolved inside the aggregate by the STBox prune -- so the Spark-specific
q05_spark variant is removed. The bundled jar and libmeos carry the
mindistance_tgeoarr_tgeoarr name and reentrant GEOS.
queries.sql is the one `-- @query`-delimited source that all three
runners (PostgreSQL, DuckDB, MobilitySpark) split, so the benchmark SQL
cannot drift between platforms; the Spark runner applies preprocessForSpark
as a dialect transform on each section. The bench measures each query with
the noop sink, which forces full materialisation of projection-only
expressions such as the set-set minDistance(tgeompoint[], tgeompoint[]) in
Q5. Query geometries parse with the configurable GEOM_SRID
(-Dberlinmod.srid) so they match the SRID the trips carry, avoiding
mixed-SRID spatial operators in Q11/Q12/Q15. The trip_h3 column
materialises through the registered tgeompointToTh3Index UDF under the
berlinmod.bench.th3index.enable flag.
Spark registers a name as exclusively scalar or aggregate, so a name
backing both fails to load. The bare merge stays the scalar form
(AccessorUDFs) and the column aggregate is mergeAgg, tracking the
upstream Agg-suffix rename of the aggregate forms.
…e type

The canonical overloaded aggregates tmin/tmax (renamed tminAgg/tmaxAgg by
MobilityDB #828) and tsum span tint, tfloat and ttext signatures. Spark
registers a UDAF name as exclusively one aggregate and cannot overload by
signature, so each becomes a single UDAF that dispatches on the base type
returned by temporal_basetype_name (#1139) — int4, float8 or text — to the
matching per-type transfn. This replaces the invented per-type tIntMin /
tFloatMin / tTextMin … surface with the canonical names; tsum keeps the
bare name since no box accessor collides with it. The vendored libmeos.so
and JMEOS jar export temporal_basetype_name.
VmRSS-based leak detection bounds native-heap growth, but glibc 2.39 on
the ubuntu-noble runner retains ~15-21 MB of freed malloc arena across the
5 000-call probe even after malloc_trim(0), so a 10 MB bound flags
freed-but-unreturned memory rather than a leak. The 50 MB bound clears
that arena floor while still catching real Temporal* leaks (≥100 KB/call →
≥500 MB) with a 10x margin.
@estebanzimanyi

Copy link
Copy Markdown
Member Author

Superseded by the canonical-surface decomposition: #22 (canonical UDF library) + #23 (BerlinMOD benchmark), derived from the green integration branch #16 and rebuilt on the regenerated single-source functions.GeneratedFunctions surface at the current ecosystem pin (the legacy functions.functions facade is retired). This PR's content lives in #22/#23 on the canonical surface, so closing as replaced.

estebanzimanyi added a commit to estebanzimanyi/MobilitySpark that referenced this pull request Jun 12, 2026
Adds the BerlinMOD benchmark harness (BerlinMODBench, per-engine runners,
report/chart tooling) and runs the single canonical MobilityDB SQL suite:
berlinmod/q*.sql are byte-identical to doc/rfc/sql-portability/berlinmod, so
MobilityDB / MobilityDuck / MobilitySpark execute one SQL file. q14 uses the
spatiotemporal eContains(geometry, tgeompoint); q09 uses round(double, int);
the th3index prefilter lives in the canonical SQL via the trip_h3 column.

No Spark-local SQL variants or preprocessing: the q0N_spark.sql reimplementations
and the BerlinMODBench preprocessForSpark rewriter / trip_h3 re-materialization
are removed; the bench executes the canonical SQL verbatim. The Trips x Trips
prefilter queries (q11-q16) call overlaps(tgeompoint, stbox) / stbox(geometry,
instant), resolved once the generated UDF surface exposes them (catalog @sqlfn
map -- MEOS-API MobilityDB#18 + MobilityDB #1200).
estebanzimanyi added a commit to estebanzimanyi/MobilitySpark that referenced this pull request Jun 12, 2026
tools/codegen_spark_udfs.py emits MobilitySpark UDF-registration classes from the
MEOS-API catalog (output/meos-idl.json), resolving each SQL name to its MEOS-C
backing via the @sqlfn / @sqlop map (MEOS-API MobilityDB#18). Two modes:
- SINGLE: one backing -> a 1:1 UDF (type-marshalling rules map each MEOS C type
  to its parse-from-String / serialize-to-String form).
- DISPATCH: an overloaded SQL name / operator (overlaps via &&, stbox(geom,time))
  -> ONE UDF that classifies each arg by its MEOS type (parser-driven: MEOS
  decides) and routes to the catalog-determined backing. Emitted lambdas call only
  static GeneratedFunctions (no captured state -> Spark-serializable). Zero hand
  heuristics, zero new MEOS functions.
estebanzimanyi added a commit to estebanzimanyi/MobilitySpark that referenced this pull request Jun 13, 2026
tools/codegen_spark_udfs.py emits MobilitySpark UDF-registration classes from the
MEOS-API catalog (output/meos-idl.json), resolving each SQL name to its MEOS-C
backing via the @sqlfn / @sqlop map (MEOS-API MobilityDB#18). Two modes:
- SINGLE: one backing -> a 1:1 UDF (type-marshalling: each MEOS C type <-> its
  parse-from-String / serialize-to-String form).
- DISPATCH: an overloaded SQL name / operator (overlaps via &&, stbox(geom,time),
  timeSpan) -> ONE UDF that classifies each arg by its MEOS type and routes to the
  catalog-determined backing. Classification is MEOS-driven and wire-format-safe:
  spans/stboxes/geometries travel as TEXT, only temporals as hex, so the leading
  token disambiguates ('['/'(' span, STBOX stbox, hex temporal, else geometry) and
  temporal_from_hexwkb is never fed a non-temporal. Emitted lambdas call only static
  GeneratedFunctions (no captured state -> Spark-serializable). Zero hand heuristics,
  zero new MEOS functions.
estebanzimanyi added a commit to estebanzimanyi/MobilitySpark that referenced this pull request Jun 13, 2026
Adds the BerlinMOD benchmark harness (BerlinMODBench, per-engine runners,
report/chart tooling) and runs the single canonical MobilityDB SQL suite:
berlinmod/q*.sql are byte-identical to doc/rfc/sql-portability/berlinmod, so
MobilityDB / MobilityDuck / MobilitySpark execute one SQL file. q14 uses the
spatiotemporal eContains(geometry, tgeompoint); q09 uses round(double, int);
the th3index prefilter lives in the canonical SQL via the trip_h3 column.

No Spark-local SQL variants or preprocessing: the q0N_spark.sql reimplementations
and the BerlinMODBench preprocessForSpark rewriter / trip_h3 re-materialization
are removed; the bench executes the canonical SQL verbatim. The Trips x Trips
prefilter queries (q11-q16) call overlaps(tgeompoint, stbox) / stbox(geometry,
instant), resolved once the generated UDF surface exposes them (catalog @sqlfn
map -- MEOS-API MobilityDB#18 + MobilityDB #1200).

Accessor leak/wire-format fixes for leak-free, dispatch-compatible execution:
valueAtTimestamp frees the parsed trip (geomPtr is the borrowed value-into-temporal,
not freed); expandSpace emits the STBox as text (stbox_out) so the generated overlaps
classifier never feeds it to temporal_from_hexwkb.
estebanzimanyi added a commit to estebanzimanyi/MobilitySpark that referenced this pull request Jun 13, 2026
Adds the BerlinMOD benchmark harness (BerlinMODBench, per-engine runners,
report/chart tooling) and runs the single canonical MobilityDB SQL suite:
berlinmod/q*.sql are byte-identical to doc/rfc/sql-portability/berlinmod, so
MobilityDB / MobilityDuck / MobilitySpark execute one SQL file. q14 uses the
spatiotemporal eContains(geometry, tgeompoint); q09 uses round(double, int);
the th3index prefilter lives in the canonical SQL via the trip_h3 column.

No Spark-local SQL variants or preprocessing: the q0N_spark.sql reimplementations
and the BerlinMODBench preprocessForSpark rewriter / trip_h3 re-materialization
are removed; the bench executes the canonical SQL verbatim. The Trips x Trips
prefilter queries (q11-q16) call overlaps(tgeompoint, stbox) / stbox(geometry,
instant), resolved once the generated UDF surface exposes them (catalog @sqlfn
map -- MEOS-API MobilityDB#18 + MobilityDB #1200).

Accessor leak/wire-format fixes for leak-free, dispatch-compatible execution:
valueAtTimestamp frees the parsed trip (geomPtr is the borrowed value-into-temporal,
not freed); expandSpace emits the STBox as text (stbox_out) so the generated overlaps
classifier never feeds it to temporal_from_hexwkb.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants