Skip to content

Add the catalog-driven Spark UDF generator (foundation)#27

Open
estebanzimanyi wants to merge 2 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/spark-udf-generator
Open

Add the catalog-driven Spark UDF generator (foundation)#27
estebanzimanyi wants to merge 2 commits into
MobilityDB:mainfrom
estebanzimanyi:feat/spark-udf-generator

Conversation

@estebanzimanyi

Copy link
Copy Markdown
Member

North-Star step ([meos-api-codegen-regularity]): MobilitySpark's UDF layer (*UDFs.java) is hand-written (~1334 registrations = debt). This adds tools/codegen_spark_udfs.py, which generates that surface from the single MEOS-API catalog:

  • resolves each target SQL name → MEOS-C backing via the catalog @sqlfn map (MEOS-API Resolve the MEOS surface through the unified GeneratedFunctions jar #18)
  • type-marshalling rules map each MEOS C type to its parse-from-String / serialize-to-String form (Temporal↔hex, GSERIALIZED↔WKT, Span, STBox, TimestampTz)
  • emits the UDF body + spark.udf().register(...)

Single-signature UDFs emit and compile (proven on whenTrue); overloaded SQL names (eContains 6 sigs, overlaps, stbox) are correctly skipped pending runtime type-dispatch, the next feature. Then create() registers the generated class and the bench hand-stubs are deleted → the bench runs on a fully generated surface. The regenerated .java output is not committed (build artifact), matching how MEOS-API #18 commits the parser, not the catalog.

@estebanzimanyi estebanzimanyi force-pushed the feat/spark-udf-generator branch from 9bb3956 to c5dd081 Compare June 12, 2026 21:04
tools/codegen_spark_udfs.py emits MobilitySpark UDF-registration classes from the
MEOS-API catalog (output/meos-idl.json), resolving each SQL name to its MEOS-C
backing via the @sqlfn / @sqlop map (MEOS-API MobilityDB#18). Two modes:
- SINGLE: one backing -> a 1:1 UDF (type-marshalling: each MEOS C type <-> its
  parse-from-String / serialize-to-String form).
- DISPATCH: an overloaded SQL name / operator (overlaps via &&, stbox(geom,time),
  timeSpan) -> ONE UDF that classifies each arg by its MEOS type and routes to the
  catalog-determined backing. Classification is MEOS-driven and wire-format-safe:
  spans/stboxes/geometries travel as TEXT, only temporals as hex, so the leading
  token disambiguates ('['/'(' span, STBOX stbox, hex temporal, else geometry) and
  temporal_from_hexwkb is never fed a non-temporal. Emitted lambdas call only static
  GeneratedFunctions (no captured state -> Spark-serializable). Zero hand heuristics,
  zero new MEOS functions.
…roup

Generalize the generator over the whole JMEOS public surface (was a 4-UDF POC):
mirror JMEOS FunctionsGenerator's marshalling conventions — temporals / spans /
sets / boxes / jsonb as hex-WKB or type text, TimestampTz as OffsetDateTime,
DateADT as int, and bool f(.., result) out-params dropped with their value
returned. Cross-check every emission against the JMEOS jar signatures (arity +
return kind) so a collapsed catalog type can never miscompile. Organize the
emitted UDFs into one class per doxygen @InGroup module — the reference-manual
structure, so a function is found in the same place across tools — excluding
meos_internal_*, and splitting oversized groups to stay under the JVM class
limits. Emits ~2200 1:1 UDFs, compiling green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant