[python] Add system tables by TheR1sing3un · Pull Request #7908 · apache/paimon

TheR1sing3un · 2026-05-19T16:21:30Z

Summary

Adds native PyPaimon access to eight core system tables, matching the Java implementation column for column.

New `pypaimon.table.system` package: `SystemTable` base + `SystemTableLoader` registry + in-memory read pipeline (`SystemReadBuilder` / `SystemTableScan` / `SystemTableRead`).
`FilesystemCatalog.get_table` and `RESTCatalog.get_table` route `$`-suffixed identifiers to the loader; non-system requests are unchanged.
Tables implemented: `snapshots`, `schemas`, `options`, `manifests`, `files`, `partitions`, `tags`, `branches`. Schema, nullability and primary keys match Java's `TABLE_TYPE`.
Manager helpers added where needed: `SnapshotManager.list_snapshots`, `SchemaManager.list_all`, `BranchManager.branch_create_time`.
Predicate pushdown is not implemented yet: `with_filter` raises `NotImplementedError` rather than dropping the filter silently. A few columns are emitted as NULL/0 placeholders (documented).
User docs: `docs/content/pypaimon/system-tables.md`.

Test plan

`pytest pypaimon/tests/system/` (73 tests, all green)
Regression: `pytest pypaimon/tests/filesystem_catalog_test.py pypaimon/tests/filesystem_catalog_branch_test.py pypaimon/tests/filesystem_catalog_tag_test.py pypaimon/tests/snapshot_manager_test.py pypaimon/tests/schema_manager_test.py pypaimon/tests/branch_manager_test.py pypaimon/tests/branch/ pypaimon/tests/rest/rest_catalog_test.py` (102 tests, all green)
`bash dev/lint-python.sh -i license,flake8`

Adds an abstract SystemTable subclass of Table that exposes a base data table's metadata as a read-only table. Write and search builders raise NotImplementedError; subclasses implement system_table_name(), row_type(), and _build_arrow_table() to materialise their contents. The Identifier encodes the system table suffix (and branch segment when present) so downstream callers see a stable on-wire shape.

Introduces SYSTEM_TABLE_LOADERS, a name-to-factory dictionary that mirrors the subset of Java's SystemTableLoader supported by the Python SDK. Each factory is a lazy import so a new system table only requires its own module plus a registry entry. The eight registered names are snapshots, schemas, options, manifests, files, partitions, tags, and branches; deferred Java entries (audit_log, binlog, read_optimized, consumers, statistics, aggregation_fields, buckets, file_key_ranges, table_indexes, row_tracking, all_tables, all_partitions, all_table_options, catalog_options) are listed in the module docstring so their omission stays visible.

…temCatalog FilesystemCatalog.get_table now detects system-table identifiers, loads the underlying data table from the bare name, and asks SystemTableLoader to wrap it. Unknown system names raise TableNotExistException so callers see a consistent contract whether the base table or the system view is missing. The existing data-table flow moves into a private _load_data_table helper to avoid recursion through the dispatching entry point.

…alog Mirrors the dispatch change in FilesystemCatalog: RESTCatalog.get_table now branches on identifier.is_system_table(). System-table requests fetch the underlying data table by its bare name, hand it to SystemTableLoader, and surface TableNotExistException when no implementation matches. The data-table flow stays unchanged behind a new _load_data_table helper so existing callers (and the dispatch path) share the same metadata-loading code.

Wires SystemTable.new_read_builder to a SystemReadBuilder whose new_scan / new_read pair materialises the entire table as a single PyArrow-backed split, then exposes to_arrow, to_pandas, to_iterator, to_record_batch_iterator and to_duckdb so users reach for the same API as on a regular data table. Subclasses override _build_arrow_table(); everything else (projection, limit, predicate-builder construction) is shared. with_filter is preserved on the builder but the read raises NotImplementedError when a predicate is set, so filters never get silently dropped.

SnapshotManager.list_snapshots enumerates every persisted snapshot in ID order, skipping IDs whose file has been expired. SchemaManager.list_all returns every committed schema in ID order. BranchManager grows a default branch_create_time accessor that returns None; the filesystem implementation overrides it to expose the branch directory's mtime in epoch milliseconds, falling back to None when the underlying file status can't supply one.

OptionsTable returns one row per (key, value) of the latest table schema's options, matching the Java OptionsTable column layout (both columns NOT NULL, "key" as the primary key).

BranchesTable lists every named branch with the branch directory's modification time. When the underlying store cannot supply an mtime (some remote object stores, REST-managed branches) the value falls back to epoch 0 so the NOT NULL contract from Java's TABLE_TYPE holds.

TagsTable surfaces every tag's snapshot id, schema id, commit time and record count. create_time and time_retained are emitted as NULL because pypaimon's Tag dataclass does not yet carry those fields — the same compromise as FileSystemCatalog.get_tag. Schema (including NOT NULL / NULL distinctions and the tag_name primary key) matches Java's TagsTable column for column.

SchemasTable returns one row per committed schema version, with the fields / partition_keys / primary_keys / options columns encoded as compact JSON strings (matching the column layout and nullability of Java's SchemasTable). update_time is the schema's own time_millis.

SnapshotsTable returns one row per persisted snapshot in ascending ID order, matching the Java SnapshotsTable column layout. NOT NULL columns (snapshot_id, schema_id, commit_user, ...) and NULLABLE columns (changelog_manifest_list, watermark, next_row_id, ...) line up with Java's TABLE_TYPE. snapshot_id is the primary key.

ManifestsTable lists every manifest referenced by the latest snapshot (base + delta + changelog), matching Java's column layout: file_name, file_size, num_added_files, num_deleted_files, schema_id, min/max partition stats, and min/max row id. The two partition-stat columns are emitted as NULL placeholders until pypaimon grows a shared partition row-to-string helper; the column shape is preserved so the schema contract stays bit-equal with Java.

PartitionsTable aggregates the latest snapshot's manifest entries by partition spec, returning record_count, file_size_in_bytes, file_count, last_update_time, and total_buckets. Catalog-owned columns (created_at, created_by, updated_by, options, done) are filled with placeholders for the filesystem path; REST-backed catalogs will populate them via the catalog API in a later phase.

FilesTable emits one row per ADD entry surviving the latest snapshot's manifests. Columns match Java's FilesTable 1:1 including the camelCase "deleteRowCount" wire name and the trailing ARRAY<STRING> write_cols. null_value_counts, min_value_stats and max_value_stats are serialised as compact JSON dictionaries keyed by column name (using the file's value_stats_cols when present), partition is rendered "k=v/k2=v2", and min/max_key fall back to NULL for tables without a primary key.

Pins the AbstractCatalog-equivalent contract: list_tables('db') returns the base table names with no '$'-suffixed entries, but get_table('db.t\$<name>') still resolves every registered system table. Catches future regressions where a listing implementation might start exposing internal directories.

Adds docs/content/pypaimon/system-tables.md covering the eight tables shipped in phase 1 (snapshots, schemas, options, manifests, files, partitions, tags, branches): column layout (with nullability and primary keys), the rendering conventions for partitions and stats-JSON, and the known limitations relative to the Java runtime (predicate pushdown unsupported, several placeholder columns).

…d docs Replaces "phase 1" / "phase 2" / "later phase" wording with neutral descriptions of current behaviour, and renames test fixtures / identifiers that embedded the same internal vocabulary. No behaviour change.

The original snippet built two separate read builders, which silently discards any with_projection / with_limit set on the first one. Reuse a single builder for the scan and the read, call to_pandas(splits) directly, and add a small example showing projection + limit chained on the same builder.

Java's BranchesTable reads branch directory mtimes through FileIO and the static BranchManager.branchPath helper; the BranchManager interface itself has no branch_create_time method. Mirror that: inline the mtime read in BranchesTable, drop the BranchManager.branch_create_time API together with its FileSystemBranchManager override and its dedicated tests. The previous shape returned None from CatalogBranchManager (the REST binding) which surfaced epoch 0 for every branch under a REST catalog. With this change FS and REST share the same code path: the table's configured FileIO returns the real mtime in both cases.

The class docstrings, column comments, doc page and test names described every table as "mirroring" or "matching" the other runtime. Rewrite them to describe what each table is on its own; rename test_schema_matches_*_table to test_schema_column_layout. No behaviour change.

TheR1sing3un added 17 commits May 19, 2026 16:03

[python] Implement $options system table

4e47569

OptionsTable returns one row per (key, value) of the latest table schema's options, matching the Java OptionsTable column layout (both columns NOT NULL, "key" as the primary key).

TheR1sing3un changed the title ~~[python] Add system tables (phase 1)~~ [python] Add system tables May 19, 2026

TheR1sing3un added 3 commits May 20, 2026 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Add system tables#7908

[python] Add system tables#7908
TheR1sing3un wants to merge 20 commits into
apache:masterfrom
TheR1sing3un:py-pypaimon-system-tables-phase-1

TheR1sing3un commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheR1sing3un commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheR1sing3un commented May 19, 2026 •

edited

Loading