Skip to content

[python] PVFS supports jindo backend for OSS reads/writes#7902

Open
shyjsarah wants to merge 4 commits into
apache:masterfrom
shyjsarah:fix/pvfs-jindo-oss-write
Open

[python] PVFS supports jindo backend for OSS reads/writes#7902
shyjsarah wants to merge 4 commits into
apache:masterfrom
shyjsarah:fix/pvfs-jindo-oss-write

Conversation

@shyjsarah
Copy link
Copy Markdown
Contributor

Purpose

PaimonVirtualFileSystem (PVFS) backed every OSS open() with ossfs, whose fsspec buffered file flushes each block via OSS AppendObject. When a result exceeds one block, a later AppendObject can fail with PositionNotEqualToLength (409) on the OSS data-acceleration endpoint due to file-length cache lag.

This PR makes the PVFS OSS backend selectable via fs.oss.impl — the same option PyArrowFileIO already honors:

  • jindo (default): native JindoSDK, writes via PutObject / multipart upload, so AppendObject is never used.
  • legacy: ossfs (the previous behavior).

When fs.oss.impl=jindo but pyjindosdk is not installed, PVFS falls back to ossfs, consistent with PyArrowFileIO.

Notes:

  • build_jindo_config is now shared by JindoFileSystemHandler (the PyArrow FileIO path) and the new create_jindo_oss_filesystem (the PVFS path), so both jindo entry points
    consume identical OSS credential options.
  • pyjindo's fsspec filesystem needs the oss:// scheme on paths while ossfs needs it stripped, so _strip_storage_protocol keeps or strips the scheme according to the active backend.
  • No storage format change; no user-facing API change — fs.oss.impl already exists, this PR only makes PVFS honor it.

Tests

  • New pvfs_oss_filesystem_test.py — CI unit tests, no DLF / OSS / pyjindosdk required (all stubbed):
    • fs.oss.impl dispatch: jindo / legacy / default / fallback when pyjindosdk is absent / invalid value
    • _get_filesystem forwards the table's OSS storage path to the backend
    • _strip_storage_protocol keeps oss:// for jindo, strips it for ossfs
    • _extract_oss_bucket parsing and error cases
  • Verified end-to-end against a real DLF REST catalog with both backends (legacy and jindo) on pyjindo 6.10.2: open/read/write (including an 8 MB multipart write), info, ls, cat_file, cp_file, mv, rm, get_file all succeed on the jindo backend.

望峤 and others added 3 commits May 19, 2026 05:03
PaimonVirtualFileSystem backed every OSS open() with ossfs, whose
fsspec buffered file flushes each block via OSS AppendObject. A
multi-chunk write can then fail with PositionNotEqualToLength (409)
on the OSS data-acceleration endpoint.

_get_oss_filesystem now honors fs.oss.impl (the same option
PyArrowFileIO uses): jindo backs OSS with the native JindoSDK
(PutObject / multipart upload), legacy keeps ossfs. jindo is the
default and falls back to ossfs when pyjindosdk is absent.

The jindo backend needs the oss:// scheme on paths while ossfs needs
it stripped, so _strip_storage_protocol keeps or removes the scheme
per the active backend. The jindo config builder is shared with
JindoFileSystemHandler so both jindo entry points consume identical
credential options.

Adds pvfs_oss_filesystem_test: CI unit tests for the fs.oss.impl
dispatch, the _get_filesystem wiring and the path-scheme handling
(no DLF / OSS / pyjindosdk required).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PaimonVirtualFileSystem caches a per-table fs in _fs_cache and
rebuilds it when the DataToken nears expiry. Previously the old fs
was just dereferenced; for the jindo backend that would have leaked
the underlying native JindoSDK connection -- fsspec's _Cached
metaclass keeps the instance alive globally, and JindoOssFileSystem
has no close() method of its own.

create_jindo_oss_filesystem now passes skip_instance_cache=True so
PVFS's _fs_cache holds the sole reference to each JindoOssFileSystem.
_get_filesystem then closes the previous filesystem (best-effort,
no-op when fs has no close) before returning the rebuilt one, mirroring
how the Java PVFS releases the FileIO it evicts from FILE_IO_CACHE.

Adds CloseStaleFilesystemOnRefreshTest covering: close is called when
the cached entry is stale, no close on first build, and a filesystem
without close() is tolerated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…write lock

The OSS rebuild branch of _get_filesystem was issuing a second REST
round-trip (rest_api.get_table) inside the _table_cache_lock write
critical section, only to resolve the table path -- it could not reuse
_get_table_store because re-entering the same non-reentrant write lock
would deadlock. Every caller already has the path in hand from
_get_table_store immediately before calling _get_filesystem, so thread
it through as storage_location and drop the inline REST call. Removes
one DLF round-trip from every cold start and 1h token-refresh path.

_close_filesystem_quietly used to run inside the same write critical
section. For the jindo backend, native handle teardown (flushing
buffers, terminating worker threads) can block; holding the write lock
through it would stall every other OSS filesystem rebuild. Capture the
stale fs inside the lock, release the lock, then close in finally.

Add test_close_runs_after_write_lock_released as a regression guard.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shyjsarah shyjsarah changed the title [WIP][python] PVFS supports jindo backend for OSS reads/writes [python] PVFS supports jindo backend for OSS reads/writes May 20, 2026
Copy link
Copy Markdown
Contributor

@jerry-024 jerry-024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix — the fs.oss.impl dispatch and the shared build_jindo_config refactor are clean, and the unit-test coverage is solid.

Two compatibility concerns on the jindo import surface, left inline. One other thing worth calling out in the PR description: with fs.oss.impl defaulting to jindo, the effective OSS backend for PVFS silently changes from ossfs to JindoSDK in any environment that already has pyjindosdk installed — so it is a behavior change for existing deployments, not strictly a no-op as "no user-facing API change" might suggest.

Comment thread paimon-python/pypaimon/filesystem/jindo_file_system_handler.py Outdated
Comment thread paimon-python/pypaimon/filesystem/jindo_file_system_handler.py
… older builds without pyjindo.ossfs

Two concerns raised on the PR review map to one root cause: pyjindosdk
ships independent surfaces and the PVFS path was assuming both were
always present.

(1) Min version: declare it via setup.py's extras_require. pyjindosdk is
on PyPI only as 6.10.4 and its `JindoOssFileSystem(uri, config,
auto_mkdir, **kwargs)` accepts the additional `skip_instance_cache`
kwarg through `**kwargs`, so the constructor signature stays compatible
with later 6.x bumps. `pip install pypaimon[jindo]` now pulls the right
version.

(2) Compatibility regression risk: `pyjindo.ossfs` was sharing the
`pyjindo.fs` / `pyjindo.util` try-block, so a pyjindosdk build that
ships fs+util but no ossfs (older internal builds) would flip
JINDO_AVAILABLE to False and silently disable the previously-working
PyArrow `JindoFileSystemHandler` path. Split into two flags:
JINDO_AVAILABLE (fs + util, gates the PyArrow path) and
JINDO_OSSFS_AVAILABLE (gates the new PVFS jindo backend).
`_use_jindo_oss_backend` requires both. `create_jindo_oss_filesystem`
also requires both and raises a precise ImportError referencing the
required version.

Update unit tests to patch JINDO_OSSFS_AVAILABLE for the missing-ossfs
fallback scenario and pin JINDO_AVAILABLE=True so the test result no
longer depends on whether pyjindosdk is installed in the CI image.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shyjsarah shyjsarah closed this May 20, 2026
@shyjsarah shyjsarah reopened this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants