[python] PVFS supports jindo backend for OSS reads/writes#7902
Open
shyjsarah wants to merge 4 commits into
Open
[python] PVFS supports jindo backend for OSS reads/writes#7902shyjsarah wants to merge 4 commits into
shyjsarah wants to merge 4 commits into
Conversation
PaimonVirtualFileSystem backed every OSS open() with ossfs, whose fsspec buffered file flushes each block via OSS AppendObject. A multi-chunk write can then fail with PositionNotEqualToLength (409) on the OSS data-acceleration endpoint. _get_oss_filesystem now honors fs.oss.impl (the same option PyArrowFileIO uses): jindo backs OSS with the native JindoSDK (PutObject / multipart upload), legacy keeps ossfs. jindo is the default and falls back to ossfs when pyjindosdk is absent. The jindo backend needs the oss:// scheme on paths while ossfs needs it stripped, so _strip_storage_protocol keeps or removes the scheme per the active backend. The jindo config builder is shared with JindoFileSystemHandler so both jindo entry points consume identical credential options. Adds pvfs_oss_filesystem_test: CI unit tests for the fs.oss.impl dispatch, the _get_filesystem wiring and the path-scheme handling (no DLF / OSS / pyjindosdk required). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PaimonVirtualFileSystem caches a per-table fs in _fs_cache and rebuilds it when the DataToken nears expiry. Previously the old fs was just dereferenced; for the jindo backend that would have leaked the underlying native JindoSDK connection -- fsspec's _Cached metaclass keeps the instance alive globally, and JindoOssFileSystem has no close() method of its own. create_jindo_oss_filesystem now passes skip_instance_cache=True so PVFS's _fs_cache holds the sole reference to each JindoOssFileSystem. _get_filesystem then closes the previous filesystem (best-effort, no-op when fs has no close) before returning the rebuilt one, mirroring how the Java PVFS releases the FileIO it evicts from FILE_IO_CACHE. Adds CloseStaleFilesystemOnRefreshTest covering: close is called when the cached entry is stale, no close on first build, and a filesystem without close() is tolerated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…write lock The OSS rebuild branch of _get_filesystem was issuing a second REST round-trip (rest_api.get_table) inside the _table_cache_lock write critical section, only to resolve the table path -- it could not reuse _get_table_store because re-entering the same non-reentrant write lock would deadlock. Every caller already has the path in hand from _get_table_store immediately before calling _get_filesystem, so thread it through as storage_location and drop the inline REST call. Removes one DLF round-trip from every cold start and 1h token-refresh path. _close_filesystem_quietly used to run inside the same write critical section. For the jindo backend, native handle teardown (flushing buffers, terminating worker threads) can block; holding the write lock through it would stall every other OSS filesystem rebuild. Capture the stale fs inside the lock, release the lock, then close in finally. Add test_close_runs_after_write_lock_released as a regression guard. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jerry-024
reviewed
May 20, 2026
Contributor
jerry-024
left a comment
There was a problem hiding this comment.
Thanks for the fix — the fs.oss.impl dispatch and the shared build_jindo_config refactor are clean, and the unit-test coverage is solid.
Two compatibility concerns on the jindo import surface, left inline. One other thing worth calling out in the PR description: with fs.oss.impl defaulting to jindo, the effective OSS backend for PVFS silently changes from ossfs to JindoSDK in any environment that already has pyjindosdk installed — so it is a behavior change for existing deployments, not strictly a no-op as "no user-facing API change" might suggest.
… older builds without pyjindo.ossfs Two concerns raised on the PR review map to one root cause: pyjindosdk ships independent surfaces and the PVFS path was assuming both were always present. (1) Min version: declare it via setup.py's extras_require. pyjindosdk is on PyPI only as 6.10.4 and its `JindoOssFileSystem(uri, config, auto_mkdir, **kwargs)` accepts the additional `skip_instance_cache` kwarg through `**kwargs`, so the constructor signature stays compatible with later 6.x bumps. `pip install pypaimon[jindo]` now pulls the right version. (2) Compatibility regression risk: `pyjindo.ossfs` was sharing the `pyjindo.fs` / `pyjindo.util` try-block, so a pyjindosdk build that ships fs+util but no ossfs (older internal builds) would flip JINDO_AVAILABLE to False and silently disable the previously-working PyArrow `JindoFileSystemHandler` path. Split into two flags: JINDO_AVAILABLE (fs + util, gates the PyArrow path) and JINDO_OSSFS_AVAILABLE (gates the new PVFS jindo backend). `_use_jindo_oss_backend` requires both. `create_jindo_oss_filesystem` also requires both and raises a precise ImportError referencing the required version. Update unit tests to patch JINDO_OSSFS_AVAILABLE for the missing-ossfs fallback scenario and pin JINDO_AVAILABLE=True so the test result no longer depends on whether pyjindosdk is installed in the CI image. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
PaimonVirtualFileSystem(PVFS) backed every OSSopen()withossfs, whose fsspec buffered file flushes each block via OSSAppendObject. When a result exceeds one block, a laterAppendObjectcan fail withPositionNotEqualToLength(409) on the OSS data-acceleration endpoint due to file-length cache lag.This PR makes the PVFS OSS backend selectable via
fs.oss.impl— the same optionPyArrowFileIOalready honors:jindo(default): native JindoSDK, writes viaPutObject/ multipart upload, soAppendObjectis never used.legacy:ossfs(the previous behavior).When
fs.oss.impl=jindobutpyjindosdkis not installed, PVFS falls back toossfs, consistent withPyArrowFileIO.Notes:
build_jindo_configis now shared byJindoFileSystemHandler(the PyArrow FileIO path) and the newcreate_jindo_oss_filesystem(the PVFS path), so both jindo entry pointsconsume identical OSS credential options.
pyjindo's fsspec filesystem needs theoss://scheme on paths whileossfsneeds it stripped, so_strip_storage_protocolkeeps or strips the scheme according to the active backend.fs.oss.implalready exists, this PR only makes PVFS honor it.Tests
pvfs_oss_filesystem_test.py— CI unit tests, no DLF / OSS / pyjindosdk required (all stubbed):fs.oss.impldispatch:jindo/legacy/ default / fallback when pyjindosdk is absent / invalid value_get_filesystemforwards the table's OSS storage path to the backend_strip_storage_protocolkeepsoss://for jindo, strips it for ossfs_extract_oss_bucketparsing and error caseslegacyandjindo) on pyjindo 6.10.2: open/read/write (including an 8 MB multipart write),info,ls,cat_file,cp_file,mv,rm,get_fileall succeed on the jindo backend.