Skip to content

fix: filter out datasets with inconsistent database and LakeFS records#5171

Open
xuang7 wants to merge 5 commits into
apache:mainfrom
xuang7:fix/filter-mismatched-datasets
Open

fix: filter out datasets with inconsistent database and LakeFS records#5171
xuang7 wants to merge 5 commits into
apache:mainfrom
xuang7:fix/filter-mismatched-datasets

Conversation

@xuang7
Copy link
Copy Markdown
Contributor

@xuang7 xuang7 commented May 24, 2026

What changes were proposed in this PR?

This PR fixes an issue where dataset listings fail when dataset records in the database and LakeFS repositories are inconsistent. This breaks the workflow dataset picker and can also affect Hub dataset listings. The fix wraps the per-row retrieveRepositorySize call in a try/catch for ApiException, logs the orphan, and drops it from the response.

Demo:

Before After
Before: dataset listing error After: dataset picker loads valid datasets

Any related issues, documentation, discussions?

Closes #5106

How was this PR tested?

Added two tests.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

@github-actions github-actions Bot added engine fix common platform Non-amber Scala service paths labels May 24, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 24, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 45.80%. Comparing base (c435aa7) to head (de46600).

Files with missing lines Patch % Lines
...exera/web/resource/dashboard/hub/HubResource.scala 0.00% 2 Missing ⚠️
...ache/texera/service/resource/DatasetResource.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5171      +/-   ##
============================================
- Coverage     47.15%   45.80%   -1.35%     
+ Complexity     2348     2342       -6     
============================================
  Files          1042     1046       +4     
  Lines         39989    40029      +40     
  Branches       4260     4259       -1     
============================================
- Hits          18855    18335     -520     
- Misses        20012    20580     +568     
+ Partials       1122     1114       -8     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.74% <ø> (-0.03%) ⬇️ Carriedforward from c4a945d
amber 50.29% <0.00%> (-0.07%) ⬇️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.45% <0.00%> (+0.26%) ⬆️
frontend 34.62% <ø> (-3.20%) ⬇️ Carriedforward from c4a945d
python 90.50% <ø> (ø) Carriedforward from c4a945d
workflow-compiling-service 56.81% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xuang7 xuang7 requested a review from aicam May 24, 2026 00:44
@chenlica chenlica requested a review from mengw15 May 25, 2026 07:08
Copy link
Copy Markdown
Contributor

@mengw15 mengw15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment

@github-actions github-actions Bot removed the common label May 25, 2026
@xuang7 xuang7 requested a review from mengw15 May 25, 2026 20:52
Copy link
Copy Markdown
Contributor

@mengw15 mengw15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix! Before merge, could you run last manual test, thanks!
One behavior worth flagging: an owner with explicit access to an orphan dataset may still see it in their list with size=0. The accessible-datasets path in listDatasets doesn't call LakeFS, so the try-catch can't fire there — maybe only the publicDatasets path drops orphans. Probably actually fine — owner sees a "broken dataset" instead of "my dataset silently disappeared", which is arguably more informative.

@xuang7
Copy link
Copy Markdown
Contributor Author

xuang7 commented May 25, 2026

LGTM! Thanks for the fix! Before merge, could you run last manual test, thanks! One behavior worth flagging: an owner with explicit access to an orphan dataset may still see it in their list with size=0. The accessible-datasets path in listDatasets doesn't call LakeFS, so the try-catch can't fire there — maybe only the publicDatasets path drops orphans. Probably actually fine — owner sees a "broken dataset" instead of "my dataset silently disappeared", which is arguably more informative.

Sounds good! In this version, the owner can still see the broken dataset in the list. I think it may be okay to keep this behavior for now, since keeping it visible can serve as a reminder that something is inconsistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine fix platform Non-amber Scala service paths

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset file selection fails when LakeFS repository and database records are inconsistent

3 participants