Skip to content

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461

Open
vinay553 wants to merge 3 commits into
masterfrom
vinayparakala/expose-phash-on-dataset-item
Open

[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
vinay553 wants to merge 3 commits into
masterfrom
vinayparakala/expose-phash-on-dataset-item

Conversation

@vinay553
Copy link
Copy Markdown
Contributor

@vinay553 vinay553 commented May 18, 2026

Summary

Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.

  • Adds a new phash: Optional[str] field to the DatasetItem dataclass — 64-character "0/1" binary string when backfilled by the backend, None otherwise.
  • Threads phash=payload.get(PHASH_KEY) into DatasetItem.from_json. Because every SDK method that returns a DatasetItem goes through from_json, this single change exposes item.phash on:
    • items_and_annotation_generator
    • items_generator / dataset.items
    • query_items
    • iloc / refloc / loc
  • Bumps version 0.18.2 → 0.18.3 and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.
  • Adds a top-level CLAUDE.md capturing release workflow, branch/PR conventions, and the from_json-centralization insight for future agent sessions.

Test plan

  • Local install: poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)" prints the hash.
  • DatasetItem.from_json falls back to None when the backend omits phash (existing test fixtures).
  • Smoke test against fedramp-prod once the paired backend PR is deployed: client.get_dataset(...).items_and_annotation_generator(...) yields items with item.phash populated.
  • CHANGELOG renders correctly on the release tag.

🤖 Generated with Claude Code

Greptile Summary

This PR exposes a new read-only phash field on DatasetItem, threading it through the single from_json deserialization entry-point so it becomes available on every SDK method that yields items. The change is additive and backwards-compatible.

  • Adds PHASH_KEY = \"phash\" constant, phash: Optional[str] = None dataclass field, and phash=payload.get(PHASH_KEY) in from_json; to_payload correctly omits it since pHash is server-computed.
  • Bumps version to 0.18.3 with a matching CHANGELOG entry and adds a CLAUDE.md file documenting repo conventions.
  • Relaxes an async-job test to compare completed_steps == total_steps instead of asserting a hardcoded count of 5, decoupling the test from backend pipeline internals.

Confidence Score: 5/5

Safe to merge — the change is a purely additive, backwards-compatible field addition that touches only deserialization and has no impact on upload payloads or existing behavior.

The implementation is minimal and correct: a new constant, an Optional[str] field with a None default, and a single payload.get(PHASH_KEY) call in the one deserialization method all items go through. Existing callers are unaffected, and the field is intentionally absent from to_payload.

No files require special attention.

Important Files Changed

Filename Overview
nucleus/constants.py Adds PHASH_KEY = 'phash' constant in alphabetical order alongside existing keys — clean, minimal addition.
nucleus/dataset_item.py Adds phash: Optional[str] = None to the DatasetItem dataclass and threads it into from_json; to_payload correctly omits it since phash is server-computed. phash is absent from the class docstring's Args section.
tests/test_dataset.py Removes specific completed_steps/total_steps count assertions in test_dataset_append_async, replacing with a looser equality check; no new test coverage for the phash field deserialization.
CHANGELOG.md Prepends a correct 0.18.3 entry following Keep-a-Changelog conventions.
pyproject.toml Bumps version from 0.18.2 to 0.18.3 — correct patch bump for a backwards-compatible additive change.
CLAUDE.md New documentation file capturing release workflow, repo conventions, and architecture notes for future agent sessions — purely informational.

Sequence Diagram

sequenceDiagram
    participant Backend as Nucleus Backend
    participant SDK as nucleus SDK
    participant User as User Code

    Backend->>SDK: JSON payload (includes "phash" when computed)
    Note over SDK: DatasetItem.from_json(payload)
    SDK->>SDK: payload.get(PHASH_KEY) → phash or None
    SDK->>User: "DatasetItem(phash="101...010" or None)"
    User->>User: "item.phash  # 64-char binary string or None"

    Note over SDK,User: All item-returning methods route through from_json:<br/>items_generator, items_and_annotation_generator,<br/>query_items, iloc / refloc / loc
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nucleus/dataset_item.py:107-118
The `phash` field is missing from the class docstring's `Args` section. All other public constructor parameters (`image_location`, `pointcloud_location`, `reference_id`, `metadata`) are documented there. Omitting `phash` means users relying on `help(DatasetItem)` or generated API docs won't learn what the field contains or when it is populated.

```suggestion
          Coordinate metadata may be provided to enable the Map Chart in the Nucleus Dataset charts page.
          These values can be specified as `{ "lat": 52.5, "lon": 13.3, ... }`.

          Context Attachments may be provided to display the attachments side by side with the dataset
          item in the Detail View by specifying
          `{ "context_attachments": [ { "attachment": 'https://example.com/1' }, { "attachment": 'https://example.com/2' }, ... ] }`.

          .. todo ::
              Shorten this once we have a guide migrated for metadata, or maybe link
              from other places to here.

        phash (Optional[str]): Perceptual hash of the underlying image as a
          64-character "0/1" binary string. Populated by the Nucleus backend
          when the pHash has been computed; ``None`` otherwise. Read-only —
          this field is not included in upload payloads.

    """
```

### Issue 2 of 2
tests/test_dataset.py:332-348
**No unit test for `phash` deserialization**

The PR adds `phash` to `DatasetItem.from_json` but doesn't add a fast, offline unit test to cover it. A minimal parametrized test — calling `DatasetItem.from_json` with and without the `"phash"` key in the payload and asserting `item.phash` equals the expected value — would catch any future regression without requiring a live API key. The existing test suite has no such coverage.

Reviews (3): Last reviewed commit: "Loosen test_dataset_append_async — don't..." | Re-trigger Greptile

vinay553 and others added 3 commits May 18, 2026 18:50
Add a `phash` field to the DatasetItem dataclass and thread it through
`from_json`. Because every SDK method that returns a DatasetItem
(items_and_annotation_generator, items_generator, query_items,
dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json,
exposing the field there is sufficient — no per-method changes required.

Also adds a top-level CLAUDE.md with release/branch conventions and
architecture pointers for future Claude Code sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The upload job pipeline plans with N total_steps initially, then
dynamically collapses to a single step once it knows how to
short-circuit (small input → batched upload). By the time
sleep_until_complete() returns, status() always reports total_steps=1,
completed_steps=1 — so the hard-coded expectation of 5/5 deterministically
fails on the current backend.

Drop the step-count assertions and keep the meaningful invariants:
job completed successfully, progress is 1.00, and
completed_steps == total_steps (whatever they are).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/test_dataset.py
Comment on lines +335 to 346
# Pinning specific step counts couples this test to the backend pipeline
# shape (the upload job is planned with N steps, then dynamically
# collapses to fewer once it knows how to short-circuit). Only check the
# outcomes that the SDK contract guarantees.
expected = {
"job_id": job.job_id,
"status": "Completed",
"job_progress": "1.00",
"completed_steps": 5,
"total_steps": 5,
}
assert_partial_equality(expected, status)
assert status["completed_steps"] == status["total_steps"]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix flaky test.

@vinay553
Copy link
Copy Markdown
Contributor Author

build_test should pass once api.scale.com is redeployed with https://github.com/scaleapi/scaleapi/pull/143842.

@vinay553 vinay553 requested review from a team and edwinpav May 19, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant