feat: per-layer package attribution (opt-in) by ashokn1 · Pull Request #793 · snyk/snyk-docker-plugin

ashokn1 · 2026-04-18T22:32:55Z

Summary

Adds the ability to attribute each OS package to the specific image layer that first introduced it, following the same approach as Trivy's `Layer { DiffID, Digest }`. Attribution is opt-in via a new `layer-attribution` plugin option so there is no performance impact on existing callers.

How layer attribution is computed

Docker images are built from an ordered stack of layers. Each layer is a filesystem delta produced by one Dockerfile instruction. When a package manager installs or removes packages, it rewrites its database in full (e.g. `/lib/apk/db/installed`, `/var/lib/dpkg/status`). This property makes diff-based attribution possible: if you parse the package DB from each layer in isolation and compare successive snapshots, you can pinpoint exactly which layer introduced (or removed) each package.

Algorithm (`lib/analyzer/layer-attribution.ts`)

History alignment. The image config's `history` array contains one entry per Dockerfile instruction, some marked `empty_layer: true` (metadata instructions like `ENV`, `LABEL`, `EXPOSE` that produce no filesystem delta). These are filtered out to produce an aligned array where index `i` maps to `rootFsLayers[i]` and its instruction text.
Per-layer parse. For each layer in order, the package DB is read from that layer's file map alone — not the merged view used for the normal scan. Two cases are distinguished:
- DB file absent in the layer (e.g. a `COPY` or `WORKDIR` instruction): `parseLayerPackages` returns `null` and the layer is skipped entirely. `previousPkgs` is left unchanged.
- DB file present but empty (e.g. `apk del $(apk info)`): returns an empty `Set`. This is treated as "all packages removed" and the layer is recorded.
Set diff. Each DB-writing layer's package set is diffed against the previous one:
- Keys in `currentPkgs` not in `previousPkgs` → added (`packages[]`)
- Keys in `previousPkgs` not in `currentPkgs` → removed (`removedPackages[]`)
A `LayerAttributionEntry` is emitted for any layer with at least one addition or removal. The `pkgLayerMap` records the layer where each `name@version` key first appeared.
Multi-manager support. `computeLayerAttribution` is called once per unique `AnalysisType` (APK, APT, RPM, Chisel). Results are cached by type so duplicate entries — APT regular + APT distroless, RPM BDB + RPM SQLite — share one parse pass and reuse the cached `pkgLayerMap`. Entries from all managers are merged per-layer by `mergeLayerAttributionEntries`.
Package annotation. Each `AnalyzedPackage` is stamped with `layerIndex` and `layerDiffId` by looking up its key in `pkgLayerMap`. These propagate to dep-graph node labels via `lib/dependency-tree/index.ts`.
Fact emission. `lib/response-builder.ts` assembles the entries into a `layerPackageAttribution` fact on the OS scan result.

Output

New fact (`layerPackageAttribution`):
```json
{
"type": "layerPackageAttribution",
"data": [
{
"layerIndex": 0,
"diffID": "sha256:abc...",
"instruction": "FROM ubuntu:22.04",
"packages": ["libc6@2.35-0ubuntu3", "curl@7.81.0"]
},
{
"layerIndex": 2,
"diffID": "sha256:ghi...",
"digest": "sha256:def...",
"instruction": "RUN apt-get install -y nginx",
"packages": ["nginx@1.18.0"],
"removedPackages": ["curl@7.81.0"]
}
]
}
```

New dep-graph node labels (additive alongside existing `dockerLayerId`):
```json
"labels": {
"dockerLayerId": "UnVOIGFwdC1nZXQ...",
"layerDiffId": "sha256:ghi...",
"layerIndex": "2"
}
```

Edge cases

Scenario	Behaviour
Layer doesn't touch the package DB	Skipped; `previousPkgs` unchanged for next diff
DB file present but empty (`apk del` all packages)	Recorded as `removedPackages`; package set reset to empty
Package deleted then reinstalled (different version)	Deletion layer records `removedPackages`; reinstall layer records new version in `packages`
`rootFsLayers` shorter than `orderedLayers`	Loop capped at `Math.min(...)`
Scratch image / no package DB	Returns empty entries; fact omitted
No history / empty history	Instructions omitted from entries; diff still runs
Multiple managers with same `AnalysisType`	Single parse pass, cached `pkgLayerMap` reused

Changes

File	Change
`lib/extractor/types.ts`	Add `orderedLayers?: ExtractedLayers[]` (optional) to `ExtractionResult`
`lib/extractor/index.ts`	Populate `orderedLayers` only when `layer-attribution` is enabled (avoids holding all per-layer buffers unconditionally)
`lib/facts.ts`	Add `LayerAttributionEntry`, `LayerPackageAttributionFact`
`lib/types.ts`	Add `"layer-attribution"` to `PluginOptions`; `"layerPackageAttribution"` to `FactType`
`lib/analyzer/types.ts`	Add `layerIndex?`, `layerDiffId?` to `AnalyzedPackage`; `layerPackageAttribution?` to `StaticAnalysis`
`lib/analyzer/layer-attribution.ts`	New — `computeLayerAttribution()`, `mergeLayerAttributionEntries()`
`lib/analyzer/static-analyzer.ts`	Call attribution (gated on option); cache results by `AnalysisType`; annotate packages
`lib/dependency-tree/index.ts`	Propagate `layerDiffId`/`layerIndex` to `DepTreeDep.labels`
`lib/response-builder.ts`	Emit `LayerPackageAttributionFact`
`test/lib/analyzer/layer-attribution.spec.ts`	New — 19 unit tests (APK + APT, including deletion/reinstall/empty-DB scenarios)
`test/harness/run.ts`	New — CLI harness wrapping `scan()` for manual testing
`test/system/docker.spec.ts`	Fix stale assertions: remove fragile sha256 round-trip comparison; fix invalid-image-name test to exercise the 404 path
`test/system/plugin.spec.ts`	Update nginx:1.19.0 manifest layer digests (re-published on Docker Hub with different compression)

Test plan

`npx jest test/lib/analyzer/layer-attribution.spec.ts` — all 19 unit tests pass
`npm run test:system` — all archive-based system tests pass; remaining failures are unauthenticated Docker Hub rate limits or require a live Docker daemon (pre-existing, unrelated to this PR)
Manual verification against real images via `npx ts-node test/harness/run.ts --layer-attribution`

🤖 Generated with Claude Code

Introduces `computeLayerAttribution` in `lib/analyzer/layer-attribution.ts` and wires it through the full pipeline. Enabled with `--layer-attribution`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Memory: make orderedLayers optional in ExtractionResult; only populate it when layer-attribution option is enabled, avoiding holding all per-layer file buffers unconditionally - Performance: cache computeLayerAttribution results by AnalysisType so duplicate manager types (APT regular + distroless, RPM BDB + SQLite) share a single expensive layer-parsing pass - Clarity: add JSDoc to buildHistoryInstructions explaining why it differs from getUserInstructionLayersFromConfig (all-layers vs user-layers) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- docker.spec.ts: remove fragile sha256 checksum comparison in the hello-world round-trip test; Docker's tar format varies across versions so the normalised checksums no longer match the fixture. Existence of the output file is still verified. - docker.spec.ts: change 'someImage' (uppercase → HTTP 400) to a valid lowercase name so the "image doesn't exist" test exercises the intended 404 code path ("not found") rather than a name-validation error. - plugin.spec.ts: update nginx:1.19.0 manifest layer digests; the compressed layer blobs were re-published on Docker Hub with different compression, changing the manifest digests while the image config (and therefore imageId) remained the same. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…key generation - Thread osRelease through computeLayerAttribution → parseLayerPackages so aptAnalyze uses the same normalization as the main analysis path, preventing pkgLayerMap lookup misses on distros where osRelease affects package version strings (e.g. Ubuntu epoch stripping) - Pass redHatRepositories through the same chain so rpmAnalyze and mapRpmSqlitePackages receive the same repository list as the main path - Fix RPM SQLite branch: SQLite packages now go through mapRpmSqlitePackages (sync helper matching the main path) instead of being merged into the rpmAnalyze call; BDB+NDB and SQLite results are combined in a single Set - Update computeLayerAttribution call in static-analyzer.ts to supply the already-computed osRelease and redHatRepositories - Update unit tests to pass the new required parameters Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- [sev 8] static-analyzer: document cache assumption — analyzers sharing an AnalysisType parse the same DB format, so a single pass covers all of them; add comment explaining why allEntries.push is cache-miss-only - [sev 7] layer-attribution: buildHistoryInstructions now returns Array<string | undefined> using `?.trim() || undefined` so empty and whitespace-only created_by values are treated as absent; tighten `if (instruction)` to `if (instruction !== undefined)` to make the intent explicit - [sev 6] layer-attribution: add comment to RPM branch clarifying that BDB/NDB and SQLite paths are independent and intentionally use separate analyzers to match the main analysis path - [sev 6] dependency-tree: extract buildLayerLabels() helper used by both the tooFrequentDeps path and buildTreeRecursive, eliminating the duplicate inline label-building blocks and the inconsistent freqLabels variable name - [sev 5] static-analyzer: set attributionCache to an empty Map on error so subsequent results of the same AnalysisType skip recomputation instead of triggering O(n) retry attempts for a broken type - [sev 4] facts: add JSDoc to LayerAttributionEntry.packages and removedPackages documenting the "name@version" key format - [sev 3] harness: remove startsWith("--") guard from next() so option values that begin with "--" (e.g. passwords) are accepted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

snyk-pr-review-bot · 2026-04-20T21:10:22Z