Skip to content

ROX-32949: reactive DNF/RPM scanning in roxagent (Phase 1)#21545

Draft
vikin91 wants to merge 7 commits into
piotr/ROX-35195-v2-vsock-pull-pocfrom
piotr/ROX-32949-reactive-scanning
Draft

ROX-32949: reactive DNF/RPM scanning in roxagent (Phase 1)#21545
vikin91 wants to merge 7 commits into
piotr/ROX-35195-v2-vsock-pull-pocfrom
piotr/ROX-32949-reactive-scanning

Conversation

@vikin91

@vikin91 vikin91 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Description

Phase 1 of reactive DNF/RPM package-change scanning (ROX-32949). Today roxagent serve only rescans on a fixed 4h timer, so a package change can take up to 4h to reach Central. This adds an fsnotify-based watcher on the RPM database directory that triggers an immediate rescan on change, and makes Sensor prioritize delivery of these reactive reports so the ~5 minute update SLA actually holds under load.

Design doc: docs/superpowers/specs/2026-07-03-reactive-dnf-scanning-design.md
Implementation plan: docs/superpowers/plans/2026-07-03-reactive-dnf-scanning-phase1.md

What changed

  • roxagent: new watch package watches /usr/lib/sysimage/rpm (falling back to /var/lib/rpm) via fsnotify, with trivial buffered-channel coalescing of event bursts. Wired into runServe's existing select loop as a third case alongside the periodic ticker — no protocol changes. Watcher failures are best-effort (falls back to periodic-only scanning, logs a warning).
  • Every cached report is now tagged with a scan_trigger fact (scheduled/reactive) via the existing ResponseMeta.facts map.
  • Sensor's VMScraper classifies each pulled report via this fact and routes reactive ones through a new Handler.SendReactive path instead of the routine Send path.
  • handlerImpl gains a per-VM "latest pending reactive report" map (bounded by fleet size, not an arbitrary channel size) that's drained ahead of the routine indexReports channel, so a reactive update can't get stuck behind a backlog of routine traffic.
  • New Prometheus metrics: virtual_machine_reactive_index_report_latency_seconds (end-to-end SLA measurement) and virtual_machine_discovered_data_scan_trigger_total (scheduled/reactive/unknown volume).

Considered alternatives

  • A second fixed-capacity channel for reactive reports (rejected — any constant is either too small at fleet scale or an unjustified memory-growth knob; the per-VM map is self-bounded by the number of VMs Sensor already tracks).
  • A new VSOCK protocol field for the trigger type (rejected for now — the existing generation-counter + facts map is sufficient; a proto field is deferred to Phase 2 if scale testing shows it's needed).

Explicitly out of scope (Phase 2, see design doc)

  • Debounce/quiet-period timer (Phase 1 only does trivial event coalescing, no debounce).
  • Storm/rate-limit protection and --reactive-scan/--reactive-debounce CLI flags.
  • Alerting on the reactive-latency metric.
  • 10,000-VM delivery scale testing and any changes to the generic sensor/common/sensor/buffered_stream.go gRPC buffer (flagged in the design doc as a known, deliberately-unaddressed risk at scale).

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag — ships unconditionally on per Phase 1 design (no feature flag added), please confirm this is acceptable
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • New unit tests for every new/changed unit: the watch package (directory-preference resolution, event coalescing, event-type filtering), roxagent serve's discoverFacts/scan_trigger tagging, Sensor's isReactiveTrigger classification and scan_trigger/reactive-latency metrics, and handlerImpl's priority path (delivery-ordering, per-VM upsert-replace semantics under concurrent access, fallback-to-scheduled-queue when a VM isn't yet resolvable, and the SLA-latency histogram observation).
  • All touched packages pass go test -race: compliance/virtualmachines/roxagent/..., sensor/common/virtualmachine/..., sensor/kubernetes/sensor/....
  • go build for both roxagent and sensor, gofmt, go vet, and golangci-lint are clean; go mod tidy produces no unexpected diff beyond promoting fsnotify to a direct dependency.

vikin91 and others added 7 commits July 3, 2026 15:10
Detects package-change events on candidate RPM database directories with
trivial buffered-channel coalescing (no debounce yet — Phase 2). Part of
ROX-32949 reactive DNF scanning; see
docs/superpowers/specs/2026-07-03-reactive-dnf-scanning-design.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Wires the new watch.Watcher into runServe's select loop as a third case
alongside the periodic ticker, sharing one rescan/cache/generation-bump
code path. Tags every cached report's facts with scan_trigger
(scheduled|reactive) so Sensor can prioritize delivery. Part of
ROX-32949; see docs/superpowers/specs/2026-07-03-reactive-dnf-scanning-design.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds virtual_machine_reactive_index_report_latency_seconds (SLA
measurement) and virtual_machine_discovered_data_scan_trigger_total
(observability), used by the reactive-scan handling added in the next
commits. Part of ROX-32949.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds Handler.SendReactive, backed by a per-VM 'latest pending' map
(reactivePending) that handlerImpl.run() drains ahead of the routine
indexReports channel on every iteration. Bounds memory by fleet size
instead of an arbitrary channel capacity. Falls back to the normal queue
if the VM isn't resolvable yet, so a reactive update is never silently
dropped. Part of ROX-32949; see
docs/superpowers/specs/2026-07-03-reactive-dnf-scanning-design.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds isReactiveTrigger (consumed by the scraper routing change in the
next commit) and a virtual_machine_discovered_data_scan_trigger_total
metric label, following the existing dnf_status pattern. Part of
ROX-32949.

Co-authored-by: Cursor <cursoragent@cursor.com>
VMScraper.scrapeVM now classifies each report via the scan_trigger fact
and calls IndexReportSender.SendReactive instead of Send for reactive
ones, passing report_generated_at through for SLA latency measurement.
Completes the ROX-32949 Phase 1 delivery path from roxagent's watcher
through to Sensor's priority queue.

Co-authored-by: Cursor <cursoragent@cursor.com>
Extends TestSendReactive_DeliveredBeforeQueuedScheduledBacklog to assert
VirtualMachineReactiveIndexReportLatencySeconds observes exactly one
sample for the reactive report and none for the two scheduled reports
delivered alongside it, closing a gap flagged in code review: the SLA
latency histogram had no direct test coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vikin91

vikin91 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

This change is part of the following stack:

Change managed by git-spice.

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit 85ffbf7. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-414-g85ffbf7c26

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.12500% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.36%. Comparing base (de6d977) to head (85ffbf7).

Files with missing lines Patch % Lines
compliance/virtualmachines/roxagent/cmd/serve.go 7.40% 25 Missing ⚠️
compliance/virtualmachines/roxagent/watch/watch.go 82.00% 6 Missing and 3 partials ⚠️
sensor/common/virtualmachine/vmscraper/scraper.go 25.00% 4 Missing and 2 partials ⚠️
sensor/common/virtualmachine/index/handler_impl.go 95.23% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@                          Coverage Diff                          @@
##           piotr/ROX-35195-v2-vsock-pull-poc   #21545      +/-   ##
=====================================================================
+ Coverage                              50.33%   50.36%   +0.02%     
=====================================================================
  Files                                   2853     2854       +1     
  Lines                                 219121   219264     +143     
=====================================================================
+ Hits                                  110290   110427     +137     
- Misses                                100815   100821       +6     
  Partials                                8016     8016              
Flag Coverage Δ
go-unit-tests 50.36% <73.12%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant