Skip to content

sonic: enforce full config_db.json ownership#2297

Merged
berendt merged 1 commit into
mainfrom
doc-sonic-generator-ownership-model
Jun 2, 2026
Merged

sonic: enforce full config_db.json ownership#2297
berendt merged 1 commit into
mainfrom
doc-sonic-generator-ownership-model

Conversation

@ideaship
Copy link
Copy Markdown
Contributor

@ideaship ideaship commented May 20, 2026

Motivation

Regenerating a SONiC config from NetBox repeatedly raised the same question: when generate_sonic_config() rewrites config_db.json, which sections does it own, and may an operator hand-edit anything directly? PRs #2279 and #2290 both ran into it. The policy is now settled: the generator owns and generates config_db.json in full, and operators make no manual adjustments to it.

What this PR does

Documents that ownership model in the generate_sonic_config() docstring and brings the code in line with it.

  • Full ownership, documented. The docstring states whole-file ownership: every section the generator manages is rebuilt from NetBox data and hardcoded SONiC policy on each regen; no sections pass operator edits through unchanged. Two fields the generator does not itself emit — the DEVICE_METADATA localhost device attributes (e.g. device type) and the DATABASE schema VERSION — are inherited from the device's image-provided base config, not from operator hand-edits.
  • Owned tables rebuilt from scratch. Previously the generator deep-copied the base config and most helpers assigned only the keys they generate this run, so an entry removed from NetBox but still present in config_db.json (e.g. an old VLAN_MEMBER, BGP_GLOBALS VRF, VXLAN_TUNNEL_MAP, SNMP user, or syslog host) survived as stale, active config. The owned tables (OWNED_TABLE_KEYS — every generated table except the inherited DEVICE_METADATA and VERSIONS) are now cleared up front, so only regenerated entries remain.
  • BGP_GLOBALS['default'] replaced wholesale. It was written with key-level updates that merged into a pre-existing entry, so operator fields survived regen. It is now replaced wholesale, like every other generated VRF.
  • STATIC_ROUTE no longer preserves hand-edited routes. Clearing the owned tables drops pre-existing routes (e.g. an operator's custom blackhole or VRF route) before the management default route is written.

Tests

  • Illustrative helper tests (test_config_generator_ownership.py): _add_vrf_configuration and _add_vlan_configuration replace BGP_GLOBALS[vrf], VLAN entries, and ROUTE_REDISTRIBUTE keys wholesale.
  • Orchestrator tests (test_config_generator_orchestrator.py): pre-existing BGP_GLOBALS['default'] fields, STATIC_ROUTE entries, and stale owned-table entries are dropped on regen, while DEVICE_METADATA and VERSIONS survive.

@ideaship ideaship force-pushed the doc-sonic-generator-ownership-model branch from cae57c6 to 9952d93 Compare May 20, 2026 06:44
@ideaship ideaship requested a review from berendt May 20, 2026 06:45
@ideaship ideaship moved this from Ready to In review in Human Board May 20, 2026
@berendt berendt moved this from In review to In progress in Human Board May 26, 2026
@ideaship ideaship force-pushed the doc-sonic-generator-ownership-model branch from 9952d93 to f4c34e0 Compare May 28, 2026 08:25
@ideaship ideaship changed the title sonic: document and test config generator ownership model sonic: enforce full config_db.json ownership May 28, 2026
@ideaship ideaship moved this from In progress to In review in Human Board May 28, 2026
@ideaship ideaship marked this pull request as ready for review May 29, 2026 06:41
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The ownership model is currently encoded via INHERITED_TABLE_KEYS + OWNED_TABLE_KEYS and duplicated table-name literals; consider centralizing these names (e.g., a single authoritative enum/set) and/or adding a small invariant check so new scaffold keys or ownership changes can’t silently drift from the documented model.
  • Given that OWNED_TABLE_KEYS is used to pop sections before scaffolding, it may be worth explicitly asserting in code (or a small helper) that every TOP_LEVEL_SCAFFOLD_KEY other than INHERITED_TABLE_KEYS is indeed in OWNED_TABLE_KEYS, so future additions don’t accidentally become neither owned nor inherited.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The ownership model is currently encoded via INHERITED_TABLE_KEYS + OWNED_TABLE_KEYS and duplicated table-name literals; consider centralizing these names (e.g., a single authoritative enum/set) and/or adding a small invariant check so new scaffold keys or ownership changes can’t silently drift from the documented model.
- Given that OWNED_TABLE_KEYS is used to pop sections before scaffolding, it may be worth explicitly asserting in code (or a small helper) that every TOP_LEVEL_SCAFFOLD_KEY other than INHERITED_TABLE_KEYS is indeed in OWNED_TABLE_KEYS, so future additions don’t accidentally become neither owned nor inherited.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ideaship ideaship marked this pull request as draft May 29, 2026 08:38
@ideaship ideaship force-pushed the doc-sonic-generator-ownership-model branch from f4c34e0 to ba31c99 Compare May 29, 2026 08:51
@ideaship
Copy link
Copy Markdown
Contributor Author

Updated with constant split + taxonomy invariant tests. No functional change.

@ideaship ideaship marked this pull request as ready for review May 29, 2026 08:51
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The ownership taxonomy (INHERITED_TABLE_KEYS, SCAFFOLDED_OWNED_TABLE_KEYS, ON_DEMAND_OWNED_TABLE_KEYS) is currently maintained by hand; consider centralizing the table names in a single source (e.g., a mapping from table → classification) so that adding/removing tables in helpers can’t silently fall out of sync with these constants and the docstring.
  • The generate_sonic_config docstring hard-codes several table names (BGP_GLOBALS, VLAN, VLAN_INTERFACE, etc.) that are also captured in OWNED_TABLE_KEYS; to avoid drift, it may be cleaner to either reference OWNED_TABLE_KEYS symbolically in the docstring or reduce the explicit list there and rely on the constants for the authoritative set.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The ownership taxonomy (INHERITED_TABLE_KEYS, SCAFFOLDED_OWNED_TABLE_KEYS, ON_DEMAND_OWNED_TABLE_KEYS) is currently maintained by hand; consider centralizing the table names in a single source (e.g., a mapping from table → classification) so that adding/removing tables in helpers can’t silently fall out of sync with these constants and the docstring.
- The generate_sonic_config docstring hard-codes several table names (BGP_GLOBALS, VLAN, VLAN_INTERFACE, etc.) that are also captured in OWNED_TABLE_KEYS; to avoid drift, it may be cleaner to either reference OWNED_TABLE_KEYS symbolically in the docstring or reduce the explicit list there and rely on the constants for the authoritative set.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ideaship ideaship force-pushed the doc-sonic-generator-ownership-model branch from ba31c99 to bb263fa Compare May 29, 2026 09:06
@berendt
Copy link
Copy Markdown
Member

berendt commented Jun 2, 2026

WARNING (1)

W-001: SNMP_SERVER misclassified as on-demand

File: osism/tasks/conductor/sonic/config_generator.py:98-110Fix: ASK — Confidence: verified

Problem: The comment on ON_DEMAND_OWNED_TABLE_KEYS claims these tables exist 'only when NetBox carries the corresponding data'. This is true for SYSLOG_SERVER, SNMP_SERVER_USER, SNMP_SERVER_GROUP_MEMBER, SNMP_SERVER_PARAMS, SNMP_SERVER_TARGET, SNMP_AGENT_ADDRESS_CONFIG, and ROUTE_REDISTRIBUTE. But SNMP_SERVER itself is created unconditionally by _add_snmp_configuration at line 2180 with hardcoded defaults (location='Data Center', contact='info@example.com'). The existing test test_add_snmp_configuration_defaults confirms SNMP_SERVER['SYSTEM'] is always present even with no NetBox data. So SNMP_SERVER is 'always recreated by the helper', not 'on demand'. The runtime effect is harmless (dropped and recreated either way), but a reader following the classification to predict behavior would expect SNMP_SERVER to be absent on a device with no SNMP config — and would be wrong.

Code:

# Owned tables that are not scaffolded: helpers create and assign them on
# demand (only when NetBox carries the corresponding data), so when that data
# is absent the table is simply left out of the generated config.
ON_DEMAND_OWNED_TABLE_KEYS = (
    "ROUTE_REDISTRIBUTE",
    "SNMP_SERVER",
    "SNMP_AGENT_ADDRESS_CONFIG",
    "SNMP_SERVER_GROUP_MEMBER",
    "SNMP_SERVER_USER",
    "SNMP_SERVER_PARAMS",
    "SNMP_SERVER_TARGET",
    "SYSLOG_SERVER",
)

Action Required: Tighten the comment to describe what's actually true, noting that some tables (notably SNMP_SERVER) are created unconditionally with hardcoded defaults rather than only when NetBox data is present.

Suggested Fix:

Update the comment above ON_DEMAND_OWNED_TABLE_KEYS to say: helpers create and assign these on demand (often, but not always, only when NetBox carries the corresponding data); some are created unconditionally with hardcoded defaults.

Fix Options:

Option Approach Pros Cons Effort Risk if skipped
A Tighten the comment to acknowledge that some tables (notably SNMP_SERVER) are created unconditionally with hardcoded defaults. Documentation matches reality; no behavior change; minimal scope. Comment loses some precision; classification still groups unconditional and conditional helpers together. LOW A future reader misreads the ownership model and either adds a duplicate scaffold for SNMP_SERVER or removes the helper's unconditional assignment.
B Move SNMP_SERVER into TOP_LEVEL_SCAFFOLD_KEYS so it joins the scaffolded-owned set, matching its actual always-created behavior. Classification accurately reflects helper behavior; structural fix rather than a doc patch. Doesn't materially change runtime behavior; adds one more key the orchestrator scaffolds. LOW Same as A: future readers misinterpret the ownership model.
C Make _add_snmp_configuration truly on-demand by guarding the SNMP_SERVER block on segment_snmp_server* config_context being present. Implementation matches the existing comment; consistent on-demand semantics for the entire SNMP family. Changes user-visible behavior (devices with no SNMP config_context lose SNMP_SERVER entirely); requires updating SNMP tests; expands PR scope. MED Same as A; behavior change is not required by the PR's stated goal.

Recommended: A — The PR scope is documentation + drop-on-regen, not a rewrite of helper logic. Reclassifying or rewriting _add_snmp_configuration would expand scope unnecessarily; the comment fix is the minimal change that aligns docs with reality.

Related: I-001


INFO (2)

I-001: "Owns config_db.json in full" overstates the actual implementation

File: osism/tasks/conductor/sonic/config_generator.py:137-144Fix: ASK — Confidence: likely

Problem: The opening sentence states 'owns config_db.json in full'. The body then scopes that to OWNED_TABLE_KEYS. The two are not the same — any SONiC table not enumerated in OWNED_TABLE_KEYS (and not in INHERITED_TABLE_KEYS) will silently survive from the on-disk base config. In practice, SONiC base images carry many tables that aren't in either list: PFC_WD, FEATURE, COPP_TRAP, WRED_PROFILE, SCHEDULER, TC_TO_QUEUE_MAP, FLEX_COUNTER_TABLE, and various custom tables. The current drop loop ignores them, so they pass through unchanged into the generated config. An operator could (technically) edit those tables in /etc/sonic/config_db.json and the edits would survive regen — directly contradicting the 'operators must not make manual adjustments' rule. This isn't a runtime bug today, but it creates a slow drift: as SONiC evolves and tables are added that the generator should emit, they'll start getting written and then silently merging with leftover base-config content from older images instead of being authoritative.

Code:

    Config ownership model:
        This generator owns config_db.json in full; operators must not make
        manual adjustments to it. On every regen the tables listed in
        OWNED_TABLE_KEYS are rebuilt from scratch from NetBox data and
        hardcoded SONiC policy: their base content is dropped up front, so
        neither pre-existing values nor entries removed from NetBox survive.

Action Required: Reword the docstring's opening sentence so it accurately reflects that the generator owns only the enumerated tables, not the entire config_db.json file.

Suggested Fix:

Reword the opening of the Config ownership model paragraph from 'owns config_db.json in full' to 'owns the tables listed in OWNED_TABLE_KEYS' so the docstring matches what the implementation actually does.

Fix Options:

Option Approach Pros Cons Effort Risk if skipped
A Reword the docstring to say 'owns the tables listed in OWNED_TABLE_KEYS' rather than 'owns config_db.json in full'. Docs accurate; no behavior change; weakest scope. Weaker policy statement; operators could still edit unlisted tables and have edits survive regen. LOW Future readers and maintainers over-trust the ownership claim and skip enumerating new tables in OWNED_TABLE_KEYS.
B Drop everything from config except INHERITED_TABLE_KEYS — invert the logic to pop all keys that aren't inherited. Makes ownership truly full and matches the original docstring claim. Aggressive — could break workflows that depend on unlisted base-config tables (PFC_WD, FEATURE, COPP_TRAP, etc.) surviving in real deployments. MED Operator edits to unlisted tables continue to leak through.
C Add an explicit pass-through allow-list and assert/warn when the base config has tables outside OWNED_TABLE_KEYS ∪ INHERITED_TABLE_KEYS ∪ PASS_THROUGH. Makes the surface area explicit; future SONiC table additions surface as warnings instead of silently merging. Significant scope expansion; needs a curated list maintained over time. HIGH Same as A: silent drift as SONiC adds tables.

Recommended: A — The PR's value is in the explicit table model and the drop loop. Option B's behavior change is risky for an operational tool running against real SONiC images and could remove operator-set features the team depends on; Option C is over-engineered for this PR's scope.

Related: W-001


I-002: "Stale owned entries" test does not seed SNMP_SERVER or STATIC_ROUTE

File: tests/unit/tasks/conductor/sonic/test_config_generator_orchestrator.py:510-515Fix: ASK — Confidence: likely

Problem: OWNED_TABLE_KEYS contains 32+ tables. This test seeds 5. The drop is correct by construction (for _owned_key in OWNED_TABLE_KEYS: config.pop(_owned_key, None)), so seeding more tables would not uncover a different bug — but the test name (test_generate_sonic_config_stale_owned_entries_dropped_on_regen) implies broader coverage than it has. In particular, it skips STATIC_ROUTE (the PR's most user-visible behavior change), INTERFACE, PORT, LOOPBACK, LOOPBACK_INTERFACE, PORTCHANNEL*, NTP_SERVER, DNS_NAMESERVER, BREAKOUT_CFG, BREAKOUT_PORTS, BGP_NEIGHBOR, BGP_NEIGHBOR_AF, BGP_GLOBALS_AF*, VRF, VXLAN_TUNNEL, VXLAN_EVPN_NVO, and the rest of the SNMP family. Not a bug; a test-strength gap. The taxonomy-invariants tests (TestOwnershipTaxonomyInvariants) and the orchestrator's pop-loop together make the behavior correct by construction, so this is INFO rather than WARNING.

Code:

    base = make_base_config()
    # Stale entries an operator/earlier run left behind, now absent from NetBox.
    base["BGP_GLOBALS"]["old-vrf"] = {"router_id": "1.1.1.1"}
    base["VLAN_MEMBER"]["Vlan999|Ethernet0"] = {"tagging_mode": "tagged"}
    base["VXLAN_TUNNEL_MAP"]["vtepServ|map_999"] = {"vlan": "Vlan999", "vni": "999"}
    base["SNMP_SERVER_USER"] = {"olduser": {"shaKey": "x"}}
    base["SYSLOG_SERVER"] = {"10.9.9.9": {"severity": "info"}}

Action Required: Replace the hand-picked sample with an exhaustive sweep — iterate OWNED_TABLE_KEYS, seed each with a sentinel entry, regen, assert each is either empty or absent. One paragraph of test logic covers every table without growing as the list does.

Suggested Fix:

Iterate over every key in OWNED_TABLE_KEYS and seed each with a sentinel entry in the base config before regen, then assert each owned table is either empty or absent after regen. This covers all 32+ owned tables with constant test code and stays in sync as the list grows.

Fix Options:

Option Approach Pros Cons Effort Risk if skipped
A Replace the hand-picked sample with an exhaustive sweep — iterate OWNED_TABLE_KEYS, seed each with a sentinel entry, regen, assert each is either empty or absent. Test name matches coverage; survives future additions to OWNED_TABLE_KEYS automatically; covers STATIC_ROUTE (the PR's headline behavior change). Generic sentinel may not exercise table-specific schema validation; per-key seeding may be tricky for tables with required structure. LOW Test name overpromises; if the drop-loop is ever changed to skip a table, the existing 5-table sample may not catch it.
B Leave the test as a hand-picked smoke test and rely on TestOwnershipTaxonomyInvariants + the construction-correct pop-loop to guarantee full coverage. Zero churn; current invariants tests already prevent the most likely regressions. Test name continues to overpromise; STATIC_ROUTE (the PR's user-visible change) remains untested as a stale-drop case. LOW If the pop-loop or OWNED_TABLE_KEYS is restructured, regressions for unsampled tables may slip through review.

Recommended: A — STATIC_ROUTE is the PR's most user-visible behavior change and currently has no stale-drop coverage; an exhaustive sweep closes that gap and keeps the test honest as OWNED_TABLE_KEYS evolves. Project preference is for test completeness commensurate with the scope of the change.


Summary

This is a clean, well-scoped PR that codifies the config_db.json ownership rule with a clear docstring, a drop-loop on OWNED_TABLE_KEYS, and a new TestOwnershipTaxonomyInvariants class that prevents classification constants from drifting at construction time. Both commits stay on-scope (ownership enforcement + black formatting), and the breaking STATIC_ROUTE change is correctly called out in the PR description and renamed test. The findings are documentation/test-strength issues only: SNMP_SERVER is grouped with on-demand tables but is actually created unconditionally with hardcoded defaults, and the docstring's 'owns config_db.json in full' opening overstates what the implementation does (only OWNED_TABLE_KEYS are dropped). No runtime bugs were identified.

Severity Count
BLOCKING 0
CRITICAL 0
WARNING 1
INFO 2

Important

Recommendation: Approve with the recommended comment/docstring fixes from Findings 1 and 2, and the suggested test-strength improvement in Finding 3. None of the findings block merge.

The SONiC config generator owns the config_db.json tables it manages: it
generates them from NetBox data and hardcoded SONiC policy, and operators
must not make manual adjustments to them.  This ownership model was
implicit and undocumented, so each PR touching the generator relitigated
which sections it owns (surfaced in review of #2279 and #2290).

Document the model in the generate_sonic_config() docstring: the
generator owns the tables listed in OWNED_TABLE_KEYS and rebuilds them
from scratch on every regen.  Tables that are neither owned nor inherited
are not emitted by the generator and pass through unchanged from the
device's image-provided base config_db.json; they are not policed, so
they are not a supported place for operator customizations either.  Note
the two fields the generator does not itself emit and instead inherits
from the base config (the DEVICE_METADATA localhost device attributes and
the DATABASE schema VERSION); these come from the image, not from operator
hand-edits.

Bring the code into line with the documented model:

  - Rebuild owned tables from scratch instead of overwriting only the
    entries touched in the current run.  generate_sonic_config() deep-
    copies the base config and most helpers assign only the keys they
    currently generate, so an entry removed from NetBox but still present
    in config_db.json (e.g. an old VLAN_MEMBER, BGP_GLOBALS VRF, or
    VXLAN_TUNNEL_MAP) survived regen as stale, active config.  Drop the
    content of every owned table up front and only regenerated entries
    remain.  OWNED_TABLE_KEYS is composed from two named sub-categories:
    the scaffolded-owned tables (every scaffold key except the inherited
    DEVICE_METADATA and VERSIONS, created up front via setdefault) and
    the on-demand owned tables (ROUTE_REDISTRIBUTE, the SNMP_* tables and
    SYSLOG_SERVER, which helpers assign directly; most only when NetBox
    carries their data, except SNMP_SERVER, which is always emitted with
    hardcoded defaults).
  - BGP_GLOBALS['default'] was written with key-level updates that merged
    into any pre-existing entry, so operator fields survived regen.
    Replace the entry wholesale when a primary IP is present, like every
    other generated VRF.
  - STATIC_ROUTE merged the management default route into the dict loaded
    from config_db.json, so pre-existing routes (e.g. an operator's
    custom blackhole or VRF route) survived regen.  Clearing the owned
    tables now drops them before the management route is written.

Add tests pinning the model.  Illustrative helper tests
(test_config_generator_ownership.py) show that _add_vrf_configuration and
_add_vlan_configuration replace BGP_GLOBALS[vrf], VLAN entries and
ROUTE_REDISTRIBUTE keys wholesale.  Orchestrator tests
(test_config_generator_orchestrator.py) assert that pre-existing
BGP_GLOBALS['default'] fields, STATIC_ROUTE entries, and stale owned-table
entries are dropped on regen while DEVICE_METADATA and VERSIONS survive: a
sampled case keeps the assertions readable, and an exhaustive sweep seeds
a sentinel into every OWNED_TABLE_KEYS table and asserts none survive
regen, so the guarantee covers the whole owned set as it grows.

The ownership tests also pin the table-classification taxonomy: that the
scaffold set partitions into scaffolded-owned and inherited tables, that
owned and inherited tables are disjoint, that OWNED_TABLE_KEYS has no
duplicates, and that the scaffolded-owned and on-demand owned sets do not
overlap.  Most of the taxonomy holds by construction (OWNED_TABLE_KEYS is
derived), so these checks exist to fail when the hand-written
INHERITED_TABLE_KEYS or ON_DEMAND_OWNED_TABLE_KEYS literals are edited in
a way that would silently break the model.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship force-pushed the doc-sonic-generator-ownership-model branch from 3d111d1 to 8a95260 Compare June 2, 2026 09:07
@berendt berendt merged commit 953be8a into main Jun 2, 2026
3 checks passed
@berendt berendt deleted the doc-sonic-generator-ownership-model branch June 2, 2026 09:21
@github-project-automation github-project-automation Bot moved this from In review to Done in Human Board Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants