sonic: enforce full config_db.json ownership#2297
Conversation
cae57c6 to
9952d93
Compare
9952d93 to
f4c34e0
Compare
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The ownership model is currently encoded via INHERITED_TABLE_KEYS + OWNED_TABLE_KEYS and duplicated table-name literals; consider centralizing these names (e.g., a single authoritative enum/set) and/or adding a small invariant check so new scaffold keys or ownership changes can’t silently drift from the documented model.
- Given that OWNED_TABLE_KEYS is used to pop sections before scaffolding, it may be worth explicitly asserting in code (or a small helper) that every TOP_LEVEL_SCAFFOLD_KEY other than INHERITED_TABLE_KEYS is indeed in OWNED_TABLE_KEYS, so future additions don’t accidentally become neither owned nor inherited.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The ownership model is currently encoded via INHERITED_TABLE_KEYS + OWNED_TABLE_KEYS and duplicated table-name literals; consider centralizing these names (e.g., a single authoritative enum/set) and/or adding a small invariant check so new scaffold keys or ownership changes can’t silently drift from the documented model.
- Given that OWNED_TABLE_KEYS is used to pop sections before scaffolding, it may be worth explicitly asserting in code (or a small helper) that every TOP_LEVEL_SCAFFOLD_KEY other than INHERITED_TABLE_KEYS is indeed in OWNED_TABLE_KEYS, so future additions don’t accidentally become neither owned nor inherited.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
f4c34e0 to
ba31c99
Compare
|
Updated with constant split + taxonomy invariant tests. No functional change. |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The ownership taxonomy (INHERITED_TABLE_KEYS, SCAFFOLDED_OWNED_TABLE_KEYS, ON_DEMAND_OWNED_TABLE_KEYS) is currently maintained by hand; consider centralizing the table names in a single source (e.g., a mapping from table → classification) so that adding/removing tables in helpers can’t silently fall out of sync with these constants and the docstring.
- The generate_sonic_config docstring hard-codes several table names (BGP_GLOBALS, VLAN, VLAN_INTERFACE, etc.) that are also captured in OWNED_TABLE_KEYS; to avoid drift, it may be cleaner to either reference OWNED_TABLE_KEYS symbolically in the docstring or reduce the explicit list there and rely on the constants for the authoritative set.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The ownership taxonomy (INHERITED_TABLE_KEYS, SCAFFOLDED_OWNED_TABLE_KEYS, ON_DEMAND_OWNED_TABLE_KEYS) is currently maintained by hand; consider centralizing the table names in a single source (e.g., a mapping from table → classification) so that adding/removing tables in helpers can’t silently fall out of sync with these constants and the docstring.
- The generate_sonic_config docstring hard-codes several table names (BGP_GLOBALS, VLAN, VLAN_INTERFACE, etc.) that are also captured in OWNED_TABLE_KEYS; to avoid drift, it may be cleaner to either reference OWNED_TABLE_KEYS symbolically in the docstring or reduce the explicit list there and rely on the constants for the authoritative set.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
ba31c99 to
bb263fa
Compare
WARNING (1)W-001: SNMP_SERVER misclassified as on-demandFile: Problem: The comment on ON_DEMAND_OWNED_TABLE_KEYS claims these tables exist 'only when NetBox carries the corresponding data'. This is true for SYSLOG_SERVER, SNMP_SERVER_USER, SNMP_SERVER_GROUP_MEMBER, SNMP_SERVER_PARAMS, SNMP_SERVER_TARGET, SNMP_AGENT_ADDRESS_CONFIG, and ROUTE_REDISTRIBUTE. But SNMP_SERVER itself is created unconditionally by _add_snmp_configuration at line 2180 with hardcoded defaults (location='Data Center', contact='info@example.com'). The existing test test_add_snmp_configuration_defaults confirms SNMP_SERVER['SYSTEM'] is always present even with no NetBox data. So SNMP_SERVER is 'always recreated by the helper', not 'on demand'. The runtime effect is harmless (dropped and recreated either way), but a reader following the classification to predict behavior would expect SNMP_SERVER to be absent on a device with no SNMP config — and would be wrong. Code: Action Required: Tighten the comment to describe what's actually true, noting that some tables (notably SNMP_SERVER) are created unconditionally with hardcoded defaults rather than only when NetBox data is present. Suggested Fix: Fix Options:
Recommended: A — The PR scope is documentation + drop-on-regen, not a rewrite of helper logic. Reclassifying or rewriting _add_snmp_configuration would expand scope unnecessarily; the comment fix is the minimal change that aligns docs with reality. Related: I-001 INFO (2)I-001: "Owns config_db.json in full" overstates the actual implementationFile: Problem: The opening sentence states 'owns config_db.json in full'. The body then scopes that to OWNED_TABLE_KEYS. The two are not the same — any SONiC table not enumerated in OWNED_TABLE_KEYS (and not in INHERITED_TABLE_KEYS) will silently survive from the on-disk base config. In practice, SONiC base images carry many tables that aren't in either list: PFC_WD, FEATURE, COPP_TRAP, WRED_PROFILE, SCHEDULER, TC_TO_QUEUE_MAP, FLEX_COUNTER_TABLE, and various custom tables. The current drop loop ignores them, so they pass through unchanged into the generated config. An operator could (technically) edit those tables in /etc/sonic/config_db.json and the edits would survive regen — directly contradicting the 'operators must not make manual adjustments' rule. This isn't a runtime bug today, but it creates a slow drift: as SONiC evolves and tables are added that the generator should emit, they'll start getting written and then silently merging with leftover base-config content from older images instead of being authoritative. Code: Action Required: Reword the docstring's opening sentence so it accurately reflects that the generator owns only the enumerated tables, not the entire config_db.json file. Suggested Fix: Fix Options:
Recommended: A — The PR's value is in the explicit table model and the drop loop. Option B's behavior change is risky for an operational tool running against real SONiC images and could remove operator-set features the team depends on; Option C is over-engineered for this PR's scope. Related: W-001 I-002: "Stale owned entries" test does not seed SNMP_SERVER or STATIC_ROUTEFile: Problem: OWNED_TABLE_KEYS contains 32+ tables. This test seeds 5. The drop is correct by construction (for _owned_key in OWNED_TABLE_KEYS: config.pop(_owned_key, None)), so seeding more tables would not uncover a different bug — but the test name (test_generate_sonic_config_stale_owned_entries_dropped_on_regen) implies broader coverage than it has. In particular, it skips STATIC_ROUTE (the PR's most user-visible behavior change), INTERFACE, PORT, LOOPBACK, LOOPBACK_INTERFACE, PORTCHANNEL*, NTP_SERVER, DNS_NAMESERVER, BREAKOUT_CFG, BREAKOUT_PORTS, BGP_NEIGHBOR, BGP_NEIGHBOR_AF, BGP_GLOBALS_AF*, VRF, VXLAN_TUNNEL, VXLAN_EVPN_NVO, and the rest of the SNMP family. Not a bug; a test-strength gap. The taxonomy-invariants tests (TestOwnershipTaxonomyInvariants) and the orchestrator's pop-loop together make the behavior correct by construction, so this is INFO rather than WARNING. Code: Action Required: Replace the hand-picked sample with an exhaustive sweep — iterate OWNED_TABLE_KEYS, seed each with a sentinel entry, regen, assert each is either empty or absent. One paragraph of test logic covers every table without growing as the list does. Suggested Fix: Fix Options:
Recommended: A — STATIC_ROUTE is the PR's most user-visible behavior change and currently has no stale-drop coverage; an exhaustive sweep closes that gap and keeps the test honest as OWNED_TABLE_KEYS evolves. Project preference is for test completeness commensurate with the scope of the change. SummaryThis is a clean, well-scoped PR that codifies the config_db.json ownership rule with a clear docstring, a drop-loop on OWNED_TABLE_KEYS, and a new TestOwnershipTaxonomyInvariants class that prevents classification constants from drifting at construction time. Both commits stay on-scope (ownership enforcement + black formatting), and the breaking STATIC_ROUTE change is correctly called out in the PR description and renamed test. The findings are documentation/test-strength issues only: SNMP_SERVER is grouped with on-demand tables but is actually created unconditionally with hardcoded defaults, and the docstring's 'owns config_db.json in full' opening overstates what the implementation does (only OWNED_TABLE_KEYS are dropped). No runtime bugs were identified.
Important Recommendation: Approve with the recommended comment/docstring fixes from Findings 1 and 2, and the suggested test-strength improvement in Finding 3. None of the findings block merge. |
The SONiC config generator owns the config_db.json tables it manages: it generates them from NetBox data and hardcoded SONiC policy, and operators must not make manual adjustments to them. This ownership model was implicit and undocumented, so each PR touching the generator relitigated which sections it owns (surfaced in review of #2279 and #2290). Document the model in the generate_sonic_config() docstring: the generator owns the tables listed in OWNED_TABLE_KEYS and rebuilds them from scratch on every regen. Tables that are neither owned nor inherited are not emitted by the generator and pass through unchanged from the device's image-provided base config_db.json; they are not policed, so they are not a supported place for operator customizations either. Note the two fields the generator does not itself emit and instead inherits from the base config (the DEVICE_METADATA localhost device attributes and the DATABASE schema VERSION); these come from the image, not from operator hand-edits. Bring the code into line with the documented model: - Rebuild owned tables from scratch instead of overwriting only the entries touched in the current run. generate_sonic_config() deep- copies the base config and most helpers assign only the keys they currently generate, so an entry removed from NetBox but still present in config_db.json (e.g. an old VLAN_MEMBER, BGP_GLOBALS VRF, or VXLAN_TUNNEL_MAP) survived regen as stale, active config. Drop the content of every owned table up front and only regenerated entries remain. OWNED_TABLE_KEYS is composed from two named sub-categories: the scaffolded-owned tables (every scaffold key except the inherited DEVICE_METADATA and VERSIONS, created up front via setdefault) and the on-demand owned tables (ROUTE_REDISTRIBUTE, the SNMP_* tables and SYSLOG_SERVER, which helpers assign directly; most only when NetBox carries their data, except SNMP_SERVER, which is always emitted with hardcoded defaults). - BGP_GLOBALS['default'] was written with key-level updates that merged into any pre-existing entry, so operator fields survived regen. Replace the entry wholesale when a primary IP is present, like every other generated VRF. - STATIC_ROUTE merged the management default route into the dict loaded from config_db.json, so pre-existing routes (e.g. an operator's custom blackhole or VRF route) survived regen. Clearing the owned tables now drops them before the management route is written. Add tests pinning the model. Illustrative helper tests (test_config_generator_ownership.py) show that _add_vrf_configuration and _add_vlan_configuration replace BGP_GLOBALS[vrf], VLAN entries and ROUTE_REDISTRIBUTE keys wholesale. Orchestrator tests (test_config_generator_orchestrator.py) assert that pre-existing BGP_GLOBALS['default'] fields, STATIC_ROUTE entries, and stale owned-table entries are dropped on regen while DEVICE_METADATA and VERSIONS survive: a sampled case keeps the assertions readable, and an exhaustive sweep seeds a sentinel into every OWNED_TABLE_KEYS table and asserts none survive regen, so the guarantee covers the whole owned set as it grows. The ownership tests also pin the table-classification taxonomy: that the scaffold set partitions into scaffolded-owned and inherited tables, that owned and inherited tables are disjoint, that OWNED_TABLE_KEYS has no duplicates, and that the scaffolded-owned and on-demand owned sets do not overlap. Most of the taxonomy holds by construction (OWNED_TABLE_KEYS is derived), so these checks exist to fail when the hand-written INHERITED_TABLE_KEYS or ON_DEMAND_OWNED_TABLE_KEYS literals are edited in a way that would silently break the model. AI-assisted: Claude Code Signed-off-by: Roger Luethi <luethi@osism.tech>
3d111d1 to
8a95260
Compare
Motivation
Regenerating a SONiC config from NetBox repeatedly raised the same question: when
generate_sonic_config()rewritesconfig_db.json, which sections does it own, and may an operator hand-edit anything directly? PRs #2279 and #2290 both ran into it. The policy is now settled: the generator owns and generatesconfig_db.jsonin full, and operators make no manual adjustments to it.What this PR does
Documents that ownership model in the
generate_sonic_config()docstring and brings the code in line with it.DEVICE_METADATAlocalhost device attributes (e.g. devicetype) and theDATABASEschemaVERSION— are inherited from the device's image-provided base config, not from operator hand-edits.config_db.json(e.g. an oldVLAN_MEMBER,BGP_GLOBALSVRF,VXLAN_TUNNEL_MAP, SNMP user, or syslog host) survived as stale, active config. The owned tables (OWNED_TABLE_KEYS— every generated table except the inheritedDEVICE_METADATAandVERSIONS) are now cleared up front, so only regenerated entries remain.BGP_GLOBALS['default']replaced wholesale. It was written with key-level updates that merged into a pre-existing entry, so operator fields survived regen. It is now replaced wholesale, like every other generated VRF.STATIC_ROUTEno longer preserves hand-edited routes. Clearing the owned tables drops pre-existing routes (e.g. an operator's custom blackhole or VRF route) before the management default route is written.Tests
test_config_generator_ownership.py):_add_vrf_configurationand_add_vlan_configurationreplaceBGP_GLOBALS[vrf],VLANentries, andROUTE_REDISTRIBUTEkeys wholesale.test_config_generator_orchestrator.py): pre-existingBGP_GLOBALS['default']fields,STATIC_ROUTEentries, and stale owned-table entries are dropped on regen, whileDEVICE_METADATAandVERSIONSsurvive.