Performance optimizations and bug fixes#2816
Open
DL6ER wants to merge 172 commits intodevelopmentfrom
Open
Conversation
The _addstr() function used memmem() to scan the entire shared-memory string pool for duplicate strings before appending. This is O(pool_size) per call — on a typical installation with tens of thousands of unique domains, this means scanning hundreds of kilobytes to megabytes of data on every call, all while holding the SHM lock. Replace with a process-local open-addressing hash table that maps string hashes to pool offsets. This gives O(1) average-case lookups while preserving the existing string deduplication behavior. The hash table uses the same hashing algorithm (Jenkins One-at-a-Time) already used elsewhere in FTL. It starts at 4096 slots, doubles at 75% load/usage, and incrementally syncs any strings added by forked TCP children via str_hash_sync(). This primarily benefits: - Database import at startup (every imported domain hits addstr()) - Post-GC domain re-registration (recycled domains call addstr() again) - Any workload with many unique domains (ad-heavy or many-user networks) Signed-off-by: Dominik <dl6er@dl6er.de>
Replace the O(MAXITER=1000) reverse linear scan in findQueryID() with a 4096-entry direct-mapped cache keyed on the (monotonically incrementing!) dnsmasq query ID. Each DNS reply/forward/CNAME callback should now hit the cache for an O(1) lookup instead of scanning up to (in the worst case!) 1000 queriesData structs in shared memory. Since dnsmasq IDs increment monotonically, new entries naturally evict the oldest at each slot, keeping the cache fresh without explicit eviction logic. The cache is cleared after GC memmove (which shifts all query indices). On a miss the original linear scan is used as fallback. Signed-off-by: Dominik <dl6er@dl6er.de>
Replace the per-GC-cycle memmove of the entire queries array with a queries_offset that advances when old queries are removed. _getQuery() adds this offset to translate logical indices to physical positions, making the change transparent to all callers. Previously, every GC run copied (remaining_queries * 96) bytes under the SHM lock — potentially tens of megabytes on a busy resolver. Now GC simply increments the offset immediately (O(1)). Compaction (the actual memmove) is deferred to shm_ensure_size() and only triggered when the physical array runs out of room, amortizing the cost over many GC cycles. Signed-off-by: Dominik <dl6er@dl6er.de>
…ounts The recycle() function previously allocated three bool arrays (clients, domains, dns_cache) and scanned ALL active queries to mark which entries were still in use before recycling unreferenced ones. This O(queries) scan ran under the shared-memory lock every GC cycle. Replace with incrementally-maintained reference counts: - client->count / domain->count (already maintained) track direct query references - domain->cname_refcount (new) tracks CNAME chain references from both queries and cache entries - cache->refcount (new) tracks how many queries reference each DNS cache entry Refcounts are incremented at query creation (new query, DB import, CNAME chain resolution, query duplication) and decremented during GC removal. recycle() now checks these fields directly instead of scanning, reducing its complexity from O(queries) to O(clients + domains + cache). Initialize DNSCacheData.CNAME_domainID to (unsigned int)-1 as a sentinel to distinguish "never set" from domain ID 0, ensuring correct refcount decrements even after gravity reloads that reset blocking_status without clearing CNAME_domainID. Signed-off-by: Dominik <dl6er@dl6er.de>
…te loop queries_to_database() called asprintf() to format the "IP#port" upstream string for every query written to the database, causing a malloc+free cycle per query. On systems with thousands of queries per DB flush, this adds up to significant heap churn. Replace with a fixed 64-byte stack buffer and snprintf. The maximum formatted length is 52 bytes (45-char IPv6 + '#' + 5-digit port + NUL), so 64 bytes provides ample headroom with zero heap allocation. Signed-off-by: Dominik <dl6er@dl6er.de>
GC recycle() previously called lookup_remove() for each recycled client, domain, and cache entry. Each removal did a binary search followed by memmove to shift all subsequent entries left, making each removal O(N) in the table size. With K entries recycled, the total cost was O(K*N). Replace with lookup_compact(): after each recycling loop, a single O(N) pass over the sorted lookup table copies surviving entries forward, skipping entries whose backing data structure has been memset to zero (magic byte == 0x00). The sorted order is preserved since only removals occur, so binary search continues to work. For a DNS cache lookup table with 20K entries and 100 recycled entries, this reduces data movement from ~8MB (100 individual memmoves) to ~160KB (one sequential scan). Signed-off-by: Dominik <dl6er@dl6er.de>
GC recycle() previously iterated data arrays to decide what to recycle, then made a second pass over each lookup table to compact out dead entries. Merge both operations into a single pass: iterate over the sorted lookup table directly, check the recycling criteria for each entry's backing data, and either recycle it (memset + add to recycle table) or shift it forward in the compacted output. This halves the number of iterations and eliminates the lookup_compact() function, its three callback wrappers, and the associated function pointer call overhead per lookup entry. Signed-off-by: Dominik <dl6er@dl6er.de>
Move the query_type bitmask check ahead of the expensive regexec()/tre_regexec() call. When a regex has a type filter (e.g., only match AAAA queries), queries of non-matching types now skip the engine entirely via a single integer AND operation, avoiding unnecessary regex evaluations. Signed-off-by: Dominik <dl6er@dl6er.de>
gen_abp_patterns() uses cJSON, allocating 9+ heap objects per blocked-domain check (1 array + N items + N string copies). Most gravity lookups hit the exact-match check, making those allocations wasted work. Change in_gravity() to accept a cJSON** and lazily generate ABP patterns only when the exact-match returns NOT_FOUND and gravity_abp_format is enabled. The patterns are shared across the antigravity and gravity calls via the caller's pointer. Signed-off-by: Dominik <dl6er@dl6er.de>
The server_cookie calloc/memcpy/free ran on every DNS query carrying EDNS cookies (most modern queries), but the data was only used for debug logging. Move cookie parsing entirely inside the debug check and replace both heap allocations (server_cookie, pretty_server_cookie) with fixed-size stack buffers. Zero heap allocations in the non-debug path. Signed-off-by: Dominik <dl6er@dl6er.de>
_FTL_new_query: the gettimeofday(&request, 0) result was never used — remove the wasted syscall on every new query. FTL_reply: identical DNSSEC proxy check block was duplicated (copy-paste bug) — remove the second copy Signed-off-by: Dominik <dl6er@dl6er.de>
Replace calloc(N) + qsort(N) with a bounded min-heap of size 4*K (where K is the requested count, typically 10). During the O(N) scan, only the top entries are retained in the heap. The final sort runs on at most 4*K entries instead of all N. For 100k domains requesting top 10: memory drops from ~5MB to ~2KB, sort from O(N log N) to O(K log K) where K << N, and zero-count entries are skipped early during collection. Also fix lock leak: calloc failure now unlocks shm before returning. Signed-off-by: Dominik <dl6er@dl6er.de>
- Remove strtolower() on upstream IPs in FTL_forwarded and FTL_retried: inet_ntop already produces lowercase output. - Fix double strlen() call in special_domain: compute domain length once and reuse for the suffix comparison. - Remove unnecessary sqlite3_clear_bindings() in domain_in_list: the next sqlite3_bind_text() overwrites parameter 1 directly, making the clear redundant. Signed-off-by: Dominik <dl6er@dl6er.de>
Signed-off-by: Dominik <dl6er@dl6er.de>
In queries_to_database(), the same domain/client/forward typically appears in many queries within a single 1-second batch. Each was triggering a separate INSERT OR IGNORE sqlite3_step into the linking tables (domain_by_id, client_by_id, forward_by_id) even though only the first INSERT per unique entry has any effect. Add per-batch dedup caches using pointer comparison (valid because getstr() returns stable shared-memory pointers where identical strings get identical addresses) to skip redundant sqlite3_step calls. Cache sizes are tuned per table type: - Domains: 256 (media-heavy pages produce 50-100+ unique domains) - Clients: 32 (typical home networks have 5-30 devices) - Upstreams: 8 (rarely more than 2-4 configured) Also remove the last 4 sqlite3_clear_bindings() calls (forward_stmt, addinfo_stmt x2, query_stmt) - unnecessary per SQLite docs since all bindings are overwritten before each step. Signed-off-by: Dominik <dl6er@dl6er.de>
Simplify linking table dedup: since the in-memory database persists for the process lifetime, a domain/client/upstream only needs to be inserted once ever, not once per batch. Replace the unsigned int db_batch generation counter with a single-bit in_database flag in each struct's flags bitfield. Reset in_database when client hostname changes in resolve.c so the new (ip, name) pair gets inserted into client_by_id. Bump SHARED_MEMORY_VERSION 14 -> 15 for the struct layout change. Signed-off-by: Dominik <dl6er@dl6er.de>
The REPLACE INTO query_storage statement previously used 4 correlated subqueries to resolve domain/client/forward/addinfo text strings to their linking table IDs on every single query upsert. This required 4 B-tree index lookups with string comparison per query. Cache the SQLite row IDs (db_id) directly on the domainsData, clientsData, and upstreamsData shared-memory structs. After the first INSERT OR IGNORE (already guarded by in_database flag), fetch the ID via sqlite3_last_insert_rowid (new rows) or a one-time SELECT (rows imported from disk), then bind it as a direct integer in the REPLACE. This eliminates 3 of 4 subqueries from the hottest SQLite statement in the system. Only the addinfo subquery remains (it only fires for blocked/CNAME queries, not every query) and cannot be cached easily (we have no matching SHM data structure for it and I don't want to add one just for this - the overall complexity is not worth it). Signed-off-by: Dominik <dl6er@dl6er.de>
The REPLACE INTO query_storage statement had one remaining correlated subquery: (SELECT id FROM addinfo_by_id WHERE type=? AND content=?). This B-tree index probe executed for every query stored to the in-memory database, even non-blocked queries where both parameters are NULL. Cache addinfo_by_id row IDs in a process-local open-addressing hash table (1024 slots, 8 KB). The key packs the addinfo type into the low bits and the actual value (CNAME_domainID or list_id) above, then applies Knuth multiplicative hashing to spread dense small integers across the power-of-two table. ADDINFO_TYPE_BITS is derived at compile time from the ADDINFO_TYPE_MAX sentinel in enum addinfo_type. After the first INSERT OR IGNORE, fetch the row ID via sqlite3_last_insert_rowid (new rows) or a one-time SELECT (rows pre-existing from disk import), then cache it. Subsequent queries with the same addinfo skip all SQLite operations entirely. The REPLACE statement is now a plain VALUES (?1,...,?13) with zero correlated subqueries, down from the original 4. Signed-off-by: Dominik <dl6er@dl6er.de>
in_gravity() is called twice per DNS cache miss in check_domain_blocked: once for gravity (block) and once for antigravity (allow-adlists). For the majority of Pi-hole users who don't use allow-adlists, the antigravity call always returns NOT_FOUND after a wasted SQLite bind/step/reset. Check once at gravity DB open time whether the antigravity table has any rows. When empty, short-circuit the in_gravity(antigravity=true) call immediately, saving one full prepared-statement round trip per uncached domain lookup. This is rechecked after gravity runs are performed when the new gravity database is re-opened. Signed-off-by: Dominik <dl6er@dl6er.de>
…k time Restructure queries_to_database() from a single massive SHM-locked loop into three phases: Phase 1 (SHM locked): Snapshot scalar query fields into a local array and handle linking table INSERTs (domain/client/forward/addinfo). The linking table work stays here because it reads SHM strings and writes back db_id/in_database flags. The changed flag is cleared immediately after snapshotting so that DNS thread updates during phase 2 are not lost. Phase 2 (SHM unlocked): Bind and step the main REPLACE statement for each snapshot. This is the bulk of the SQLite work and no longer blocks DNS processing or API threads. Phase 3 (SHM locked, brief): Write back db indices for newly inserted queries and update counters. This reduces SHM lock hold time from O(queries × SQLite REPLACE) to O(queries × field copy) + O(queries × index writeback), keeping the lock-free window over the heaviest database operation. Signed-off-by: Dominik <dl6er@dl6er.de>
find_mac() can trigger a netlink kernel call (iface_enumerate) to refresh the full ARP table when a client's MAC address isn't in dnsmasq's cache. Previously this ran while holding the global SHM mutex, blocking all other threads (API, database, GC, DNS workers). Release the SHM lock before calling find_mac(), perform the lookup into stack-local buffers, then reacquire the lock and write back the result. Both client and query pointers are re-fetched after reacquiring since SHM may have been remapped while unlocked. Only affects new-client paths (hwlen < 1); existing clients already have their MAC cached and skip this code entirely. Signed-off-by: Dominik <dl6er@dl6er.de>
_FTL_iface() is called on every incoming UDP query and iterates daemon->interfaces (potentially 2-3 times) to collect the IPv4/IPv6 addresses of the receiving interface. On systems with many interfaces (Docker hosts, VLANs, VPNs), this is O(n_interfaces) per query. Cache the resolved recviface pointer and skip the expensive address- collection loop when the same interface is seen again. The pointer is stable within dnsmasq's interface list and naturally invalidates on SIGHUP when the list is rebuilt. Also moves the next_iface invalidation (memset) to only run on cache miss or when no interface is found, avoiding unnecessary zeroing on every query. Signed-off-by: Dominik <dl6er@dl6er.de>
…s are empty Refactor the one-off gravity_check_antigravity() into a generic gravity_table_has_entries() helper and a consolidated gravity_check_list_presence() that probes antigravity, vw_allowlist and vw_denylist at gravity DB open time. When the exact allowlist or exact denylist contains no entries (common for most Pi-hole users who rely on regex or adlist-based blocking), in_allowlist() and in_denylist() now return NOT_FOUND immediately, saving one SQLite bind/step/reset per cache-miss query for each empty table. Signed-off-by: Dominik <dl6er@dl6er.de>
…tterns The per-cache-miss gravity check generatesABP wildcard patterns by iterating over domain label suffixes. Previously each pattern was heap-allocated as a cJSON string node inside a cJSON array, adding N+1 malloc/free round-trips per uncached query when ABP format is enabled (one array plus one string per domain component). Replace the cJSON array with a lightweight struct abp_patterns that stores only the suffix start offsets into the original domain string. The actual "||suffix^" / "@@||suffix^" pattern strings are built on the fly in a small stack buffer right before the SQLite bind call. This eliminates all heap allocation and deallocation in the ABP pattern path, removes the cJSON iteration overhead, and avoids the multiple memmove operations that the old gen_abp_patterns() used to build pattern strings incrementally. The original gen_abp_patterns() returning cJSON is retained for the cold API search path which needs the patterns as JSON for debug output. Signed-off-by: Dominik <dl6er@dl6er.de>
…ttime, reuse query timestamp - Add strcpy_tolower() that copies and lowercases in a single pass, replacing the strncpy()+strtolower() pattern. This avoids both the double traversal and strncpy's zero-padding of the remaining buffer (typically ~236 wasted bytes for a 20-char domain in a 256-byte buffer). Applied in _FTL_new_query() and FTL_CNAME(). - Guard clock_gettime(CLOCK_MONOTONIC) calls in lock_shm()/unlock_shm() behind config.debug.timing.v.b. Previously the syscall was executed on every SHM lock/unlock even when debug timing was disabled. - Replace time(NULL) syscall in FTL_check_blocking() cache expiry check with (time_t)query->timestamp which is already available, eliminating a redundant syscall per blocked-query lookup. Signed-off-by: Dominik <dl6er@dl6er.de>
- Replace 50-line query type switch in _FTL_new_query() with a 256-byte direct-mapped lookup table. Maps DNS wire types to FTL enum values via a single array load instead of a multi-case switch/jump table, avoiding indirect branch overhead and BTB pollution. - Replace strncpy() with strcpy_tolower() in FTL_check_blocking() for the local domain copy. Domains are already lowercase in shared memory, so tolower() is a no-op; the key gain is avoiding strncpy's zero- padding of the remaining ~236 bytes (MAXDOMAINLEN=256, typical domain ~20 chars). - Guard querystr() evaluation and debug logging in FTL_hook() behind config.debug.flags.v.b check. Previously querystr() (which does a linear search + sprintf) was called unconditionally on every DNS event even when debug output was disabled. Signed-off-by: Dominik <dl6er@dl6er.de>
The result of the ternary selecting between antigravity_stmt and gravity_stmt was discarded instead of assigned back to stmt, leaving it NULL when domain_in_list() was subsequently called. This is a latent crash bug triggered whenever a client's gravity prepared statement needed on-demand initialization. Matches the correct pattern already used in in_allowlist(). Signed-off-by: Dominik <dl6er@dl6er.de>
blockDomain is a bool (return value of check_domain_blocked()), not an enum db_result. The comparison worked by coincidence since NOT_FOUND == 0 == false, but was semantically incorrect. Replace with the idiomatic boolean negation. Signed-off-by: Dominik <dl6er@dl6er.de>
…lients After NUM_RECHECKS group rechecks the inner condition is permanently false, but time(NULL) was still invoked on every call. This function is called 3-4 times per cache-miss query (once each for allowlist, denylist, gravity, and antigravity). Adding an early-exit guard eliminates all time() calls for clients past their first ~3 minutes, which covers the vast majority of time. Signed-off-by: Dominik <dl6er@dl6er.de>
…casecmp is_pihole_domain() is called for every non-TYPE_NONE query. In the common case the domain is not pi.hole/hostname, so all four strcasecmp calls ran to completion unnecessarily. Cache the lengths of the four comparison targets as static variables and pre-filter with a single strlen(domain) + integer comparison before each strcasecmp, turning typical misses into O(1) rejections. Signed-off-by: Dominik <dl6er@dl6er.de>
|
Conflicts have been resolved. |
Signed-off-by: Dominik <dl6er@dl6er.de>
… last development merge Signed-off-by: Dominik <dl6er@dl6er.de>
Member
Author
|
Depends on #2845 |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Dominik <dl6er@dl6er.de>
|
Conflicts have been resolved. |
…the earlier merge Signed-off-by: Dominik <dl6er@dl6er.de>
Signed-off-by: Dominik <dl6er@dl6er.de>
Set SQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=33554432 as a compile-time default so every connection inherits it. Without this cap, the -wal file on the on-disk query database can grow unbounded between checkpoints on busy resolvers, increasing commit latency and disk usage. A compile-time default is preferred over a runtime PRAGMA because it also covers attached databases and any future connection added without having to remember to re-issue the PRAGMA. Signed-off-by: Dominik <dl6er@dl6er.de>
get_memdb_size() previously issued two separate prepare/step/finalize cycles for "PRAGMA page_count" and "PRAGMA page_size". Replace them with a single SELECT that cross-joins the pragma_page_count() and pragma_page_size() table-valued functions. This is a code-simplification change, not a hot-path optimisation — get_memdb_size() is only called from the debug-gated memory-usage log. The motivation is to stop pretending this needs two statements when SQLite can return both values from one prepare. Also fix a misleading log prefix: the function is get_memdb_size(), not init_memory_database(). Signed-off-by: Dominik <dl6er@dl6er.de>
bind_client_groups() binds the group_ids array to the carray() TVF with SQLITE_STATIC, meaning SQLite will NOT copy the buffer and will dereference it directly during each sqlite3_step(). That is only correct because getintarray() returns a pointer into the SHM integer-array region, which is stable for the statement's execution scope. This change does three things: - Document the lifetime contract next to the binding so anyone refactoring getintarray() is forced to consider the SQLITE_STATIC implication (TRANSIENT copy, or explicit rebind before each step). - Add a cheap assert() to catch a future refactor that could leave group_ids NULL while group_count > 0 — the early return already handles NULL, but the assert documents the post-guard invariant at the bind site. - Pure documentation + debug-only guard; no runtime impact in release builds. Signed-off-by: Dominik <dl6er@dl6er.de>
Pass SQLITE_OPEN_NOMUTEX to sqlite3_open_v2() in _dbopen(). Every dbopen() caller assigns the result to a local variable, uses it within one function, and calls dbclose() before returning — the connection never crosses a thread boundary. In SQLite's default "serialized" threading mode, each of the dozens of sqlite3_* calls per API request acquires a per-connection mutex; with NOMUTEX those acquisitions are skipped. Left untouched on purpose: - _memdb (query-table.c): accessed by both the DNS thread and the background database thread, must stay serialized. - gravity_db: accessed by multiple threads during gravity reload, must stay serialized. The long-lived shared connections continue to rely on SQLite's own locking; this change only applies to the short-lived per-request connections that are inherently single-threaded. Signed-off-by: Dominik <dl6er@dl6er.de>
Apply the same treatment as the previous _dbopen() change to the remaining sqlite3_open_v2() call sites that open a connection which lives entirely within one thread's scope: - network-table.c: macvendor.db readonly lookup - gravity-db.c: gravity_updated() readonly check - tools/gravity-parseList.c: CLI tool (single-process) - api/teleporter.c: per-request gravity.db writer - zip/teleporter.c: two :memory: database opens used for serialization and validation Each of these creates a sqlite3* in a local variable, uses it, and closes it before the function returns — the connection is never shared across threads. The in-memory databases in zip/teleporter.c have no file-level locking to care about at all. The teleporter's extra connection to gravity.db coexists with the long-lived gravity_db handle; NOMUTEX on one connection does not affect the other, and SQLite's file-level locking handles cross- connection serialization as usual. No functional change; avoids per-call SQLite mutex acquisitions on these paths. Signed-off-by: Dominik <dl6er@dl6er.de>
SQLITE_DEFAULT_CACHE_SIZE was set to 16384, which is a positive value and therefore means pages rather than kiB: at the default 4 kiB page size that works out to ~64 MiB of private page cache for every sqlite3 connection. FTL keeps several long-lived connections open (_memdb in query-table.c, gravity_db in gravity-db.c, plus briefly- scoped dbopen() handles), so the compiled-in default dominates RSS. Drop it to 4096 pages (~16 MiB). The mmap-backed on-disk query database (PRAGMA mmap_size) lets the kernel page cache serve cold reads for free, so large per-connection caches mostly duplicate the kernel's work. 16 MiB comfortably covers the hot working set for both gravity lookups and the in-memory query ring, while saving ~48 MiB per live connection at steady state. Also clean up the adjacent comment which previously described a "16 MiB" cache that the actual define did not deliver; the comment now matches what the compile-time default actually configures. No behaviour change. If a specific workload ever needs more cache on a specific connection, PRAGMA cache_size on that connection is the right localized override. Signed-off-by: Dominik <dl6er@dl6er.de>
The debug.performance 5-minute rollup has so far only tracked
FTL_check_blocking() as one aggregate slot. Real-world deployments
show a small but recurring tail: ~1-2 events per 5 minute window
where check_blocking takes 1.2-2.3 ms instead of the usual tens of
microseconds. Because the entire function lives in one PERF_STAT
slot, we cannot tell whether the spike lives in the allowlist pass
or the denylist/gravity pass.
Add two nested PERF_STAT slots that only fire on the cache-miss
path (the fast cache-hit path is already dominated by a single
switch and is not the source of the tail):
PERF_STAT_CB_ALLOWLIST wraps in_allowlist() + in_regex(ALLOW)
PERF_STAT_CB_DENYLIST wraps check_domain_blocked() (primary
call and the _esni fallback are timed
into the same slot, so the rollup
reflects total denylist/gravity work
per query)
special_domain() is not instrumented separately; it is a handful
of strcmp()s and cannot be the source of a multi-millisecond spike.
No behaviour change; the new scopes compile out to nothing when
debug.performance is disabled (existing PERF_START / PERF_END
macros already gate on config.debug.performance.v.b). The rollup
in FTL_dump_cache_stats() picks up the new slots via the existing
loop, reporting them with descriptive names that make their
nesting under check_blocking obvious in the log.
Signed-off-by: Dominik <dl6er@dl6er.de>
Background. The previous commit added nested PERF_STAT slots inside
FTL_check_blocking (allowlist, denylist/gravity). First ~8 h of
production data under debug.performance showed an oddity: the
denylist/gravity child slot accumulated ~69 slow (>1 ms) events
while the parent PERF_STAT_CHECK_BLOCKING slot saw only 3.
Two things followed from that:
1. FTL_check_blocking has two callers. FTL_new_query wraps its call
in PERF_STAT_CHECK_BLOCKING; FTL_CNAME does not. The nested child
slots fire regardless, so CNAME deep-inspection time was invisible
at the parent level. One user-facing DNS query against a 5-hop
CNAME chain drives 5 separate calls through FTL_check_blocking,
each capable of a full denylist/gravity miss.
2. The existing CB_DENYLIST slot wraps check_domain_blocked() as a
whole, but that function runs three independent engines in
sequence — exact SQLite denylist, gravity/antigravity SQLite
probe, and TRE regex — each with a different potential failure
mode, and we currently cannot tell them apart in the rollup.
Changes. Still instrumentation-only; PERF macros compile out when
debug.performance is disabled.
(a) Split the parent CHECK_BLOCKING into primary vs CNAME callers.
Add PERF_STAT_CB_CNAME and wrap the FTL_CNAME-side call. Rename
the existing slot's human-readable label to "check_blocking
(primary path)" so its scope is unambiguous in the log.
(b) Split CB_DENYLIST's body into three nested sub-slots:
CDB_EXACT — in_denylist() exact-match SQLite
CDB_GRAVITY — in_gravity() for both antigravity and gravity;
timed into one slot so the rollup reflects total
gravity-DB work per query regardless of which
sub-call hit
CDB_REGEX — in_regex(REGEX_DENY) over the TRE regex engine
The regex call is now stored in a local bool before the `if`
instead of being evaluated inline, so the PERF_END runs after the
engine returns. Same evaluation order, same short-circuit semantics,
no behaviour change.
After this lands, the 5-minute rollup reports:
check_blocking (primary path) # FTL_new_query side
check_blocking (CNAME path, per-hop) # FTL_CNAME side
check_blocking -> allowlist # nested, both callers
check_blocking -> denylist/gravity # nested, both callers
check_domain_blocked -> exact denylist
check_domain_blocked -> gravity+antigravity
check_domain_blocked -> regex denylist
The CNAME-to-primary call ratio gives the amplification factor, and
the three new sub-engine slots localize which one owns the tail. If
CDB_GRAVITY concentrates the slow events, the fix domain is SQLite
and mmap (page-fault on cold regions, WAL contention); if CDB_REGEX
does, it is the TRE engine; if CDB_EXACT does, it is the prepared-
statement hot path. Each has a different next step, and right now we
cannot tell them apart.
Signed-off-by: Dominik <dl6er@dl6er.de>
Production data from the previous PERF_STAT round narrowed the
multi-millisecond tail on cache-miss queries to CDB_GRAVITY
(check_domain_blocked → gravity+antigravity): 45 slow (>1 ms)
events across ~7600 calls in ~4.75 h, with a peak of 2147 µs.
That same data, cross-referenced with the pre-existing
gravity-db.c "gravity (exact)" slot which wraps domain_in_list()
(the bind + sqlite3_step + reset cycle), shows the step itself
never exceeded ~210 µs in the same window. The step is therefore
not the culprit.
The time must live in in_gravity()'s wrapper code around the
step:
- gravityDB_client_check_again() — group membership recheck,
which can reload groups from the gravity DB on demand and
may page-fault on cold mmap regions;
- gravityDB_prepare_client_statements() — per-client statement
prepare, runs only on first call per client;
- bind_client_groups() — carray bind of the client's group-id
array against a shared statement.
Add a span-style sibling to the existing expression-wrapping
GRAVITY_TIMED_LOOKUP macro (GRAVITY_PERF_START / GRAVITY_PERF_END),
extend gravity_perf[] from 6 to 7 slots, and use the new pair to
time the setup block inside in_gravity() from entry through
bind_client_groups(). The 5-minute rollup now includes:
Gravity lookup stats [in_gravity wrapper (non-step)]:
N calls, avg X us, max Y us, Z slow (>1ms, W%)
If Z matches the slow count in CDB_GRAVITY, the wrapper IS the
tail source — next fix domain is likely mmap warm-up (the existing
gravity pre-warm from 38f7322 may need to re-touch pages on group
reloads) or carray bind. If Z stays at zero while CDB_GRAVITY
keeps going slow, the tail is in the ABP post-step path or the
two in_denylist() / in_regex() neighbors that CDB_GRAVITY was
never actually covering (in which case the earlier attribution
needs re-examining).
Either outcome narrows the search meaningfully. No behaviour
change; both macros compile out when debug.performance is off.
Early-return paths inside in_gravity() (LIST_NOT_AVAILABLE,
bind_client_groups failure) skip the GRAVITY_PERF_END call so
they don't pollute the histogram — they are rare error paths,
not the hot path we want to characterize.
Signed-off-by: Dominik <dl6er@dl6er.de>
Production perf data (debug.performance enabled, ~8 h of rollups) localized the 1-2 ms tail in cache-miss DNS queries entirely to the wrapper code inside in_gravity() — specifically the inline call to gravityDB_reload_groups(client) made from gravityDB_client_check_again() at the 60/120/180-second boundaries after a client was first seen. 27/27 slow events in the new "in_gravity wrapper (non-step)" slot matched the 27 slow events in CDB_GRAVITY, and the bimodal distribution across windows (21 of 32 windows had sub-µs wrapper time; the other 11 each had 1-7 slow events) mapped cleanly onto "any client in its 3-minute recheck window during this 5-minute window". gravityDB_reload_groups() does three heavy things — finalize the client's prepared statements, re-prepare them against the current group set (which runs get_client_groupids() SQL against gravity.db), and rebuild the per-client regex lookup. All synchronously on the DNS hot path, under the SHM lock, on whichever query happened to land inside the recheck window. One user-facing query was therefore effectively charged ~1-2 ms whenever it lost this lottery, and CNAME recursion amplified the probability by 2.2x because every CNAME hop is another chance to cross the boundary. The recheck mechanism itself is correct — clients *should* have their identity re-evaluated as hostname/MAC/interface information arrives — but the reload does not need to happen synchronously with the triggering query. Between rechecks the per-client state is already "stale" by up to 60 seconds by design; deferring by ≤1 second is immaterial. Change: gravityDB_client_check_again() now just advances the recheck counter and sets a new per-client flag (clientsData.flags.reload_pending, a single bit inside the existing client_flags bitfield — zero size impact). The DB thread, in its once-per-second cadence alongside queries_to_database() and the gravity_updated() check, calls a new gravityDB_process_pending_reloads() that scans clients, takes the SHM lock, and runs gravityDB_reload_groups() for any client whose flag is set, then clears it. The SHM lock is still held during the reload, but the hold is now on the DB thread. A DNS query only ever waits on that lock if it happens to be in flight exactly when the DB thread is processing — a small fraction of the time at typical query rates, and the blocked query absorbs one reload's worth of latency rather than every query near a boundary triggering its own. The second caller of gravityDB_reload_groups() at dnsmasq_interface.c:1088 (client changed network interface) is left inline intentionally. Interface changes are rare events and we have no evidence they contribute to the tail; routing that path through the same flag can be a follow-up if it ever surfaces. No behaviour change from the user's perspective: group re-checks still happen at the same 60/120/180-second marks, clients are still re-identified over their first three minutes, and the eventual group set a client is assigned is unchanged. Signed-off-by: Dominik <dl6er@dl6er.de>
Production perf data (debug.performance enabled, ~8 h of rollups)
localized the recurring 1-2 ms tail on cache-miss DNS queries entirely
to one inline call: gravityDB_client_check_again() → gravityDB_reload_groups()
from inside in_gravity() / in_allowlist() / in_denylist(). The 27 slow
events in the "in_gravity wrapper (non-step)" PERF_STAT slot matched the
27 slow events in CDB_GRAVITY one-for-one, and the bimodal window
distribution (21 quiet, 11 spiky) mapped cleanly onto "did any client
cross its 60/120/180-second recheck boundary during this 5-minute
window".
The recheck mechanism exists for a legitimate reason: a client's
identity (MAC, hostname) is discovered asynchronously via
parse_neighbor_cache() (ARP) and the resolver thread, so a client whose
group assignment depends on MAC or hostname cannot be matched correctly
on its very first query. The three timed rechecks at 60/120/180 s
re-run get_client_groupids() once enough async identity data has
arrived.
That's periodic maintenance, though, not per-query work. The old
placement made every DNS query eat the "is this client due for a
recheck" conditional, and whichever query happened to land inside a
boundary absorbed the full ~1-2 ms reload (SQL re-fetch of groups +
per-client statement re-prepare + per-client regex rebuild) under the
SHM lock. CNAME recursion amplified the exposure ~2.2×.
Rework:
- Remove the three hot-path calls to gravityDB_client_check_again()
from in_gravity(), in_allowlist(), in_denylist() entirely. No more
per-query check for anything recheck-related.
- Replace the per-client gravityDB_client_check_again() with a
gravityDB_recheck_clients() that iterates all clients under the SHM
lock, short-circuits mature ones (reread_groups >= NUM_RECHECKS),
and runs the reload directly for any client currently within its
3-minute window whose next boundary has passed.
- Call that function from the database thread's existing once-per-
second block in run_db_thread(), next to queries_to_database() and
the gravity_updated() check.
Behaviour is preserved: clients are still re-identified at the same
60/120/180 s marks after first-seen, with at most ~1 s of additional
latency between the boundary and the actual reload — well within the
existing design's pre-existing between-rechecks staleness window.
The other caller of gravityDB_reload_groups() at
dnsmasq_interface.c:1088 (client changed network interface) is left
inline intentionally. Interface changes are rare events and we have
no evidence they contribute to the tail.
Removes from scope the per-client reload_pending flag that an earlier
iteration of this work introduced: with the full check moved to the DB
thread, there is nothing for the DNS thread to signal, and recomputing
the "is this client due for a recheck" condition during the once-per-
second scan is cheaper than reading/writing a shared flag bit.
Net: -23 lines, no new state on clientsData, zero DNS-path overhead
from the recheck subsystem. The CDB_GRAVITY and in_gravity wrapper
slow-event counts should drop to essentially zero once this ships.
Signed-off-by: Dominik <dl6er@dl6er.de>
yubiuser
reviewed
Apr 20, 2026
Signed-off-by: Dominik <dl6er@dl6er.de>
Member
|
If been running this for several days now without issues. I think after the next patch release this is good to go into |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this implement/fix?
Summary
Broad sweep across the FTL engine covering performance work, correctness fixes, database schema improvements, signal handler hardening, and crash diagnostics.
145 commits across 56 files (+2851/−1395 lines): 78 bug fixes, 45 performance optimizations, and supporting refactors.
Testing showed approximately 15% query processing speedup: ~92 µs → ~80 µs for each new query.
Performance: Gravity Database
carray()binding — replace per-client statement vectors with one shared statement per process, rebound viasqlite3_carray_bind()before each lookup. Query plan goes from 4 SEARCH operations (through 3 LEFT JOINs) to a single covering index seek for gravity/antigravity.addintarray(). TCP forks get group data for free without recomputation.(group_id, ...)indexes — enable index seeks forcarray()-filtered queries onadlist_by_groupanddomainlist_by_groupinstead of full scans.WITHOUT ROWIDon junction tables —adlist_by_group,domainlist_by_group,client_by_group, andinfotables use the composite PK directly as the B-tree storage key, eliminating one level of indirection per JOIN.bind_client_groups()helper tracks the last-bound SHMgroupsposper statement. Sinceaddintarray()deduplicates, clients sharing the same group set skip the rebind entirely.PRAGMA mmap_size = 256 MiBon gravity.db and the on-disk query database. Eliminatespread()syscall overhead for all domain lookups.posix_fadvise(POSIX_FADV_WILLNEED)on gravity.db at startup — pre-warms the kernel page cache asynchronously, eliminating cold-start page faults.getTable()case, SHM counter, and API exposure for antigravity domain counts.adlist.enabledandgroup.enabledon every lookup.Performance: Query Processing Hot Path
FTL_check_blocking()— pass pre-fetched domain/client/cache pointers instead of re-looking up by ID.gravity_has_exact_allowlist,gravity_has_exact_denylist,gravity_has_antigravityflags skip entire functions when the corresponding lists are empty.regexec(), skipping the expensive regex engine when the type doesn't match.tolower()with branchless ASCII bit-trick instrcpy_tolower().time(NULL)calls with the query's existing timestamp where applicable.find_mac()outside SHM lock — netlink kernel I/O no longer blocks all other threads._FTL_iface()— skip repeated interface list iteration when the pointer hasn't changed.Performance: Garbage Collection
memmove()-ing all surviving queries on every GC run, advancequeries_offsetand compact lazily only when the physical array runs out of room.domain->count,client->count,cache->refcount,domain->cname_refcountare maintained per-query instead of recomputed during GC, eliminating O(queries) scan inrecycle().Performance: Database Writes
queries_to_database()— Phase 1 (under SHM lock): snapshot query data. Phase 2 (no lock): bulk SQLite writes. Phase 3 (brief lock): write-back of DB indices. Minimizes lock contention.INSERT OR IGNOREwithsqlite3_changes()check instead of SELECT-then-INSERT.PRAGMA synchronous=NORMALon WAL-mode disk database — skip per-commit fsync, relying on WAL checksums for crash safety.Performance: Data Structures
DNSCacheDatafrom 48 to 36 bytes on all architectures.queriesDataandclientsDatafor cache-line alignment — hot fields in the first 64-byte line, coldoverTime[]at the end.O(1)string deduplication — replacememmem()linear scan inaddstr()with a process-local open-addressing hash table (FNV-1a).hashStr()./api/stats/top_domainsand/api/stats/top_clients— O(N log K) instead of full sort.Database Schema & Query Improvements
queriesVIEW migration (v22) — replace 4 correlated subqueries with LEFT JOINs. Dramatically faster for users queryingpihole-FTL.dbvia sqlite3 CLI or third-party tools.ORDER BY ... DESC LIMITon database top items —/api/stats/database/top_domainsandtop_clientsnow return correctly sorted results (was returning arbitrary GROUP BY order) and push the LIMIT into SQL instead of scanning all rows.NOL NULL→NOT NULLtypo onclient.ipin gravity schema — SQLite silently ignored the misspelled constraint, allowing NULL IPs.tr_adlist_update,tr_client_update,tr_domainlist_updatenow useWHERE id = NEW.idinstead of UNIQUE text column lookup.index_creationarray — C string literal concatenation merged two index definitions into one entry.Signal Handler Hardening
SA_ONSTACK— crash handlers run on a dedicated 16 KiB alternate signal stack, enabling backtrace generation even on stack overflow faults.SA_RESETHAND— crash signal disposition resets toSIG_DFLafter first invocation, preventing infinite re-entry if the handler itself faults.si_pid/si_uidin volatile globals, defer the expensive sender lookup (/proc/<pid>/cmdline,getpwuid(), logging) tolog_sigterm_info()called fromcleanup()outside signal context. Eliminates potential deadlock if SIGTERM arrives while malloc lock is held.log_info(strsignal())call; events are logged when processed in the main loop.delete_shm()against uninitialized SHM — early return whenname == NULL, preventing garbage munmap/close/shm_unlink duringpihole-FTL crashtest.FILE*in original SIGTERM handler (missingfcloseon success path).Crash Diagnostics
#N 0xADDR in func () at file:linematching GDB'sbtconvention. Library frames showfrom libname. Unresolved frames showin ?? ()._Unwind_Backtracevia GCC's libgcc — works on glibc AND musl, static AND dynamic, all architectures. Falls back todladdr()for library symbols, then/proc/self/mapsfor mapping names.__ehdr_startload base from instruction pointers for correct source line resolution.--- end of backtrace (N frames) ---for clear visual termination.Correctness Fixes (78 commits)
Critical:
gravity_updated()closed globalgravity_dbinstead of local handle — broke all DNS queries on gravity check errornew_sqlite3_stmt_vec()allocated n² instead of n items — massive memory wasteprocess_alive()usedstrcmpon a substring — never matchedparse_neighbor_cache()held lock acrosscontinueon NULL clientadd_to_fifo_buffer()OOB write,process_arp_reply()OOB device accessdb_update_disk_counter— on-disk counters never updated% 0xFFvs& 0xFFfor EDNS VERSION byte — corrupted version 0 detectionfind_mac()unlock/relock — use-after-free riskreallocin teleporter upload — NULL deref + memory leak on failureResource leaks (30+ fixes):
dbclose()on error paths across all API database endpointssqlite3_finalize()on prepare/bind failuresROLLBACKbefore early returns inside open transactionsfree()for cJSON objects, punycode buffers, query stringsfclose()for file handlesY2038 fixes:
sqlite3_column_int64()for alltime_tdatabase fieldssqlite3_bind_int64()for time fields in network table insertsTest Infrastructure
test-fastbuild option — skips performance tests for faster CI iterationRelated issue or feature (if applicable): N/A
Pull request in docs with documentation (if applicable): N/A
By submitting this pull request, I confirm the following:
git rebase)Checklist:
developmentbranch.