Performance optimizations and bug fixes by DL6ER · Pull Request #2816 · pi-hole/FTL

DL6ER · 2026-03-20T17:24:21Z

What does this implement/fix?

Summary

Broad sweep across the FTL engine covering performance work, correctness fixes, database schema improvements, signal handler hardening, and crash diagnostics.

145 commits across 56 files (+2851/−1395 lines): 78 bug fixes, 45 performance optimizations, and supporting refactors.

Testing showed approximately 15% query processing speedup: ~92 µs → ~80 µs for each new query.

Performance: Gravity Database

Shared prepared statements with carray() binding — replace per-client statement vectors with one shared statement per process, rebound via sqlite3_carray_bind() before each lookup. Query plan goes from 4 SEARCH operations (through 3 LEFT JOINs) to a single covering index seek for gravity/antigravity.
SHM intarray storage for client group IDs — store group ID arrays in shared memory with deduplication via addintarray(). TCP forks get group data for free without recomputation.
Junction table (group_id, ...) indexes — enable index seeks for carray()-filtered queries on adlist_by_group and domainlist_by_group instead of full scans.
WITHOUT ROWID on junction tables — adlist_by_group, domainlist_by_group, client_by_group, and info tables use the composite PK directly as the B-tree storage key, eliminating one level of indirection per JOIN.
carray rebind deduplication — bind_client_groups() helper tracks the last-bound SHM groupspos per statement. Since addintarray() deduplicates, clients sharing the same group set skip the rebind entirely.
Memory-mapped I/O — PRAGMA mmap_size = 256 MiB on gravity.db and the on-disk query database. Eliminates pread() syscall overhead for all domain lookups.
posix_fadvise(POSIX_FADV_WILLNEED) on gravity.db at startup — pre-warms the kernel page cache asynchronously, eliminating cold-start page faults.
ANTIGRAVITY_TABLE support — new enum, count query, getTable() case, SHM counter, and API exposure for antigravity domain counts.
Correctness: reverted adlist ID caching — pre-computed per-client adlist ID arrays bypassed view JOINs for ~4x speedup but became stale when users modified groups/adlists at runtime. Reverted to live view queries that check adlist.enabled and group.enabled on every lookup.

Performance: Query Processing Hot Path

Eliminate redundant SHM lookups in FTL_check_blocking() — pass pre-fetched domain/client/cache pointers instead of re-looking up by ID.
Stack-based ABP pattern generation — replace cJSON heap allocations with a stack struct containing pre-computed suffix offsets. Zero heap allocation in the blocking check path.
Lazy ABP offset computation — defer suffix offset calculation until after the exact-match check fails, avoiding all ABP work on exact matches (the common case).
Skip empty list checks — gravity_has_exact_allowlist, gravity_has_exact_denylist, gravity_has_antigravity flags skip entire functions when the corresponding lists are empty.
Regex query-type pre-filtering — check the query-type bitmask before running regexec(), skipping the expensive regex engine when the type doesn't match.
Replace tolower() with branchless ASCII bit-trick in strcpy_tolower().
Replace time(NULL) calls with the query's existing timestamp where applicable.
Move find_mac() outside SHM lock — netlink kernel I/O no longer blocks all other threads.
Interface caching in _FTL_iface() — skip repeated interface list iteration when the pointer hasn't changed.

Performance: Garbage Collection

Deferred compaction — instead of memmove()-ing all surviving queries on every GC run, advance queries_offset and compact lazily only when the physical array runs out of room.
O(1) query ID lookup — direct-mapped 4096-entry cache for dnsmasq ID → FTL query index, replacing O(N) linear scan.
Incremental reference counting — domain->count, client->count, cache->refcount, domain->cname_refcount are maintained per-query instead of recomputed during GC, eliminating O(queries) scan in recycle().
Single-pass lookup table compaction — GC recycles clients/domains/upstreams and compacts lookup tables in one pass instead of per-entry removal.

Performance: Database Writes

Three-phase queries_to_database() — Phase 1 (under SHM lock): snapshot query data. Phase 2 (no lock): bulk SQLite writes. Phase 3 (brief lock): write-back of DB indices. Minimizes lock contention.
Linking table ID caching — hash table caches domain/client/forward/addinfo row IDs, eliminating 4 REPLACE subqueries per query.
Batched dedup linking table INSERTs — INSERT OR IGNORE with sqlite3_changes() check instead of SELECT-then-INSERT.
PRAGMA synchronous=NORMAL on WAL-mode disk database — skip per-commit fsync, relying on WAL checksums for crash safety.

Performance: Data Structures

Shrink DNSCacheData from 48 to 36 bytes on all architectures.
Reorder queriesData and clientsData for cache-line alignment — hot fields in the first 64-byte line, cold overTime[] at the end.
O(1) string deduplication — replace memmem() linear scan in addstr() with a process-local open-addressing hash table (FNV-1a).
FNV-1a hash replaces Jenkins' One-at-a-Time in hashStr().
Heap-based top-K selection for /api/stats/top_domains and /api/stats/top_clients — O(N log K) instead of full sort.

Database Schema & Query Improvements

queries VIEW migration (v22) — replace 4 correlated subqueries with LEFT JOINs. Dramatically faster for users querying pihole-FTL.db via sqlite3 CLI or third-party tools.
ORDER BY ... DESC LIMIT on database top items — /api/stats/database/top_domains and top_clients now return correctly sorted results (was returning arbitrary GROUP BY order) and push the LIMIT into SQL instead of scanning all rows.
Fix NOL NULL → NOT NULL typo on client.ip in gravity schema — SQLite silently ignored the misspelled constraint, allowing NULL IPs.
Update triggers use PK lookup — tr_adlist_update, tr_client_update, tr_domainlist_update now use WHERE id = NEW.id instead of UNIQUE text column lookup.
Fix missing comma in index_creation array — C string literal concatenation merged two index definitions into one entry.
MEMDB_VERSION bump to 22 and expected schema test updates.

Signal Handler Hardening

SA_ONSTACK — crash handlers run on a dedicated 16 KiB alternate signal stack, enabling backtrace generation even on stack overflow faults.
SA_RESETHAND — crash signal disposition resets to SIG_DFL after first invocation, preventing infinite re-entry if the handler itself faults.
Async-signal-safe SIGTERM handler — save si_pid/si_uid in volatile globals, defer the expensive sender lookup (/proc/<pid>/cmdline, getpwuid(), logging) to log_sigterm_info() called from cleanup() outside signal context. Eliminates potential deadlock if SIGTERM arrives while malloc lock is held.
Async-signal-safe RT signal handler — remove log_info(strsignal()) call; events are logged when processed in the main loop.
Guard delete_shm() against uninitialized SHM — early return when name == NULL, preventing garbage munmap/close/shm_unlink during pihole-FTL crash test.
Fix leaked FILE* in original SIGTERM handler (missing fclose on success path).

Crash Diagnostics

GDB-style backtrace output — frames show #N 0xADDR in func () at file:line matching GDB's bt convention. Library frames show from libname. Unresolved frames show in ?? ().
_Unwind_Backtrace via GCC's libgcc — works on glibc AND musl, static AND dynamic, all architectures. Falls back to dladdr() for library symbols, then /proc/self/maps for mapping names.
PIE/ASLR-aware addr2line — subtracts the __ehdr_start load base from instruction pointers for correct source line resolution.
Backtrace end marker — --- end of backtrace (N frames) --- for clear visual termination.

Correctness Fixes (78 commits)

Critical:

gravity_updated() closed global gravity_db instead of local handle — broke all DNS queries on gravity check error
new_sqlite3_stmt_vec() allocated n² instead of n items — massive memory waste
Session cookie used compile-time default timeout instead of configured value
Zombie detection in process_alive() used strcmp on a substring — never matched
SHM deadlock: parse_neighbor_cache() held lock across continue on NULL client
Heap buffer overflows: add_to_fifo_buffer() OOB write, process_arp_reply() OOB device access
Swapped bind parameters in db_update_disk_counter — on-disk counters never updated
% 0xFF vs & 0xFF for EDNS VERSION byte — corrupted version 0 detection
EDE info-code byte-order: double-swap on little-endian, broken on big-endian
Stale SHM pointers after find_mac() unlock/relock — use-after-free risk
Unchecked realloc in teleporter upload — NULL deref + memory leak on failure

Resource leaks (30+ fixes):

Missing dbclose() on error paths across all API database endpoints
Missing sqlite3_finalize() on prepare/bind failures
Missing ROLLBACK before early returns inside open transactions
Missing free() for cJSON objects, punycode buffers, query strings
Missing fclose() for file handles

Y2038 fixes:

sqlite3_column_int64() for all time_t database fields
sqlite3_bind_int64() for time fields in network table inserts

Test Infrastructure

test-fast build option — skips performance tests for faster CI iteration
Updated test expectations for removed log messages, new schema version, new VIEW definition

Related issue or feature (if applicable): N/A

Pull request in docs with documentation (if applicable): N/A

By submitting this pull request, I confirm the following:

I have read and understood the contributors guide, as well as this entire template. I understand which branch to base my commits and Pull Requests against.
I have commented my proposed changes within the code.
I am willing to help maintain this change if there are issues with it later.
It is compatible with the EUPL 1.2 license
I have squashed any insignificant commits. (git rebase)

Checklist:

The code change is tested and works locally.
I based my code and PRs against the repositories development branch.
I signed off all commits. Pi-hole enforces the DCO for all contributions
I signed all my commits. Pi-hole requires signatures to verify authorship
I have read the above and my PR is ready for review.

The _addstr() function used memmem() to scan the entire shared-memory string pool for duplicate strings before appending. This is O(pool_size) per call — on a typical installation with tens of thousands of unique domains, this means scanning hundreds of kilobytes to megabytes of data on every call, all while holding the SHM lock. Replace with a process-local open-addressing hash table that maps string hashes to pool offsets. This gives O(1) average-case lookups while preserving the existing string deduplication behavior. The hash table uses the same hashing algorithm (Jenkins One-at-a-Time) already used elsewhere in FTL. It starts at 4096 slots, doubles at 75% load/usage, and incrementally syncs any strings added by forked TCP children via str_hash_sync(). This primarily benefits: - Database import at startup (every imported domain hits addstr()) - Post-GC domain re-registration (recycled domains call addstr() again) - Any workload with many unique domains (ad-heavy or many-user networks) Signed-off-by: Dominik <dl6er@dl6er.de>

Replace the O(MAXITER=1000) reverse linear scan in findQueryID() with a 4096-entry direct-mapped cache keyed on the (monotonically incrementing!) dnsmasq query ID. Each DNS reply/forward/CNAME callback should now hit the cache for an O(1) lookup instead of scanning up to (in the worst case!) 1000 queriesData structs in shared memory. Since dnsmasq IDs increment monotonically, new entries naturally evict the oldest at each slot, keeping the cache fresh without explicit eviction logic. The cache is cleared after GC memmove (which shifts all query indices). On a miss the original linear scan is used as fallback. Signed-off-by: Dominik <dl6er@dl6er.de>

Replace the per-GC-cycle memmove of the entire queries array with a queries_offset that advances when old queries are removed. _getQuery() adds this offset to translate logical indices to physical positions, making the change transparent to all callers. Previously, every GC run copied (remaining_queries * 96) bytes under the SHM lock — potentially tens of megabytes on a busy resolver. Now GC simply increments the offset immediately (O(1)). Compaction (the actual memmove) is deferred to shm_ensure_size() and only triggered when the physical array runs out of room, amortizing the cost over many GC cycles. Signed-off-by: Dominik <dl6er@dl6er.de>

…ounts The recycle() function previously allocated three bool arrays (clients, domains, dns_cache) and scanned ALL active queries to mark which entries were still in use before recycling unreferenced ones. This O(queries) scan ran under the shared-memory lock every GC cycle. Replace with incrementally-maintained reference counts: - client->count / domain->count (already maintained) track direct query references - domain->cname_refcount (new) tracks CNAME chain references from both queries and cache entries - cache->refcount (new) tracks how many queries reference each DNS cache entry Refcounts are incremented at query creation (new query, DB import, CNAME chain resolution, query duplication) and decremented during GC removal. recycle() now checks these fields directly instead of scanning, reducing its complexity from O(queries) to O(clients + domains + cache). Initialize DNSCacheData.CNAME_domainID to (unsigned int)-1 as a sentinel to distinguish "never set" from domain ID 0, ensuring correct refcount decrements even after gravity reloads that reset blocking_status without clearing CNAME_domainID. Signed-off-by: Dominik <dl6er@dl6er.de>

…te loop queries_to_database() called asprintf() to format the "IP#port" upstream string for every query written to the database, causing a malloc+free cycle per query. On systems with thousands of queries per DB flush, this adds up to significant heap churn. Replace with a fixed 64-byte stack buffer and snprintf. The maximum formatted length is 52 bytes (45-char IPv6 + '#' + 5-digit port + NUL), so 64 bytes provides ample headroom with zero heap allocation. Signed-off-by: Dominik <dl6er@dl6er.de>

GC recycle() previously called lookup_remove() for each recycled client, domain, and cache entry. Each removal did a binary search followed by memmove to shift all subsequent entries left, making each removal O(N) in the table size. With K entries recycled, the total cost was O(K*N). Replace with lookup_compact(): after each recycling loop, a single O(N) pass over the sorted lookup table copies surviving entries forward, skipping entries whose backing data structure has been memset to zero (magic byte == 0x00). The sorted order is preserved since only removals occur, so binary search continues to work. For a DNS cache lookup table with 20K entries and 100 recycled entries, this reduces data movement from ~8MB (100 individual memmoves) to ~160KB (one sequential scan). Signed-off-by: Dominik <dl6er@dl6er.de>

GC recycle() previously iterated data arrays to decide what to recycle, then made a second pass over each lookup table to compact out dead entries. Merge both operations into a single pass: iterate over the sorted lookup table directly, check the recycling criteria for each entry's backing data, and either recycle it (memset + add to recycle table) or shift it forward in the compacted output. This halves the number of iterations and eliminates the lookup_compact() function, its three callback wrappers, and the associated function pointer call overhead per lookup entry. Signed-off-by: Dominik <dl6er@dl6er.de>

Move the query_type bitmask check ahead of the expensive regexec()/tre_regexec() call. When a regex has a type filter (e.g., only match AAAA queries), queries of non-matching types now skip the engine entirely via a single integer AND operation, avoiding unnecessary regex evaluations. Signed-off-by: Dominik <dl6er@dl6er.de>

gen_abp_patterns() uses cJSON, allocating 9+ heap objects per blocked-domain check (1 array + N items + N string copies). Most gravity lookups hit the exact-match check, making those allocations wasted work. Change in_gravity() to accept a cJSON** and lazily generate ABP patterns only when the exact-match returns NOT_FOUND and gravity_abp_format is enabled. The patterns are shared across the antigravity and gravity calls via the caller's pointer. Signed-off-by: Dominik <dl6er@dl6er.de>

The server_cookie calloc/memcpy/free ran on every DNS query carrying EDNS cookies (most modern queries), but the data was only used for debug logging. Move cookie parsing entirely inside the debug check and replace both heap allocations (server_cookie, pretty_server_cookie) with fixed-size stack buffers. Zero heap allocations in the non-debug path. Signed-off-by: Dominik <dl6er@dl6er.de>

_FTL_new_query: the gettimeofday(&request, 0) result was never used — remove the wasted syscall on every new query. FTL_reply: identical DNSSEC proxy check block was duplicated (copy-paste bug) — remove the second copy Signed-off-by: Dominik <dl6er@dl6er.de>

Replace calloc(N) + qsort(N) with a bounded min-heap of size 4*K (where K is the requested count, typically 10). During the O(N) scan, only the top entries are retained in the heap. The final sort runs on at most 4*K entries instead of all N. For 100k domains requesting top 10: memory drops from ~5MB to ~2KB, sort from O(N log N) to O(K log K) where K << N, and zero-count entries are skipped early during collection. Also fix lock leak: calloc failure now unlocks shm before returning. Signed-off-by: Dominik <dl6er@dl6er.de>

- Remove strtolower() on upstream IPs in FTL_forwarded and FTL_retried: inet_ntop already produces lowercase output. - Fix double strlen() call in special_domain: compute domain length once and reuse for the suffix comparison. - Remove unnecessary sqlite3_clear_bindings() in domain_in_list: the next sqlite3_bind_text() overwrites parameter 1 directly, making the clear redundant. Signed-off-by: Dominik <dl6er@dl6er.de>

Signed-off-by: Dominik <dl6er@dl6er.de>

In queries_to_database(), the same domain/client/forward typically appears in many queries within a single 1-second batch. Each was triggering a separate INSERT OR IGNORE sqlite3_step into the linking tables (domain_by_id, client_by_id, forward_by_id) even though only the first INSERT per unique entry has any effect. Add per-batch dedup caches using pointer comparison (valid because getstr() returns stable shared-memory pointers where identical strings get identical addresses) to skip redundant sqlite3_step calls. Cache sizes are tuned per table type: - Domains: 256 (media-heavy pages produce 50-100+ unique domains) - Clients: 32 (typical home networks have 5-30 devices) - Upstreams: 8 (rarely more than 2-4 configured) Also remove the last 4 sqlite3_clear_bindings() calls (forward_stmt, addinfo_stmt x2, query_stmt) - unnecessary per SQLite docs since all bindings are overwritten before each step. Signed-off-by: Dominik <dl6er@dl6er.de>

Simplify linking table dedup: since the in-memory database persists for the process lifetime, a domain/client/upstream only needs to be inserted once ever, not once per batch. Replace the unsigned int db_batch generation counter with a single-bit in_database flag in each struct's flags bitfield. Reset in_database when client hostname changes in resolve.c so the new (ip, name) pair gets inserted into client_by_id. Bump SHARED_MEMORY_VERSION 14 -> 15 for the struct layout change. Signed-off-by: Dominik <dl6er@dl6er.de>

The REPLACE INTO query_storage statement previously used 4 correlated subqueries to resolve domain/client/forward/addinfo text strings to their linking table IDs on every single query upsert. This required 4 B-tree index lookups with string comparison per query. Cache the SQLite row IDs (db_id) directly on the domainsData, clientsData, and upstreamsData shared-memory structs. After the first INSERT OR IGNORE (already guarded by in_database flag), fetch the ID via sqlite3_last_insert_rowid (new rows) or a one-time SELECT (rows imported from disk), then bind it as a direct integer in the REPLACE. This eliminates 3 of 4 subqueries from the hottest SQLite statement in the system. Only the addinfo subquery remains (it only fires for blocked/CNAME queries, not every query) and cannot be cached easily (we have no matching SHM data structure for it and I don't want to add one just for this - the overall complexity is not worth it). Signed-off-by: Dominik <dl6er@dl6er.de>

The REPLACE INTO query_storage statement had one remaining correlated subquery: (SELECT id FROM addinfo_by_id WHERE type=? AND content=?). This B-tree index probe executed for every query stored to the in-memory database, even non-blocked queries where both parameters are NULL. Cache addinfo_by_id row IDs in a process-local open-addressing hash table (1024 slots, 8 KB). The key packs the addinfo type into the low bits and the actual value (CNAME_domainID or list_id) above, then applies Knuth multiplicative hashing to spread dense small integers across the power-of-two table. ADDINFO_TYPE_BITS is derived at compile time from the ADDINFO_TYPE_MAX sentinel in enum addinfo_type. After the first INSERT OR IGNORE, fetch the row ID via sqlite3_last_insert_rowid (new rows) or a one-time SELECT (rows pre-existing from disk import), then cache it. Subsequent queries with the same addinfo skip all SQLite operations entirely. The REPLACE statement is now a plain VALUES (?1,...,?13) with zero correlated subqueries, down from the original 4. Signed-off-by: Dominik <dl6er@dl6er.de>

in_gravity() is called twice per DNS cache miss in check_domain_blocked: once for gravity (block) and once for antigravity (allow-adlists). For the majority of Pi-hole users who don't use allow-adlists, the antigravity call always returns NOT_FOUND after a wasted SQLite bind/step/reset. Check once at gravity DB open time whether the antigravity table has any rows. When empty, short-circuit the in_gravity(antigravity=true) call immediately, saving one full prepared-statement round trip per uncached domain lookup. This is rechecked after gravity runs are performed when the new gravity database is re-opened. Signed-off-by: Dominik <dl6er@dl6er.de>

…k time Restructure queries_to_database() from a single massive SHM-locked loop into three phases: Phase 1 (SHM locked): Snapshot scalar query fields into a local array and handle linking table INSERTs (domain/client/forward/addinfo). The linking table work stays here because it reads SHM strings and writes back db_id/in_database flags. The changed flag is cleared immediately after snapshotting so that DNS thread updates during phase 2 are not lost. Phase 2 (SHM unlocked): Bind and step the main REPLACE statement for each snapshot. This is the bulk of the SQLite work and no longer blocks DNS processing or API threads. Phase 3 (SHM locked, brief): Write back db indices for newly inserted queries and update counters. This reduces SHM lock hold time from O(queries × SQLite REPLACE) to O(queries × field copy) + O(queries × index writeback), keeping the lock-free window over the heaviest database operation. Signed-off-by: Dominik <dl6er@dl6er.de>

find_mac() can trigger a netlink kernel call (iface_enumerate) to refresh the full ARP table when a client's MAC address isn't in dnsmasq's cache. Previously this ran while holding the global SHM mutex, blocking all other threads (API, database, GC, DNS workers). Release the SHM lock before calling find_mac(), perform the lookup into stack-local buffers, then reacquire the lock and write back the result. Both client and query pointers are re-fetched after reacquiring since SHM may have been remapped while unlocked. Only affects new-client paths (hwlen < 1); existing clients already have their MAC cached and skip this code entirely. Signed-off-by: Dominik <dl6er@dl6er.de>

_FTL_iface() is called on every incoming UDP query and iterates daemon->interfaces (potentially 2-3 times) to collect the IPv4/IPv6 addresses of the receiving interface. On systems with many interfaces (Docker hosts, VLANs, VPNs), this is O(n_interfaces) per query. Cache the resolved recviface pointer and skip the expensive address- collection loop when the same interface is seen again. The pointer is stable within dnsmasq's interface list and naturally invalidates on SIGHUP when the list is rebuilt. Also moves the next_iface invalidation (memset) to only run on cache miss or when no interface is found, avoiding unnecessary zeroing on every query. Signed-off-by: Dominik <dl6er@dl6er.de>

…s are empty Refactor the one-off gravity_check_antigravity() into a generic gravity_table_has_entries() helper and a consolidated gravity_check_list_presence() that probes antigravity, vw_allowlist and vw_denylist at gravity DB open time. When the exact allowlist or exact denylist contains no entries (common for most Pi-hole users who rely on regex or adlist-based blocking), in_allowlist() and in_denylist() now return NOT_FOUND immediately, saving one SQLite bind/step/reset per cache-miss query for each empty table. Signed-off-by: Dominik <dl6er@dl6er.de>

…tterns The per-cache-miss gravity check generatesABP wildcard patterns by iterating over domain label suffixes. Previously each pattern was heap-allocated as a cJSON string node inside a cJSON array, adding N+1 malloc/free round-trips per uncached query when ABP format is enabled (one array plus one string per domain component). Replace the cJSON array with a lightweight struct abp_patterns that stores only the suffix start offsets into the original domain string. The actual "||suffix^" / "@@||suffix^" pattern strings are built on the fly in a small stack buffer right before the SQLite bind call. This eliminates all heap allocation and deallocation in the ABP pattern path, removes the cJSON iteration overhead, and avoids the multiple memmove operations that the old gen_abp_patterns() used to build pattern strings incrementally. The original gen_abp_patterns() returning cJSON is retained for the cold API search path which needs the patterns as JSON for debug output. Signed-off-by: Dominik <dl6er@dl6er.de>

…ttime, reuse query timestamp - Add strcpy_tolower() that copies and lowercases in a single pass, replacing the strncpy()+strtolower() pattern. This avoids both the double traversal and strncpy's zero-padding of the remaining buffer (typically ~236 wasted bytes for a 20-char domain in a 256-byte buffer). Applied in _FTL_new_query() and FTL_CNAME(). - Guard clock_gettime(CLOCK_MONOTONIC) calls in lock_shm()/unlock_shm() behind config.debug.timing.v.b. Previously the syscall was executed on every SHM lock/unlock even when debug timing was disabled. - Replace time(NULL) syscall in FTL_check_blocking() cache expiry check with (time_t)query->timestamp which is already available, eliminating a redundant syscall per blocked-query lookup. Signed-off-by: Dominik <dl6er@dl6er.de>

- Replace 50-line query type switch in _FTL_new_query() with a 256-byte direct-mapped lookup table. Maps DNS wire types to FTL enum values via a single array load instead of a multi-case switch/jump table, avoiding indirect branch overhead and BTB pollution. - Replace strncpy() with strcpy_tolower() in FTL_check_blocking() for the local domain copy. Domains are already lowercase in shared memory, so tolower() is a no-op; the key gain is avoiding strncpy's zero- padding of the remaining ~236 bytes (MAXDOMAINLEN=256, typical domain ~20 chars). - Guard querystr() evaluation and debug logging in FTL_hook() behind config.debug.flags.v.b check. Previously querystr() (which does a linear search + sprintf) was called unconditionally on every DNS event even when debug output was disabled. Signed-off-by: Dominik <dl6er@dl6er.de>

The result of the ternary selecting between antigravity_stmt and gravity_stmt was discarded instead of assigned back to stmt, leaving it NULL when domain_in_list() was subsequently called. This is a latent crash bug triggered whenever a client's gravity prepared statement needed on-demand initialization. Matches the correct pattern already used in in_allowlist(). Signed-off-by: Dominik <dl6er@dl6er.de>

blockDomain is a bool (return value of check_domain_blocked()), not an enum db_result. The comparison worked by coincidence since NOT_FOUND == 0 == false, but was semantically incorrect. Replace with the idiomatic boolean negation. Signed-off-by: Dominik <dl6er@dl6er.de>

…lients After NUM_RECHECKS group rechecks the inner condition is permanently false, but time(NULL) was still invoked on every call. This function is called 3-4 times per cache-miss query (once each for allowlist, denylist, gravity, and antigravity). Adding an early-exit guard eliminates all time() calls for clients past their first ~3 minutes, which covers the vast majority of time. Signed-off-by: Dominik <dl6er@dl6er.de>

…casecmp is_pihole_domain() is called for every non-TYPE_NONE query. In the common case the domain is not pi.hole/hostname, so all four strcasecmp calls ran to completion unnecessarily. Cache the lengths of the four comparison targets as static variables and pre-filter with a single strlen(domain) + integer comparison before each strcasecmp, turning typical misses into O(1) rejections. Signed-off-by: Dominik <dl6er@dl6er.de>

github-actions · 2026-04-11T12:30:41Z

Conflicts have been resolved.

Signed-off-by: Dominik <dl6er@dl6er.de>

… last development merge Signed-off-by: Dominik <dl6er@dl6er.de>

DL6ER · 2026-04-12T09:33:41Z

Depends on #2845

github-actions · 2026-04-12T09:55:03Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Dominik <dl6er@dl6er.de>

github-actions · 2026-04-12T13:44:18Z

Conflicts have been resolved.

…the earlier merge Signed-off-by: Dominik <dl6er@dl6er.de>

Signed-off-by: Dominik <dl6er@dl6er.de>

Set SQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=33554432 as a compile-time default so every connection inherits it. Without this cap, the -wal file on the on-disk query database can grow unbounded between checkpoints on busy resolvers, increasing commit latency and disk usage. A compile-time default is preferred over a runtime PRAGMA because it also covers attached databases and any future connection added without having to remember to re-issue the PRAGMA. Signed-off-by: Dominik <dl6er@dl6er.de>

get_memdb_size() previously issued two separate prepare/step/finalize cycles for "PRAGMA page_count" and "PRAGMA page_size". Replace them with a single SELECT that cross-joins the pragma_page_count() and pragma_page_size() table-valued functions. This is a code-simplification change, not a hot-path optimisation — get_memdb_size() is only called from the debug-gated memory-usage log. The motivation is to stop pretending this needs two statements when SQLite can return both values from one prepare. Also fix a misleading log prefix: the function is get_memdb_size(), not init_memory_database(). Signed-off-by: Dominik <dl6er@dl6er.de>

bind_client_groups() binds the group_ids array to the carray() TVF with SQLITE_STATIC, meaning SQLite will NOT copy the buffer and will dereference it directly during each sqlite3_step(). That is only correct because getintarray() returns a pointer into the SHM integer-array region, which is stable for the statement's execution scope. This change does three things: - Document the lifetime contract next to the binding so anyone refactoring getintarray() is forced to consider the SQLITE_STATIC implication (TRANSIENT copy, or explicit rebind before each step). - Add a cheap assert() to catch a future refactor that could leave group_ids NULL while group_count > 0 — the early return already handles NULL, but the assert documents the post-guard invariant at the bind site. - Pure documentation + debug-only guard; no runtime impact in release builds. Signed-off-by: Dominik <dl6er@dl6er.de>

Pass SQLITE_OPEN_NOMUTEX to sqlite3_open_v2() in _dbopen(). Every dbopen() caller assigns the result to a local variable, uses it within one function, and calls dbclose() before returning — the connection never crosses a thread boundary. In SQLite's default "serialized" threading mode, each of the dozens of sqlite3_* calls per API request acquires a per-connection mutex; with NOMUTEX those acquisitions are skipped. Left untouched on purpose: - _memdb (query-table.c): accessed by both the DNS thread and the background database thread, must stay serialized. - gravity_db: accessed by multiple threads during gravity reload, must stay serialized. The long-lived shared connections continue to rely on SQLite's own locking; this change only applies to the short-lived per-request connections that are inherently single-threaded. Signed-off-by: Dominik <dl6er@dl6er.de>

Apply the same treatment as the previous _dbopen() change to the remaining sqlite3_open_v2() call sites that open a connection which lives entirely within one thread's scope: - network-table.c: macvendor.db readonly lookup - gravity-db.c: gravity_updated() readonly check - tools/gravity-parseList.c: CLI tool (single-process) - api/teleporter.c: per-request gravity.db writer - zip/teleporter.c: two :memory: database opens used for serialization and validation Each of these creates a sqlite3* in a local variable, uses it, and closes it before the function returns — the connection is never shared across threads. The in-memory databases in zip/teleporter.c have no file-level locking to care about at all. The teleporter's extra connection to gravity.db coexists with the long-lived gravity_db handle; NOMUTEX on one connection does not affect the other, and SQLite's file-level locking handles cross- connection serialization as usual. No functional change; avoids per-call SQLite mutex acquisitions on these paths. Signed-off-by: Dominik <dl6er@dl6er.de>

SQLITE_DEFAULT_CACHE_SIZE was set to 16384, which is a positive value and therefore means pages rather than kiB: at the default 4 kiB page size that works out to ~64 MiB of private page cache for every sqlite3 connection. FTL keeps several long-lived connections open (_memdb in query-table.c, gravity_db in gravity-db.c, plus briefly- scoped dbopen() handles), so the compiled-in default dominates RSS. Drop it to 4096 pages (~16 MiB). The mmap-backed on-disk query database (PRAGMA mmap_size) lets the kernel page cache serve cold reads for free, so large per-connection caches mostly duplicate the kernel's work. 16 MiB comfortably covers the hot working set for both gravity lookups and the in-memory query ring, while saving ~48 MiB per live connection at steady state. Also clean up the adjacent comment which previously described a "16 MiB" cache that the actual define did not deliver; the comment now matches what the compile-time default actually configures. No behaviour change. If a specific workload ever needs more cache on a specific connection, PRAGMA cache_size on that connection is the right localized override. Signed-off-by: Dominik <dl6er@dl6er.de>

The debug.performance 5-minute rollup has so far only tracked FTL_check_blocking() as one aggregate slot. Real-world deployments show a small but recurring tail: ~1-2 events per 5 minute window where check_blocking takes 1.2-2.3 ms instead of the usual tens of microseconds. Because the entire function lives in one PERF_STAT slot, we cannot tell whether the spike lives in the allowlist pass or the denylist/gravity pass. Add two nested PERF_STAT slots that only fire on the cache-miss path (the fast cache-hit path is already dominated by a single switch and is not the source of the tail): PERF_STAT_CB_ALLOWLIST wraps in_allowlist() + in_regex(ALLOW) PERF_STAT_CB_DENYLIST wraps check_domain_blocked() (primary call and the _esni fallback are timed into the same slot, so the rollup reflects total denylist/gravity work per query) special_domain() is not instrumented separately; it is a handful of strcmp()s and cannot be the source of a multi-millisecond spike. No behaviour change; the new scopes compile out to nothing when debug.performance is disabled (existing PERF_START / PERF_END macros already gate on config.debug.performance.v.b). The rollup in FTL_dump_cache_stats() picks up the new slots via the existing loop, reporting them with descriptive names that make their nesting under check_blocking obvious in the log. Signed-off-by: Dominik <dl6er@dl6er.de>

Background. The previous commit added nested PERF_STAT slots inside FTL_check_blocking (allowlist, denylist/gravity). First ~8 h of production data under debug.performance showed an oddity: the denylist/gravity child slot accumulated ~69 slow (>1 ms) events while the parent PERF_STAT_CHECK_BLOCKING slot saw only 3. Two things followed from that: 1. FTL_check_blocking has two callers. FTL_new_query wraps its call in PERF_STAT_CHECK_BLOCKING; FTL_CNAME does not. The nested child slots fire regardless, so CNAME deep-inspection time was invisible at the parent level. One user-facing DNS query against a 5-hop CNAME chain drives 5 separate calls through FTL_check_blocking, each capable of a full denylist/gravity miss. 2. The existing CB_DENYLIST slot wraps check_domain_blocked() as a whole, but that function runs three independent engines in sequence — exact SQLite denylist, gravity/antigravity SQLite probe, and TRE regex — each with a different potential failure mode, and we currently cannot tell them apart in the rollup. Changes. Still instrumentation-only; PERF macros compile out when debug.performance is disabled. (a) Split the parent CHECK_BLOCKING into primary vs CNAME callers. Add PERF_STAT_CB_CNAME and wrap the FTL_CNAME-side call. Rename the existing slot's human-readable label to "check_blocking (primary path)" so its scope is unambiguous in the log. (b) Split CB_DENYLIST's body into three nested sub-slots: CDB_EXACT — in_denylist() exact-match SQLite CDB_GRAVITY — in_gravity() for both antigravity and gravity; timed into one slot so the rollup reflects total gravity-DB work per query regardless of which sub-call hit CDB_REGEX — in_regex(REGEX_DENY) over the TRE regex engine The regex call is now stored in a local bool before the `if` instead of being evaluated inline, so the PERF_END runs after the engine returns. Same evaluation order, same short-circuit semantics, no behaviour change. After this lands, the 5-minute rollup reports: check_blocking (primary path) # FTL_new_query side check_blocking (CNAME path, per-hop) # FTL_CNAME side check_blocking -> allowlist # nested, both callers check_blocking -> denylist/gravity # nested, both callers check_domain_blocked -> exact denylist check_domain_blocked -> gravity+antigravity check_domain_blocked -> regex denylist The CNAME-to-primary call ratio gives the amplification factor, and the three new sub-engine slots localize which one owns the tail. If CDB_GRAVITY concentrates the slow events, the fix domain is SQLite and mmap (page-fault on cold regions, WAL contention); if CDB_REGEX does, it is the TRE engine; if CDB_EXACT does, it is the prepared- statement hot path. Each has a different next step, and right now we cannot tell them apart. Signed-off-by: Dominik <dl6er@dl6er.de>

Production data from the previous PERF_STAT round narrowed the multi-millisecond tail on cache-miss queries to CDB_GRAVITY (check_domain_blocked → gravity+antigravity): 45 slow (>1 ms) events across ~7600 calls in ~4.75 h, with a peak of 2147 µs. That same data, cross-referenced with the pre-existing gravity-db.c "gravity (exact)" slot which wraps domain_in_list() (the bind + sqlite3_step + reset cycle), shows the step itself never exceeded ~210 µs in the same window. The step is therefore not the culprit. The time must live in in_gravity()'s wrapper code around the step: - gravityDB_client_check_again() — group membership recheck, which can reload groups from the gravity DB on demand and may page-fault on cold mmap regions; - gravityDB_prepare_client_statements() — per-client statement prepare, runs only on first call per client; - bind_client_groups() — carray bind of the client's group-id array against a shared statement. Add a span-style sibling to the existing expression-wrapping GRAVITY_TIMED_LOOKUP macro (GRAVITY_PERF_START / GRAVITY_PERF_END), extend gravity_perf[] from 6 to 7 slots, and use the new pair to time the setup block inside in_gravity() from entry through bind_client_groups(). The 5-minute rollup now includes: Gravity lookup stats [in_gravity wrapper (non-step)]: N calls, avg X us, max Y us, Z slow (>1ms, W%) If Z matches the slow count in CDB_GRAVITY, the wrapper IS the tail source — next fix domain is likely mmap warm-up (the existing gravity pre-warm from 38f7322 may need to re-touch pages on group reloads) or carray bind. If Z stays at zero while CDB_GRAVITY keeps going slow, the tail is in the ABP post-step path or the two in_denylist() / in_regex() neighbors that CDB_GRAVITY was never actually covering (in which case the earlier attribution needs re-examining). Either outcome narrows the search meaningfully. No behaviour change; both macros compile out when debug.performance is off. Early-return paths inside in_gravity() (LIST_NOT_AVAILABLE, bind_client_groups failure) skip the GRAVITY_PERF_END call so they don't pollute the histogram — they are rare error paths, not the hot path we want to characterize. Signed-off-by: Dominik <dl6er@dl6er.de>

Production perf data (debug.performance enabled, ~8 h of rollups) localized the 1-2 ms tail in cache-miss DNS queries entirely to the wrapper code inside in_gravity() — specifically the inline call to gravityDB_reload_groups(client) made from gravityDB_client_check_again() at the 60/120/180-second boundaries after a client was first seen. 27/27 slow events in the new "in_gravity wrapper (non-step)" slot matched the 27 slow events in CDB_GRAVITY, and the bimodal distribution across windows (21 of 32 windows had sub-µs wrapper time; the other 11 each had 1-7 slow events) mapped cleanly onto "any client in its 3-minute recheck window during this 5-minute window". gravityDB_reload_groups() does three heavy things — finalize the client's prepared statements, re-prepare them against the current group set (which runs get_client_groupids() SQL against gravity.db), and rebuild the per-client regex lookup. All synchronously on the DNS hot path, under the SHM lock, on whichever query happened to land inside the recheck window. One user-facing query was therefore effectively charged ~1-2 ms whenever it lost this lottery, and CNAME recursion amplified the probability by 2.2x because every CNAME hop is another chance to cross the boundary. The recheck mechanism itself is correct — clients *should* have their identity re-evaluated as hostname/MAC/interface information arrives — but the reload does not need to happen synchronously with the triggering query. Between rechecks the per-client state is already "stale" by up to 60 seconds by design; deferring by ≤1 second is immaterial. Change: gravityDB_client_check_again() now just advances the recheck counter and sets a new per-client flag (clientsData.flags.reload_pending, a single bit inside the existing client_flags bitfield — zero size impact). The DB thread, in its once-per-second cadence alongside queries_to_database() and the gravity_updated() check, calls a new gravityDB_process_pending_reloads() that scans clients, takes the SHM lock, and runs gravityDB_reload_groups() for any client whose flag is set, then clears it. The SHM lock is still held during the reload, but the hold is now on the DB thread. A DNS query only ever waits on that lock if it happens to be in flight exactly when the DB thread is processing — a small fraction of the time at typical query rates, and the blocked query absorbs one reload's worth of latency rather than every query near a boundary triggering its own. The second caller of gravityDB_reload_groups() at dnsmasq_interface.c:1088 (client changed network interface) is left inline intentionally. Interface changes are rare events and we have no evidence they contribute to the tail; routing that path through the same flag can be a follow-up if it ever surfaces. No behaviour change from the user's perspective: group re-checks still happen at the same 60/120/180-second marks, clients are still re-identified over their first three minutes, and the eventual group set a client is assigned is unchanged. Signed-off-by: Dominik <dl6er@dl6er.de>

Production perf data (debug.performance enabled, ~8 h of rollups) localized the recurring 1-2 ms tail on cache-miss DNS queries entirely to one inline call: gravityDB_client_check_again() → gravityDB_reload_groups() from inside in_gravity() / in_allowlist() / in_denylist(). The 27 slow events in the "in_gravity wrapper (non-step)" PERF_STAT slot matched the 27 slow events in CDB_GRAVITY one-for-one, and the bimodal window distribution (21 quiet, 11 spiky) mapped cleanly onto "did any client cross its 60/120/180-second recheck boundary during this 5-minute window". The recheck mechanism exists for a legitimate reason: a client's identity (MAC, hostname) is discovered asynchronously via parse_neighbor_cache() (ARP) and the resolver thread, so a client whose group assignment depends on MAC or hostname cannot be matched correctly on its very first query. The three timed rechecks at 60/120/180 s re-run get_client_groupids() once enough async identity data has arrived. That's periodic maintenance, though, not per-query work. The old placement made every DNS query eat the "is this client due for a recheck" conditional, and whichever query happened to land inside a boundary absorbed the full ~1-2 ms reload (SQL re-fetch of groups + per-client statement re-prepare + per-client regex rebuild) under the SHM lock. CNAME recursion amplified the exposure ~2.2×. Rework: - Remove the three hot-path calls to gravityDB_client_check_again() from in_gravity(), in_allowlist(), in_denylist() entirely. No more per-query check for anything recheck-related. - Replace the per-client gravityDB_client_check_again() with a gravityDB_recheck_clients() that iterates all clients under the SHM lock, short-circuits mature ones (reread_groups >= NUM_RECHECKS), and runs the reload directly for any client currently within its 3-minute window whose next boundary has passed. - Call that function from the database thread's existing once-per- second block in run_db_thread(), next to queries_to_database() and the gravity_updated() check. Behaviour is preserved: clients are still re-identified at the same 60/120/180 s marks after first-seen, with at most ~1 s of additional latency between the boundary and the actual reload — well within the existing design's pre-existing between-rechecks staleness window. The other caller of gravityDB_reload_groups() at dnsmasq_interface.c:1088 (client changed network interface) is left inline intentionally. Interface changes are rare events and we have no evidence they contribute to the tail. Removes from scope the per-client reload_pending flag that an earlier iteration of this work introduced: with the full check moved to the DB thread, there is nothing for the DNS thread to signal, and recomputing the "is this client due for a recheck" condition during the once-per- second scan is cheaper than reading/writing a shared flag bit. Net: -23 lines, no new state on clientsData, zero DNS-path overhead from the recheck subsystem. The CDB_GRAVITY and in_gravity wrapper slow-event counts should drop to essentially zero once this ships. Signed-off-by: Dominik <dl6er@dl6er.de>

Signed-off-by: Dominik <dl6er@dl6er.de>

yubiuser · 2026-04-20T17:54:52Z

If been running this for several days now without issues. I think after the next patch release this is good to go into development for an extended testing period.

DL6ER added 30 commits March 16, 2026 20:28

Add related CI tests

ec122fe

Signed-off-by: Dominik <dl6er@dl6er.de>

Merge branch 'development' into tweak/performance

39bbec8

github-actions bot removed the Merge conflicts label Apr 11, 2026

DL6ER added 2 commits April 11, 2026 14:39

Improve test script harness cleanup/preparation steps

d2f040d

Signed-off-by: Dominik <dl6er@dl6er.de>

Fix a stale free(punycode) left over from the merge resolution during…

861666b

… last development merge Signed-off-by: Dominik <dl6er@dl6er.de>

github-actions bot added the Merge conflicts label Apr 12, 2026

DL6ER marked this pull request as draft April 12, 2026 09:59

Merge branch 'development' into tweak/performance

4a4830a

Signed-off-by: Dominik <dl6er@dl6er.de>

github-actions bot removed the Merge conflicts label Apr 12, 2026

DL6ER marked this pull request as ready for review April 12, 2026 13:45

DL6ER added 14 commits April 12, 2026 18:02

Fix duplicated DNSSEC pre-warm query accidentally slipped through in …

1f889cd

…the earlier merge Signed-off-by: Dominik <dl6er@dl6er.de>

Mitigate possible race window after Teleporter test

f99b052

Signed-off-by: Dominik <dl6er@dl6er.de>

Merge branch 'development' into tweak/performance

9049c51

yubiuser reviewed Apr 20, 2026

View reviewed changes

Comment thread src/api/stats.c

Skip aliasclients in the new gravityDB_recheck_clients()

bcf3e7d

Signed-off-by: Dominik <dl6er@dl6er.de>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance optimizations and bug fixes#2816

Performance optimizations and bug fixes#2816
DL6ER wants to merge 172 commits intodevelopmentfrom
tweak/performance

DL6ER commented Mar 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

DL6ER commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Uh oh!

yubiuser commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

DL6ER commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix?

Summary

Performance: Gravity Database

Performance: Query Processing Hot Path

Performance: Garbage Collection

Performance: Database Writes

Performance: Data Structures

Database Schema & Query Improvements

Signal Handler Hardening

Crash Diagnostics

Correctness Fixes (78 commits)

Test Infrastructure

Checklist:

Uh oh!

github-actions bot commented Apr 11, 2026

Uh oh!

DL6ER commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Uh oh!

yubiuser commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DL6ER commented Mar 20, 2026 •

edited

Loading