Skip to content

Performance optimizations and bug fixes#2816

Open
DL6ER wants to merge 172 commits intodevelopmentfrom
tweak/performance
Open

Performance optimizations and bug fixes#2816
DL6ER wants to merge 172 commits intodevelopmentfrom
tweak/performance

Conversation

@DL6ER
Copy link
Copy Markdown
Member

@DL6ER DL6ER commented Mar 20, 2026

What does this implement/fix?

Summary

Broad sweep across the FTL engine covering performance work, correctness fixes, database schema improvements, signal handler hardening, and crash diagnostics.

145 commits across 56 files (+2851/−1395 lines): 78 bug fixes, 45 performance optimizations, and supporting refactors.

Testing showed approximately 15% query processing speedup: ~92 µs → ~80 µs for each new query.


Performance: Gravity Database

  • Shared prepared statements with carray() binding — replace per-client statement vectors with one shared statement per process, rebound via sqlite3_carray_bind() before each lookup. Query plan goes from 4 SEARCH operations (through 3 LEFT JOINs) to a single covering index seek for gravity/antigravity.
  • SHM intarray storage for client group IDs — store group ID arrays in shared memory with deduplication via addintarray(). TCP forks get group data for free without recomputation.
  • Junction table (group_id, ...) indexes — enable index seeks for carray()-filtered queries on adlist_by_group and domainlist_by_group instead of full scans.
  • WITHOUT ROWID on junction tablesadlist_by_group, domainlist_by_group, client_by_group, and info tables use the composite PK directly as the B-tree storage key, eliminating one level of indirection per JOIN.
  • carray rebind deduplicationbind_client_groups() helper tracks the last-bound SHM groupspos per statement. Since addintarray() deduplicates, clients sharing the same group set skip the rebind entirely.
  • Memory-mapped I/OPRAGMA mmap_size = 256 MiB on gravity.db and the on-disk query database. Eliminates pread() syscall overhead for all domain lookups.
  • posix_fadvise(POSIX_FADV_WILLNEED) on gravity.db at startup — pre-warms the kernel page cache asynchronously, eliminating cold-start page faults.
  • ANTIGRAVITY_TABLE support — new enum, count query, getTable() case, SHM counter, and API exposure for antigravity domain counts.
  • Correctness: reverted adlist ID caching — pre-computed per-client adlist ID arrays bypassed view JOINs for ~4x speedup but became stale when users modified groups/adlists at runtime. Reverted to live view queries that check adlist.enabled and group.enabled on every lookup.

Performance: Query Processing Hot Path

  • Eliminate redundant SHM lookups in FTL_check_blocking() — pass pre-fetched domain/client/cache pointers instead of re-looking up by ID.
  • Stack-based ABP pattern generation — replace cJSON heap allocations with a stack struct containing pre-computed suffix offsets. Zero heap allocation in the blocking check path.
  • Lazy ABP offset computation — defer suffix offset calculation until after the exact-match check fails, avoiding all ABP work on exact matches (the common case).
  • Skip empty list checksgravity_has_exact_allowlist, gravity_has_exact_denylist, gravity_has_antigravity flags skip entire functions when the corresponding lists are empty.
  • Regex query-type pre-filtering — check the query-type bitmask before running regexec(), skipping the expensive regex engine when the type doesn't match.
  • Replace tolower() with branchless ASCII bit-trick in strcpy_tolower().
  • Replace time(NULL) calls with the query's existing timestamp where applicable.
  • Move find_mac() outside SHM lock — netlink kernel I/O no longer blocks all other threads.
  • Interface caching in _FTL_iface() — skip repeated interface list iteration when the pointer hasn't changed.

Performance: Garbage Collection

  • Deferred compaction — instead of memmove()-ing all surviving queries on every GC run, advance queries_offset and compact lazily only when the physical array runs out of room.
  • O(1) query ID lookup — direct-mapped 4096-entry cache for dnsmasq ID → FTL query index, replacing O(N) linear scan.
  • Incremental reference countingdomain->count, client->count, cache->refcount, domain->cname_refcount are maintained per-query instead of recomputed during GC, eliminating O(queries) scan in recycle().
  • Single-pass lookup table compaction — GC recycles clients/domains/upstreams and compacts lookup tables in one pass instead of per-entry removal.

Performance: Database Writes

  • Three-phase queries_to_database() — Phase 1 (under SHM lock): snapshot query data. Phase 2 (no lock): bulk SQLite writes. Phase 3 (brief lock): write-back of DB indices. Minimizes lock contention.
  • Linking table ID caching — hash table caches domain/client/forward/addinfo row IDs, eliminating 4 REPLACE subqueries per query.
  • Batched dedup linking table INSERTsINSERT OR IGNORE with sqlite3_changes() check instead of SELECT-then-INSERT.
  • PRAGMA synchronous=NORMAL on WAL-mode disk database — skip per-commit fsync, relying on WAL checksums for crash safety.

Performance: Data Structures

  • Shrink DNSCacheData from 48 to 36 bytes on all architectures.
  • Reorder queriesData and clientsData for cache-line alignment — hot fields in the first 64-byte line, cold overTime[] at the end.
  • O(1) string deduplication — replace memmem() linear scan in addstr() with a process-local open-addressing hash table (FNV-1a).
  • FNV-1a hash replaces Jenkins' One-at-a-Time in hashStr().
  • Heap-based top-K selection for /api/stats/top_domains and /api/stats/top_clients — O(N log K) instead of full sort.

Database Schema & Query Improvements

  • queries VIEW migration (v22) — replace 4 correlated subqueries with LEFT JOINs. Dramatically faster for users querying pihole-FTL.db via sqlite3 CLI or third-party tools.
  • ORDER BY ... DESC LIMIT on database top items/api/stats/database/top_domains and top_clients now return correctly sorted results (was returning arbitrary GROUP BY order) and push the LIMIT into SQL instead of scanning all rows.
  • Fix NOL NULLNOT NULL typo on client.ip in gravity schema — SQLite silently ignored the misspelled constraint, allowing NULL IPs.
  • Update triggers use PK lookuptr_adlist_update, tr_client_update, tr_domainlist_update now use WHERE id = NEW.id instead of UNIQUE text column lookup.
  • Fix missing comma in index_creation array — C string literal concatenation merged two index definitions into one entry.
  • MEMDB_VERSION bump to 22 and expected schema test updates.

Signal Handler Hardening

  • SA_ONSTACK — crash handlers run on a dedicated 16 KiB alternate signal stack, enabling backtrace generation even on stack overflow faults.
  • SA_RESETHAND — crash signal disposition resets to SIG_DFL after first invocation, preventing infinite re-entry if the handler itself faults.
  • Async-signal-safe SIGTERM handler — save si_pid/si_uid in volatile globals, defer the expensive sender lookup (/proc/<pid>/cmdline, getpwuid(), logging) to log_sigterm_info() called from cleanup() outside signal context. Eliminates potential deadlock if SIGTERM arrives while malloc lock is held.
  • Async-signal-safe RT signal handler — remove log_info(strsignal()) call; events are logged when processed in the main loop.
  • Guard delete_shm() against uninitialized SHM — early return when name == NULL, preventing garbage munmap/close/shm_unlink during pihole-FTL crash test.
  • Fix leaked FILE* in original SIGTERM handler (missing fclose on success path).

Crash Diagnostics

  • GDB-style backtrace output — frames show #N 0xADDR in func () at file:line matching GDB's bt convention. Library frames show from libname. Unresolved frames show in ?? ().
  • _Unwind_Backtrace via GCC's libgcc — works on glibc AND musl, static AND dynamic, all architectures. Falls back to dladdr() for library symbols, then /proc/self/maps for mapping names.
  • PIE/ASLR-aware addr2line — subtracts the __ehdr_start load base from instruction pointers for correct source line resolution.
  • Backtrace end marker--- end of backtrace (N frames) --- for clear visual termination.

Correctness Fixes (78 commits)

Critical:

  • gravity_updated() closed global gravity_db instead of local handle — broke all DNS queries on gravity check error
  • new_sqlite3_stmt_vec() allocated n² instead of n items — massive memory waste
  • Session cookie used compile-time default timeout instead of configured value
  • Zombie detection in process_alive() used strcmp on a substring — never matched
  • SHM deadlock: parse_neighbor_cache() held lock across continue on NULL client
  • Heap buffer overflows: add_to_fifo_buffer() OOB write, process_arp_reply() OOB device access
  • Swapped bind parameters in db_update_disk_counter — on-disk counters never updated
  • % 0xFF vs & 0xFF for EDNS VERSION byte — corrupted version 0 detection
  • EDE info-code byte-order: double-swap on little-endian, broken on big-endian
  • Stale SHM pointers after find_mac() unlock/relock — use-after-free risk
  • Unchecked realloc in teleporter upload — NULL deref + memory leak on failure

Resource leaks (30+ fixes):

  • Missing dbclose() on error paths across all API database endpoints
  • Missing sqlite3_finalize() on prepare/bind failures
  • Missing ROLLBACK before early returns inside open transactions
  • Missing free() for cJSON objects, punycode buffers, query strings
  • Missing fclose() for file handles

Y2038 fixes:

  • sqlite3_column_int64() for all time_t database fields
  • sqlite3_bind_int64() for time fields in network table inserts

Test Infrastructure

  • test-fast build option — skips performance tests for faster CI iteration
  • Updated test expectations for removed log messages, new schema version, new VIEW definition

Related issue or feature (if applicable): N/A

Pull request in docs with documentation (if applicable): N/A


By submitting this pull request, I confirm the following:

  1. I have read and understood the contributors guide, as well as this entire template. I understand which branch to base my commits and Pull Requests against.
  2. I have commented my proposed changes within the code.
  3. I am willing to help maintain this change if there are issues with it later.
  4. It is compatible with the EUPL 1.2 license
  5. I have squashed any insignificant commits. (git rebase)

Checklist:

  • The code change is tested and works locally.
  • I based my code and PRs against the repositories development branch.
  • I signed off all commits. Pi-hole enforces the DCO for all contributions
  • I signed all my commits. Pi-hole requires signatures to verify authorship
  • I have read the above and my PR is ready for review.

DL6ER added 30 commits March 16, 2026 20:28
The _addstr() function used memmem() to scan the entire shared-memory string pool for duplicate strings before appending. This is O(pool_size) per call — on a typical installation with tens of thousands of unique domains, this means scanning hundreds of kilobytes to megabytes of data on every call, all while holding the SHM lock.

Replace with a process-local open-addressing hash table that maps string hashes to pool offsets. This gives O(1) average-case lookups while preserving the existing string deduplication behavior.

The hash table uses the same hashing algorithm (Jenkins One-at-a-Time) already used elsewhere in FTL. It starts at 4096 slots, doubles at 75% load/usage, and incrementally syncs any strings added by forked TCP children via str_hash_sync().

This primarily benefits:
- Database import at startup (every imported domain hits addstr())
- Post-GC domain re-registration (recycled domains call addstr() again)
- Any workload with many unique domains (ad-heavy or many-user networks)

Signed-off-by: Dominik <dl6er@dl6er.de>
Replace the O(MAXITER=1000) reverse linear scan in findQueryID() with a 4096-entry direct-mapped cache keyed on the (monotonically incrementing!) dnsmasq query ID. Each DNS reply/forward/CNAME callback should now hit the cache for an O(1) lookup instead of scanning up to (in the worst case!) 1000 queriesData structs in shared memory.

Since dnsmasq IDs increment monotonically, new entries naturally evict the oldest at each slot, keeping the cache fresh without explicit eviction logic. The cache is cleared after GC memmove (which shifts all query indices). On a miss the original linear scan is used as fallback.

Signed-off-by: Dominik <dl6er@dl6er.de>
Replace the per-GC-cycle memmove of the entire queries array with a queries_offset that advances when old queries are removed. _getQuery() adds this offset to translate logical indices to physical positions, making the change transparent to all callers.

Previously, every GC run copied (remaining_queries * 96) bytes under the SHM lock — potentially tens of megabytes on a busy resolver. Now GC simply increments the offset immediately (O(1)).

Compaction (the actual memmove) is deferred to shm_ensure_size() and only triggered when the physical array runs out of room, amortizing the cost over many GC cycles.

Signed-off-by: Dominik <dl6er@dl6er.de>
…ounts

The recycle() function previously allocated three bool arrays (clients, domains, dns_cache) and scanned ALL active queries to mark which entries were still in use before recycling unreferenced ones. This O(queries) scan ran under the shared-memory lock every GC cycle.

Replace with incrementally-maintained reference counts:

- client->count / domain->count (already maintained) track direct query references
- domain->cname_refcount (new) tracks CNAME chain references from both queries and cache entries
- cache->refcount (new) tracks how many queries reference each DNS cache entry

Refcounts are incremented at query creation (new query, DB import, CNAME chain resolution, query duplication) and decremented during GC removal. recycle() now checks these fields directly instead of scanning, reducing its complexity from O(queries) to O(clients + domains + cache).

Initialize DNSCacheData.CNAME_domainID to (unsigned int)-1 as a sentinel to distinguish "never set" from domain ID 0, ensuring correct refcount decrements even after gravity reloads that reset blocking_status without clearing CNAME_domainID.

Signed-off-by: Dominik <dl6er@dl6er.de>
…te loop

queries_to_database() called asprintf() to format the "IP#port" upstream string for every query written to the database, causing a malloc+free cycle per query. On systems with thousands of queries per DB flush, this adds up to significant heap churn.

Replace with a fixed 64-byte stack buffer and snprintf. The maximum formatted length is 52 bytes (45-char IPv6 + '#' + 5-digit port + NUL), so 64 bytes provides ample headroom with zero heap allocation.

Signed-off-by: Dominik <dl6er@dl6er.de>
GC recycle() previously called lookup_remove() for each recycled client, domain, and cache entry. Each removal did a binary search followed by memmove to shift all subsequent entries left, making each removal O(N) in the table size. With K entries recycled, the total cost was O(K*N).

Replace with lookup_compact(): after each recycling loop, a single O(N) pass over the sorted lookup table copies surviving entries forward, skipping entries whose backing data structure has been memset to zero (magic byte == 0x00). The sorted order is preserved since only removals occur, so binary search continues to work.

For a DNS cache lookup table with 20K entries and 100 recycled entries, this reduces data movement from ~8MB (100 individual memmoves) to ~160KB (one sequential scan).

Signed-off-by: Dominik <dl6er@dl6er.de>
GC recycle() previously iterated data arrays to decide what to recycle, then made a second pass over each lookup table to compact out dead entries. Merge both operations into a single pass: iterate over the sorted lookup table directly, check the recycling criteria for each entry's backing data, and either recycle it (memset + add to recycle table) or shift it forward in the compacted output.

This halves the number of iterations and eliminates the lookup_compact() function, its three callback wrappers, and the associated function pointer call overhead per lookup entry.

Signed-off-by: Dominik <dl6er@dl6er.de>
Move the query_type bitmask check ahead of the expensive regexec()/tre_regexec() call. When a regex has a type filter (e.g., only match AAAA queries), queries of non-matching types now skip the engine entirely via a single integer AND operation, avoiding unnecessary regex evaluations.

Signed-off-by: Dominik <dl6er@dl6er.de>
gen_abp_patterns() uses cJSON, allocating 9+ heap objects per blocked-domain check (1 array + N items + N string copies). Most gravity lookups hit the exact-match check, making those allocations wasted work.

Change in_gravity() to accept a cJSON** and lazily generate ABP patterns only when the exact-match returns NOT_FOUND and gravity_abp_format is enabled. The patterns are shared across the antigravity and gravity calls via the caller's pointer.

Signed-off-by: Dominik <dl6er@dl6er.de>
The server_cookie calloc/memcpy/free ran on every DNS query carrying EDNS cookies (most modern queries), but the data was only used for debug logging. Move cookie parsing entirely inside the debug check and replace both heap allocations (server_cookie, pretty_server_cookie) with fixed-size stack buffers. Zero heap allocations in the non-debug path.

Signed-off-by: Dominik <dl6er@dl6er.de>
_FTL_new_query: the gettimeofday(&request, 0) result was never used — remove the wasted syscall on every new query.

FTL_reply: identical DNSSEC proxy check block was duplicated (copy-paste bug) — remove the second copy

Signed-off-by: Dominik <dl6er@dl6er.de>
Replace calloc(N) + qsort(N) with a bounded min-heap of size 4*K (where K is the requested count, typically 10). During the O(N) scan, only the top entries are retained in the heap. The final sort runs on at most 4*K entries instead of all N.

For 100k domains requesting top 10: memory drops from ~5MB to ~2KB, sort from O(N log N) to O(K log K) where K << N, and zero-count entries are skipped early during collection.

Also fix lock leak: calloc failure now unlocks shm before returning.

Signed-off-by: Dominik <dl6er@dl6er.de>
- Remove strtolower() on upstream IPs in FTL_forwarded and FTL_retried: inet_ntop already produces lowercase output.
- Fix double strlen() call in special_domain: compute domain length once and reuse for the suffix comparison.
- Remove unnecessary sqlite3_clear_bindings() in domain_in_list: the next sqlite3_bind_text() overwrites parameter 1 directly, making the clear redundant.

Signed-off-by: Dominik <dl6er@dl6er.de>
Signed-off-by: Dominik <dl6er@dl6er.de>
In queries_to_database(), the same domain/client/forward typically
appears in many queries within a single 1-second batch. Each was
triggering a separate INSERT OR IGNORE sqlite3_step into the linking
tables (domain_by_id, client_by_id, forward_by_id) even though only
the first INSERT per unique entry has any effect.

Add per-batch dedup caches using pointer comparison (valid because
getstr() returns stable shared-memory pointers where identical strings
get identical addresses) to skip redundant sqlite3_step calls.
Cache sizes are tuned per table type:
- Domains: 256 (media-heavy pages produce 50-100+ unique domains)
- Clients: 32 (typical home networks have 5-30 devices)
- Upstreams: 8 (rarely more than 2-4 configured)

Also remove the last 4 sqlite3_clear_bindings() calls (forward_stmt,
addinfo_stmt x2, query_stmt) - unnecessary per SQLite docs since all
bindings are overwritten before each step.

Signed-off-by: Dominik <dl6er@dl6er.de>
Simplify linking table dedup: since the in-memory database persists
for the process lifetime, a domain/client/upstream only needs to be
inserted once ever, not once per batch. Replace the unsigned int
db_batch generation counter with a single-bit in_database flag in
each struct's flags bitfield.

Reset in_database when client hostname changes in resolve.c so the
new (ip, name) pair gets inserted into client_by_id.

Bump SHARED_MEMORY_VERSION 14 -> 15 for the struct layout change.

Signed-off-by: Dominik <dl6er@dl6er.de>
The REPLACE INTO query_storage statement previously used 4 correlated
subqueries to resolve domain/client/forward/addinfo text strings to
their linking table IDs on every single query upsert. This required
4 B-tree index lookups with string comparison per query.

Cache the SQLite row IDs (db_id) directly on the domainsData,
clientsData, and upstreamsData shared-memory structs. After the first
INSERT OR IGNORE (already guarded by in_database flag), fetch the ID
via sqlite3_last_insert_rowid (new rows) or a one-time SELECT (rows
imported from disk), then bind it as a direct integer in the REPLACE.

This eliminates 3 of 4 subqueries from the hottest SQLite statement
in the system. Only the addinfo subquery remains (it only fires for
blocked/CNAME queries, not every query) and cannot be cached easily
(we have no matching SHM data structure for it and I don't want to
add one just for this - the overall complexity is not worth it).

Signed-off-by: Dominik <dl6er@dl6er.de>
The REPLACE INTO query_storage statement had one remaining correlated
subquery: (SELECT id FROM addinfo_by_id WHERE type=? AND content=?).
This B-tree index probe executed for every query stored to the in-memory
database, even non-blocked queries where both parameters are NULL.

Cache addinfo_by_id row IDs in a process-local open-addressing hash
table (1024 slots, 8 KB). The key packs the addinfo type into the low
bits and the actual value (CNAME_domainID or list_id) above, then
applies Knuth multiplicative hashing to spread dense small integers
across the power-of-two table. ADDINFO_TYPE_BITS is derived at compile
time from the ADDINFO_TYPE_MAX sentinel in enum addinfo_type.

After the first INSERT OR IGNORE, fetch the row ID via
sqlite3_last_insert_rowid (new rows) or a one-time SELECT (rows
pre-existing from disk import), then cache it. Subsequent queries
with the same addinfo skip all SQLite operations entirely.

The REPLACE statement is now a plain VALUES (?1,...,?13) with zero
correlated subqueries, down from the original 4.

Signed-off-by: Dominik <dl6er@dl6er.de>
in_gravity() is called twice per DNS cache miss in check_domain_blocked:
once for gravity (block) and once for antigravity (allow-adlists). For
the majority of Pi-hole users who don't use allow-adlists, the antigravity
call always returns NOT_FOUND after a wasted SQLite bind/step/reset.

Check once at gravity DB open time whether the antigravity table has any
rows. When empty, short-circuit the in_gravity(antigravity=true) call
immediately, saving one full prepared-statement round trip per uncached
domain lookup. This is rechecked after gravity runs are performed when the
new gravity database is re-opened.

Signed-off-by: Dominik <dl6er@dl6er.de>
…k time

Restructure queries_to_database() from a single massive SHM-locked loop into
three phases:

  Phase 1 (SHM locked): Snapshot scalar query fields into a local array
  and handle linking table INSERTs (domain/client/forward/addinfo).
  The linking table work stays here because it reads SHM strings and
  writes back db_id/in_database flags. The changed flag is cleared
  immediately after snapshotting so that DNS thread updates during
  phase 2 are not lost.

  Phase 2 (SHM unlocked): Bind and step the main REPLACE statement for
  each snapshot. This is the bulk of the SQLite work and no longer
  blocks DNS processing or API threads.

  Phase 3 (SHM locked, brief): Write back db indices for newly inserted
  queries and update counters.

This reduces SHM lock hold time from O(queries × SQLite REPLACE) to
O(queries × field copy) + O(queries × index writeback), keeping the
lock-free window over the heaviest database operation.

Signed-off-by: Dominik <dl6er@dl6er.de>
find_mac() can trigger a netlink kernel call (iface_enumerate) to
refresh the full ARP table when a client's MAC address isn't in
dnsmasq's cache. Previously this ran while holding the global SHM
mutex, blocking all other threads (API, database, GC, DNS workers).

Release the SHM lock before calling find_mac(), perform the lookup
into stack-local buffers, then reacquire the lock and write back the
result. Both client and query pointers are re-fetched after reacquiring
since SHM may have been remapped while unlocked.

Only affects new-client paths (hwlen < 1); existing clients already
have their MAC cached and skip this code entirely.

Signed-off-by: Dominik <dl6er@dl6er.de>
_FTL_iface() is called on every incoming UDP query and iterates
daemon->interfaces (potentially 2-3 times) to collect the IPv4/IPv6
addresses of the receiving interface. On systems with many interfaces
(Docker hosts, VLANs, VPNs), this is O(n_interfaces) per query.

Cache the resolved recviface pointer and skip the expensive address-
collection loop when the same interface is seen again. The pointer is
stable within dnsmasq's interface list and naturally invalidates on
SIGHUP when the list is rebuilt.

Also moves the next_iface invalidation (memset) to only run on cache
miss or when no interface is found, avoiding unnecessary zeroing on
every query.

Signed-off-by: Dominik <dl6er@dl6er.de>
…s are empty

Refactor the one-off gravity_check_antigravity() into a generic
gravity_table_has_entries() helper and a consolidated
gravity_check_list_presence() that probes antigravity, vw_allowlist
and vw_denylist at gravity DB open time.

When the exact allowlist or exact denylist contains no entries (common
for most Pi-hole users who rely on regex or adlist-based blocking),
in_allowlist() and in_denylist() now return NOT_FOUND immediately,
saving one SQLite bind/step/reset per cache-miss query for each empty
table.

Signed-off-by: Dominik <dl6er@dl6er.de>
…tterns

The per-cache-miss gravity check generatesABP wildcard
patterns by iterating over domain label suffixes.  Previously each
pattern was heap-allocated as a cJSON string node inside a cJSON array,
adding N+1 malloc/free round-trips per uncached query when ABP format
is enabled (one array plus one string per domain component).

Replace the cJSON array with a lightweight struct abp_patterns that
stores only the suffix start offsets into the original domain string.
The actual "||suffix^" / "@@||suffix^" pattern strings are built on
the fly in a small stack buffer right before the SQLite bind call.

This eliminates all heap allocation and deallocation in the ABP pattern
path, removes the cJSON iteration overhead, and avoids the multiple
memmove operations that the old gen_abp_patterns() used to build
pattern strings incrementally.

The original gen_abp_patterns() returning cJSON is retained for the
cold API search path which needs the patterns as JSON for debug output.

Signed-off-by: Dominik <dl6er@dl6er.de>
…ttime, reuse query timestamp

- Add strcpy_tolower() that copies and lowercases in a single pass,
  replacing the strncpy()+strtolower() pattern. This avoids both the
  double traversal and strncpy's zero-padding of the remaining buffer
  (typically ~236 wasted bytes for a 20-char domain in a 256-byte buffer).
  Applied in _FTL_new_query() and FTL_CNAME().

- Guard clock_gettime(CLOCK_MONOTONIC) calls in lock_shm()/unlock_shm()
  behind config.debug.timing.v.b. Previously the syscall was executed on
  every SHM lock/unlock even when debug timing was disabled.

- Replace time(NULL) syscall in FTL_check_blocking() cache expiry check
  with (time_t)query->timestamp which is already available, eliminating
  a redundant syscall per blocked-query lookup.

Signed-off-by: Dominik <dl6er@dl6er.de>
- Replace 50-line query type switch in _FTL_new_query() with a 256-byte
  direct-mapped lookup table. Maps DNS wire types to FTL enum values via
  a single array load instead of a multi-case switch/jump table, avoiding
  indirect branch overhead and BTB pollution.

- Replace strncpy() with strcpy_tolower() in FTL_check_blocking() for
  the local domain copy. Domains are already lowercase in shared memory,
  so tolower() is a no-op; the key gain is avoiding strncpy's zero-
  padding of the remaining ~236 bytes (MAXDOMAINLEN=256, typical domain
  ~20 chars).

- Guard querystr() evaluation and debug logging in FTL_hook() behind
  config.debug.flags.v.b check. Previously querystr() (which does a
  linear search + sprintf) was called unconditionally on every DNS event
  even when debug output was disabled.

Signed-off-by: Dominik <dl6er@dl6er.de>
The result of the ternary selecting between antigravity_stmt and
gravity_stmt was discarded instead of assigned back to stmt, leaving
it NULL when domain_in_list() was subsequently called. This is a
latent crash bug triggered whenever a client's gravity prepared
statement needed on-demand initialization. Matches the correct
pattern already used in in_allowlist().

Signed-off-by: Dominik <dl6er@dl6er.de>
blockDomain is a bool (return value of check_domain_blocked()), not
an enum db_result. The comparison worked by coincidence since
NOT_FOUND == 0 == false, but was semantically incorrect. Replace with
the idiomatic boolean negation.

Signed-off-by: Dominik <dl6er@dl6er.de>
…lients

After NUM_RECHECKS group rechecks the inner condition is permanently
false, but time(NULL) was still invoked on every call. This function
is called 3-4 times per cache-miss query (once each for allowlist,
denylist, gravity, and antigravity). Adding an early-exit guard
eliminates all time() calls for clients past their first ~3 minutes,
which covers the vast majority of time.

Signed-off-by: Dominik <dl6er@dl6er.de>
…casecmp

is_pihole_domain() is called for every non-TYPE_NONE query. In the
common case the domain is not pi.hole/hostname, so all four strcasecmp
calls ran to completion unnecessarily. Cache the lengths of the four
comparison targets as static variables and pre-filter with a single
strlen(domain) + integer comparison before each strcasecmp, turning
typical misses into O(1) rejections.

Signed-off-by: Dominik <dl6er@dl6er.de>
@github-actions
Copy link
Copy Markdown

Conflicts have been resolved.

DL6ER added 2 commits April 11, 2026 14:39
Signed-off-by: Dominik <dl6er@dl6er.de>
… last development merge

Signed-off-by: Dominik <dl6er@dl6er.de>
@DL6ER
Copy link
Copy Markdown
Member Author

DL6ER commented Apr 12, 2026

Depends on #2845

@github-actions
Copy link
Copy Markdown

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@DL6ER DL6ER marked this pull request as draft April 12, 2026 09:59
Signed-off-by: Dominik <dl6er@dl6er.de>
@github-actions
Copy link
Copy Markdown

Conflicts have been resolved.

@DL6ER DL6ER marked this pull request as ready for review April 12, 2026 13:45
DL6ER added 14 commits April 12, 2026 18:02
…the earlier merge

Signed-off-by: Dominik <dl6er@dl6er.de>
Signed-off-by: Dominik <dl6er@dl6er.de>
Set SQLITE_DEFAULT_JOURNAL_SIZE_LIMIT=33554432 as a compile-time default
so every connection inherits it. Without this cap, the -wal file on the
on-disk query database can grow unbounded between checkpoints on busy
resolvers, increasing commit latency and disk usage. A compile-time
default is preferred over a runtime PRAGMA because it also covers
attached databases and any future connection added without having to
remember to re-issue the PRAGMA.

Signed-off-by: Dominik <dl6er@dl6er.de>
get_memdb_size() previously issued two separate prepare/step/finalize
cycles for "PRAGMA page_count" and "PRAGMA page_size". Replace them
with a single SELECT that cross-joins the pragma_page_count() and
pragma_page_size() table-valued functions.

This is a code-simplification change, not a hot-path optimisation —
get_memdb_size() is only called from the debug-gated memory-usage
log. The motivation is to stop pretending this needs two statements
when SQLite can return both values from one prepare.

Also fix a misleading log prefix: the function is get_memdb_size(),
not init_memory_database().

Signed-off-by: Dominik <dl6er@dl6er.de>
bind_client_groups() binds the group_ids array to the carray() TVF
with SQLITE_STATIC, meaning SQLite will NOT copy the buffer and will
dereference it directly during each sqlite3_step(). That is only
correct because getintarray() returns a pointer into the SHM
integer-array region, which is stable for the statement's execution
scope.

This change does three things:

- Document the lifetime contract next to the binding so anyone
  refactoring getintarray() is forced to consider the SQLITE_STATIC
  implication (TRANSIENT copy, or explicit rebind before each step).

- Add a cheap assert() to catch a future refactor that could leave
  group_ids NULL while group_count > 0 — the early return already
  handles NULL, but the assert documents the post-guard invariant
  at the bind site.

- Pure documentation + debug-only guard; no runtime impact in
  release builds.

Signed-off-by: Dominik <dl6er@dl6er.de>
Pass SQLITE_OPEN_NOMUTEX to sqlite3_open_v2() in _dbopen(). Every
dbopen() caller assigns the result to a local variable, uses it
within one function, and calls dbclose() before returning — the
connection never crosses a thread boundary. In SQLite's default
"serialized" threading mode, each of the dozens of sqlite3_* calls
per API request acquires a per-connection mutex; with NOMUTEX those
acquisitions are skipped.

Left untouched on purpose:
- _memdb (query-table.c): accessed by both the DNS thread and the
  background database thread, must stay serialized.
- gravity_db: accessed by multiple threads during gravity reload,
  must stay serialized.

The long-lived shared connections continue to rely on SQLite's own
locking; this change only applies to the short-lived per-request
connections that are inherently single-threaded.

Signed-off-by: Dominik <dl6er@dl6er.de>
Apply the same treatment as the previous _dbopen() change to the
remaining sqlite3_open_v2() call sites that open a connection which
lives entirely within one thread's scope:

- network-table.c: macvendor.db readonly lookup
- gravity-db.c: gravity_updated() readonly check
- tools/gravity-parseList.c: CLI tool (single-process)
- api/teleporter.c: per-request gravity.db writer
- zip/teleporter.c: two :memory: database opens used for
  serialization and validation

Each of these creates a sqlite3* in a local variable, uses it, and
closes it before the function returns — the connection is never
shared across threads. The in-memory databases in zip/teleporter.c
have no file-level locking to care about at all.

The teleporter's extra connection to gravity.db coexists with the
long-lived gravity_db handle; NOMUTEX on one connection does not
affect the other, and SQLite's file-level locking handles cross-
connection serialization as usual.

No functional change; avoids per-call SQLite mutex acquisitions on
these paths.

Signed-off-by: Dominik <dl6er@dl6er.de>
SQLITE_DEFAULT_CACHE_SIZE was set to 16384, which is a positive value
and therefore means pages rather than kiB: at the default 4 kiB page
size that works out to ~64 MiB of private page cache for every
sqlite3 connection. FTL keeps several long-lived connections open
(_memdb in query-table.c, gravity_db in gravity-db.c, plus briefly-
scoped dbopen() handles), so the compiled-in default dominates RSS.

Drop it to 4096 pages (~16 MiB). The mmap-backed on-disk query
database (PRAGMA mmap_size) lets the kernel page cache serve cold
reads for free, so large per-connection caches mostly duplicate the
kernel's work. 16 MiB comfortably covers the hot working set for
both gravity lookups and the in-memory query ring, while saving
~48 MiB per live connection at steady state.

Also clean up the adjacent comment which previously described a
"16 MiB" cache that the actual define did not deliver; the comment
now matches what the compile-time default actually configures.

No behaviour change. If a specific workload ever needs more cache
on a specific connection, PRAGMA cache_size on that connection is
the right localized override.

Signed-off-by: Dominik <dl6er@dl6er.de>
The debug.performance 5-minute rollup has so far only tracked
FTL_check_blocking() as one aggregate slot. Real-world deployments
show a small but recurring tail: ~1-2 events per 5 minute window
where check_blocking takes 1.2-2.3 ms instead of the usual tens of
microseconds. Because the entire function lives in one PERF_STAT
slot, we cannot tell whether the spike lives in the allowlist pass
or the denylist/gravity pass.

Add two nested PERF_STAT slots that only fire on the cache-miss
path (the fast cache-hit path is already dominated by a single
switch and is not the source of the tail):

  PERF_STAT_CB_ALLOWLIST  wraps in_allowlist() + in_regex(ALLOW)
  PERF_STAT_CB_DENYLIST   wraps check_domain_blocked() (primary
                          call and the _esni fallback are timed
                          into the same slot, so the rollup
                          reflects total denylist/gravity work
                          per query)

special_domain() is not instrumented separately; it is a handful
of strcmp()s and cannot be the source of a multi-millisecond spike.

No behaviour change; the new scopes compile out to nothing when
debug.performance is disabled (existing PERF_START / PERF_END
macros already gate on config.debug.performance.v.b). The rollup
in FTL_dump_cache_stats() picks up the new slots via the existing
loop, reporting them with descriptive names that make their
nesting under check_blocking obvious in the log.

Signed-off-by: Dominik <dl6er@dl6er.de>
Background. The previous commit added nested PERF_STAT slots inside
FTL_check_blocking (allowlist, denylist/gravity). First ~8 h of
production data under debug.performance showed an oddity: the
denylist/gravity child slot accumulated ~69 slow (>1 ms) events
while the parent PERF_STAT_CHECK_BLOCKING slot saw only 3.

Two things followed from that:

1. FTL_check_blocking has two callers. FTL_new_query wraps its call
   in PERF_STAT_CHECK_BLOCKING; FTL_CNAME does not. The nested child
   slots fire regardless, so CNAME deep-inspection time was invisible
   at the parent level. One user-facing DNS query against a 5-hop
   CNAME chain drives 5 separate calls through FTL_check_blocking,
   each capable of a full denylist/gravity miss.
2. The existing CB_DENYLIST slot wraps check_domain_blocked() as a
   whole, but that function runs three independent engines in
   sequence — exact SQLite denylist, gravity/antigravity SQLite
   probe, and TRE regex — each with a different potential failure
   mode, and we currently cannot tell them apart in the rollup.

Changes. Still instrumentation-only; PERF macros compile out when
debug.performance is disabled.

(a) Split the parent CHECK_BLOCKING into primary vs CNAME callers.
Add PERF_STAT_CB_CNAME and wrap the FTL_CNAME-side call. Rename
the existing slot's human-readable label to "check_blocking
(primary path)" so its scope is unambiguous in the log.

(b) Split CB_DENYLIST's body into three nested sub-slots:
    CDB_EXACT   — in_denylist() exact-match SQLite
    CDB_GRAVITY — in_gravity() for both antigravity and gravity;
                  timed into one slot so the rollup reflects total
                  gravity-DB work per query regardless of which
                  sub-call hit
    CDB_REGEX   — in_regex(REGEX_DENY) over the TRE regex engine

The regex call is now stored in a local bool before the `if`
instead of being evaluated inline, so the PERF_END runs after the
engine returns. Same evaluation order, same short-circuit semantics,
no behaviour change.

After this lands, the 5-minute rollup reports:

    check_blocking (primary path)            # FTL_new_query side
    check_blocking (CNAME path, per-hop)     # FTL_CNAME side
    check_blocking -> allowlist              # nested, both callers
    check_blocking -> denylist/gravity       # nested, both callers
    check_domain_blocked -> exact denylist
    check_domain_blocked -> gravity+antigravity
    check_domain_blocked -> regex denylist

The CNAME-to-primary call ratio gives the amplification factor, and
the three new sub-engine slots localize which one owns the tail. If
CDB_GRAVITY concentrates the slow events, the fix domain is SQLite
and mmap (page-fault on cold regions, WAL contention); if CDB_REGEX
does, it is the TRE engine; if CDB_EXACT does, it is the prepared-
statement hot path. Each has a different next step, and right now we
cannot tell them apart.

Signed-off-by: Dominik <dl6er@dl6er.de>
Production data from the previous PERF_STAT round narrowed the
multi-millisecond tail on cache-miss queries to CDB_GRAVITY
(check_domain_blocked → gravity+antigravity): 45 slow (>1 ms)
events across ~7600 calls in ~4.75 h, with a peak of 2147 µs.

That same data, cross-referenced with the pre-existing
gravity-db.c "gravity (exact)" slot which wraps domain_in_list()
(the bind + sqlite3_step + reset cycle), shows the step itself
never exceeded ~210 µs in the same window. The step is therefore
not the culprit.

The time must live in in_gravity()'s wrapper code around the
step:
  - gravityDB_client_check_again() — group membership recheck,
    which can reload groups from the gravity DB on demand and
    may page-fault on cold mmap regions;
  - gravityDB_prepare_client_statements() — per-client statement
    prepare, runs only on first call per client;
  - bind_client_groups() — carray bind of the client's group-id
    array against a shared statement.

Add a span-style sibling to the existing expression-wrapping
GRAVITY_TIMED_LOOKUP macro (GRAVITY_PERF_START / GRAVITY_PERF_END),
extend gravity_perf[] from 6 to 7 slots, and use the new pair to
time the setup block inside in_gravity() from entry through
bind_client_groups(). The 5-minute rollup now includes:

  Gravity lookup stats [in_gravity wrapper (non-step)]:
      N calls, avg X us, max Y us, Z slow (>1ms, W%)

If Z matches the slow count in CDB_GRAVITY, the wrapper IS the
tail source — next fix domain is likely mmap warm-up (the existing
gravity pre-warm from 38f7322 may need to re-touch pages on group
reloads) or carray bind. If Z stays at zero while CDB_GRAVITY
keeps going slow, the tail is in the ABP post-step path or the
two in_denylist() / in_regex() neighbors that CDB_GRAVITY was
never actually covering (in which case the earlier attribution
needs re-examining).

Either outcome narrows the search meaningfully. No behaviour
change; both macros compile out when debug.performance is off.
Early-return paths inside in_gravity() (LIST_NOT_AVAILABLE,
bind_client_groups failure) skip the GRAVITY_PERF_END call so
they don't pollute the histogram — they are rare error paths,
not the hot path we want to characterize.

Signed-off-by: Dominik <dl6er@dl6er.de>
Production perf data (debug.performance enabled, ~8 h of rollups)
localized the 1-2 ms tail in cache-miss DNS queries entirely to
the wrapper code inside in_gravity() — specifically the inline
call to gravityDB_reload_groups(client) made from
gravityDB_client_check_again() at the 60/120/180-second boundaries
after a client was first seen. 27/27 slow events in the new
"in_gravity wrapper (non-step)" slot matched the 27 slow events in
CDB_GRAVITY, and the bimodal distribution across windows (21 of
32 windows had sub-µs wrapper time; the other 11 each had 1-7
slow events) mapped cleanly onto "any client in its 3-minute
recheck window during this 5-minute window".

gravityDB_reload_groups() does three heavy things — finalize the
client's prepared statements, re-prepare them against the current
group set (which runs get_client_groupids() SQL against
gravity.db), and rebuild the per-client regex lookup. All
synchronously on the DNS hot path, under the SHM lock, on
whichever query happened to land inside the recheck window. One
user-facing query was therefore effectively charged ~1-2 ms
whenever it lost this lottery, and CNAME recursion amplified the
probability by 2.2x because every CNAME hop is another chance to
cross the boundary.

The recheck mechanism itself is correct — clients *should* have
their identity re-evaluated as hostname/MAC/interface information
arrives — but the reload does not need to happen synchronously
with the triggering query. Between rechecks the per-client state
is already "stale" by up to 60 seconds by design; deferring by
≤1 second is immaterial.

Change: gravityDB_client_check_again() now just advances the
recheck counter and sets a new per-client flag
(clientsData.flags.reload_pending, a single bit inside the
existing client_flags bitfield — zero size impact). The DB
thread, in its once-per-second cadence alongside
queries_to_database() and the gravity_updated() check, calls a
new gravityDB_process_pending_reloads() that scans clients,
takes the SHM lock, and runs gravityDB_reload_groups() for any
client whose flag is set, then clears it.

The SHM lock is still held during the reload, but the hold is
now on the DB thread. A DNS query only ever waits on that lock
if it happens to be in flight exactly when the DB thread is
processing — a small fraction of the time at typical query
rates, and the blocked query absorbs one reload's worth of
latency rather than every query near a boundary triggering its
own.

The second caller of gravityDB_reload_groups() at
dnsmasq_interface.c:1088 (client changed network interface) is
left inline intentionally. Interface changes are rare events
and we have no evidence they contribute to the tail; routing
that path through the same flag can be a follow-up if it ever
surfaces.

No behaviour change from the user's perspective: group re-checks
still happen at the same 60/120/180-second marks, clients are
still re-identified over their first three minutes, and the
eventual group set a client is assigned is unchanged.

Signed-off-by: Dominik <dl6er@dl6er.de>
Production perf data (debug.performance enabled, ~8 h of rollups)
localized the recurring 1-2 ms tail on cache-miss DNS queries entirely
to one inline call: gravityDB_client_check_again() → gravityDB_reload_groups()
from inside in_gravity() / in_allowlist() / in_denylist(). The 27 slow
events in the "in_gravity wrapper (non-step)" PERF_STAT slot matched the
27 slow events in CDB_GRAVITY one-for-one, and the bimodal window
distribution (21 quiet, 11 spiky) mapped cleanly onto "did any client
cross its 60/120/180-second recheck boundary during this 5-minute
window".

The recheck mechanism exists for a legitimate reason: a client's
identity (MAC, hostname) is discovered asynchronously via
parse_neighbor_cache() (ARP) and the resolver thread, so a client whose
group assignment depends on MAC or hostname cannot be matched correctly
on its very first query. The three timed rechecks at 60/120/180 s
re-run get_client_groupids() once enough async identity data has
arrived.

That's periodic maintenance, though, not per-query work. The old
placement made every DNS query eat the "is this client due for a
recheck" conditional, and whichever query happened to land inside a
boundary absorbed the full ~1-2 ms reload (SQL re-fetch of groups +
per-client statement re-prepare + per-client regex rebuild) under the
SHM lock. CNAME recursion amplified the exposure ~2.2×.

Rework:
  - Remove the three hot-path calls to gravityDB_client_check_again()
    from in_gravity(), in_allowlist(), in_denylist() entirely. No more
    per-query check for anything recheck-related.
  - Replace the per-client gravityDB_client_check_again() with a
    gravityDB_recheck_clients() that iterates all clients under the SHM
    lock, short-circuits mature ones (reread_groups >= NUM_RECHECKS),
    and runs the reload directly for any client currently within its
    3-minute window whose next boundary has passed.
  - Call that function from the database thread's existing once-per-
    second block in run_db_thread(), next to queries_to_database() and
    the gravity_updated() check.

Behaviour is preserved: clients are still re-identified at the same
60/120/180 s marks after first-seen, with at most ~1 s of additional
latency between the boundary and the actual reload — well within the
existing design's pre-existing between-rechecks staleness window.

The other caller of gravityDB_reload_groups() at
dnsmasq_interface.c:1088 (client changed network interface) is left
inline intentionally. Interface changes are rare events and we have
no evidence they contribute to the tail.

Removes from scope the per-client reload_pending flag that an earlier
iteration of this work introduced: with the full check moved to the DB
thread, there is nothing for the DNS thread to signal, and recomputing
the "is this client due for a recheck" condition during the once-per-
second scan is cheaper than reading/writing a shared flag bit.

Net: -23 lines, no new state on clientsData, zero DNS-path overhead
from the recheck subsystem. The CDB_GRAVITY and in_gravity wrapper
slow-event counts should drop to essentially zero once this ships.

Signed-off-by: Dominik <dl6er@dl6er.de>
Comment thread src/api/stats.c
Signed-off-by: Dominik <dl6er@dl6er.de>
@yubiuser
Copy link
Copy Markdown
Member

If been running this for several days now without issues. I think after the next patch release this is good to go into development for an extended testing period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants