Skip to content

Project remains listed but becomes unqueryable after MCP restart with stale WAL/SHM sidecars #790

Description

@humais

Environment

  • Source tag: DeusData/codebase-memory-mcp v0.8.1
  • Resolved source commit: f0c9be19c5d74b84f418d807bfdce7b5d6a261ff (lightweight tag; verified by git ls-remote and the GitHub API git/refs/tags/v0.8.1 endpoint)
  • Binary variant: standard non-UI
  • Build: locally compiled on macOS 10.15.8 Catalina x86_64 with Apple Clang 12.0.0 (clang-1200.0.32.29); MACOSX_DEPLOYMENT_TARGET=10.15; CC=cc; CXX=c++; the resulting LC_BUILD_VERSION is minos=10.15, sdk=10.15.6 (Catalina-compatible).
  • Binary SHA-256: 5a859337d243a0f4be913d764bb73a0e01f83e68225cf63c0f000e468df166de
  • Cache: isolated CBM_CACHE_DIR per test (no shared or prior-pilot caches were used)
  • Fixture: a generic 3-file React project at /Users/test/cbm-repro/sample-react-project/ (package.json, src/main.tsx, src/App.tsx) so that no Astryx or other private project names appear in the cache filenames or in the indexer's output
  • No installer script, no install.sh, no auto-configuration of any agent. The binary is driven exclusively over raw stdio MCP (bin-source-built/codebase-memory-mcp invoked as a subprocess; initialize, notifications/initialized, tools/list, tools/call; nothing else)

Expected behavior

After the MCP process exits (any reason) and is restarted with the same CBM_CACHE_DIR:

  1. The previously-indexed project remains registered.
  2. index_status, get_graph_schema, get_architecture, search_graph, search_code, and get_code_snippet all return the previously-indexed data.
  3. Either the WAL is checkpointed on graceful close, or the WAL is recovered on next start, so that the project record is visible in the main .db file after restart.
  4. The project's .db-wal and .db-shm sidecar files (if any) are either checkpointed (truncated to zero) on close, or they reflect a state that the next start can recover from.

Actual behavior

Before termination (in the same process):

  • initialize, notifications/initialized, tools/list succeed.
  • index_repository succeeds with status:"indexed", nodes:8, edges:10 for the 3-file sample-react-project fixture.
  • list_projects returns the project with name, root_path, nodes:8, edges:10, size_bytes:1769472.
  • index_status returns {"project":"…","nodes":8,"edges":10,"status":"ready"}.
  • search_code returns the project file matches.

After termination and restart (separate fresh process, same CBM_CACHE_DIR):

  • list_projects returns the same project entry (same name, nodes, edges, size_bytes).

  • index_status returns:

    {
      "error": "project not found or not indexed",
      "hint": "Use list_projects to see all indexed projects, then pass the project name.",
      "available_projects": ["…sample-react-project"],
      "count": 1
    }
  • search_code returns the same "project not found or not indexed" error.

  • The project's .db file remains on disk with the same size (1,769,472 bytes) and the same SHA-256 as immediately before termination.

  • After the restart sequence runs queries against the same cache, .db-wal and .db-shm sidecar files appear in the cache directory and persist.

The project's file is on disk; the project is in list_projects; but the query tools return "project not found". The project is registered but unreachable through queries.

Process shutdown behavior (kept separate from the persistence defect)

The lifecycle tests used four independent caches, each starting empty, each receiving one termination event. The exit-behavior was observed as follows:

Test Termination sent Process exited within 15 s window? Exit code Time to exit
A close client stdin cleanly (no signal) yes 0 0.0 s
B SIGINT yes 0 0.41 s
C SIGTERM yes 0 0.21 s
D SIGKILL yes -9 0.2 s

In the four isolated tests above, the process always exited within the test window. However, a separate controlled run (in a different prior pilot) observed the binary remaining in S (sleeping) state when SIGTERM was sent; that observation is out of scope for this issue and is reported separately if relevant. The persistence defect observed here is independent of which termination mode was used: in all four tests, the .db file remained on disk with the same size, and after the restart all query tools reported "project not found" while list_projects continued to see the project.

Diagnostic result (disposable cache only)

In an isolated disposable cache, after the restart sequence returned the "project not found" responses, removing the .db-wal and .db-shm sidecar files caused the project's queries to succeed again on the next restart (without re-indexing). The .db file itself was not deleted and was not modified.

This is a diagnostic action performed on a disposable test cache only to isolate the failure to the sidecar files. It is not proposed as a production workaround. The implication is that the sidecar files, when present, cause the query open path (cbm_store_open_path_query in mcp.c's resolve_store) to either fail the integrity check or fail the project-record lookup against the main .db file. A production fix should make the open path tolerate (or actively recover from) the sidecar state, not require users to delete sidecars.

Compact before/after cache inventory

This is the cache inventory observed in one of the four tests (the others are identical except for the project-DB SHA-256, which is unique per test). All numbers are exact and reproducible; see evidence/upstream-prep/lifecycle-*/test.log for the full per-test logs.

Stage File Size (bytes) SHA-256 (first 16 hex)
Before termination …sample-react-project.db 1,769,472 b7e13d8b157b09a7 (test B; varies per test)
Before termination _config.db 12,288 e8f7566de75fe3f6 (constant across all tests)
Immediately after exit …sample-react-project.db 1,769,472 b7e13d8b157b09a7 (unchanged)
Immediately after exit _config.db 12,288 e8f7566de75fe3f6 (unchanged)
After restart sequence (queries failed) …sample-react-project.db 1,769,472 b7e13d8b157b09a7 (unchanged)
After restart sequence (queries failed) _config.db 12,288 e8f7566de75fe3f6 (unchanged)

The project DB is preserved across the lifecycle event in all four tests. Across the four tests, the project DB SHA-256s differ (each indexing run produces a fresh DB) but the sizes are identical (1,769,472 bytes), and within each test the project DB SHA-256 is identical before, after exit, and after restart.

Relationship to PR #387 (merged) — context only

PR #387 ("fix(store): checkpoint WAL on close and startup to prevent orphan accumulation") was intended to:

  1. Checkpoint the WAL before close, so the next open sees the same rows the previous process wrote.
  2. Recover stale WAL at startup (PRAGMA wal_checkpoint(PASSIVE)), so an ungraceful exit doesn't poison the next open.
  3. Prevent orphan WAL/SHM sidecar accumulation.

The v0.8.1 reproduction above indicates that, under the tested stdio lifecycle, at least one of these three is incomplete or has regressed: the sidecar files are present in the cache after the restart, the .db is on disk, but the query open path does not return the project's data. A subsequent restart in a fresh cache (where the indexer wrote the data and exited cleanly) is what the diagnostic above measures; the sidecar state from a prior failed restart can be cleared by deleting the sidecars, but the user's expectation is that the binary's own close/restart logic handles this.

PR #387 should be checked for whether the close-time checkpoint covers the project-record write path and whether the startup-time recovery handles the case where the .db is on disk but the sidecars are present.

Relationship to issue #277 (open) — context only

Issue #277 ("New files not indexed — WAL-checkpoint blocked on successfully-indexed project") is in the same WAL/recovery family. The defect observed here is not the same symptom (here: queryable in the same process, unqueryable after restart; #277: new files not detected). However, the underlying root-cause family (WAL state not being committed before close, or not being recovered on next open) is plausibly the same.

Relationship to issue #557 (open) — context only

Issue #557 describes a different defect: the binary unlinks the project DB on a corruption determination (deleting corrupt db path in resolve_store). The v0.8.1 reproduction in this issue does not reproduce that symptom. In all four tests above, the .db file is preserved with the same size and the same SHA-256. The defect in this issue is a lesser manifestation of the same WAL/recovery area: the sidecar files are present and the next open cannot read the project, but the file is not unlinked. (See the issue-557 family discussion above for the "delete corrupt db" code path; the controlled lifecycle isolation in this issue's tests did not reach that path.)

Minimal reproduction

All commands below are public-safe; no private names, paths, or model names appear. The fixture is generic; the cache is disposable; the reproduction is one shell command plus one Python script.

# Variables (set BINARY, CACHE, FIXTURE; create an empty CACHE; use a small generic fixture)
BINARY=/Users/test/cbm-repro/bin-source-built/codebase-memory-mcp   # locally compiled v0.8.1
CACHE=/Users/test/cbm-repro/cache-wal-restart-XXXX                    # fresh, empty
FIXTURE=/Users/test/cbm-repro/sample-react-project                    # 3 files
mkdir -p "$CACHE"

# Step 1 — start the binary, index, and confirm queryability in the same process.
# Step 2 — terminate (any method; here we use stdin close which is the most reproducible).
# Step 3 — restart with the same CACHE.
# Step 4 — list_projects still works; index_status and search_code return "project not found".
# Step 5 — diagnostic: remove sidecars, restart, queries succeed again.
#
# All four lifecycle tests (close-stdin / SIGINT / SIGTERM / SIGKILL) produce the same Step 4 result.

python3 - <<'PY'
import os, json, select, subprocess, time, signal
BINARY, CACHE, FIXTURE = os.environ['BINARY'], os.environ['CACHE'], os.environ['FIXTURE']

def send(p, r):
    p.stdin.write((json.dumps(r) + "\n").encode())
    p.stdin.flush()

def recv(p, rid, t=15):
    end = time.time() + t
    buf = b""
    while time.time() < end:
        r, _, _ = select.select([p.stdout], [], [], 0.3)
        if r:
            c = os.read(p.stdout.fileno(), 4096)
            if not c: return None
            buf += c
            while b"\n" in buf:
                l, buf = buf.split(b"\n", 1)
                if not l.strip(): continue
                try: x = json.loads(l.decode())
                except: continue
                if x.get("id") == rid: return x
    return None

def text_of(resp):
    if not resp or "result" not in resp: return ""
    return resp["result"].get("content", [{}])[0].get("text", "")

# --- Pass 1: index + query, then exit via stdin close ---
p = subprocess.Popen([BINARY], stdin=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0,
                     env={**os.environ, "CBM_CACHE_DIR": CACHE})
send(p, {"jsonrpc":"2.0","id":1,"method":"initialize",
         "params":{"protocolVersion":"2024-11-05","capabilities":{},
                  "clientInfo":{"name":"r","version":"1"}}})
recv(p, 1)
send(p, {"jsonrpc":"2.0","method":"notifications/initialized","params":{}})
send(p, {"jsonrpc":"2.0","id":2,"method":"tools/call",
         "params":{"name":"index_repository","arguments":{"repo_path": FIXTURE}}})
recv(p, 2, 60)
send(p, {"jsonrpc":"2.0","id":3,"method":"tools/call",
         "params":{"name":"index_status","arguments":{"project":"sample-react-project"}}})
print("index_status in pass 1:", text_of(recv(p, 3)))
p.stdin.close()  # close stdin; this is the "EOF" termination mode

# --- Pass 2: restart with the same CACHE ---
p = subprocess.Popen([BINARY], stdin=subprocess.PIPE, stdout=subprocess.PIPE, bufsize=0,
                     env={**os.environ, "CBM_CACHE_DIR": CACHE})
send(p, {"jsonrpc":"2.0","id":1,"method":"initialize",
         "params":{"protocolVersion":"2024-11-05","capabilities":{},
                  "clientInfo":{"name":"r","version":"1"}}})
recv(p, 1)
send(p, {"jsonrpc":"2.0","method":"notifications/initialized","params":{}})
send(p, {"jsonrpc":"2.0","id":2,"method":"tools/call",
         "params":{"name":"list_projects","arguments":{}}})
print("list_projects in pass 2:", text_of(recv(p, 2)))
send(p, {"jsonrpc":"2.0","id":3,"method":"tools/call",
         "params":{"name":"index_status","arguments":{"project":"sample-react-project"}}})
print("index_status in pass 2:", text_of(recv(p, 3)))
send(p, {"jsonrpc":"2.0","id":4,"method":"tools/call",
         "params":{"name":"search_code","arguments":{"pattern":"App","project":"sample-react-project"}}})
print("search_code in pass 2:", text_of(recv(p, 4)))
p.stdin.close()
PY

# Inspect the cache; the project .db is on disk; the .db-wal and .db-shm may also be present.
ls -la "$CACHE"

# Diagnostic only — disposable cache. Do NOT do this in production.
rm -f "$CACHE"/*.db-wal "$CACHE"/*.db-shm

# Restart again; index_status and search_code now succeed without re-indexing.

Why this is not just a CLI-wrapper or field-name issue

The four raw-MCP query tools tested in this issue (index_status, search_code, list_projects, get_architecture) all use the live schema's canonical project field (the field name is project, not project_name; the related CLI vs MCP schema mismatch is reported in a separate draft). The query calls in this issue's reproduction succeed in the same process and fail in the restart process with the exact same field values. The defect is in the server-side state at the moment of cbm_store_open_path_query in resolve_store, not in the client's field name.

Local verdict

The local controlled pilot verdict is CODEBASE MEMORY PILOT BLOCKED — NO OPENCLAW CONFIG CHANGES RETAINED. The locally-compiled v0.8.1 binary has not been registered as an MCP server in the test environment; openclaw.json was not modified; no skills, hooks, or instructions were installed. No part of this controlled pilot was performed against an authoritative or production repository. The standard non-UI binary and a 3-file generic React fixture were used. All cache directories used by the test were disposable and isolated per test.


Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority/highNeeds near-term maintainer attention; high-impact bug, regression, safety issue, or release blocker.stability/performanceServer crashes, OOM, hangs, high CPU/memory

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions