From b49dea3d7d199bc194e0851a62bdee1aaa7aeabc Mon Sep 17 00:00:00 2001 From: Yvette Carlisle Date: Tue, 9 Jun 2026 14:55:22 +0800 Subject: [PATCH] {"schema":"decodex/commit/1","summary":"Add single-user production recovery runbook","authority":"XY-825"} --- README.md | 9 +- docker-compose.yml | 2 +- .../2026-06-09-live-baseline-report.md | 5 + .../2026-06-09-production-corpus-report.md | 5 + docs/guide/benchmarking/index.md | 4 + .../benchmarking/live_baseline_benchmark.md | 4 + docs/guide/single_user_production.md | 325 +++++++++++++++++- 7 files changed, 337 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index e77f3344..182ac2b5 100644 --- a/README.md +++ b/README.md @@ -39,6 +39,8 @@ ELF is a memory service for LLM agents that stores short, evidence-linked facts Use the canonical setup guide: - `docs/guide/getting_started.md` +- For single-user production operation, backup, restore, and Qdrant rebuild, use + [docs/guide/single_user_production.md](docs/guide/single_user_production.md). Fast path: @@ -56,6 +58,9 @@ curl -fsS http://127.0.0.1:51892/health ``` For provider-backed development, copy `elf.example.toml` to `elf.toml` and fill the provider blocks. +For production use, do not rely on these quickstart commands; follow the single-user +production runbook linked above so backup, restore, rollback, and provider config +handling are explicit. ## Architecture @@ -136,6 +141,7 @@ Detailed evidence and interpretation: - [Live Baseline Benchmark Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-live-baseline-report.md) - [Synthetic Production Corpus Benchmark Report - June 9, 2026](docs/guide/benchmarking/2026-06-09-production-corpus-report.md) - [Live Baseline Benchmark Runbook](docs/guide/benchmarking/live_baseline_benchmark.md) +- [Single-User Production Runbook](docs/guide/single_user_production.md) Quick comparison snapshot (objective/high-level). This table compares capability coverage, not overall project quality. @@ -191,7 +197,8 @@ Latest external research refresh: June 8, 2026. - Start here: `docs/index.md` - Operational guide index: `docs/guide/index.md` -- Single-user production runbook: `docs/guide/single_user_production.md` +- Single-user production runbook: + [docs/guide/single_user_production.md](docs/guide/single_user_production.md) - Benchmarking guides and reports: `docs/guide/benchmarking/index.md` - Research index: `docs/guide/research/index.md` - Specifications: `docs/spec/index.md` diff --git a/docker-compose.yml b/docker-compose.yml index ef0a17c7..6a5009fa 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -15,7 +15,7 @@ services: timeout: 5s retries: 10 volumes: - - elf-postgres-data:/var/lib/postgresql/data + - elf-postgres-data:/var/lib/postgresql qdrant: image: qdrant/qdrant:v1.16.3 diff --git a/docs/guide/benchmarking/2026-06-09-live-baseline-report.md b/docs/guide/benchmarking/2026-06-09-live-baseline-report.md index a609ff90..ed94704f 100644 --- a/docs/guide/benchmarking/2026-06-09-live-baseline-report.md +++ b/docs/guide/benchmarking/2026-06-09-live-baseline-report.md @@ -178,6 +178,11 @@ calls, worker indexing, Qdrant rebuild/search, lifecycle checks, soak, and conta overhead. Whether that is acceptable depends on the production workflow: it is a cold/backfill measurement, not an interactive-ingest target. +This report is benchmark evidence, not the production operating procedure. Use +`docs/guide/single_user_production.md` for Docker Compose production start, stop, +health, backup, restore, Qdrant rebuild, rollback, provider config handling, and +cleanup commands. + Throughput work should focus on: - micro-batching provider embedding requests; diff --git a/docs/guide/benchmarking/2026-06-09-production-corpus-report.md b/docs/guide/benchmarking/2026-06-09-production-corpus-report.md index 8d1505c8..b050f1df 100644 --- a/docs/guide/benchmarking/2026-06-09-production-corpus-report.md +++ b/docs/guide/benchmarking/2026-06-09-production-corpus-report.md @@ -23,6 +23,11 @@ Verification: Compare this Markdown summary with the source JSON before committi - Same-corpus summary: `1 pass`, `0 fail`, `0 incomplete` - Full check summary: `7/7 pass` +This report is production-corpus benchmark evidence only. Use +`docs/guide/single_user_production.md` for the single-user Docker Compose production +runbook, including backup, restore, Qdrant rebuild, rollback, provider config +handling, and cleanup commands. + ## Projects | Project | Status | Retrieval | Checks | Elapsed | Reason | diff --git a/docs/guide/benchmarking/index.md b/docs/guide/benchmarking/index.md index 3fcd0143..8d3f7506 100644 --- a/docs/guide/benchmarking/index.md +++ b/docs/guide/benchmarking/index.md @@ -17,6 +17,10 @@ Outputs: The smallest benchmarking guide or report needed to continue. - You need to extend the benchmark matrix with new projects, profiles, or lifecycle checks. +Do not use benchmark commands as the production operating procedure. For single-user +Docker Compose production start, stop, backup, restore, Qdrant rebuild, rollback, and +cleanup, use `docs/guide/single_user_production.md`. + ## Guides And Reports - `live_baseline_benchmark.md`: run, clean up, publish, and interpret the live diff --git a/docs/guide/benchmarking/live_baseline_benchmark.md b/docs/guide/benchmarking/live_baseline_benchmark.md index f6db3637..05108f19 100644 --- a/docs/guide/benchmarking/live_baseline_benchmark.md +++ b/docs/guide/benchmarking/live_baseline_benchmark.md @@ -10,6 +10,10 @@ Verification: `cargo make baseline-live-docker` writes `tmp/live-baseline/live-b ## Scope +This guide is for benchmark evidence, not for operating a personal production ELF service. For +single-user Docker Compose production start, stop, health, backup, restore, Qdrant rebuild, +rollback, and cleanup commands, use `docs/guide/single_user_production.md`. + The runner covers ELF plus the six external projects in the README comparison table: - ELF diff --git a/docs/guide/single_user_production.md b/docs/guide/single_user_production.md index 33d21784..4322236e 100644 --- a/docs/guide/single_user_production.md +++ b/docs/guide/single_user_production.md @@ -9,7 +9,8 @@ binaries, and provider credentials for production embeddings/rerank/extraction. Depends on: `docker-compose.yml`, `elf.example.toml`, `docs/spec/system_elf_memory_service_v2.md`, `docs/guide/getting_started.md`, and `docs/guide/integration-testing.md`. Verification: Health succeeds, a note can be ingested and found, Postgres backup restores notes, -and Qdrant search state can be rebuilt from Postgres. +Qdrant search state can be rebuilt from Postgres, and the clean-volume proof path below can run +without host-global service installs. ## Operating Boundary @@ -33,8 +34,8 @@ binds when auth is off. ## 1. Create Local Secrets -Create `.env` for Docker Compose only. Docker Compose loads it automatically; ELF itself does not -read provider credentials or required config fields from environment variables. +Create `.env` for Docker Compose storage settings only. Docker Compose loads it automatically; ELF +itself does not read provider credentials or required config fields from environment variables. ```sh cat > .env <<'EOF' @@ -83,6 +84,12 @@ Edit `elf.production.toml`: - If you run `elf-mcp`, keep `[mcp]` present and ensure exactly one static key matches its tenant, project, agent, and read profile. +Do not put provider credentials, bearer tokens, or static-key secrets in the Compose `.env` file. +Production provider settings belong in the untracked ELF config file, or in a local secret-rendering +step that writes that untracked config before startup. ELF fails closed when provider keys are empty, +required provider fields are absent, the embedding dimension does not match the Qdrant vector +dimension, or the config path is missing. + Do not commit `.env`, `elf.production.toml`, backups, provider keys, bearer tokens, or database dumps. `.env*`, root ELF config files, and `backups/` are ignored for this reason. @@ -105,6 +112,31 @@ docker compose -f docker-compose.yml exec -T postgres \ curl -fsS "http://127.0.0.1:${ELF_QDRANT_REST_PORT}/collections" >/dev/null ``` +Stop storage without deleting data: + +```sh +docker compose -f docker-compose.yml stop postgres qdrant +``` + +Start it again: + +```sh +docker compose -f docker-compose.yml up -d postgres qdrant +``` + +Remove stopped containers while keeping volumes: + +```sh +docker compose -f docker-compose.yml down +``` + +Delete all Compose-managed storage only when you have a verified backup or are running the +clean-volume proof below: + +```sh +docker compose -f docker-compose.yml down -v +``` + ## 3. Build And Start ELF Services Build once, then run the binaries directly to avoid multiple `cargo run` processes contending for @@ -132,6 +164,15 @@ Optional: start MCP in a third terminal when a client needs the MCP adapter: target/debug/elf-mcp -c elf.production.toml ``` +Stop ELF services by sending Ctrl-C in each service terminal. If you started them in the background, +stop those exact processes before backup, restore, upgrade, or rollback: + +```sh +pkill -f "target/debug/elf-api -c elf.production.toml" || true +pkill -f "target/debug/elf-worker -c elf.production.toml" || true +pkill -f "target/debug/elf-mcp -c elf.production.toml" || true +``` + On startup, `elf-api` and `elf-worker` initialize the Postgres schema and ensure the Qdrant collections and docs payload indexes exist. Startup fails closed if the config file is missing, required config is absent, `security.reject_non_english` is false, vector dimensions mismatch, or @@ -157,7 +198,62 @@ Before upgrading ELF binaries or changing config, take a Postgres backup. There migration command in the minimum runbook; rollback means stopping ELF, restoring the previous Postgres backup, starting the previous known-good binary/config, and rebuilding Qdrant. -## 5. Back Up Postgres +## 5. Restart, Upgrade, And Roll Back + +For a config-only restart: + +```sh +pkill -f "target/debug/elf-api -c elf.production.toml" || true +pkill -f "target/debug/elf-worker -c elf.production.toml" || true +pkill -f "target/debug/elf-mcp -c elf.production.toml" || true +``` + +Then start the worker and API again in separate terminals: + +```sh +target/debug/elf-worker -c elf.production.toml +``` + +```sh +target/debug/elf-api -c elf.production.toml +``` + +For an ELF binary upgrade: + +```sh +# 1. Run Section 6 and keep the backup path. +# 2. Stop ELF service processes. +pkill -f "target/debug/elf-api -c elf.production.toml" || true +pkill -f "target/debug/elf-worker -c elf.production.toml" || true +pkill -f "target/debug/elf-mcp -c elf.production.toml" || true + +# 3. Move the checkout to the desired release or commit, then rebuild. +cargo build -p elf-api -p elf-worker -p elf-mcp + +# 4. Start worker in one terminal. +target/debug/elf-worker -c elf.production.toml +``` + +```sh +# 5. Start API in another terminal, then run Section 4 health and migration checks. +target/debug/elf-api -c elf.production.toml +``` + +For rollback, restore the pre-upgrade backup and rebuild Qdrant: + +```sh +# 1. Stop ELF service processes. +pkill -f "target/debug/elf-api -c elf.production.toml" || true +pkill -f "target/debug/elf-worker -c elf.production.toml" || true +pkill -f "target/debug/elf-mcp -c elf.production.toml" || true + +# 2. Move the checkout and elf.production.toml back to the previous known-good version. +# 3. Run Section 7 restore. +# 4. Run Section 8 Qdrant rebuild. +# 5. Start the previous known-good worker and API, then run Section 4 health checks. +``` + +## 6. Back Up Postgres Stop or pause writers first. For this single-user runbook, that means stop `elf-api`, `elf-worker`, and `elf-mcp` with Ctrl-C in their terminals. Leave the `postgres` container running. @@ -177,7 +273,7 @@ printf 'Wrote %s\n' "${BACKUP}" Copy the backup to your normal encrypted backup location. Do not commit it. -## 6. Restore Postgres +## 7. Restore Postgres Use this path for a fresh machine restore or rollback. Stop `elf-api`, `elf-worker`, and `elf-mcp` before restoring. Start only storage: @@ -210,7 +306,7 @@ docker compose -f docker-compose.yml exec -T postgres \ -c "SELECT COUNT(*) AS notes FROM memory_notes;" ``` -## 7. Rebuild Qdrant From Postgres +## 8. Rebuild Qdrant From Postgres Qdrant is rebuildable. If the Qdrant volume or memory-note collection is missing, stale, or restored from the wrong point in time, discard the memory-note collection and rebuild it from @@ -255,7 +351,7 @@ call the embedding provider. This endpoint rebuilds memory-note chunks. Do not treat it as a Doc Extension rebuild procedure for `storage.qdrant.docs_collection`. -## 8. Smoke And Restore Proof +## 9. Smoke And Restore Proof With `elf-worker` and `elf-api` running, ingest one deterministic note. If auth is off, omit the `Authorization` header. If static-key auth is on, use a token whose configured context matches the @@ -303,16 +399,215 @@ curl -fsS -X POST http://127.0.0.1:51892/v2/searches \ }' ``` -To prove restore and rebuild: +### Clean-Volume Proof Path + +Run this from the repository root when you need a local proof that backup, clean-volume restore, +Qdrant rebuild, and search recovery work without host-global service installs. It uses the +checked-in deterministic local providers, a temporary config under `tmp/`, ports `51988-51993`, +and isolated Docker volume names. + +```sh +bash <<'EOF' +set -euo pipefail + +PROOF_DIR="tmp/single-user-restore-proof" +PROOF_CONFIG="${PROOF_DIR}/elf.restore-proof.toml" +mkdir -p "${PROOF_DIR}/backups" +cp config/local/elf.docker.toml "${PROOF_CONFIG}" +perl -0pi -e 's/127\.0\.0\.1:51888/127.0.0.1:51988/g; s/127\.0\.0\.1:51889/127.0.0.1:51989/g; s/127\.0\.0\.1:51890/127.0.0.1:51990/g; s/127\.0\.0\.1:51891/127.0.0.1:51991/g; s/127\.0\.0\.1:51892/127.0.0.1:51992/g; s/127\.0\.0\.1:51893/127.0.0.1:51993/g; s/elf_local_notes/elf_restore_proof_notes/g; s/elf_local_doc_chunks/elf_restore_proof_doc_chunks/g' "${PROOF_CONFIG}" + +export ELF_COMPOSE_PROJECT=elf-restore-proof +export ELF_POSTGRES_DB=elf_local +export ELF_POSTGRES_USER=elf_dev +export ELF_POSTGRES_PASSWORD=elf_dev_password +export ELF_POSTGRES_PORT=51988 +export ELF_POSTGRES_VOLUME=elf-restore-proof-postgres-data +export ELF_QDRANT_REST_PORT=51989 +export ELF_QDRANT_GRPC_PORT=51990 +export ELF_QDRANT_VOLUME=elf-restore-proof-qdrant-data + +API_PID="" +WORKER_PID="" +cleanup() { + for pid in ${API_PID:-} ${WORKER_PID:-}; do + if [ -n "${pid}" ]; then + kill "${pid}" 2>/dev/null || true + wait "${pid}" 2>/dev/null || true + fi + done + docker compose -f docker-compose.yml down -v --remove-orphans >/dev/null 2>&1 || true +} +trap cleanup EXIT + +docker compose -f docker-compose.yml down -v --remove-orphans +docker compose -f docker-compose.yml config >/dev/null +docker compose -f docker-compose.yml up -d postgres qdrant +for _ in $(seq 1 60); do + docker compose -f docker-compose.yml exec -T postgres \ + pg_isready -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" >/dev/null 2>&1 && break + sleep 1 +done +docker compose -f docker-compose.yml exec -T postgres \ + pg_isready -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" +for _ in $(seq 1 60); do + curl -fsS "http://127.0.0.1:${ELF_QDRANT_REST_PORT}/collections" >/dev/null && break + sleep 1 +done +curl -fsS "http://127.0.0.1:${ELF_QDRANT_REST_PORT}/collections" >/dev/null + +cargo build -p elf-api -p elf-worker -1. Run the backup step. -2. Stop `elf-api`, `elf-worker`, and `elf-mcp`. -3. Restore the backup into Postgres. -4. Delete the Qdrant memory-note collection. -5. Start `elf-api`, call `/v2/admin/qdrant/rebuild`, then start `elf-worker`. -6. Re-run the search command and confirm the restored note appears. +target/debug/elf-worker -c "${PROOF_CONFIG}" > "${PROOF_DIR}/worker-before.log" 2>&1 & +WORKER_PID="$!" +target/debug/elf-api -c "${PROOF_CONFIG}" > "${PROOF_DIR}/api-before.log" 2>&1 & +API_PID="$!" + +for _ in $(seq 1 60); do + curl -fsS http://127.0.0.1:51992/health >/dev/null && break + sleep 1 +done +curl -fsS http://127.0.0.1:51992/health | tee "${PROOF_DIR}/health-before.json" + +curl -fsS -X POST http://127.0.0.1:51992/v2/notes/ingest \ + -H 'content-type: application/json' \ + -H 'X-ELF-Tenant-Id: local-tenant' \ + -H 'X-ELF-Project-Id: local-project' \ + -H 'X-ELF-Agent-Id: local-agent' \ + -d '{ + "scope": "agent_private", + "notes": [ + { + "type": "fact", + "key": "single_user_restore_probe", + "text": "The single-user production restore proof note is stored in Postgres and searchable after Qdrant rebuild.", + "importance": 0.8, + "confidence": 0.95, + "ttl_days": 14, + "source_ref": {"schema": "single_user_runbook/v1", "ref": {"step": "clean_volume_restore_proof"}} + } + ] + }' | tee "${PROOF_DIR}/add-note.json" + +for _ in $(seq 1 60); do + OPEN_OUTBOX="$(docker compose -f docker-compose.yml exec -T postgres \ + psql -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" -At \ + -c "SELECT COUNT(*) FROM indexing_outbox WHERE status <> 'DONE';")" + [ "${OPEN_OUTBOX}" = "0" ] && break + sleep 1 +done +test "${OPEN_OUTBOX}" = "0" + +curl -fsS -X POST http://127.0.0.1:51992/v2/searches \ + -H 'content-type: application/json' \ + -H 'X-ELF-Tenant-Id: local-tenant' \ + -H 'X-ELF-Project-Id: local-project' \ + -H 'X-ELF-Agent-Id: local-agent' \ + -H 'X-ELF-Read-Profile: private_only' \ + -d '{ + "mode": "quick_find", + "query": "Where is the single-user production restore proof note stored?", + "top_k": 5, + "candidate_k": 20, + "payload_level": "l0" + }' | tee "${PROOF_DIR}/search-before.json" +grep -F "single-user production restore proof note" "${PROOF_DIR}/search-before.json" + +BACKUP="${PROOF_DIR}/backups/elf-proof.dump" +docker compose -f docker-compose.yml exec -T postgres \ + pg_dump -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" -Fc > "${BACKUP}" +test -s "${BACKUP}" + +kill "${API_PID}" "${WORKER_PID}" 2>/dev/null || true +wait "${API_PID}" "${WORKER_PID}" 2>/dev/null || true +API_PID="" +WORKER_PID="" + +docker compose -f docker-compose.yml down -v --remove-orphans +docker compose -f docker-compose.yml up -d postgres qdrant +for _ in $(seq 1 60); do + docker compose -f docker-compose.yml exec -T postgres \ + pg_isready -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" >/dev/null 2>&1 && break + sleep 1 +done +docker compose -f docker-compose.yml exec -T postgres \ + pg_isready -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" +for _ in $(seq 1 60); do + curl -fsS "http://127.0.0.1:${ELF_QDRANT_REST_PORT}/collections" >/dev/null && break + sleep 1 +done + +docker compose -f docker-compose.yml exec -T postgres \ + dropdb -U "${ELF_POSTGRES_USER}" --force --if-exists "${ELF_POSTGRES_DB}" +docker compose -f docker-compose.yml exec -T postgres \ + createdb -U "${ELF_POSTGRES_USER}" "${ELF_POSTGRES_DB}" +docker compose -f docker-compose.yml exec -T postgres \ + pg_restore -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" \ + --no-owner --role="${ELF_POSTGRES_USER}" < "${BACKUP}" + +RESTORED_NOTES="$(docker compose -f docker-compose.yml exec -T postgres \ + psql -U "${ELF_POSTGRES_USER}" -d "${ELF_POSTGRES_DB}" -At \ + -c "SELECT COUNT(*) FROM memory_notes WHERE key = 'single_user_restore_probe';")" +test "${RESTORED_NOTES}" = "1" + +target/debug/elf-api -c "${PROOF_CONFIG}" > "${PROOF_DIR}/api-after.log" 2>&1 & +API_PID="$!" +for _ in $(seq 1 60); do + curl -fsS http://127.0.0.1:51992/health >/dev/null && break + sleep 1 +done + +curl -fsS -X POST http://127.0.0.1:51991/v2/admin/qdrant/rebuild \ + | tee "${PROOF_DIR}/qdrant-rebuild.json" +grep -F '"missing_vector_count":0' "${PROOF_DIR}/qdrant-rebuild.json" +grep -F '"error_count":0' "${PROOF_DIR}/qdrant-rebuild.json" + +curl -fsS -X POST http://127.0.0.1:51992/v2/searches \ + -H 'content-type: application/json' \ + -H 'X-ELF-Tenant-Id: local-tenant' \ + -H 'X-ELF-Project-Id: local-project' \ + -H 'X-ELF-Agent-Id: local-agent' \ + -H 'X-ELF-Read-Profile: private_only' \ + -d '{ + "mode": "quick_find", + "query": "Where is the single-user production restore proof note stored?", + "top_k": 5, + "candidate_k": 20, + "payload_level": "l0" + }' | tee "${PROOF_DIR}/search-after.json" +grep -F "single-user production restore proof note" "${PROOF_DIR}/search-after.json" + +printf 'Single-user restore proof passed. Evidence files remain under %s.\n' "${PROOF_DIR}" +EOF +``` -## 9. Failure And Secret Rules +The proof fails closed on missing Docker services, occupied ports, failed service health, undrained +indexing outbox rows, an empty backup, missing restored source rows, non-zero Qdrant rebuild errors, +or a search response that does not contain the restored note. + +### Recorded Local Proof - June 9, 2026 + +The clean-volume proof path above was executed locally against this worktree after aligning +`docker-compose.yml` with the PostgreSQL 18 volume layout. It used the checked-in local deterministic +providers, isolated Compose volumes, and ports `51988-51993`. + +Recorded evidence: + +- Compose storage started cleanly with Postgres accepting connections. +- `cargo build -p elf-api -p elf-worker` completed. +- `POST /v2/notes/ingest` returned `op = "ADD"` and `policy_decision = "remember"` for + `key = "single_user_restore_probe"`. +- Search before backup returned the note summary: + "The single-user production restore proof note is stored in Postgres and searchable after Qdrant + rebuild." +- The custom-format Postgres backup was non-empty (`88K` in the local proof run). +- The proof destroyed and recreated the isolated Compose volumes, restored Postgres with + `pg_restore`, and verified one restored source row for `single_user_restore_probe`. +- `POST /v2/admin/qdrant/rebuild` returned + `{"error_count":0,"missing_vector_count":0,"rebuilt_count":1}`. +- Search after restore and Qdrant rebuild returned the same restored note. +- Cleanup removed the isolated proof containers and volumes. + +## 10. Failure And Secret Rules - Missing or invalid config fails startup. - `security.reject_non_english = false` fails config validation.