An Agentic RAG system for engineering standards and technical documentation.
Doc Agent ingests complex regulatory PDFs (GOST, PUE, SNiP, etc.), normalises their content, and makes it searchable through a hybrid vector‑database. It then answers engineering questions with verbatim citations, without hallucinations.
- PDF slicing & physical rotation correction (ONNX orientation model).
- Layout analysis with
Docling(IBM) – extracts tables, formulas, figures. - Semantic normalisation via a Vision‑Language Model (VLM) that “heals” raw Markdown, fixes headings, merges split tables, and converts formulas to LaTeX.
- Structural chunking based on document hierarchy (H1–H4).
- Embedding with local
BGE‑M3(ONNX, CPU‑only) and indexing into a hybridQdrantcollection (dense + BM25).
- Hybrid search (dense + BM25, fused with Reciprocal Rank Fusion).
- Agentic reasoning loop – question decomposition, relevance judging, iterative retrieval, faithfulness verification.
- Streamlit UI – chat interface with clickable source citations.
All components are vendor‑agnostic and can be switched from cloud APIs to local models when needed.
├── pipelines/ingestion_pipeline.py ← entry point for batch processing
├── ui/app.py ← Streamlit chat interface
├── src/doc_agent/
│ ├── configs/ ← Pydantic settings
│ ├── schemas/ ← all data contracts (Pydantic)
│ ├── core/ ← storage & state machine
│ ├── data/ ← “dumb” parsers (Docling, PDF slicing, tagging)
│ ├── agents/ ← LLM/VLM agents (OCR healer, answer generator)
│ ├── indexing/ ← structural chunker, embedder, Qdrant writer
│ ├── retrieval/ ← hybrid searcher
│ ├── reasoning/ ← decomposer, judge, faithfulness checker
│ ├── prompts/ ← version‑controlled system prompts
│ └── orchestrator.py ← main pipeline conductor
├── models/ ← downloaded ONNX models (BGE‑M3, orientation)
├── data/ ← working directory (git‑ignored)
│ ├── 01_raw/
│ ├── 02_interim/
│ └── 03_processed/
└── docker-compose.yml ← local Qdrant service
- Idempotent pipeline – every processing step is tracked in a
manifest.json; re‑running only processes new or failed pages. - Strict data contracts – all LLM/VLM outputs are parsed into Pydantic models (no free‑form JSON).
- Hybrid retrieval – combines semantic search with exact lexical matching (BM25) for article numbers and codes.
- Faithfulness checker – post‑generation verification that removes unsupported claims.
- Thread‑safe document processing – pages are normalised concurrently, with proper locking.
-
Start Qdrant
docker compose up -d
-
Install dependencies
pip install -e . -
Set environment variables – create
.envwith:NANOGPT_API_KEY=your_key NANOGPT_BASE_URL=https://api.openai.com/v1 # or any OpenAI‑compatible endpoint -
Ingest documents
python pipelines/ingestion_pipeline.py
PDFs from
data/01_raw/will be processed, healed, chunked, and indexed into Qdrant. -
Launch the UI
streamlit run ui/app.py
Each notebook documents a specific milestone in the project, evolving from raw physical extraction to advanced semantic normalization and retrieval integration.
| Notebook | Focus Area | Key Deliverables & Technical Outcomes |
|---|---|---|
| 001_docling_pdf_parsing.ipynb | Stage 1: Physical Layout & Extraction | Evaluated Docling against complex engineering standards. Catalogued 7 critical parsing artifacts (ghost tables, OCR overreach in schematics) and confirmed that disabling native OCR prevents “spaghetti text” while revealing the limits of heuristic parsing. |
| 002_semantic_normalization.ipynb | Stage 2: Semantic Normalization & VLM Healing | Developed a multimodal VLM agent to “heal” Markdown using high‑res page images. Implemented strict Pydantic data contracts for deterministic JSON responses. Achieved reconstruction of run‑in headings, borderless tables, and LaTeX formulas with zero hallucinations. |
| 003_docling_ast_xml_anchoring.ipynb | XML Tagging & Attention Boundaries | Intercepted the Docling AST to wrap elements in XML tags, creating ironclad attention boundaries that prevent context blending and mitigate the “middle‑drop” effect in large‑context VLMs. |
| 004_additional_testing_cases... | Edge Cases Stress Testing | Batch‑tested the VLM on vertical/horizontal tables, rotated pages, multiple drawings per page, and embedded images. Refined the Native CoT prompt to handle complex headers and placeholders without duplicating text or splitting tables. |
| 005_orchestrator_test.ipynb | Pipeline Orchestration Mock | Built the DocumentOrchestrator state machine. Validated LocalStorageManager and LocalManifestManager for idempotent page transitions (PENDING → TAGGED → CLEANED) via manifest.json. |
| 006_pdf_processor_test.ipynb | Physical Normalization | Implemented pypdfium2 + ONNX (PP‑LCNet) rotation detection. Proved that physically rotating pages to 0° before parsing eliminates structural errors. |
| 007_pdf_to_markdown_test.ipynb | End‑to‑End Integration | Validated the full extraction loop in src/doc_agent/: Slicing → AST Tagging → VLM Healing. Confirmed production code matches notebook R&D exactly. |
| 008_aggregation_and_chunking_test.ipynb | Semantic Chunking & Aggregation | Merged healed pages with HTML anchors. Developed StructuralChunker to slice by H1–H4 headers and map physical PNG artifacts into VectorPayload metadata for Qdrant. |
| 009_retrieval_stack_rationale.ipynb | Retrieval Stack Rationale & Integration Testing | Documented the architectural decision for Qdrant (native hybrid search, automatic BM25, sub‑10 ms latency, strong payload filtering) and BGE‑M3 ONNX (dense 1024‑dim vectors). Performed hands‑on integration: downloaded and tested the BGE‑M3 ONNX model, created Qdrant collections (dense‑only and hybrid), upserted real payloads, executed dense and hybrid search with Reciprocal Rank Fusion, and validated the full retrieval chain through HybridSearcher, AnswerGeneratorAgent, and AgenticAnswerAgent. Confirmed end‑to‑end correctness and benchmarked hybrid retrieval quality |