Open-source toolkit for reliable RAG pipelines: convert PDFs to Markdown, clean documents, inspect chunks, compare chunking strategies, and enrich metadata for LLM applications.
-
Updated
Jun 6, 2026 - Python
Open-source toolkit for reliable RAG pipelines: convert PDFs to Markdown, clean documents, inspect chunks, compare chunking strategies, and enrich metadata for LLM applications.
A Python CLI to test, benchmark, and find the best RAG chunking strategy for your Markdown documents.
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Production-ready Snowflake RAG system with type-specific chunking
A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.
A Controlled Natural Language (CNL) for AI designed to "minify" language and make AI context denser.
📝 Parse, chunk, and evaluate Markdown for RAG pipelines with token-accurate support and flexible strategies for optimal context management.
This repository provides a fully modular implementation of a Retrieval-Augmented Generation (RAG) pipeline tailored for Italian legal-domain documents.
Astra Vector DB on Python-paketti, joka tallentaa dokumentteja DataStax Astra DB -vektoritietokantaan ja suorittaa semanttista hakua.
building a CPU-Only "PDF Q&A System" using hugging face, chromaDB vector search, and Python
"My complete LangChain learning journey — from basics to advanced RAG, LCEL, LangGraph, LangServe, LangSmith with hands-on code examples."
FastAPI service for document chunking and sentence-transformer embeddings for RAG, semantic search, and vector database ingestion.
KChunker is a lightweight, ultra-fast document parsing and chunking engine designed for RAG systems. It intelligently structures native/scanned PDFs, Excel files, Word documents, and email trails by preserving layout hierarchy, extracting tables, and generating dense vector embeddings for local search databases (ChromaDB and FAISS)
Smart text chunking tool for RAG systems. Splits long texts into sentence-based chunks with ~10%-15% overlap for better context retention. Runs fully in-browser with a clean UI and copyable outputs.
Add a description, image, and links to the document-chunking topic page so that developers can more easily learn about it.
To associate your repository with the document-chunking topic, visit your repo's landing page and select "manage topics."