pdf-to-chroma
Hierarchical PDF ingestion for RAG
Pipeline for converting PDFs to searchable vector embeddings with preserved document structure
A RAG pipeline that converts PDF books into a searchable vector database while preserving document hierarchy. Chapter structure becomes queryable metadata.
Problem
PDFs are black boxes for LLMs. Naive chunking loses context—you get fragments without knowing which chapter or section they came from. “Summarize chapter 3” becomes impossible when your vectors don’t know what a chapter is.
Architecture
Multi-stage ingestion with hierarchy preservation:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PDF │────▶│ Markdown │────▶│ Chunks + │
│ (Books) │ │ (marker) │ │ Hierarchy │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ChromaDB │◀────│ Voyage AI │◀────│ LlamaIndex │
│ (persist) │ │ (embed) │ │ (parse) │
└─────────────┘ └─────────────┘ └─────────────┘
Desktop/Laptop Split: GPU-heavy PDF conversion runs on desktop, embedding and storage on laptop. rsync keeps them in sync.
# Desktop: Convert PDFs
just convert
# Laptop: Pull markdown, ingest to Chroma
just sync
just ingest
Key Pattern: Hierarchical Metadata
The HierarchicalMarkdownProcessor preserves structure:
# Each chunk carries its location in the document
metadata = {
"book": "Design Patterns",
"chapter": "3. Creational Patterns",
"section": "3.1 Abstract Factory",
"page_range": "87-95"
}
This enables:
- “Find mentions of factories in chapter 3”
- “Summarize the introduction”
- Precise citations with page numbers
Technologies
| Component | Tool | Purpose |
|---|---|---|
| PDF → Markdown | marker-pdf | High-quality conversion preserving structure |
| Parsing | LlamaIndex | Hierarchical markdown processing |
| Embeddings | Voyage AI | Semantic vector generation |
| Storage | ChromaDB | Persistent vector database |
Current Status
Why This Exists
RAG quality depends on retrieval quality. Retrieval quality depends on chunking strategy. Most chunking strategies throw away the structure that makes documents navigable. This pipeline proves you can keep the hierarchy and get dramatically better results—“summarize chapter 3” actually works.