pdf-to-chroma

A RAG pipeline that converts PDF books into a searchable vector database while preserving document hierarchy. Chapter structure becomes queryable metadata.

Problem

PDFs are black boxes for LLMs. Naive chunking loses context—you get fragments without knowing which chapter or section they came from. “Summarize chapter 3” becomes impossible when your vectors don’t know what a chapter is.

Architecture

Multi-stage ingestion with hierarchy preservation:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   PDF       │────▶│  Markdown   │────▶│  Chunks +   │
│  (Books)    │     │  (marker)   │     │  Hierarchy  │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  ChromaDB   │◀────│  Voyage AI  │◀────│  LlamaIndex │
│  (persist)  │     │  (embed)    │     │  (parse)    │
└─────────────┘     └─────────────┘     └─────────────┘

Desktop/Laptop Split: GPU-heavy PDF conversion runs on desktop, embedding and storage on laptop. rsync keeps them in sync.

# Desktop: Convert PDFs
just convert

# Laptop: Pull markdown, ingest to Chroma
just sync
just ingest

Key Pattern: Hierarchical Metadata

The HierarchicalMarkdownProcessor preserves structure:

# Each chunk carries its location in the document
metadata = {
    "book": "Design Patterns",
    "chapter": "3. Creational Patterns",
    "section": "3.1 Abstract Factory",
    "page_range": "87-95"
}

This enables:

“Find mentions of factories in chapter 3”
“Summarize the introduction”
Precise citations with page numbers

Technologies

Component	Tool	Purpose
PDF → Markdown	marker-pdf	High-quality conversion preserving structure
Parsing	LlamaIndex	Hierarchical markdown processing
Embeddings	Voyage AI	Semantic vector generation
Storage	ChromaDB	Persistent vector database

Current Status

Why This Exists

RAG quality depends on retrieval quality. Retrieval quality depends on chunking strategy. Most chunking strategies throw away the structure that makes documents navigable. This pipeline proves you can keep the hierarchy and get dramatically better results—“summarize chapter 3” actually works.