λ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ λ

✧ Homebrew RAG ✧

Andrew Gallant et al. (ripgrep)
λ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ λ

The thesis

For a personal corpus, an embedding pipeline is overkill in nearly every case. The filesystem is the index; a controlled vocabulary is the schema; ripgrep is the query engine; the Linux page cache is the database. Discipline at write time is the consistency guarantee.

Three corpuses, three retrieval shapes, all on the same Talos host:

  1. Ripgrep-based — over my org notes (~/notes) and over the personal reference text corpus (~/ref/text/). Pure rg + consult, no embeddings, no daemons.
  2. PDF-based — academic papers under ~/ref/pdf/. pdftotext locally + rg as the default; sqlite-vec + BGE-M3 + a reranker as a deferred Layer 3, built only if keyword search demonstrably fails.
  3. Zotero workflow — the three-node star topology that feeds the paper corpus. Office curates, CHPC replicates, Talos reads. git-annex everywhere.

The two PDF and rg corpuses live in different silos that map onto the gptel backend boundary. ~/ref/pdf/ is the workspace silo and only the Anthropic-backed gptel sees it; ~/ref/text/ and ~/notes are personal silo and only the local gpt-oss-120b sees them. The path alone does not tell you which side of the boundary you are on — enforcement is at the tool surface, not the directory tree.

Ripgrep-based

The notes corpus

Org-mode + Denote at ~/notes. The architecture has five layers, each independently replaceable:

LayerWhatMechanism
1. Completion Minibuffer narrowing vertico + orderless + marginalia + consult
2. Vocabulary Single source of truth for tags ~/notes/00-vocabulary.org — one heading per tag domain, one bullet per tag with a one-line gloss
3. Maintenance Drift detection Monthly audit walks the corpus, lists orphan tags (in corpus, not in vocabulary) and unused tags (in vocabulary, not in corpus)
4. Query bindings C-c n prefix-map find / heading / grep / tag / recent — see below
5. Habits Tag at write time, link liberally, act on the audit output Practice, not config. The system stays precise only because the writer keeps it precise.

Layer 2 is doing what BGE-M3 would have done. Tagging a Sanskrit-content note with :sanskrit: means a query of C-c n t sanskrit retrieves it regardless of whether the body language is English, Tamil, or Sanskrit. The English tag does the cross-language work. Embeddings would be more general — but for a personal corpus authored by one person, controlled vocabulary is dramatically less infrastructure for nearly identical recall.

The query map under C-c n:

KeyAction
C-c n cNotebook capture (append kill ring to notebook.org)
C-c n fFind note by title (denote-open-or-create)
C-c n hHeading search across the corpus (consult-org-heading, scoped recursively)
C-c n gRipgrep across the corpus (consult-ripgrep)
C-c n tFind by tag (org-ql-search with (tags TAG); tag completes against the canonical vocabulary)
C-c n rRecent notes (org-ql-search with (ts :from -DAYS), default 14 days)

The personal text corpus

Reference material at ~/ref/text/. All UTF-8, all rg-searchable. No binary formats, no databases, no indexing infrastructure beyond what the kernel page cache gives for free.

CorpusSizeSource / shape
dicts/ ~3.5 GB StarDict (6 languages) via pyglossary, plus Wiktionary (English, German, Tamil, Sanskrit, Japanese, Mandarin) extracted from kaikki.org Wiktextract. Entry format is @ headword [POS] on its own line.
gutenberg/ ~15 GB Plain-text literature from the Project Gutenberg rsync mirror. Filtered to UTF-8 .txt only.
wikipedia/ ~22 GB English Wikipedia plain text via wikiextractor on the latest enwiki-pages-articles dump.
rfc/ ~1 GB All numbered IETF RFCs from rfc-editor.org rsync. Already clean UTF-8.
manpages/ ~200 MB Rendered man pages from Debian + Guix profiles, via man -Tutf8 | col -bx.
info/ ~100 MB Texinfo manuals exported to text via info --output.
unicode/ ~30 MB Unicode Character Database — UnicodeData.txt, Blocks.txt, Scripts.txt, NameAliases.txt.

Total around 42 GB — about 7% of free RAM after the local LLM is loaded. Cold read from the Optane-backed disk takes ~17 seconds; after first touch, the kernel keeps it cached, and rg across the full corpus runs in under two seconds.

Conversion scripts (convert-stardict.sh, convert-wiktionary.py) are disposable — run once per quarterly refresh inside a temporary venv, then the venv is deleted. Nothing persists as infrastructure. The only permanent tools are rg, consult, rsync, curl, man, info — all already in Guix or the Debian base.

Emacs entry point is C-c d d (my/dict-lookup-at-point) which calls consult-ripgrep scoped to the corpus root. Optional per-corpus bindings (C-c d w for Wiktionary, C-c d p for Wikipedia, C-c d r for RFCs) get added only after demonstrated need.

PDF-based

Layer 1: filesystem

Academic papers under ~/ref/pdf/, organized as a git-annex working tree:

~/ref/pdf/
├── papers/                 # git-annex working tree (symlinks → annex objects)
├── references.bib          # symlink into refs-bib/ sibling repo
├── text/                   # pdftotext output, rebuilt locally
├── title-index.txt         # filename → title lookup
└── rag/                    # deferred Layer 3
    ├── chunks.db
    └── embed.py

Roughly 10k papers, 50 GB of PDFs. Authoritative copy is the git-annex repo; the working tree on each host is symlinks into content-addressable storage. numcopies=2 with three replicas (office, CHPC, Talos) — dropping content on Talos is always safe.

Layer 2: text + rg

Talos's 44 cores chew through pdftotext in about 15 minutes for 10k papers — what would be 8 hours on the office workstation. The output goes under ~/ref/pdf/text/ and is not replicated through the annex; replicating derived artifacts would double bandwidth for nothing. Each node regenerates its own.

find ~/ref/pdf/papers/ -name '*.pdf' -type f | \
    xargs -P 32 -I{} sh -c '
        src="$1"
        txt="~/ref/pdf/text/$(basename "${src%.pdf}").txt"
        [ -f "$txt" ] || pdftotext "$src" "$txt" 2>/dev/null
    ' _ {}
find ~/ref/pdf/text/ -empty -delete

Image-only PDFs (scanned books, old journals) produce empty text files and get deleted. OCR recovery, when needed, is a separate pipeline with a vision-language model on demand — not part of the default flow.

Title index for human-readable rg results:

for f in ~/ref/pdf/text/*.txt; do
    title=$(head -5 "$f" | tr '\n' ' ' | sed 's/  */ /g' | cut -c1-120)
    printf "%s\t%s\n" "$(basename "$f")" "$title"
done > ~/ref/pdf/title-index.txt

Performance on warm cache: references.bib (~5 MB) — instant. title-index.txt (~2 MB) — instant. Full text (~3–5 GB) — under one second. For a "papers I cite but don't read" workflow over a corpus with stable terminology (economics, here), this is sufficient. The signal that you've outgrown it is repeated failures of the form "I know a paper exists in my corpus, I search for it, rg doesn't find it because the paper uses different terminology" — at which point Layer 3 earns its keep.

Layer 3: deferred sqlite-vec + BGE-M3 + reranker

Built only if keyword search demonstrably fails. Three signals warrant building it:

Architecture when built: chunk each paper at ~500 tokens with ~50 token overlap, embed with BGE-M3 (Q8_0, ~2 GB VRAM), store in sqlite-vec with metadata (filename, chunk_idx, preview, silo). Query path is embed → kNN against chunks_vec → rerank with bge-reranker-v2-m3 → return top-10 chunks as text.

CREATE TABLE chunks (
    id        INTEGER PRIMARY KEY,
    filename  TEXT NOT NULL,
    chunk_idx INTEGER NOT NULL,
    preview   TEXT NOT NULL,
    text      TEXT NOT NULL,
    silo      TEXT NOT NULL DEFAULT 'workspace'
);

CREATE INDEX idx_chunks_silo ON chunks(silo);

CREATE VIRTUAL TABLE chunks_vec USING vec0(
    id        INTEGER PRIMARY KEY,
    embedding FLOAT[1024]
);

The silo column is the load-bearing safety boundary at this layer. The corresponding gptel tool (rag_search) hard-codes WHERE silo='workspace' — silo is never a user-controllable parameter. If personal-silo material ever needs semantic search, it lives in the same DB with silo='personal' and is queried by a different tool, never commingled at query time.

This layer is Talos-only by design. Office has no GPU; CHPC is archival; the W7900 (48 GB VRAM) and the resident BGE-M3 model live on Talos. Embedding only makes sense where the embedder runs.

Zotero workflow (three nodes)

Three nodes, distinct roles, star topology through CHPC. Office and Talos never talk to each other directly; both go through CHPC. This matches the network reality — office is on the institutional network, Talos is at home, and they have no reason to trust each other.

NodeRoleOS / archZoteroGPU
OfficeWriter / curatorDebian x86yesnone
CHPCReplication hubRHEL / institutionalnon/a
TalosReader / computeDebian + Guix / ppc64lenoW7900 48 GB
              [office workstation]
              Zotero + Zotmoov + BBT
              Claude Desktop (MCP filesystem)
                      │
                      │ git-annex sync --content
                      │ git push refs.bib
                      ▼
              [CHPC project storage]
              bare annex repo
              bare refs.bib repo
              (+ Zotero Sync independently, metadata only)
                      │
                      │ git-annex sync --content
                      │ git pull refs.bib
                      ▼
              [Talos home workstation]
              pdftotext → text/
              (optional) BGE-M3 embeddings → sqlite-vec
              Claude via gptel (workspace silo only)

Four layers of the workflow

Replication policy

Office-side plugins

Office runs Zotero with PDF full-text indexing disabled (or capped at ~50k chars per file) — that's Talos's job. Zotero's full-text index would just duplicate what pdftotext does, on slower hardware.

Operational rhythm

CadenceWhat runs where
Per-session (office) Drag PDFs in from browser. Zotmoov relocates them. Better BibTeX keeps refs.bib current. Data sync happens automatically.
Weekly (office) git annex sync --content in ~/papers/; git push in ~/writing/. Pushes new PDFs and bib changes to CHPC; enforces numcopies=2.
Monthly (Talos) pdf-sync alias (annex sync + bib pull) + regenerate text/ and title-index.txt for new files. Combined with the notes-weeding pass.
Monthly (office) VACUUM zotero.sqlite; restic / borg snapshot to CHPC outside the annex; git annex fsck locally.
Quarterly git annex fsck on the CHPC remote — catches silent bit rot on institutional storage. Trustworthy because content is checksummed and content-addressable.

What this design is not

It's not a generic ingestion pipeline. The asymmetry — office curates, Talos computes — is by design. Replicating the embedding pipeline on x86 office hardware would gain nothing: office has no GPU, and economics terminology is stable enough that office's keyword search via MCP filesystem covers nearly all queries. When office-side work needs semantic retrieval (rare), the answer is "ssh to Talos, or ask Claude via gptel on Talos" — not "build a second RAG pipeline."