Homebrew RAG - GNU / Ara's Homepage

The thesis

For a personal corpus, an embedding pipeline is overkill in nearly every case. The filesystem is the index; a controlled vocabulary is the schema; ripgrep is the query engine; the Linux page cache is the database. Discipline at write time is the consistency guarantee.

Three corpuses, three retrieval shapes, all on the same Talos host:

Ripgrep-based — over my org notes (~/notes) and over the personal reference text corpus (~/ref/text/). Pure rg + consult, no embeddings, no daemons.
PDF-based — academic papers under ~/ref/pdf/. pdftotext locally + rg as the default; sqlite-vec + BGE-M3 + a reranker as a deferred Layer 3, built only if keyword search demonstrably fails.
Zotero workflow — the three-node star topology that feeds the paper corpus. Office curates, CHPC replicates, Talos reads. git-annex everywhere.

The two PDF and rg corpuses live in different silos that map onto the gptel backend boundary. ~/ref/pdf/ is the workspace silo and only the Anthropic-backed gptel sees it; ~/ref/text/ and ~/notes are personal silo and only the local gpt-oss-120b sees them. The path alone does not tell you which side of the boundary you are on — enforcement is at the tool surface, not the directory tree.

Ripgrep-based

The notes corpus

Org-mode + Denote at ~/notes. The architecture has five layers, each independently replaceable:

Layer	What	Mechanism
1. Completion	Minibuffer narrowing	vertico + orderless + marginalia + consult
2. Vocabulary	Single source of truth for tags	`~/notes/00-vocabulary.org` — one heading per tag domain, one bullet per tag with a one-line gloss
3. Maintenance	Drift detection	Monthly audit walks the corpus, lists orphan tags (in corpus, not in vocabulary) and unused tags (in vocabulary, not in corpus)
4. Query bindings	`C-c n` prefix-map	find / heading / grep / tag / recent — see below
5. Habits	Tag at write time, link liberally, act on the audit output	Practice, not config. The system stays precise only because the writer keeps it precise.

Layer 2 is doing what BGE-M3 would have done. Tagging a Sanskrit-content note with :sanskrit: means a query of C-c n t sanskrit retrieves it regardless of whether the body language is English, Tamil, or Sanskrit. The English tag does the cross-language work. Embeddings would be more general — but for a personal corpus authored by one person, controlled vocabulary is dramatically less infrastructure for nearly identical recall.

The query map under C-c n:

Key	Action
`C-c n c`	Notebook capture (append kill ring to `notebook.org`)
`C-c n f`	Find note by title (`denote-open-or-create`)
`C-c n h`	Heading search across the corpus (`consult-org-heading`, scoped recursively)
`C-c n g`	Ripgrep across the corpus (`consult-ripgrep`)
`C-c n t`	Find by tag (`org-ql-search` with `(tags TAG)`; tag completes against the canonical vocabulary)
`C-c n r`	Recent notes (`org-ql-search` with `(ts :from -DAYS)`, default 14 days)

The personal text corpus

Reference material at ~/ref/text/. All UTF-8, all rg-searchable. No binary formats, no databases, no indexing infrastructure beyond what the kernel page cache gives for free.

Corpus	Size	Source / shape
`dicts/`	~3.5 GB	StarDict (6 languages) via pyglossary, plus Wiktionary (English, German, Tamil, Sanskrit, Japanese, Mandarin) extracted from kaikki.org Wiktextract. Entry format is `@ headword [POS]` on its own line.
`gutenberg/`	~15 GB	Plain-text literature from the Project Gutenberg rsync mirror. Filtered to UTF-8 `.txt` only.
`wikipedia/`	~22 GB	English Wikipedia plain text via `wikiextractor` on the latest `enwiki-pages-articles` dump.
`rfc/`	~1 GB	All numbered IETF RFCs from rfc-editor.org rsync. Already clean UTF-8.
`manpages/`	~200 MB	Rendered man pages from Debian + Guix profiles, via `man -Tutf8 \| col -bx`.
`info/`	~100 MB	Texinfo manuals exported to text via `info --output`.
`unicode/`	~30 MB	Unicode Character Database — `UnicodeData.txt`, `Blocks.txt`, `Scripts.txt`, `NameAliases.txt`.

Total around 42 GB — about 7% of free RAM after the local LLM is loaded. Cold read from the Optane-backed disk takes ~17 seconds; after first touch, the kernel keeps it cached, and rg across the full corpus runs in under two seconds.

Conversion scripts (convert-stardict.sh, convert-wiktionary.py) are disposable — run once per quarterly refresh inside a temporary venv, then the venv is deleted. Nothing persists as infrastructure. The only permanent tools are rg, consult, rsync, curl, man, info — all already in Guix or the Debian base.

Emacs entry point is C-c d d (my/dict-lookup-at-point) which calls consult-ripgrep scoped to the corpus root. Optional per-corpus bindings (C-c d w for Wiktionary, C-c d p for Wikipedia, C-c d r for RFCs) get added only after demonstrated need.

PDF-based

Layer 1: filesystem

Academic papers under ~/ref/pdf/, organized as a git-annex working tree:

~/ref/pdf/
├── papers/                 # git-annex working tree (symlinks → annex objects)
├── references.bib          # symlink into refs-bib/ sibling repo
├── text/                   # pdftotext output, rebuilt locally
├── title-index.txt         # filename → title lookup
└── rag/                    # deferred Layer 3
    ├── chunks.db
    └── embed.py

Roughly 10k papers, 50 GB of PDFs. Authoritative copy is the git-annex repo; the working tree on each host is symlinks into content-addressable storage. numcopies=2 with three replicas (office, CHPC, Talos) — dropping content on Talos is always safe.

Layer 2: text + rg

Talos's 44 cores chew through pdftotext in about 15 minutes for 10k papers — what would be 8 hours on the office workstation. The output goes under ~/ref/pdf/text/ and is not replicated through the annex; replicating derived artifacts would double bandwidth for nothing. Each node regenerates its own.

find ~/ref/pdf/papers/ -name '*.pdf' -type f | \
    xargs -P 32 -I{} sh -c '
        src="$1"
        txt="~/ref/pdf/text/$(basename "${src%.pdf}").txt"
        [ -f "$txt" ] || pdftotext "$src" "$txt" 2>/dev/null
    ' _ {}
find ~/ref/pdf/text/ -empty -delete

Image-only PDFs (scanned books, old journals) produce empty text files and get deleted. OCR recovery, when needed, is a separate pipeline with a vision-language model on demand — not part of the default flow.

Title index for human-readable rg results:

for f in ~/ref/pdf/text/*.txt; do
    title=$(head -5 "$f" | tr '\n' ' ' | sed 's/  */ /g' | cut -c1-120)
    printf "%s\t%s\n" "$(basename "$f")" "$title"
done > ~/ref/pdf/title-index.txt

Performance on warm cache: references.bib (~5 MB) — instant. title-index.txt (~2 MB) — instant. Full text (~3–5 GB) — under one second. For a "papers I cite but don't read" workflow over a corpus with stable terminology (economics, here), this is sufficient. The signal that you've outgrown it is repeated failures of the form "I know a paper exists in my corpus, I search for it, rg doesn't find it because the paper uses different terminology" — at which point Layer 3 earns its keep.

Layer 3: deferred sqlite-vec + BGE-M3 + reranker

Built only if keyword search demonstrably fails. Three signals warrant building it:

"Papers with a similar identification strategy to mine" — conceptual, not lexical. rg can't match this.
Cross-subfield search where terminology diverges (e.g., "endogeneity" in economics vs. "confounding" in statistics).
Systematic literature review with recall requirements.

Architecture when built: chunk each paper at ~500 tokens with ~50 token overlap, embed with BGE-M3 (Q8_0, ~2 GB VRAM), store in sqlite-vec with metadata (filename, chunk_idx, preview, silo). Query path is embed → kNN against chunks_vec → rerank with bge-reranker-v2-m3 → return top-10 chunks as text.

CREATE TABLE chunks (
    id        INTEGER PRIMARY KEY,
    filename  TEXT NOT NULL,
    chunk_idx INTEGER NOT NULL,
    preview   TEXT NOT NULL,
    text      TEXT NOT NULL,
    silo      TEXT NOT NULL DEFAULT 'workspace'
);

CREATE INDEX idx_chunks_silo ON chunks(silo);

CREATE VIRTUAL TABLE chunks_vec USING vec0(
    id        INTEGER PRIMARY KEY,
    embedding FLOAT[1024]
);

The silo column is the load-bearing safety boundary at this layer. The corresponding gptel tool (rag_search) hard-codes WHERE silo='workspace' — silo is never a user-controllable parameter. If personal-silo material ever needs semantic search, it lives in the same DB with silo='personal' and is queried by a different tool, never commingled at query time.

This layer is Talos-only by design. Office has no GPU; CHPC is archival; the W7900 (48 GB VRAM) and the resident BGE-M3 model live on Talos. Embedding only makes sense where the embedder runs.

Zotero workflow (three nodes)

Three nodes, distinct roles, star topology through CHPC. Office and Talos never talk to each other directly; both go through CHPC. This matches the network reality — office is on the institutional network, Talos is at home, and they have no reason to trust each other.

Node	Role	OS / arch	Zotero	GPU
Office	Writer / curator	Debian x86	yes	none
CHPC	Replication hub	RHEL / institutional	no	n/a
Talos	Reader / compute	Debian + Guix / ppc64le	no	W7900 48 GB

              [office workstation]
              Zotero + Zotmoov + BBT
              Claude Desktop (MCP filesystem)
                      │
                      │ git-annex sync --content
                      │ git push refs.bib
                      ▼
              [CHPC project storage]
              bare annex repo
              bare refs.bib repo
              (+ Zotero Sync independently, metadata only)
                      │
                      │ git-annex sync --content
                      │ git pull refs.bib
                      ▼
              [Talos home workstation]
              pdftotext → text/
              (optional) BGE-M3 embeddings → sqlite-vec
              Claude via gptel (workspace silo only)

Four layers of the workflow

Filesystem layer. ~/papers/ on office, ~/ref/pdf/papers/ on Talos. PDFs organized as <journal>/<year>/<citekey>.pdf. Authoritative copy is the git-annex repo; working trees on each host are symlinks into content-addressable storage.
Metadata layer. Zotero on office, with linked attachments pointing into ~/papers/. Items, tags, collections, notes, annotations, citation keys. Synced to zotero.org as data-only off-site backup. Talos never sees Zotero directly.
Citation layer. Better BibTeX auto-exports ~/writing/refs.bib on office. A small sibling git repo pushes that file to CHPC. Talos pulls and symlinks it to ~/ref/pdf/references.bib. Single source of truth for citation keys.
Search / retrieval layer. Claude Desktop on office uses MCP filesystem + ripgrep against ~/papers/. Talos uses rg over pdftotext output, optionally Layer 3 above.

Replication policy

git-annex numcopies=2 floor. With three active replicas (office, CHPC, Talos), the invariant holds even if any one host is offline or being rebuilt.
Zotero cloud is not counted as a replica for PDFs. It holds metadata only.
restic / borg snapshots of zotero.sqlite land on CHPC in a path outside the papers annex. The database is the irreplaceable artifact (tags, notes, citation keys, annotations); treat its backup with more care than the PDFs themselves.

Office-side plugins

Zotmoov — replaces the deprecated ZotFile for Zotero 7. Moves attachments out of storage/ into ~/papers/ on import, deriving the path from item metadata.
Better BibTeX — exports refs.bib on every change, with stable citation keys.

Office runs Zotero with PDF full-text indexing disabled (or capped at ~50k chars per file) — that's Talos's job. Zotero's full-text index would just duplicate what pdftotext does, on slower hardware.

Operational rhythm

Cadence	What runs where
Per-session (office)	Drag PDFs in from browser. Zotmoov relocates them. Better BibTeX keeps refs.bib current. Data sync happens automatically.
Weekly (office)	`git annex sync --content` in `~/papers/`; `git push` in `~/writing/`. Pushes new PDFs and bib changes to CHPC; enforces `numcopies=2`.
Monthly (Talos)	`pdf-sync` alias (annex sync + bib pull) + regenerate `text/` and `title-index.txt` for new files. Combined with the notes-weeding pass.
Monthly (office)	VACUUM `zotero.sqlite`; restic / borg snapshot to CHPC outside the annex; `git annex fsck` locally.
Quarterly	`git annex fsck` on the CHPC remote — catches silent bit rot on institutional storage. Trustworthy because content is checksummed and content-addressable.

What this design is not

It's not a generic ingestion pipeline. The asymmetry — office curates, Talos computes — is by design. Replicating the embedding pipeline on x86 office hardware would gain nothing: office has no GPU, and economics terminology is stable enough that office's keyword search via MCP filesystem covers nearly all queries. When office-side work needs semantic retrieval (rare), the answer is "ssh to Talos, or ask Claude via gptel on Talos" — not "build a second RAG pipeline."