The thesis
For a personal corpus, an embedding pipeline is overkill in nearly every case. The filesystem is the index; a controlled vocabulary is the schema; ripgrep is the query engine; the Linux page cache is the database. Discipline at write time is the consistency guarantee.
Three corpuses, three retrieval shapes, all on the same Talos host:
- Ripgrep-based — over my org notes (
~/notes) and over the personal reference text corpus (~/ref/text/). Pure rg + consult, no embeddings, no daemons. - PDF-based — academic papers under
~/ref/pdf/.pdftotextlocally + rg as the default; sqlite-vec + BGE-M3 + a reranker as a deferred Layer 3, built only if keyword search demonstrably fails. - Zotero workflow — the three-node star topology that feeds the paper corpus. Office curates, CHPC replicates, Talos reads. git-annex everywhere.
The two PDF and rg corpuses live in different silos that map onto the gptel backend boundary. ~/ref/pdf/ is the workspace silo and only the Anthropic-backed gptel sees it; ~/ref/text/ and ~/notes are personal silo and only the local gpt-oss-120b sees them. The path alone does not tell you which side of the boundary you are on — enforcement is at the tool surface, not the directory tree.
Ripgrep-based
The notes corpus
Org-mode + Denote at ~/notes. The architecture has five layers, each independently replaceable:
| Layer | What | Mechanism |
|---|---|---|
| 1. Completion | Minibuffer narrowing | vertico + orderless + marginalia + consult |
| 2. Vocabulary | Single source of truth for tags | ~/notes/00-vocabulary.org — one heading per tag domain, one bullet per tag with a one-line gloss |
| 3. Maintenance | Drift detection | Monthly audit walks the corpus, lists orphan tags (in corpus, not in vocabulary) and unused tags (in vocabulary, not in corpus) |
| 4. Query bindings | C-c n prefix-map |
find / heading / grep / tag / recent — see below |
| 5. Habits | Tag at write time, link liberally, act on the audit output | Practice, not config. The system stays precise only because the writer keeps it precise. |
Layer 2 is doing what BGE-M3 would have done. Tagging a Sanskrit-content note with :sanskrit: means a query of C-c n t sanskrit retrieves it regardless of whether the body language is English, Tamil, or Sanskrit. The English tag does the cross-language work. Embeddings would be more general — but for a personal corpus authored by one person, controlled vocabulary is dramatically less infrastructure for nearly identical recall.
The query map under C-c n:
| Key | Action |
|---|---|
C-c n c | Notebook capture (append kill ring to notebook.org) |
C-c n f | Find note by title (denote-open-or-create) |
C-c n h | Heading search across the corpus (consult-org-heading, scoped recursively) |
C-c n g | Ripgrep across the corpus (consult-ripgrep) |
C-c n t | Find by tag (org-ql-search with (tags TAG); tag completes against the canonical vocabulary) |
C-c n r | Recent notes (org-ql-search with (ts :from -DAYS), default 14 days) |
The personal text corpus
Reference material at ~/ref/text/. All UTF-8, all rg-searchable. No binary formats, no databases, no indexing infrastructure beyond what the kernel page cache gives for free.
| Corpus | Size | Source / shape |
|---|---|---|
dicts/ |
~3.5 GB | StarDict (6 languages) via pyglossary, plus Wiktionary (English, German, Tamil, Sanskrit, Japanese, Mandarin) extracted from kaikki.org Wiktextract. Entry format is @ headword [POS] on its own line. |
gutenberg/ |
~15 GB | Plain-text literature from the Project Gutenberg rsync mirror. Filtered to UTF-8 .txt only. |
wikipedia/ |
~22 GB | English Wikipedia plain text via wikiextractor on the latest enwiki-pages-articles dump. |
rfc/ |
~1 GB | All numbered IETF RFCs from rfc-editor.org rsync. Already clean UTF-8. |
manpages/ |
~200 MB | Rendered man pages from Debian + Guix profiles, via man -Tutf8 | col -bx. |
info/ |
~100 MB | Texinfo manuals exported to text via info --output. |
unicode/ |
~30 MB | Unicode Character Database — UnicodeData.txt, Blocks.txt, Scripts.txt, NameAliases.txt. |
Total around 42 GB — about 7% of free RAM after the local LLM is loaded. Cold read from the Optane-backed disk takes ~17 seconds; after first touch, the kernel keeps it cached, and rg across the full corpus runs in under two seconds.
Conversion scripts (convert-stardict.sh, convert-wiktionary.py) are disposable — run once per quarterly refresh inside a temporary venv, then the venv is deleted. Nothing persists as infrastructure. The only permanent tools are rg, consult, rsync, curl, man, info — all already in Guix or the Debian base.
Emacs entry point is C-c d d (my/dict-lookup-at-point) which calls consult-ripgrep scoped to the corpus root. Optional per-corpus bindings (C-c d w for Wiktionary, C-c d p for Wikipedia, C-c d r for RFCs) get added only after demonstrated need.
PDF-based
Layer 1: filesystem
Academic papers under ~/ref/pdf/, organized as a git-annex working tree:
~/ref/pdf/
├── papers/ # git-annex working tree (symlinks → annex objects)
├── references.bib # symlink into refs-bib/ sibling repo
├── text/ # pdftotext output, rebuilt locally
├── title-index.txt # filename → title lookup
└── rag/ # deferred Layer 3
├── chunks.db
└── embed.py
Roughly 10k papers, 50 GB of PDFs. Authoritative copy is the git-annex repo; the working tree on each host is symlinks into content-addressable storage. numcopies=2 with three replicas (office, CHPC, Talos) — dropping content on Talos is always safe.
Layer 2: text + rg
Talos's 44 cores chew through pdftotext in about 15 minutes for 10k papers — what would be 8 hours on the office workstation. The output goes under ~/ref/pdf/text/ and is not replicated through the annex; replicating derived artifacts would double bandwidth for nothing. Each node regenerates its own.
find ~/ref/pdf/papers/ -name '*.pdf' -type f | \
xargs -P 32 -I{} sh -c '
src="$1"
txt="~/ref/pdf/text/$(basename "${src%.pdf}").txt"
[ -f "$txt" ] || pdftotext "$src" "$txt" 2>/dev/null
' _ {}
find ~/ref/pdf/text/ -empty -delete
Image-only PDFs (scanned books, old journals) produce empty text files and get deleted. OCR recovery, when needed, is a separate pipeline with a vision-language model on demand — not part of the default flow.
Title index for human-readable rg results:
for f in ~/ref/pdf/text/*.txt; do
title=$(head -5 "$f" | tr '\n' ' ' | sed 's/ */ /g' | cut -c1-120)
printf "%s\t%s\n" "$(basename "$f")" "$title"
done > ~/ref/pdf/title-index.txt
Performance on warm cache: references.bib (~5 MB) — instant. title-index.txt (~2 MB) — instant. Full text (~3–5 GB) — under one second. For a "papers I cite but don't read" workflow over a corpus with stable terminology (economics, here), this is sufficient. The signal that you've outgrown it is repeated failures of the form "I know a paper exists in my corpus, I search for it, rg doesn't find it because the paper uses different terminology" — at which point Layer 3 earns its keep.
Layer 3: deferred sqlite-vec + BGE-M3 + reranker
Built only if keyword search demonstrably fails. Three signals warrant building it:
- "Papers with a similar identification strategy to mine" — conceptual, not lexical. rg can't match this.
- Cross-subfield search where terminology diverges (e.g., "endogeneity" in economics vs. "confounding" in statistics).
- Systematic literature review with recall requirements.
Architecture when built: chunk each paper at ~500 tokens with ~50 token overlap, embed with BGE-M3 (Q8_0, ~2 GB VRAM), store in sqlite-vec with metadata (filename, chunk_idx, preview, silo). Query path is embed → kNN against chunks_vec → rerank with bge-reranker-v2-m3 → return top-10 chunks as text.
CREATE TABLE chunks (
id INTEGER PRIMARY KEY,
filename TEXT NOT NULL,
chunk_idx INTEGER NOT NULL,
preview TEXT NOT NULL,
text TEXT NOT NULL,
silo TEXT NOT NULL DEFAULT 'workspace'
);
CREATE INDEX idx_chunks_silo ON chunks(silo);
CREATE VIRTUAL TABLE chunks_vec USING vec0(
id INTEGER PRIMARY KEY,
embedding FLOAT[1024]
);
The silo column is the load-bearing safety boundary at this layer. The corresponding gptel tool (rag_search) hard-codes WHERE silo='workspace' — silo is never a user-controllable parameter. If personal-silo material ever needs semantic search, it lives in the same DB with silo='personal' and is queried by a different tool, never commingled at query time.
This layer is Talos-only by design. Office has no GPU; CHPC is archival; the W7900 (48 GB VRAM) and the resident BGE-M3 model live on Talos. Embedding only makes sense where the embedder runs.
Zotero workflow (three nodes)
Three nodes, distinct roles, star topology through CHPC. Office and Talos never talk to each other directly; both go through CHPC. This matches the network reality — office is on the institutional network, Talos is at home, and they have no reason to trust each other.
| Node | Role | OS / arch | Zotero | GPU |
|---|---|---|---|---|
| Office | Writer / curator | Debian x86 | yes | none |
| CHPC | Replication hub | RHEL / institutional | no | n/a |
| Talos | Reader / compute | Debian + Guix / ppc64le | no | W7900 48 GB |
[office workstation]
Zotero + Zotmoov + BBT
Claude Desktop (MCP filesystem)
│
│ git-annex sync --content
│ git push refs.bib
▼
[CHPC project storage]
bare annex repo
bare refs.bib repo
(+ Zotero Sync independently, metadata only)
│
│ git-annex sync --content
│ git pull refs.bib
▼
[Talos home workstation]
pdftotext → text/
(optional) BGE-M3 embeddings → sqlite-vec
Claude via gptel (workspace silo only)
Four layers of the workflow
- Filesystem layer.
~/papers/on office,~/ref/pdf/papers/on Talos. PDFs organized as<journal>/<year>/<citekey>.pdf. Authoritative copy is the git-annex repo; working trees on each host are symlinks into content-addressable storage. - Metadata layer. Zotero on office, with linked attachments pointing into
~/papers/. Items, tags, collections, notes, annotations, citation keys. Synced to zotero.org as data-only off-site backup. Talos never sees Zotero directly. - Citation layer. Better BibTeX auto-exports
~/writing/refs.bibon office. A small sibling git repo pushes that file to CHPC. Talos pulls and symlinks it to~/ref/pdf/references.bib. Single source of truth for citation keys. - Search / retrieval layer. Claude Desktop on office uses MCP filesystem + ripgrep against
~/papers/. Talos uses rg overpdftotextoutput, optionally Layer 3 above.
Replication policy
- git-annex
numcopies=2floor. With three active replicas (office, CHPC, Talos), the invariant holds even if any one host is offline or being rebuilt. - Zotero cloud is not counted as a replica for PDFs. It holds metadata only.
- restic / borg snapshots of
zotero.sqliteland on CHPC in a path outside the papers annex. The database is the irreplaceable artifact (tags, notes, citation keys, annotations); treat its backup with more care than the PDFs themselves.
Office-side plugins
- Zotmoov — replaces the deprecated ZotFile for Zotero 7. Moves attachments out of
storage/into~/papers/on import, deriving the path from item metadata. - Better BibTeX — exports
refs.bibon every change, with stable citation keys.
Office runs Zotero with PDF full-text indexing disabled (or capped at ~50k chars per file) — that's Talos's job. Zotero's full-text index would just duplicate what pdftotext does, on slower hardware.
Operational rhythm
| Cadence | What runs where |
|---|---|
| Per-session (office) | Drag PDFs in from browser. Zotmoov relocates them. Better BibTeX keeps refs.bib current. Data sync happens automatically. |
| Weekly (office) | git annex sync --content in ~/papers/; git push in ~/writing/. Pushes new PDFs and bib changes to CHPC; enforces numcopies=2. |
| Monthly (Talos) | pdf-sync alias (annex sync + bib pull) + regenerate text/ and title-index.txt for new files. Combined with the notes-weeding pass. |
| Monthly (office) | VACUUM zotero.sqlite; restic / borg snapshot to CHPC outside the annex; git annex fsck locally. |
| Quarterly | git annex fsck on the CHPC remote — catches silent bit rot on institutional storage. Trustworthy because content is checksummed and content-addressable. |
What this design is not
It's not a generic ingestion pipeline. The asymmetry — office curates, Talos computes — is by design. Replicating the embedding pipeline on x86 office hardware would gain nothing: office has no GPU, and economics terminology is stable enough that office's keyword search via MCP filesystem covers nearly all queries. When office-side work needs semantic retrieval (rare), the answer is "ssh to Talos, or ask Claude via gptel on Talos" — not "build a second RAG pipeline."