Sovereign RAG on air-gapped clusters: what actually moves the needle
Six months of engineering notes from a production retrieval pipeline running entirely behind the wire — what we kept, what we rewrote twice, and why the embedding model matters less than the chunking strategy.

This isn't a "five things you should know about RAG" post. We've been operating a retrieval pipeline for a regulated buyer on an air-gapped cluster since late last year, and the playbook the public internet sells you is at best half-true at that scale.
The embedding model isn't where the value is
Swapping between three popular open-weight embedding models moved our retrieval quality less than 3 percentage points on the rubric we co-defined with the buyer. Chunking strategy, on the other hand, moved it by 17 points. Spend your week on the latter.
What broke in production
The first system was clean in evals and embarrassing in front of a real user. We rebuilt three things:
- The chunker — moved from fixed-size to semantic-boundary with a 15% overlap.
- The reranker — added a cross-encoder pass over the top 40 hits, dropped recall but lifted precision exactly where the buyer cared.
- The eval harness — replaced our golden set with a buyer-curated rubric of 240 questions across five judgment categories.
Sovereignty isn't a feature flag
Operating air-gapped means every dependency is a deployment risk. We replaced four hosted services with self-hosted alternatives over the project's first quarter — the embedding model server, the vector store, the reranker host, and the eval harness. None of those swaps were technically interesting; all of them mattered for sign-off.
Write to us if you're building toward an air-gapped deployment and want a second pair of eyes.
