Start from failure modes, not benchmarks
Most RAG initiatives begin with a leaderboard score and end with operators who do not trust the product. The gap is almost always operational: ambiguous chunk boundaries, silent ingestion failures, and retrieval that looks fine on averages but breaks on the documents your users actually upload.
Treat the first milestone as a catalog of real queries that failed—then instrument the stack so you can reproduce them. That discipline matters more than swapping embedding models in the first month.
Chunking is a product decision
Tables, headers, and scanned PDFs punish naive token windows. Layout-aware parsing, metadata enrichment, and domain-aware chunking are not optional polish—they define whether citations line up with what a reviewer can defend.
Where possible, keep chunk boundaries stable across re-index jobs. Nothing erodes trust faster than citations that drift after a routine rebuild.
Hybrid retrieval and reranking are table stakes
Dense vectors alone rarely win on short proper nouns, internal codes, and policy numbers. Pair lexical signals with dense retrieval, then rerank with a cross-encoder or a small supervised ranker on your own click or thumbs data when you have it.
Reranking adds latency—budget explicitly and measure p95 alongside nDCG@k. If you cannot afford reranking at peak, design tiered paths: rerank only above a confidence threshold or only for premium tenants.
Close the loop with evals your PM can read
Offline suites should include stratified slices: multilingual pages, scanned attachments, and long tables. Online, sample production traffic with redaction and review queues for low-confidence answers.
Ship a promotion policy: no release without passing regression gates on the slices that burned you last quarter. That is how RAG becomes an engineering system—not a demo.