Production RAG: Best Practices and Deployment
Part of the RAG series.
Running RAG in production means more than accuracy: you need reliability, observability and sensible defaults when things fail. This post outlines practices that help keep production RAG systems stable and debuggable.
Monitoring
Instrument the full pipeline so you can see where time and errors occur:
- Latency. Track retrieval latency (embed + search) and generation latency (LLM call) separately. Set alerts on p95 or p99.
- Errors. Log and alert on embedding failures, vector store timeouts, LLM API errors and empty retrieval.
- Usage. Count queries, tokens (embedding + LLM), and cost per query or per day. Helps with capacity and budget.
- Quality signals. Where possible, log faithfulness or relevance scores (from evals or lightweight checkers) or sample answers for human review. Use them to detect regressions.
Caching
Cache to reduce cost and latency:
- Embedding cache. Cache embeddings for repeated queries or for chunks you re-use. Key by (model, text) or a hash. Reduces embedding API calls during indexing and for frequent queries.
- Retrieval cache. For identical or near-identical queries, cache the top-k chunks. Key by normalized query or query embedding. Speeds up repeated questions (e.g. support bots).
- Response cache. For exact duplicate queries, return the last (query, answer) pair. Easiest win; be careful with TTL and invalidation when the index or prompt changes.
Fallbacks
Define what happens when parts of the pipeline fail:
- Empty retrieval. If no chunks are returned, either return a clear "I couldn't find relevant information" message or fall back to a general answer path (e.g. no context, or a limited context). Avoid inventing sources.
- LLM failure. Retry with backoff; if still failing, return a short error message or route to a human. Do not silently serve stale or wrong answers.
- Vector store or embedder down. Degrade to keyword-only search if you have it, or return a maintenance message. Log and alert so you can fix the dependency.
Index Freshness
Keep the index aligned with source of truth:
- Re-index on a schedule (e.g. nightly) or on webhooks when docs change. For real-time requirements, consider streaming ingestion and incremental updates.
- Version or timestamp chunks so you can detect stale context and trigger re-index when schemas or sources change.
Security and Cost
- Input validation. Sanitize and length-limit user queries to avoid prompt injection or abuse. Do not pass raw user text into system prompts without control.
- Access control. Ensure retrieval is scoped to the tenant or user so users only see chunks they are allowed to see. Use metadata filters and consistent identity.
- Cost controls. Rate-limit or cap tokens per user or per day. Monitor spend and set alerts so you catch runaway usage early.
Quick Checklist
- Log and alert on retrieval and generation latency and errors.
- Cache embeddings and retrieval where it saves cost and latency.
- Define fallbacks for empty retrieval and LLM failure.
- Keep the index updated and versioned.
- Validate input, scope retrieval by identity and cap cost.
For evaluation and metrics to monitor, see Evaluating RAG Systems. For the full RAG stack, start with RAG Introduction.
Get in touch
Questions about RAG or AI knowledge systems? Tell us about your project.