18 minute read

Retrieval-Augmented Generation Architecture Logo

Retrieval-Augmented Generation Architecture


Introduction

Retrieval-Augmented Generation, usually shortened to RAG, is the dominant engineering pattern for making large language models useful against enterprise knowledge that is private, frequently changing, operationally scoped, or too large to place reliably in every prompt.

The idea is simple: retrieve relevant evidence first, then ask the model to answer using that evidence. The implementation is not simple. A production RAG system is a distributed information-retrieval system with ingestion pipelines, parsers, chunkers, embedding models, vector indexes, lexical indexes, metadata filters, rerankers, prompt builders, model endpoints, caches, traces, authorization controls, and evaluation loops.

RAG is often introduced as “a vector database attached to an LLM.” That is a useful demo, but a poor production architecture. Enterprise RAG has to answer harder questions:

  • Which data sources are allowed for this user, tenant, region, and workflow?
  • How are documents parsed, normalized, chunked, versioned, and deleted?
  • Which retrieval modes are used for semantic meaning, exact identifiers, and metadata?
  • How are candidate passages reranked and filtered before generation?
  • How does the system avoid stale, poisoned, or unauthorized evidence?
  • How does it prove which sources supported the answer?
  • How are costs, latency, hallucination rate, and retrieval quality measured?
  • How are regulated records deleted from every index and cache?

The practical conclusion is that RAG should be treated as AI infrastructure, not a prompt-engineering trick. Its quality is bounded by data engineering, retrieval design, access control, and operational discipline at least as much as by the model.

RAG as Enterprise AI Infrastructure 1. Enterprise Sources Documents, tickets, source code, databases, SaaS systems 2. Ingestion and Indexing Parse, normalize, chunk, enrich metadata, embed, index 3. Retrieval Dense + sparse search Metadata filters + reranking 4. Generation Grounded prompt + LLM Answer with citations Control Plane Identity, tenant scoping, policy, observability, evaluation, retention The model writes the answer; the RAG system controls the evidence.

Figure 1 - Enterprise RAG is a layered infrastructure system, not a single model call.

Why Organizations Use RAG

RAG is useful when a model must answer from knowledge that is too dynamic, private, domain-specific, regulated, or extensive to trust the model’s training memory alone.

Common drivers:

  • Freshness: product policies, customer records, code, incidents, and procedures change faster than model training cycles.
  • Private knowledge: internal documents, tickets, source code, contracts, and operational logs are not part of the base model.
  • Provenance: users and auditors need to see the evidence behind factual claims.
  • Deletion and retention: enterprise data must be removable or scoped without retraining a model.
  • Tenant isolation: each tenant, region, business unit, or role may have a different permissible knowledge boundary.
  • Cost and practicality: re-indexing is usually cheaper and safer than fine-tuning for factual updates.

RAG is not a guarantee of correctness. It changes the failure surface. Instead of relying only on model memory, the system can fail through bad parsing, bad chunking, wrong metadata, missed retrieval, stale indexes, weak reranking, over-broad top-k selection, prompt injection, or insecure handling of retrieved content.

The useful engineering mindset is: RAG is evidence logistics. The system has to move the right evidence, for the right user, from the right sources, into the right prompt, with the right controls, at the right time.

How RAG Works End to End

A RAG system has two major planes:

  • Offline ingestion plane: collects, parses, normalizes, chunks, enriches, embeds, and indexes source data.
  • Online serving plane: authenticates the user, scopes the query, retrieves candidates, reranks evidence, builds a grounded prompt, calls the model, formats the answer, and logs the trace.

The offline plane determines what can be found. The online plane determines what is allowed, selected, and shown.

RAG Offline and Online Planes Offline Ingestion Plane Sources docs, code, data Parse preserve structure Chunk split + metadata Embed dense vectors Index vector + keyword Online Serving Plane Gateway auth + policy Retrieve search indexes Rerank choose evidence Prompt + LLM generate answer Trace scores + citations query reads indexes

Figure 2 - RAG reliability depends on both ingestion quality and online serving controls.

Retrieval Modes

Dense Retrieval

Dense retrieval uses embeddings to map queries and chunks into a vector space. Chunks with similar meaning should land close to each other. This is the retrieval mode most people mean when they say “semantic search.”

Dense retrieval is strong when:

  • Users paraphrase source material.
  • Source documents use varied wording.
  • The task is conceptual or semantic.
  • Search should find similar ideas, not only exact terms.

Dense retrieval is weaker when exact identifiers matter. Error codes, product SKUs, policy numbers, legal references, function names, table names, and log fragments can be blurred by embedding similarity.

Sparse Retrieval

Sparse retrieval uses lexical features, usually term-based ranking such as BM25 or related methods. It remains essential for enterprise corpora.

Sparse retrieval is strong when:

  • Exact terms matter.
  • Queries contain identifiers.
  • Source text includes code, logs, product names, or error messages.
  • Users expect deterministic keyword-like behavior.

Sparse retrieval is weaker when users ask conceptually similar questions with different words from the source documents.

Hybrid Retrieval

Hybrid retrieval blends dense semantic matches with sparse lexical matches. It is usually the best default for enterprise RAG because business corpora contain both natural-language concepts and brittle exact anchors.

A typical hybrid path:

  1. Run dense vector search for semantic candidates.
  2. Run sparse or BM25 search for lexical candidates.
  3. Merge candidates with reciprocal rank fusion or a similar method.
  4. Apply metadata and access filters.
  5. Rerank the merged candidate set.
  6. Pass only the strongest evidence into the prompt.

Reranking

Reranking is a second-stage ranking step. First-stage retrieval quickly finds a broad candidate set. A reranker then scores query-document pairs more carefully and selects the best evidence.

Reranking is valuable because first-stage retrieval optimizes speed and recall. Generation needs precision. Passing too many mediocre chunks into the model can increase token cost and reduce answer quality.

The usual pattern is:

retrieve 40 to 100 candidates -> rerank -> keep 5 to 12 evidence chunks

The exact numbers should be tuned per corpus and task.

Chunking and Context Design

Chunking is one of the highest-leverage decisions in RAG. The retriever can only find what the chunker preserves.

A chunk should be:

  • Small enough to retrieve precisely.
  • Large enough to make sense alone.
  • Connected to its parent document.
  • Enriched with useful metadata.
  • Stable enough to support citations and deletion.

Naive fixed-size chunking is often acceptable for simple prose, but it breaks down when documents are structured, referential, tabular, legal, or code-heavy.

Better chunking patterns include:

  • Section-aware chunking: preserve headings and subsection hierarchy.
  • Sentence-window chunking: retrieve a sentence but include surrounding context.
  • Hierarchical chunking: index small chunks but retain parent sections for prompt assembly.
  • Contextual chunking: add document-level context to each chunk before embedding.
  • Table-aware chunking: preserve row, column, caption, and source relationships.
  • Code-aware chunking: preserve function, class, file, and import context.

Context windows are not a substitute for retrieval quality. Larger windows can help, but adding weak or irrelevant context can confuse the model, increase cost, and make citations harder to trust. The goal is not maximum context. The goal is sufficient evidence.

Reference Architecture

A production RAG architecture should separate source ingestion, search state, online serving, and observability.

Core components:

  • Source connectors: fetch data from file shares, SaaS systems, wikis, tickets, code repositories, databases, and streams.
  • Parser and normalizer: preserve useful structure from PDFs, HTML, Markdown, office documents, tables, and code.
  • Chunker: splits content using a policy appropriate for the source type.
  • Metadata enricher: attaches tenant, region, business unit, source, effective date, version, sensitivity, and access attributes.
  • Embedding service: creates dense vectors.
  • Sparse indexer: creates lexical or BM25-style search state.
  • Vector store: stores embeddings and supports approximate nearest-neighbor search.
  • Keyword index: supports exact and sparse retrieval.
  • Metadata store: supports filters, source lineage, deletion, and policy decisions.
  • Retriever: runs dense, sparse, and metadata-scoped search.
  • Reranker: scores candidates more precisely.
  • Prompt builder: constructs an evidence-grounded prompt with citations policy and abstention rules.
  • LLM endpoint: generates an answer from evidence.
  • Response formatter: returns answer, sources, warnings, and evidence links.
  • Trace store: captures retrieval scores, prompt versions, latency, token usage, and user identity.
Enterprise RAG Reference Architecture Sources Files, wikis, tickets, code repositories, databases, SaaS APIs Ingestion Plane Parse and normalize source content Chunk, enrich metadata, embed, and build indexes Search State Vector index Keyword index + metadata store Serving Plane Authorize, retrieve, rerank Build prompt, answer, cite Operational Control Layer CI/CD, evaluations, tenant isolation, retention, audit, metrics, tracing, versioning

Figure 3 - A reference RAG architecture separates ingestion, search state, online serving, and operational controls.

Data Pipeline and Index Lifecycle

The ingestion side usually needs more engineering than the query side.

Parsing

Parsing determines whether useful structure survives. A weak parser can flatten tables, discard captions, remove headings, break code blocks, or lose page numbers. A strong parser preserves the source structure needed for retrieval and citations.

Important parser outputs:

  • Clean text.
  • Document hierarchy.
  • Tables and captions.
  • Page or section anchors.
  • Source URI.
  • Author, owner, date, version, and effective date.
  • Sensitivity and tenant metadata.

Metadata

Metadata is the control surface for enterprise RAG. It enables filtering, authorization, deletion, retention, citations, and evaluation.

Useful fields include:

  • tenant_id
  • source_system
  • source_id
  • document_version
  • chunk_id
  • parent_id
  • business_unit
  • region
  • language
  • effective_date
  • expiry_date
  • sensitivity
  • allowed_groups
  • record_type
  • ingested_at
  • parser_version
  • chunker_version
  • embedding_model

Poor metadata creates security and quality failures. A relevant chunk from the wrong tenant is worse than no chunk at all.

Index Releases

Index changes should be released like software changes. A change in chunking, parser behavior, embedding model, sparse analyzer, reranker, or prompt template can change production answers.

Useful release patterns:

  • Batch upsert for normal document updates.
  • Near-real-time upsert for urgent operational sources.
  • Scheduled rebuild for large controlled corpora.
  • Shadow index for major embedding or chunking changes.
  • Alias swap to promote a rebuilt index atomically.
  • Emergency purge for sensitive, poisoned, or legally deleted content.

Keep index versioning explicit. An answer trace should show which document snapshot, chunker, embedding model, retriever, reranker, and prompt version participated.

Query API Shape

A production RAG query API should expose retrieval controls explicitly enough to test and operate them.

POST /rag/query
{
  "tenant_id": "acme-ca",
  "user_id": "user-123",
  "query": "What changed in the travel policy for meals?",
  "filters": {
    "category": "policy",
    "region": "ca",
    "effective_on": "2026-06-08"
  },
  "retrieval": {
    "dense_k": 40,
    "sparse_k": 40,
    "fusion": "reciprocal_rank_fusion",
    "minimum_score": 0.18
  },
  "rerank": {
    "enabled": true,
    "top_n": 10
  },
  "generation": {
    "citations": true,
    "abstain_if_evidence_is_weak": true,
    "max_context_chunks": 8
  }
}

The service response should include more than a final answer:

{
  "answer": "The meal allowance changed for domestic travel...",
  "citations": [
    {
      "source_id": "policy-2026-04",
      "chunk_id": "policy-2026-04:section-3.2",
      "title": "Travel Policy",
      "uri": "https://kb.example/policies/travel",
      "score": 0.86
    }
  ],
  "trace_id": "rag-trace-20260608-001",
  "warnings": [],
  "retrieval": {
    "retrieved_candidates": 80,
    "reranked_candidates": 10,
    "context_chunks_used": 6
  }
}

This response shape supports debugging, user trust, and audit.

Deployment Topologies

Managed Retrieval

Managed retrieval is appropriate when speed matters more than low-level index control.

Characteristics:

  • Hosted vector store or file-search service.
  • Managed parsing or indexing.
  • Simpler operations.
  • Fast delivery.
  • Less control over internal index behavior.

Best fit:

  • Departmental assistants.
  • Internal knowledge bases.
  • Early production systems.
  • Teams without search-infrastructure expertise.

Self-Hosted Vector Infrastructure

Self-hosted infrastructure is appropriate when data residency, private networking, custom ranking, cost control, or low-level index tuning matter.

Characteristics:

  • Vector database or ANN service deployed on Kubernetes or VMs.
  • Separate embedding and reranking services.
  • Explicit object storage for raw documents and snapshots.
  • Operational ownership for scaling, replication, backup, and upgrades.

Best fit:

  • Regulated environments.
  • Sovereign deployments.
  • High-volume platforms.
  • Deeply customized retrieval.
  • Sensitive corpora that cannot leave a private network.

Hybrid Deployment

Hybrid deployment is common. Source systems and indexes may remain private while model inference is cloud-hosted, or retrieval may be hosted while generation uses a private model endpoint.

The boundary should be explicit:

  • What leaves the private environment?
  • Are retrieved chunks sent to a third-party model provider?
  • Are chunks redacted or summarized first?
  • Are prompts and responses retained by the model provider?
  • Which audit logs prove the data path?

Edge and Cache Patterns

Caching helps, but it must respect data sensitivity.

Good cache targets:

  • Static documents.
  • Public citations.
  • Repeated system prompts.
  • Stable retrieval results for non-sensitive corpora.
  • Embeddings for unchanged content.

Risky cache targets:

  • Tenant-specific answers.
  • Personal data.
  • Regulated records.
  • Security incident context.
  • Generated responses with mixed sensitive evidence.

Use private cache keys that include tenant, user scope, policy version, source version, and prompt version where needed.

Security Architecture and Threat Model

RAG security starts with a blunt assumption: every external input is untrusted. That includes user queries, uploaded files, retrieved passages, connector results, and tool outputs.

The model sees text. Attackers can put instructions in text. Therefore retrieved evidence must not be treated as trusted instructions.

Assets

Important RAG assets:

  • Source documents.
  • Parsed text.
  • Chunks.
  • Embeddings.
  • Sparse terms.
  • Metadata.
  • Vector indexes.
  • Keyword indexes.
  • User queries.
  • Prompt templates.
  • Retrieved evidence.
  • Generated answers.
  • Citations.
  • Access tokens.
  • Trace logs.
  • Evaluation datasets.

Threats

Threat Description Primary Controls
Prompt injection Retrieved content contains instructions that try to override system behavior. Delimit evidence, treat retrieval as data, keep policy outside the model, test against injection samples.
Index poisoning Malicious or low-quality content is ingested into the search index. Source allowlists, ingestion review, malware scanning, content classification, provenance metadata.
Tenant cross-talk One tenant retrieves another tenant’s data. Tenant partitioning, metadata filters, authorization checks before retrieval, tests for isolation.
Stale retrieval Old or superseded content is retrieved as current. Effective dates, expiry dates, versioning, freshness checks, source lifecycle policies.
Sensitive disclosure Retrieved content reveals personal, regulated, or confidential data. Data minimization, redaction, field-level policy, result classification, DLP controls.
Insecure output handling Model output is passed into tools, APIs, or UIs without validation. Structured output validation, escaping, approval gates, downstream parameter checks.
Supply-chain compromise Parser, embedding, vector DB, connector, or framework dependency is compromised. Dependency scanning, pinning, SBOMs, signed images, vendor review.
Denial of service Expensive retrieval, reranking, or context assembly is abused. Rate limits, query budgets, token caps, timeout limits, top-k caps.
Audit gaps The system cannot reconstruct evidence and actions. Correlation IDs, immutable logs, retrieval traces, prompt and model version logging.
RAG Security Model and Control Points 1. Untrusted Content Enters Uploads, web pages, documents, tickets, emails, logs 2. Ingestion Guard Scan and normalize Classify and tag metadata 3. Scoped Retrieval Tenant, role, purpose filters Least-privilege evidence access 4. Prompt Boundary Instructions stay separate Retrieved text is evidence only 5. Output Guard Validate, cite, redact Block unsafe downstream use Audit Trace Identity, filters, scores, sources, prompt version, model version, policy decisions Security principle: retrieved content is evidence, not instruction.

Figure 4 - RAG security controls must protect ingestion, retrieval scope, prompt construction, output handling, and auditability.

Control Baseline

Production RAG should include at least these controls:

Control Area Baseline
Authentication Every query is bound to a user or workload identity.
Authorization Retrieval is scoped before search using tenant, role, purpose, and metadata policy.
Tenant isolation Indexes, namespaces, shards, or filters enforce tenant boundaries.
Data classification Chunks carry sensitivity metadata used at query and answer time.
Ingestion security Files are scanned, normalized, classified, and sourced from approved connectors.
Prompt isolation Retrieved evidence is clearly separated from system and developer instructions.
Output validation Generated outputs are checked before display, storage, or tool execution.
Secrets handling Credentials never enter model-visible context.
Encryption Data is encrypted in transit and at rest; keys are externally managed where required.
Audit Logs capture identity, filters, retrieved chunks, scores, prompt version, citations, and model version.
Retention Source, chunk, index, cache, and trace retention match legal and business policy.
Deletion Erasure workflows propagate to raw stores, chunks, embeddings, indexes, caches, and replicas.

Evaluation and CI/CD

RAG requires testing for both code and knowledge.

Retrieval Metrics

Useful retrieval metrics:

  • Recall at k: whether relevant evidence appears in the top k results.
  • Precision at k: how much of the retrieved set is relevant.
  • MRR: how early the first relevant result appears.
  • nDCG: whether highly relevant results are ranked higher.
  • Filter correctness: whether tenant, region, date, and role filters work.
  • Freshness: how long source updates take to become searchable.

If retrieval misses the right evidence, generation cannot reliably recover.

Answer Metrics

Useful answer metrics:

  • Groundedness.
  • Citation coverage.
  • Hallucination rate.
  • Abstention accuracy.
  • Contradiction handling.
  • Policy compliance.
  • User task success.
  • Latency.
  • Token cost.

Evaluation sets should include normal questions, ambiguous questions, stale-source cases, access-control cases, and adversarial prompt-injection samples.

Release Gates

Before promoting a new parser, chunker, embedding model, reranker, prompt, or index:

  • Run retrieval evals.
  • Run answer evals.
  • Run access-control tests.
  • Run prompt-injection tests.
  • Compare latency and cost.
  • Verify deletion behavior.
  • Validate citations.
  • Keep rollback paths.

Operations and Observability

Minimum production telemetry:

  • Query ID and correlation ID.
  • User identity and tenant.
  • Authorization decision.
  • Retrieval filters.
  • Dense and sparse candidate counts.
  • Reranker scores.
  • Chunks used in context.
  • Source versions.
  • Prompt template version.
  • Model name and version.
  • Token usage.
  • Latency by stage.
  • Cache hit or miss.
  • Generated answer ID.
  • Citation coverage.
  • Policy denials and redactions.

Operational playbooks should cover:

  • Ingestion backfill.
  • Index rebuild.
  • Embedding model migration.
  • Reranker migration.
  • Emergency source removal.
  • Tenant isolation incident.
  • Prompt injection incident.
  • Data deletion request.
  • Vector database outage.
  • Model provider outage.

The incident-response question is simple: can the team reconstruct why the model answered the way it did? If not, the RAG system is under-instrumented.

Product and Framework Landscape

The RAG ecosystem has several layers.

Layer Examples Role
Hosted retrieval OpenAI Vector Stores and File Search, model-provider retrieval tools Fastest path to managed retrieval and citations.
Managed vector databases Pinecone, Weaviate Cloud, Zilliz Cloud Scalable vector and hybrid search with managed operations.
Open-source vector databases Weaviate, Milvus, Qdrant Self-hosted search infrastructure and deployment control.
ANN libraries Faiss, Annoy, ScaNN Low-level similarity search building blocks.
RAG frameworks LlamaIndex, Haystack, LangChain/LangGraph, Semantic Kernel Pipelines, retrievers, connectors, orchestration, evaluation.
Embedding serving Hugging Face Text Embeddings Inference, provider embedding APIs Embedding generation and batching.
Integration protocols Model Context Protocol Standardized access to tools, resources, and enterprise context.

The right choice depends on constraints:

  • Use hosted retrieval for speed and small teams.
  • Use managed vector databases for scalable production without owning every operational detail.
  • Use self-hosted vector infrastructure for data residency, private networking, and deeper control.
  • Use libraries when embedding search inside a specialized application.
  • Use frameworks when the application needs complex ingestion, retrieval, evaluation, or workflow logic.

Business Workflow Patterns

Customer Support

RAG can retrieve approved product documentation, policy documents, previous tickets, and account context. Hybrid retrieval is important because users describe issues in natural language while systems often use exact SKUs, error codes, and plan names.

Controls:

  • Use approved support corpora.
  • Show citations.
  • Separate internal notes from customer-visible answers.
  • Avoid exposing one customer record to another.
  • Draft responses before sending.

Internal Knowledge Assistant

RAG can help employees find policies, procedures, architecture decisions, and team documentation.

Controls:

  • Scope by user group.
  • Track stale documents.
  • Surface source age.
  • Prefer abstention when evidence conflicts.
  • Make feedback easy.

Engineering Assistant

RAG can retrieve code, design docs, API references, CI logs, and incident history.

Controls:

  • Preserve repository and symbol metadata.
  • Use code-aware chunking.
  • Keep read-only retrieval separate from code modification tools.
  • Protect secrets in source and logs.

Decision Support

RAG can support regulated analysis by assembling relevant evidence and producing cited summaries.

Controls:

  • Evidence-forward UI.
  • Strong citation requirements.
  • Review workflow.
  • Data retention policy.
  • Audit logs.
  • No autonomous high-impact decisions without human approval.

Common Failure Modes

  1. Chunking destroys context.

    The retrieved passage is relevant but impossible to interpret without its heading, table, parent section, or date.

  2. Dense-only retrieval misses exact anchors.

    Error codes, policy IDs, and function names disappear behind semantic similarity.

  3. Metadata is incomplete.

    The system cannot filter by tenant, region, date, or sensitivity.

  4. Top-k is too high.

    The model receives too much weak context and produces diluted or contradictory answers.

  5. Citations are decorative.

    The answer lists sources but the claims are not actually supported by them.

  6. Deletion stops at the source system.

    Deleted data remains in chunks, embeddings, vector indexes, caches, or traces.

  7. Prompt injection is treated as a model problem only.

    The real fix requires data labeling, prompt boundaries, tool controls, and output validation.

  8. Evaluation only tests happy paths.

    Production failures often come from stale data, access boundaries, ambiguous questions, and adversarial content.

Implementation Checklist

Before production launch:

  • Identify source owners.
  • Define data classification.
  • Define tenant and role boundaries.
  • Choose chunking policies by source type.
  • Preserve parent document and section metadata.
  • Use hybrid retrieval unless the corpus is clearly unsuitable.
  • Add reranking for high-value workflows.
  • Require citations for factual answers.
  • Add abstention behavior.
  • Build retrieval and answer evals.
  • Add prompt-injection test cases.
  • Log retrieval scores and source IDs.
  • Version parser, chunker, embedding model, reranker, prompt, and index.
  • Test deletion propagation.
  • Define incident-response playbooks.
  • Add cost and latency budgets.

Conclusion

RAG is the practical bridge between general-purpose language models and enterprise knowledge. Its strength is that it lets teams update, scope, cite, and govern knowledge without retraining the base model. Its weakness is that it introduces a full retrieval and data-governance system that must be engineered seriously.

The best enterprise RAG systems are not the ones with the most elaborate prompt. They are the ones with clean source ownership, reliable parsing, sensible chunking, strong metadata, hybrid retrieval, reranking, grounded prompts, strict tenant isolation, complete observability, and repeatable evaluation.

The model writes the answer. The RAG system decides what evidence the model is allowed to see. That is the architectural responsibility.

References


Applying This in Practice

If you are applying these ideas to a regulated product, certification target, or production system, I can help turn the analysis into a threat model, architecture review, migration roadmap, or remediation plan.

Discuss an AI security architecture challenge