Standard Retrieval-Augmented Generation has a dirty secret: it works well for simple lookups and falls apart for anything complex. When a user asks a question that requires synthesizing information from multiple documents, resolving contradictions between sources, or reasoning across several steps, the basic query-retrieve-generate pipeline produces mediocre results. The retrieval step fires once, returns whatever the vector search considers "similar," and the LLM is left to make sense of whatever it gets. There is no feedback loop, no quality check, no ability to try a different search strategy. Agentic RAG fixes this by putting an AI agent in control of the retrieval process, giving it the ability to reason about what to retrieve, evaluate what it found, and adapt its approach in real time.
Key Takeaways- Standard RAG follows a fixed pipeline that cannot handle multi-hop queries, ambiguous phrasing, or conflicting sources
- Agentic RAG wraps retrieval in an agent loop that decides when, what, and how to retrieve
- Three architecture patterns dominate: single-agent with tool use, multi-agent with specialized retrievers, and self-correcting RAG with reflection
- LangGraph provides the best framework for building agentic RAG with stateful graphs and conditional routing
- The tradeoff is latency and cost - agentic RAG uses 2-5x more LLM calls per query than standard RAG
The Limits of Standard RAG
Standard RAG assumes a single retrieval step will return the right context for any query. That assumption breaks down quickly in enterprise environments where data is messy, distributed, and complex.The typical RAG pipeline has three stages:
This works well enough for direct factual questions against a single, well-chunked knowledge base. Ask "What is our parental leave policy?" against an HR document collection, and standard RAG will likely find the right chunk and produce a correct answer.
But enterprise queries are rarely that clean. Here are the failure modes that show up consistently in production:
Multi-Hop Reasoning
Some questions cannot be answered from a single document chunk. Consider: "Which of our healthcare clients in the Northeast region had compliance incidents in Q3 that were resolved using the updated protocol from the July policy change?"
Answering this requires:
- Retrieving the list of healthcare clients in the Northeast
- Finding Q3 compliance incident records for those clients
- Locating the July policy change document
- Cross-referencing which incidents were resolved using that specific protocol
Query Ambiguity and Reformulation
Users rarely phrase queries in the same vocabulary that documents use. A clinician asking "What should I do when a patient's eGFR drops below 30?" might be looking for a clinical protocol document titled "Stage 4 CKD Management Guidelines." The semantic gap between the query and the document can cause poor retrieval, a problem we explored in detail in our comparison of HyDE vs traditional RAG.
Standard RAG has no mechanism to detect that retrieval failed or to reformulate the query. It returns whatever the vector search found and moves on.
Conflicting Sources
Enterprise knowledge bases contain documents from different time periods, departments, and authors. A policy document from 2024 might contradict an updated memo from 2025. Standard RAG treats all retrieved chunks equally - it has no way to resolve conflicts, check document recency, or weigh source authority.
Queries That Need Routing
In many enterprise deployments, data lives in multiple backends: a vector database for unstructured documents, a SQL database for structured records, an API for real-time data, and a graph database for relationships. Standard RAG is hardwired to one retrieval mechanism. It cannot decide that a particular query should go to a SQL database instead of (or in addition to) a vector store.
These limitations are not edge cases. In a 2025 survey by LlamaIndex, 67% of enterprise teams reported that their initial RAG deployments failed to meet accuracy requirements for complex queries. The response was not to abandon RAG but to make it agentic.
What Makes RAG "Agentic"
Agentic RAG is a retrieval-augmented generation architecture where an AI agent controls the entire retrieval process, deciding when to retrieve, what queries to issue, whether the results are good enough, and what follow-up actions to take.The shift from standard RAG to agentic RAG is the shift from a pipeline to a loop. In standard RAG, data flows in one direction: query enters, chunks come out, response is generated. In agentic RAG, an AI agent sits at the center and orchestrates retrieval as one of several tools it can use.
The agent makes four key decisions that standard RAG cannot:
1. Whether to Retrieve at All
Not every query needs retrieval. "What is 2 + 2?" does not require a database lookup. An agentic RAG system can recognize when the LLM's parametric knowledge is sufficient and skip retrieval entirely, saving latency and cost. Conversely, it can recognize when a question absolutely requires grounding in source documents and trigger retrieval accordingly.
2. What to Retrieve
The agent can decompose a complex query into sub-queries, each targeting specific information. For the multi-hop healthcare compliance example above, the agent might generate three separate retrieval queries:
- "Healthcare clients in Northeast region"
- "Q3 compliance incidents [client names]"
- "July policy change updated protocol"
3. Whether Results Are Good Enough
After retrieval, the agent evaluates the returned documents for relevance. If the chunks do not actually answer the question, the agent can:
- Reformulate the query with different terms
- Expand the search to additional data sources
- Adjust retrieval parameters (increase top-k, change similarity threshold)
- Fall back to a web search or API call
4. What Actions to Take Next
Retrieval might be just one step in a larger workflow. After gathering relevant documents, the agent might need to perform calculations on extracted data, compare values across documents, generate structured output, or route the result to another system. The agent treats retrieval as a tool in its toolkit, not the entire pipeline.
This is fundamentally the same architecture pattern that defines AI agents in general: perception, reasoning, action, and memory applied specifically to the retrieval problem.
Architecture Patterns for Agentic RAG
Three architecture patterns cover the majority of agentic RAG implementations: single-agent with tool use, multi-agent with specialized retrievers, and self-correcting RAG with reflection.Pattern 1: Single-Agent RAG with Tool Use
The simplest agentic RAG architecture gives a single agent access to retrieval as a tool alongside other tools. The agent receives a query, reasons about how to answer it, and calls the retrieval tool when it determines external knowledge is needed.
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
@tool
def search_knowledge_base(query: str) -> str:
"""Search the internal knowledge base for relevant documents."""
docs = vector_store.similarity_search(query, k=5)
return "\n\n".join([doc.page_content for doc in docs])
@tool
def search_sql_database(sql_query: str) -> str:
"""Execute a SQL query against the structured database."""
result = db.execute(sql_query)
return str(result.fetchall())
llm = ChatOpenAI(model="gpt-4o")
agent = llm.bind_tools([search_knowledge_base, search_sql_database])
This pattern works well when:
- Queries vary in type (some need vector search, some need SQL, some need no retrieval)
- The reasoning required to select and use tools fits within a single LLM call
- You want a minimal architecture with low operational complexity
Pattern 2: Multi-Agent RAG with Specialized Retrievers
For enterprise systems with multiple data sources, a multi-agent architecture assigns specialized retriever agents to different knowledge domains. A router agent analyzes the incoming query and delegates to the appropriate specialist.
Each retriever agent is optimized for its domain:
- The policy agent uses vector search over policy documents with metadata filtering by date and department
- The clinical agent queries a medical knowledge graph and clinical trial databases
- The financial agent runs SQL queries against structured financial data
This pattern scales well but introduces coordination overhead. Each agent call is an LLM invocation, so costs multiply. The design borrows heavily from the supervisor pattern described in our multi-agent systems guide.
Pattern 3: Self-Correcting RAG with Reflection
Self-correcting RAG adds an explicit evaluation step after retrieval. The agent retrieves documents, grades their relevance, and decides whether to proceed with generation or loop back for another retrieval attempt.
The flow works as follows: a query triggers retrieval, retrieved documents are graded for relevance, and if enough relevant documents are found the agent generates an answer and checks it for groundedness. If documents are insufficient, the agent reformulates the query and retrieves again. If the generated answer is not grounded in the source documents, the agent retries.
This is the most common agentic RAG pattern in production because it directly addresses the biggest weakness of standard RAG: no quality check on retrieved context.
Implementation with LangGraph
LangGraph is the natural fit for agentic RAG because it models the retrieve-evaluate-retry loop as a stateful graph with conditional edges.Let us build a self-correcting RAG pipeline that retrieves documents, grades them for relevance, reformulates the query if needed, generates an answer, and validates that the answer is grounded in the retrieved context.
Step 1: Define the State and Retrieval
from typing import TypedDict, List, Optional
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
class AgenticRAGState(TypedDict):
question: str
generation: Optional[str]
documents: List[Document]
query: str
retry_count: int
max_retries: int
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = Chroma(
collection_name="enterprise_docs",
embedding_function=embeddings,
persist_directory="./chroma_db"
)
retriever = vector_store.as_retriever(search_kwargs={"k": 6})
def retrieve(state: AgenticRAGState) -> AgenticRAGState:
"""Retrieve documents using the current query."""
query = state.get("query", state["question"])
documents = retriever.invoke(query)
return {state, "documents": documents, "query": query}
Step 2: Grade Retrieved Documents
This is the node that standard RAG lacks entirely. We use the LLM to evaluate whether each retrieved document is actually relevant to the query:
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
class RelevanceGrade(BaseModel):
"""Grade the relevance of a retrieved document."""
is_relevant: bool = Field(
description="Whether the document is relevant to the query"
)
reasoning: str = Field(
description="Brief explanation of relevance assessment"
)
grading_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
grader = grading_llm.with_structured_output(RelevanceGrade)
def grade_documents(state: AgenticRAGState) -> AgenticRAGState:
"""Grade each retrieved document for relevance to the query."""
question = state["question"]
documents = state["documents"]
relevant_docs = []
for doc in documents:
grade = grader.invoke(
f"Question: {question}\n\n"
f"Document: {doc.page_content}\n\n"
f"Is this document relevant to answering the question?"
)
if grade.is_relevant:
relevant_docs.append(doc)
return {state, "documents": relevant_docs}
Using gpt-4o-mini for the grading step is a deliberate cost optimization. The grading task is relatively simple (binary relevance assessment), so a smaller, faster model handles it well. Save the more capable model for generation.
Step 3: Route, Reformulate, and Generate
The conditional edge decides whether to generate an answer or reformulate the query for another retrieval attempt:
def should_generate(state: AgenticRAGState) -> str:
"""Decide whether to generate an answer or reformulate the query."""
if len(state["documents"]) >= 2:
return "generate"
elif state.get("retry_count", 0) >= state.get("max_retries", 3):
return "generate"
return "reformulate"
reformulation_llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
def reformulate_query(state: AgenticRAGState) -> AgenticRAGState:
"""Reformulate the query to improve retrieval results."""
question = state["question"]
current_query = state.get("query", question)
retry_count = state.get("retry_count", 0)
reformulated = reformulation_llm.invoke(
f"The following search query did not return sufficiently "
f"relevant results:\n\n"
f"Original question: {question}\n"
f"Search query used: {current_query}\n\n"
f"Rewrite the search query to better match documents that "
f"would contain the answer. Use different terminology, "
f"be more specific, or try a different angle. "
f"Return ONLY the new search query, nothing else."
)
return {
state,
"query": reformulated.content,
"retry_count": retry_count + 1,
}
Step 4: Generate and Validate
The generation node produces an answer, and a groundedness check verifies that the answer is supported by the retrieved documents:
generation_llm = ChatOpenAI(model="gpt-4o", temperature=0.2)
def generate(state: AgenticRAGState) -> AgenticRAGState:
"""Generate an answer from relevant documents."""
question = state["question"]
documents = state["documents"]
context = "\n\n---\n\n".join([doc.page_content for doc in documents])
response = generation_llm.invoke(
f"Answer the following question based on the provided context. "
f"If the context does not contain enough information to fully "
f"answer the question, say so explicitly. Do not fabricate "
f"information.\n\n"
f"Context:\n{context}\n\n"
f"Question: {question}"
)
return {state, "generation": response.content}
class GroundednessGrade(BaseModel):
"""Evaluate whether the answer is grounded in the documents."""
is_grounded: bool = Field(
description="Whether all claims are supported by the documents"
)
unsupported_claims: List[str] = Field(
default_factory=list,
description="List of claims not supported by the documents"
)
groundedness_grader = grading_llm.with_structured_output(GroundednessGrade)
def check_groundedness(state: AgenticRAGState) -> str:
"""Check if the generated answer is grounded in retrieved documents."""
context = "\n\n".join([doc.page_content for doc in state["documents"]])
grade = groundedness_grader.invoke(
f"Documents:\n{context}\n\n"
f"Answer:\n{state['generation']}\n\n"
f"Are all claims in the answer supported by the documents?"
)
return "grounded" if grade.is_grounded else "not_grounded"
Step 5: Assemble the Graph
from langgraph.graph import StateGraph, END
workflow = StateGraph(AgenticRAGState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("reformulate", reformulate_query)
workflow.add_node("generate", generate)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
"grade_documents",
should_generate,
{"generate": "generate", "reformulate": "reformulate"}
)
workflow.add_edge("reformulate", "retrieve")
workflow.add_conditional_edges(
"generate",
check_groundedness,
{"grounded": END, "not_grounded": "reformulate"}
)
app = workflow.compile()
Running the Pipeline
result = app.invoke({
"question": "What compliance requirements apply to our healthcare AI deployments in states with strict data residency laws?",
"query": "What compliance requirements apply to our healthcare AI deployments in states with strict data residency laws?",
"documents": [],
"generation": None,
"retry_count": 0,
"max_retries": 3,
})
print(result["generation"])
A query like this will likely trigger the reformulation loop at least once. The initial retrieval might return general compliance documents, but the grading step will flag them as insufficient because they do not address state-specific data residency laws. The agent then reformulates and retrieves more targeted documents on the second pass.
This adaptive behavior is what makes agentic RAG valuable in enterprise settings. For a real-world example, see our Hello Kidney case study where conversational AI had to retrieve from clinical guidelines, patient records, and educational materials simultaneously.
When to Use Standard RAG vs Agentic RAG
Standard RAG and agentic RAG are not competing approaches. They serve different complexity levels, and the right choice depends on your query patterns, accuracy requirements, and latency budget.Here is a decision framework:
Use Standard RAG When:
- Queries are direct lookups: "What is our refund policy?" - single document, single retrieval
- The knowledge base is well-structured: Clean, consistently chunked documents with good metadata
- Latency is critical: Standard RAG completes in 1-2 seconds. Agentic RAG can take 5-15 seconds for complex queries
- Cost sensitivity is high: Standard RAG uses one embedding call and one LLM call per query
- Error tolerance is acceptable: If getting an imperfect answer 15-20% of the time is fine for your use case
Use Agentic RAG When:
- Queries require multi-hop reasoning: Answers span multiple documents or require connecting information across sources
- Data sources are heterogeneous: You need to route queries to different backends (vector, SQL, API, graph)
- Query phrasing is unpredictable: Users in your system phrase questions in ways that do not match document vocabulary
- Accuracy requirements are high: Healthcare, legal, and financial applications where wrong answers carry real risk
- Conflicting or versioned documents: Your knowledge base contains documents that may contradict each other, and recency or authority matters
The Hybrid Approach
Most production systems end up somewhere in between. A practical pattern is to start with standard RAG and add agentic capabilities selectively:
class QueryComplexity(BaseModel):
"""Classify query complexity to determine routing."""
complexity: str = Field(
description="'simple' for direct lookups, 'complex' for multi-step reasoning"
)
reasoning: str = Field(description="Why this classification was chosen")
classifier = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(
QueryComplexity
)
def route_query(question: str) -> str:
result = classifier.invoke(
f"Classify this query as 'simple' (can be answered from a single "
f"document chunk) or 'complex' (requires multi-step reasoning, "
f"multiple sources, or query reformulation):\n\n{question}"
)
if result.complexity == "simple":
return standard_rag_pipeline(question)
else:
return agentic_rag_pipeline(question)
This keeps latency low for simple queries while providing the accuracy benefits of agentic RAG for complex ones. The classifier itself is cheap - a single gpt-4o-mini call with structured output.
Production Considerations
Moving agentic RAG from a prototype to a production system requires careful attention to latency, cost, evaluation, and failure modes. The agentic loop that makes retrieval smarter also makes it slower and more expensive, so production deployments need deliberate optimization.Latency and Cost
Standard RAG typically completes in 1-2 seconds: one embedding call (~100ms), one vector search (~50ms), one LLM generation (~1-2s). Agentic RAG multiplies this because each step in the agent loop involves at least one additional LLM call:
Latency by step:- Initial retrieval: 150ms (1 call)
- Document grading: 200-500ms (1 call, batched)
- Query reformulation: 500-800ms (0-3 calls)
- Additional retrievals: 150ms each (0-3 calls)
- Generation: 1-2s (1 call)
- Groundedness check: 300-500ms (1 call)
- Standard RAG: 1 LLM call, ~$0.005-0.02 per query
- Agentic RAG (no retry): 3-4 LLM calls, ~$0.015-0.06 per query
- Agentic RAG (with retries): 5-8 LLM calls, ~$0.03-0.12 per query
Mitigation strategies:
- Parallel retrieval: When the agent decomposes a query into sub-queries, execute all retrievals concurrently rather than sequentially
- Streaming: Stream the final generation so the user sees output immediately once the retrieval loop completes
- Async grading: Grade documents asynchronously and filter during generation rather than blocking the pipeline
- Tiered models: Use
gpt-4o-minior Claude Haiku for grading and classification steps. Reserve larger models for generation only - Cache reformulated queries: If the same query pattern triggers the same reformulation, cache the mapping to skip the reformulation LLM call
- Monitor retry rates: If more than 30% of queries trigger reformulation, the underlying retrieval pipeline needs improvement (better chunking, better embeddings, metadata filtering)
Evaluation Metrics
Standard RAG evaluation metrics (retrieval recall, generation faithfulness, answer relevance) still apply, but agentic RAG introduces additional dimensions you need to track. Without proper evaluation, you are flying blind - the added complexity of the agentic loop means more places where things can go wrong silently:
- Retrieval efficiency: How many retrieval rounds does the agent need on average? Track this as a distribution, not just a mean. If 90% of queries resolve in one round but 10% hit the max retry limit, that tail matters
- Grading accuracy: Is the relevance grader correctly identifying irrelevant documents? Sample grading decisions and have humans verify them. A grader that is too aggressive (rejecting relevant documents) causes unnecessary retries. A grader that is too lenient defeats the purpose
- Reformulation quality: When the agent reformulates a query, does the new query actually return better results? Compare retrieval recall before and after reformulation
- End-to-end accuracy: Answer correctness, groundedness, and completeness using both automated evaluations (LLM-as-judge) and periodic human evaluation on a held-out test set
from dataclasses import dataclass
@dataclass
class AgenticRAGMetrics:
query_id: str
total_retrieval_rounds: int
documents_retrieved: int
documents_after_grading: int
reformulations_triggered: int
total_latency_ms: float
answer_relevance: float # 0-1, LLM-graded
answer_groundedness: float # 0-1, LLM-graded
answer_completeness: float # 0-1, LLM-graded
def log_metrics(metrics: AgenticRAGMetrics):
"""Log metrics to your observability platform."""
print(f"Query {metrics.query_id}: "
f"{metrics.total_retrieval_rounds} rounds, "
f"{metrics.reformulations_triggered} reformulations, "
f"{metrics.total_latency_ms:.0f}ms, "
f"relevance={metrics.answer_relevance:.2f}")
Failure Modes and Guardrails
Agentic RAG introduces failure modes that standard RAG does not have:
Infinite loops: If the grader consistently rejects documents and the reformulator keeps generating similar queries, the agent can loop indefinitely. Always set amax_retries limit (3 is a good default) and have a fallback behavior - either generate with whatever context is available or return an honest "I could not find sufficient information to answer this question."
Grader hallucination: The grading LLM itself can make mistakes. It might mark a highly relevant document as irrelevant (false negative) or approve a tangentially related document (false positive). Structured output with binary classification reduces this, but monitor grading accuracy over time.
Query drift: Through multiple reformulations, the agent's search query can drift away from the user's original intent. Include the original question in every reformulation prompt so the agent stays anchored.
Cost explosion: A bug in the retry logic or a systematic retrieval failure can cause a single query to burn through many LLM calls. Set hard limits on total LLM calls per query and alert when queries exceed expected cost thresholds.
Cost Optimization Tactics
Beyond the mitigation strategies listed above, a few cost-specific tactics deserve attention:
- Tune the relevance threshold: A stricter grading threshold means more retries. Find the sweet spot where you filter out genuinely irrelevant documents without triggering unnecessary reformulations
- Batch grading calls: Instead of grading documents one by one, grade them all in a single LLM call with structured output
- Monitor retry rates: If more than 30% of queries trigger reformulation, the underlying retrieval pipeline needs improvement (better chunking, better embeddings, metadata filtering)
Building an Agentic RAG Strategy
Implementing agentic RAG is not just a code problem. It is a strategic decision about how your organization retrieves and reasons over its knowledge. Before writing code, answer these questions:
The organizations getting the most value from agentic RAG are the ones that treat retrieval as an engineering problem, not a plug-and-play feature. They invest in evaluation, iterate on chunking strategies, and build feedback loops between retrieval quality metrics and system improvements. If you are planning an agentic RAG implementation or looking to improve an existing retrieval pipeline, the questions above will help you scope the effort realistically and avoid over-engineering solutions for problems that do not require them.
Conclusion
Standard RAG is a good starting point, but it hits a ceiling quickly in enterprise environments where queries are complex, data sources are heterogeneous, and accuracy requirements are high. Agentic RAG addresses these limitations by putting an AI agent in control of the retrieval process, enabling query decomposition, relevance grading, iterative reformulation, and groundedness verification.
The key architectural patterns - single-agent tool use, multi-agent specialized retrievers, and self-correcting RAG with reflection - each suit different complexity levels. Start with the self-correcting pattern (it delivers the most value for the least architectural complexity), add multi-agent routing when you have multiple data backends, and use the hybrid routing approach to keep costs and latency manageable.
The implementation is well within reach for engineering teams using frameworks like LangGraph. The real challenge is not the code - it is building the evaluation infrastructure to know whether your agentic RAG system is actually better than the standard pipeline it replaces. Measure retrieval efficiency, grading accuracy, reformulation quality, and end-to-end answer correctness. Let the metrics guide your architecture decisions, not assumptions about what "should" work better.
We build agentic RAG systems for enterprise clients in healthcare, finance, and government. Learn about our development process or talk to us.
BeyondScale Team
AI/ML Team
AI/ML Team at BeyondScale Technologies, an ISO 27001 certified AI consulting firm and AWS Partner. Specializing in enterprise AI agents, multi-agent systems, and cloud architecture.


