How to Build a Self-Evaluating Agentic AI System with LlamaIndex and OpenAI Using Retrieval, Tool Use, and Automated Quality Checks

1 3 minutes read

1768718317 How to Build a Self Evaluating Agentic AI System with LlamaIndex.png

In this tutorial, we built an advanced AI workflow using LlamaIndex and OpenAI models. We focus on designing a reliable augmented retrieval generation agent (RAG) that can reflect on evidence, use tools deliberately, and evaluate its own output for quality. By structuring the system around retrieval, answer synthesis, and self-evaluation, we show how agentic patterns move beyond simple chatbots and toward more trustworthy, controllable AI systems suitable for research and analytical use cases.

!pip -q install -U llama-index llama-index-llms-openai llama-index-embeddings-openai nest_asyncio


import os
import asyncio
import nest_asyncio
nest_asyncio.apply()


from getpass import getpass


if not os.environ.get("OPENAI_API_KEY"):
   os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY: ")

We have set up the environment and installed all the dependencies required to run the agent AI workflow. We load the OpenAI API key securely at runtime, ensuring that credentials are never encrypted. We also set up the notebook to handle asynchronous execution seamlessly.

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")


texts = [
   "Reliable RAG systems separate retrieval, synthesis, and verification. Common failures include hallucination and shallow retrieval.",
   "RAG evaluation focuses on faithfulness, answer relevancy, and retrieval quality.",
   "Tool-using agents require constrained tools, validation, and self-review loops.",
   "A robust workflow follows retrieve, answer, evaluate, and revise steps."
]


docs = [Document(text=t) for t in texts]
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=4)

We configure the OpenAI language model and embedding model and build an embedded knowledge base for our agent. We convert raw text into indexed documents so that the agent can retrieve relevant evidence during inference.

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator


faith_eval = FaithfulnessEvaluator(llm=Settings.llm)
rel_eval = RelevancyEvaluator(llm=Settings.llm)


def retrieve_evidence(q: str) -> str:
   r = query_engine.query(q)
   out = []
   for i, n in enumerate(r.source_nodes or []):
       out.append(f"[{i+1}] {n.node.get_content()[:300]}")
   return "\n".join(out)


def score_answer(q: str, a: str) -> str:
   r = query_engine.query(q)
   ctx = [n.node.get_content() for n in r.source_nodes or []]
   f = faith_eval.evaluate(query=q, response=a, contexts=ctx)
   r = rel_eval.evaluate(query=q, response=a, contexts=ctx)
   return f"Faithfulness: {f.score}\nRelevancy: {r.score}"

We identify the primary tools used by the agent: retrieving evidence and evaluating answers. We implement automatic scoring for fidelity and relevance so that the agent can judge the quality of his responses.

from llama_index.core.agent.workflow import ReActAgent
from llama_index.core.workflow import Context


agent = ReActAgent(
   tools=[retrieve_evidence, score_answer],
   llm=Settings.llm,
   system_prompt="""
Always retrieve evidence first.
Produce a structured answer.
Evaluate the answer and revise once if scores are low.
""",
   verbose=True
)


ctx = Context(agent)

We create the ReAct-based agent and define its system behavior, directing how it retrieves evidence, generates answers, and reviews results. We also initialize an execution context that maintains the agent’s state across interactions. This step combines tools and thinking into a single agent workflow.

async def run_brief(topic: str):
   q = f"Design a reliable RAG + tool-using agent workflow and how to evaluate it. Topic: {topic}"
   handler = agent.run(q, ctx=ctx)
   async for ev in handler.stream_events():
       print(getattr(ev, "delta", ""), end="")
   res = await handler
   return str(res)


topic = "RAG agent reliability and evaluation"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(run_brief(topic))


print("\n\nFINAL OUTPUT\n")
print(result)

We implement the complete agent loop by passing a topic to the system and flowing the agent’s logic and output. We allow the agent to complete the retrieval, generation and evaluation cycle asynchronously.

In conclusion, we showed how an agent can retrieve supporting evidence, create a structured response, and evaluate its sincerity and relevance before finalizing the answer. We’ve kept the design modular and transparent, making it easy to extend the workflow with additional tools, evaluators, or domain-specific knowledge sources. This approach demonstrates how we can use agentic AI with LlamaIndex and OpenAI models to build systems that are more capable, more reliable, and more self-aware in their thinking and responses.

verify Full codes here. Also, feel free to follow us on twitter Don’t forget to join us 100k+ mil SubReddit And subscribe to Our newsletter. I am waiting! Are you on telegram? Now you can join us on Telegram too.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of AI for social good. His most recent endeavor is the launch of the AI media platform, Marktechpost, which features in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand by a broad audience. The platform has more than 2 million views per month, which shows its popularity among the masses.

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2026-01-17 21:56:00

1 3 minutes read