A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain

0 9 minutes read

1744953880 A Hands On Tutorial Build a Modular LLM Evaluation Pipeline with.png

The LLMS evaluation has emerged as a pivotal challenge in reliability progress and the benefit of artificial intelligence in both academic and industrial environments. With the expansion of the capabilities of these models, the need for strict, repetitive and multi -faceted evaluation methodologies. In this tutorial, we offer a comprehensive examination of one of the most important limits in the field: systematically evaluating the strengths and restrictions of LLMS through the different dimensions of performance. Using Google’s advanced artificial intelligence models as Langchain standards and library as a tool for our format, we offer a strong evaluation pipeline and a special designed standard for implementation in Google Colab. This framework integrates the registration based on standards, and includes right, importance, cohesion, and accuracy, with comparisons of the marital model and rich visual analyzes to provide accurate and implementable visions. This approach is based on groups of questions accompanied by experts and objectives of the objective truth, and this approach balances with the quantitative ability with the ability to manage practical, providing researchers and developers a ready -to -use tool set for use for high -resolution LLM.

!pip install langchain langchain-google-genai ragas pandas matplotlib

We install the main Python libraries to build and operate the workflow in which artificial intelligence works, and Langchain to organize LLM reactions (with the extension of the Langchain-Gogle-Genai for the Gogle AI from Google), and the Ragas Pre-Recovering Army, and Pandas Plus Matplotlib to manipulate data.

import os
import pandas as pd
import matplotlib.pyplot as plt
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.evaluation import load_evaluator
from langchain.schema import HumanMessage

Merging the basic PYTHON tools, including OS to manage the environment, pandas to deal with data frames, and Matplotlib.PyPlot for planning, along with the Langchain AI Google customer, rapid representation, chain construction, evaluation loader, and Humanmessge scheme to build and evaluate the lines of LLM pipelines.

os.environ["GOOGLE_API_KEY"] = "Use Your API Key"

Here, we create your environment by storing your Google API key in the Google_API_KEY variable, allowing the Google Google Google customer safely authenticated.

def create_evaluation_dataset():
    """Create a simple dataset for evaluation."""
    questions = [
        "Explain the concept of quantum computing in simple terms.",
        "How does a neural network learn?",
        "What are the main differences between SQL and NoSQL databases?",
        "Explain how blockchain technology works.",
        "What is the difference between supervised and unsupervised learning?"
    ]
   
    ground_truth = [
        "Quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits. This allows quantum computers to process certain types of information much faster than classical computers for specific problems.",
        "Neural networks learn through a process called backpropagation where they adjust the weights between neurons based on the error between predicted and actual outputs, gradually minimizing this error through many iterations of training data.",
        "SQL databases are relational with structured schemas, fixed tables, and use SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for specific data models like document, key-value, wide-column, or graph formats.",
        "Blockchain is a distributed ledger technology where data is stored in blocks that are linked cryptographically. Each block contains transaction data and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without central authority.",
        "Supervised learning uses labeled data where the algorithm learns to predict outputs based on input-output pairs. Unsupervised learning works with unlabeled data to find patterns or structures without predefined outputs."
    ]
   
    return pd.DataFrame({"question": questions, "ground_truth": ground_truth})

We build a small evaluation data system by associating five questions example of AI’s concepts and databases with their corresponding answers, making it easy to determine LLM responses against the properly defined outputs.

def setup_models():
    """Set up different Google Generative AI models for comparison."""
    models = {
        "gemini-2.0-flash-lite": ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0),
        "gemini-2.0-flash": ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)
    }
    return models

Now, this function installs two Chatgooglegenerativei clients, one of which is using the “Gemini – 2.0 -Flash – Lite” model and the other model “Gemini – 2.0 – Flash”, so you can easily compare the outputs side by side.

def generate_responses(models, dataset):
    """Generate responses from each model for the questions in the dataset."""
    responses = {}
   
    for model_name, model in models.items():
        model_responses = []
        for question in dataset["question"]:
            try:
                response = model.invoke([HumanMessage(content=question)])
                model_responses.append(response.content)
            except Exception as e:
                print(f"Error with model {model_name} on question: {question}")
                print(f"Error: {e}")
                model_responses.append("Error generating response")
       
        responses[model_name] = model_responses
   
    return responses

This function is flying through each form that has been formed and every question in the data set, which requires the form to create a response, pick up any errors (registration and insert a deputy component), and re -draw maps for each model to the list of answers created.

def evaluate_responses(models, dataset, responses):
    """Evaluate model responses using different evaluation criteria."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    reference_criteria = ["correctness"]
    reference_free_criteria = [
        "relevance",  
        "coherence",    
        "conciseness"  
    ]
   
    results = {model_name: {criterion: [] for criterion in reference_criteria + reference_free_criteria}
               for model_name in models.keys()}
   
    for criterion in reference_criteria:
        evaluator = load_evaluator("labeled_criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset["question"]):
                ground_truth = dataset["ground_truth"][i]
                response = responses[model_name][i]
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        reference=ground_truth,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results[model_name][criterion].append(normalized_score)
                else:
                    results[model_name][criterion].append(0)  
   
    for criterion in reference_free_criteria:
        evaluator = load_evaluator("criteria", criteria=criterion, llm=evaluator_model)
       
        for model_name in models.keys():
            for i, question in enumerate(dataset["question"]):
                response = responses[model_name][i]
               
                if response != "Error generating response":
                    eval_result = evaluator.evaluate_strings(
                        prediction=response,
                        input=question
                    )
                    normalized_score = float(eval_result.get('score', 0)) * 2
                    results[model_name][criterion].append(normalized_score)
                else:
                    results[model_name][criterion].append(0)  
    return results

This function works to take advantage of the “Gemini – 2.0 -Flash – Lite” evaluation to record the answers of each model on both the infected reference scales and reference standards (link, cohesion, expanding their scope), normalizing these grades, and re -maps of the overlapping dictionary each model and standard for evaluation results.

def calculate_average_scores(evaluation_results):
    """Calculate average scores for each model and criterion."""
    avg_scores = {}
   
    for model_name, criteria in evaluation_results.items():
        avg_scores[model_name] = {}
       
        for criterion, scores in criteria.items():
            if scores:
                avg_scores[model_name][criterion] = sum(scores) / len(scores)
            else:
                avg_scores[model_name][criterion] = 0
               
        all_scores = [score for criterion_scores in criteria.values() for score in criterion_scores if score is not None]
        if all_scores:
            avg_scores[model_name]["overall"] = sum(all_scores) / len(all_scores)
        else:
            avg_scores[model_name]["overall"] = 0
           
    return avg_scores

This function processes the interrelated evaluation results to calculate the average result of each standard via all questions for each model. Also, it calculates a total average by collecting all individual metropolitan degrees. The dictionary, which is returned, sets each model to its average standards and the complex “comprehensive” performance degree.

def visualize_results(avg_scores):
    """Visualize evaluation results with bar charts."""
    models = list(avg_scores.keys())
    criteria = list(avg_scores[models[0]].keys())
   
    plt.figure(figsize=(14, 8))
   
    bar_width = 0.8 / len(models)
   
    positions = range(len(criteria))
   
    for i, model in enumerate(models):
        model_scores = [avg_scores[model][criterion] for criterion in criteria]
        plt.bar([p + i * bar_width for p in positions], model_scores,
                width=bar_width, label=model)
   
    plt.xlabel('Evaluation Criteria', fontsize=12)
    plt.ylabel('Average Score (0-10)', fontsize=12)
    plt.title('LLM Model Comparison by Evaluation Criteria', fontsize=14)
    plt.xticks([p + bar_width * (len(models) - 1) / 2 for p in positions], criteria)
    plt.legend()
    plt.grid(axis="y", linestyle="--", alpha=0.7)
   
    plt.tight_layout()
    plt.show()
   
    plt.figure(figsize=(10, 8))
   
    categories = [c for c in criteria if c != 'overall']
    N = len(categories)
   
    angles = [n / float(N) * 2 * 3.14159 for n in range(N)]
    angles += angles[:1]  
   
    plt.polar(angles, [0] * (N + 1))
    plt.xticks(angles[:-1], categories)
   
    for model in models:
        values = [avg_scores[model][c] for c in categories]
        values += values[:1]  
        plt.polar(angles, values, label=model)
   
    plt.legend(loc="upper right")
    plt.title('LLM Model Comparison - Radar Chart', fontsize=14)
    plt.tight_layout()
    plt.show()

This function creates tape charts alongside to compare the average degree of each model across all evaluation criteria. Then it makes the radar scheme to perceive the profiles of its performance, allowing quickly the strengths and relative weaknesses.

def main():
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"\n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("\nVisualizing results...")
    visualize_results(avg_scores)
   
    print("Saving results to CSV...")
    results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
    print("Results saved to llm_evaluation_results.csv")
   
    detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
   
    for i, question in enumerate(dataset["question"]):
        row = {
            "Question": question,
            "Ground Truth": dataset["ground_truth"][i]
        }
       
        for model_name in models.keys():
            row[model_name] = responses[model_name][i]
       
        detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
    print("Detailed responses saved to llm_response_comparison.csv")

The main job organizes the full progress of the evaluation work from the end to the end: it creates the data set, prepares the models, creates and records responses, is calculated, displays medium standards, visualizes performance with plans, and finally exporting both the summary and detailed results such as CSV files.

def pairwise_model_comparison(models, dataset, responses):
    """Compare two models side by side using an LLM as judge."""
    evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
   
    pairwise_template = """
    Question: {question}
   
    Response A: {response_a}
   
    Response B: {response_b}
   
    Which response better answers the user's question? Consider factors like accuracy,
    helpfulness, clarity, and completeness.
   
    First, analyze each response point by point. Then conclude with your choice of either:
    A is better, B is better, or They are equally good/bad.
   
    Your analysis:
    """
   
    pairwise_prompt = PromptTemplate(
        input_variables=["question", "response_a", "response_b"],
        template=pairwise_template
    )
   
    pairwise_chain = LLMChain(llm=evaluator_model, prompt=pairwise_prompt)
   
    model_names = list(models.keys())
   
    pairwise_results = {f"{model_a} vs {model_b}": [] for model_a in model_names for model_b in model_names if model_a != model_b}
   
    for i, question in enumerate(dataset["question"]):
        for j, model_a in enumerate(model_names):
            for model_b in model_names[j+1:]:  
                response_a = responses[model_a][i]
                response_b = responses[model_b][i]
               
                if response_a != "Error generating response" and response_b != "Error generating response":
                    comparison_result = pairwise_chain.run(
                        question=question,
                        response_a=response_a,
                        response_b=response_b
                    )
                   
                    key_ab = f"{model_a} vs {model_b}"
                    pairwise_results[key_ab].append({
                        "question": question,
                        "result": comparison_result
                    })
   
    return pairwise_results

This function runs face-to-face comparisons for each pair of unique model by pushing the “Gemini-2.0-Flash-Lite” judge to analyze and classify their responses to accuracy, clarity and completeness, and collect the provisions of all questions in an organized dictionary to evaluate side by side.

def enhanced_main():
    """Enhanced main function with additional evaluations."""
    print("Creating evaluation dataset...")
    dataset = create_evaluation_dataset()
   
    print("Setting up models...")
    models = setup_models()
   
    print("Generating responses...")
    responses = generate_responses(models, dataset)
   
    print("Evaluating responses...")
    evaluation_results = evaluate_responses(models, dataset, responses)
   
    print("Calculating average scores...")
    avg_scores = calculate_average_scores(evaluation_results)
   
    print("Average scores:")
    for model, scores in avg_scores.items():
        print(f"\n{model}:")
        for criterion, score in scores.items():
            print(f"  {criterion}: {score:.2f}")
   
    print("\nVisualizing results...")
    visualize_results(avg_scores)
   
    print("\nPerforming pairwise model comparison...")
    pairwise_results = pairwise_model_comparison(models, dataset, responses)
   
    print("\nPairwise comparison results:")
    for comparison, results in pairwise_results.items():
        print(f"\n{comparison}:")
        for i, result in enumerate(results[:2]):
            print(f"  Question {i+1}: {result['question']}")
            print(f"  Analysis: {result['result'][:100]}...")
   
    print("\nSaving all results...")
    results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
    for model, criteria in avg_scores.items():
        for criterion, score in criteria.items():
            results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
                                  ignore_index=True)
   
    results_df.to_csv("llm_evaluation_results.csv", index=False)
   
    detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
   
    for i, question in enumerate(dataset["question"]):
        row = {
            "Question": question,
            "Ground Truth": dataset["ground_truth"][i]
        }
       
        for model_name in models.keys():
            row[model_name] = responses[model_name][i]
       
        detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
   
    detailed_df.to_csv("llm_response_comparison.csv", index=False)
   
    pairwise_df = pd.DataFrame(columns=["Comparison", "Question", "Analysis"])
   
    for comparison, results in pairwise_results.items():
        for result in results:
            pairwise_df = pd.concat([pairwise_df, pd.DataFrame([{
                "Comparison": comparison,
                "Question": result["question"],
                "Analysis": result["result"]
            }])], ignore_index=True)
   
    pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
   
    print("All results saved to CSV files.")

Endanced_Main function extends to the basic evaluation pipeline by adding comparisons on automatic marital models, printing brief progress updates at each stage, exporting three CSV files, summary degrees, detailed responses, and marital analysis, so you end up with a full evaluation of the evaluation side by side.

if __name__ == "__main__":
    enhanced_main()

Finally, this guard guarantees that when implementing the text program directly (not imported), it calls the endustaned_main () to operate the full evaluation and compare the end -to -end pipeline.

In conclusion, in this tutorial, a multi -use and initial frame for assessing and comparing LLMS performance, and taking advantage of the AI’s Obstetrics from Google alongside the Langchain Library for Coordination. Unlike simple accuracy standards, the methodology offered here adopts a multi -dimensional nature to understand the language, combine rating on granular standards, comparing the model to the organized model, and intuitive perceptions. By capturing major features, including right, importance, cohesion, and briefing, our evaluation pipeline provides practicing the accurate differences in performance that directly affects the estuary applications. The outputs, including CSV -based reporting, radar pain, and tape graphic duties, do not support only transparent measurement but also directing data -based decisions in selecting and publishing models.

Here is Clap notebook. Also, do not forget to follow us twitter And join us Telegram channel and LinkedIn GrOup. Don’t forget to join 90k+ ml subreddit.

🔥 [Register Now] The virtual Minicon Conference on Agency AI: Free Registration + attendance Certificate + 4 hours short (May 21, 9 am- Pacific time)

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.