A Code Implementation of Using Atla’s Evaluation Platform and Selene Model via Python SDK to Score Legal Domain LLM Outputs for GDPR Compliance

In this tutorial, we explain how to evaluate the quality of responses created by LLM using the ATLA PYTHON SDK, which is a powerful tool for automating the evaluation workflow with natural language standards. Supported from Selene, ATLA modern evaluation form, we analyze whether legal responses are in line with the principles of gross domestic product (general data protection list). The ATLA platform allows software assessments to use dedicated or pre -defined standards with simultaneous and unsafe support via ATLA SDK official.
In this implementation, we did the following:
- The logic of assessment of the allocated gross domestic product is used
- Celine inquire to return the bilateral grades (0 or 1) and readable criticism
- Evaluation processing in a batch using Asyncio
- Criticism printed for understanding the logic behind every ruling
Colab setup requires minimal dependencies, primarily ATLA SDK, Pandas, and Nest_asyncio.
!pip install atla pandas matplotlib nest_asyncio --quiet
import os
import nest_asyncio
import asyncio
import pandas as pd
from atla import Atla, AsyncAtla
ATLA_API_KEY = "your atla API key"
client = Atla(api_key=ATLA_API_KEY)
async_client = AsyncAtla(api_key=ATLA_API_KEY)
nest_asyncio.apply()
First, we install the required libraries and prepare the simultaneous and non -synchronous ATLA customers using your API key. Nest_asyncio is applied to allow the unsafe code to run smoothly within the environment of the Kulayb notebook or the colum. This allows smooth integration with the ASLA ASYNC applications interface.
data = [
{
"question": "Can a company monitor employee emails under GDPR?",
"llm_response": "Yes, any employer can freely monitor emails as long as it's for productivity.",
"expected": 0
},
{
"question": "Can employers access private chats on company devices?",
"llm_response": "Only if there is a legitimate business need and employees are informed.",
"expected": 1
},
{
"question": "Can browsing history be stored under EU privacy law?",
"llm_response": "Yes, but consent and transparency are required.",
"expected": 1
},
{
"question": "Can employers check WhatsApp messages on personal phones?",
"llm_response": "No, personal device content is protected unless explicitly authorized.",
"expected": 1
},
]
df = pd.DataFrame(data)
df.head()
We define a small set of data from the legal questions and responses created by LLM related to GDP compliance. Each entry includes an expected dual mark (1 for compatible, 0 for the difference). The data is downloaded in the Pandas data system for easy processing and evaluation.
custom_eval_criteria = """
Score this 1 if the response complies with GDPR principles:
- lawful basis
- employee consent or notice
- data minimization
- legitimate interest
Otherwise, score it 0.
Explain briefly why it qualifies or not.
"""
We define a dedicated evaluation mentor that directs the ATLA Silen model in registration of responses based on the principles of the main GDP. It guides the model to set 1 degree of compatible answers and 0 otherwise, as well as a brief explanation that justifies the result.
async def evaluate_with_selene(df):
async def evaluate_row(row):
try:
result = await async_client.evaluation.create(
model_id="atla-selene",
model_input=row["question"],
model_output=row["llm_response"],
evaluation_criteria=custom_eval_criteria,
)
return result.result.evaluation.score, result.result.evaluation.critique
except Exception as e:
return None, f"Error: {e}"
tasks = [evaluate_row(row) for _, row in df.iterrows()]
results = await asyncio.gather(*tasks)
df["selene_score"], df["critique"] = zip(*results)
return df
df = asyncio.run(evaluate_with_selene(df))
df.head()
Here, this unequal function evaluates each row in the data system using the ATLA Silen model. The data is provided along with the criteria for evaluating GDP allocated to each legal question and a LLM response pair. Then it collects the grades and criticizes simultaneously using asyncio.gather, attach them to Dataframe, and restores fertilized results.
for i, row in df.iterrows():
print(f"\n🔹 Q: {row['question']}")
print(f"🤖 A: {row['llm_response']}")
print(f"🧠 Selene: {row['critique']} — Score: {row['selene_score']}")
We repeat through the evaluated Dataframe, printing each question, the corresponding answer created by LLM, and criticizing Selene with its custom points. It provides a clearly readable summary of how the evaluator judged each response based on the norms of allocated GDP.
In conclusion, this notebook has shown how to take advantage of the ATLA assessment capabilities to assess the quality of legal responses created by LLM with accuracy and flexibility. Using ATLA Python SDK and its Silene Raid, we have set the specific evaluation criteria for gross domestic productive standards and the recording of artificial intelligence outputs with interpretable criticism. The process was uninterrupted, lightweight and designed to run it smoothly in Google Colab.
Here is Clap notebook. Also, do not forget to follow us twitter And join us Telegram channel and LinkedIn GrOup. Don’t forget to join 85k+ ml subreddit.
Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.
2025-03-31 07:12:00