Create a Simple Experiment
This guide walks you through creating and running a simple evaluation experiment in Evalap.
Creating an Experiment via the API
An experiment evaluates a model on a specific dataset using defined metrics.
When choosing a model, there are typically two scenarios:
- Provider models (e.g., OpenAI, Albert): EvalAP generates answers from the dataset. The dataset must contain at least a
query
column representing the model inputs. - Custom models: You generate the model outputs yourself and pass them to the API for metric computation.
This guide covers both scenarios.
Selecting Metrics
You need to specify which metrics to compute for your experiment. You can explore available metrics through the interface or API.
A typical metric for evaluating LLMs is "LLM-as-a-judge," which uses another LLM to assess answer quality. When you have ground-truth answers in your dataset, you can use LLM-as-a-judge to verify if the model output contains the correct answer. In EvalAP, the judge_precision
metric performs this function.
Here are some key metrics offered by EvalAP:
Name | Description | Type | Require |
---|---|---|---|
judge_precision | Binary precision of output_true. Returns 1 if the correct answer is contained in the given answer | llm | [output, output_true, query] |
qcm_exactness | Binary equality between output and output_true | llm | [output, output_true] |
bias | See https://docs.confident-ai.com/docs/metrics-introduction | deepeval | [output, query] |
hallucination | See https://docs.confident-ai.com/docs/metrics-introduction | deepeval | [context, output, query] |
contextual_relevancy | See https://docs.confident-ai.com/docs/metrics-introduction | deepeval | [output, query, retrieval_context] |
ocr_v1 | Levenshtein distance between output and ground-truth markdown | ocr | [output, output_true] |
output_length | Number of words in the output | ops | [output] |
generation_time | Time taken to generate the answer/output | ops | [output] |
energy_consumption | Energy consumption (kWh) - Environmental impact calculated by ecologits library | ops | [output] |
nb_tool_calls | Number of tools called during generation | ops | [output] |
Query the complete metrics list from the v1/metrics API route.
When selecting metrics, ensure the required fields match your dataset columns. For example, judge_precision
requires output
, output_true
, and query
fields. Note that the output
field is generated by EvalAP during evaluation, so it doesn't need to be present in your dataset initially.
Additional metrics provide general measurements like generation time and output size.
Creating an Experiment with a Model Provider
Here's how to create a simple experiment evaluating an OpenAI model:
import os
import requests
# Replace with your Evalap API endpoint
API_URL = "https://evalap.etalab.gouv.fr/v1"
# Replace with your API key or authentication token (or None if launch locally)
HEADERS = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
# Design the experiment
experiment = {
"name": "my_experiment_name",
"dataset": "my_dataset", # name identifier of the dataset
"model": {"name": "gpt-4o", "base_url": "https://api.openai.com/v1", "api_key": os.getenv("OPENAI_API_KEY")},
"metrics": ["judge_precision", "generation_time", "output_length"],
}
# Run the experiment
response = requests.post(f'{API_URL}/experiment', json=experiment, headers=HEADERS)
experiment_id = response.json()["id"]
print(f"Experiment {experiment_id} is running")
The model schema support passing sampling params, such as the temperature like "model": {..., "sampling_params": {"temperature": 0.2}}
, or extra params since supported by the Openai API used. Check the experiment creation endpoint to full list of supported parameters.
Creating an Experiment with a Custom Model
For the second scenario, where you have your own model outputs, you'll need to provide those outputs in your API call. Here's how to create an experiment with a custom model:
import os
import requests
# Replace with your Evalap API endpoint
API_URL = "https://evalap.etalab.gouv.fr/v1"
# Replace with your API key or authentication token (or None if launch locally)
HEADERS = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}
# Design the experiment with a custom model
experiment = {
"name": "my_custom_model_experiment",
"dataset": "my_dataset", # name identifier of the dataset
"model": {
"aliased_name": "my-custom-model", # A name to identify this model
"output": ["answer1", "answer2", "answer3"] # Array of model outputs corresponding to dataset rows
},
"metrics": ["judge_precision", "generation_time", "output_length"],
}
# Run the experiment
response = requests.post(f'{API_URL}/experiment', json=experiment, headers=HEADERS)
experiment_id = response.json()["id"]
print(f"Experiment {experiment_id} is running")
In this scenario, the model schema is different:
Field | Type | Description |
---|---|---|
output | Array of strings | The sequence of answers generated by your model, ordered to match the 'rows' of the dataset you are evaluating |
aliased_name | string | A name to identify this model. Different from the 'name' parameter used with provider models |
After running the experiment, the API returns a success response if it starts without errors. EvalAP manages experiments asynchronously, and you can check the status and results through the interface or by querying the API directly.
Viewing Experiment Results and Progress
After launching an experiment:
- Navigate to the experiment details page
- View summary results showing:
- Overall performance metrics for each model
- Support table displaying the number of experiments used for score averaging
- Explore detailed results:
- Number of successful and failed attempts per experiment
- Detailed results for each experiment
After creating your first experiment, consider using Experiment Sets to compare multiple models or configurations. Experiment sets allow you to run related experiments together, making it easier to draw meaningful comparisons and conclusions. They're essential for robust evaluations that account for model variability and provide comparative insights. Learn more in our Create an Experiment Set guide.