Skip to main content

Create a Simple Experiment

This guide walks you through creating and running a simple evaluation experiment in Evalap.

Creating an Experiment via the API

An experiment evaluates a model on a specific dataset using defined metrics.

When choosing a model, there are typically two scenarios:

  1. Provider models (e.g., OpenAI, Albert): EvalAP generates answers from the dataset. The dataset must contain at least a query column representing the model inputs.
  2. Custom models: You generate the model outputs yourself and pass them to the API for metric computation.

This guide covers both scenarios.

Selecting Metrics

You need to specify which metrics to compute for your experiment. You can explore available metrics through the interface or API.

A typical metric for evaluating LLMs is "LLM-as-a-judge," which uses another LLM to assess answer quality. When you have ground-truth answers in your dataset, you can use LLM-as-a-judge to verify if the model output contains the correct answer. In EvalAP, the judge_precision metric performs this function.

Here are some key metrics offered by EvalAP:

NameDescriptionTypeRequire
judge_precisionBinary precision of output_true. Returns 1 if the correct answer is contained in the given answerllm[output, output_true, query]
qcm_exactnessBinary equality between output and output_truellm[output, output_true]
biasSee https://docs.confident-ai.com/docs/metrics-introductiondeepeval[output, query]
hallucinationSee https://docs.confident-ai.com/docs/metrics-introductiondeepeval[context, output, query]
contextual_relevancySee https://docs.confident-ai.com/docs/metrics-introductiondeepeval[output, query, retrieval_context]
ocr_v1Levenshtein distance between output and ground-truth markdownocr[output, output_true]
output_lengthNumber of words in the outputops[output]
generation_timeTime taken to generate the answer/outputops[output]
energy_consumptionEnergy consumption (kWh) - Environmental impact calculated by ecologits libraryops[output]
nb_tool_callsNumber of tools called during generationops[output]
info

Query the complete metrics list from the v1/metrics API route.

When selecting metrics, ensure the required fields match your dataset columns. For example, judge_precision requires output, output_true, and query fields. Note that the output field is generated by EvalAP during evaluation, so it doesn't need to be present in your dataset initially.

Additional metrics provide general measurements like generation time and output size.

Creating an Experiment with a Model Provider

Here's how to create a simple experiment evaluating an OpenAI model:

import os
import requests

# Replace with your Evalap API endpoint
API_URL = "https://evalap.etalab.gouv.fr/v1"

# Replace with your API key or authentication token (or None if launch locally)
HEADERS = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}

# Design the experiment
experiment = {
"name": "my_experiment_name",
"dataset": "my_dataset", # name identifier of the dataset
"model": {"name": "gpt-4o", "base_url": "https://api.openai.com/v1", "api_key": os.getenv("OPENAI_API_KEY")},
"metrics": ["judge_precision", "generation_time", "output_length"],
}

# Run the experiment
response = requests.post(f'{API_URL}/experiment', json=experiment, headers=HEADERS)
experiment_id = response.json()["id"]
print(f"Experiment {experiment_id} is running")
info

The model schema support passing sampling params, such as the temperature like "model": {..., "sampling_params": {"temperature": 0.2}}, or extra params since supported by the Openai API used. Check the experiment creation endpoint to full list of supported parameters.

Creating an Experiment with a Custom Model

For the second scenario, where you have your own model outputs, you'll need to provide those outputs in your API call. Here's how to create an experiment with a custom model:

import os
import requests

# Replace with your Evalap API endpoint
API_URL = "https://evalap.etalab.gouv.fr/v1"

# Replace with your API key or authentication token (or None if launch locally)
HEADERS = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}

# Design the experiment with a custom model
experiment = {
"name": "my_custom_model_experiment",
"dataset": "my_dataset", # name identifier of the dataset
"model": {
"aliased_name": "my-custom-model", # A name to identify this model
"output": ["answer1", "answer2", "answer3"] # Array of model outputs corresponding to dataset rows
},
"metrics": ["judge_precision", "generation_time", "output_length"],
}

# Run the experiment
response = requests.post(f'{API_URL}/experiment', json=experiment, headers=HEADERS)
experiment_id = response.json()["id"]
print(f"Experiment {experiment_id} is running")

In this scenario, the model schema is different:

FieldTypeDescription
outputArray of stringsThe sequence of answers generated by your model, ordered to match the 'rows' of the dataset you are evaluating
aliased_namestringA name to identify this model. Different from the 'name' parameter used with provider models

After running the experiment, the API returns a success response if it starts without errors. EvalAP manages experiments asynchronously, and you can check the status and results through the interface or by querying the API directly.

Viewing Experiment Results and Progress

After launching an experiment:

  1. Navigate to the experiment details page
  2. View summary results showing:
    • Overall performance metrics for each model
    • Support table displaying the number of experiments used for score averaging
  3. Explore detailed results:
    • Number of successful and failed attempts per experiment
    • Detailed results for each experiment
Next Steps: Experiment Sets

After creating your first experiment, consider using Experiment Sets to compare multiple models or configurations. Experiment sets allow you to run related experiments together, making it easier to draw meaningful comparisons and conclusions. They're essential for robust evaluations that account for model variability and provide comparative insights. Learn more in our Create an Experiment Set guide.