Create a Simple Experiment

This guide walks you through creating and running a simple evaluation experiment in EvalAP.

info

This article show how to create a single experiment for the sake of simplicity, but we recommend always creating experiments through experiment set instead. This is because an evaluation is usually not a single experiment (called orphan in EvalAP), but a group of coherent experiments to compare things. All the concepts and parametrizations exposed in this article remains valid and a reference for Experiment Set though.

Creating an Experiment via the API

An experiment evaluates a model on a specific dataset using defined metrics.

When choosing a model, there are typically two scenarios:

Provider models (e.g., OpenAI, Albert): EvalAP generates answers from the dataset. The dataset must contain at least a query column representing the model inputs.
Custom models: You generate the model outputs yourself and pass them to the API for metric computation.

This guide covers both scenarios.

Selecting Metrics

You need to specify which metrics to compute for your experiment. You can explore available metrics through the interface or API.

A typical metric for evaluating LLMs is "LLM-as-a-judge," which uses another LLM to assess answer quality. When you have ground-truth answers in your dataset, you can use LLM-as-a-judge to verify if the model output contains the correct answer. In EvalAP, the judge_precision metric performs this function.

Here are some key metrics offered by EvalAP:

Name	Description	Type	Require
judge_precision	Binary precision of output_true. Returns 1 if the correct answer is contained in the given answer	llm	[output, output_true, query]
qcm_exactness	Binary equality between output and output_true	llm	[output, output_true]
bias	See https://docs.confident-ai.com/docs/metrics-introduction	deepeval	[output, query]
hallucination	See https://docs.confident-ai.com/docs/metrics-introduction	deepeval	[context, output, query]
contextual_relevancy	See https://docs.confident-ai.com/docs/metrics-introduction	deepeval	[output, query, retrieval_context]
ocr_v1	Levenshtein distance between output and ground-truth markdown	ocr	[output, output_true]
output_length	Number of words in the output	ops	[output]
generation_time	Time taken to generate the answer/output	ops	[output]
energy_consumption	Energy consumption (kWh) - Environmental impact calculated by ecologits library	ops	[output]
nb_tool_calls	Number of tools called during generation	ops	[output]

info

Query the complete metrics list from the v1/metrics API route.

When selecting metrics, ensure the required fields match your dataset columns. For example, judge_precision requires output, output_true, and query fields. Note that the output field is generated by EvalAP during evaluation, so it doesn't need to be present in your dataset initially.

Additional metrics provide general measurements like generation time and output size.

Creating an Experiment with a Model Provider

Here's how to create a simple experiment evaluating an OpenAI model:

import os
import requests

# Replace with your Evalap API endpoint
API_URL = "https://evalap.etalab.gouv.fr/v1"

# Replace with your API key or authentication token (or None if launch locally)
HEADERS = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

# Design the experiment
experiment = {
    "name": "my_experiment_name",
    "dataset": "my_dataset", # name identifier of the dataset
    "model": {"name": "gpt-4o", "base_url": "https://api.openai.com/v1", "api_key": os.getenv("OPENAI_API_KEY")},
    "metrics": ["judge_precision", "generation_time", "output_length"],
}

# Run the experiment
response = requests.post(f'{API_URL}/experiment', json=experiment, headers=HEADERS)
experiment_id = response.json()["id"]
print(f"Experiment {experiment_id} is running")

Model Fields Supported

The model schema support passing sampling params, such as the temperature like "model": {..., "sampling_params": {"temperature": 0.2}}, or extra params since supported by the Openai API used. Check the experiment creation endpoint to full list of supported parameters.

Creating an Experiment with a Custom Model

For the second scenario, where you have your own model outputs, you'll need to provide those outputs in your API call. Here's how to create an experiment with a custom model:

import os
import requests

# Replace with your Evalap API endpoint
API_URL = "https://evalap.etalab.gouv.fr/v1"

# Replace with your API key or authentication token (or None if launch locally)
HEADERS = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

# Design the experiment with a custom model
experiment = {
    "name": "my_custom_model_experiment",
    "dataset": "my_dataset", # name identifier of the dataset
    "model": {
        "aliased_name": "my-custom-model",  # A name to identify this model
        "output": ["answer1", "answer2", "answer3"]  # Array of model outputs corresponding to dataset rows
        # "context": list[str] a list of contextual information pass to the prompt.
        # "retrieval_context": list[str] a list of retrieved information pass to the prompt.
        # "reasonin"`: (str) The reasoning output tokens associated to an answer.
        # (to come) "tools_called"
        # (to come) "expected_tools"
    },
    "metrics": ["judge_precision", "generation_time", "output_length"],
}

# Run the experiment
response = requests.post(f'{API_URL}/experiment', json=experiment, headers=HEADERS)
experiment_id = response.json()["id"]
print(f"Experiment {experiment_id} is running")

Custom Model Fields Supported

See the API documention of the ModelRaw schema to see all the parameters accepted for a custom model : https://evalap.etalab.gouv.fr/redoc#tag/experiments/operation/create_experiment_v1_experiment_post

In this scenario, the model schema is different:

Field	Type	Description
output	Array of strings	The sequence of answers generated by your model, ordered to match the 'rows' of the dataset you are evaluating
aliased_name	string	A name to identify this model. Different from the 'name' parameter used with provider models

After running the experiment, the API returns a success response if it starts without errors. EvalAP manages experiments asynchronously, and you can check the status and results through the interface or by querying the API directly.

Configuring a LLM-as-a-Judge Model

When using metrics like judge_precision that rely on an LLM to evaluate outputs, you can customize which model acts as the judge. By default, EvalAP uses a small pre-configured model, but you can specify your own using the judge_model parameter in your experiment configuration.

You can specify the judge model in two ways:

Using a model name string: EvalAP will use the first available model matching this name from your configured providers (To configure a provider, you just need to have the appropriate provider API key set in your environment before launching EvalAP, e.g. OPENAI_API_KEY, MISTRAL_API_KEY).
Using a complete model configuration: Provide the model name, base URL, and API key.

Here's a simple example:

# Using a model name string
experiment = {
    "name": "judge_precision",
    "dataset": "my_dataset",
    "model": {"name": "gpt-4o", "base_url": "https://api.openai.com/v1", "api_key": os.getenv("OPENAI_API_KEY")},
    "metrics": ["judge_precision", "generation_time"],
    "judge_model": "gpt-4-turbo"  # Specify which model to use as judge
}

# Or using a complete model configuration
experiment = {
    "name": "judge_precision",
    "dataset": "my_dataset",
    "model": {"name": "gpt-4o", "base_url": "https://api.openai.com/v1", "api_key": os.getenv("OPENAI_API_KEY")},
    "metrics": ["judge_precision", "generation_time"],
    "judge_model": {
        "name": "claude-3-opus-20240229",
        "base_url": "https://api.anthropic.com/v1",
        "api_key": os.getenv("ANTHROPIC_API_KEY")
    }
}

Viewing Experiment Results and Progress

After launching an experiment:

Navigate to the experiment details page
View summary results showing:
- Overall performance metrics for each model
- Support table displaying the number of experiments used for score averaging
Explore detailed results:
- Number of successful and failed attempts per experiment
- Detailed results for each experiment

Next Steps: Experiment Sets

After creating your first experiment, consider using Experiment Sets to compare multiple models or configurations. Experiment sets allow you to run related experiments together, making it easier to draw meaningful comparisons and conclusions. They're essential for robust evaluations that account for model variability and provide comparative insights. Learn more in our Create an Experiment Set guide.

Creating an Experiment via the API​

Selecting Metrics​

Creating an Experiment with a Model Provider​

Creating an Experiment with a Custom Model​

Configuring a LLM-as-a-Judge Model​

Viewing Experiment Results and Progress​