Skip to main content

Publish a Dataset

This guide will walk you through the process of adding a new dataset to Evalap for model evaluation. You can add a dataset programmatically using the Evalap API.

Two formats are supported :

  • CSV like data (dataframes)
  • Parquet format (for bigger dataset)

Column Mapping

Evalap uses a standard column naming convention. When adding your dataset, you need to either name your columns accordingly or map your dataset columns to these standard names using the columns_map parameter:

  • query: (str) the input query.
  • output_true: (str) the ground truth answer.
  • img: (PIL/bytes) an image - Support in parquet format only.
  • any other field that you can use in adhoc metrics.

If the column names of your dataset do not match these conventions, you can either rename them before adding the dataset, or use the parameter columns_map in the request to provide a mapping between the Evalap convention names and yours.

For example, if your dataset has a column named "question", you would map them like this:

"columns_map": {"query": "question"}

See the api reference for more usage detail.

From CSV like dataset

The following code show how to upload a dataset to Evalap from a CSV file.

import requests
import json
import pandas as pd

# Replace with your Evalap API endpoint
API_URL = "https://evalap.etalab.gouv.fr/v1"

# Replace with your API key or authentication token (or None if launch locally)
HEADERS = {
"Authorization": "Bearer YOUR_EVALAP_KEY",
"Content-Type": "application/json"
}

# Load the dataset from a CSV file
dataset_df = pd.read_csv("my_dataset.csv") # Pandas use "," as default limiter.


# Prepare dataset metadata
dataset = {
"name": "My domain specific dataset",
"readme": "A dataset for evaluating question answering capabilities",
"default_metric": "judge_precision",
"df": dataset_df.to_json()
}

# Create the dataset
response = requests.post( f"{API_URL}/datasets", headers=HEADERS, json=dataset)

dataset_id = response.json()["id"]

print(f"Dataset created with ID: {dataset_id}")

From Parquet Dataset

See the demo tutorial to add an OCR dataset provided by the Marker library: create_marker_dataset.ipynb

Next Steps

After adding your dataset, you can: