Run Answers Locally with vLLM
This guide walks you through the process of generating model answers locally using GPU-enabled machines and then submitting experiments to Evalap. We provide two utility scripts to help accomplish this:
run_answers.py: Generate model responses for an Evalap datasetrun_expe.py: Create or update experiment sets in Evalap
Prerequisites
- Access to a machine with GPU capabilities
 - SSH access configured with your public key (if needed)
 - Python environment with virtual environment support
 - Sufficient disk space for model downloads
 - vLLM installed
 
Step 1: Connect to GPU Machine (if needed)
Connect to your GPU-enabled VM or machine using SSH:
# Add your SSH key to the agent
ssh-add ~/.ssh/your_key
# Connect to the machine
ssh user@gpu-machine-address
Step 2: Check Available Disk Space
Before downloading models, ensure you have sufficient disk space:
# Check disk usage
df -Th
# If needed, clean up old model cache
# Models are stored in ~/.cache/huggingface/hub/ by default
rm -rf ~/.cache/huggingface/hub/old_models/
⚠️ Note: Large language models can require significant disk space (10-100GB per model). Plan accordingly.
Step 3: Launch Model with vLLM
Start the model server using vLLM. Here's an example with Gemma-3:
vllm serve google/gemma-3-27b-it \
  --gpu-memory-utilization 1 \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --port 9191
Common vLLM Parameters:
--gpu-memory-utilization: Fraction of GPU memory to use (0-1)--tensor-parallel-size: Number of GPUs for tensor parallelism--max-model-len: Maximum sequence length--port: Port for the API server
Step 4: Install Evalap Framework
Clone and install the Evalap repository:
# Clone the repository
git clone https://github.com/etalab/evalap.git
cd evalap
# Create and activate virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ./
This installation provides access to the command-line tools located in the evalap/scripts/ directory.
Step 5: Generate Answers
Use the run_answers.py script to generate responses from your model:
# Set your API keys
export EVALAP_API_KEY="your-evalap-token"
export OPENAI_API_KEY="your-openai-key"  # Optional, if not using --auth-token
# View available options
python -m evalap.scripts.run_answers.run_answers --help
# Example: Generate answers for MFS questions with Gemma-3
python -m evalap.scripts.run_answers.run_answers \
  --run-name gemma-3-27b_mfs \
  --base-url http://localhost:9191/v1 \
  --model google/gemma-3-27b-it \
  --dataset MFS_questions_v01 \
  --repeat 4 \
  --max-workers 8
Key Parameters:
--run-name: Unique identifier for this generation run--base-url: URL of the vLLM server (e.g., http://localhost:9191/v1)--model: Model name/identifier--dataset: Name of the Evalap dataset to use--repeat: Number of times to run the dataset (default: 1)--max-workers: Maximum concurrent requests (default: 8)--system-prompt: Optional system prompt to prepend to queries--sampling-params: Optional JSON string with sampling parameters (e.g.,'{"temperature": 0.7, "max_tokens": 1024}')
The script will:
- Download the specified dataset from Evalap
 - Generate responses for each query in the dataset
 - Save results to 
results/{run_name}__{repetition}.json - Save model details to 
results/{run_name}__details.json 
Step 6: Create and Run Experiments
Use run_expe.py to create experiment sets and submit them to Evalap:
# View available options
python -m evalap.scripts.run_expe.run_expe --help
# Create a new experiment
python -m evalap.scripts.run_expe.run_expe \
  --run-name gemma-3-27b_mfs \
  --expe-name "Gemma-3 27B MFS Evaluation"
# Update an existing experiment set
python -m evalap.scripts.run_expe.run_expe \
  --run-name gemma-3-27b_mfs \
  --expset existing-experiment-id
Key Parameters:
--run-name: Name of the model generation to load (must match files inresults/directory)--expe-name: Display name for the experiment set (optional, defaults to run-name)--expset: Existing experiment set ID to update (optional)
The script will:
- Load all result files matching 
results/{run_name}*.json - Create an experiment set with metrics: answer_relevancy, judge_exactness, judge_notator, output_length, generation_time
 - Submit the experiment set to Evalap for evaluation
 
Complete Example Workflow
# 1. Start vLLM server
vllm serve google/gemma-3-27b-it --gpu-memory-utilization 0.9 --port 9191
# 2. Generate answers (in another terminal)
python -m evalap.scripts.run_answers.run_answers \
  --run-name gemma3_test \
  --base-url http://localhost:9191/v1 \
  --model google/gemma-3-27b-it \
  --dataset MFS_questions_v01 \
  --repeat 3
# 3. Submit experiment to Evalap
python -m evalap.scripts.run_expe.run_expe \
  --run-name gemma3_test \
  --expe-name "Gemma-3 Test Run"
Best Practices
- Resource Management: Monitor GPU memory usage and adjust 
--gpu-memory-utilizationaccordingly - Concurrent Requests: Adjust 
--max-workersbased on your model's capacity and dataset size - Experiment Tracking: Use meaningful experiment names and maintain metadata for reproducibility
 - Multiple Runs: Use 
--repeatto generate multiple runs for statistical significance - API Keys: Store your API keys in environment variables for security
 
Troubleshooting
Common Issues:
Out of Memory Error
# Reduce memory utilization
--gpu-memory-utilization 0.8
Connection Refused
# Check if vLLM server is running
curl http://localhost:9191/v1/models
Slow Generation
# Adjust batch size and parallelism
--tensor-parallel-size 2  # If multiple GPUs available
Missing Results Files
# Check that result files were generated
ls results/{run_name}*.json
API Authentication Issues
# Ensure API keys are set
echo $EVALAP_API_KEY
echo $OPENAI_API_KEY