Source Code:
src/gaia/eval/Overview
The GAIA evaluation framework provides systematic tools for testing and comparing AI model performance across different deployment scenarios. This framework enables generation of synthetic test data, creation of evaluation standards, and automated performance comparison.Quick Example: Email Summarization
Here’s what happens when you evaluate email summarization:-
Generate Test Data → Creates
customer_support_email.txtwith realistic email content- Why needed: Real emails contain sensitive data; synthetic data provides realistic scenarios without privacy concerns
-
Create Ground Truth → Produces
customer_support_email.summarization.groundtruth.jsonwith expected summary- Why needed: Models need objective standards to measure against; human experts define what “good” looks like
-
Run Experiments → Generates
Claude-Email-Summary.experiment.jsonwith model responses- Why needed: Captures actual model outputs under controlled conditions for repeatable testing
-
Evaluate Results → Creates
Claude-Email-Summary.experiment.eval.jsonwith scored comparisons- Why needed: Converts subjective quality into measurable metrics (accuracy, completeness, relevance)
-
Generate Reports → Outputs
email_evaluation_report.mdwith human-readable analysis- Why needed: Raw scores don’t tell the story; reports reveal patterns, strengths, and improvement areas
System Architecture
Prerequisites
Before using the evaluation framework, ensure you have completed the initial GAIA setup:-
Initial Setup: Follow the GAIA Development Guide to:
- Install Python 3.12
- Create and activate the virtual environment (
.venv) - Install base GAIA dependencies
- Continue below to add evaluation-specific dependencies
Installation
- Node.js (with npm): Required for the interactive visualizer (
gaia visualizecommand)- Download from nodejs.org or install via a package manager
-
Verify both are available (you may need to restart your shell):
-
If either command fails:
- Windows:
winget install OpenJS.NodeJS.LTS, then restart your shell - Linux: use your distro package manager (e.g.,
apt install nodejs npm)
- Windows:
- The visualizer will automatically install its webapp dependencies on first launch
Quick Start
Try your first evaluation in 5 minutes with minimal commands: Ultra-simplified workflow using all defaults:- Meetings →
./output/test_data/meetings/ - Emails →
./output/test_data/emails/
- Claude API key:
export ANTHROPIC_API_KEY=your_key_here - Lemonade server running:
lemonade-server serve - Node.js installed for visualizer
Understanding the Output
After running the Quick Start, you’ll have these directories:./output/test_data/- Synthetic meeting transcript(s) for testing./output/groundtruth/- Expected “correct” summaries and evaluation criteria./output/experiments/- Raw responses from each model tested./output/evaluations/- Scored comparisons showing which model performed best./reports/- Human-readable markdown report with analysis and insights
Step-by-Step Workflows
Follow these complete workflows to reproduce evaluation experiments from start to finish.Workflow 1: Meeting Transcript Summarization
Complete example: Evaluate how well different models summarize meeting transcripts Step 1: Generate synthetic meeting transcripts./src/gaia/eval/configs/basic_summarization.json to customize models or create your own config
- Synthetic meeting transcripts in
./output/test_data/meetings/ - Ground truth standards in
./output/groundtruth/ - Model responses in
./output/experiments/(with embedded groundtruth for Comparative Evaluation) - Evaluation scores in
./output/evaluations/(objective ratings based on groundtruth comparison) - Summary report in
./reports/meeting_summarization_report.md - Interactive web interface for comparison
Workflow 2: Email Summarization
Complete example: Compare models on business email summarization Step 1: Generate synthetic business emails./src/gaia/eval/configs/basic_summarization.json to customize models or create your own config
Workflow 3: Document Q&A
Complete example: Test question-answering capabilities on documents Step 1: Create Q&A evaluation standards directly from existing PDF directory Important: Use--num-samples 3 to match the qa_config.num_qa_pairs: 3 in basic_qa.json
basic_qa.json for Q&A experiments)
Note: Edit ./src/gaia/eval/configs/basic_qa.json to customize models or create your own config
test-data-dir and groundtruth-dir to view source PDFs and Q&A pairs
Workflow 4: Third-Party Model Testing
Complete example: Test external models (OpenAI, etc.) that can’t be integrated directly Step 1: Generate test data and standards./templates/*.template.json - paste responses into “response” fields
Step 4: Evaluate completed template
Important: Two Evaluation Workflows
The evaluation system supports two different workflows for handling groundtruth data:Method 1: Embedded Groundtruth (Recommended)
Use groundtruth files as input to batch-experiment:-g flag needed.
Advantages: Self-contained experiment files, no risk of missing groundtruth, simpler eval command
Method 2: External Groundtruth
Use test data as input to batch-experiment:-g flag required
Advantages: Separate concerns, easier to swap different groundtruth files for comparison
⚠️ Important: Q&A RAG Use Cases
For Q&A and RAG evaluations, you MUST use Method 1 (Embedded Groundtruth): Required for Q&A - groundtruth contains the input queries:-g/--groundtruth flag to the eval command, otherwise evaluation will fall back to standalone quality assessment (see Evaluation Modes below).
Evaluation Modes: Comparative vs Standalone
The evaluation system operates in two distinct modes depending on groundtruth availability:🎯 Comparative Evaluation (Recommended)
When groundtruth data is available (embedded or via-g flag):
Example: Comparative evaluation with embedded groundtruth
- Compares generated summaries against known-correct reference summaries
- Measures accuracy by checking if key facts, decisions, and action items match
- Provides objective scoring based on how well the output matches the expected result
- Enables reliable model comparison with consistent benchmarks
- Executive Summary Accuracy: How well does it match the reference?
- Completeness: Are all important details from groundtruth covered?
- Action Items Accuracy: Are action items correctly identified vs reference?
- Key Decisions Accuracy: Are decisions properly captured vs reference?
- Participant Identification: Are participants correctly identified vs reference?
- Topic Coverage: Are all discussed topics included vs reference?
🔍 Standalone Quality Assessment (Fallback)
When no groundtruth data is available: Example: Standalone assessment (not recommended for production evaluation)-g flag provided
What it does:
- Analyzes summary quality based on general best practices
- Subjective assessment using Claude’s judgment alone
- No accuracy verification (Claude doesn’t know if facts are correct)
- Inconsistent standards (Claude’s judgment may vary between runs)
- Executive Summary Quality: Is it clear and high-level?
- Detail Completeness: Does it provide sufficient context?
- Action Items Structure: Are items specific and actionable?
- Key Decisions Clarity: Are decisions clearly stated?
- Participant Information: Are participants properly identified?
- Topic Organization: Are topics well-organized and comprehensive?
📊 Reliability Comparison
| Aspect | Comparative Evaluation | Standalone Assessment |
|---|---|---|
| Accuracy Measurement | ✅ Verified against known facts | ❌ No fact verification |
| Model Comparison | ✅ Objective benchmarking | ⚠️ Subjective, inconsistent |
| Production Readiness | ✅ Reliable for decisions | ❌ Not recommended |
| Consistency | ✅ Repeatable results | ⚠️ May vary between runs |
| Cost | 💰 Higher (groundtruth generation) | 💰 Lower (evaluation only) |
💡 Best Practices
For Production Evaluation:- Always use Comparative Evaluation with groundtruth data
- Generate comprehensive groundtruth covering all expected use cases
- Use consolidated groundtruth files for batch experiments
- Standalone assessment is acceptable for rapid prototyping
- Useful for general quality validation during development
- Not suitable for final model selection or production deployment decisions
- Initial groundtruth generation has upfront cost but provides objective evaluation
- Standalone assessment is cheaper but less reliable for critical decisions
- Comparative evaluation pays for itself in model selection accuracy
- Individual prompts (default) prioritize quality over cost for production reliability
Command Reference
Synthetic Data Generation
Generate realistic test scenarios for evaluation purposes.Meeting Transcripts
Generate meeting transcripts with full options:standup- Daily standup meetingsplanning- Sprint/project planning sessionsclient_call- Client communication meetingsdesign_review- Technical design reviewsperformance_review- Employee performance discussionsall_hands- Company-wide meetingsbudget_planning- Financial planning sessionsproduct_roadmap- Product strategy meetings
Business Emails
Generate business emails with full options:project_update- Project status communicationsmeeting_request- Meeting scheduling emailscustomer_support- Customer service interactionssales_outreach- Sales and marketing communicationsinternal_announcement- Company announcementstechnical_discussion- Technical team communicationsvendor_communication- External vendor interactionsperformance_feedback- Performance review communications
Ground Truth Creation
Transform synthetic data into evaluation standards. Create evaluation standards with all options:qa- Question-answer pair generation (requires groundtruth input for batch-experiment)summarization- Summary generation tasks (supports both embedded and external groundtruth)email- Email processing tasks (supports both embedded and external groundtruth)
Skip Existing Ground Truth Files
Ground truth generation automatically skips files that already have ground truth data to avoid redundant processing and API costs. Default Behavior: Automatically skip files where ground truth JSON already exists in the output directory. How it works:- Checks for existing
.groundtruth.jsonfiles in the output directory - Skips files where ground truth has already been generated
- Includes skipped files in the consolidated report
- Logs which files were skipped vs newly processed
--force: Force regeneration of all ground truth files, even if they already exist (overrides default skip behavior)
- Cost Optimization: Avoid redundant Claude API calls for expensive ground truth generation
- Reliability: Resume interrupted runs without losing progress
- Efficiency: Only process new files when adding to existing ground truth
- Development Speed: Faster iteration during prompt and configuration development
Batch Experimentation
Run systematic model comparisons. Create and run batch experiments (can also use your own config, see./src/gaia/eval/configs for examples):
-g flag during eval):
Resuming Interrupted Experiments
Default Behavior: Batch experiments automatically skip already completed experiments to avoid redundant processing. This is particularly useful when:- API Rate Limits: Your batch run was interrupted due to rate limiting
- System Interruptions: Power outage, network issues, or manual cancellation
- Incremental Testing: Adding new models to an existing experiment set
- Cost Optimization: Avoiding redundant API calls for expensive cloud models
- Checks for existing
.experiment.jsonfiles in the output directory - Skips experiments where the output file already exists
- Includes skipped experiments in the consolidated report
- Logs which experiments were skipped vs newly generated
Third-Party LLM Templates
Create standardized templates for external model evaluation. Generate templates for manual testing:Evaluation and Reporting
Analyze experiment results and generate reports. Evaluate results and generate reports:Skip Existing Evaluations and Incremental Updates
The evaluation system supports skipping existing evaluations and incremental report consolidation to improve efficiency and reduce redundant processing. CLI Options:--force: Force regeneration of all evaluations, even if they already exist (overrides default skip behavior)--incremental-update: Update consolidated report incrementally with only new evaluations--regenerate-report: Force complete regeneration of the consolidated report
- Performance: Skip redundant evaluations, avoiding expensive API calls
- Reliability: Resume interrupted runs without losing progress
- Efficiency: Incremental updates process only new files vs regenerating entire reports
- Development Speed: Faster iteration during experiment development
- Skip Logic: Checks for existing
.eval.jsonfiles in both flat and hierarchical structures - Change Detection: Uses file modification time + file size as fingerprint
- Incremental Updates: Only processes files not present in consolidated report metadata
Evaluation Options
Thegaia eval command supports multiple input modes and groundtruth handling:
Single File Mode (-f):
Evaluate a single experiment file (with embedded groundtruth):
-d):
Evaluate all JSON files in a directory (with embedded groundtruth):
test_data):
- No
-gflag: Uses groundtruth embedded in experiment files (when batch-experiment input was groundtruth files) - With
-gflag: Uses external groundtruth file (when batch-experiment input was test_data files)
- ✅ Flexibility: Easy to test different groundtruth versions against same experiments
- ✅ Separation of Concerns: Keep experiments and evaluation standards separate
- ✅ Storage Efficiency: No groundtruth duplication in experiment files
.json files in the specified directory, providing individual evaluation results for each file plus consolidated usage and cost information.
Interactive Results Visualizer
The evaluation framework includes a web-based visualizer for interactive comparison of experiment results.Overview
The visualizer provides a user-friendly interface to:- Compare Multiple Experiments: Side-by-side comparison of different model configurations
- Analyze Key Metrics: Cost breakdowns, token usage, and quality scores
- Inspect Quality Ratings: Detailed analysis of evaluation criteria performance
- Track Performance Trends: Visual indicators for model improvement areas
Launch the Visualizer
Launch with default settings (looks for./output/experiments, ./output/evaluations, ./output/test_data, and ./output/groundtruth directories):
Visualizer Features
Data Loading:- Automatically discovers
.experiment.jsonfiles in the experiments directory - Loads corresponding
.experiment.eval.jsonfiles from the evaluations directory - Displays test data files (emails, meeting transcripts) with generation metadata and costs
- Shows groundtruth files with evaluation criteria, expected summaries, and generation details
- Real-time file system monitoring for new results
- Grid layout for side-by-side experiment comparison
- Expandable sections for detailed metrics review
- Color-coded quality indicators (Excellent/Good/Fair/Poor)
- Cost analysis with input/output token breakdown
- Quality score distributions and averages
- Model configuration comparison (temperature, max tokens, etc.)
- Experiment metadata and error tracking
Integrated Workflow
The visualizer integrates seamlessly with the evaluation pipeline. After running any of the Step-by-Step Workflows above, simply launch the visualizer to explore your results interactively. Launch visualizer with all data directories:System Requirements
- Node.js: Required for running the web server (auto-installs dependencies)
- Modern Browser: Chrome, Firefox, Safari, or Edge with JavaScript enabled
- File System Access: Read permissions for experiment and evaluation directories
Configuration Options
Combined Prompt Optimization (Optional)
The evaluation framework supports combined prompt as an optional optimization to reduce costs and execution time, but uses individual prompts by default for maximum reliability: Individual Prompts (Default):- Higher Reliability: Each summary style gets dedicated attention and analysis
- Better Quality: More focused generation per style type
- Easier Debugging: Can isolate issues to specific summary styles
- Production Recommended: Default choice for critical evaluations
- Single API Call: All requested summary styles generated in one call
- Cost Reduction: ~83% cost savings for cloud APIs (1 call vs 6 for all styles)
- Time Savings: 3-5x faster execution (5-7 seconds vs 18-30 seconds)
- Quality Trade-off: May produce less focused results for individual styles
- ⚠️ Model Limitations: Smaller LLMs (e.g., less than 3B parameters) may struggle with complex combined prompts, leading to incomplete outputs or format errors
--combined-prompt flag to enable optimization
"combined_prompt": true in parameters to enable
- ✅ Development/Testing: Quick iterations and prototyping
- ✅ Cost-Conscious Evaluation: Budget constraints with acceptable quality trade-off
- ✅ High-Volume Experiments: Processing many documents where speed matters
- ✅ Large Models Only: Use with models ≥7B parameters (Claude, GPT-4, large Llama models)
- ✅ Production Evaluation: Critical model selection decisions
- ✅ Quality-First: When accuracy and reliability are paramount
- ✅ Detailed Analysis: Need to understand performance per summary style
- ✅ Smaller Models: Required for models less than 7B parameters (Qwen3-0.6B, small Llama models) to avoid instruction-following issues
basic_summarization.json) use "combined_prompt": false by default to ensure maximum evaluation reliability.
Model Providers
- anthropic - Claude models via Anthropic API
- lemonade - Local models via Lemonade server
Evaluation Metrics
The framework evaluates responses across multiple dimensions:- Correctness - Factual accuracy relative to source content
- Completeness - Comprehensive coverage of key information
- Conciseness - Appropriate brevity while maintaining accuracy
- Relevance - Direct alignment with query requirements
Model Performance Grading System
The evaluation framework includes a comprehensive grading system that converts qualitative ratings into numerical quality scores and percentage grades for easy comparison across models.Grade Calculation Formula
Quality scores are calculated using a weighted average approach:- Weighting: Excellent=4, Good=3, Fair=2, Poor=1 points
- Average: Divides by total summaries to get weighted average (1.0 to 4.0 scale)
- Normalization: Subtracts 1 to convert to 0-3 scale
- Percentage: Multiplies by 100/3 to get 0-100% scale
- All Excellent ratings: (4-1) × 100/3 = 100%
- All Good ratings: (3-1) × 100/3 = 66.7%
- All Fair ratings: (2-1) × 100/3 = 33.3%
- All Poor ratings: (1-1) × 100/3 = 0%
Model Performance Summary Table
The system generates a consolidated Model Performance Summary table that displays:- Model: Model name and configuration
- Grade: Calculated percentage score (0-100%)
- Individual Criteria: Performance ratings across different evaluation aspects
- Executive Summary Quality
- Detail Completeness
- Action Items Structure
- Key Decisions Clarity
- Participant Information
- Topic Organization
Grade Display Locations
Quality scores and grades are displayed in multiple locations:- Web Interface: Interactive visualizer (
gaia visualize) - Markdown Reports: Generated evaluation reports
- JSON Files: Raw evaluation data in
.eval.jsonfiles - Consolidated Reports: Cross-model comparison summaries
Interpreting Quality Scores
Quality Score Ranges:- 85-100%: Excellent performance, production ready
- 67-84%: Good performance, minor improvements needed
- 34-66%: Fair performance, significant improvements required
- 0-33%: Poor performance, major revisions needed
- Scores are based on qualitative ratings from Claude AI evaluation
- Each model’s score reflects performance across multiple test cases
- Model consolidation may aggregate results from different experiments
- Always review individual evaluation details alongside summary grades
Output Formats
Results are generated in multiple formats:- JSON files for programmatic analysis
- Markdown reports for human review
- CSV exports for spreadsheet analysis
Troubleshooting
Common Issues
Evaluation Mode Confusion: If yourgaia eval command runs successfully but produces inconsistent results:
- ❓ Check: Are you using Comparative Evaluation or Standalone Assessment mode?
- ✅ Fix: Ensure groundtruth data is either embedded (batch-experiment input was groundtruth files) or provided via
-gflag - 🔍 Verify: Look for log message “Loaded ground truth data from:” vs “No ground truth file provided”
- ❓ Check: Are you using groundtruth files as input to batch-experiment?
- ✅ Fix: Use
gaia batch-experiment -i ./output/groundtruth/consolidated_qa_groundtruth.json(not test data) - 🔍 Why: Q&A groundtruth contains the specific questions models need to answer consistently
-
❓ Symptoms:
ValueError: numpy.dtype size changedorImportError: cannot import name 'ComplexWarning' -
✅ Fix: Reinstall dependencies to rebuild against current numpy version:
- 🔍 Why: Binary incompatibility between numpy 2.x and older pandas/sklearn versions
- Use absolute paths when relative paths fail
- Ensure output directories exist or can be created
- Check file permissions for read/write access
Performance Tips
- Start with small datasets to validate configuration
- Use
--count-per-type 1for initial testing - Monitor token usage and API costs during generation
- Parallelize experiments when testing multiple models
- Default behavior automatically skips existing experiments to avoid redundant processing
Timing Metrics
The evaluation system now captures detailed timing information: Available Metrics:- Total experiment execution time
- Per-item processing times (with min/max/average)
- Per-question evaluation times
- Report generation time
Cost Tracking
The system tracks inference costs with distinction between cloud and local models: Cloud Inference (e.g., Claude):- Tracks actual API costs per request
- Shows cost breakdown by input/output tokens
- Displays per-item cost analysis
- Automatically set to $0.00 (FREE)
- No token costs since it runs on your hardware
- Clear visual indication in webapp that inference is local and free
- Via Webapp Visualizer (Recommended):
- Inference Type: Clear indicators for Cloud (☁️) vs Local (🖥️) models
- Cost Information:
- Cloud models show actual costs with breakdown
- Local models show “FREE” with celebratory banner
- Timing Metrics:
- Main metrics grid (Total Time, Avg/Item, Eval Time)
- Dedicated “Performance Timing” section with detailed statistics
- Min/Max/Average breakdowns for each processing phase
- Via JSON Files: