Source Code:
src/gaia/eval/fix_code_testbench/Overview
A testbench for evaluating code-fixing prompts against LLM models (local or Claude). Use it to validate and iterate on prompts that automatically fix code errors.Purpose
The testbench simulates the agent flow for fixing code:- Takes a buggy source file and an error description
- Sends a formatted prompt to an LLM (local endpoint or Claude Sonnet 4.5)
- Extracts the model’s suggested fix
- Outputs the corrected code and displays a diff
fix_codemode (default): Requests the full corrected code snippetedit_filemode (--use-edit-file): Requests anedit_filetool call with old/new content pairs
Usage
Basic example--use-claude, --use-edit-file, --context, --start-line, etc.) work with gaia eval fix-code.
With Claude
Example Output
Testing if a model can fix a simple off-by-one error:number is used instead of numbers on line 12.
Testing if a model can fix a mutable default argument bug:
Real-World TypeScript Scenario
Earlier versions of the web dev agent failed TypeScript validation inexamples/workout_web_app_component/WorkoutForm.tsx:
--use-claude. The testbench helps explore prompt/context tweaks that let local models match Claude’s reliability.
Improvements for the TypeScript Scenario
Two approaches improved the local model’s success rate: Solution 1: More specific prompt engineeringas unknown as never) while Claude properly types normalized.
Common Options
--model MODEL: Local model identifier (default:Qwen3-Coder-30B-A3B-Instruct-GGUF)--use-claude: Use Claude Sonnet 4.5 instead of local endpoint--language LANG: Override language detection (python/typescript/etc.)--context TEXT: Add additional context to the prompt--use-prompt-engineering: Inject targeted guidance into the prompt--use-edit-file: Request anedit_filetool call format instead of full code replacement--start-line N: Include only lines N onwards (default: 1)--end-line N: Include only up to line N (default: end of file)--temperature FLOAT: Sampling temperature (default: 0.2)--timeout SECONDS: HTTP timeout (default: 600)
Test Cases
Example bug scenarios:examples/average-calc.py: Variable name typoexamples/max-value.py: Function missing return statementexamples/append.py: Mutable default argument issueexamples/sum.py: Off-by-one loop error
Requirements
- Python 3.x
openaipackage (for local models)anthropicpackage (for Claude when using--use-claude)- Local LLM endpoint at
http://localhost:8000/api/v1(or setANTHROPIC_API_KEYfor Claude)