Skip to main content

Overview

A testbench for evaluating code-fixing prompts against LLM models (local or Claude). Use it to validate and iterate on prompts that automatically fix code errors.

Purpose

The testbench simulates the agent flow for fixing code:
  1. Takes a buggy source file and an error description
  2. Sends a formatted prompt to an LLM (local endpoint or Claude Sonnet 4.5)
  3. Extracts the model’s suggested fix
  4. Outputs the corrected code and displays a diff
It supports two modes:
  • fix_code mode (default): Requests the full corrected code snippet
  • edit_file mode (--use-edit-file): Requests an edit_file tool call with old/new content pairs

Usage

Basic example
gaia eval fix-code path/to/buggy_file.py "Error message" output.py
With local model
gaia eval fix-code \
  examples/average-calc.py \
  "NameError: name 'number' is not defined" \
  examples/average-calc-fixed.py \
  --model Qwen3-Coder-30B-A3B-Instruct-GGUF
All script flags (--use-claude, --use-edit-file, --context, --start-line, etc.) work with gaia eval fix-code. With Claude
export ANTHROPIC_API_KEY=your_key_here
gaia eval fix-code \
  examples/max-value.py \
  "TypeError: 'NoneType' object is not callable" \
  examples/max-value-fixed.py \
  --use-claude
Using edit_file mode
gaia eval fix-code \
  examples/sum.py \
  "This is resulting in 10 but the correct answer is 15. Analyze and fix what is wrong." \
  examples/sum-fixed.py \
  --use-edit-file

Example Output

Testing if a model can fix a simple off-by-one error:
gaia eval fix-code \
  examples/sum.py \
  "This is resulting in 10 but the correct answer is 15. Analyze and fix what is wrong." \
  examples/sum-fixed.py
The tool displays the prompt sent to the model, the raw response, and a diff showing the fix:
=== Prompt ===
Fix the following Python code error:
File path: examples/sum.py
Error: This is resulting in 10 but the correct answer is 15. Analyze and fix what is wrong.
Code:

```python

"""Compute a numeric series."""
def sum(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    total = 0
    for i in range(n):
        total += i
    return total

print(f"Sum through 5: {sum(5)}")
```

Return ONLY the corrected code, no explanations.

=== RAW RESPONSE ===

```python
"""Compute a numeric series."""
def sum(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    total = 0
    for i in range(n + 1):
        total += i
    return total
print(f"Sum through 5: {sum(5)}")
```

=== Diff (cleaned vs original) ===

--- examples/sum.py:1-16\
+++ examples/sum-fixed.py:1-16
@@ -5,7 +5,7 @@
     if n < 0:
         raise ValueError("n must be non-negative")
     total = 0
-    for i in range(n):
+    for i in range(n + 1):
         total += i
     return total
Testing if a model can fix a variable name typo:
gaia eval fix-code \
  examples/average-calc.py \
  "File \"/scratch/eddier/gaia/src/gaia/eval/fix_code_testbench/examples/average-calc.py\", line 16, in <module>" \
  examples/average-calc-fixed.py
This checks whether the model can fix a typo where number is used instead of numbers on line 12. Testing if a model can fix a mutable default argument bug:
gaia eval fix-code \
  examples/append.py \
  "Default list is shared across calls. Fix this." \
  examples/append-fixed.py

Real-World TypeScript Scenario

Earlier versions of the web dev agent failed TypeScript validation in examples/workout_web_app_component/WorkoutForm.tsx:
src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'.

40         normalized[field as keyof typeof normalized] = parsedValue.toISOString().
Two common failure modes appeared with a local LLM: Scenario 1: Giving the entire error message to the agent
gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:33:21 - error TS7053: Element implicitly has an 'any' type because expression of type 'string' can't be used to index type '{ name: string; duration: number; date: string; goal: string; }'. No index signature with a parameter of type 'string' was found on type '{ name: string; duration: number; date: string; goal: string; }'. 33  const value = normalized[field]; src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString()" \
  examples/workout_web_app_component/WorkoutForm-new.tsx
Resulting diff (does not fix the issue):
--- examples/workout_web_app_component/WorkoutForm.tsx:1-184
+++ examples/workout_web_app_component/WorkoutForm-new.tsx:1-184
@@ -40,7 +40,7 @@
       const parsedValue = new Date(value as string | number | Date);
       if (!Number.isNaN(parsedValue.getTime())) {
-        normalized[field as keyof typeof normalized] = parsedValue.toISOString();
+        normalized[field as keyof typeof normalized] = parsedValue.toISOString() as any;
       }
     });
Scenario 2: Not providing the exact line of failure
gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. " \
  examples/workout_web_app_component/WorkoutForm-new.tsx
This typically produces no diff because the model cannot locate the failing line. Scenario: Only providing a single error with the appropriate failing line
gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString();" \
  examples/workout_web_app_component/WorkoutForm-new.tsx
This fixes the issue about half the time; other times it yields an incorrect diff:
=== Diff (cleaned vs original) ===
--- examples/workout_web_app_component/WorkoutForm.tsx:1-184
+++ examples/workout_web_app_component/WorkoutForm-new.tsx:1-184
@@ -40,7 +40,7 @@
 
       const parsedValue = new Date(value as string | number | Date);
       if (!Number.isNaN(parsedValue.getTime())) {
-        normalized[field as keyof typeof normalized] = parsedValue.toISOString();
+        normalized[field as keyof typeof normalized] = parsedValue.toISOString() as unknown as (string | number);
       }
     });
Scenario: Running with a frontier model All three prompts above resolve when using --use-claude. The testbench helps explore prompt/context tweaks that let local models match Claude’s reliability.

Improvements for the TypeScript Scenario

Two approaches improved the local model’s success rate: Solution 1: More specific prompt engineering
gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString();" \
  examples/workout_web_app_component/WorkoutForm-new.tsx \
  --use-prompt-engineering
Solution 2: Homing in on where the issue is occurring
gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString();" \
  examples/workout_web_app_component/WorkoutForm-new.tsx \
  --start-line 35 \
  --end-line 45
Both approaches can produce a fix; local models may use type assertions (as unknown as never) while Claude properly types normalized.

Common Options

  • --model MODEL: Local model identifier (default: Qwen3-Coder-30B-A3B-Instruct-GGUF)
  • --use-claude: Use Claude Sonnet 4.5 instead of local endpoint
  • --language LANG: Override language detection (python/typescript/etc.)
  • --context TEXT: Add additional context to the prompt
  • --use-prompt-engineering: Inject targeted guidance into the prompt
  • --use-edit-file: Request an edit_file tool call format instead of full code replacement
  • --start-line N: Include only lines N onwards (default: 1)
  • --end-line N: Include only up to line N (default: end of file)
  • --temperature FLOAT: Sampling temperature (default: 0.2)
  • --timeout SECONDS: HTTP timeout (default: 600)

Test Cases

Example bug scenarios:
  • examples/average-calc.py: Variable name typo
  • examples/max-value.py: Function missing return statement
  • examples/append.py: Mutable default argument issue
  • examples/sum.py: Off-by-one loop error

Requirements

  • Python 3.x
  • openai package (for local models)
  • anthropic package (for Claude when using --use-claude)
  • Local LLM endpoint at http://localhost:8000/api/v1 (or set ANTHROPIC_API_KEY for Claude)