Fix Code Testbench

Source Code: src/gaia/eval/fix_code_testbench/

Overview

A testbench for evaluating code-fixing prompts against LLM models (local or Claude). Use it to validate and iterate on prompts that automatically fix code errors.

Purpose

The testbench simulates the agent flow for fixing code:

Takes a buggy source file and an error description
Sends a formatted prompt to an LLM (local endpoint or Claude Sonnet 4.5)
Extracts the model’s suggested fix
Outputs the corrected code and displays a diff

It supports two modes:

fix_code mode (default): Requests the full corrected code snippet
edit_file mode (--use-edit-file): Requests an edit_file tool call with old/new content pairs

Usage

Basic example

gaia eval fix-code path/to/buggy_file.py "Error message" output.py

With local model

gaia eval fix-code \
  examples/average-calc.py \
  "NameError: name 'number' is not defined" \
  examples/average-calc-fixed.py \
  --model Qwen3-Coder-30B-A3B-Instruct-GGUF

All script flags (--use-claude, --use-edit-file, --context, --start-line, etc.) work with gaia eval fix-code. With Claude

export ANTHROPIC_API_KEY=your_key_here
gaia eval fix-code \
  examples/max-value.py \
  "TypeError: 'NoneType' object is not callable" \
  examples/max-value-fixed.py \
  --use-claude

Using edit_file mode

gaia eval fix-code \
  examples/sum.py \
  "This is resulting in 10 but the correct answer is 15. Analyze and fix what is wrong." \
  examples/sum-fixed.py \
  --use-edit-file

Example Output

Testing if a model can fix a simple off-by-one error:

gaia eval fix-code \
  examples/sum.py \
  "This is resulting in 10 but the correct answer is 15. Analyze and fix what is wrong." \
  examples/sum-fixed.py

The tool displays the prompt sent to the model, the raw response, and a diff showing the fix:

=== Prompt ===
Fix the following Python code error:
File path: examples/sum.py
Error: This is resulting in 10 but the correct answer is 15. Analyze and fix what is wrong.
Code:

```python

"""Compute a numeric series."""
def sum(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    total = 0
    for i in range(n):
        total += i
    return total

print(f"Sum through 5: {sum(5)}")
```

Return ONLY the corrected code, no explanations.

=== RAW RESPONSE ===

```python
"""Compute a numeric series."""
def sum(n):
    if n < 0:
        raise ValueError("n must be non-negative")
    total = 0
    for i in range(n + 1):
        total += i
    return total
print(f"Sum through 5: {sum(5)}")
```

=== Diff (cleaned vs original) ===

--- examples/sum.py:1-16\
+++ examples/sum-fixed.py:1-16
@@ -5,7 +5,7 @@
     if n < 0:
         raise ValueError("n must be non-negative")
     total = 0
-    for i in range(n):
+    for i in range(n + 1):
         total += i
     return total

Testing if a model can fix a variable name typo:

gaia eval fix-code \
  examples/average-calc.py \
  "File \"/scratch/eddier/gaia/src/gaia/eval/fix_code_testbench/examples/average-calc.py\", line 16, in <module>" \
  examples/average-calc-fixed.py

This checks whether the model can fix a typo where number is used instead of numbers on line 12. Testing if a model can fix a mutable default argument bug:

gaia eval fix-code \
  examples/append.py \
  "Default list is shared across calls. Fix this." \
  examples/append-fixed.py

Real-World TypeScript Scenario

Earlier versions of the web dev agent failed TypeScript validation in examples/workout_web_app_component/WorkoutForm.tsx:

src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'.

40         normalized[field as keyof typeof normalized] = parsedValue.toISOString().

Two common failure modes appeared with a local LLM: Scenario 1: Giving the entire error message to the agent

gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:33:21 - error TS7053: Element implicitly has an 'any' type because expression of type 'string' can't be used to index type '{ name: string; duration: number; date: string; goal: string; }'. No index signature with a parameter of type 'string' was found on type '{ name: string; duration: number; date: string; goal: string; }'. 33  const value = normalized[field]; src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString()" \
  examples/workout_web_app_component/WorkoutForm-new.tsx

Resulting diff (does not fix the issue):

--- examples/workout_web_app_component/WorkoutForm.tsx:1-184
+++ examples/workout_web_app_component/WorkoutForm-new.tsx:1-184
@@ -40,7 +40,7 @@
       const parsedValue = new Date(value as string | number | Date);
       if (!Number.isNaN(parsedValue.getTime())) {
-        normalized[field as keyof typeof normalized] = parsedValue.toISOString();
+        normalized[field as keyof typeof normalized] = parsedValue.toISOString() as any;
       }
     });

Scenario 2: Not providing the exact line of failure

gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. " \
  examples/workout_web_app_component/WorkoutForm-new.tsx

This typically produces no diff because the model cannot locate the failing line. Scenario: Only providing a single error with the appropriate failing line

gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString();" \
  examples/workout_web_app_component/WorkoutForm-new.tsx

This fixes the issue about half the time; other times it yields an incorrect diff:

=== Diff (cleaned vs original) ===
--- examples/workout_web_app_component/WorkoutForm.tsx:1-184
+++ examples/workout_web_app_component/WorkoutForm-new.tsx:1-184
@@ -40,7 +40,7 @@
 
       const parsedValue = new Date(value as string | number | Date);
       if (!Number.isNaN(parsedValue.getTime())) {
-        normalized[field as keyof typeof normalized] = parsedValue.toISOString();
+        normalized[field as keyof typeof normalized] = parsedValue.toISOString() as unknown as (string | number);
       }
     });

Scenario: Running with a frontier model All three prompts above resolve when using --use-claude. The testbench helps explore prompt/context tweaks that let local models match Claude’s reliability.

Improvements for the TypeScript Scenario

Two approaches improved the local model’s success rate: Solution 1: More specific prompt engineering

gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString();" \
  examples/workout_web_app_component/WorkoutForm-new.tsx \
  --use-prompt-engineering

Solution 2: Homing in on where the issue is occurring

gaia eval fix-code \
  examples/workout_web_app_component/WorkoutForm.tsx \
  "src/components/WorkoutForm.tsx:43:9 - error TS2322: Type 'string' is not assignable to type 'never'. 43 normalized[field as keyof typeof normalized] = parsedValue.toISOString();" \
  examples/workout_web_app_component/WorkoutForm-new.tsx \
  --start-line 35 \
  --end-line 45

Both approaches can produce a fix; local models may use type assertions (as unknown as never) while Claude properly types normalized.

Common Options

--model MODEL: Local model identifier (default: Qwen3-Coder-30B-A3B-Instruct-GGUF)
--use-claude: Use Claude Sonnet 4.5 instead of local endpoint
--language LANG: Override language detection (python/typescript/etc.)
--context TEXT: Add additional context to the prompt
--use-prompt-engineering: Inject targeted guidance into the prompt
--use-edit-file: Request an edit_file tool call format instead of full code replacement
--start-line N: Include only lines N onwards (default: 1)
--end-line N: Include only up to line N (default: end of file)
--temperature FLOAT: Sampling temperature (default: 0.2)
--timeout SECONDS: HTTP timeout (default: 600)

Test Cases

Example bug scenarios:

examples/average-calc.py: Variable name typo
examples/max-value.py: Function missing return statement
examples/append.py: Mutable default argument issue
examples/sum.py: Off-by-one loop error

Requirements

Python 3.x
openai package (for local models)
anthropic package (for Claude when using --use-claude)
Local LLM endpoint at http://localhost:8000/api/v1 (or set ANTHROPIC_API_KEY for Claude)

Command Line

API & Development

Help

Fix Code Testbench

Overview

Purpose

Usage

Example Output

Real-World TypeScript Scenario

Improvements for the TypeScript Scenario

Common Options

Test Cases

Requirements

Command Line

API & Development

Help

​Overview

​Purpose

​Usage

​Example Output

​Real-World TypeScript Scenario

​Improvements for the TypeScript Scenario

​Common Options

​Test Cases

​Requirements

Overview

Purpose

Usage

Example Output

Real-World TypeScript Scenario

Improvements for the TypeScript Scenario

Common Options

Test Cases

Requirements