Skip to main content
Our internal benchmark is designed to test the cost efficiency of coding agents with various models, measuring performance per dollar across realistic engineering tasks. Benchmark results

Methodology

20 tasks. 5 trials per task. 100 trials total.

Task sets

Rain

Event-sourced DSL built in Zig. A purpose-built language with a variety of sophisticated modules that make for a realistic engineering playground with no pretraining biases.
  • 4 feature tasks
  • 3 bug tasks
  • 3 refactor tasks (including a full Rust rewrite)

pgx

PostgreSQL driver in Go, a real GitHub project. Larger real-world project with real-world complexity and scale to navigate.
  • 3 feature tasks
  • 5 bug tasks
  • 2 refactor tasks

Results

Both task sets represent realistic engineering tasks spanning from small to very large scope, with varying levels of difficulty and complexity to model real-world variation in engineering work.