Benchmark - Magnitude

Our internal benchmark is designed to test the cost efficiency of coding agents with various models, measuring performance per dollar across realistic engineering tasks.

Methodology

20 tasks. 5 trials per task. 100 trials total.

Task sets

Rain

Event-sourced DSL built in Zig. A purpose-built language with a variety of sophisticated modules that make for a realistic engineering playground with no pretraining biases.

4 feature tasks
3 bug tasks
3 refactor tasks (including a full Rust rewrite)

pgx

PostgreSQL driver in Go, a real GitHub project. Larger real-world project with real-world complexity and scale to navigate.

3 feature tasks
5 bug tasks
2 refactor tasks

Results

Both task sets represent realistic engineering tasks spanning from small to very large scope, with varying levels of difficulty and complexity to model real-world variation in engineering work.

​Methodology

​Task sets

​Rain

​pgx

​Results

Methodology

Task sets

Rain

pgx

Results