Technical benchmark · Coding models
Evaluating top-tier coding models for ephemeral code generation
Ten frontier models, forty single-purpose tasks, executed and graded on whether the code actually runs, actually works, and what it costs. The finding: for small programs, correctness is nearly solved — and a cheap model plus one self-review pass matches the best models at a fraction of the price.
The category
What "ephemeral code" is
Ephemeral code is single-purpose, short-lived code generated on demand to do one job — process a file, call an API, transform a payload, drive a browser — and then discarded or regenerated rather than maintained as a codebase.
These programs are small: almost always under a thousand lines, usually under a hundred. They have a narrow contract — some inputs, a side effect or a result — and no need for the architecture, abstraction, or longevity of application code. They are the plumbing that connects one system to another for a single run.
That size is exactly the range where today's models are strongest. A task that fits in one file, needs no cross-module reasoning, and can be checked by running it is close to the sweet spot of current code generation. The interesting question is no longer can a top model write it — several can, nearly every time — but which model does it reliably, cheaply, and correctly enough to trust unattended.
The argument
Keep the prompt, not the code
If a model can regenerate a correct 80-line integration from a paragraph of intent, the code stops being the asset. The specification is. It is smaller, it is readable, it doesn't rot, and it isn't bound to a language or a library version.
Maintain compact natural-language specs — instruction plus spec, a “micro-prompt” — and generate the code on the fly, rather than storing and maintaining the code itself.
A micro-prompt is a fraction of the size of the code it produces, survives model and language changes untouched, and carries no dependency drift, no dead branches, and no stale comments. The generated artifact lives for one execution.
This only works if generation is dependable. A micro-prompt that produces working code 70% of the time is a liability; at 99%+, with a cheap automatic repair pass to cover the rest, it becomes infrastructure. So the practical question this benchmark answers is: how dependable is each model on exactly this class of work, and at what cost?
Method
How the tasks were measured
Each model was given a task's instruction and spec — nothing about implementation — and asked to produce a complete program plus its input schema. The program was then executed in an isolated local sandbox and graded on four axes.
- Runs — does it execute without crashing and emit a well-formed result?
- Correct — does the output match the expected result, and — for integration tasks — did a mock service confirm the request was authenticated correctly? A hard-coded answer cannot pass.
- Quality — static contract checks plus an LLM reviewer scoring error handling, cleanup, and robustness.
- Cost — real API spend, reported per model and per correct answer.
The 40 tasks span three families: file & data processing (parse, transform, aggregate), external API integration across five real authentication schemes — bearer token, HTTP basic, API-key header, OAuth2 client-credentials, and mutual TLS with both PEM and PKCS#12 client certificates — and bug-fixing of planted defects. Each contender ran three samples; a refinement study ran one.
Results · single-shot
The leaderboard
Correctness is a solved problem at the top: four models are perfect across all 40 tasks. What separates them is cost per correct answer — a 30× spread — and how much robustness the reviewer sees in the code.
| # | Model | Correct | Refused | Judge | Cost | Cost / correct |
|---|---|---|---|---|---|---|
| 1 | gemini-3.1-pro | 100.0% | 0 | 0.81 | $5.55 | $0.046 |
| 2 | claude-haiku-4.5 | 100.0% | 0 | 0.79 | $1.94 | $0.016 |
| 3 | minimax-m3best value | 100.0% | 0 | 0.77 | $0.82 | $0.007 |
| 4 | claude-sonnet-4.6 | 100.0% | 0 | 0.70 | $4.53 | $0.038 |
| 5 | gpt-5.5 | 99.2% | 0 | 0.80 | $5.35 | $0.045 |
| 6 | glm-5.1 | 96.7% | 0 | 0.67 | $1.63 | $0.014 |
| 7 | deepseek-v4-pro | 95.8% | 0 | 0.73 | $1.82 | $0.016 |
| 8 | claude-fable-5 | 95.0% | 2 ⚑ | 0.84 | $6.61 | $0.174 |
| 9 | kimi-k2.6 | 93.3% | 0 | 0.80 | $2.12 | $0.019 |
| 10 | qwen3-coder-nextcheapest | 93.3% | 0 | 0.60 | $0.60 | $0.005 |
40 tasks × 3 samples. Correct requires verified authentication, not just matching output. Judge is an independent reviewer's robustness score (0–1). ⚑ Refused = the model's content filter declined to generate the task. minimax-m3 is the value winner: perfect correctness at 5–7× less than the other flawless models. qwen3-coder-next is the outright cheapest.
Results · cost-to-correct
One repair pass closes the gap
The models that miss single-shot rarely miss by much. Feeding a failed attempt back for one revision recovers almost everything — and it works two ways. With execution feedback (the real error message) every model reaches 100% by the second pass. With self-review only — no test run, the model just re-reads its own code against the spec — most still get there. That second mode is the one you can run anywhere, without an oracle.
| Model | exec · p@1 | p@2 | p@3 | review · p@1 | p@2 | p@3 | Cost / correct |
|---|---|---|---|---|---|---|---|
| gemini-3.1-pro | 100 | 100 | 100 | 100 | 100 | 100 | $0.048 |
| claude-haiku-4.5 | 100 | 100 | 100 | 100 | 100 | 100 | $0.016 |
| minimax-m3 | 98 | 100 | 100 | 100 | 100 | 100 | $0.007 |
| claude-sonnet-4.6 | 100 | 100 | 100 | 98 | 100 | 100 | $0.038 |
| gpt-5.5 | 100 | 100 | 100 | 98 | 100 | 100 | $0.031 |
| glm-5.1 | 92 | 100 | 100 | 98 | 100 | 100 | $0.015 |
| deepseek-v4-pro | 100 | 100 | 100 | 100 | 100 | 100 | $0.017 |
| kimi-k2.6 | 100 | 100 | 100 | 100 | 100 | 100 | $0.017 |
| qwen3-coder-next | 100 | 100 | 100 | 92 | 95 | 95 | $0.005 |
p@k = share correct within k passes (% correct). exec and review are separate single-sample runs, so their pass@1 differs by noise; the convergence by pass@2 is the signal. Only qwen3-coder-next plateaus under self-review — it can't always self-diagnose its last bug without running the code. Everyone else reaches 100% with a single revision, oracle or not.
The value frontier
Put the two results together. A cheap model with one self-review pass — minimax-m3 at ~$0.007, qwen at ~$0.005 per correct program — lands at the same correctness as the frontier models cost 5–10× more. For work that is generated on the fly and thrown away, paying frontier prices per call buys almost nothing; the cheap-model-plus-repair loop is the efficient design.
Results · difficulty
Where the models actually break
Aggregate scores hide the shape of the difficulty. File and data tasks are effectively free points for everyone. The separation is entirely in credential handling — mutual-TLS certificates and OAuth flows — the exact places real integrations get hard.
| Model | File / data | Bug-fix | API (all) | Mutual TLS | OAuth2 |
|---|---|---|---|---|---|
| gemini-3.1-pro | 100 | 100 | 100 | 100 | 100 |
| claude-haiku-4.5 | 100 | 100 | 100 | 100 | 100 |
| minimax-m3 | 100 | 100 | 100 | 100 | 100 |
| claude-sonnet-4.6 | 100 | 100 | 100 | 100 | 100 |
| gpt-5.5 | 100 | 100 | 99 | 93 | 100 |
| glm-5.1 | 100 | 94 | 95 | 93 | 92 |
| deepseek-v4-pro | 100 | 100 | 93 | 73 | 100 |
| qwen3-coder-next | 100 | 97 | 89 | 73 | 75 |
| kimi-k2.6 | 88 | 100 | 93 | 80 | 100 |
| claude-fable-5 | 100 | 100 | 92 | 80 | 75 |
Single-shot correctness by task family (%). The two most common real failures: passing certificate contents where a library expects file paths, and importing a PKCS#12 helper without importing its submodule. Both are the kind of mistake a single execution-feedback pass fixes immediately.
What we learned
Findings
Correctness is saturating
On sub-1000-line tasks, four of ten models were flawless and the rest missed by single digits. The frontier of difficulty for this work has moved from "can it" to "at what cost and reliability."
Refinement beats raw capability
One repair pass lifts every contender to 100% with execution feedback, and nearly all of them with self-review alone. Cheap-model-plus-loop dominates expensive-model-single-shot on both cost and reliability.
Auth is the real test
File and data work is trivial for every model. Mutual-TLS and OAuth separate the field — cheaper models drop to 73–80% on client certificates. If you benchmark on toy tasks, every model looks equal.
Robustness ≠ correctness
The reviewer rewarded defensive, well-guarded code; the terse-but-correct models scored lower on that axis despite perfect functional results. Correctness and code quality are genuinely separate signals.
Price spans 30×
The same 40 tasks cost $0.60 on the cheapest model and up to $6.61 on the priciest — at equal or better correctness for the cheap one. Per-correct cost, not raw capability, is the metric that should drive model choice here.
Watch the safety filter
One model's content filter refused two credential-and-payment tasks outright. For automation that legitimately handles secrets and money, a model that declines the work is a hard constraint, not a footnote.
Takeaway
Generate on the fly, cheaply
For ephemeral code — the small, single-purpose programs that connect systems for one run — the economics have flipped. The code is no longer worth keeping; the prompt that produces it is.
The benchmark makes the practical case concrete. Reliability on this class of work is high across every top model and effectively perfect once you add a single automatic repair pass. That repair pass works even without the ability to run the code, so it fits inside a generation pipeline. And the cost difference between the frontier and the value tier is an order of magnitude with no correctness penalty.
So the efficient architecture is not a library of maintained utilities, nor a frontier model called once and trusted blindly. It is a compact micro-prompt, a cheap capable model, and a self-review loop — code summoned when needed, verified, and discarded. You maintain intent. The model maintains everything else.
Where this points
Extend the idea and the tooling itself changes shape. Version control and build systems treat code as the source of truth — but if code is ephemeral, the source of truth moves up a level, to the prompts and the project's current state. The precedent already exists: database migration frameworks don't store the database, they store the ordered instructions that build it plus a marker for where you are in the sequence. The database is derived; the migrations are the asset.
It is reasonable to expect coding frameworks built on the same principle — ones that version a project as a sequence of micro-prompts and a resolved state, and regenerate the actual code on demand from them. Code becomes a build artifact, like a compiled binary is today. The hard, durable engineering shifts to how those prompts are captured, ordered, replayed, and stored — that is where the value settles once the code stops being worth keeping.