GPT-5 Benchmarking for Agentic Coding Tasks: Loses to GPT-4.1

TL;DR We benchmarked GPT-5 for coding tasks and found it insufficient on quality and speed, and are sticking with GPT-4.1 for now.

The initial GPT-5 backlash was mostly based on response tone and forcing the model selection in ChatGPT.

However, if you take a data-driven approach there are pleanty of measurable reasons to be unimpressed. We used our self-built model testing pipeline to put GPT-5 and 5-mini through the paces, and found some specific reasons why we're sticking with GPT-4.1 for most agentic coding tasks for the foreseeable future.

Methodology

Flightcrew uses OpenAI models for a variety of tightly-scoped tasks that fall into three major categories:

File Indexing: discovering context via connections between config files
Code Generation: carrying out config changes in multi-language IaC repos
Change Explanation: describing methodology and justification for each change

If you're interested in specifics, we gave a talk on this at GitHub HQ.

To ensure the quality and consistency of our agent, we have an internal testing pipeline that we use to monitor performance and test model upgrades. This test suite is good for evaluating models because these test cases:

Vary in complexity and function, and therefore should be applicable to many coding-related tasks we all ask LLMs to perform
Are scored on a mixture of deterministic validation (e.g. string matching), validation prompts (other LLMs verifying output), and human-generated ground truth.
Were initially written when we were using GPT-4o and o1-mini, so they should not be biased towards either of the current model options in this evaluation, nor reasoning models vs. not.
Consist of particularly challenging cases that the models previously failed, so hitting 100% correctness is extremely rare and therefore highlights tradeoffs between models.

And to note the limitations, these tests do not cover:

Interpretation of human intent. Our agent takes instructions from our internal statistics-based recommenders, so there's little room for interpretation.
Long (3+ minutes) agentic tasks where the model is given free reign to reason and perform multiple steps

Results

Across 50 file indexing tests, 29 code generation tests, and 41 change explanation tests, we found that GPT-5 not only falters with the most complex tasks of code generation, it's much slower across the board:

Pass Rate Graph

For simplicity of displaying the results, we're only including the top-performing GPT-5 setup even after following the prompting guide and tuning reasoning_effort. We're also including all test cases whether they call the full-sized or the mini models.

And the average time of completion for each suite:

Speed Graph

Common Failure Patterns for GPT-5

Digging into the test failures, we found these common patterns that GPT-5 struggled with:

Retaining information in multi-round prompts with large context windows. For example, in trying to carry out a code change, GPT-5 would often ask for more files as context that we had already passed in, or it would hallucinate file paths that were not in the given list.
This is hard to quantify, but it would often "assume the worst" and bail out, rather than just carry out its given task. These example snippets show GPT-5 being overly skeptical:
> The potential issue is not an extra unrelated change but an omission
: leaving GenerateKedaTrigger(..., 20) unchanged could leave an unintended 20 value used somewhere...
> The new module directory and files are consistent with Terraform style, but there is no evidence that the variable names used by the module match the repository
-level variables in the existing configs...
Even with reasoning_effort set to low where possible, GPT-5 is so much slower that we needed to account for potential page timeouts and cronjob deadlines. It's unclear if this is due to the reasoning process or OpenAI's backend (or both).

Takeaways

GPT-4.1 beats GPT-5 in every dimension we care about: quality, speed, and predictability. Additionally, the prompt tuning required just to get GPT-5 up to speed means that it's losing on engineering time spent as well.

So our recommendation is: right now, GPT-4.1 is the best choice for your agent. And perhaps this should have been expected, since GPT-4.1 is still the more expensive of the two.

Again this is a tightly-scoped A/B test of two models and won't apply to every use case, and we feel confident that we will find use cases for which upgrading to GPT-5 will make sense (its performance on the validation prompts is quite promising), but for us that time is not now.

If you're building AI-powered developer tools, we're always interested to chat and share approaches. Reach out to us at hello@flightcrew.io.

GPT-5 Benchmarking for Agentic Coding Tasks: Loses to GPT-4.1

Methodology

Results

Common Failure Patterns for GPT-5

Takeaways

Keep reading

Build Your Own Validation Pipeline to Ship AI Agents with Confidence

Introducing Agentic PRs

Don’t miss out!