← Blog
ai

AI Makes Senior Developers Slower. The Study That Shouldn't Surprise You.

·5 min read

AI Makes Senior Developers Slower. The Study That Shouldn't Surprise You.

METR recruited 16 experienced contributors to large open-source repositories — repos averaging 22k+ stars and over a million lines of code. These are not beginners picking up Copilot for the first time. The study gave them full access to any AI tools they wanted, primarily Cursor Pro with Claude 3.5/3.7 Sonnet. The result: AI increased task completion time by 19%.

Before the study, developers predicted AI would cut their time by 24%. After completing their tasks, they estimated AI had saved them around 20%. The actual measured result was the opposite.

The developers were wrong about their own productivity. They felt faster. They were slower.

The measurement that matters

METR didn't study beginners in tutorial repos. They studied experienced engineers doing real work — bug fixes, features, refactors — in codebases they had contributed to for years. This is the population that claims to benefit most from AI coding tools. This is the population that showed a 19% slowdown.

The reason isn't mysterious. Less than 44% of AI suggestions were accepted. The rest required review, rejection, and cleanup. In a million-line codebase with ten years of accumulated design decisions, AI has no real context. It produces plausible-looking code that violates constraints it cannot see. The developer has to read it, determine what's wrong, and explain to themselves why it fails. That's not assistance. That's a second reviewer who never read the spec.

Net negative.

What senior developers actually do

The narrative around AI productivity assumes that developer work is primarily token generation — that the bottleneck is the speed of writing code, and AI removes that bottleneck. This is accurate for a certain class of tasks: boilerplate, known patterns, documented APIs, greenfield features in well-structured repos.

But experienced engineers in mature codebases are not bottlenecked by token generation. They're doing something else: navigating unfamiliar constraint intersections, resolving conflicts between subsystems with incompatible assumptions, making tradeoffs that won't create untraceable bugs eighteen months from now.

That work is judgment-intensive. It requires context that lives in the engineer's head, built over years of watching systems fail in specific ways. AI cannot compress that context. It has no representation of the implicit conventions, the architectural decisions that weren't documented because they seemed obvious at the time, the failure modes that were fixed quietly and then forgotten.

When you apply AI to judgment-intensive work, you don't get speed. You get confident-sounding suggestions that don't fit, and you spend time cleaning them up.

I maintain this distinction deliberately

I write the skeleton and interfaces manually. I let AI fill the interiors of patterns I've already understood. That's not a craft preference — it's a structural decision.

The parts where AI adds speed are not the parts where my judgment is load-bearing. When I'm generating implementation that fits a pattern I fully understand, AI is genuinely fast and mostly correct. That's useful. But when I'm designing the interface, deciding what the system should and shouldn't do, figuring out where future debt will accumulate — AI has nothing to offer there. And if I try to use it, I spend more time correcting it than I would have spent writing from scratch.

This distinction requires honest self-assessment in the moment. Is this a pattern I understand completely? Or am I navigating unfamiliar territory where the constraints aren't fully explicit? That's not always easy to answer. The METR study documented the failure mode precisely: developers couldn't tell in the moment whether they were doing median work or judgment work. Their confidence was inverted. They believed they were faster.

That's the more significant finding. Not the 19% slowdown — the wrong self-assessment.

What AI coding tools are actually calibrated for

They're calibrated for the median task in the median codebase. New files, clear requirements, well-documented libraries. That's most of what most developers do most of the time, which is why adoption is real and the reported productivity gains are not fabricated — they're measuring a real effect on a real subset of work.

The problem is that the tasks determining whether a system survives its fifth year are not median tasks. Architectural decisions, debugging failures that cross subsystem boundaries, resolving conflicting constraints imposed by different teams at different times — these require context AI cannot have. When developers apply AI-grade confidence to these tasks, they're not moving faster. They're accumulating invisible debt.

AI-generated code is working code. It is not necessarily comprehended code. I treat these as separate properties. Code that works but that I don't understand is a liability. When the system fails in some unexpected way, I need to be able to reason from structure. If large parts of the codebase were filled by a pattern-completion tool I applied without full understanding, that structure doesn't exist in my head. The failure becomes untraceable.

This is what the METR study was measuring, indirectly: the cost of misapplied tools. Not AI failure in the abstract. A specific mismatch between what AI is good at and what experienced developers in complex codebases need.

The practical conclusion

The 19% slowdown is not an argument against AI coding tools. It's an argument for knowing when not to use them.

For experienced developers: the work that defines your value is exactly the work where AI makes you slower. Apply it to the rest. Write the hard parts yourself — not because it's more authentic, but because you need to understand them, and writing is how understanding gets built.

Speed means the velocity of hypothesis validation, not code generation. The two are almost unrelated. The METR result is evidence of that, measured directly on the people most convinced they had already figured it out.