Anthropic ran an experiment: give Claude a real C compiler codebase and see what happens. On straightforward tasks — bug fixes, refactors, features that follow existing patterns — it performed well. Then it hit the wall. Tasks that required holding the full architecture in your head, reasoning about second-order effects, making judgment calls with incomplete information — those didn't go as well.
This matters more than any speed benchmark. At Zendrop, our developers don't write code anymore — they supervise, orchestrate, and review while AI handles generation. That's a real shift. But when your team trusts AI output without understanding what it can't do, you get code that passes review and fails in production.
Don't measure AI success by percentage of generated code. Measure it by what your senior engineers are now free to do. If they're spending more time on architecture and system design — you're doing it right. If they're just reviewing more PRs — you have a process problem, not an AI win. The ceiling isn't a limitation to be frustrated by. It's the clearest signal for where to invest your human talent.