We tested Claude Opus 4.5 vs GPT-5.2 Max on a real RBAC implementation with strict TDD constraints. One completed the task. One didn't.
Most teams didn't "reject" TDD. They just stopped paying the tax.
TDD is great when it's done properly: tight feedback loops, clear contracts, refactoring with confidence. But in real delivery environments, a few things happen predictably. Deadlines show up. Requirements wobble. Engineers ship code and "will add tests after". The test suite becomes a mix of high-signal checks and legacy noise. Eventually TDD feels like process, not leverage.
So TDD became aspirational. People still like the idea. Fewer teams consistently run it.
The Experiment: RBAC Implementation
To understand how AI changes this, we ran an experiment. We took a realistic scenario — not a toy problem — because toy problems flatter models.
The starting app had only basic auth: "admin" and "user". We introduced RBAC with org-scoped custom roles and a permission matrix. We required backwards compatibility, because that's where most RBAC changes die in the real world: migrations and implicit assumptions in code.
We gave both Claude Opus 4.5 and GPT-5.2 Max the same objective, same repo, same markdown contract, same coding rules. The comparison wasn't "who writes prettier code in a vacuum". It was: who can take a non-trivial change, respect constraints, and converge to green without supervision.
That's the bar that matters if you want AI to carry long-running engineering tasks.
What AI Changes About the TDD Loop
AI doesn't magically make design decisions better. What it does change is the economics of iteration.
If you can keep a model inside the red–green loop, it will grind through the boring parts: wiring, edge cases, refactors, updating call sites, rerunning tests, fixing regressions, repeating. Humans can do this too, but it's cognitively expensive and easy to abandon halfway through.
The key isn't "prompting". The key is making success mechanically checkable and then forcing the model to stay until it's true.
Writing Tests as the Source of Truth
Stop writing "requirements" as prose and instead write them as a contract the model can't wriggle out of. We used a single markdown doc that says:
- what the feature is
- what cannot break
- what must be explicit
- what counts as "done"
- and the termination rule
This is not a PRD. It's closer to a build spec. Here's a high-level version of the RBAC spec we used (trimmed to the essentials):
# Feature: RBAC Expansion ## Context App currently has basic auth only: - admin - user ## Objective Introduce real role-based access control (RBAC) without breaking existing users. ## Scope - Custom roles per org - Permission matrix: read / write / admin - Backwards compatibility for existing admin/user accounts - Migration strategy that preserves access - Admin UI to manage org roles + assignments ## Testable outputs 1) Permission matrix is explicit and queryable (DB or JSON) 2) No existing user loses access after migration 3) Attempts to exceed permissions are blocked (API + UI) 4) Admin UI can: - create/edit roles within org - assign users to roles - preview effective permissions ## Hard checks - Snapshot tests for permission decisions (deterministic) - A misconfigured role must not escalate privileges (deny-by-default + no implicit inheritance) ## Termination rule Do not stop. Do not respond with partial completion. Run tests repeatedly and continue until all tests pass.
This is the shape that matters: the spec is short, but every line is testable.
The "Do Not Exit Until Tests Pass" Constraint
This constraint sounds obvious, but it's where most AI-assisted coding falls apart.
If you let the model stop when it "thinks it's done", you get plausible code and a pile of untested assumptions. If you explicitly disallow exit until green, you convert the workflow into a deterministic loop:
That loop is basically what good engineers do — but now the model is the one doing the grinding.
In practice, you want wording that prevents the classic failure modes: stopping early, hand-waving a failing test, or "I can't run tests here". Your harness should be runnable in the environment the model is operating in, and your instruction should be unambiguous.
The Test Harness
The harness asserted behavior across three layers:
DB State
After migration (no access loss)
API
Authorization (correct status codes per role)
UI
Authorization (no unauthorized actions exposed)
And we pinned the tricky requirement: one misconfigured role must not escalate privileges. That means deny-by-default, explicit permission evaluation, and no "helpful" fallbacks.
To make permission decisions testable, we required a single permission-evaluation function with snapshot tests. The goal: if permission logic changes, the diff is visible and reviewable.
import { decide } from "../auth/decide";
test("permission decisions are stable", () => {
const cases = [
{ role: "viewer", action: "project:read", expected: true },
{ role: "viewer", action: "project:write", expected: false },
{ role: "editor", action: "project:write", expected: true },
{ role: "misconfigured", action: "org:admin", expected: false },
];
for (const c of cases) {
expect(decide({ role: c.role, action: c.action })).toBe(c.expected);
}
});The real suite was broader (API + UI + migration), but the point is the same: permission logic is explicit and test-driven.
Results
Claude Opus 4.5
Completed successfully
~40 minutes
~$40
- •Stayed in the loop continuously
- •Full suite passed: DB, API, UI
- •Migration preserved access
- •Permission matrix explicit
GPT-5.2 Max
Did not complete
~15 min × 2 attempts
~$30 total
- •Exited while tests were failing
- •UI flows didn't match role behavior
- •API checks inconsistent
- •Org-scoped roles incomplete
What Broke (and Why)
The failure modes were not subtle. They're the ones you've probably already seen if you've tried to use models for serious work.
GPT-5.2 Max had four recurring issues:
It quit
Not always, but often enough that you can't rely on it for long-running work unless you're supervising. The termination instruction helped, but didn't fully constrain it.
It wrote broken code and relied on tests to notice
That sounds fine (that's TDD), except it then struggled to converge cleanly. You'd see fixes that created new failures elsewhere. It was playing whack-a-mole.
It drifted from coding rules
Even with explicit constraints, it regularly used any/unknown patterns when told not to, and produced large, sprawling files even when asked to keep modules small.
It dropped context
It would read a file, incorporate it, then behave later as if it never existed — which is fatal for cross-cutting changes like RBAC.
Opus 4.5 wasn't perfect, but it was materially better on the things that matter for this style of work: it stayed in the loop, followed constraints more consistently, and held the necessary context across domains.
The trade-off is cost. Opus got it right, but it was expensive by model standards.
What This Means for Your Teams
If you've avoided AI for "real engineering" because it can't hold a long thread, this is the first time the answer is: it depends on the model, but also on your discipline.
The winning pattern is not "ask the model to implement RBAC".
The winning pattern is:
With that setup, you can hand the model a long-running task and get back something that's actually verifiable. And yes, the compute can look expensive — $40 for one run — until you compare it to a day or two of a strong engineer doing a careful RBAC migration with UI + API + DB coverage. That's not a cheap change. You're paying for concentrated iteration and test-driven convergence.
The return of TDD isn't philosophical. It's practical. AI makes the red–green loop cheap enough that you can afford to make tests the center of the workflow again — and, crucially, you can enforce that workflow mechanically.
If your team wants to use AI beyond "draft a function", this is the line to draw: no tests, no trust. Tests first, and don't let it exit early.
