The Return of TDD — AI Changes the Economics

We tested Claude Opus 4.5 vs GPT-5.2 Max on a real RBAC implementation with strict TDD constraints. One completed the task. One didn't.

Most teams didn't "reject" TDD. They just stopped paying the tax.

TDD is great when it's done properly: tight feedback loops, clear contracts, refactoring with confidence. But in real delivery environments, a few things happen predictably. Deadlines show up. Requirements wobble. Engineers ship code and "will add tests after". The test suite becomes a mix of high-signal checks and legacy noise. Eventually TDD feels like process, not leverage.

So TDD became aspirational. People still like the idea. Fewer teams consistently run it.

The Experiment: RBAC Implementation

To understand how AI changes this, we ran an experiment. We took a realistic scenario — not a toy problem — because toy problems flatter models.

The starting app had only basic auth: "admin" and "user". We introduced RBAC with org-scoped custom roles and a permission matrix. We required backwards compatibility, because that's where most RBAC changes die in the real world: migrations and implicit assumptions in code.

We gave both Claude Opus 4.5 and GPT-5.2 Max the same objective, same repo, same markdown contract, same coding rules. The comparison wasn't "who writes prettier code in a vacuum". It was: who can take a non-trivial change, respect constraints, and converge to green without supervision.

That's the bar that matters if you want AI to carry long-running engineering tasks.

What AI Changes About the TDD Loop

AI doesn't magically make design decisions better. What it does change is the economics of iteration.

If you can keep a model inside the red–green loop, it will grind through the boring parts: wiring, edge cases, refactors, updating call sites, rerunning tests, fixing regressions, repeating. Humans can do this too, but it's cognitively expensive and easy to abandon halfway through.

The key isn't "prompting". The key is making success mechanically checkable and then forcing the model to stay until it's true.

Writing Tests as the Source of Truth

Stop writing "requirements" as prose and instead write them as a contract the model can't wriggle out of. We used a single markdown doc that says:

what the feature is
what cannot break
what must be explicit
what counts as "done"
and the termination rule

This is not a PRD. It's closer to a build spec. Here's a high-level version of the RBAC spec we used (trimmed to the essentials):

RBAC_SPEC.md

# Feature: RBAC Expansion

## Context
App currently has basic auth only:
- admin
- user

## Objective
Introduce real role-based access control (RBAC) without breaking existing users.

## Scope
- Custom roles per org
- Permission matrix: read / write / admin
- Backwards compatibility for existing admin/user accounts
- Migration strategy that preserves access
- Admin UI to manage org roles + assignments

## Testable outputs
1) Permission matrix is explicit and queryable (DB or JSON)
2) No existing user loses access after migration
3) Attempts to exceed permissions are blocked (API + UI)
4) Admin UI can:
   - create/edit roles within org
   - assign users to roles
   - preview effective permissions

## Hard checks
- Snapshot tests for permission decisions (deterministic)
- A misconfigured role must not escalate privileges
  (deny-by-default + no implicit inheritance)

## Termination rule
Do not stop. Do not respond with partial completion.
Run tests repeatedly and continue until all tests pass.

This is the shape that matters: the spec is short, but every line is testable.

The "Do Not Exit Until Tests Pass" Constraint

This constraint sounds obvious, but it's where most AI-assisted coding falls apart.

If you let the model stop when it "thinks it's done", you get plausible code and a pile of untested assumptions. If you explicitly disallow exit until green, you convert the workflow into a deterministic loop:

implement

run tests

fix failures

repeat

That loop is basically what good engineers do — but now the model is the one doing the grinding.

In practice, you want wording that prevents the classic failure modes: stopping early, hand-waving a failing test, or "I can't run tests here". Your harness should be runnable in the environment the model is operating in, and your instruction should be unambiguous.

The Test Harness

The harness asserted behavior across three layers:

DB State

After migration (no access loss)

API

Authorization (correct status codes per role)

UI

Authorization (no unauthorized actions exposed)

And we pinned the tricky requirement: one misconfigured role must not escalate privileges. That means deny-by-default, explicit permission evaluation, and no "helpful" fallbacks.

To make permission decisions testable, we required a single permission-evaluation function with snapshot tests. The goal: if permission logic changes, the diff is visible and reviewable.

permissions.test.ts

import { decide } from "../auth/decide";

test("permission decisions are stable", () => {
  const cases = [
    { role: "viewer", action: "project:read", expected: true },
    { role: "viewer", action: "project:write", expected: false },
    { role: "editor", action: "project:write", expected: true },
    { role: "misconfigured", action: "org:admin", expected: false },
  ];

  for (const c of cases) {
    expect(decide({ role: c.role, action: c.action })).toBe(c.expected);
  }
});

The real suite was broader (API + UI + migration), but the point is the same: permission logic is explicit and test-driven.

Results

Claude Opus 4.5

Completed successfully

Time

~40 minutes

Cost

~$40

•Stayed in the loop continuously
•Full suite passed: DB, API, UI
•Migration preserved access
•Permission matrix explicit

GPT-5.2 Max

Did not complete

Time

~15 min × 2 attempts

Cost

~$30 total

•Exited while tests were failing
•UI flows didn't match role behavior
•API checks inconsistent
•Org-scoped roles incomplete

What Broke (and Why)

The failure modes were not subtle. They're the ones you've probably already seen if you've tried to use models for serious work.

GPT-5.2 Max had four recurring issues:

It quit

Not always, but often enough that you can't rely on it for long-running work unless you're supervising. The termination instruction helped, but didn't fully constrain it.

It wrote broken code and relied on tests to notice

That sounds fine (that's TDD), except it then struggled to converge cleanly. You'd see fixes that created new failures elsewhere. It was playing whack-a-mole.

It drifted from coding rules

Even with explicit constraints, it regularly used any/unknown patterns when told not to, and produced large, sprawling files even when asked to keep modules small.

It dropped context

It would read a file, incorporate it, then behave later as if it never existed — which is fatal for cross-cutting changes like RBAC.

Opus 4.5 wasn't perfect, but it was materially better on the things that matter for this style of work: it stayed in the loop, followed constraints more consistently, and held the necessary context across domains.

The trade-off is cost. Opus got it right, but it was expensive by model standards.

What This Means for Your Teams

If you've avoided AI for "real engineering" because it can't hold a long thread, this is the first time the answer is: it depends on the model, but also on your discipline.

The winning pattern is not "ask the model to implement RBAC".

The winning pattern is:

Define the contract in markdown

Define tests that make the contract executable

Force the model to stay until green

With that setup, you can hand the model a long-running task and get back something that's actually verifiable. And yes, the compute can look expensive — $40 for one run — until you compare it to a day or two of a strong engineer doing a careful RBAC migration with UI + API + DB coverage. That's not a cheap change. You're paying for concentrated iteration and test-driven convergence.

The return of TDD isn't philosophical. It's practical. AI makes the red–green loop cheap enough that you can afford to make tests the center of the workflow again — and, crucially, you can enforce that workflow mechanically.

If your team wants to use AI beyond "draft a function", this is the line to draw: no tests, no trust. Tests first, and don't let it exit early.