Blog post

Working Is Not a Quality Metric

Why producing working code faster with AI raises the stakes on architectural quality, and why passing tests is not the same as having sound boundaries.

There is a phase in building where everything looks fine.

The happy path works. The UI renders correctly. The flow completes. The database writes succeed. The tests pass. The demo goes smoothly.

And that creates a false sense of confidence that becomes more dangerous, not less, when you are building with AI.

What AI changes about this

AI-assisted development makes working code much easier to produce.

A feature that might have taken two days can take two hours. The boilerplate is handled. The standard patterns are applied. The common cases are covered. The tests for the obvious scenarios are written.

For a while I treated that acceleration as straightforwardly good: more output, faster pace, same quality.

That assumption turned out to be wrong in an important way.

The faster you can produce code that works on the happy path, the more important it becomes to ask what "works" is actually telling you.

Because what it is telling you is: this code behaved correctly in expected circumstances.

What it is not telling you is anything about what happens in unexpected ones.

The hidden questions

The hidden question in any SaaS system is not "does the feature work?" but "what are the boundaries, and are they sound?"

  • Can tenant scope be forged on any endpoint?
  • Can authorization logic be bypassed through a less-tested path?
  • Are there operations that should be atomic but are not?
  • Are there dev-mode escape hatches that could survive into production?
  • Are there implicit assumptions that hold under normal use but fail under adversarial use?

Working code does not answer any of these.

In fact, it obscures them. A system that works correctly in all standard cases provides very little evidence about its behavior at the edges. And in a multi-tenant SaaS, the edges are exactly where the important failures live.

Why tests give you less than you think

Unit tests pass because they test success cases. Integration tests pass because they use valid, well-formed inputs. End-to-end tests pass because they simulate real users doing normal things.

Nobody writes the test that checks: what if the tenant ID in this request is wrong? What if an authorization check is missing on this endpoint? What if a user from one organization tries to access another's data?

Those tests are harder to write because they require thinking about failure modes, not just success modes. And when AI is generating the test scaffolding, it tends to generate tests that mirror the implementation — tests that verify the code does what the code does, not tests that verify the system is secure.

The result is a codebase with high apparent test coverage and genuinely weak boundary guarantees.

What the acceleration reveals

There is something instructive about this that only becomes visible when you build with AI.

When development was slower, the bottleneck was implementation. You had time to think while you were writing, and the friction of writing forced some implicit architectural consideration.

When the bottleneck moves to judgment — when the agent can implement faster than you can evaluate — the quality of your thinking about boundaries matters more than it did before.

AI makes it easy to produce working code. It makes it easy to produce working code with fragile boundaries. It makes it easy to produce a lot of that code very quickly.

The acceleration does not change what sound architecture requires. It changes how fast you can get into trouble without it.

What to look for instead

The quality signals I trust now are different from "it works."

I trust a system more when:

  • authorization is enforced by structure, not by convention
  • tenant scope comes from authentication, not from request parameters
  • dangerous operations are hard or impossible by design, not merely discouraged
  • business-critical actions are coherent under failure, not just under success

Those things are not visible on the happy path. They require deliberately testing the wrong case, the missing permission, the forged tenant, the partial failure.

That kind of testing is less natural than testing that the feature works. It requires thinking adversarially about a system you built and probably trust.

But in a system where the boundaries matter — and in a multi-tenant SaaS they always do — appearing to work is not the same as being sound.

The gap between those two things is where the real risk lives.

Continue exploring

Follow the same line of thought through themes, tags, or a broader local search across the archive.

Keep following the thread.