Coverage Up, Confidence Down

A couple of years ago, before AI, unit tests were meant to validate the intention of the code. We created them to validate that the code we wrote satisfied the business requirements, but that’s not the case anymore.

Of course, some people were already writing tests just for the sake of it. AI didn’t create that problem, it just sped it up.

Tests as a checkbox

Today, unit tests have become nothing more than a checkbox to tick. Since developers are encouraged to write everything using AI, the unit tests are also written by AI.

The unit tests that took a day to write are now ready in minutes, but faster doesn’t mean better. The problem lies in what’s being measured.

With AI speeding everything up, the expectation shifted. Timelines now only account for building the features.

Everything needs to be delivered in a single day, and that’s where everything else becomes an afterthought in favor of speed. Unit tests and code reviews are two of the most affected.

AI tests mirror the implementation

If you ever look at unit tests written by AI, all they do is test the code as written. Sometimes they cheat, mocking what needs to be tested. They write tests based on the implementation, they don’t care if the implementation is wrong.

What’s the point of testing the implementation when it can’t tell what’s right or wrong? What does it verify?

To answer that, you have to go back to what a good test is supposed to do.

What makes a good unit test

A good unit test should verify that the implementation matches the business requirements, meaning the expected behavior. If, at any point, we change the code so it no longer satisfies the business requirements, it should fail.

The unit tests created by AI only test the implementation, they don’t know what’s actually right or wrong. But, can we really blame AI for this? It tested exactly what we pointed it at.

We just pointed it at the wrong thing.

Coverage rewards the wrong thing

This keeps happening because that’s what’s being measured. Measuring code coverage is easier than measuring quality. You can set code coverage targets as a KPI, but what about code quality?

Developers are rewarded for writing more tests, more coverage, regardless of what the tests actually verify.

High code coverage also sends a false sense of security to the higher-ups. After all, AI can easily keep your code coverage above 90%. The developers who get their hands dirty with this AI shenanigan, though, their confidence drops when it comes to quality and system reliability.

How many bugs do you think an application with over 90% code coverage has? The bugs usually happen when a developer changes something that breaks other features. This is common since AI only focuses on the task at hand, without considering the side effects.

Flip the order

Despite all of this, the goal hasn’t changed. We still want the code to behave the way we expect. What I keep coming back to is whether we’re even using AI the right way.

Maybe the real issue isn’t speed. It’s that AI doesn’t know what the code is supposed to do.

We hand it the implementation and say “write tests for this.” So it reverse-engineers the intent from the code, the exact thing we’re trying to verify. No wonder the tests just mirror the implementation.

What if we flipped it? You write the scenario in plain English: “when a user withdraws more than their balance, reject the transaction.” That’s the spec. AI turns that into test code, then writes the implementation to make it pass.

Now AI can’t cheat. There’s no implementation to mirror, because it doesn’t exist yet. It has to write code that actually satisfies what you described.

Mutation testing over coverage

Even then, a passing test doesn’t mean the test is good. You can flip a > to a < in the source and see if any test catches it. If they all still pass, those tests weren’t really testing anything. That’s mutation testing.

Coverage tells you the line ran. Mutation testing tells you the test would actually catch a bug on that line. One of these is useful. The other is what we’ve been measuring.

The spec is the source of truth

There’s still a gap, though. Once AI turns your plain English scenario into test code, you lose the connection. Six months later a test fails, and you’re back to reverse-engineering the business rule from generated code. That’s the same trap we were trying to get out of.

So don’t treat the test code as the thing you maintain. The scenario is the source of truth. The test code is just an artifact, like compiled bytecode. If the scenario changes, you regenerate the test. If a test fails, you read the scenario, not the test, to understand the intent.

What actually changes is what we own. We stop writing assertions and start writing intent. AI handles the translation. Mutation testing handles the verification. Coverage drops out of the picture, because it was never measuring what mattered.

This works at other layers too. Playwright is a good example. It’s deterministic, repeatable, and predictable. AI calls the tool, it doesn’t invent its own browser behavior.

The mechanical part is fine to delegate, as long as something deterministic holds the line.