We’re Writing Too Many Tests

Creating and maintaining a test automation suite is a time-consuming endeavor, yet the value of those efforts is often questionable.

Creating and maintaining a test automation suite is a time-consuming endeavor, often rivaling the time spent developing the code changes they are there to support. Yet the value of those efforts is often questionable. It’s hard to judge the value of test automation, and it’s surprisingly easy to invest a lot of time for no, or even negative, return.

Now, I don’t want to suggest that we shouldn’t write tests. Done carefully, they are a vital part of the development process; they increase confidence, help us catch bugs earlier in the development process, and make future improvements easier. Instead, I propose that we be more careful about how we write tests and what tests we choose to write. We need to ensure the effort in creating tests is balanced against the value we get in return.

Last year, I deleted our team’s end-to-end test suite. This suite had all the same problems as the last two e2e test suites I deleted. It wasn’t finding any valuable regressions. It was burning time investigating and fixing failures that all turned out to be flakes and had become difficult to add new tests to. A few months later, we started transitioning the app to a continuous delivery model as part of an extensive modernization project. But now we had a problem — did we have sufficient test automation to give us confidence in our releases? While the e2e tests had provided little confidence, it was hard to gauge whether we could rely on our unit tests either. Sure, we had a lot of tests and a high code coverage number, but many of the tests didn’t appear to actually be testing anything. Without being able to trust the existing tests, we retrofitted a suite of highly reliable integration tests that covered the scenarios most important to the business and customer. Even better, we were able to write these tests in a tiny amount of time.

Three anti-patterns plague the test suites of almost every application I have worked on: tests that aren’t actually testing anything, tests that repeat the same coverage across locations and layers of tests, and tests that ignore the context of the code they’re testing. These anti-patterns waste time, particularly when maintaining existing tests. They also make it hard to measure confidence, dilute the value of our test suite, and take away from time better spent on more effective improvements to both the quality of our application and the developer experience.

Anti-pattern 1: Trivial Tests

A test failure is only helpful if it’s unexpected. However, many unit tests I have encountered never fail or only fail when we expect them to. Trivial tests look useful until you realize they will never fail in situations that will provide the developer with information they don’t already have.

Consider the following example:

// example.ts
import { method1, method2 } from './utils';

export function example(param1: any, param2: any) {
    const result1 = method1(param1);
    const result2 = method2(param2);
    return `${result1}: ${result2}`;
}


// example.test.ts
import { example } from './example';

jest.mock('./utils', () => ({
   method1: jest.fn(() => 'baz'),
   method2: jest.fn(() => 'qux'),
}));

describe('example', () => {
    test('example', () => {
        const result = example('foo', 'bar');
        expect(result).toEqual('baz: qux');
    });
});

This test will only fail if we’ve updated the method being tested. But if we update this method, we expect the test to fail, so the failure hasn’t told us anything new. With such a simple method, there are no edge cases we need to be wary of; meaning tests aren’t preventing an accidental regression.

Most codebases consist primarily of methods that don’t have the complexity where unit tests like this add value. Unit tests typically aren’t required for methods that lack branching logic, loops, state, or error handling.

Anti-pattern 2: Applying coverage equally

A common line of thinking is that all code across an application is equal and deserves equal test coverage. Teams typically attempt to apply the same test strategies equally across all code in the codebase.

In reality, our applications vary wildly in how likely we are to see regressions and the cost of those regressions. For example, we might have an area of high importance to the customer that was rushed out and of questionable quality. However, the changes are isolated from the rest of the codebase and unlikely to see further development. This area might not be a suitable candidate for reducing technical debt or backfilling test automation, as it is unlikely to see future regressions. At most, it might benefit from a handful of integration tests covering happy-path scenarios.

The value of a test varies not only over the codebase, but over time as well:

A complex algorithm will benefit from early tests to aid development and may not see many iterations in the short to medium term. However, it will benefit from those tests if the algorithm needs to be extended years later.

Many parts of our applications contain simple logic that doesn’t benefit from tests while writing. However, once it’s written, multiple people may iterate on it, causing churn that would benefit from test coverage.

Deciding where to apply effort is often a best guess, and parts of the codebase we thought complete will get unexpected customer requests for additional features. But when we have limited resources, we need some form of a decision-making framework to help us decide where we allocate effort.

Your application will have knots of complexity, but it will also have a lot of boilerplate and a lot of simple CRUD interactions. The tests for particularly complex or essential areas should not inadvertently set the standard across the application.

Anti-pattern 3: Duplicate Tests

While we have plenty of tools that measure the lines of code covered by our tests, we have nothing to measure whether a line has been over-covered. By duplicating scenarios across multiple tests, we’re also duplicating effort, increasing our maintenance burden and reducing the value of each test. Duplication means that tests won’t fail in a way that provides us with new information.

Duplicate tests arise from testing the same application code across multiple types of tests or test suites. There are few situations where we get value from testing the same code path multiple times, and typically, we only need the first failure from the lowest-cost test.

Duplication across test layers Most applications have multiple approaches for ensuring correctness and quality, for example, end-to-end tests, integration tests, and unit tests. Chances are that we are replicating test scenarios across some of those layers. For example, we might have the same happy-path scenarios covered by both our integration and unit tests. Duplication means that tests that run later in our process cannot fail, as we have already validated the scenarios they cover. Alternatively, a test that runs later, but is easier to write and maintain, could render swathes of earlier tests redundant, as it provides the same coverage more efficiently.

By default, the first test we write for a new feature is almost always a unit test. Individually, these tests are the easiest to write. For those developers writing tests alongside new code, an integration test isn’t immediately helpful because it’s a step too far removed. It may need code not yet written, requiring multiple iterations. However, it can take dozens of unit tests to equal the value of one good integration test. By writing unit tests first, we guarantee duplication between the coverage of happy path scenarios.

My preferred model for writing tests is to focus unit tests specifically on the areas with knots of complexity and with a high number of routes through the code. Above this, I create a backbone of a few happy-path integration tests. For the front end, the integration tests cover user input through to API requests. For APIs, they cover from the request being received to a response returned, mocking out any external services or databases. Meanwhile, static analysis provides a baseline level of quality checking. Focusing each type of quality tooling on the scenarios where they are strongest reduces duplication between each type of test.

As an alternative to the testing pyramid, I prefer the stalatite model. Rather than broad coverage across the application, unit tests are limited, and focused around on the areas of high complexity.

Duplication across different locations Sometimes, when we have a shared component, we end up duplicating effort by testing the functionality of that component both within its tests and within the tests of its consumers. While there is some value in checking our code integrates with a common component, any tests in our application that exhaustively check that shared code are unlikely to add any additional value. Only the first failure gives us helpful information, and that will typically be the test closest to the affected code change.

We should avoid the temptation of testing our middleware or cross-cutting concerns in every location we use them. Particularly when dealing with slow or flaky tests, re-tracing well-covered code paths increases our maintenance burden for a small amount of new coverage. For example, if we’re writing tests against an API, we won’t need e2e tests for every scenario — we rarely need to go to the database to test an edge case in our business logic.

Finally, some tests, particularly e2e tests, allow us to test the inner workings of other applications. For example, a front-end team might write a test that asserts business logic contained within the APIs they’re calling off to. Again, this is redundant, as the API’s integration tests should cover that behavior.

The problem with low-value tests

Maintenance burden

As developers, we’re often pretty good at stamping out new tests quickly. But just like the rest of our codebase, the effort of tests comes not from creation but maintenance. Whether from a change to a feature or a sweeping refactor, many tests are brittle enough that any nearby change will require understanding and updating our tests.

Debugging test failures is hard. Many tests don’t fail for obvious reasons and require debugging to determine what went wrong. Our applications have many types of tests and a variety of test patterns within each type of test. Each different type of test that fails must be understood separately. Even getting the test suite to run for legacy applications can consume a lot of time. Creating new tests and updating old ones often consumes 70–90% of the time of a feature change.

Occasionally test failures are helpful, revealing issues we couldn’t have anticipated through reasoning alone. But more often, test failures are false positives, with the test failure occurring because the test code has fallen out of sync with the application code. When the fix for a failed test only results in changes to the test code itself, that test has provided no value.

Similarly to gold-plating and technical debt, poorly thought-out tests are a tax on all future development efforts. Even if a test is easy to write, it can have a net-negative impact if we spend more time investigating and updating the test than we receive in value.

Confidence Gap

Perhaps contrary to expectations, a test suite’s value doesn’t come from failing tests. Instead, it comes from tests succeeding, as successful test runs give us confidence in the correctness of our code. However, that’s only a meaningful distinction if our tests can fail, and, as described earlier, a significant percentage of our test suites don’t meet this simple criterion. Without it, it’s hard to tell whether our tests are contributing to our confidence levels.

Confidence is a vague term that many teams either struggle to define or misunderstand. It’s much more than just tests, encompassing how we architect our applications, invest in addressing technical debt, release new functionality, monitor our application, and respond to problems when they arise. I’ve worked in many teams that have misattributed confidence to low-value tests while underestimating the importance of the above capabilities.

There are teams with large test suites that detect nothing in development while regressions still make it through to production. These teams believe the solution to their problem is increasing test coverage, often doubling down on low-value tests. Typically, regressions come from scenarios not considered rather than tests not written. Sometimes, writing tests helps identify edge cases. But just as often, it gets in the way — locking us into our initial assumptions, burning time better spent considering requirements, or giving us false confidence that we have already enumerated every possibility.

We only have so much time to spend on the tooling that supports our applications, and writing tests aren’t always the best way to spend that time. Personally, I put more stock in the team’s ability to detect issues and roll forward than I do in having thousands of tests. Confidence comes from strong CI/CD, good observability and monitoring in production, strong team culture, good architecture, limited technical debt, and a reasonable set of integration tests supported by a few unit and end-to-end tests.

Opportunity Cost

Instead of writing low-value tests to catch bugs, we could spend that time making bugs less likely to occur. We could spend that time paying down technical debt, increasing automation, investing in a better architecture, modernizing the application, updating our frameworks and libraries, or improving the development team’s capabilities. Thinking we can do everything often leads to teams running out of time and missing the opportunity to make essential improvements. Teams are so busy with the bread and butter of writing features and tests they don’t spot possibilities that could increase their capabilities or quality of life.

When we invest too much effort in getting things right up-front, we risk spending a lot of time building something that no one wants. We end up with a negative feedback loop — work takes so long that we can’t iterate, so we need to get it right the first time, slowing everything down further.

Instead, we could adjust our processes to help us release faster. Done carefully, increasing velocity can, by itself, improve quality. By shortening the tasks’ lifetime, reducing the team’s cognitive load, and increasing engagement, developers are less likely to make mistakes and more likely to identify edge cases. With a tighter feedback loop, we can develop iteratively and base decisions on metrics and data, pivoting easier and quicker. This effect is particularly prominent when considering that much of the time spent on test automation is of marginal value at best.

As a note of caution, we should be mindful of the feature factory anti-pattern, where velocity gets traded for quality. The team reshapes its processes to ship buggy low-quality features quickly. Rather, we want to carefully drop time that didn’t meaningfully contribute to quality, with the goal of an overall increase in quality.

We don’t know how bad the test suite is until it’s too late

Because the impact of tests co-mingles with feature development, it’s often hard to judge the usefulness of the test suite with anything but anecdotal data. How much time goes towards maintaining existing tests and adding new ones? How often are tests providing the developer with helpful information? Would we have identified the regressions without tests as part of normal feature development?

It’s tough to take an inventory of what unit tests cover. Code coverage is a flawed metric: not accounting for the scenarios not considered and treating the application as an equal surface. When there are so many tests written across an extended period, we’re just left trusting that the team wrote appropriate tests at every point in the application’s history.

We can discover that the test automation suite they’ve spent so much effort building is worthless and will not give them the confidence they need. Once we’re in this situation, it can be hard to escape. We have to deal with the sunk-cost fallacy and the politics of how so much effort got wasted. We’ll limp along with a worthless test suite pretending it is providing value, making up the difference with additional manual effort bridging the gap. In this situation, we have the worst of both worlds, maintaining our test suite while still performing the manual testing we were trying to avoid.

Backfilling tests is hard, and it takes a lot of investment to retrofit a better set of tests into an existing application. Tests written after the fact will not have the understanding of the edge cases and intricacies of the code that were present during the initial development. Often the best we can do when backfilling is to cover the happy-path scenarios with integration tests.

Judging the value of the tests we all spend so much time writing is hard. Most of the regressions and bugs we identify just aren’t that important, whether discovered by a test or found in production after slipping through our quality checks. A developer only has to check their team’s backlog to see the number of bug tickets languishing, never becoming significant enough to rise above feature work. Some bugs have an impact so small as to be negligible or require a combination of circumstances too improbable to affect a meaningful number of users. Others occur in barely used parts of the application or are too difficult to fix to justify the time investment.

The Bugs That Shouldn’t Be in Your Bug Backlog

There is never enough time to fix all of our bugs, so why do we keep them around?

medium.com

Tests don’t need to fail to provide confidence and peace of mind. However, confidence is often in the eye of the beholder. Bad tests can give us false confidence, while good tests go underestimated and underappreciated. We need to understand whether we have already covered the scenario somewhere else, whether the test is appropriate to the code under test, and whether a test would fail in a way we couldn’t have predicted. My rule of thumb for any new test is to ask: “If this test failed, would it provide me with any information I didn’t already have?”

Summarize