Better System Tests – Increasing Testability without Sacrificing Elegance

Writing tests for your software is a great idea! I’m glad we’re all doing it. And I’m glad we’ve all put a lot of thought into categorizing the tests–by unit, integration, system, etc. Of all the tests you can write, my professional experience has shown repeatedly that system tests are the most valuable. Nothing gives you as much proof as validating the entire system working in concert.

Unfortunately, I also hate system tests. It’s hard to write them–you have to treat your whole system as a black box, then come up with ways to automate interacting with the black box as your end user would. Since you’re doing TDD, you have the added complexity of trying to do this before the UI has even been made. To make matters worse, every system test suite I’ve ever seen has been plagued with long run times and intermittent, spurious test failures. Does it have to be this way?

What Makes System Tests So Hard?

Let’s consider the case of the prototypical worst offender: a Ruby on Rails app with some client-side interactivity sprinkled in. We typically test such an application by running a headless browser with which we can script interactions. We’ll write a test that finds some form elements, fills in values, and then clicks a submit button. When the form has been submitted and a new page has rendered, we’ll assert that we see some stuff on screen, or perhaps we’ll peek into the database and see that something desirable happened. This is the ideal case for such a test, because it’s easy to know when to make our assertions: Clicking on the submit button caused the headless browser to enter a loading state, and we can simply wait for it to confirm that it’s got new content.

The thing is, nobody’s written this kind of Rails apps for years. The real case is that the application is either partially (some JavaScript here and there) or fully dynamic (e.g., using Ember, React, etc.) in the client side. When you click a button, instead of loading a new page, maybe it’ll do some client-side processing and transform the DOM. Or maybe it’ll make an Ajax request. Or both.

It’s easy for us to ask the headless browser whether it’s finished loading a page, but it’s much harder to know for sure that all of the client-side JavaScript or DOM processing has settled down. It’s so hard that the usual solution is to be flexible with your timing. You’ll click a button, wait a fraction of a second…and then keep trying to make your assertions, repeating them if they fail, for, say, five more seconds.

I hope this sounds terrifying to you! In essence, to write system tests in this style is to create a giant bag of race conditions. What if the process takes an extra 200ms once in a while? Your test fails! So you’ll wait six seconds instead, then bump it again later when you realize that your continuous integration server isn’t quite as fast as your fancy MacBook Pro.

Let’s also consider the case where the test is catching a legitimate failure. Now it’s going to end up waiting the full six or so seconds over and over again, so your system tests are going to take forever to tell you what’s actually going on. And in the end, you won’t be certain if they’re indicating a legitimate or spurious failure. You’ll run them a few times, maybe leaving for lunch while you do. Or if you’re lucky, you can use CircleCI to split your tests into twelve different subsets and run them in parallel.

Frankly, I find this depressing. Enough so that I wanted to delete all of the system tests on projects I’ve worked on in the past. Ironically, despite all of the frustration and madness and wasted time, our system tests were still providing more than enough value to justify their continued maintenance and additions. System tests are really important, you guys.

Is There a Better Way?

After analyzing the source of my frustrations, I think I can answer my question: No, it doesn’t have to be this way.

At the start of my most recent project, I gave a lot of thought to how to design our system tests so that they don’t suffer from this affliction. It’s not an insurmountable problem, actually. You just need your system tests (or, preferably, the framework you’re using to write your system tests) to have a little bit of insight into what your application is doing while it’s running.

In particular, you need to know whether it’s busy updating the DOM, performing some computation, or fetching data from your server. If you can wait until those have settled down, you won’t run into issues where elements or information you’re trying to assert on are not quite ready yet–and you won’t keep wasting time just to be safe.

For this project, we’d chosen to use Ember for the frontend and Clojure for the backend. Ember has a testing framework that has all of that knowledge built in. It calls these acceptance tests, and it appears that people normally run them against a mocked-out or fake API.

Naturally, writing tests for your UI that don’t talk to your actual backend won’t prove that the two are actually capable of talking to each other, but there was nothing keeping me from running these tests against my real API. I set them up to run against a “test” API, which is just the real API with some extra REST endpoints for poking at the database or other state. Just like that, Ember’s acceptance tests became our project’s full-system black box tests.

This approach worked extremely well for our project. The downside? You have to manage a few extra processes on your development machine (the test API server and Ember test runner). The upside? Ridiculously fast tests that never give us any intermittent or spurious failures! We’re now concluding development on the first release, and it still feels dreamy to see how fast and reliable our tests are.

Balancing the Need for Testability

The realization I’ve reluctantly arrived at is that good system tests require your system to be designed to be tested. In the past, I’ve always aimed for my code to be beautiful and elegant in a way that stands on its own for its purpose. When you alter your design to accommodate its testability, complexity can increase and elegance can decrease.

To find a good approach for increasing testability without sacrificing elegance requires careful thought and time, both of which are being spent on a goal that is only ancillary. As an illustration of just how hard it can be to get right, consider that we couldn’t write a test for seeing a particular refresh indicator in the app. Ember’s acceptance tests wouldn’t run our assertions until the refresh request had completed. As another illustration, consider that what’s necessary to system-test a mobile app or device firmware will be wildly different than a web app. I suspect these costs have discouraged a very large number of developers from even bothering.

At the end of the day, if your goals are quality and a velocity that doesn’t grind to a halt, the extra time and engineering you’ll spend to achieve good testability will be an excellent investment.