Evaluating Property-Based Testing Through a Random Walk

Lately, I’ve been interested in property-based testing. It’s a sort of “Monte Carlo”-esque approach where you execute your application randomly (rather than according to strict scripts) and test that it never reaches an invalid state.

It has proven usefulness in lower-level software (such as implementations of data structures), but I’ve been wondering if it could be applied at a higher level. I’ve been wanting to apply it to a web app to test its domain objects and possibly also run it at a higher level, such as through a REST API.

I haven’t been sure of the best way to apply property-based testing at this level. My concern kept coming back to just how large the state-space could be. Today, I tested out my theories by modeling a property-based test as a random walk and seeing how well it could cover the state-space.

Setting Up the Test

I planned to use one dimension for each operation, executing many random walks through this n-dimensional space and recording which points (i.e. states) they reached.

The next thing I needed to figure out was how to define a “well covered” state-space. Since operations could be repeated, the space was infinite, so I couldn’t just do a straight-out percentage. I decided that I would pick a volume to be considered “interesting,” then measure which points within that volume were reached by at least one walk.

For this back-of-the napkin calculation, I defined the “interesting” volume to be an n-cube extending from the origin n units into the positive and negative on every axis. So for two dimensions, I wanted to know about the points above, below, left, and right of the origin, plus the diagonals. This extends into three, four, or more dimensions as well. In other words, as we linearly increase the number of operations, the potentially interesting volume increases exponentially. And, obviously, as the state-space gets bigger, coverage gets correspondingly poorer.

The more interesting part is the rate at which the coverage percentage changes.

Test Results

Below is a summary of some trials. Because they’re all random and use the system rand(), my results may be less than scientific, but I think they’re still useful for a very rough guesstimate.

For each different amount of operations (i.e. dimensions), I performed 10,000 random walks, each of length 10. Here’s what I found:

Dimensions Coverage %
2 100%
3 100%
4 80%
5 12%
6 0.8%
7 0.03%

With this model, as long as the number of operations is fairly small (in the 5-10 range), the coverage will be good. But it falls from “good” to “smoke test” to “at least it compiles” levels of assurance very, very quickly as we try to test more.

The difficulty in interpreting these results is that this is just a model. Not every operation is equally likely to trigger a bug, and not every operation can be done at every time. This means that we don’t really have the uniform probabilities that this model expects. So, take my results with a grain of salt, but I still think they provide enough illumination to be useful.

My Conclusions

After running my tests, I can suggest some guidelines:

  1. It won’t work to test your whole app at once. If you have 10 different kinds of records and simply support CRUD operations, then you have at least 40 different operations. If you allow all of them, you’re not going to cover any measurable amount of your state-space.
  2. Property-based testing likely won’t be very effective when done through a browser for a whole application. The 100,000 steps I simulated took seconds to run, but with 100ms delays waiting for browsers to complete operations, it’s not feasible.
  3. Property-based testing is definitely useful as long as we can keep our state-space focused. For example, testing a single form through a browser may be useful (it’s slow, but we search a smaller space). Alternately, we might get a good effect out of testing at a lower level if we focus on just business logic or testing via a REST API. This might speed up tests enough to make somewhat larger spaces feasible.

I am very excited about the possibilities that property-based testing provides, and I hope to make significant use of it soon.

I’ll close this post with one prediction: in a property-based testing world, I think continuous integration will be just that–continuous. It’ll run 24/7 and only be restarted when people check in, a constant sentry searching for new problems.