Debugging a Complex Problem? Think Like an Epidemiologist

I’ve always felt that code should be absolute, with concrete yes or no answers. It is like that on the small scale: An “if” statement runs one branch or the other. But on the larger scale, it’s very hard to treat our code that way.

Some bugs are caused by math errors (e.g. typing in the wrong formula). These kinds of issues are often predictable and easy to resolve.

But there are other problems—problems that take up more time, may be hard to reproduce, or are caused by a confluence of multiple issues. These are the kind of problems that caused notorious failures like the THERAC-25 and the large 2003 blackouts.

Even if the consequences of errors in my code are smaller than in those examples, modern stacks have gotten so complicated that the problems are still very challenging. These problems are fuzzier, less well bounded, and more organic. Symptoms and causes are often far apart and interact in non-obvious ways.

Problem-Solving Like Epidemiologists

It turns out that the idea of problems being “organic” is actually useful. We can’t solve them just by reading code top to bottom. We need to use other approaches—indirect approaches.

This has long been a challenge in clinical medical research, where it’s hard to know things. You are trying to answer questions like: “Did the medication actually bring the fever down, or did the fever happen to break that particular day for some other reason?” To answer that, you need to collect data and evaluate it. And you’re never going to be 100% sure.

I’ve found that it helps to consider software in similar terms. There’s a problem we need to diagnose, and (setting aside limited tools like X-rays and debuggers) we can’t see inside. We know, in theory, how things work, but it’s too complex to model in our heads. So instead, we devise experiments, collate data points, and slowly narrow our focus until we can fix the problem.

Generating Ideas with the Bradford Hill Criteria

One approach we can borrow from medical research is the Bradford Hill Criteria—nine principles to help us determine if a causal relationship is present. Hill’s criteria aren’t absolutes, but they can be useful heuristics.

When attacking complex software problems, the hardest part can be just getting past the blank page. The Bradford Hill Criteria can help by giving us specific questions to ask.

Here are some inspirations I took from four of the criteria:

Consistency

Does your problem occur every time you do a specific action? Does it behave the same across multiple environments? These questions won’t solve your problems on their own, but at least they provide some data points.

That’s an important thing to remember: Having data points is useful. They won’t all be useful. But if you really have no idea, focus on gathering the information first. Then worry about filtering and setting direction.

Specificity

Is this symptom specific? To a specific user? To a user role? To a platform?

You have a symptom that involves some subset of the entities of your system. You may not understand your problem, but you probably understand your data. Use that understanding to explore whether your problem is specific to some particular cross-section of your configuration.

Temporality

What else happens at the same time as your problem? It could be an AJAX request. A database query. Perhaps a mutation. A background job. Don’t forget informational things either: Do you get log output at the same time as your problem?

Biological gradient

This one is my favorite. It seems like it shouldn’t be useful in a digital world, but it’s often a great inspiration. When I apply this concept to software, I usually phrase it as the question, “Can I make this problem worse?”

For binary problems like “it crashes when I do X,” the answer may be “no.” But for problems of lag, processing time, performance, or resource utilization, you can often make the situation worse through actions such as enqueueing a bigger job, uploading a bigger file, or generating a larger dataset.

Rather than trying to start by inventing a possible cause, it’s often much easier to gather data. Hill provides inspiration for different kinds of data that we might want to have.

When You’re in the Dark

Finding direction when you’re stuck is hard. But by arming yourself with data, you can start to think about the sequence of events that might have caused your issue.

Hill’s Criteria provides one possible starting point to attack the problem, and I was tickled by the idea that something from the squishy world of medical research could be so close to the debugging instincts I’ve developed over the years.

Do you have any strategies that you use to break through when faced with the inexplicable? Let me know!