Debugging a recent project has been surprisingly challenging. It’s a complicated product with multiple components, but that’s nothing new. The customer’s QA department has done great work, but it still feels like this is harder than it should be.
Last week, I think I figured out what’s different. It struck me that a single metric like “complexity” is too one-dimensional to represent the whole picture. What matters isn’t overall complexity. It’s the ratio:
complexity of functionality : visibility into that functionality
What Do I Mean?
My project has a native client, distributed devices, and a central control node, plus an administrative interface. Additionally, most of the state is ephemeral and not committed to disk. There are internal communications channels between most of the pieces.
Yet…the client UI has one number input and three buttons. That’s pretty much it. This means that failures mostly look the same, even though there are many possible causes. In other words, we have lots of complexity and little visibility.
Here are a couple of other sample applications:
- A simple todo list app may have low complexity (since it doesn’t do much) and high visibility (since it’s all user-facing.) This project would be easy to debug.
- On the other end of the spectrum, a line of business application might be very complicated, but it’s likely that there are many screens and reports to provide visibility into the data and the internals of the system. With high complexity and high visibility, this project should be manageable.
I don’t think that high complexity/low visibility types of systems are as uncommon as they might seem at first glance. Any IoT project will likely have a number of units with little UI and stateful behavior, for example. Projects where multiple people move a project through a workflow are also likely candidates. Each individual app UI might be simple, but they may interact in a complex process.
Debugging Complex, Low-Visibility Systems
If you’re dealing with a system like this, here are a few tips for staying sane.
1. Buffer in the project schedule.
The less that system state is reflected in your user interface, the less information is available for users to include in bug reports. That means that each report takes correspondingly longer to really understand, triage, fix, validate, and ship.
Leave extra time in the schedule for this digging. Even if you decide a particular bug is low-impact and fixes can be put off until later, you’ll still need to verify that you understand the bug and its impact. That takes time.
2. Logs are a must-have.
For most systems, diagnostic output is handy. For systems where logs provide the majority of the information about your project’s internal state, they’re an absolute must-have. This [isn’t new](https://spin.atomicobject.com/2017/01/17/logs-are-features-too/), but it’s more important here.
I’m not referring to just diagnostic info, “got here” messages, or stack traces. You need to know what kinds of business transactions are going through the system, how long your queues are, and when messages and scheduled jobs are dropped or overridden.
And you must ensure that your logs make sense together. If you have multiple systems, make sure that your timestamps correlate across them. If possible, include transaction IDs that follow data through your system.
If your UI doesn’t give an excellent view into the system, then you need to make sure that your logs do–they’re all you’ve got!
3. Use simulators.
Systems like this often make it hard to get the system into a particular state.
You may need to build in some simulation capabilities for some parts. This step is frequently needed for load testing, but it can be useful beyond that, too. For example, you can make sure you send certain messages, respond slowly (or badly!), and otherwise validate things in ways that you might not be able to do through a production user interface.
4. Validation requires the whole team.
Exploratory testing is incredibly valuable, but it can be hard to apply if your project has very few obvious inputs. When you have an address input field, there are well-defined ways to test it: valid values, long values, international values, etc. But when you have few inputs for many back-end processes, it can be hard to understand how to explore the state-space.
In cases like this, the whole team will need to come together. Developers can provide details on how the systems interact and where inputs are being routed. Testers can help come up with error cases based on the system.
It’s a Spectrum
What is your project like? Mine may be at the extreme, but every project falls somewhere on this spectrum. Take a look at how transparent your systems and tools are, and adjust your estimates, schedule, and features to match the risk they present.