I’ve been working on a multi-threaded, distributed system, and there have been some bugs that only manifested when things lined up exactly right. While the project’s deterministic elements have lots of unit and system tests, every once in a while mysterious failures have appeared.
On top of the other tests, the code is full of asserts for numerous boundary conditions, and stress tests intentionally overload the system in several ways trying to see if they trigger. While the system is generally stable, every once in a while something has still happened due to unusual thread interleaving or network timing, and these issues can be extraordinarily difficult to reproduce.
Streamlining Info Gathering
Early efforts to fix rare bugs like these tend to focus on gathering info: narrowing the focus to the code paths involved and finding patterns in the interactions that cause them. This process can take quite a while, so I wrote a tool to streamline things. I named it “autoclave” because it applies prolonged heat and pressure to eliminate bugs: a pressure cooker for programs.
autoclave runs programs over and over, to the point of failure. It handles logging and log rotation and can call a failure handler script to send a push notification, analyze a core dump, or freeze a process and attach a debugger. When investigating issues that may take 10 hours to reproduce, it can be tremendously helpful to know there will be logs and an active debugger connection to hit the ground running the next morning, or to get an e-mail with logs attached when a failure is found.
For example, we had a network request with a race condition in the error handling. When the request succeeded, everything was fine, but if it failed with a specific kind of recoverable error, there was a narrow window where one thread could potentially free() memory associated with the request before the other had finished using it to handle the error. This led to a crash, or was caught by an assert shortly after. By stressing it with autoclave, we were able to get logs from a couple failure instances, spot a common pattern, and narrow the root cause to that error handling path.
Basic Usage
$ autoclave stress_test_program_name
To log standard out & standard error and just print run counts, use:
$ autoclave -s stress_test_program_name
To rotate log files, use -c
(here, keeping 10):
$ autoclave -s -c 10 stress_test_program_name
To specify a handler script/program for failure events, use -x
:
$ autoclave -s -c 10 -x gdb_it stress_test_program_name
While the program doesn’t need any special changes to work with autoclave, it can be useful to testing for a failure condition and then adding an infinite loop. Then, running autoclave with a timeout can be an easy way to attach a debugger. For example, to freeze the process and attach a debugger if the stress test doesn’t complete within 60 seconds:
// in the program, add:
if (some_condition) { for (;;) {} } // infinite loop => timeout
$ autoclave -s -c 10 -x gdb_it -t 60 stress_test_program_name
(Attaching a debugger on timeout is also useful for investigating deadlocks.)
While automated testing is usually a better use of time than debugging issues after the fact, some bugs are inevitable, particularly in multi-threaded systems. It’s good to be prepared when surprises occur. Before a bug can be captured in a failing test, it’s often necessary to gather more raw data, and tools can help automate this as well.