So you’re staring down a big product launch, and you’ve got a long list of anxieties. In part one of this three-part series, we discussed planning to prevent various categories of things that could go wrong. Perhaps you’ve done your best to test and dry-run, but you’re still feeling uncertain. To quell your fears and do your due diligence, take your test-driven development practice to a more meta-level. Instead of watching your [unit] tests fail, how might you monitor the way your launch issues manifest?
What symptoms should I look for?
For each problem scenario, start by asking yourself: “What symptoms would we see if this happened?”
- Collect your stats: Ask yourself what kind of signals might alert you to the potential problems you’ve identified. An incomplete starter list might be something like overall site hits, hits by endpoint, account creations/abandonments, error rates, DB reads and writes, cache hit/miss rates, time to first meaningful paint, and so on.
- Know your stats: What do you expect your hit rate, read/write rate, and response times to look like? Run some back-of-the-napkin math and write down your expectations. These will help you to understand what’s cause for alarm and what’s just an interesting data point.
- Set some invariants: If it’s a brand new product, maybe you have no idea what to expect in terms of hit rates and the like. In that case, develop some estimates on how you expect your statistics to scale relative to each other. Maybe you expect the number of database reads to scale linearly with the number of accounts. If your reads start skyrocketing, you won’t be left wondering whether there’s an issue or just a healthy surge in signups.
- Identify the guard: Having a bunch of data collecting in an S3 bucket doesn’t do you any good if nobody is keeping an eye on it. Explicitly identify who should monitor these stats, during what time, and how frequently. For the problem thresholds you’ve identified, what alarms can you automate? Many hosting platforms and monitoring tools provide features that can alert you to issues via email or text.
- Test your tests: This step is the most important. Once you have a theory for how problems will manifest and how you will catch them, validate it. Simulate
N > thresholderrors and confirm that your team is notified. Crank down your server resources or write some wildly expensive queries. And, make sure that when your app does crack under pressure, your monitoring tools let you know. Taking these steps will ensure that you don’t go into launch day with a false sense of security only to be unpleasantly surprised by angry phone calls when your monitoring dashboards are giving the all-clear.
Okay, so what do I do about all of this?
You’ve got your dashboards and alarms ready, and you’ve identified all of your problem vectors. You’re as prepared as you can be, right? But what will you do when something does go wrong? In the next post, we’ll discuss wartime debugging, testing, and deployment so that you’re ready for whatever comes down the pipe on the big day.