It’s finally here — the day of golive. You’ve prepared yourself, practically and emotionally, to be on the lookout for problems and able to identify them when they arise. In part one of this three-part series, we discussed what could go wrong, and in part two, we discussed how you’ll know when something is going wrong.
So what happens when an alarm does go off? The answer to this question is going to vary wildly depending on your specific project, so you’ll need to have this conversation with your team. Here are some questions to ask to get it started.
A note about triage: It’s very easy during a public launch to panic when your first Rollbar notification hits. Before you drop your Friday night plans to debug production issues, take a close look at the scale of the problems you’re seeing and determine, in concert with any stakeholders, how you want to prioritize them. If an issue is only happening when someone accesses your site from the default browser on their 2007-era phone, it might be able to wait until later.
Debugging and fixing prod issues can be high-stakes and high-stress. Negotiate these questions well in advance to save yourself lost sleep and grumpy feelings later.
- Who is on the hook for keeping an eye on their notifications, and during what hours? Humans have to sleep sometimes. If you don’t have round-the-clock support, are you prepared to accept the risk of letting an issue go until the morning?
- Who needs to be updated when there’s an issue? Set expectations for where this conversation will happen and who will be involved. It’s better to have a frank all-team conversation about this early rather than suffering heart palpitations at every text message that hits your personal cell while trying to enjoy your weekend.
- Who should resolve a given issue? Based on your team structure, there are lots of potential answers. Maybe the whole dev team swarms on a given issue until the problem area is identified, or maybe it’s assumed that whoever is on-call can handle anything that comes up. Be wary of parts of the system that “only Person X can touch.”
- Consider arranging for pairs of developers to be on-call together. The benefits of pair programming are heightened under high-stakes circumstances, and a little camaraderie can go a long way, too.
Once you’ve identified the issue, be ready to get in there and work quickly. A little prep work can prevent you from feeling like you’re shooting from the hip.
- Have a staging environment ready that matches prod closely to test fixes on. Could you set up a script to duplicate the prod environment at will so that you have representative data?
- If the issue is client-side, can you test your fixed local client against the prod API? Doing this can help give you some certainty that your fix really will solve the issue.
- Consider outside-the-box solutions. It might be better to turn off a problem feature entirely until there’s more time to debug, or you might be able to creatively sidestep a UI issue by replacing a problem component with something simpler.
- Get in a first-aid mindset. Save the elegant architectural adjustments for when the team can afford them; you want low-risk interventions that will stop the bleeding/get that dashboard back up with minimal downtime.
Deployment & Validation
You’ve got a fix. Great! Now, don’t just slap it on prod and hope for the best…
- Agree in advance on what your branching policy will be. Negotiating at 9 p.m. over who’s merging what to which branch after whose PR review is not how anyone wants to spend a Saturday.
- Similarly, agree on your policy for testing fixes. Who needs to look at a change before it goes live? Do you need QA approval, and if so, what’s the expected turnaround time on that? How much power should a developer have to deploy experience-changing fixes in the absence of the product owner?
- Consider a deployment style that allows you to soft-launch your fix and give it a final test before committing to it. It’s common to use blue/green deployments for this purpose, but there are lots of ways to protect yourself against risk here.
- After launching your fix, are there metrics that you expect to see improve? Are there new metrics that you want to track to catch similar problems in the future?
Remember that no matter how thorough and prepared you are, there will be problems you didn’t see coming. Software is built by humans, and humans make mistakes. Your team will be stronger for it if you can discuss the problems that do come up with a sense of grace and an eye for improvement rather than shame. For truly big issues, after (and only after) everyone’s had a chance to decompress and high-five over the fix, plan a team discussion. As a team, look objectively at what went wrong and what you’ll watch out for next time. There are few things more valuable to a team than hard-earned experience.
Maybe you can even write a short blog series about what you learned! 😉