I wrote a post about how important it is to make mistakes a while back, talking about how I could never have become the developer I am today without making many of them. There were a good number of comments discussing it, and they were generally in agreement. However, a few comments pointed out a big area that I did not cover in my post at all. It is absolutely important to be allowed to make mistakes to learn. But…I only discuss the value of making these mistakes, small or big. I don’t go into detail on the consequences of these mistakes. This implicitly assumes that the cost is always worth it. However, that is absolutely NOT always true. So if mistakes are really that valuable, do we just give up on building a culture for them if the stakes are too high?
An Example Problem
Let’s imagine you’re at a company where every time production goes down, your company loses thousands of dollars per minute. After all, if customers can’t use your app, they can’t spend money there. Some of them may just try again later, but others may just find an alternative and use that instead. In many cases, the cost is not as easily measurable as it can be reputational damage. There is always a cost though, so let’s just start from there.
Your boss comes to you and tells you they’re losing too much revenue from this situation. It’s now a better investment to improve reliability than it is to launch a few extra features. So, you buckle down and look at what are the common causes for these outages. You do detailed post mortems, create tickets to improve that area of your software, and get them merged. Reliability is drastically improved, and everyone is happy! You cut down on lost outage revenue by 50%. Job well done, time to go back to features! What more could your boss ask for?
Hidden Costs
There are a few problems with this approach, however. Firstly, you’re making incremental progress on an uphill battle. As new features are released, new bugs will occasionally sneak in. This means reliability will slowly decrease and outages will slowly increase. Eventually, you will have to repeat this process all over again, just to get back to exactly the same situation you’re in now. In fact, if your company is growing, you may have to repeat this more frequently and it could take longer each time. As your company’s revenue grows, the cost of outages may increase, decreasing outage tolerance on top of everything.
The second problem is the more nuanced of the two: the cost of individual incidents hasn’t changed. By only addressing existing problems, new problems have no additional protection. I talk in my previous post about how valuable it is to make mistakes. If you continue to take that approach for reliability, you’re risking pushing the culture to be more risk-averse. Eventually large architecture changes can end up de-prioritized. They pose a lot of risk, shifting the cost-benefit analysis that is used to prioritize work. Developers can end up afraid of changing or fixing things because they don’t want to be chastised for causing another outage.
The Alternative
What if instead of improving reliability of your system…you made it more fault tolerant? No matter what you do or how you do it, your system WILL go down sometimes. It is as inevitable as the passage of time. But, what happens when it does? How long does it take to bring back up? How often is it happening? There are a lot of ways you can tackle this problem more holistically. Make smaller deployments that are easier to roll back. Split your app into services (or smaller services) so when one service fails the rest of the app is less affected. Start doing release testing on a prod-like staging environment. There so many more options. Grab the experts in your architecture, delivery process, and deployed environment and gather some ideas.
This has the added benefit of solving both of the hidden costs stated above. It makes a lasting impact on improving reliability, while *also* making an environment safer for mistakes. That means less costs to keeping up with reliability and bug fixes. It also means less fear from big architecture changes that could drastically improve your software. Finally…it also means your developers can feel free to try things and learn from them, even if sometimes they make mistakes that have negative consequences.
There is no bulletproof solution.
It’s important to remember nothing is perfect. No matter how much you improve your process or fault tolerance, sometimes it will fail. AWS still goes down, and the amount of money they spend to try and make sure that never happens is unimaginable. This is not about removing all consequences from mistakes. It’s about mitigating them as best you can, so that the culture stays healthy. Ask me the number of times I’ve heard, “No, we shouldn’t change that. It’s super risky.” It’s innumerable. Don’t let that become your culture.
Push for systemic, lasting ways to build safety nets into your software and processes. Then, once that has improved, reap the benefits of you or your developers making mistakes and learning valuable lessons from them to become the best they can be.