Don’t Forget to Kick the Tires – On Automatic Monitoring & Human Intervention

With as much as we use modern technology to automatically monitor, observe, and report on so many different systems, I think it is important to manually check the viability of these same systems from time to time.

While the quality of sensors and metrics always seems to be improving, there are still loopholes and lapses which make human oversight and investigation important. I have experienced a few situations in the past few months where the electronic/computer-based monitoring solution for systems have failed in ways that could have been particularly costly; yet human intuition and manual intervention ensured that the situation was addressed appropriately to avoid real damage or loss.

I’d like to share two examples of situations where sophisticated automatic monitoring systems failed, and human oversight was critical to diagnosing and correcting a problem. I think this highlights the importance of not blindly relying on our modern electronics and computers, but instead continuously appraising and challenging those systems–something ever-more important as we enter the new era of autonomous vehicles and algorithmic high-frequency trading.

Example 1: The Door

Our building has an electronic security and access control system. Typically, doors are in a physically locked position (bolt extended), with entry only allowed via an electronic strike-plate which releases when an appropriate keycard is swiped at an electronic sensor. Exit from these same doors is accomplished via a deadlatch exit paddle which causes the bolt to retract when pushed, then restores the bolt to the extended position.

This is a very typical solution for securing commercial entry doors, and it is backed by a sophisticated computer system that monitors whether the doors are closed and the electronic strike-plate has been released to open the door. The sensors and controls work as designed, and everything seems very secure.

However, a recent incident revealed a flaw with the system: There was no verification that the bolt for the door was extended properly. A flaw with the deadlatch exit paddle caused the bolt to remain retracted. From the perspective of the security and access control system, the door was locked and secured. In reality, it was completely unlocked and able to be opened by anyone without any key or keycard.

Only because someone physically tugged on the door handle when it should have been locked was the problem noticed. While the system had not failed explicitly, the sense of security provided by the modern electronic system nearly led us to overlook the most basic element of security—making sure that the door is locked. Our new policy is to physically check that every door is properly secured at the end of the day.

Example 2: The Website Deployment

I typically employ automated monitoring systems to check our websites and applications for availability. The availability check is straightforward—every minute, the monitoring system performs an HTTP GET request on our site and checks the HTTP status code to ensure that it is “200 OK” and that the response contains a certain text string corresponding to content on the page. This indicates with a high degree of certainty that the target website or application is “up” and available to handle requests.

When deploying minor updates to a web application, I had often relied on our automated monitoring and alerting system to notify me if there was some issue with the deployment. For large or complex updates, I would usually validate that the updates had not caused issues. For small changes, like the date for a copyright, I often didn’t immediately validate the changes, trusting our monitoring system to alert me in the event of an issue.

During the recent deployment of a small update, the web application became unavailable, but our monitoring system did not alert me to the issue. Due to a bug, the application experienced an error but still returned a “200 OK” HTTP status code, even though the application itself was entirely unavailable (a very prominent error message was displayed). Additionally, despite the error, the response still contained the text string that was used to verify the health of the application by the monitoring system (even though it was not visible on the page).

Shortly after the deployment, I decided to load the web application so I could look up something. I was surprised by the prominent error and scrambled to fix the problem. Afterwards, I investigated why there had been no alert from our monitoring system about the downtime. During this investigation, I found out that the specific checks for an error state by the monitoring system had been confounded by the bug in the web application.

I now make it my standard practice to always do a manual check of applications whenever I deploy any changes, no matter how small. Depending on the application, even a brief glance can provide a higher degree of certainty than relying on the monitoring system alone.

Lessons

While not particularly staggering examples of technology failures, I think these two instances provide good evidence of why it’s important for humans to critically challenge modern electronic/computer-based monitoring systems by continually checking assumptions, validating state, and probing for weaknesses.

Monitoring systems can only be programmed for a finite set of situations, and they can only predict a limited set of outcomes. There will always be things overlooked, situations not previously encountered, or complex interactions not anticipated. Algorithms, sensors, and heuristics will continually improve, but they will never be a substitute for the intuition, common sense, and observational prowess of humans.