If you experience sudden, drastic changes in the behavior of a piece of software, you should immediately suspect reaching some limit, threshold, or quota.
I spend a great deal of my time supporting and troubleshooting software deployments. Generally, if a piece of software has been running without issue for a substantial length of time, any drastic behavior change that’s not related to a service interruption is due to reaching a threshold or limit. These thresholds could be very simple (no more disk space), or more subtle (an API starting to rate-limit requests). Below, I’ll mention a few of the different limits I’ve run across, and some common patterns of observed behavior.
No Free Disk Space
Running out of free disk space will almost inevitably cause any application to grind to a halt. Whether it’s too many uploads, the latest round of kernel updates, or simply log files–a full disk will cause problems.
In such a situation, applications generally start returning errors and refusing to function. The root cause tends to be that the application cannot write temporary files to serve the request, or it cannot log details about the request (even the detail that a file couldn’t be written due to the lack of free disk space).
Error messages from running out of free disk space are obvious, generally following the form `No space left on device`. However, these errors are not always accessible unless logs are sent to a remote service, or you have a terminal on the affected server.
Out of Available Memory
Running out of free memory tends to cause odd behavior and specific patterns of failure, rather than the outright lock-up you get from a lack of free disk space.
In most cases, running out of memory causes applications to time out on certain requests or activities, returning an empty or malformed response. This ordinarily occurs as the process handling the request or activity (or a dependent process) is terminated by the operating system for consuming too much memory (out-of-memory killer). Most applications have multiple components or threads. Having one of these killed by the OOM killer tends to result in odd behavior for affected components (or threads for given requests), while other aspects of the application may seem to continue unaffected.
On Linux, the system log will show very strong evidence if the OOM killer has been at work–messages regarding the process(es) selected for termination and freeing of memory will clearly indicate that the system has run out of available memory. If they’re directly accessing an affected server, foreground processes or jobs are often terminated, with just the parting message `Killed.`
Exceeding a quota for services or API requests tends to result in cyclical patterns of failure which occur at certain times of day, or after a certain number of days in a month.
When the quota is exceeded, the application or application components begin to return errors or become nonfunctional. When the quota is reset, everything begins to function normally again. If the quota is exceeded on a particular day or month, then the issue does not manifest itself.
This can be a particularly troublesome issue to track down if the application itself does not receive feedback that it has exceeded its quota, or if it does not properly report information about exceeding its quota. In such cases, inexplicable application outages or failures start and continue irregularly until the quota in question is consistently exceeded, and then the outages or failure become consistently periodic.
Having requests or services rate-limited is similar to exceeding a quota. However, rather than having all requests uniformly rejected or denied until some reset, only some are limited. For example, the rate-limit may only affect requests after the first 1,000 such operations in a given second.
For requests or operations which are rate-limited, the application may behave slowly or strangely: timeouts, empty responses, generic errors, etc. When the rate-limit is no longer in effect (perhaps application load or traffic has decreased), the issue evaporates.
Similar to exceeding quotas, it is critical that a service or API provide feedback that rate-limiting is in effect (for example, making use of an ‘HTTP 429 Too Many Requests’ response, with a ‘Retry-After’ header). Further, the application must properly report information about such responses from dependent services or APIs.
Troubleshooting software and software deployments seems to be an odd mix of science, art, and luck. You have to have an understanding of the software, how it operates, and what affects it. You also need to recognize patterns of behavior and intuit the cause of behavior that you observe. And sometimes, you just get lucky and stumble upon unexpected clues or evidence. The general limits and thresholds that I outlined above help me to initially assess the best avenue of investigation as I dig into solving issues with applications and services.