You’ve built your application. It does everything the customer asked for and everything he needs. Your designer’s polish has impressed everyone and your tool is a dream to use. Your comprehensive system test suite runs clean. Your exploratory tester can’t find any bugs you’ve missed. And you’ve got it deployed and running on its final home. So you must be done, right?
Wrong. It might be deployed and running – but now you have to keep it that way.
As tempting as it is to ship it and forget it, there are many less than glamorous tasks remaining. At this point you’ll have done everything you can to ensure that things go smoothly. Therefore, it’s now time to make sure you can deal with the problems you get when things stop going so smoothly. Here are a few things I consider whenever I ship an application to make sure that when (not “if”) something goes wrong that I will at least know what happened, why it happened, and how to fix it. This list is a bit server-centric, but the ideas apply to all types of software and it’s a good starting point.
You probably don’t need to write a comprehensive user manual, but if you can get some basic information about the installation recorded, the person who has to fix the next issue will thank you. Here are some things to record:
- How do I build the application?
- How do I run the automated tests to make sure I didn’t break it?
- Server information: where is it installed? What is the username and password?
- Application information: what is the test username and password? What is the administrator’s username and password? Where do I log in?
- Contact information for everyone important: The hosting company or organization, the customer, etc.
- Some basic use: How do I start the app? How do I make sure it’s running? How do I stop and restart it?
While perfect, up to date, comprehensive documentation is always great to have (albeit not so great to write and maintain), having just the information above can be an incredible boon to the confused and tired tech getting a call at 2 AM.
There are a few simple things that can be very easy to miss when deploying an app — and you won’t notice them missing until it’s too late:
- Log rotation. A high traffic site can generate a large quantity of logs very quickly. Make sure they’re getting compressed and discarded when appropriate.
- Start on boot. This is distressingly easy to forget due to the high uptimes most of us enjoy, but when you get nailed by a power failure and are forced to reboot, it’s important that all your services come up clean.
- Versioning. Make sure you track what version of your application is running in production so that your developers can reproduce errors.
- Backups. Database backups, filesystem backups, source code backups. In particular, make sure things like your private keys are dealt with somehow. Get your backups off site in case of disaster. And remember, off-site means far away. Getting a hard drive across the street doesn’t do much for you when a Hurricane Katrina happens.
This only really applies to server-side applications, but for these it really is critical. Here’s a few things to consider monitoring:
- Basic accessibility: is the server up and responding to requests from the outside world? This is a good place to use a 3rd-party monitoring service like Alerta, or a tool like Nagios running in a different datacenter.
- Application status: a healthcheck page built into your application is valuable if your install is complicated.
- Server monitoring: Is Apache running? Is Passenger running? Is memcache running? Watch each of these individually for uptime, memory & CPU, etc. A tool like Nagios or Monit is very useful here. Additionally, consider monitoring basic server health: disk space, swap, and similar metrics.
- Reactive monitoring: It can be useful for some applications to restart some or all of the components automatically if they begin consuming excess memory or CPU. It is always best to write well-behave applications, but this is a way of dealing with ones that aren’t well-behaved.
The basic idea here follows the same testing principles you used in your automated test suite (you do have an automated test suite, don’t you?) Monitoring the complete application stack from the outside gives you an effective way to answer the question “is it up?” while internal server monitoring is more like a unit test suite which tell you what piece isn’t working right. There’s some redundancy there, of course, but if you hit each of them you have an excellent chance of knowing about any given failure.
So something went wrong (like something always does.) Now what? Try these tools and techniques:
- For web sites, install New Relic. It can monitor performance, uptime, slow queries, and on some platforms (such as Rails) capture things like stack traces when errors occur.
- Logging. Many web frameworks come with good logging built in, but for other kinds of applications and more complicated apps, it is well worth making sure your app’s basic operations are well recorded. You’ll be grateful when you get an error report that reads little more than “It’s broken.” and you can pull up a session log for that user and that time period.
That’s my attempt at a checklist for deployment. There’s a lot of things to do beyond “cap deploy” – have I forgotten anything? What do your deployment checklists look like?