Historically, our support and maintenance agreements have been on-demand and initiated by the customer observing a particular problem
that needs to be addressed.
A relatively new type of project at Atomic is the Continuous Improvement Engagement, or CIE. This type of engagement reserves a
developer’s time on a regular cadence, typically one full day each week (or month). This new model seeks to be more proactive in
sustaining the performance and overall health of our client’s software.
From a developer’s perspective, this is quite a change. Typically we have a plan to work through. Instead, on a CIE, you’ll be discovering
and adapting the plan as each week passes. In this post, I’ll outline some of the essential activities a developer should be performing to
ensure a successful CIE engagement.
Gathering Intelligence
The first critical activity is more of a prerequisite: make sure that you have good tools for collecting data and monitoring.
- Do you have something like New Relic that will monitor how long your requests take? Can you use it to individually identify slow database queries, API endpoints, or external services?
- Do you have a service like Rollbar that will collect errors and exceptions? Does it capture enough information to allow you to debug
and reproduce these errors? Is it capturing from all relevant pieces of the system: web server, web workers, javascript frontend, mobile
apps? - Do you have good logging? Can you easily search the logs? Can you easily cross-reference the logs with individual errors or slow
requests recorded by your monitoring services? - Do you have alerts established, at least for egregious issues?
If the answer to any of these questions is no, it should be your first priority of work.
Performance Monitoring
Once you have tools in place to monitor the performance of your system, make it a weekly or monthly habit to actually look at
them. See if you can identify patterns or trends before they become urgent.
As the system grows, there will inevitably be performance issues that crop up. It’s impossible to predict where and why they will occur. So, the best you can do is keep an eye on it.
This can be as simple as spending a few minutes looking through New Relic. Are there any egregiously slow web requests or database
queries? Are there requests that aren’t that slow, but happen all of the time?
Sometimes all it will take to speed things up is to add or tweak a database index. Or maybe you had an accidental n+1 query problem that
you can fix without adding any new lines of code.
Other times it may require significantly more investment to address. You might need to rewrite some code. Or you may want to bring
in a tool like GraphQL to help reduce the amount of data you’re fetching from the server in a single request.
In the latter case, it’ll be much nicer to be able to plan for and prioritize this work with the customer before it becomes an urgent
problem. If the customer is hearing complaints from their users, you’ve waited too long.
Error Triage
Similar to performance, you can’t predict how your system will fail once real people are using it.
Some facts about errors:
- The only way your customer will know about them is if one of their customers complains or it happens to them.
- Not all errors are actually worth fixing.
- An error that occurs very infrequently can still be very bad.
It’s important to have a tool that will group errors automatically and let you silence alerts about those that aren’t important. Pay
particular attention to new kinds of errors, and be sure to triage and prioritize new kinds of errors regularly.
Infrastructure tuning
Every so often, take a quick look at your hosting, third-party services, and costs. Some examples:
- Are you getting close to a limit on one of your third-party services? You don’t want to find yourself suddenly unable to send an email or save user data.
- Are your servers running out of RAM?
- Is your AWS bill trending higher?
Updating
It’s always unpleasant to begin work on a project and discover that, for example, it’s running old versions of Ruby and Rails.
The challenge of updating these dependencies does not grow linearly over time.
Eventually, it’s easy to find yourself in a catch-22: you can’t fix bugs because your tools are so out of date, and you can’t upgrade your tools without introducing bugs.
The CIE time allows you periodically to take stock of where you are and ask yourself these questions:
- Have new versions of our frameworks or tools been released? What things are new, and what will it take to upgrade?
- Have any dependencies been deprecated? How do we migrate to something else?
- Are there new security vulnerabilities that need to be fixed?
Keeping the Test Suite Healthy
The test suite is your first line of protection for ensuring your software is stable. If the test suite is not in good health, then it
will be much harder to perform your other duties: fix bugs, update dependencies, improve performance, deploy, etc.
If the project you are supporting has a healthy test suite, congratulations! Do everything you can to keep it that way. Write new
tests for the bugs you fix and the features you add. Keep CI running.
All too often, in my experience, you are not so lucky. Don’t let that stop you. Inheriting a bad (or nonexistent) test suite is not a death
sentence and you should make it a priority to remediate it:
- Fix tests that fail intermittently.
- Make sure to keep CI running.
- Are there slow tests? Can you make them faster?
- What parts of the system are missing coverage? How can you write tests to fill the gaps?
- Are there old tests that have been disabled? Why? Can you fix or rewrite them?
- Add tests for changes you make.
Document
Make it a priority to document the system as you work, whether for the benefit of your future self or the next developer.
As you work, make it a practice to add comments explaining:
- Why does the system work like this?
- How does a particular piece of code unexpectedly interact with other parts of the system?
- If you discover a potential problem, why is it there and why aren’t you fixing it?
- Is there a relevant vision or roadmap for the code that has not yet come to fruition?
Make sure to include your name and the date, too. Git history alone isn’t definitive.
The CIE model is all about being proactive and staying ahead of issues before they become big problems.
Naturally, it’s more difficult than it sounds. Part of this difficulty stems from the fact that you’ll be dividing your time and attention
between multiple projects, which is unusual at Atomic.
The rest of the challenge comes from the drastic difference in daily activities when compared to a more typical project that’s focused on
quickly developing and delivering new features.
I hope the next time you find yourself as the developer on a CIE, these checklists help serve as a starting point for success.