Article summary
- Understand how access/permissions work locally vs. in deployed environments.
- Understand what other migrations or long-running processes will run at the same time.
- Understand the impact of a migration that fails partway through the process.
- Understand the impact of your migration to old instances of your application.
- Consider retryable, smaller-scale migrations.
When I hear the word migration, my mind immediately jumps to relational database migration. That’s when your migration runs before application deployment, and the application code is updated to handle only the new schema changes. Should the migration or deployment fail, the migration tool typically has good support for rolling back a change.
On a recent project, our team was tasked with transitioning data encrypted with a secret to data encrypted with a key from a cloud key management system. This involved writing our own migration script without any of the guardrails of a relational DB migration I described above. Since the volume of data that had already been encrypted was fairly small and, with the goal of saving time for implementation, we decided to implement a big bang migration process. This process, which ran on application startup, decrypted existing data, re-encrypted it with the new key, and stored the newly encrypted values. When our migration process eventually ran, it failed, took down our deployed application, and required us to roll back our change. The next day, we fixed the issue and redeployed the change, only for it to fail again.
Here, I’ll outline some factors that may complicate a data or service migration process you may have initially considered simple. I’ll also walk through a few important factors to consider when implementing a migration.
Understand how access/permissions work locally vs. in deployed environments.
The first major issue we ran into was permission related. Encrypting/decrypting data with a KMS required the introduction of new permissions. When running the app locally, we saw that the permissions were missing from our user accounts. So, we assigned them to the user group that our individual accounts were assigned to. However, the service account the application uses didn’t belong to the same user group. So, missing the required permissions caused our deployed application’s migration process to crash.
If your migration involves changing service providers and your deployed application and locally running application use different accounts to authenticate to cloud services, make sure any new roles/permissions are applied to the correct users and user groups.
Understand what other migrations or long-running processes will run at the same time.
Our initial implementation ran the migration process on application startup. The trouble with this was that a handful of other processes were being initialized at the same time, some of which utilized the encryption/decryption code + storage that I mentioned.
When deciding on the complexity needed for your custom migration implementation, catalog what parts of the system will be affected by the change and when and how those components are run.
Understand the impact of a migration that fails partway through the process.
On our initial failed migration, not a single record successfully migrated. This made rolling back the change simple. Just quickly redeploy the old code that knows how to deal with the unmigrated data. On our second failed attempt, some but not all records migrated before the migration process failed. Dealing with this partial failure was trickier since we now had a bunch of data that the old application didn’t know how to handle.
If rolling back quickly isn’t an option or if the migration is started but incomplete, then the migration code should handle errors, and the code that interacts with the data/services undergoing the migration service needs to be able to handle both records that were migrated successfully and records that have not been migrated.
Understand the impact of your migration to old instances of your application.
This next consideration is similar to the previous one. Depending on how your deployment infrastructure is set up, multiple instances of the application may be running simultaneously. If the old application sticks around while the new one is still waiting to come up and be in a healthy state, how will the old instance handle the change to the underlying data’s values, schema, or storage provider? This question is also relevant if you must revert the code change and roll back to a previous deployment. If the old instance won’t be able to handle the changes gracefully, it might be worth temporarily keeping data around in both formats
Consider retryable, smaller-scale migrations.
With all the trouble we ran into, we wanted more control over when and how the migration was run. Given this, we decided not to run this process as an action on startup and instead invoke it manually so that we could pay closer attention to the results and retry as needed. We also introduced the ability to run the migration on individual records one at a time to minimize the impact of something going wrong.