Store Batch Import Metadata to Save Yourself Time Later

Batch data imports are tricky to build. You need to make sure the data is valid, and, most importantly, you need to correctly map that data to your domain. In my experience, that last part is difficult.

Recently, I ran into an issue where having a robust import saved me a lot of headaches and manual work. Our import created several related models in the system. One subtle but important model was not created, preventing users from accessing a critical part of the application. Luckily, I was able to debug and fix this issue quickly because our import did two very important things:

  1. It stored the import file.
  2. It stored import metadata.

Storing the Import File

When a new import was created, our system stored the CSV containing the data in S3. This allowed me to use the same file when developing a fix, and it enabled me to write an automated test case that exercised the fix using that data. Now, we can quickly catch any regressions that occur with that particular extensive data set.

Storing Import Metadata

Storing the import metadata was by far the biggest timesaver. The import was complicated; it created a big tree of new resources in the system. If I didn’t track which resources were created or modified via the import, trying to uncover which entities needed the fix would have been very time-consuming.

Tracking the metadata was actually pretty straightforward. In our Rails back end, we had two models that represented an import. BatchImport was the top-level model that stored metadata about the import. For each action taken, we stored information about that action in BatchImportEntry. Here is what those two models look like:


    class BatchImport
      has_many :batch_import_entries
      # also store started_at, finished_at, status, notes
    end

    class BatchImportEntry
      belongs_to :batch_import
      belongs_to :resource, polymorphic: true
    end
  

In order to fix the issue, I was able to write a script that iterated through the top-level entries in the trees that were created and made sure the missing entry was created below it. It was as simple as this:


    BatchImport.where('finished_at IS NOT NULL').each do |import|
      import.entries.where(type: 'Program').each do |entry|
        # create new model related to this resource
      end 
    end
  

Batch imports are tricky, and debugging them when things go wrong can be very difficult if you don’t store information about the import. You can save yourself a headache and a lot of time by building some import auditing into your next batch import.