Case Study: Treating an ETL as a Software Design Problem

Atomic recently partnered on a large-scale data migration project, tasked with safely and accurately translating data from a long-standing legacy system into a newer, more structured one. On the surface, the work looked familiar: translate data from an old, patchwork system into a more streamlined one, at a significant scale, with high correctness requirements.

Context

Previous efforts had approached the problem in ways that made sense given that framing—leaning on established ETL tooling, orchestration frameworks, and automated translation strategies. But despite reasonable choices and significant investment, those efforts struggled to produce usable results.

The difficulty with this project wasn’t volume or performance. It was meaning.

The source system reflected decades of real-world use, workarounds, and adaptations. The destination system, while still a legacy system in its own right, enforced more explicit rules and assumptions about how final scenarios should behave. Bridging that gap required more than moving data. It required rebuilding system state in a way that preserved intent. It required custom software!

Reframing the Problem

One of the most important shifts we made was reframing the work.

Rather than asking, “How do we translate this data efficiently?” we asked:

  • What does a valid scenario look like in the destination system?
  • What assumptions does it make about events, relationships, and ordering?
  • What information do end users actually rely on in their day-to-day work?

This reframing revealed a core challenge: data that appeared complete in the source system could still produce broken results in the destination.

For example, certain values—like financial information—only make sense if corresponding events exist. In the source system, those relationships weren’t always explicitly modeled or enforced. In the destination system, they were foundational. Migrating values without their conceptual context led to invalid system states.

At that point, it was clear this wasn’t a straightforward ETL problem. It was a software design problem constrained by legacy data.

A Design-Thinking-Led Approach

We approached the project as consultants first, technologists second.

Working closely with daily users of source and destination systems, as well as developers of the destination system, we focused on understanding:

  • What required data drove the reports referenced by end users
  • Which invariants the destination system depended on
  • What kinds of inconsistencies would cause downstream failures
  • What data did not have a home in the destination system, and how we then might communicate its former existence

These conversations weren’t a one-time discovery phase. They informed decisions continuously as we encountered real data and real edge cases.

Design thinking, in this context, meant grounding every technical decision in how the system was actually used—and validating those decisions early and often.

Why Custom Software Was the Right Tool

Once we accepted that the problem was fundamentally about reconstructing meaning, custom software became essential.

Owning the translation code gave us the flexibility to respond to what we were learning instead of forcing the problem into the shape of existing tools. As new scenarios surfaced, we could evolve the system intentionally rather than layering fragile exceptions.

Several custom features proved especially impactful.

Translation Logic That Mirrors the Destination System

Rather than building one-off mappings, we structured the translation layer to re-implement the destination system’s core rules. This meant encoding domain concepts explicitly and centralizing logic that would otherwise be duplicated.

The result was a system that behaved more like the destination application itself—and one that could be extended to support additional source systems without starting over.

Snapshot Testing for Confidence

As translation logic grew more complex, we introduced snapshot tests around critical scenarios and behaviors. These tests ensured that as we added new capabilities, we didn’t inadvertently break existing translations.

This allowed the codebase to evolve safely and gave the team confidence to refactor and improve structure over time.

Custom Report Comparison Tooling

Manual report review was both time-consuming and error-prone, so we built tooling to compare key reports generated from the source and destination systems.

Instead of debating whether a migration “looked right,” we could point to concrete differences and decide whether they were expected, acceptable, or indicative of missing logic.

Built-In Visibility Into Migration Issues

We intentionally designed the system to surface problems rather than hide them. Each migration run produces detailed reports highlighting:

  • Values that required truncation
  • Events that needed to be created to complete system state
  • Source data inconsistencies that warrant review

These reports became shared artifacts for discussion with subject matter experts, turning ambiguity into actionable feedback.

Testing With Real, Anonymized Data

To validate assumptions against reality, we built tooling to download real data sets, fully anonymize them, and run translations in tests. This allowed us to see how complex, real-world scenarios played out before committing changes—and to change course early when they didn’t behave as expected.

Designing for Iteration and Feedback

Rather than aiming for a single large migration, we designed the system to support small, repeatable runs. Each run produced tangible output we could spontaneously review with stakeholders, and each review informed the next iteration.

This incremental approach reduced risk and created a steady feedback loop.

Outcomes and Lessons Learned

The success of this project didn’t come from choosing the “right” ETL tool or applying more automation. It came from recognizing that the hardest problems weren’t technical in the narrow sense; it was about modeling human workflows and historical context.

In other words, the most important step was realizing this ETL project wasn’t “just ETL” at all.

Conversation

Join the conversation

Your email address will not be published. Required fields are marked *