Generate CTEs for a Legacy Database Using an LLM

My software development team is working on a project involving migrating data from a legacy database. The tables in this legacy database all have columns with confusing names, and they use composite primary keys that are made up of three or four columns. I need to write a lot of ad-hoc queries while analyzing the data in the existing system, but writing any SQL queries by hand is a pain because of the column names and composite keys.

In our application, we’d already created C# Entity Framework models that map to the legacy database. The joins are still a pain, but working with the legacy database programmatically isn’t that difficult.

However, I’m providing an LLM (in this case ChatGPT-4o right now) with the C# models. These will give the LLM the table name, column types, and column names, along with mappings to the friendlier names. That way, I can quickly get Common Table Expressions (CTEs) to use as the basis for writing greatly simplified ad-hoc SQL queries.

The Schema

For example, here are two tables from an imaginary legacy database for tracking orders.


CREATE TABLE WHIBM5 (         -- Orders table
  WHIKJH DATE NOT NULL,       -- OrderDate
  WHI8UJ INT NOT NULL,        -- CustomerNumber
  WHI2QR VARCHAR(2) NOT NULL, -- StoreState
  WHI9XZ INT NOT NULL,        -- StoreNumber
  WHIXYH TEXT                 -- Notes
);

CREATE TABLE TKLJH4 (         -- LineItems table
  TKLKJH DATE NOT NULL,       -- OrderDate
  TKL8UJ INT NOT NULL,        -- CustomerNumber
  TKL2QR VARCHAR(2) NOT NULL, -- StoreState
  TKL9XZ INT NOT NULL,        -- StoreNumber
  TKL1NH INT NOT NULL,        -- LineItemNumber
  TKL2RG INT NOT NULL,        -- ProductNumber
  TKLU7F INT NOT NULL         -- Quantity
);

The Models

Entity Framework models for these tables might look like this:


[Table("WHIBM5")]
public class Order
{
    [Column("WHIKJH")]
    public DateOnly OrderDate { get; set; }

    [Column("WHI8UJ")]
    public int CustomerNumber { get; set; }

    [Column("WHI2QR")]
    [StringLength(2)]
    public string StoreState { get; set; }

    [Column("WHI9XZ")]
    public int StoreNumber { get; set; }

    [Column("WHIXYH")]
    public string Notes { get; set; }

    public static void SetupModel(ModelBuilder modelBuilder)
    {
        modelBuilder.Entity((entity) =>
        {
            entity.HasKey(e => new { e.OrderDate, e.CustomerNumber, e.StoreState, e.StoreNumber });
        });
    }
}

[Table("TKLJH4")]
public class LineItem
{
    [Column("TKLKJH")]
    public DateOnly OrderDate { get; set; }

    [Column("TKL8UJ")]
    public int CustomerNumber { get; set; }

    [Column("TKL2QR")]
    [StringLength(2)]
    public string StoreState { get; set; }

    [Column("TKL9XZ")]
    public int StoreNumber { get; set; }

    [Column("TKL1NH")]
    public int LineItemNumber { get; set; }

    [Column("TKL2RG")]
    public int ProductNumber { get; set; }

    [Column("TKLU7F")]
    public int Quantity { get; set; }

    public static void SetupModel(ModelBuilder modelBuilder)
    {
        modelBuilder.Entity((entity) =>
        {
            entity.HasKey(e => new { e.OrderDate, e.CustomerNumber, e.StoreState, e.StoreNumber, e.LineItemNumber });
        });
    }
}

The Prompt

With these models and an LLM prompt, we can let the LLM do the tedious work of generating the CTEs.  That includes providing a nice identifier to make for easier joins! Here’s an example prompt that got the output I wanted for this example schema:

Given the following C# Entity Framework models, generate CTEs for each table that maps the actual column name to the friendlier property names in the models, and maps the table name to the Model name (but plural). In addition to mapping the column names, please provide a unique Id column for each table, and if the tables can join, provide a foreign key to the Id column as well. For example, Order would have Id and LineItem would have both Id and OrderId. The Id can just be a row number, but sort by all of the columns that make up the primary key. To figure out the OrderId you’ll need to do and inner join from the LineItems to the Orders table using the columns in the composite primary key of the Order table.

[contents of the C# classes above goes here]

The Common Table Expressions

And the generated CTEs:


WITH Orders AS (
    SELECT
        ROW_NUMBER() OVER (ORDER BY WHIKJH, WHI8UJ, WHI2QR, WHI9XZ) AS Id,
        WHIKJH AS OrderDate,
        WHI8UJ AS CustomerNumber,
        WHI2QR AS StoreState,
        WHI9XZ AS StoreNumber,
        WHIXYH AS Notes
    FROM WHIBM5
),
LineItems AS (
    SELECT
        ROW_NUMBER() OVER (ORDER BY TKLKJH, TKL8UJ, TKL2QR, TKL9XZ, TKL1NH) AS Id,
        Orders.Id AS OrderId,
        TKLKJH AS OrderDate,
        TKL8UJ AS CustomerNumber,
        TKL2QR AS StoreState,
        TKL9XZ AS StoreNumber,
        TKL1NH AS LineItemNumber,
        TKL2RG AS ProductNumber,
        TKLU7F AS Quantity
    FROM TKLJH4
    INNER JOIN Orders
        ON Orders.OrderDate = TKLKJH
        AND Orders.CustomerNumber = TKL8UJ
        AND Orders.StoreState = TKL2QR
        AND Orders.StoreNumber = TKL9XZ
)

Now I can just drop this at the top of a SQL query window (in something like TablePlus, or DataGrip, etc.) and then write my ad-hoc queries:


[CTEs go here]

SELECT *
FROM LineItems
INNER JOIN Orders ON Orders.Id = LineItems.OrderId
WHERE Orders.OrderDate > '2024-01-01'

I’ve found that it’s helpful to start with just a couple of tables like this. Then, I keep the LLM conversation around and just request an additional CTE for each new table as I need to include them in my queries.

Conversation

Join the conversation

Your email address will not be published. Required fields are marked *