Article summary
My software development team is working on a project involving migrating data from a legacy database. The tables in this legacy database all have columns with confusing names, and they use composite primary keys that are made up of three or four columns. I need to write a lot of ad-hoc queries while analyzing the data in the existing system, but writing any SQL queries by hand is a pain because of the column names and composite keys.
In our application, we’d already created C# Entity Framework models that map to the legacy database. The joins are still a pain, but working with the legacy database programmatically isn’t that difficult.
However, I’m providing an LLM (in this case ChatGPT-4o right now) with the C# models. These will give the LLM the table name, column types, and column names, along with mappings to the friendlier names. That way, I can quickly get Common Table Expressions (CTEs) to use as the basis for writing greatly simplified ad-hoc SQL queries.
The Schema
For example, here are two tables from an imaginary legacy database for tracking orders.
CREATE TABLE WHIBM5 ( -- Orders table
WHIKJH DATE NOT NULL, -- OrderDate
WHI8UJ INT NOT NULL, -- CustomerNumber
WHI2QR VARCHAR(2) NOT NULL, -- StoreState
WHI9XZ INT NOT NULL, -- StoreNumber
WHIXYH TEXT -- Notes
);
CREATE TABLE TKLJH4 ( -- LineItems table
TKLKJH DATE NOT NULL, -- OrderDate
TKL8UJ INT NOT NULL, -- CustomerNumber
TKL2QR VARCHAR(2) NOT NULL, -- StoreState
TKL9XZ INT NOT NULL, -- StoreNumber
TKL1NH INT NOT NULL, -- LineItemNumber
TKL2RG INT NOT NULL, -- ProductNumber
TKLU7F INT NOT NULL -- Quantity
);
The Models
Entity Framework models for these tables might look like this:
[Table("WHIBM5")]
public class Order
{
[Column("WHIKJH")]
public DateOnly OrderDate { get; set; }
[Column("WHI8UJ")]
public int CustomerNumber { get; set; }
[Column("WHI2QR")]
[StringLength(2)]
public string StoreState { get; set; }
[Column("WHI9XZ")]
public int StoreNumber { get; set; }
[Column("WHIXYH")]
public string Notes { get; set; }
public static void SetupModel(ModelBuilder modelBuilder)
{
modelBuilder.Entity((entity) =>
{
entity.HasKey(e => new { e.OrderDate, e.CustomerNumber, e.StoreState, e.StoreNumber });
});
}
}
[Table("TKLJH4")]
public class LineItem
{
[Column("TKLKJH")]
public DateOnly OrderDate { get; set; }
[Column("TKL8UJ")]
public int CustomerNumber { get; set; }
[Column("TKL2QR")]
[StringLength(2)]
public string StoreState { get; set; }
[Column("TKL9XZ")]
public int StoreNumber { get; set; }
[Column("TKL1NH")]
public int LineItemNumber { get; set; }
[Column("TKL2RG")]
public int ProductNumber { get; set; }
[Column("TKLU7F")]
public int Quantity { get; set; }
public static void SetupModel(ModelBuilder modelBuilder)
{
modelBuilder.Entity((entity) =>
{
entity.HasKey(e => new { e.OrderDate, e.CustomerNumber, e.StoreState, e.StoreNumber, e.LineItemNumber });
});
}
}
The Prompt
With these models and an LLM prompt, we can let the LLM do the tedious work of generating the CTEs. That includes providing a nice identifier to make for easier joins! Here’s an example prompt that got the output I wanted for this example schema:
Given the following C# Entity Framework models, generate CTEs for each table that maps the actual column name to the friendlier property names in the models, and maps the table name to the Model name (but plural). In addition to mapping the column names, please provide a unique Id column for each table, and if the tables can join, provide a foreign key to the Id column as well. For example, Order would have Id and LineItem would have both Id and OrderId. The Id can just be a row number, but sort by all of the columns that make up the primary key. To figure out the OrderId you’ll need to do and inner join from the LineItems to the Orders table using the columns in the composite primary key of the Order table.
[contents of the C# classes above goes here]
The Common Table Expressions
And the generated CTEs:
WITH Orders AS (
SELECT
ROW_NUMBER() OVER (ORDER BY WHIKJH, WHI8UJ, WHI2QR, WHI9XZ) AS Id,
WHIKJH AS OrderDate,
WHI8UJ AS CustomerNumber,
WHI2QR AS StoreState,
WHI9XZ AS StoreNumber,
WHIXYH AS Notes
FROM WHIBM5
),
LineItems AS (
SELECT
ROW_NUMBER() OVER (ORDER BY TKLKJH, TKL8UJ, TKL2QR, TKL9XZ, TKL1NH) AS Id,
Orders.Id AS OrderId,
TKLKJH AS OrderDate,
TKL8UJ AS CustomerNumber,
TKL2QR AS StoreState,
TKL9XZ AS StoreNumber,
TKL1NH AS LineItemNumber,
TKL2RG AS ProductNumber,
TKLU7F AS Quantity
FROM TKLJH4
INNER JOIN Orders
ON Orders.OrderDate = TKLKJH
AND Orders.CustomerNumber = TKL8UJ
AND Orders.StoreState = TKL2QR
AND Orders.StoreNumber = TKL9XZ
)
Now I can just drop this at the top of a SQL query window (in something like TablePlus, or DataGrip, etc.) and then write my ad-hoc queries:
[CTEs go here]
SELECT *
FROM LineItems
INNER JOIN Orders ON Orders.Id = LineItems.OrderId
WHERE Orders.OrderDate > '2024-01-01'
I’ve found that it’s helpful to start with just a couple of tables like this. Then, I keep the LLM conversation around and just request an additional CTE for each new table as I need to include them in my queries.