Ingest Inconsistent Tables Quickly Using Postgres’s JSON Columns

On a recent project, I found myself needing to ingest a few hundred tables that were almost the same schema into one larger table in our PostgreSQL database. Here’s how you can leverage JSON columns to help smooth out the distinctions between tables.

Why You Might Need This

Ingesting a bunch of tables that are exactly the same shape is an easy thing to Google. Just write a plpgsql function that loops over the tables, one by one, and inserts their data into a destination table. You can find a dozen good examples online.

The process gets trickier if you need to deal with incoming tables that are only mostly the same. I don’t think there’s any way to, in regular SQL, tell Postgres to select a value from this column that may or may not exist. (By the way, if I’m wrong here, please let me know in the comments.)

Since the SQL that your ingestion function executes can’t handle varying fields, you need to handle that variation elsewhere. You could write a giant select…case… statement that used some indicator to conditionally execute a few different SQL queries, but all that gets really gnarly really fast as the number of tables you’re ingesting and the number of shapes that they’re in goes up.

Using Postgres’s JSON Columns

First, you’ll need a temporary table. For my case, I wanted to know the schema, table name, and row value for each row of data. In SQL, that looks like this:

create table if not exists temporary_json_table (
  id integer generated always as identity primary key,
  schema text,
  table_name text,
  row_value json

Next, you’ll need a plpgsql function that iterates over your input tables and inserts each row it finds into your temporary JSON table. That looks something like this.

create or replace function
import_tables(target_table text, source_schema text, table_pattern text)
returns int
as $$
declare input_table regclass;
  for input_table in
    select c.oid::regclass from pg_class c
    join pg_namespace n on n.oid = c.relnamespace
    where c.relkind = 'r'
    and n.nspname = source_schema
    and lower(c.relname) similar to table_pattern
      execute format('insert into %I (table_name, schema, row_value) select (parse_ident(a.tableoid::regclass::varchar, false))[2], (parse_ident(a.tableoid::regclass::varchar, false))[1], row_to_json(a.*) from %s a', target_table, input_spreadsheet);
    end loop;
    return 1;
$$ language plpgsql;

That’s a lot, but the result should be that, if you call that function using the name of your temporary table, the schema that holds your input tables, and the regex pattern for which tables you’d like to include, you should wind up with a JSON table that houses all of your data.

select import_tables('temporary_json_table', 'data_inbox', 'monthly_logs_\d+');

Now that you have all of your data in one table, you can leverage a regular Postgres insert…select… statement to convert the data into the shape of your target table. In my case, I knew that all of the input tables shared most of their columns. But the schema for the table had changed over time, and some earlier tables were missing a few columns. In this case, inserting a null value was an acceptable way to proceed for some values. That looked something like this:

insert into destination_table (
  row_value ->> 'Fiscal' as fiscal_period,
  row_value ->> 'Amount' as amount,
  row_value ->> 'TaxAmount' as tax,
  row_value ->> 'Check' as check_number
from temporary_json_table;

Why This Works

JSON is a dynamic, potentially schemaless format as far as Postgres is concerned. So, when you use a JSON operator like ->>, the engine knows that the referenced key may or may not exist. If it doesn’t exist, then Postgres just inserts a null. If it does, you get the value. In my use case, I was dealing with a large number of tables whose schemas had changed over time. I was okay with null values for fields that didn’t exist when each table was created.

Thanks to the safe nature of Postgres’s JSON operators, I was able to quickly import a whole mess of old data without resorting to a more complicated ETL. There are a ton of ways to solve this problem. I didn’t see this one getting a lot of press, and it was really handy.