6 Comments

There’s a Relational Database in Your Unix CLI

Did you know that there’s a relational database hiding in your Unix shell? There really is, it turns out.

A friend of mine was recently telling me about his discovery of the join command, which allows you to combine data from multiple files that contain tabular data. Let’s take a closer look.

Examples

For these examples, I’ll be using two files.

courses.csv:

CIS 162,Computer Science I
CIS 163,Computer Science II
CIS 263,Data Structures and Algorithms
MTH 225,Discrete Structures I
MTH 325,Discrete Structures II
CIS 351,Computer Organization
STA 215,Introductory Applied Statistics

and enrollments.csv:

Alice,CIS 351
Alice,CIS 263
Alice,MTH 325
Bob,CIS 163
Bob,MTH 225
Ryan,CIS 351
Ryan,MTH 325
Ryan,WRT 350
James,WRT 350

Inner Join

By default, the join command behaves as an INNER JOIN does in SQL. That is, each pair of matching lines from both files will be printed, but no additional lines from either file that may have matched.

To illustrate this, let’s find the names of all of the courses where each student is enrolled:

# join -t , -1 2 -2 1 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv)

CIS 163,Bob,Computer Science II
CIS 263,Alice,Data Structures and Algorithms
CIS 351,Alice,Computer Organization
CIS 351,Ryan,Computer Organization
MTH 225,Bob,Discrete Structures I
MTH 325,Alice,Discrete Structures II
MTH 325,Ryan,Discrete Structures II
# 

You'll notice that it doesn't include anything about WRT 350, which has no description in courses.csv. Similarly, it doesn't list anything about CIS 162, which nobody is taking.

Since you're likely wondering, I think now's a good time to describe what those flags mean.

FlagDescription
-t ,Use commas as the field separator
-1 2Join using the second column of the first input file
-2 1And the first column of the second file
<(sort -t , -k 2 enrollments.csv)The `join` command wants its inputs to be sorted by the join key. This command sorts the enrollments file based on course number.
<(sort -t , -k 1 courses.csv)Also sort the courses file

I'll also point out the <(sort ...) syntax, in case you aren't familiar with it. It's similar to other shell piping operators, except it doesn't pipe the output of the sort command to the standard input of join. Instead, it creates a new file handle, and a UNIX path is substituted into its place for the arguments to join. It looks something like /dev/fd/11 on my machine.

Outer join

Instead, imagine that we want to output all of the lines from enrollments.csv, regardless of whether we have a course description in courses.csv. In SQL, we'd use an outer join for this. Sure enough, join supports this too, via another flag.

The flag to use is -a 1, which indicates that we should show all lines from File 1. (We could alternatively specify -a 2 if we wanted to include all lines from the second file.)

# join -t , -1 2 -2 1 -a 1 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv)


CIS 163,Bob,Computer Science II
CIS 263,Alice,Data Structures and Algorithms
CIS 351,Alice,Computer Organization
CIS 351,Ryan,Computer Organization
MTH 225,Bob,Discrete Structures I
MTH 325,Alice,Discrete Structures II
MTH 325,Ryan,Discrete Structures II
James,WRT 350
Ryan,WRT 350
# 

Excluding matches

What if we actually wanted to find all the courses where nobody has enrolled? In SQL, this would be akin to doing an outer join and constraining one table's ID to null.

To achieve this, join provides us with the -v flag. You specify a file number along with it, and then join will only print out the lines from that file that could not be paired with any lines in the other file.

Here it goes:

# join -t , -1 2 -2 1 -v 2 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv)

CIS 162,Computer Science I STA 215,Introductory Applied Statistics #

Side note: It's actually possible to specify both files with multiple -v flags. The output is probably less useful to a machine.

Other SQL-like tools in the Unix toolbox

Unix offers other tools that have analogs to what SQL gives us. I've thought of a few:

CommandSQL analog
awkSELECT
grepWHERE
uniqSELECT DISTINCT
sumSUM()

Can you think of others? Leave some suggestions in the comments!

Conclusion

I'm not really sure when or why I'll end up using this, but it's pretty cool to know it's there. Even after over a decade of using the Unix command line, I'm still surprised by it.