Did you know that there’s a relational database hiding in your Unix shell? There really is, it turns out.
A friend of mine was recently telling me about his discovery of the
join command, which allows you to combine data from multiple files that contain tabular data. Let’s take a closer look.
For these examples, I’ll be using two files.
CIS 162,Computer Science I CIS 163,Computer Science II CIS 263,Data Structures and Algorithms MTH 225,Discrete Structures I MTH 325,Discrete Structures II CIS 351,Computer Organization STA 215,Introductory Applied Statistics
Alice,CIS 351 Alice,CIS 263 Alice,MTH 325 Bob,CIS 163 Bob,MTH 225 Ryan,CIS 351 Ryan,MTH 325 Ryan,WRT 350 James,WRT 350
By default, the
join command behaves as an
INNER JOIN does in SQL. That is, each pair of matching lines from both files will be printed, but no additional lines from either file that may have matched.
To illustrate this, let’s find the names of all of the courses where each student is enrolled:
# join -t , -1 2 -2 1 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv) CIS 163,Bob,Computer Science II CIS 263,Alice,Data Structures and Algorithms CIS 351,Alice,Computer Organization CIS 351,Ryan,Computer Organization MTH 225,Bob,Discrete Structures I MTH 325,Alice,Discrete Structures II MTH 325,Ryan,Discrete Structures II #
You'll notice that it doesn't include anything about
WRT 350, which has no description in
courses.csv. Similarly, it doesn't list anything about
CIS 162, which nobody is taking.
Since you're likely wondering, I think now's a good time to describe what those flags mean.
|Use commas as the field separator|
|Join using the second column of the first input file|
|And the first column of the second file|
|The `join` command wants its inputs to be sorted by the join key. This command sorts the enrollments file based on course number.|
|Also sort the courses file|
I'll also point out the
<(sort ...) syntax, in case you aren't familiar with it. It's similar to other shell piping operators, except it doesn't pipe the output of the
sort command to the standard input of
join. Instead, it creates a new file handle, and a UNIX path is substituted into its place for the arguments to
join. It looks something like
/dev/fd/11 on my machine.
Instead, imagine that we want to output all of the lines from
enrollments.csv, regardless of whether we have a course description in
courses.csv. In SQL, we'd use an outer join for this. Sure enough,
join supports this too, via another flag.
The flag to use is
-a 1, which indicates that we should show all lines from File 1. (We could alternatively specify
-a 2 if we wanted to include all lines from the second file.)
# join -t , -1 2 -2 1 -a 1 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv) CIS 163,Bob,Computer Science II CIS 263,Alice,Data Structures and Algorithms CIS 351,Alice,Computer Organization CIS 351,Ryan,Computer Organization MTH 225,Bob,Discrete Structures I MTH 325,Alice,Discrete Structures II MTH 325,Ryan,Discrete Structures II James,WRT 350 Ryan,WRT 350 #
What if we actually wanted to find all the courses where nobody has enrolled? In SQL, this would be akin to doing an outer join and constraining one table's ID to null.
To achieve this,
join provides us with the
-v flag. You specify a file number along with it, and then
join will only print out the lines from that file that could not be paired with any lines in the other file.
Here it goes:
# join -t , -1 2 -2 1 -v 2 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv)
CIS 162,Computer Science I STA 215,Introductory Applied Statistics #
Side note: It's actually possible to specify both files with multiple
-v flags. The output is probably less useful to a machine.
Other SQL-like tools in the Unix toolbox
Unix offers other tools that have analogs to what SQL gives us. I've thought of a few:
Can you think of others? Leave some suggestions in the comments!
I'm not really sure when or why I'll end up using this, but it's pretty cool to know it's there. Even after over a decade of using the Unix command line, I'm still surprised by it.