There’s a Relational Database in Your Unix CLI

Did you know that there’s a relational database hiding in your Unix shell? There really is, it turns out.

A friend of mine was recently telling me about his discovery of the join command, which allows you to combine data from multiple files that contain tabular data. Let’s take a closer look.

Examples

For these examples, I’ll be using two files.

courses.csv:

CIS 162,Computer Science I
CIS 163,Computer Science II
CIS 263,Data Structures and Algorithms
MTH 225,Discrete Structures I
MTH 325,Discrete Structures II
CIS 351,Computer Organization
STA 215,Introductory Applied Statistics

and enrollments.csv:

Alice,CIS 351
Alice,CIS 263
Alice,MTH 325
Bob,CIS 163
Bob,MTH 225
Ryan,CIS 351
Ryan,MTH 325
Ryan,WRT 350
James,WRT 350

Inner Join

By default, the join command behaves as an INNER JOIN does in SQL. That is, each pair of matching lines from both files will be printed, but no additional lines from either file that may have matched.

To illustrate this, let’s find the names of all of the courses where each student is enrolled:

# join -t , -1 2 -2 1 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv)

CIS 163,Bob,Computer Science II
CIS 263,Alice,Data Structures and Algorithms
CIS 351,Alice,Computer Organization
CIS 351,Ryan,Computer Organization
MTH 225,Bob,Discrete Structures I
MTH 325,Alice,Discrete Structures II
MTH 325,Ryan,Discrete Structures II
# 

You’ll notice that it doesn’t include anything about WRT 350, which has no description in courses.csv. Similarly, it doesn’t list anything about CIS 162, which nobody is taking.

Since you’re likely wondering, I think now’s a good time to describe what those flags mean.

Flag Description
-t , Use commas as the field separator
-1 2 Join using the second column of the first input file
-2 1 And the first column of the second file
<(sort -t , -k 2 enrollments.csv) The `join` command wants its inputs to be sorted by the join key. This command sorts the enrollments file based on course number.
<(sort -t , -k 1 courses.csv) Also sort the courses file

I’ll also point out the <(sort ...) syntax, in case you aren’t familiar with it. It’s similar to other shell piping operators, except it doesn’t pipe the output of the sort command to the standard input of join. Instead, it creates a new file handle, and a UNIX path is substituted into its place for the arguments to join. It looks something like /dev/fd/11 on my machine.

Outer join

Instead, imagine that we want to output all of the lines from enrollments.csv, regardless of whether we have a course description in courses.csv. In SQL, we’d use an outer join for this. Sure enough, join supports this too, via another flag.

The flag to use is -a 1, which indicates that we should show all lines from File 1. (We could alternatively specify -a 2 if we wanted to include all lines from the second file.)

# join -t , -1 2 -2 1 -a 1 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv)


CIS 163,Bob,Computer Science II
CIS 263,Alice,Data Structures and Algorithms
CIS 351,Alice,Computer Organization
CIS 351,Ryan,Computer Organization
MTH 225,Bob,Discrete Structures I
MTH 325,Alice,Discrete Structures II
MTH 325,Ryan,Discrete Structures II
James,WRT 350
Ryan,WRT 350
# 

Excluding matches

What if we actually wanted to find all the courses where nobody has enrolled? In SQL, this would be akin to doing an outer join and constraining one table’s ID to null.

To achieve this, join provides us with the -v flag. You specify a file number along with it, and then join will only print out the lines from that file that could not be paired with any lines in the other file.

Here it goes:

# join -t , -1 2 -2 1 -v 2 <(sort -t , -k 2 enrollments.csv) <(sort -t , -k 1 courses.csv)

CIS 162,Computer Science I
STA 215,Introductory Applied Statistics
#

Side note: It’s actually possible to specify both files with multiple -v flags. The output is probably less useful to a machine.

Other SQL-like tools in the Unix toolbox

Unix offers other tools that have analogs to what SQL gives us. I’ve thought of a few:

Command SQL analog
awk SELECT
grep WHERE
uniq SELECT DISTINCT
sum SUM()

Can you think of others? Leave some suggestions in the comments!

Conclusion

I’m not really sure when or why I’ll end up using this, but it’s pretty cool to know it’s there. Even after over a decade of using the Unix command line, I’m still surprised by it.

Conversation
  • Colin says:

    Unix command line is really rich in text processing tools, not many know about it. In my line of work, I almost use them on a daily basis. Try out sed, xargs, sort, cut , comm

    • Chris Farber Chris Farber says:

      I’ve heard of most of those, but `comm` is new to me. Thanks for pointing that out!

    • Justin swanhart says:

      I have been using Unix and Unix-like operating systems for 27 years, and I just learned about “tac” which reverses the order of lines from std input. It is cat spelled backwards.

  • kgh says:

    sum(1) does not do what you think it does.

  • Mr X says:

    Hey, looks cool, but I am unable to replicate it with pdksh.

    I’ve tried:
    join -t , -1 2 -2 1 <(sort -t , -k 2 enroll.csv) <(sort -t , -k 1 courses.csv)
    join -t , -1 2 -2 1 <$(sort -t , -k 2 enroll.csv) <$(sort -t , -k 1 courses.csv)
    join -t , -1 2 -2 1 $(sort -t , -k 2 enroll.csv) $(sort -t , -k 1 courses.csv)
    join -t , -1 2 -2 1 <`sort -t , -k 2 enroll.csv` <`sort -t , -k 1 courses.csv`
    join -t , -1 2 -2 1 `sort -t , -k 2 enroll.csv` `sort -t , -k 1 courses.csv`

    and maybe something else – with no result. Or maybe it is only possible in bash?

    • Chris Farber Chris Farber says:

      Interesting. I’ve tested this with both zsh and bash, but I’m not familiar with ksh. Presuming that join and sort are both present on your system, I’d suggest reading its documentation and looking for a way to pipe the contents of a command to a temporary file. It’s possible the syntax for how to do that is different in pdksh.

  • Take a look at this:
    http://crn.hopto.org/unix/#rdb

    /rdb made Unix into a 4gl.

  • Comments are closed.