Parsing Comma-Separated Values in Ruby

Parsing data from comma-separated values (CSVs) is a relatively common task. Using Ruby to parse CSVs makes this process easier because Ruby has its own built-in CSV class and interface.

Parsing CSVs in Ruby: The Basics

You can either choose to process CSV data all at once or by the row. Referring to the Ruby docs, processing the whole file can be simply done like so:

CSV.read(filepath)

But what if we want to do something to each row? There are a few ways to do it, but I think the most readable code employs header converters. These converters act as a way to refer to the header (column) you want to process for that row. Essentially, by having access to both the column and row, you can pick out specific cells in the CSV with ease.

Let’s say we have a CSV with the headers: Email, First_Name, Last_Name.

To read each row:

CSV.foreach(filepath, headers: true, header_converters: :symbol) do |row|

...

end

Given the header_converter described as :symbol, we can now reference the “email” column of that CSV row using row[:email]. Similarly, we could also do row[:first_name].

The Problem

Let’s say we want to read the CSV and need to validate the incoming headers are what we expect. You perform a standard

CSV.read(filepath, headers: true).headers

Output:

actual_headers = [ HEADER_1, HEADER_2, HEADER_3 ]

You want to compare this with the expected input:

expected_headers = [ HEADER_1, HEADER_2, HEADER_3 ]

You perform a standard comparison on these two arrays.

actual_headers == expected_headers

They look exactly the same, right? They aren’t. The comparison evaluates to false. Oh no!

This is where I began going down the rabbit hole of frantically debugging. Admittedly, this had me stumped for a while. I mean, what do you even begin to search for?

The Solution

Unless you instinctively check the first element of each array and work from there, this can be a real head-scratcher. Turns out, there’s a zero-width, no-break space that comes back from the CSV to indicate the beginning of the file. This is more commonly used as a byte-order mark (BOM). Although we can’t see it through debugging, it’s there, and that’s why even though the comparison looks identical, it is false.

To fix this, you simply have to account for the BOM. Ruby has a built-in way to do this:

actual_headers = CSV.read(filepath, 'r:bom|utf-8', headers: true).headers

Using this modification, the actual and expected headers comparison from the above example should evaluate to true. Now that this invisible character is gone, they are identical.

Using Ruby to Parse CSVs

Using Ruby to parse CSVs can be extremely helpful. With a few tweaks and know-how, anyone can process and evaluate data efficiently and accurately.