4 Comments

Parsing Excel Files with Ruby

In this post, I will review options for parsing Excel files using Ruby. I’ll discuss the different types of Excel files and introduce some of the Ruby libraries that exist for working with them. Note that I focus mostly on reading Excel files in this post, but there is some discussion around writing/updating them as well.

If you’d like to go straight to the code, I’ve put a project on GitHub with example snippets for reading an Excel file with each library discussed in this post.

Excel File Types

Before we get into the different Ruby libraries, let’s talk about Excel files. It is important to identify the type of Excel files that you are going to be using. There are two main types: legacy files and the newer OOXML file format introduced in Microsoft Office 2007.

There is a nice description of the differences on Wikipedia. The tldr; version is that the legacy file format includes files with the following extensions:

Extension Description
.xls legacy Excel file
.xlt legacy Excel file template
.xlm legacy Excel file with macros

Microsoft Excel 2007 abandoned the legacy binary format and switched to the Open Office XML (OOXML) format that is used today. These files use the following extensions:

Extension Description
.xlsx OOXML Excel file
.xlst OOXML Excel file template
.xlsm OOXML Excel file with macros

It’s important to determine if you are working with legacy or OOXML Excel files. You can usually convert back and forth between the different formats if you have a copy of Excel, but chances are you don’t want to rely on a manual file format conversion. In my scenario, I was receiving Excel files from an external source and couldn’t control the file format.

It is also worth noting that the modern .xlsx files usually work in other spreadsheet programs that can read the OOXML format, such as Numbers and LibreOffice.

Ruby Excel Libraries

There are many Ruby libraries for working with Excel–perhaps too many. When researching the different libraries, it took some effort to understand their features and limitations. I found the following questions helpful when reviewing each library:

  1. What types of Excel files does it support?
  2. Does it support reading, writing, or both?
  3. Can it handle large files? Is it fast?
  4. Does it require a file as input, or can it use a stream?

Depending on your application, some or all of those questions may be important.

Choosing The Right Library

Below is a table detailing the functionality of six different Ruby Excel libraries.

Library License Supports .xlsx Supports .xls Capability
axlsx MIT yes no write
rubyXL MIT yes no read/write
roo MIT yes yes read
creek MIT yes no read
spreadsheet GPLv3 no yes read/write
simple_xlsx_reader MIT yes no read

Based on your needs, one or more of the above libraries may be helpful. Consider the following usage scenarios.

Writing .xlsx Files

If you need to write .xlsx files, axslx is a good place to start. It allows writing cell values, along with generating charts and graphs. If you need something lighter-weight, rubyXL is a nice option.

Reading .xlsx Files

If you only need to read .xlsx files, you can choose between rubyXL, roo, creek, and simple_xlsx_reader. roo is very popular–likely because it also supports the legacy .xls format. However, if you read on to the performance section, creek and simple_xlsx_reader are much better at handling larger files. If you want to read from an IO stream instead of a file, rubyXL is the only library that supports that.

Reading and Writing .xlsx Files

If you need to read and write .xlsx files, you have a couple of options. You can use rubyXL, which supports reading and writing. Alternatively, you could use two different libraries–one for reading and the other for writing.

Reading and Writing Legacy Files

Support for the legacy .xls format is more limited. If you only need to support .xls, I would recommend spreadsheet. It supports reading and writing. If you need to also support .xlsx files, I would probably recommend a second gem…unless you only need to read the files, in which case, roo allows reading legacy and modern Excel files.

The good news is that regardless of which library you end up choosing, the code to open a file and read is it pretty simple, and it looks very similar across the different libraries. As an example, the below code reads all of the rows in an Excel file using creek.


require 'creek'

workbook = Creek::Book.new 'path/to/file.xlsx'
worksheets = workbook.sheets

worksheets.each do |worksheet|
  worksheet.rows.each do |row|
    row_cells = row.values
    # do something with row_cells
  end
end

I have samples for reading .xlsx files using each of the above libraries in the linked GitHub project.

Performance

If you need to read large Excel files, you may want to compare the performance between the different libraries. I created a quick and dirty benchmark comparing the four libraries that can read .xlsx files in the above table.

I created sample .xlsx files with 500, 10000, 50000, 200000, and 500000 rows. Then, I ran a code snippet using each library that parsed each file (e.g., read every row in the file). The snippets used for each library and the sample Excel files are available here.

I ran each library on each file three times and recorded the median time (the times didn’t vary much at all).

Both rubyXL and roo performed about the same, requiring over two minutes to parse a 500000-row Excel file. creek and simple_xlsx_reader were both much faster, requiring less than a minute to parse the 500000-row file.

I hope this post gave you some direction for working with Excel files in Ruby. If you are using a library that I didn’t cover and you like it, please let me know.