Enumerating Ruby: Handling Memory Constraints and Recovering from Errors

This is the fourth post in a series on enumeration in Ruby. In the previous post, we saw how we could easily extend the built-in Enumerable methods to work with lazy collections. This post will cover some of the added benefits of using enumerators to build the interface to your collection, namely the ability to deal with memory constraints and transparently handle errors.

Handling Memory Constraints

A nice aspect of using enumerators is that you keep the appearance of a collection, but don’t have to have your entire data set loaded at once.

Suppose I have a large CSV file that I want to import into a database. A naive solution might read the entire CSV data set into memory, and then create the database records.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
require "csv"

class AwesomeCsvReader
  def read
    CSV.read "large.csv"
  end
end

class AwesomeRecordImporter
  def import(data)
    data.each do |datum|
      puts "Imported #{datum.inspect}"
    end
  end
end

reader = AwesomeCsvReader.new
importer = AwesomeRecordImporter.new

importer.import reader.read

However, if we are limited by how much memory we can use, our CSV file might be too large to load all at once. The obvious solution is to read and import the data one line at a time. If we didn’t know about Object#enum_for or Enumerator.new, we might be tempted to shoehorn the importer into the reader and have it be responsible for calling #import on every line.

i.e.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
require "csv"

class AwesomeCsvReader
  def initialize(importer)
    @importer = importer
  end

  def read
    CSV.foreach "large.csv" do |line|
      @importer.import line
    end
  end
end

class AwesomeRecordImporter
  def import(datum)
    puts "Imported #{datum.inspect}"
  end
end

importer = AwesomeRecordImporter.new
reader = AwesomeCsvReader.new importer

reader.read

But we can do this much better, and without changing the flow of our code, by simply changing the implementation of AwesomeCsvReader#read.

1
2
3
4
5
class AwesomeCsvReader
  def read
    CSV.enum_for :foreach, "large.csv"
  end
end

Now we only maintain a reference to a single line at a time, leaving previously-read lines eligible for garbage collection and keeping our memory usage low no matter how big our input is.

Recovering from Errors

When you use Enumerator.new, you’re essentially writing an #each method, which means that you can control how the sequence is traversed by client code. This is particularly useful when you want to keep going even if you encounter an error.

Given:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def import_data_source
  [
    :good_data,
    :bad_data,
    :good_data
  ]
end

def import(data)
  raise "Error!" if data == :bad_data
  puts "Successfully imported #{data}"
end

import_data_source.each do |data|
  import data
end

Problem:

Recover from errors and continue the import process even if a single element fails to import.

Solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def import_data_source
  data = [
    :good_data,
    :bad_data,
    :good_data
  ]
  Enumerator.new do |yielder|
    data.each do |data|
      begin
        yielder << data
      rescue
      end
    end
  end
end

def import(data)
  raise "Error!" if data == :bad_data
  puts "Successfully imported #{data}"
end

import_data_source.each do |data|
  import data
end

As you can see, any error in the client code will bubble up to our use of yielder <<, so simply putting a rescue there allows us to keep going. A more realistic implementation might maintain a list of the errors and raise them after the fact, or store them in a database.

The full source for this post is available here.

Enumerating Ruby Series: