Article summary
This is the fourth post in a series on enumeration in Ruby. In the previous post, we saw how we could easily extend the built-in Enumerable
methods to work with lazy collections. This post will cover some of the added benefits of using enumerators to build the interface to your collection, namely the ability to deal with memory constraints and transparently handle errors.
Handling Memory Constraints
A nice aspect of using enumerators is that you keep the appearance of a collection, but don’t have to have your entire data set loaded at once.
Suppose I have a large CSV file that I want to import into a database. A naive solution might read the entire CSV data set into memory, and then create the database records.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
require "csv" class AwesomeCsvReader def read CSV.read "large.csv" end end class AwesomeRecordImporter def import(data) data.each do |datum| puts "Imported #{datum.inspect}" end end end reader = AwesomeCsvReader.new importer = AwesomeRecordImporter.new importer.import reader.read |
However, if we are limited by how much memory we can use, our CSV file might be too large to load all at once. The obvious solution is to read and import the data one line at a time. If we didn’t know about Object#enum_for
or Enumerator.new
, we might be tempted to shoehorn the importer into the reader and have it be responsible for calling #import
on every line.
i.e.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
require "csv" class AwesomeCsvReader def initialize(importer) @importer = importer end def read CSV.foreach "large.csv" do |line| @importer.import line end end end class AwesomeRecordImporter def import(datum) puts "Imported #{datum.inspect}" end end importer = AwesomeRecordImporter.new reader = AwesomeCsvReader.new importer reader.read |
But we can do this much better, and without changing the flow of our code, by simply changing the implementation of AwesomeCsvReader#read
.
1 2 3 4 5 |
class AwesomeCsvReader def read CSV.enum_for :foreach, "large.csv" end end |
Now we only maintain a reference to a single line at a time, leaving previously-read lines eligible for garbage collection and keeping our memory usage low no matter how big our input is.
Recovering from Errors
When you use Enumerator.new
, you’re essentially writing an #each
method, which means that you can control how the sequence is traversed by client code. This is particularly useful when you want to keep going even if you encounter an error.
Given:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def import_data_source [ :good_data, :bad_data, :good_data ] end def import(data) raise "Error!" if data == :bad_data puts "Successfully imported #{data}" end import_data_source.each do |data| import data end |
Problem:
Recover from errors and continue the import process even if a single element fails to import.
Solution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def import_data_source data = [ :good_data, :bad_data, :good_data ] Enumerator.new do |yielder| data.each do |data| begin yielder << data rescue end end end end def import(data) raise "Error!" if data == :bad_data puts "Successfully imported #{data}" end import_data_source.each do |data| import data end |
As you can see, any error in the client code will bubble up to our use of yielder <<
, so simply putting a rescue there allows us to keep going. A more realistic implementation might maintain a list of the errors and raise them after the fact, or store them in a database.
The full source for this post is available here.
Enumerating Ruby Series:
- Object#enum_for
- Lazy Chains
- Handling Memory Constraints and Recovering from Errors