ActiveRecord Join Models Hash Themselves into Bad Performance

ActiveRecord’s implementation of #hash is lacking, if not outright broken. I came across this issue while fixing an N + 1 problem in my current Rails application. I’ll use the proverbial Author/Book example to explain. I was doing something like:

@authors = Author.where(some_cond: true)

Later on in my code, I was looping over each author’s books (classic N+1). To fix, just sprinkle on some includes:

@authors = Author.where(some_cond: true).includes(:books)

There, fixed it. Page now loads in… 2 minutes!. How did that happen?

Let’s look at the setup:

  class Book < ActiveRecord::Base
    has_many :compositions
  class Author < ActiveRecord::Base
    has_many :compositions
    has_many :books, through: :compositions
  class Composition < ActiveRecord::Base
    belongs_to :book
    belongs_to :author

In my application there was quite a bit of data, but nothing outrageous. Using pry I was able to find a call to uniq inside of ActiveRecord that was unexpectedly slow. Under the covers, uniq places all the items of your array as keys in a hash and converts that hash out to a new array. Ruby’s hash implementation relies on the objects #hash method.

ActiveRecord::Base has it defined as:

  def hash

This works well for models with an id, but for join models that do not have an id they will all be hashed as #hash. If everything in the hash has the same key, each lookup still has to check the entire set of values. So, it doesn’t speed up the uniq call.

Solution #1

Define a custom #hash method in your join model. This feels like we’re fighting Rails a little too much. Their guides subtly recommend that you have an id on “has many through” join tables, but does not mention that it’s basically required.

  def hash

Solution #2

Add an id field. It appears that we migrated from a “has and belongs to many” to a “has many through” and never added an id field. Simply adding id would also work around this issue.

We went with solution #2, and our page load times in development are now around one second. I’ll be watching my #hash a little bit closer from now on.