3 Comments

Selecting Top N Per Group in PostgreSQL

For a recent project, I found myself needing to get certain data out of our database. I wanted to select the top two items, grouped by color, and sorted by when they were created. My project was being built on PostgreSQL, so I turned to my handy dandy window functions.

Window functions are similar to aggregate functions, but instead of operating on groups of records to produce a single output record, they operate on records related to the current row. We can use that to add a column to our result set that represents the rank of each item in its respective group. Here’s the completed query:


    SELECT rank_filter.* FROM (
        SELECT items.*, 
        rank() OVER (
            PARTITION BY color
            ORDER BY created_at DESC
        )
        FROM items
        WHERE items.cost < 50
    ) rank_filter WHERE RANK = 1

Let’s break down what’s going on in each piece. The most interesting piece is the inner query that utilizes the OVER clause.


    SELECT items.*, 
      rank() OVER (
          PARTITION BY color
          ORDER BY created_at DESC
      )
    FROM items
    WHERE items.other_value < 50

OVER describes how to window the results. PARTITION BY splits, or partitions, the data into groups with the same value for the specified column. In this case, it builds groups of the same color. ORDER BY within an OVER clause tells PostgreSQL how to order each group.

Once you’ve built your window, you’re ready for the window function: rank() returns the position of the record within its group or window. The above results might look like:

color other_value created_at rank
red 12 2016-01-22 1
red 18 2016-01-23 2
red 19 2016-01-24 3
blue 12 2016-01-22 1
blue 18 2016-01-23 2
blue 19 2016-01-24 3

Now we’re getting close to the goal. However, we want just the top two from each group. We cannot use a simple LIMIT because LIMIT doesn’t know about our partitions. If we use a sub-select from these results, we can now filter by rank.


    SELECT rank_filter.* FROM (
        ... codevious results here ...
    ) rank_filter WHERE RANK <= 2

This gives us just what we wanted:

color other_value created_at rank
red 12 2016-01-22 1
red 18 2016-01-23 2
blue 12 2016-01-22 1
blue 18 2016-01-23 2

This is such a common pattern that PostgreSQL has a built-in to shortcut the process, which is great if you’re only interested in the top ranking record from each group. It’s called DISTINCT ON.
Here’s a quick example:


    SELECT DISTINCT ON (color), items.*
    FROM items
    WHERE items.other_value < 50
    ORDER BY items.created_at DESC

So next time you need to grab the top N records from each group, reach for a rank()/PARTITION query.