For a recent project, I found myself needing to get certain data out of our database. I wanted to select the top two items, grouped by color, and sorted by when they were created. My project was being built on PostgreSQL, so I turned to my handy dandy window functions.
Window functions are similar to aggregate functions, but instead of operating on groups of records to produce a single output record, they operate on records related to the current row. We can use that to add a column to our result set that represents the rank of each item in its respective group. Here’s the completed query:
SELECT rank_filter.* FROM (
SELECT items.*,
rank() OVER (
PARTITION BY color
ORDER BY created_at DESC
)
FROM items
WHERE items.cost < 50
) rank_filter WHERE RANK = 1
Let’s break down what’s going on in each piece. The most interesting piece is the inner query that utilizes the OVER
clause.
SELECT items.*,
rank() OVER (
PARTITION BY color
ORDER BY created_at DESC
)
FROM items
WHERE items.other_value < 50
OVER
describes how to window the results. PARTITION BY
splits, or partitions, the data into groups with the same value for the specified column. In this case, it builds groups of the same color. ORDER BY
within an OVER
clause tells PostgreSQL how to order each group.
Once you’ve built your window, you’re ready for the window function: rank()
returns the position of the record within its group or window. The above results might look like:
color | other_value | created_at | rank |
red | 12 | 2016-01-22 | 1 |
red | 18 | 2016-01-23 | 2 |
red | 19 | 2016-01-24 | 3 |
blue | 12 | 2016-01-22 | 1 |
blue | 18 | 2016-01-23 | 2 |
blue | 19 | 2016-01-24 | 3 |
Now we’re getting close to the goal. However, we want just the top two from each group. We cannot use a simple LIMIT
because LIMIT
doesn’t know about our partitions. If we use a sub-select from these results, we can now filter by rank.
SELECT rank_filter.* FROM (
... codevious results here ...
) rank_filter WHERE RANK <= 2
This gives us just what we wanted:
color | other_value | created_at | rank |
red | 12 | 2016-01-22 | 1 |
red | 18 | 2016-01-23 | 2 |
blue | 12 | 2016-01-22 | 1 |
blue | 18 | 2016-01-23 | 2 |
This is such a common pattern that PostgreSQL has a built-in to shortcut the process, which is great if you’re only interested in the top ranking record from each group. It’s called DISTINCT ON
.
Here’s a quick example:
SELECT DISTINCT ON (color), items.*
FROM items
WHERE items.other_value < 50
ORDER BY items.created_at DESC
So next time you need to grab the top N records from each group, reach for a rank()/PARTITION
query.
Very helpful article! Very easy! Thanks.
Thanks Roie
I’m glad you found it useful!
It’s important to note rank() will give all equivalent rows the same value. In your example, if created_at is the same for many rows, they will all have the same rank, so your query could return more than 2 rows for the case of RANK <= 2. You can use row_number() to get around this and only return up to a hard limit.