We’ve all hit that point on a project: features are working, and you’re starting to really expand the application’s capabilities, but the expansion is steadily leading to slower performance. It’s time for some refactoring around your database interactions
Maybe the queries run by the application have become more and more complex. Maybe there are things going on in the TypeScript code that you suspect could be pushed down closer to the database. Or maybe you just know your next feature will need special attention to maintain the excellent responsiveness you’ve managed to deliver so far.
Whichever situation you’re in, database functions should be a tool in your toolbox. Most modern databases support some type of function; I’m going to look specifically at Postgres’s support for functions as a tool to help you achieve your performance goals.
Two Types of Postgres Functions
Postgres supports two types of functions: SQL Functions and PL/pgSQL Functions. Both types can receive input and output parameters and return various types of data (single values, a single row-like structure, a table-like structure, void, etc.).
A SQL function lets you perform a query or series of queries inside a function, returning the data from the final query in the function. Functions can return
VOID if there is no useful data to return, as in a SQL
Here’s a trivial example that would return the full set of Employees matching the given name.
CREATE FUNCTION get_employee_by_name(name text) RETURNS SETOF employee AS $$ SELECT * from employee where name = $1; $$ LANGUAGE SQL;
-- Called as select * from get_employee_by_name("George");
See the Postgres documentation on SQL functions for more information: PostgreSQL: Documentation: 11: 38.5. Query Language (SQL) Functions.
PL/pgSQL functions can use Postgres’s procedural language, which provides tools like conditionals and loops rather than just the SQL query language features. You can do more than execute a sequence of queries (e.g., loop through results and take subsequent actions, work with cursors, other things I won’t get into here). But I’m going to stick with how to use simpler SQL Functions to improve application performance.
If you want to learn more about PL/pgSQL, see the Postgres documentation here: PostgreSQL: Documentation: 11: Chapter 43. PL/pgSQL – SQL Procedural Language.
Table SQL Functions
Today, I’m going to focus on a particular type of function: one that runs a query and returns a table-like structure that you can use in other queries, just like a table. Table Function isn’t an official term, as far as I can tell, but I’ll use it in this post to mean:
A function that returns a
SETOFsome type and can be used the same as a table in other SQL queries.
That sounds a lot like a view, right? It’s true that they can be used in most of the same ways; they both behave a lot like tables. But there are some important differences.
Table SQL Functions vs. Views
A view is a good way to reduce code duplication. They can be composed into other queries just like tables and, in many cases, suffer no performance penalty compared to writing the query in-line. They’re very convenient if you frequently write custom queries against your database and don’t want to re-type all those common joins and
WHERE clauses repeatedly.
Views experience more challenges if you include aggregate functions like
SUM(), set operations like
UNION, or common table expressions (CTEs) in your view. The query optimizer isn’t able to apply your
WHERE clause predicates to sub-parts of the query (also called “predicate pushdown”) in those situations. This can have a significant impact on performance by, for example, forcing the view to compute its aggregate measures for all rows in a table when your
WHERE clause should be restricting it down to a small subset.
If you’re seeing slow performance from a view that does any of those things, consider using a function instead. Functions’ input parameters can be used in multiple places in the function body, which allows you to put the right predicates in the right places (e.g., filter down both queries on either side of a
UNION). That gives you a lot of control.
Functions can also insert, delete, and update records, which is handy if you have derivative values to compute and want to avoid data transfer out of the database. Postgres triggers are a great example of this.
Improving Performance with Table SQL Functions
We’ve significantly improved performance in our Postgres-backed applications by using functions to:
- Apply predicates to multiple places in a query.
- Compute derivative values without a round-trip out of the database.
Simple Aggregate Example
Let’s say you have this view:
CREATE OR REPLACE VIEW customer_order_average AS ( SELECT AVG(total_amt), customer_id FROM order GROUP BY customer_id );
And now you want to get the result for a specific list of customers, so you query it like this:
SELECT * FROM customer_order_average WHERE customer_id IN (4312, 283, 4478, 2346, 9246)
Because of the
AVG function inside the view, Postgres won’t be able to push the
WHERE clause down into the view, and Postgres will compute the
AVG(total_amt) for every customer. That might be okay if you have sufficiently small numbers of customers and orders, or if you’re querying to get them all, but it’s not ideal. Let’s turn it into a function:
CREATE FUNCTION customer_order_average(VARIADIC ids int) RETURNS TABLE(customer_id int, oder_total numeric) AS $$ SELECT AVG(total_amt), customer_id FROM order WHERE customer_id = ANY($1) GROUP BY customer_id $$ LANGUAGE SQL;
When we call this function, the predicate will be applied in a place where it can reduce the set of data that the database has to work with to the minimum necessary. This should perform much better.
select * from customer_order_average(4312, 283, 4478, 2346, 9246);
Let’s say you have this involving a SQL
-- Get a complete set of preferences for a user, -- combining defaults and their temporary overrides. CREATE OR REPLACE VIEW user_perferences AS ( SELECT upo.user_id, upo.preference_id, upo.preference_value FROM user_preference_override upo UNION ALL SELECT pb.user_id, pb.preference_id, pb.preference_value FROM preference_base pb WHERE NOT EXISTS ( SELECT 1 FROM user_preference_override o WHERE pb.preference_id = o.preference_id AND pb.user_id = o.user_id ) );
Querying this view as-is will force Postgres to deal with the full set of preferences for all users even if we really only care about preferences for one user. Our
WHERE clause will be stuck on the outside of the view query instead of down inside (within each query joined by the
UNION ALL) where it could do more good.
SELECT * FROM user_preferences WHERE user_id = 22342;
We can write this as a function that will pull that predicate down into a better place for performance.
CREATE FUNCTION user_preferences(id int) RETURNS TABLE(user_id int, preference_id int, preference_value text) AS $$ SELECT upo.user_id, upo.preference_id, upo.preference_value FROM user_preference_override upo WHERE upo.user_id = $1 UNION ALL SELECT pb.user_id, pb.preference_id, pb.preference_value FROM preference_base pb WHERE upo.user_id = $1 AND NOT EXISTS ( SELECT 1 FROM user_preference_override o WHERE pb.preference_id = o.preference_id AND o.user_id = $1 ) $$ LANGUAGE SQL;
Note the three instances of
$1 where the query is filtering down to the single user we care about. This is a simplification of a real situation we had recently on my current project (domain terms changed to protect the innocent). The performance difference of this approach was significant compared to the view approach. Now when we query this function, it only works with data for the specific user we care about.
SELECT * FROM user_preferences(22342);
Computing Derivative Values
The other place where functions can provide a helpful performance advantage is when computing derivative values. Doing the calculation in the database avoids round-trip data transfer time.
Imagine that we have an Order table and that we update a set of aggregate statistics on a CustomerOrderStatistics table any time an Order record is updated (because of reasons). It looks something like this:
order - id (int) - customer_id (int) - total_amt (numeric) customer_order_statistics - customer_id (int) - lifetime_order_amt (numeric) - average_order_amt (numeric)
Instead of calculating the values in TypeScript, Python, C#, or whatever favorite server-side tools you’re using, we can write a function to insert or update the aggregate row for the customer. This function could be triggered as a Postgres trigger or simply called in a query from the already-mentioned server-side tools after changing some data in the order table.
The function could look something like this:
CREATE FUNCTION update_customer_order_aggregate(id int) RETURNS void AS $$ INSERT INTO customer_order_statistics (customer_id, lifetime_order_amt, average_order_amt) SELECT (customer_id, SUM(total_amt), AVG(total_amt) FROM order WHERE customer_id = $1 GROUP BY customer_id ON CONFLICT (customer_id) DO UPDATE SET lifetime_order_amt = EXCLUDED.lifetime_order_amt, average_order_amt = EXCLUDED.average_order_amt; $$ LANGUAGE SQL;
The tradeoff here is the location of business logic. We keep most of our business logic in the application codebase (TypeScript for my current project), so it’s a sacrifice to spread some of it into other places like the database. But it can be worthwhile if the performance benefit is sufficient, the logic isn’t duplicated elsewhere, or it’s in part of the application where we’re pushing a lot down into the database.
Use SQL Functions!
Database functions are fun, helpful tools to have in your toolbox. Keep them in mind the next time you need to improve the performance of your application’s database interactions.