I recently worked on a task researching options for a search service to integrate into our application. I initially wrote off traditional relational databases in favor of more trendy options like ElasticSearch or AWS CloudSearch. These products left me feeling overwhelmed. There was a steep learning curve and high service management overhead costs. Reluctantly, I began investigating what would be possible with PostgreSQL’s full-text search capabilities. My research left me pleasantly surprised.
As I alluded to, the main factor that pushed me away from ElasticSearch and similar products was the management overhead. My impression was that supporting search indices in ElasticSearch requires a lot of knowledge and active maintenance. Many products will have a dedicated search team, a luxury we didn’t have on this project. AWS offers a service called OpenSearch that claims to manage some of these variables for you. But, even that felt more heavy-handed than I needed.
Postgres full-text search capabilities also have a bit of a learning curve, but it was at least predicated on SQL concepts I’m comfortable with. My research into Postgres made me feel like maintaining a rudimentary search engine on the platform would be no more complicated than basic database administration that is common on these sorts of applications. The cherry on top of my research was that the performance metrics I was seeing were much faster than I expected. Even with a load test consisting of a million records and nearly 200,000 unique words, I was searching the dataset with queries completed in less than 10 milliseconds.
The primary drawback of Postgres search is the limitation of its search result relevance algorithms. I think in the past decade Google has polluted the perception of what a good search engine is. If you’re trying to build a search engine with personalized results based on external factors, user engagement, and advanced metrics like Google, then Postgres is not the right fit.
But if you want to find words, phrases, or moderately-complex search queries in a set of documents, then consider following me into the nuts and bolts of Postgres full-text search.
Let’s first cover some terminology that comes up frequently in full-text search discussions and documentation.
Lexeme – A lexeme is a fundamental unit of language. Simply put, it is the root of a word. For example run, running, and ran all share a lexeme of “run.” The process of converting to lexemes is often referred to as “stemming.”
Stop Words – Stop words are words that carry no definitive value in a search context. Common words like “a” and “the” will occur in nearly every searchable document, so you can typically omit them from the search space.
Text Search Dictionaries – Text search dictionaries are a Postgres concept that defines how to break down words within a document into lexemes and vectors. You can set options for things like synonymous words for stemming and stop words. Postgres offers several reasonable defaults like the minimal “simple” dictionary that uses no stop words or stemming or the more standard “English” dictionary.
Text Search Vectors – Text search vectors in Postgres represent a document stored in a way optimized for search operations. They’re commonly referred to by its datatype “tsvector.” They come in many forms, but a common approach is to pair lexemes with positions and weights. For example, a document with the title “Interesting Post” and body “The post is great” might produce a tsvector like
interest:1A post:2A,4B great:6B. In this tsvector, stop words have been removed, numbers represent a position in the document, and “A” and “B” represent weights associated with the title and body respectively.
Text Search Queries – Postgres’ text search queries are a way to represent complex search requests. A query might consist of multiple words, lexemes, or tsvectors combined using Boolean operators or the “followed by” operator. For example, searching for documents containing the words “interesting” and “post” but not “boring” might use the tsquery
interesting & post & !boring.
Preparing the Database for Search
Before we can start querying search terms, we have to optimize our database for fast, accurate searches. First, optionally, you can configure a custom text search dictionary. In the first SQL block below, I set up a custom dictionary that uses English stop words and the simple template.
Next, we construct a generated column of tsvectors that will act as the keywords we can search by. You accomplish this in the second SQL block. Individual columns are converted to tsvectors with
to_tsvector and weighted with
setweight. They’re then concatenated into a document with the
|| operator. Finally, in the third SQL block, we create a GIN index on the new column in order to make searching super fast.
-- 1. create text search dictionary simple_tsd ( template = pg_catalog.simple, stopwords = english ); create text search configuration simple_tsc (copy = simple); alter text search configuration simple_tsc alter mapping for asciiword with simple_tsd; -- 2. alter table SearchablePosts add column if not exists weighted_keywords tsvector generated always as ( setweight(to_tsvector('english', coalesce(SearchablePosts.title, '')), 'A') || setweight(to_tsvector('english', coalesce(SearchablePosts.description, '')), 'B') ) stored; -- 3. create index if not exists keywords_idx on SearchablePosts using gin (weighted_keywords);
Search by Keyword
Finally, we can write some SQL to search for queries in our search space. There are many ways to search over the keywords column. However, the best approach I’ve found is in the first SQL block below.
@@ operator is a Boolean operator for determining if a tsvector matches a tsquery. We use the
ts_rank_cd function to generate a score for how closely a vector matches the query. This is perfect for ordering our results. The
to_tsquery function generates the tsquery, which expects an input already formatted with appropriate operators. Alternatively, as shown in the second SQL block, you can use
websearch_to_tsquery to generate queries using a more typical search-engine-like user input.
-- 1. select *, ts_rank_cd(weighted_keywords, to_tsquery('english', 'interesting & post')) as ranking from SearchablePosts where weighted_keywords @@ to_tsquery('english', 'interesting & post') order by ranking desc; -- 2. select *, ts_rank_cd(weighted_keywords, websearch_to_tsquery('english', 'interesting post -boring')) as ranking, websearch_to_tsquery('english', 'interesting post -boring') as original_query from SearchablePosts where weighted_keywords @@ websearch_to_tsquery('english', 'interesting post -boring') order by ranking desc;
Results of Query 1 | id | title | description | weighted_keywords | ranking | | --- | ---------------- | ------------------- | ----------------------------------- | --------- | | 1 | Interesting Post | The post is great | great:6B interest:1A post:2A,4B | 1 | | 2 | Less Interesting | This post is boring | bore:6B interest:2A less:1A post:4B | 0.2857143 | Results of Query 2 | id | title | description | weighted_keywords | ranking | original_query | | --- | ---------------- | ----------------- | ------------------------------- | ------- | ----------------------- | | 1 | Interesting Post | The post is great | great:6B interest:1A post:2A,4B | 1 | interest & post & !bore |
You can see in the results of the first query, two posts contain both words “interesting” and “post.” However, the post with id 1 ranks higher than the post with id 2 because it has both terms in its title (and the search gives this more weight).
In the results of the second query, you can see how we can translate a user’s search-engine-like input into a tsquery. You can also see why the modified query excludes the post with id 2 from the results.
Full-Text Search Capabilities in Postgres
I have had a lot of fun researching and experimenting with Postgres full-text search capabilities. It provides a highly performant and capable search engine without the maintenance or steep learning curve associated with other search platforms.