Case-Insensitive Unicode Sorting in SQLite: Pre-collated Strings

Article summary

Pre-Collation
Java's Collator
Example Code
Creating a Descending Order Key
Wrapping Up

I recently worked on a project that was a perfect fit for using SQLite as part of its production architecture. The system was a CMS that would allow content to be created, edited, reviewed, and then finally published. The published content was written to a readonly SQLite database, which the website would then switch over to when the content was scheduled to go live. The content was published in multiple languages, with each language getting their own SQLite file.

During development, we ran into issues on index pages where article titles were not sorted correctly in non-English languages. The issue was that SQLite only supports case-insensitive comparisons of ASCII characters, not Unicode characters. This can lead to incorrect results when sorting strings that include characters like á and a, which we wanted to be treated equally when sorting a list of articles.

Pre-Collation

There are several ways to handle this, but we decided the simplest would be to pre-collate the strings – encoding the locale-sensitive rules into ASCII strings that SQLite could properly (and efficiently) index and sort. The CMS was written in Java, and it turns out that Java has a built-in Collator class that can be used to do exactly what we needed.

Java’s Collator

Java’s Collator provides locale-sensitive comparison of strings. By configuring a few properties (the decomposition and strength) we were able to get the case-insensitive, locale-specific sort order we were looking for. The method getCollationKey() converts a string to a byte array that encapsulates its sorted order according to the locale-specific rules. We needed an ASCII string to give to SQLite, so we used the Apache Commons Codec library to encode the byte array as a hex string.

The resulting “sort key” was then stored in an indexed column in the generated SQLite database alongside the actual Unicode article title. When retrieving the list of articles, the ORDER BY clause would use the collated sort key to sort the articles correctly.

Example Code

The following Java function demonstrates how to generate an ASCII hex string from an input string using a Collator.


import java.text.Collator;
import java.util.Locale;
import org.apache.commons.codec.binary.Hex;

public class CollationUtil {
  public static String getCollatedKey(String input, Locale locale) {
    Collator collator = Collator.getInstance(locale);

    // The combination of these two settings results in a case-insenstive sort,
    // that ignores differences between accented characters.
    collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
    collator.setStrength(Collator.PRIMARY);

    byte[] bytes = collator.getCollationKey(input).toByteArray();
    return Hex.encodeHexString(bytes);
  }
}

Creating a Descending Order Key

The user interface allowed users to sort the articles in either ascending or descending order. But the pre-collated sort key we generated above only encodes the ascending order. We ended up adding an additional column to the database to store a separate descending sort key. You can generate such a key by inverting each byte of the collation key. This produces a transformed key that, when sorted in ascending order, effectively results in descending order for the original strings.

In other words, the SQL query would end with ORDER BY desc_sort_key to get the articles in descending order. It’s A tad confusing not seeing the DESC keyword in the query. However, hopefully, the column name makes it clear what’s going on. Here’s an updated method that can handle generating both ascending and descending sort keys:


import java.text.Collator;
import java.util.Locale;
import org.apache.commons.codec.binary.Hex;

public class CollationUtil {
  public static String getCollatedKey(String input, Locale locale, boolean ascending) {
    Collator collator = Collator.getInstance(locale);

    // The combination of these two settings results in a case-insenstive sort,
    // that ignores differences between accented characters.
    collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
    collator.setStrength(Collator.PRIMARY);

    byte[] bytes = collator.getCollationKey(input).toByteArray();
    if (!ascending) {
      for (int i = 0; i < bytes.length; i++) {
        bytes[i] = (byte) ~bytes[i];
      }
    }
    return Hex.encodeHexString(bytes);
  }
}

Wrapping Up

This approach won’t fit all use-cases, but it worked really well for our project. We overcame SQLite’s limitations in sorting Unicode without introducing any additional dependencies or compile options. All of the “work” was done during the publishing process, keeping the “frontend” system simple and easy to maintain – and very fast.

Try Case-Insensitive Unicode Sorting in SQLite with Pre-collated Strings

Article summary

Pre-Collation

Java’s Collator

Example Code

Creating a Descending Order Key

Wrapping Up

Join the conversation Cancel reply

Tell Us About Your Project

Article summary

Pre-Collation

Java’s Collator

Example Code

Creating a Descending Order Key

Wrapping Up

Related Posts

Zip Archives with the Java Stream API

Programmatically Configure Persistent Disk Storage with Ehcache and Spring Boot

Java’s Discriminated Union: Sealed Interfaces

Keep up with our latest posts.

Join the conversation Cancel reply

Tell Us About Your Project