I recently worked on a project that was a perfect fit for using SQLite as part of its production architecture. The system was a CMS that would allow content to be created, edited, reviewed, and then finally published. The published content was written to a readonly SQLite database, which the website would then switch over to when the content was scheduled to go live. The content was published in multiple languages, with each language getting their own SQLite file.
During development, we ran into issues on index pages where article titles were not sorted correctly in non-English languages. The issue was that SQLite only supports case-insensitive comparisons of ASCII characters, not Unicode characters. This can lead to incorrect results when sorting strings that include characters like á and a, which we wanted to be treated equally when sorting a list of articles.
Pre-Collation
There are several ways to handle this, but we decided the simplest would be to pre-collate the strings – encoding the locale-sensitive rules into ASCII strings that SQLite could properly (and efficiently) index and sort. The CMS was written in Java, and it turns out that Java has a built-in Collator
class that can be used to do exactly what we needed.
Java’s Collator
Java’s Collator
provides locale-sensitive comparison of strings. By configuring a few properties (the decomposition and strength) we were able to get the case-insensitive, locale-specific sort order we were looking for. The method getCollationKey()
converts a string to a byte array that encapsulates its sorted order according to the locale-specific rules. We needed an ASCII string to give to SQLite, so we used the Apache Commons Codec library to encode the byte array as a hex string.
The resulting “sort key” was then stored in an indexed column in the generated SQLite database alongside the actual Unicode article title. When retrieving the list of articles, the ORDER BY
clause would use the collated sort key to sort the articles correctly.
Example Code
The following Java function demonstrates how to generate an ASCII hex string from an input string using a Collator
.
import java.text.Collator;
import java.util.Locale;
import org.apache.commons.codec.binary.Hex;
public class CollationUtil {
public static String getCollatedKey(String input, Locale locale) {
Collator collator = Collator.getInstance(locale);
// The combination of these two settings results in a case-insenstive sort,
// that ignores differences between accented characters.
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
collator.setStrength(Collator.PRIMARY);
byte[] bytes = collator.getCollationKey(input).toByteArray();
return Hex.encodeHexString(bytes);
}
}
Creating a Descending Order Key
The user interface allowed users to sort the articles in either ascending or descending order. But the pre-collated sort key we generated above only encodes the ascending order. We ended up adding an additional column to the database to store a separate descending sort key. You can generate such a key by inverting each byte of the collation key. This produces a transformed key that, when sorted in ascending order, effectively results in descending order for the original strings.
In other words, the SQL query would end with ORDER BY desc_sort_key
to get the articles in descending order. It’s A tad confusing not seeing the DESC
keyword in the query. However, hopefully, the column name makes it clear what’s going on. Here’s an updated method that can handle generating both ascending and descending sort keys:
import java.text.Collator;
import java.util.Locale;
import org.apache.commons.codec.binary.Hex;
public class CollationUtil {
public static String getCollatedKey(String input, Locale locale, boolean ascending) {
Collator collator = Collator.getInstance(locale);
// The combination of these two settings results in a case-insenstive sort,
// that ignores differences between accented characters.
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
collator.setStrength(Collator.PRIMARY);
byte[] bytes = collator.getCollationKey(input).toByteArray();
if (!ascending) {
for (int i = 0; i < bytes.length; i++) {
bytes[i] = (byte) ~bytes[i];
}
}
return Hex.encodeHexString(bytes);
}
}
Wrapping Up
This approach won’t fit all use-cases, but it worked really well for our project. We overcame SQLite’s limitations in sorting Unicode without introducing any additional dependencies or compile options. All of the “work” was done during the publishing process, keeping the “frontend” system simple and easy to maintain – and very fast.