Article summary
As I gain experience programming, I find myself becoming more and more mindful about implicit assumptions I might be making while solving problems. Spending hours debugging timezone issues or unexpectedly time-sensitive date calculations has revealed that everything I thought I knew about time (and many other subjects) is wrong. Such experiences have led me to take more cautious approaches to new problems.
One area where this has paid off handsomely is the field of text rendering. During a recent project, one of my primary goals was to enable good localization across the globe. I immediately started reading about various writing systems and how support for them is managed in software. It turns out that the general answer is, “It’s complicated.” As soon as I started looking at Indic text shaping, my head began to spin—but my investigations eventually turned up a simpler solution.
Ideograph Idiosyncrasies
One particular surprise that turned up in my reading was Han Unification. In an effort to save Unicode codepoints, the ideographs of Traditional Chinese, Simplified Chinese, Japanese, and Korean have been “deduplicated.” In essence, drafters of the standard made an effort to map the common ideographs of each language down to a single set of codepoints. This sounds like a good solution, but each language has significant variations in how the ideographs are drawn.
Thankfully, many people much smarter than I have devoted a great deal of time to this subject, and there are fonts which simplify the problem significantly. For my project, I selected a collaboration between Adobe and Google called either Source Han Sans or Noto Sans CJK. This font is distributed under a very liberal license, and it includes glyphs for Japanese, Korean, Simplified Chinese, and Traditional Chinese. All you need to do to ensure proper ideograph use is to choose the correct language variant.
Simplified Font Selection
My project is shipping its own Linux-based operating system, so I figured it would be handy to get the underlying font selection engine to do this for me. After a few hours of FontConfig research, I figured out this method:
First, install the all-in-one Noto Sans CJK super OTC font. Next, you’ll have to tell your Linux system’s font selection engine (FontConfig) what rules it should apply.
zh-CN
sans-serif
Noto Sans CJK SC
zh-TW
sans-serif
Noto Sans CJK TC
ja
sans-serif
Noto Sans CJK JP
ko
sans-serif
Noto Sans CJK KR
sans-serif
Noto Sans CJK SC
true
Noto Sans CJK KR
Noto Sans CJK SC
Noto Sans CJK JP
Noto Sans CJK TC
Noto Sans Mono CJK KR
Noto Sans Mono CJK SC
Noto Sans Mono CJK JP
Noto Sans Mono CJK TC
false
true
true
false
hintfull
true
none
false
When you place this configuration file in /etc/fonts/local.conf
, FontConfig will rely on the user’s current locale to automatically choose the correct glyph set from the Noto Sans CJK TTC you installed earlier.
My colleague Jesse has written about some more generally applicable translation pitfalls you can avoid here. What interesting problems have you fixed or avoided while localizing a program?
Hello Mr Johnson,
Thanks for this great post sharing the configuration of locale fonts in Linux. There is a block in your config file, line 48-55. It looks similar to the first match-block starting at line 5. May I ask what’s the difference of their purposes?
Best wishes,
Haoxian