As I gain experience programming, I find myself becoming more and more mindful about implicit assumptions I might be making while solving problems. Spending hours debugging timezone issues or unexpectedly time-sensitive date calculations has revealed that everything I thought I knew about time (and many other subjects) is wrong. Such experiences have led me to take more cautious approaches to new problems.
One area where this has paid off handsomely is the field of text rendering. During a recent project, one of my primary goals was to enable good localization across the globe. I immediately started reading about various writing systems and how support for them is managed in software. It turns out that the general answer is, “It’s complicated.” As soon as I started looking at Indic text shaping, my head began to spin—but my investigations eventually turned up a simpler solution.
One particular surprise that turned up in my reading was Han Unification. In an effort to save Unicode codepoints, the ideographs of Traditional Chinese, Simplified Chinese, Japanese, and Korean have been “deduplicated.” In essence, drafters of the standard made an effort to map the common ideographs of each language down to a single set of codepoints. This sounds like a good solution, but each language has significant variations in how the ideographs are drawn.
Thankfully, many people much smarter than I have devoted a great deal of time to this subject, and there are fonts which simplify the problem significantly. For my project, I selected a collaboration between Adobe and Google called either Source Han Sans or Noto Sans CJK. This font is distributed under a very liberal license, and it includes glyphs for Japanese, Korean, Simplified Chinese, and Traditional Chinese. All you need to do to ensure proper ideograph use is to choose the correct language variant.
Simplified Font Selection
My project is shipping its own Linux-based operating system, so I figured it would be handy to get the underlying font selection engine to do this for me. After a few hours of FontConfig research, I figured out this method:
zh-CN sans-serif Noto Sans CJK SC zh-TW sans-serif Noto Sans CJK TC ja sans-serif Noto Sans CJK JP ko sans-serif Noto Sans CJK KR sans-serif Noto Sans CJK SC true Noto Sans CJK KR Noto Sans CJK SC Noto Sans CJK JP Noto Sans CJK TC Noto Sans Mono CJK KR Noto Sans Mono CJK SC Noto Sans Mono CJK JP Noto Sans Mono CJK TC false true true false hintfull true none false
When you place this configuration file in
/etc/fonts/local.conf, FontConfig will rely on the user’s current locale to automatically choose the correct glyph set from the Noto Sans CJK TTC you installed earlier.
My colleague Jesse has written about some more generally applicable translation pitfalls you can avoid here. What interesting problems have you fixed or avoided while localizing a program?