Article summary
Recently the SME Toolkit, a project sponsored by the International Finance Corporation (a member of the World Bank Group), was attempting to send international SMS messages. This gave everyone on the team a good lesson in character encodings. We had previously utilized UTF-16 to send our SMS messages to the telephone company which we were partnering with for international messages. However, when we tried to send our SMS messages to a telco in Sri Lanka, they requested that we use a predecessor to UTF-16 known as UCS-2.
Since we knew UTF-16 and UCS-2 were very similar, we didn’t anticipate any problems. (The major differences between UTF-16 and UCS-2 occur for characters which we were not concerned about). We dutifully converted our messages to UCS-2, and sent them off — only to find that no one in Sri Lanka could read them. They showed up as gibberish on our testers phones, or were not even accepted by the telco’s SMS gateway — being rejected as malformed. What had happened? UTF-16 and UCS-2 are supposed to be similar, right? Well, they are similar. Unfortunately, character encodings are widely misunderstood, and implementations differ widely. A particular program may handle even similar encodings in very different fashions.
After discovering we were sending nonsense messages, we decided to examine the logs. To our surprise, we discovered that not only was the telco receiving messages from us without a byte order mark (BOM), they were receiving our messages as if they were little endian — with the least significant byte first! This was a major surprise to us as we had anticipated that our messages would be sent just like UTF-16: with a BOM so that the actual byte order was not an issue.
It turns out that there is a snag with the Unix Iconv library (libiconv) on certain systems depending on the system’s endianness. We converted our messages in Ruby using the Iconv library which utilizes the local system’s library. It seems that Iconv silently omits the BOM when converting messages to UCS-2, but does include the BOM when converting messages to UTF-16. This is surprising (and somewhat concerning), as the UCS-2 encoding is byte order sensitive, just like UTF-16. In addition, the default byte order is supposed to be big endian. Unfortunately, we couldn’t find a good explanation for this behavior, so we simply adapted.
We recognized the problem when doing a trial a conversion with the Iconv library in IRB from UTF-8 to UCS-2:
- On a local system:
irb(main):001:0> Iconv.conv("UCS-2","UTF-8","a") => "\000a"
- On our production server:
irb(main):001:0> Iconv.conv("UCS-2","UTF-8","a") => "a\000"
For comparison, the Iconv library in IRB does include the BOM when converting from UTF-8 to UTF-16:
- On a local system:
irb(main):002:0> Iconv.conv("UTF-16","UTF-8","a")
=> "\376\377\000a" - On our production server:
irb(main):002:0> Iconv.conv("UTF-16","UTF-8","a")
=> "\377\376a\000"
While this type of behavior in Iconv may be intended, it certainly is confusing and unhelpful since it isn’t documented. I could find no information regarding intended behavior of Iconv other than the odd forum post which referred to UCS-2 conversion in Iconv being “broken.” The UCS-2 standard allows a BOM, but it can be omitted. This wouldn’t be such a problem if Iconv always converted into the same variant of UCS-2 rather than switching between big endian and little endian depending on the current architecture.
Somewhat cryptically, Ruby outputs the contents of a byte as the ASCII equivalent, or the octal value if that will not be visible. So, in our case, the character “a” gets displayed as “\000a” on one machine, and “a\000” on the other machine. This is endianness in action. One machine is clearly putting the most significant byte first, the other the least significant byte first. This is a problem since the UCS-2 standard specifies that in the absence of a BOM, the bytes should be interpreted as big endian.
Instead of receiving the character “a” (Unicode: 0x00 0x61), the telco would have received the character “愀” (Unicode: 0x61 0x00) — a Han ideograph. No wonder the Sri Lankan phones couldn’t display it!
Discussion with the telco in Sri Lanka revealed what we had come to suspect: they wanted the UCS-2 messages to be big endian. I suppose it would have been helpful if they would have specified this in the first place, but they might have (rightfully) expected that we would use the default — big endian.
In order to force our server to use big endian, we specified it explicitly when doing our conversions with Iconv:
- On our production server:
irb(main):0003:0> Iconv.conv("UCS-2BE","UTF-8","a") => "\000a"
This effectively solved our problem. No longer did we have to worry about whether our machine was representing characters differently than the telco’s machines — we specified exactly which encoding and byte order to use. Sadly, documentation of character encoding issues like this is very sparse and we had to do much research and testing ourselves before coming to this conclusion.
Further Reading
- An excellent introduction to character encodings, by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- The Unicode standard and associated documentation, by the The Unicode Consortium: What is Unicode? (Be careful not to read the full standard — you will fall asleep.)
[…] upon for this work is iconv, also known as libiconv. While iconv has some quirks (see my post on Converting to UTF-16 and UCS-2 with Iconv), and does not always behave as expected in some environments (the iconv gem for Ruby does not […]