7 Comments

Converting UTF-8 to the 7-bit GSM Default Alphabet

7-Bit GSM Default Alphabet

As blogged about previously, the SME Toolkit has a feature which allows integration with telephone companies to send SMS messages via the SMPP protocol.

Recently, we experienced the challenge of needing to communicate SMS messages to a telco utilizing the 7-bit GSM Default Alphabet. The challenge arose from the fact that, internally, the SME Toolkit uses the UTF-8 encoding scheme, and we were trying to communicate Portuguese language messages.

The Portuguese language utilizes characters which are not represented in the 7-bit GSM Default Alphabet, yet GSM-7 was the only way we could communicate SMS messages to the telephone company. The challenge was to send Portuguese language SMS messages from the SME Toolkit to end-user mobile handsets via the telephone company without rendering the SMS message unreadable.

The Problem

We could not very well just ask people to not send Portuguese-language specific characters which did not exist in GSM-7 — that would be tacky and very user-unfriendly. We could not use other encodings which did support the Portuguese language, such as UCS-2/UTF-16 as the telephone company did not support them. Finally, we could not use one of the newer features of GSM, national language shift tables, as the telephone company did not support them either.

Doing a standard conversion to from UTF-8 to GSM-7 would be ‘lossy’ in that the characters which did not exist in GSM-7, but did exist in UTF-8 could not be represented. A replacement character would be used to indicate that no valid character existed in the GSM-7 encoding to represent the given UTF-8 character.

The Solution

After some discussion with target Portuguese speakers who had mobile handsets with the telephone company, we discovered that it was acceptable to use a close replacement character for the Portuguese-language specific characters which the telco did not support.

For example, for the purposes of SMS messages, it was acceptable to replace ‘á‘ with ‘a‘, ‘ó‘ with ‘o‘, and ‘ã‘ with ‘a‘. A SMS message such as ‘PME Negócios‘ could be displayed as ‘PME Negocios‘ without rendering the message unintelligible.

This was a much better alternative than a ‘lossy’ conversion from UTF-8 to GSM-7 which would likely result in the replacement character ‘?‘ being used for the Portuguese-language characters which did not exist in GSM-7. For instance, ‘PME Negócios‘ would become ‘PME Neg?cios‘. While not ideal, using a close replacement character is clearly much preferable to using just a question mark.

But how to achieve this conversion? Very few resources exist to convert UTF-8 to GSM-7, much less do so with the sort of flexibility we had in mind. Normally, I would opt to use something like iconv to convert between encodings, but iconv does not natively support GSM-7.

The Implementation

In the end, I decided to write a custom library to handle conversion from UTF-8 to GSM-7, which would allow for replacement characters which were as close as possible to the original UTF-8 character. This was achieved through direct 1-to-1 conversion of GSM-7 supported characters, and then transliteration to handle the remainder.

The library takes a UTF-8 string, and examines each character separately. If the character has a GSM-7 equivalent, the GSM-7 value of that character is provided using a mapping. If the character does not have a GSM-7 equivalent, it is transliterated to ASCII. (ASCII is fully represented by GSM-7.) Once the UTF-8 character has been transliterated to ASCII, the GSM-7 value of that ASCII character is provided using a mapping.

In this way, a UTF-8 string can be converted to GSM-7, using the best possible replacement character instead of simply dropping the character altogether, or using a generic replacement. If, for some reason, there is no way to transliterate a UTF-8 character to ASCII, then a replacement character must be used as there is no alternative. While definitely not ideal, this solution was sufficient for our needs.

For instance, with the UTF-8 string Seqüência de teste em Português, there are many characters which can be directly converted to GSM-7 without any issue, but a few which need to be handled specially. Let’s go through the logic.

  • The characters:
    a, c, d, e, g, i, m, n, o, P, q, r, S, s, t, u
    All have direct mappings to GSM-7 (which, conveniently, have the same hex value as in UTF-8)
  • ü‘ does exist in GSM-7, and has a value of 0x7E. So that’s easy, we provide that hex value for the conversion.
  • ê‘ does not exist in GSM-7. So… we transliterate to ASCII, which would give ‘e‘.
  • e‘ does exist in GSM-7, and has a value of 0x65. So, we provide that hex value.

We now have a string which, if rendered in GSM-7, would look like ‘Seqüencia de teste em Portugues‘. The actual hex value in GSM-7 would be:

\x53\x65\x71\x7e\x65\x6e\x63\x69\x61\x20\x64\x65\x20\x74\x65\x73\x74\x65\x20\x65\x6d\x20\x50\x6f\x72\x74\x75\x67\x75\x65\x73

Conclusion

This strategy for converting a UTF-8 string to the 7-bit GSM Default Alphabet allowed us to communicate with the telephone company utilizing the required encoding, but without fully sacrificing the ability to send meaningful messages in Portuguese. It certainly isn’t as optimal as sending messages encoded as UTF-8 or UCS-2/UTF-16, but it suffices when no other option is available.