Character Encoding Fun with SMS Messages

While trying to send SMS messages via the SME Toolkit application using the SMPP protocol, I encountered some bizarre behavior.

I would send a seemingly ordinary message via SMPP only to receive a somewhat altered message on my mobile devices. Specifically, only certain symbols were altered, and latin letters appeared normally. What is more, the altered symbols were not out of the ordinary — they were everyday symbols such as $ and @.

This was clearly a character encoding issue, but one not previously encountered. Most character encoding issues result in completely mangled and unreadable text, or only issues displaying non-latin letters and non-standard symbols.

The reason for my surprise was that most standard character encodings used in the U.S. have the ASCII characters in common. The ASCII characters include the latin alphabet, and a few other commonly used symbols such as those on the standard U.S. keyboard. For instance, all characters encoded in ASCII have the same value in Latin-1 and UTF-8 character encodings.

Mobile devices tend to use the GSM character encoding for SMS messages. However, the application was sending messages encoded in Latin-1. The GSM character encoding, like ASCII, uses 7-bits to represent characters. Unfortunately, GSM maps actual characters differently than ASCII — certain characters have different values.

There is a large overlap between values for characters in GSM and ASCII, and by extension Latin-1 However, certain important symbols such as the dollar sign ($) and “at” sign (@) are not included in this overlap.

Here’s a look at some of the similarities and differences between GSM and ASCII values:

Hex Value ASCII Character GSM Character
21 ! !
41 A A
61 a a
24 $ ¤
40 @ ¡
5E ^ Ü

Notice, specifically, the difference for the hex value of 24, which in ASCII is the dollar sign ($), but in GSM is the generic currency symbol (¤). For the hex value of 40, the character in ASCII is the commercial “at” symbol (@), but the character in GSM is an inverted exclamation mark (¡). For the hex value of 5E, ASCII uses the circumflex accent (^), but GSM uses the capital U with diaeresis (Ü).

Once the issue was recognized, the solution was simple — convert messages sent from the application into the GSM character encoding. The issue surfaced largely because the local SMPP gateway only supported the Latin-1 character encoding, and the gateway used by the Telco only supported the GSM character encoding. Often, the SMPP gateway will recognize and automatically convert SMS messages sent via SMPP into the GSM character encoding and vice versa, seamlessly masking the underlying difference.