5 Comments

Rendering UTF-8 characters in Rich Text Format with PHP

One of the requirements for a project that I’ve been working on was to dynamically generate a document using information in a database. RTF was chosen for its versatility and compatibility across platforms.

While implementing this feature, I discovered that some characters would not rendering properly. These were UTF-8 characters, which cannot legally be embedded directly into RTF output.

After some research, I learned that it was possible to render a specific character by specifying its 16-bit code point. The RTF sequence looks as follows: \u####?. The #’s represet the decimal code point value. The question mark acts as a replacement character for legacy RTF viewers that do not support rendering by code point.

Let’s use the left double quotation mark “ as an example:

  • UTF-8: \xE2\x80\x9C
  • UTF-16 hexadecimal: 201C
  • UTF-16 decimal: 8220
  • RTF: \u8220?

I needed to implement a mechanism in PHP for locating, isolating, and converting sequences of UTF-8 characters.

The PHP function mb_convert_encoding appeared to satisfy what I required:


mb_convert_encoding("\xE2\x80\x9C", 'UTF-16', 'UTF-8') == "\x201C" // True

Unfortunately applying this to an entire block of text converts all of the text to UTF-16, which was not the desired result.


mb_convert_encoding("a\xE2\x80\x9Cb", 'UTF-16', 'UTF-8') == "a\x201Cb" // False

I needed to isolate the multibyte UTF-8 sequences and convert them individually.

The specification for UTF-8 indicates that a byte sequence is one to four bytes long. Single byte UTF-8 “sequences” map directly to US-ASCII and therefore do not need to be converted. The first byte in the multibyte sequence is used to determine how many bytes the sequence has. For example, if the first byte falls within the range \xE0 to \xEF, two additional bytes follow in the range \x80 to \xBF.

These patterns could be easily represented in a regular expression array:

1
2
3
4
5
$patterns = array(
"[\xC2-\xDF][\x80-\xBF]",    // Two byte sequence
"[\xE0-\xEF][\x80-\xBF]{2}", // Three bytes
"[\xF0-\xF4][\x80-\xBF]{3}", // Four bytes
);

PHP has a plethora of regular expression functions, and the solution ultimately came from preg_replace. What made preg_replace especially convenient was the ability to pass PHP code directly to the replacement parameter using the /e modifier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
function utf8_to_rtf($utf8_text) {
    $utf8_patterns = array(
      "[\xC2-\xDF][\x80-\xBF]",
      "[\xE0-\xEF][\x80-\xBF]{2}",
      "[\xF0-\xF4][\x80-\xBF]{3}",
    );
    $new_str = $utf8_text;
    foreach($utf8_patterns as $pattern) {
      $new_str = preg_replace("/($pattern)/e", 
        "'\u'.hexdec(bin2hex(mb_convert_encoding('$1', 'UTF-16', 'UTF-8'))).'?'", 
        $new_str);
    }
    return $new_str;
  }

The key bit of code in from above is the replacement string sent to preg_replace:


'\u' . hexdec( bin2hex( mb_convert_encoding('$1', 'UTF-16', 'UTF-8'))) . '?'

The first matched grouping (denoted by $1) is converted from UTF-8 to UTF-16 using mb_convert_encoding. The binary result is converted to hexadecimal, and then to decimal. Finally, a \u is prepended and a ? is appended.