One of the requirements for a project that I’ve been working on was to dynamically generate a document using information in a database. RTF was chosen for its versatility and compatibility across platforms.
While implementing this feature, I discovered that some characters would not rendering properly. These were UTF-8 characters, which cannot legally be embedded directly into RTF output.
After some research, I learned that it was possible to render a specific character by specifying its 16-bit code point. The RTF sequence looks as follows: \u####?
. The #’s represet the decimal code point value. The question mark acts as a replacement character for legacy RTF viewers that do not support rendering by code point.
Let’s use the left double quotation mark “ as an example:
- UTF-8: \xE2\x80\x9C
- UTF-16 hexadecimal: 201C
- UTF-16 decimal: 8220
- RTF: \u8220?
I needed to implement a mechanism in PHP for locating, isolating, and converting sequences of UTF-8 characters.
The PHP function mb_convert_encoding
appeared to satisfy what I required:
|
mb_convert_encoding("\xE2\x80\x9C", 'UTF-16', 'UTF-8') == "\x201C" // True |
Unfortunately applying this to an entire block of text converts all of the text to UTF-16, which was not the desired result.
|
mb_convert_encoding("a\xE2\x80\x9Cb", 'UTF-16', 'UTF-8') == "a\x201Cb" // False |
I needed to isolate the multibyte UTF-8 sequences and convert them individually.
The specification for UTF-8 indicates that a byte sequence is one to four bytes long. Single byte UTF-8 “sequences” map directly to US-ASCII and therefore do not need to be converted. The first byte in the multibyte sequence is used to determine how many bytes the sequence has. For example, if the first byte falls within the range \xE0 to \xEF, two additional bytes follow in the range \x80 to \xBF.
These patterns could be easily represented in a regular expression array:
1 2 3 4 5 |
$patterns = array( "[\xC2-\xDF][\x80-\xBF]", // Two byte sequence "[\xE0-\xEF][\x80-\xBF]{2}", // Three bytes "[\xF0-\xF4][\x80-\xBF]{3}", // Four bytes ); |
PHP has a plethora of regular expression functions, and the solution ultimately came from preg_replace
. What made preg_replace
especially convenient was the ability to pass PHP code directly to the replacement parameter using the /e
modifier.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
function utf8_to_rtf($utf8_text) { $utf8_patterns = array( "[\xC2-\xDF][\x80-\xBF]", "[\xE0-\xEF][\x80-\xBF]{2}", "[\xF0-\xF4][\x80-\xBF]{3}", ); $new_str = $utf8_text; foreach($utf8_patterns as $pattern) { $new_str = preg_replace("/($pattern)/e", "'\u'.hexdec(bin2hex(mb_convert_encoding('$1', 'UTF-16', 'UTF-8'))).'?'", $new_str); } return $new_str; } |
The key bit of code in from above is the replacement string sent to preg_replace
:
|
'\u' . hexdec( bin2hex( mb_convert_encoding('$1', 'UTF-16', 'UTF-8'))) . '?' |
The first matched grouping (denoted by $1) is converted from UTF-8 to UTF-16 using mb_convert_encoding
. The binary result is converted to hexadecimal, and then to decimal. Finally, a \u
is prepended and a ?
is appended.
Thank You! Helped a lot
Thanks for the code! Here’s a slightly modified version that:
1) uses iconv instead of mb_convert_encoding.
2) uses a single match pattern that matches UTF-8 byte sequences of all lengths (as long as the UTF-8 text is valid, this will work).
3) ensures that the byte order of the UTF-16 code is correct (it has to be big endian — iconv doesn’t default to the same byte order on all systems).
return preg_replace(‘/([\\xC2-\\xF4][\\x80-\\xBF]+)/e’, ‘”\\u”.hexdec(bin2hex(iconv(“UTF-8″,”UTF-16BE”,”\\1″))).”?”‘, $utf8_text);
I’ve adopted this a little. Theoretically, using this specific regular expression code injection should be impossible, but I still don’t feel good about leaving the PCRE_EVAL modifier in there. There is a perfectly good alternative in preg_replace_callback.
function Utf8ToRtf($utf8_text) {
$utf8_text = str_replace("n", "parn", str_replace("r", "n", str_replace("rn", "n", $utf8_text)));
return preg_replace_callback("/([xC2-xF4][x80-xBF]+)/", 'FixUnicodeForRtf', $utf8_text);
}
function FixUnicodeForRtf($matches) {
return 'u'.hexdec(bin2hex(iconv('UTF-8', 'UTF-16BE', $matches[1]))).'?';
}
hey thanks for the nice worke her, but i encountered problems with the ‘§’ character
it might be necessary to convert any character thats above 127 into escape code
not familiar with regexp but I changed “/([\\xC2-\\xF4][\\x80-\\xBF]+)/” into “/([\\x80-\\xF4][\\x80-\\xBF]+)/”
tell me when I’m wrong