3 Comments

Some Useful Iconv Functionality

Much of the work I do involves heavy manipulation of documents and text data for display on the web, and in various file formats. Necessarily, I deal with character encodings and related issues on nearly a daily basis.

Iconv character encodings

One of the programs I rely upon for this work is iconv, also known as libiconv. While iconv has some quirks (see my post on Converting to UTF-16 and UCS-2 with Iconv), and does not always behave as expected in some environments (the iconv gem for Ruby does not support transliteration), it is still useful. iconv is primarily used to convert text files from one character encoding to another, for example from Latin-1 to UTF-16.

The reason for converting between encodings varies. For example, sometimes it is necessary to convert legacy character encodings to the more modern and universal UTF-8 encoding. Other times it is necessary to convert text into a particular encoding so that it can be properly used by a certain program.

Whatever the reason, iconv is often the quickest solution for simple conversion of character encodings. iconv comes installed on GNU/Linux distributions, and is one of the many standard GNU/Linux packages. Many programming languages provide wrappers which interface with iconv, including Ruby and PHP.

iconv is simple enough to use. On the command line, I specify the encoding an existing file is using, and then specify the encoding that I would like to convert it to.

$ iconv [-t <new_encoding>] [-f <old_encoding>] [filename]

For example, to convert a text document from Latin-1 to UTF-8, I could execute:

[kulesza@home ~]$ iconv -t UTF-8 -f LATIN1 myfile.txt

I can list all character encodings that iconv can work with by executing:

[kulesza@home ~]$ iconv -l

There are many other options and usage parameters that are better left up to the man pages. (However, you should get the idea.)

Some of iconv‘s features aren’t well documented. These include the //IGNORE and //TRANSLIT extensions.

//IGNORE tells iconv to ignore invalid character sequences which it encounters during conversion. This could be useful if I am converting a UTF-8 text file to another encoding, but the UTF-8 file contains an invalid UTF-8 character sequence such as 0x80. Ordinarily, iconv would exit and not complete the conversion. However, specifying the //IGNORE extension with the ‘to’ character encoding causes iconv to simply discard any invalid sequences, and attempt to continue the conversion.

For example, say I have some string with an invalid UTF-8 character sequence, but many other valid characters I want converted.

[kulesza@home ~]$ echo -e "\x80iconv\x20\x80will\x20\x80not\x20\x80like\x20\x80this" | iconv -t LATIN1 -f UTF-8
iconv: illegal input sequence at position 0

*boom*

But if I use //IGNORE…

[kulesza@home ~]$ echo -e "\x80iconv\x20\x80will\x20\x80not\x20\x80like\x20\x80this" | iconv -t LATIN1//IGNORE -f UTF-8
iconv will not like this
iconv: illegal input sequence at position 30

iconv may still complain about the invalid sequences, but will complete the conversion. Alternatively, I can use the -c option for iconv to omit invalid characters from output, which will also suppress warnings.

[kulesza@training ~]$ echo -e "\x80iconv\x20\x80will\x20\x80not\x20\x80like\x20\x80this" | iconv -c -t LATIN1 -f UTF-8
iconv will not like this

//TRANSLIT tells iconv to transliterate characters, or convert characters in the origin encoding to the closest possible matching character in the target encoding. This may be necessary when converting from something like UTF-8 (which supports the full range of Unicode characters) to ASCII (which only supports a very limited character repertoire). Ordinarily, iconv would exit and not complete the conversion. The use of //TRANSLIT tells iconv to use the closest possible character. If no such character is available, it will be replaced with the ‘replacement character’, which is often a question mark.

For example, the letters “ç”, “ß”, and “∑” do not exist in ASCII, but do exist in UTF-8.

[kulesza@home ~]$ echo -e "ç ß ∑" | iconv -t ASCII -f UTF-8
iconv: illegal input sequence at position 0

But if I use //TRANSLIT…

[kulesza@home ~]$ echo -e "ç ß ∑" | iconv -t ASCII//TRANSLIT -f UTF-8
c ss ?

Clearly iconv can be a very powerful tool for converting text between different encodings, and cleaning invalid character sequences from files. On my current project, we are using it for a variety of purposes, including cleaning invalid UTF-8 sequences from user-submitted files before saving them in the database. This is accomplished by having iconv convert from UTF-8 and to UTF-8 while using the //IGNORE extension.

While iconv can be very powerful, it should definitely be tested extensively in the environment in which it will be used. As mentioned previously, I have observed it to behave differently depending on the server environment. If application development takes place in a different environment than will be used for production code, the behavior of iconv should be tested early to avoid potential issues.