Character Encoding Tricks for Vim

Manipulating large amounts of data can often be a challenge, especially when the data utilizes complex character encodings or requires a change of character encoding. Fortunately, certain text editors such as Vim are well suited to handle this type of work.

Vim supports many character encodings, and provides enhanced functionality to work with some of them. For example, Vim allows you to change the character encoding of a particular file, check for characters that aren’t valid for a particular encoding, and find the code value for a particular character.

This functionality can be extremely valuable when attempting to convert a text file from one character encoding to another, when identifying unprintable or invalid characters, and determining the raw binary contents of the file.

Below are a few helpful commands for working with character encodings in Vim. Note that arguments in square brackets, [ ], are optional, while arguments in angle brackets, < >, are required.

:edit ++enc=<encoding> [filename]

This command allow you to open (or re-open) a file for editing in Vim using the specified encoding. This can be very useful if, for example, you are editing a file encoded in UTF-8, but Vim has auto-detected it as Latin-1.

:write ++enc=<encoding> [filename]

Similar to editing a file in a specified encoding, this command allows you to save a file in a particular encoding. This can be useful if you’d like to save a file as something other than the default. For example, this would work if you typed a file as UTF-8 in Vim, but wanted to save it as Latin-1. In order to save a file with a particular encoding, the characters in the current buffer need to be available in the target encoding. Otherwise, the characters will not be able to be represented in the target character encoding, and data loss may occur. (e.g. Greek Letter Omega, Unicode:03A9 Ω can be represented by UTF-8, but not Latin-1)

:set encoding[=<encoding>]

This command specifies the character encoding that Vim will use internally for input, buffers, registers, etc. By default, this is set to UTF-8. If no encoding is specified, the current encoding will be displayed.

:set fileencoding[=<encoding>]

This command specifies the character encoding that should be used for saving files. If the encoding specified by ‘:set fileencoding’ differs from ‘:set encoding’, Vim will attempt to convert the contents of the file from the existing encoding (‘:set encoding’) to the target encoding (‘:set fileencoding’). In order to save a file as a particular encoding, the characters in the current buffer need to be available in the target encoding. Otherwise, the characters will not be able to be represented in the target character encoding, and data loss may occur. (e.g. Greek Letter Omega, Unicode:03A9 Ω can be represented by UTF-8, but not Latin-1)

:as or keystroke: ga

This command displays the code point of the character under the cursor. You may also use the much easier keyboard shortcut ‘ga’. The decimal, hexadecimal, and octal code point values will be displayed. For example, the Greek Letter Omega, in UTF-8, provides: “<Ω> 937, Hex 03a9, Octal 1651”. ’03a9′ is the hexadecimal code point value, in Unicode, for Omega.

keystroke: g8

This command displays the hexadecimal value of the bytes used to represent the character under the cursor. This differs from ‘ga’ which displays the code point value of the character – the value a character has within a coded character set. This command displays the actual byte value on disk – the value used to represent a character within a character encoding form. For example, the Greek Letter Omega, in UTF-8, provides: “ce a9”. “ce a9” is the hexadecimal value of Omega in UTF-8, which differs from the code point value of “03 a9”.

keystroke: 8g8

This command identifies any invalid UTF-8 character sequences in the current file. For example, if the file is encoded as UTF-8, but contains a byte or set of bytes which do not represent a valid UTF-8 character, this command will position the cursor over that location in the file. Generally, Vim will represent this invalid character sequence as the hexadecimal value of the byte or bytes enclosed in angle brackets, such as “<93>”.

The commands above represent some of the more useful (and easy to use) features which Vim has to offer when working with character encodings. Searching through the Vim manual reveals several more features which can also be handy, such as the ability to edit text files by hexadecimal value using the xxd utility, support for the Unicode byte-order mark (BOM), and modifying how the keyboard encodes what is typed.