On the Importance of Character Sets & Character Encodings in MySQL

Article summary

When transmitting and storing digital data, one of the most important considerations should be the character encoding. Unfortunately, this rarely seems to be on anyone’s mind when setting up a database or making a database connection. For the most part, the defaults are just expected to work and provide the best set of options. With regards to character encodings (in any context), this is a dangerous approach.

In MySQL, the default character set is Latin-1. As a reminder, Latin-1 is an 8-bit, single byte, character encoding capable of representing 255 values. This would be awesome if you only ever had to represent characters from the Latin alphabet, and would never store or retrieve characters outside of the Latin-1 character set. Unfortunately, in a world driven by the Internet, this is almost never the case, and it causes problems.

Why? Well, because the default MySQL character set is Latin-1, any characters not within that character set may not be properly stored (or retrieved). This often doesn’t occur to developers in the U.S. because nearly everything is represented in characters from the Latin alphabet anyway. However, should you try to store (or retrieve) something not in the standard Latin-1 character set, there are often problems.

For instance, let’s create a sample database on a new MySQL server installation from a UTF-8 client:

mysql> SET NAMES utf8;
mysql> CREATE DATABASE mydatabase;
mysql> USE mydatabase;
mysql> CREATE TABLE `mytable` (`id` int(11) NOT NULL AUTO_INCREMENT,
     `name` text, PRIMARY KEY (`id`));

Our database has been created, and is using the Latin-1 character set, as expected:

mysql> SHOW CREATE DATABASE mydatabase;
+------------+-----------------------------------------------------------------------+
| Database   | Create Database     |
+------------+-----------------------------------------------------------------------+
| mydatabase | CREATE DATABASE `mydatabase` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+------------+-----------------------------------------------------------------------+

mysql> SHOW CREATE TABLE mytable;
+---------+--------------------------------------------------------------------------+
| Table   | Create Table     |
+---------+--------------------------------------------------------------------------+
| mytable | CREATE TABLE `mytable` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1     |
+---------+--------------------------------------------------------------------------+

Now let’s try to insert some data:

  • Standard Latin-1:
    >mysql> INSERT mytable (name) VALUES ("abc");
  • UTF-8 (Greek):
    >mysql> INSERT mytable (name) VALUES ("αβγ");

Now let’s try to retrieve our data:

mysql> SELECT * FROM mytable;
+----+------+
| id | name |
+----+------+
|  1 | abc  |
|  2 | ???  |
+----+------+

Well, our UTF-8 data doesn’t look right, does it? No, not at all. In fact, our data is gone. Permanently. Because the database is Latin-1, and our UTF-8 characters don’t exist in Latin-1, MySQL simply replaced all of our UTF-8 characters with the “replacement character” — which is supposed to signify that the character was not understood, and not properly converted. However, this is little help when you are trying to retrieve data from the database at a later time.

In practice, rarely will developers set up a MySQL database and send all non-Latin-1 characters to it. Usually most of the characters will be Latin-1, with an odd UTF-8 character thrown in. These UTF-8 characters may be forever lost or corrupted, but the Latin-1 characters are just fine. Because these UTF-8 characters may be rarely used, their loss or corruption may not be noticed for some time (if at all). This varied behavior contributes to the lack of awareness and understanding about properly configuring character encodings in general. It should be noted that this is only one specific example of the data corruption and loss that can occur due to improperly configured character encodings — many different variants can and do occur.

So, what is one to do about this problem? The answer is really very simple: always use the correct character set all the time. From a practical perspective, this should mean always using UTF-8 for everything. Why? Because that is the way the world is trending — the Internet is international, and nearly all locales except the U.S. and Western Europe rely upon UTF-8 (or some other form of Unicode) to represent characters all the time. If anyone hopes to serve an international or Internet audience, the character encoding of choice is UTF-8.

So, how is this accomplished in MySQL? Generally, the MySQL server itself should be configured to use UTF-8 as the default character set.

  • This can be done by inserting the following line into the MySQL configuration file, usually /etc/my.conf :
    [mysqld]
    ...
    default-character-set = utf8
    ...
    
  • If the server configuration file isn’t accessible, you must specify the correct character set at database creation:
    >mysql> CREATE DATABASE mytest2 DEFAULT CHARACTER SET utf8;
  • You can also specify the correct character set at table creation:
    mysql> CREATE TABLE `mytable2` (`id` int(11) NOT NULL AUTO_INCREMENT, `name` text, PRIMARY KEY (`id`)) CHARACTER SET utf8;
  • Alternatively, you can specify the correct character set on a per column basis at table creation:
    mysql> CREATE TABLE `mytable2` (`id` int(11) NOT NULL AUTO_INCREMENT, `name` text CHARACTER SET utf8, PRIMARY KEY (`id`));

Existing character sets for servers, databases, tables, and columns can be altered, but this poses a risk for further corrupting or damaging existing data.

Our database and table have both been created, using the UTF-8 character set as specified:

mysql> SHOW CREATE DATABASE mytest2;
+----------+------------------------------------------------------------------+
| Database | Create Database     |
+----------+------------------------------------------------------------------+
| mytest2  | CREATE DATABASE `mytest2` /*!40100 DEFAULT CHARACTER SET utf8 */ |
+----------+------------------------------------------------------------------+

mysql> SHOW CREATE TABLE mytable2;
+----------+------------------------------------------------------------------+
| Table    | Create Table     |
+----------+------------------------------------------------------------------+
| mytable2 | CREATE TABLE `mytable2` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` text CHARACTER SET utf8,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
+----------+------------------------------------------------------------------+

We can now insert UTF-8 data into the database without a problem:

mysql> INSERT mytable2 (name) VALUES ("αβγ");
mysql> SELECT * FROM mytable2;
+----+--------+
| id | name   |
+----+--------+
|  1 | αβγ   |
+----+--------+

It is important to note that clients and their connections to MySQL server also have their own character sets. These should also always be the same as the server, database, and table: UTF-8. The MySQL client will often try and establish a connection to the MySQL server using the default character set (Latin-1), so it must sometimes be specifically set to UTF-8.

  • On the MySQL client command line, this can be accomplished by setting the following variable:
    >mysql> SET NAMES utf8;
  • Certain other MySQL clients must also specifically be told to use UTF-8. For example, in Ruby on Rails, the database.yml file should specify UTF-8:
    production:
      adapter: mysql
      database: mydatabase
      username: myuser
      password: mypass
      host:  mydb
      encoding: utf8
    

The SQL samples shown here are really intended to illustrate the importance of using the proper character set on MySQL server, and on MySQL clients. These are just examples, and not total solutions. You should do proper research on the character encoding of the server and clients that you utilize. Always backup data and use caution when trying to change the character encoding used in a production database

Further Reading

Conversation
  • John Smith says:

    This should be considered carefully. Double byte characters could also double the size of the database. With nonclustered indexes in the mix it would actually more than double the size. Although that may not seem like much it would be if there were even moderate write IO involved. The default of 1 byte is there for a reason.

  • Actually, utf8 is a variable width character encoding and not a double byte character encoding. In MySQL, utf8 will be stored on disk using between one and three bytes depending on the character being represented. Standard Latin alphabet characters (essentially those which exist in the ASCII character set) will be represented with one byte, while those outside that range will be represented by two or more bytes. If no non-Latin-1 characters are used, the sizes of MySQL tables encoded in utf8 and latin1 will be the same. Concern arises not with disk operations, but memory (RAM) storage. MySQL will utilize three bytes for utf8 characters in memory, even if that character only occupies one byte on disk. If memory constraints are not a problem, then utf8 can be safely used. Actual multi-byte character sets in MySQL (such as ucs2), however, are of concern for greatly increased database and index size.

  • Comments are closed.