Kristof Kovacs

Software architect, consultant

How not to mess up the encoding?

Encoding error

I'm not the kind of pedant guy who always fights for right language, right syntax, or right accenting. In the twentieth century, when there was no standard way to use accents, I wrote my emails, documents, etc even in Hungarian without accents, rather than using the half-cooked solutions that were around, like the multitude of character mappings (for us Hungarians, there were ISO-8859-2 aka Latin-2, ISO-8859-16 aka Latin-16, Windows-1250, CP852, Mac CentralEurRoman, and CP437 that almost had all of our characters). I guess there are nations where it was even worse.

(This may get a little long. But the main point is the checklist later in the entry. You can just go there if you are in a hurry.)

For those who are not familiar with this mess, it meant that for example while letter "A" is always character #65, my letter "Ő" could get assigned more than one number, depending on which code table I use. And if someone is reading my text using some other, then for her there will be a different letter shown. And if, for example, a someone of a russian nationality was looking at my text, then the tables she used not even included my letters!

Then came Unicode

I was a bit suspicious at first. But when Java started to support Unicode natively, I felt in character heaven. At first it was scary to waste two bytes for each character, but what the hell. This was called UTF-16, or UCS-2. If you look at a Windows EXE with a binary viewer, you can see the string information encoded this way: you see a "00" byte (shown as empty space) between characters.

Unicode is the name of the character table that includes all the characters. You can view it best at unicodemap.org. UTF-8 and UTF-16 (and others) are methods for storing these characters in files. UTF-16 means two bytes (16 bit) for each character. UTF-8 is trickier, but the advantage is that (1) if you don't use extended characters, then it will be the same as ordinary ASCII, and (2) it takes up much less space.

Let me introduce you to the one tool that never lies to you

It's called xxd, and it's included with every Unix and Linux distribution. It's very simple. Suppose I have a file called "tresbien.txt", that has the word "Très bien, merci" written in it (using UTF-8 encoding).

If you click that link, your browser may or may not show it right. Anyway, just play a bit with setting the encoding of the browser (it's usually in the "View" menü). If you set it to UTF-8, it displays right. If you set it to Latin-1 (or something else) it's not.

Now, this is how we use xxd on this file:

$ xxd tresbien.txt
0000000: 5472 c3a8 7320 6269 656e 2c20 6d65 7263  Tr..s bien, merc
0000010: 690a                                     i.
$ _

You can see that between "R" and "sum" there are two points, meaning non-ASCII bytes. On the left side, you see the hexadecimal values of the bytes. This is beacuse the letter "é" is Unicode 00E9 (called "LATIN SMALL LETTER E WITH ACUTE"), and UTF-8 used the second kind of encoding, using two bytes.

Now, if I open VIM on my Mac, it's very likely that I'll see the characters right, beacuse it detects that the file is encoded in UTF-8, and because natively on a mac it uses UTF-8 to display chars too. Then I ":set fileencoding=latin1", and save the file. Let's see what happened:

$ xxd tresbien.txt
0000000: 5472 e873 2062 6965 6e2c 206d 6572 6369  Tr.s bien, merci
0000010: 0a                                       .
$ _

The letter "é" is now only 1 byte long! It's because Latin-1 (ISO-8859-1) has an "é". And the scary thing: if I quit VIM now, and then open the file again, the "é" will be there! All I will see on open is "[converted]". But other editors may not auto-detect the encoding.

If you see "[converted]" in VIM, it should be a red flag anytime.

How to fix encoding problems (or even prevent them)

If we take a regular LAMP development, start right from the beginning. This means setting your MySQL tables' (and all the fields') encoding to UTF-8. (Collation does not really matter at this point, that may be a different blog post.) Like this:

CREATE TABLE `table` (
  `ID` bigint(20) unsigned NOT NULL auto_increment,
  `TEXT` longtext NOT NULL,
  PRIMARY KEY  (`ID`),
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8

From your PHP code, set your MySQL connection up to use UTF-8 all the time (if not configured so by default, but usually it is not). This means the following SQL right after connecting to the DB:

set character_set_results=utf8, character_set_client=utf8, character_set_connection=utf8;

Next, set the HTTP connection to use UTF-8. This means addig this code early on (your head.inc or similar):

header('Content-Type: text/html; charset=utf-8');

and, for safety and consistency, also adding this to your html-head output:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Also, make sure that the editor you use uses UTF-8 to view and to save your files both. In VIM, there are two commands that you can use to check if everything is all right (and set right if not). For displaying, you can ask what does it currently use by:

:set encoding?

I recommend setting it to UTF-8 as early as in your .vimrc file:

:set encoding=utf8

When you open a file, you can see it's encoding (auto-detected) by typing:

:set fileencoding?

If it's not UTF-8, then you can tell VIM to use UTF-8 encoding on the next save by typing:

:set fileencoding=utf8

You can also use this last one to save your file in various other encodings.

It may be also worth looking thru your terminal program and your SSH client's preferences and set everything to UTF-8.

We are all in Nirvāṇa then

If you do all these steps, then you will not be burdened anymore by any of the character encoding drama. You can generate pages from your program that can display names from different nations on the same page, something that was unimaginable when we used code tables. (This is one of the main reason code tables must die.) Now you can have clients in both Russia, India, and China! You are not ignoring the world's 80% non-english named population anymore.

It's not always what it seems

And remember: you can use xxd to check for actual file content, instead of relying on the unlikely cooperation of your operating system, windowing system, terminal, ssh client and editor to show you things. I mean, if your file is in UTF-8, but your editor thinks it's not, it will still send the right bytes to your terminal, and it will show you the right characters! Only when you type in your editor, you can overtype only one byte of the two bytes that represent your character, and you will be messed up.

The same thing can happen the other way: if your operating system sends Latin-1, and your VIM is (fortunately, but incorrectly) set to display as Latin-1, and the files are UTF-8, then you will type right in VIM, but when you try to type an SQL command in the "mysql" command, VIM will not be there for you to convert, and you will not understand why you don't get the supposed results. (Or, if this was an insert, mess up half of your database.)

It's especially scary when the HTTP encoding is right, and the database is not. Mysql will store UTF-8 encoded text into a Latin-1 encoded table (or field), but it will be wrong on about half the tools you try to use. And, your SQL "LIKE"-s may not work anymore on these rows, since the character is not what it seems to be...

A rule of thumb

If you ever have to use "utf8_encode", "utf8_decode" or "iconv" in your PHP application (besides maybe in your interfaces, where incoming and outgoing files may have non-utf8 encodings but you can do nothing about it), then you do have a problem. Look for it right now, because it will be harder to straight out the code later when your clients Mr. Åberg from Sweden and Mr Jürgen from Germany come complaining.

Further reading

"Why does Joomla! 1.5 use UTF-8 encoding?" (Short and to the meat, worth reading...)
"Configuring PuTTY to use UTF-8 character encoding" (For you Windows users)
"Advanced Q&A (Google Webmaster Central Blog)" (Related stuff is around the end of the article)