Technical reasons to use UTF-8

Anne’s Weblog • 4 days ago

Via Fronteers I discovered that even now not everyone is convinced of the merits of UTF-8. A little over five years ago I wrote a quick guide to UTF-8 and it seemed worthwhile to stipulate some technical points I became aware of meanwhile as to why using UTF-8 is a good idea.

I tend to think it is pretty self-explanatory that you want to use some encoding that can encode all of Unicode. After all, at some point you might want to sell your software abroad, you might want to accept comments in any given language, accept any kind of user-contributed content for that matter, and you simply do not want to keep an encoding label around everytime you deal with a string. Given that people are trying to phase out UTF-7 (security issues) and UTF-32 (bloat) (both gone in Opera 10) this gives UTF-8 and UTF-16 as options.

Here are two reasons to use UTF-8:

  • The encoding of URLs is UTF-8. The path component is always encoded in UTF-8 (when the request is made) and the query component depends on the page encoding if the link is embedded inside HTML. However, for XMLHttpRequest the query component is always encoded in UTF-8 which could result in confusion if you have the same link in the page and in a script. If you need to process such links on the server or want to link to external pages the easiest is to simply align with the encoding of URLs. I.e. by using UTF-8.
  • The encoding XMLHttpRequest uses for encoding text strings when sending data to the server is always UTF-8. This means your server better deals with UTF-8 input correctly. Always using UTF-8 means again less work for you since you do not have to figure out if the request came from a form element or an XMLHttpRequest object.

Here are two reasons to use UTF-8 over UTF-16:

  • In a thread on Erik van der Poel (Google) briefly describes a security issue with UTF-16 on the Web and says: « Google Web Search has stopped serving UTF-16. » To be fair, the issue is actually in Internet Explorer, though obviously that does not make UTF-16 less dangerous given IE’s market share.
  • Contrary to popular believe, UTF-16 often takes up more space, even for CJK pages. UTF-8 is often accused of having a Western bias and while this may be true, content on the Web does use a fair amount of markup and whitespace, which compresses a whole lot better in UTF-8. roc (Mozilla) explains the details and concludes with: « We’ve seen no data showing that UTF-16 is useful in practice on the real Web … except as a legacy encoding of course. »
Cet article a été publié dans Uncategorized. Ajoutez ce permalien à vos favoris.

Laisser un commentaire

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:


Vous commentez à l'aide de votre compte Déconnexion /  Changer )

Photo Google+

Vous commentez à l'aide de votre compte Google+. Déconnexion /  Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion /  Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion /  Changer )


Connexion à %s