Via Fronteers I discovered that even now not everyone is convinced of the merits of UTF-8. A little over five years ago I wrote a quick guide to UTF-8 and it seemed worthwhile to stipulate some technical points I became aware of meanwhile as to why using UTF-8 is a good idea.
I tend to think it is pretty self-explanatory that you want to use some encoding that can encode all of Unicode. After all, at some point you might want to sell your software abroad, you might want to accept comments in any given language, accept any kind of user-contributed content for that matter, and you simply do not want to keep an encoding label around everytime you deal with a string. Given that people are trying to phase out UTF-7 (security issues) and UTF-32 (bloat) (both gone in Opera 10) this gives UTF-8 and UTF-16 as options.
Here are two reasons to use UTF-8:
- The encoding of URLs is UTF-8. The path component is always encoded in UTF-8 (when the request is made) and the query component depends on the page encoding if the link is embedded inside HTML. However, for
XMLHttpRequestthe query component is always encoded in UTF-8 which could result in confusion if you have the same link in the page and in a script. If you need to process such links on the server or want to link to external pages the easiest is to simply align with the encoding of URLs. I.e. by using UTF-8.
- The encoding
XMLHttpRequestuses for encoding text strings when sending data to the server is always UTF-8. This means your server better deals with UTF-8 input correctly. Always using UTF-8 means again less work for you since you do not have to figure out if the request came from a
formelement or an
Here are two reasons to use UTF-8 over UTF-16:
- In a thread on
firstname.lastname@example.orgErik van der Poel (Google) briefly describes a security issue with UTF-16 on the Web and says: « Google Web Search has stopped serving UTF-16. » To be fair, the issue is actually in Internet Explorer, though obviously that does not make UTF-16 less dangerous given IE’s market share.
- Contrary to popular believe, UTF-16 often takes up more space, even for CJK pages. UTF-8 is often accused of having a Western bias and while this may be true, content on the Web does use a fair amount of markup and whitespace, which compresses a whole lot better in UTF-8. roc (Mozilla) explains the details and concludes with: « We’ve seen no data showing that UTF-16 is useful in practice on the real Web … except as a legacy encoding of course. »