Unicode in URIs makes my head hurt
I’ve read very little about Unicode before but today I had the questionable pleasure of delving a bit deeper into it. Mind you, it still feels like I’ve just dipped a foot in the water, but before today I had only dipped a single toe.
Especially I was interested in the URI encoding (“percentage encoding”) and Unicode. According to RFC 3986:
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded.
Of course this particular document if fairly new (January 2005) so I bet there are quite a few URI codecs out there that don’t behave this way yet. Another interesting detail is that Microsoft long has supported a special URI encoding especially suited for dealing with UCS-2i which takes the form %uhhhh. E.g. the character ‘A’ would be %41 according to the standard encoding, using Microsoft’s encoding it looks like %u0041. So far it’s quite straight forward but then enters something strange in Unicode; compatibility characters. They do make certain sense when they are combinations of a base character and some sort of marker (I’m not sure I’m using the right terminology here), e.g. the character ‘å’ can be constructed in two ways, either using the code point U+00E5 or by combining an ‘a’ (U+0061) and the “combining diacritical mark” ‘ ̊’ (U+030A). Of course comparing these two characters which are completely differently encoded while still having exactly the same semantics is a bit of a problem. That’s solved by canonicalisation, which there are two standards for. I didn’t bother going further into that, because my real problem, the reason why I started all of this was that there are compatibility characters for something called “Halfwidth and Fullwidth Forms” (block FF01–FFEF). This block contains some non-latin characters and then it makes sense, but for some strange reason all printable characters in the Basic Latin block (0000–007f) is present as “fullwidth forms” as well. The reason for this is unclear to me and I’d really love an explanation. The result of this is that there apparently is some confusion just what to do with these “fullwidth forms” when decoding them, in some cases they are treated just like their “halfwidth form” cousins in the Basic Latin block. The end result is that on Microsoft products ‘A’ can also be encoded as %uff21.
While reading about Unicode I always have to remind myself that “for every complex problem, there is a solution that is simple, neat, and wrong”. I simply can’t help but think “this is so complicated, there must be an easier solution”…
Re-reading this post I realise there isn’t much of a point to it, besides possibly that writing (or talking) about something always helps my understanding of it. Please let me know if my understanding of Unicode or URI encoding is wrong…
- I suspect this is connected to Microsoft’s love for UCS-2 in other areas of their operating system.[back]