Posts tagged ‘microsoft’

Unicode in URIs makes my head hurt

I’ve read very little about Unicode before but today I had the questionable pleasure of delving a bit deeper into it. Mind you, it still feels like I’ve just dipped a foot in the water, but before today I had only dipped a single toe.

Especially I was interested in the URI encoding (”percentage encoding”) and Unicode. According to RFC 3986:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded.

Of course this particular document if fairly new (January 2005) so I bet there are quite a few URI codecs out there that don’t behave this way yet. Another interesting detail is that Microsoft long has supported a special URI encoding especially suited for dealing with UCS-2i which takes the form %uhhhh. E.g. the character ‘A’ would be %41 according to the standard encoding, using Microsoft’s encoding it looks like %u0041. So far it’s quite straight forward but then enters something strange in Unicode; compatibility characters. They do make certain sense when they are combinations of a base character and some sort of marker (I’m not sure I’m using the right terminology here), e.g. the character ‘å’ can be constructed in two ways, either using the code point U+00E5 or by combining an ‘a’ (U+0061) and the “combining diacritical mark” ‘ ̊’ (U+030A). Of course comparing these two characters which are completely differently encoded while still having exactly the same semantics is a bit of a problem. That’s solved by canonicalisation, which there are two standards for. I didn’t bother going further into that, because my real problem, the reason why I started all of this was that there are compatibility characters for something called “Halfwidth and Fullwidth Forms” (block FF01–FFEF). This block contains some non-latin characters and then it makes sense, but for some strange reason all printable characters in the Basic Latin block (0000–007f) is present as “fullwidth forms” as well. The reason for this is unclear to me and I’d really love an explanation. The result of this is that there apparently is some confusion just what to do with these “fullwidth forms” when decoding them, in some cases they are treated just like their “halfwidth form” cousins in the Basic Latin block. The end result is that on Microsoft products ‘A’ can also be encoded as %uff21.

While reading about Unicode I always have to remind myself that “for every complex problem, there is a solution that is simple, neat, and wrong”. I simply can’t help but think “this is so complicated, there must be an easier solution”…

Re-reading this post I realise there isn’t much of a point to it, besides possibly that writing (or talking) about something always helps my understanding of it. Please let me know if my understanding of Unicode or URI encoding is wrong…

  1. I suspect this is connected to Microsoft’s love for UCS-2 in other areas of their operating system.[back]

M$ grasping at straws…

On Windows programming

I always find myself going through the same motions when trying to programming on Windows: excitement, bewilderedness, frustration, relief. It’s exciting to find new libraries and frameworks that seem to deliver exactly the functionality I require. I feel bewildered because I don’t think I’ve ever come across a Windows C/C++ API that immediately makes sense to me. Then follows a time of frustration, often a rather long period too, when I try to use the library/framework to solve the problem I have. At some point I hit that stage where my little project is debugged into behaving properlyi and a sense of relief comes over me.

One thing that never ceases to amaze me is how many small surprising things there are lurking just under the hood in Windows:

  1. Want to print an error message? GetLastError gives you an error value and FormatMessage whips it into a nice printable string. Take a good long look at FormatMessage. Where is the convenience function a lá strerror?

  2. Another thing is the surprising order of paths that is searched for DLLs. By putting PATH so far down the list and completely leaving out an equivalent of LD_LIBRARY_PATH they actively encourage developers to copy DLLs into the home dirs of executables. I suspect this is inevitable given the DLL-hell phenomenon on Windows. It’s nonetheless extremely irritating when developing against a non-standard DLL (i.e. one that isn’t installed in \windows\system32).

  3. The utter confusion I experience when trying to figure out just where to find the correct framework to use. There is considerable overlap between Visual Studio and the Windows Platform SDK. To add more confusion there are sometimes other frameworks that overlap the both of them, e.g. Debugging Tools for Windows provides dbghelp.{dll,h}, both of which are provided in slightly different versions in the other placesii.

  4. The lack of fixes for known issues, e.g. the version of dbghelp.h included in Debugging Tools for Windows can’t be included as is because it lacks the definition of a macro. The webpage announcing version 6.6.7.5 was updated 18 July, 2006. One would think that gives Microsoft ample time to address the issue, but no such luck.

Well, that’s enough of ranting for one night…

  1. Through experience I’ve come to the conclusion that it isn’t worth the time and effort to try to fit Microsoft solutions into some logical framework. I’d argue that’s true for most closed-source solutions.[back]
  2. A tip, make sure to use the ones that comes from Debugging Tools for Windows![back]

It is fair, at least for now…

I think it’d be better if Microsoft’s security specialists concentrated on improving security in their products (and possibly write about how they do it) rather than trying to make the rest of the world feel sorry for them. I’m sorry, but full disclosure is the fairest we have at the moment.

Microsoft sits on a reported vulnerability for months, releases patch when it becomes a 0-day. As I write this Microsoft is sitting on a few publically known vulnerabilities in Office (0-days as well) that have been known for a while now.

So, until companies start behaving I think full disclosure is fair. It seems to be the only way of forcing delivery of security to paying customers at the moment. When there’s a sign that the business as a whole can function without FD I’ll be the first to argue against it; at the moment though it seems to be our only hope.

FD ⇒ bad PR ⇒ declining share price and sales ⇒ security fixes

Some companies seem to be on the verge of understanding this and taking it to heart. (Microsoft has understood it, but doesn’t seem to have found its heart just yet.)

Paying your users?