perl-unicode

Re: Encode::Guess fails on UTF-16BE string w/ newline characters

2003-04-14 21:30:05
On Tuesday, April 15, 2003, at 06:42  AM, Jay Lawrence wrote:
eeeeee! What were they thinking? Unicode encodings are a nightmare!!!

IMHO it seems that UTF-16/32 should mandate BOMs at the start of a file to save everybody a world of hurt. I get the impression that ISO-10646 is also a big pain in the a**.

You mean for guessing encodings? I don't think UTF-(8|16|32) is a mess in terms of how they map Unicode to series of bytes. I do think however Unicode, or how the consortium give each character a number, is a mess, especially CJK unification which later led to the mess of surrogate pair. They first tried to be a reductionist (Try to squeeze everything in 16 bits) and found that does not work and became a expansionist (32 bits but in reality it is (16 + 1)*65536 because of surrogate pair mess). They should've been an expansionist from the first place.

I agree - my prototype language for unicode support is Arabic (I took a class in it for this very reason) - it is entirely possible that there are no latin characters in a stream therefore making encoding impossible to detect!

You mean not even \n?

some reactive thoughts on the guessing process:
        a - make declared list of candidates position significant
                - first suspect is what will be used in case of multiple matches
                - that is what I'd expect

For the time being Encode::Guess is strict in terms that it allows no ambiguity. Maybe we can lax this by adding hints by implementing Encode::Guess->set_policy("loose") or something.

        b - consider language-specific test feature
                - Kanji exists in \x5x & \x9x ranges?
                - Arabic
                - Simplified Chinese
                - etc.

The problem is that it takes vast knowledge of the language. Plus I don't want to complicate Encode::Guess too much. Be my guest.

        c - can we open a file with encoding type Guess?

You mean "open F, '<:encoding(Guess)'" ? The answer is no because in order to guess encoding you have to actually read it. And to allow file access after the guess you have to rewind the filehandle, something you can't always do (consider socket).

I am not certain if I mentioned, but my entire reason for foraying into this UTF-16 zone is because of Apple's address book. When it exports a vCard file with non-ascii 0-127 characters it suddenly switches from UTF-8 to UTF-16BE with no warning. Most annoying. I think I will encourage them to include BOM at the start of their file - and other UTF-16/32 exporters should be encouraged to do the same!

IMHO I think they should use UTF-8 instead but is that Apple that sets standards for vCard? Let's see.... RFC2425 <http://www.ietf.org/rfc/rfc2425.txt> does not state that (but says UTF-8 preferred).

Also don't forget that UTF-8 is identical to ASCII when the string contains only \x00-\x7f, so more appropriate way to state that is 'it suddenly switches from US-ASCII to UTF-16BE', since when you say UTF-8 it implies it contains \x{80} and above.

If you like some of my suggestions above would you like me to propose a patch and send your way?

Your patch is also welcome :)

Dan the Encode Maintainer