Re: Encode::Guess fails on UTF-16BE string w/ newline characters

On Tuesday, April 15, 2003, at 06:42  AM, Jay Lawrence wrote:

eeeeee! What were they thinking? Unicode encodings are a nightmare!!!
IMHO it seems that UTF-16/32 should mandate BOMs at the start of afile to save everybody a world of hurt. I get the impression thatISO-10646 is also a big pain in the a**.

You mean for guessing encodings? I don't think UTF-(8|16|32) is a messin terms of how they map Unicode to series of bytes. I do thinkhowever Unicode, or how the consortium give each character a number, isa mess, especially CJK unification which later led to the mess ofsurrogate pair. They first tried to be a reductionist (Try to squeezeeverything in 16 bits) and found that does not work and became aexpansionist (32 bits but in reality it is (16 + 1)*65536 because ofsurrogate pair mess). They should've been an expansionist from thefirst place.

I agree - my prototype language for unicode support is Arabic (I tooka class in it for this very reason) - it is entirely possible thatthere are no latin characters in a stream therefore making encodingimpossible to detect!


You mean not even \n?

some reactive thoughts on the guessing process:
        a - make declared list of candidates position significant
                - first suspect is what will be used in case of multiple matches
                - that is what I'd expect

For the time being Encode::Guess is strict in terms that it allows noambiguity. Maybe we can lax this by adding hints by implementingEncode::Guess->set_policy("loose") or something.

        b - consider language-specific test feature
                - Kanji exists in \x5x & \x9x ranges?
                - Arabic
                - Simplified Chinese
                - etc.

The problem is that it takes vast knowledge of the language. Plus Idon't want to complicate Encode::Guess too much. Be my guest.

        c - can we open a file with encoding type Guess?

You mean "open F, '<:encoding(Guess)'" ? The answer is no because inorder to guess encoding you have to actually read it. And to allowfile access after the guess you have to rewind the filehandle,something you can't always do (consider socket).

I am not certain if I mentioned, but my entire reason for forayinginto this UTF-16 zone is because of Apple's address book. When itexports a vCard file with non-ascii 0-127 characters it suddenlyswitches from UTF-8 to UTF-16BE with no warning. Most annoying. Ithink I will encourage them to include BOM at the start of their file- and other UTF-16/32 exporters should be encouraged to do the same!

IMHO I think they should use UTF-8 instead but is that Apple that setsstandards for vCard? Let's see.... RFC2425<http://www.ietf.org/rfc/rfc2425.txt> does not state that (but saysUTF-8 preferred).

Also don't forget that UTF-8 is identical to ASCII when the stringcontains only \x00-\x7f, so more appropriate way to state that is 'itsuddenly switches from US-ASCII to UTF-16BE', since when you say UTF-8it implies it contains \x{80} and above.

If you like some of my suggestions above would you like me to proposea patch and send your way?


Your patch is also welcome :)

Dan the Encode Maintainer