Re: Encode::Guess fails on UTF-16BE string w/ newline characters

I agree - my prototype language for unicode support is Arabic (I tooka class in it for this very reason) - it is entirely possible thatthere are no latin characters in a stream therefore making encodingimpossible to detect!
You mean not even \n?

Good point. I was fixating on the text. But actually spaces andnewlines would be good candidates for sniffing the encoding style.Would this apply across all languages in unicode? If so then there area few predictable cases to signal non-BOM encoded UTF-16 & 32. As anend developer, however, we could run into a problem with line endings.If the file were encoded "LE" then reading a line would stop at the \nand leave the \x00 (or \x00\x00\x00) in the buffer for the next read.Really the file needs to be guessed on blocks of 4 bytes when guessingUTF-16&32.

        c - can we open a file with encoding type Guess?
You mean "open F, '<:encoding(Guess)'" ? The answer is no because inorder to guess encoding you have to actually read it. And to allowfile access after the guess you have to rewind the filehandle,something you can't always do (consider socket).

This may be worth further investigation - the decoder is indeterminateat initial read. First read from filehandle does the guess and sets thenature of decoding to take place....

I am not certain if I mentioned, but my entire reason for forayinginto this UTF-16 zone is because of Apple's address book. When itexports a vCard file with non-ascii 0-127 characters it suddenlyswitches from UTF-8 to UTF-16BE with no warning. Most annoying. Ithink I will encourage them to include BOM at the start of their file- and other UTF-16/32 exporters should be encouraged to do the same!
IMHO I think they should use UTF-8 instead but is that Apple that setsstandards for vCard? Let's see.... RFC2425<http://www.ietf.org/rfc/rfc2425.txt> does not state that (but saysUTF-8 preferred).

I gave Apple feedback to that effect. They retorted that the sometimesUTF-16 and sometimes UTF-8 was a feature not a bad idea. My read of theRFC is that they should use UTF-8 unless they have a very good reasonnot to.

If you like some of my suggestions above would you like me to proposea patch and send your way?
Your patch is also welcome :)

Unless someone is itching to do the filehandle guess feature - I'llhave a look at that as time permits.