perl-unicode

Re: Encode::Guess fails on UTF-16BE string w/ newline characters

2003-04-15 16:30:05
I agree - my prototype language for unicode support is Arabic (I took a class in it for this very reason) - it is entirely possible that there are no latin characters in a stream therefore making encoding impossible to detect!

You mean not even \n?

Good point. I was fixating on the text. But actually spaces and newlines would be good candidates for sniffing the encoding style. Would this apply across all languages in unicode? If so then there are a few predictable cases to signal non-BOM encoded UTF-16 & 32. As an end developer, however, we could run into a problem with line endings. If the file were encoded "LE" then reading a line would stop at the \n and leave the \x00 (or \x00\x00\x00) in the buffer for the next read. Really the file needs to be guessed on blocks of 4 bytes when guessing UTF-16&32.

        c - can we open a file with encoding type Guess?

You mean "open F, '<:encoding(Guess)'" ? The answer is no because in order to guess encoding you have to actually read it. And to allow file access after the guess you have to rewind the filehandle, something you can't always do (consider socket).

This may be worth further investigation - the decoder is indeterminate at initial read. First read from filehandle does the guess and sets the nature of decoding to take place....

I am not certain if I mentioned, but my entire reason for foraying into this UTF-16 zone is because of Apple's address book. When it exports a vCard file with non-ascii 0-127 characters it suddenly switches from UTF-8 to UTF-16BE with no warning. Most annoying. I think I will encourage them to include BOM at the start of their file - and other UTF-16/32 exporters should be encouraged to do the same!

IMHO I think they should use UTF-8 instead but is that Apple that sets standards for vCard? Let's see.... RFC2425 <http://www.ietf.org/rfc/rfc2425.txt> does not state that (but says UTF-8 preferred).

I gave Apple feedback to that effect. They retorted that the sometimes UTF-16 and sometimes UTF-8 was a feature not a bad idea. My read of the RFC is that they should use UTF-8 unless they have a very good reason not to.

If you like some of my suggestions above would you like me to propose a patch and send your way?

Your patch is also welcome :)

Unless someone is itching to do the filehandle guess feature - I'll have a look at that as time permits.
J