I agree - my prototype language for unicode support is Arabic (I took
a class in it for this very reason) - it is entirely possible that
there are no latin characters in a stream therefore making encoding
impossible to detect!
You mean not even \n?
Good point. I was fixating on the text. But actually spaces and
newlines would be good candidates for sniffing the encoding style.
Would this apply across all languages in unicode? If so then there are
a few predictable cases to signal non-BOM encoded UTF-16 & 32. As an
end developer, however, we could run into a problem with line endings.
If the file were encoded "LE" then reading a line would stop at the \n
and leave the \x00 (or \x00\x00\x00) in the buffer for the next read.
Really the file needs to be guessed on blocks of 4 bytes when guessing
UTF-16&32.
c - can we open a file with encoding type Guess?
You mean "open F, '<:encoding(Guess)'" ? The answer is no because in
order to guess encoding you have to actually read it. And to allow
file access after the guess you have to rewind the filehandle,
something you can't always do (consider socket).
This may be worth further investigation - the decoder is indeterminate
at initial read. First read from filehandle does the guess and sets the
nature of decoding to take place....
I am not certain if I mentioned, but my entire reason for foraying
into this UTF-16 zone is because of Apple's address book. When it
exports a vCard file with non-ascii 0-127 characters it suddenly
switches from UTF-8 to UTF-16BE with no warning. Most annoying. I
think I will encourage them to include BOM at the start of their file
- and other UTF-16/32 exporters should be encouraged to do the same!
IMHO I think they should use UTF-8 instead but is that Apple that sets
standards for vCard? Let's see.... RFC2425
<http://www.ietf.org/rfc/rfc2425.txt> does not state that (but says
UTF-8 preferred).
I gave Apple feedback to that effect. They retorted that the sometimes
UTF-16 and sometimes UTF-8 was a feature not a bad idea. My read of the
RFC is that they should use UTF-8 unless they have a very good reason
not to.
If you like some of my suggestions above would you like me to propose
a patch and send your way?
Your patch is also welcome :)
Unless someone is itching to do the filehandle guess feature - I'll
have a look at that as time permits.
J