perl-unicode

Re: BOM and principle of least surprise

2004-03-31 15:30:09
Erland Sommarskog <sommar(_at_)algonet(_dot_)se> writes:

It seems that the only way out, is to first open the file in plain mode,

binmode I suspect.

look at the first three bytes, and if it is BOM, close the file, open
again with the appropriate options and discard the BOM.

You don't have to close it just seek() past the BOM and twiddle the 
layer stack. If you close it there is a possibility someone will 
overwrite it with another file in a different format.


I would really expect someone to have done this already, but I see no
reference to such a module. Or layer-directive like "<:use-bom" to open
the file. And then some way to open an output file "same mode as that
handle". 

Seems you are the 1st (at least to care) - so in true OpenSource 
spirit you would write the module and contribute it.

Including correct handling of newline in UTF-16LE.

If you have a stream of bytes Perl cannot start blindly guessing
what data it might be.  

Why not? It seems that many programs on Windows does precisely this.

Many _programs_ yes. So when you write a perl _program_ you can 
handle it. C++ language doesn't do this for you, why should Perl?
Now there may well be a C++ _library_ which does this, so there 
could be a perl _library_ (module) which did it too.


If from a BOM Perl should guess that input is in UTF-16, that would make
it impossible to read the same file in as binary.  

Not sure I understand. Could you elaborate?

Suppose binary file (say of packed floating point numbers)
to start with a UTF-16LE BOM.
If perl blindly honoured the BOM it would start mangling the numbers.
That said on Win32 binary file would need to assert binmode() on such 
a file anyway. So I don't see a fundamental reason why Win32 in text
mode could not honour BOMs. 

It would seem best place to do this would be to change 
the initial layer in Win32 to a new layer (say :bomcrlf).
This layer would get popped on binmode() - fixing above.
It would look at 1st few bytes it got from OS and then if it was 
a BOM push an encoding() layer beneath itself and mutate into 
a :crlf layer with UTF8 flag set.


(Perl does recognize BOMs in Perl scripts, since it has to kind of its 
known format...)

Again, I don't understand what you are talking about. Maybe it is that
funny !# you use on Unix, but as I said, I mainly work on Windows. In 
any case, I fail to see that first looking for a BOM, and then for !#
once you have deduced the encoding would be impossible. 

Perl doesn't look at the #! UNIX does. Unix uses the first few bytes
of the file to decide what kind of file it is. (Windows uses
the extension to know that "C:\bin\perl.exe" is an executable, UNIX 
uses "magic number" in the file to know '/usr/bin/perl' is an executable.)  
As far as I know no UNIX yet can cope with a BOM before the #!


Is there at all any possibility to feed Perl a script that has been
saved in UTF-16? (Which is how you normally save Unicode on Windows.)

Perl5's assumption is that Unicode is saved as UTF-8.
For Perl6 ask elsewhere ...

<Prev in Thread] Current Thread [Next in Thread>