perl-unicode

Re: BOM and principle of least surprise

2004-03-29 13:30:05
Jarkko Hietaniemi (jhi(_at_)iki(_dot_)fi) writes:
I said the principle of least surprise, because having read Perluniintro
my impression was that I should really have to care in which format the
string was in.

You should not need to care *once* the data has been read into Perl.

Before that, in the input phase, Perl needs your help.

That is quite a serious restriction. It's like having a great car that
can make 200 km/h on the motorway, but you have to push it in a downhill
to start it.

In many situations, I as an author of a Perl script have no idea of what
the input might be in. I might be writing a general tool that reads 
some file, and performs some processing. I have no idea of what input
the user might be feeding me. And putting the burden on the user is 
not a good solution. Not in an environment where he never has to bother
with other tools.

It seems that the only way out, is to first open the file in plain mode,
look at the first three bytes, and if it is BOM, close the file, open
again with the appropriate options and discard the BOM.

I would really expect someone to have done this already, but I see no
reference to such a module. Or layer-directive like "<:use-bom" to open
the file. And then some way to open an output file "same mode as that
handle". Including correct handling of newline in UTF-16LE.

If you have a stream of bytes Perl cannot start blindly guessing
what data it might be.  

Why not? It seems that many programs on Windows does precisely this.

If from a BOM Perl should guess that input is in UTF-16, that would make
it impossible to read the same file in as binary.  

Not sure I understand. Could you elaborate?

(Perl does recognize BOMs in Perl scripts, since it has to kind of its 
known format...)

Again, I don't understand what you are talking about. Maybe it is that
funny !# you use on Unix, but as I said, I mainly work on Windows. In 
any case, I fail to see that first looking for a BOM, and then for !#
once you have deduced the encoding would be impossible. 

Is there at all any possibility to feed Perl a script that has been
saved in UTF-16? (Which is how you normally save Unicode on Windows.)
-- 
Erland Sommarskog, Stockholm, sommar(_at_)algonet(_dot_)se

<Prev in Thread] Current Thread [Next in Thread>