Re: BOM and principle of least surprise

Erland Sommarskog <sommar(_at_)algonet(_dot_)se> writes:

Nick Ing-Simmons (nick(_at_)ing-simmons(_dot_)net) writes:

Erland Sommarskog <sommar(_at_)algonet(_dot_)se> writes:

I would really expect someone to have done this already, but I see no
reference to such a module. Or layer-directive like "<:use-bom" to open
the file. And then some way to open an output file "same mode as that
handle".


Seems you are the 1st (at least to care) - so in true OpenSource 
spirit you would write the module and contribute it.


Unfortunately my field of expertise is not in the area of C++ programming
or Perl internals. Believe me, you would not want to see my miserable
code entered into the Perl code base. :-)


Well you only learn by trying - but that is your choice.


I guess, that if I want to write a utility which can handle Unicode 
files, that I will implement the file-opening in Perl in some private
module.


That would be a resonable way to prototype stuff for core anyway.
With perl5.7+'s "layers" it should be possible to do this as module.
(Which was at least part of motivation for inventing them.)

Many _programs_ yes. So when you write a perl _program_ you can 
handle it. C++ language doesn't do this for you, why should Perl?
Now there may well be a C++ _library_ which does this, so there 
could be a perl _library_ (module) which did it too.


But Perl is not C++. C++ is a strongly typed language where you use
different functions for 8-bit and Unicode data. Perl is also a higher-
level language that does more work for me.


But there is a limit - or there would be just one perl program:

#!/usr/bin/perl
exit(do_what_I_mean(@ARGV));

I'd say that it would be
perfectly in the spirit of Perl to magically handle file as ASCII or
Unicode without me having to bother.


Agreed - but magic doesn't create itself.

It would seem best place to do this would be to change 
the initial layer in Win32 to a new layer (say :bomcrlf).
This layer would get popped on binmode() - fixing above.
It would look at 1st few bytes it got from OS and then if it was 
a BOM push an encoding() layer beneath itself and mutate into 
a :crlf layer with UTF8 flag set.


Yes, that sounds like a good way that would ensure compatibility and
still give me what I want. When is Santa coming to town? :-)


Implied timescale sounds viable ;-)


However, that does not really help when the Perl script itself is in
UTF-16 or UTF-8.


Yes it does - I _think_ one or more of 

perl -MWin32BOM UTF-16_script

or 

set PERL5OPT -MWin32BOM

or 

set PERLIO bomcrlf  
(with magical autoload) 

could be made to work.

If it happens in core-perl it can certainly work.


Anyway, thanks for all the replies. This is not really a big deal for
me at the moment. I was just puzzled by the results of my tests. Since
I working with a module that will support Unicode data, I'm a little
nervous that I will get questions from users about the topic.