perl-unicode

Re: BOM and principle of least surprise

2004-04-26 09:30:07
Erland Sommarskog <sommar(_at_)algonet(_dot_)se> writes:
Nick Ing-Simmons (nick(_at_)ing-simmons(_dot_)net) writes:
Erland Sommarskog <sommar(_at_)algonet(_dot_)se> writes:
I would really expect someone to have done this already, but I see no
reference to such a module. Or layer-directive like "<:use-bom" to open
the file. And then some way to open an output file "same mode as that
handle". 

Seems you are the 1st (at least to care) - so in true OpenSource 
spirit you would write the module and contribute it.

Unfortunately my field of expertise is not in the area of C++ programming
or Perl internals. Believe me, you would not want to see my miserable
code entered into the Perl code base. :-)

Well you only learn by trying - but that is your choice.


I guess, that if I want to write a utility which can handle Unicode 
files, that I will implement the file-opening in Perl in some private
module.

That would be a resonable way to prototype stuff for core anyway.
With perl5.7+'s "layers" it should be possible to do this as module.
(Which was at least part of motivation for inventing them.)


Many _programs_ yes. So when you write a perl _program_ you can 
handle it. C++ language doesn't do this for you, why should Perl?
Now there may well be a C++ _library_ which does this, so there 
could be a perl _library_ (module) which did it too.

But Perl is not C++. C++ is a strongly typed language where you use
different functions for 8-bit and Unicode data. Perl is also a higher-
level language that does more work for me. 

But there is a limit - or there would be just one perl program:

#!/usr/bin/perl
exit(do_what_I_mean(@ARGV));

I'd say that it would be
perfectly in the spirit of Perl to magically handle file as ASCII or
Unicode without me having to bother.

Agreed - but magic doesn't create itself.


It would seem best place to do this would be to change 
the initial layer in Win32 to a new layer (say :bomcrlf).
This layer would get popped on binmode() - fixing above.
It would look at 1st few bytes it got from OS and then if it was 
a BOM push an encoding() layer beneath itself and mutate into 
a :crlf layer with UTF8 flag set.

Yes, that sounds like a good way that would ensure compatibility and
still give me what I want. When is Santa coming to town? :-)

Implied timescale sounds viable ;-)


However, that does not really help when the Perl script itself is in
UTF-16 or UTF-8.

Yes it does - I _think_ one or more of 

perl -MWin32BOM UTF-16_script

or 

set PERL5OPT -MWin32BOM

or 

set PERLIO bomcrlf  
(with magical autoload) 

could be made to work.

If it happens in core-perl it can certainly work. 


Anyway, thanks for all the replies. This is not really a big deal for
me at the moment. I was just puzzled by the results of my tests. Since
I working with a module that will support Unicode data, I'm a little
nervous that I will get questions from users about the topic.

<Prev in Thread] Current Thread [Next in Thread>