perl-unicode

Keeping byte-wise processing as an option

2004-01-02 14:30:08
Dear Perl Unicode experts,

http://www.perldoc.com/perl5.8.0/pod/perlunicode.html says:

"In future, Perl-level operations will be expected to work with characters rather than bytes."

I very much appreciate all your hard work on the internationalization of Perl.
However, recently I have been working on some things that let me think
that the above statement, if taken directly, may be going somewhat too far.

It is in some cases very useful to use Perl for simple byte-oriented
processing. Some examples that I have are:

1) charlint (see http://www.w3.org/International/charlint/). Among else,
   this checks for various 'not-quite-UTF-8' cases such as overlong encodings.
   Although both input and output are UTF-8, the program works on these
   byte-by-byte.

2) some simple input checking code such as the example at
   http://www.w3.org/International/questions/qa-forms-utf-8.html

3) The following simple script (due to Jonathan Coxhead) that
   removes a BOM at the start of an UTF-8 file:
      #!/usr/bin/perl -pi~ -0777
      # program to remove a leading UTF-8 BOM from a file
      # works both STDIN -> STDOUT and on the spot (with filename as argument)
      s/^\xEF\xBB\xBF//s;

All these were written assuming a simple bytes-in-bytes-out model.
At least the later fails with Perl 5.8.1 when the PERL_UNICODE
environment variable is defined. Jungshik has also reported that
it fails with Perl 5.8.0 with an UTF-8 locale. I have not been
able to confirm this. Similar things will probably apply to the
first two examples, in which case I would need to patch them soon.

What I'm looking for is a very simple way to write perl programs
that work on byte streams. This should be possible without depending
on versions, working both on very old versions as well as future
versions.

Many thanks in advance for your help.        Regards,   Martin.

<Prev in Thread] Current Thread [Next in Thread>