perl-unicode

Re: Conversion-free switching between binary and character strings in Perl

2007-06-01 00:58:32
Markus Kuhn wrote:
Let's say I live in a completely ISO 8859/etc.-free world, that I don't
care about the existance of any other character representation than
UTF-8, and that I am therefore absolutely not interested in any form of
character encoding conversion function.

How can I then switch between a "byte string" and a "character string"
in Perl without ever actually touching the stored bytes of the string?
All I want to change is the UTF-8 flag associated with a string that
tells the regular expression engine, for example, whether /./ matches
just a single byte or an entire UTF-8 character?

Sounds like Encode::_utf8_on() and Encode::_utf8_off() are what you want, although they are documented as "INTERNAL" ("efficient but may change") and obviously involve loading the Encode package...



It seems the low-level Perl functions utf8::upgrade(),
utf8::downgrade(), utf8::encode(), and utf8::decode() (see "man 3 utf8")
are not usable, because they interpret and convert any binary string as
if it was an ISO 8859-1 string. I don't want to load any huge encoding
packages such as "use encode 'utf8';" or "use Encoding;", because I
don't need and want any character encoding conversion functions. All I
want to change is a simple flag. Unfortunately, the documentation is far
from clear on how to do this, and my experimentation leads to strange
results that look like strings going through several ISO 8859-1 to UTF-8
conversion steps (whereas I want zero of these).

Any help?

Markus

--

<Prev in Thread] Current Thread [Next in Thread>
  • Re: Conversion-free switching between binary and character strings in Perl, Steve Hay <=