Re: converting between utf8 and bytes

I have found out how to create a utf8 string: insert something with a code

255 (a BOM should do it) and then strip it off later. Hacky, but works.


Interesting. That shouldn't work at all. Perl has no way of knowing that the
BOM on the start of your data file is a BOM rather than the ordinary characters
255 256. That's for UTF16, I forget what UTF8 is right now - but you get the
point: files are made up of bytes, and bytes can only hold values up to 256.
There's no difference on the disk between the bytes 196, 172 and the UTF8
character 300. How can Perl say if you want one rather than the other?

#    $str =~ tr///CC;    # This crashes Perl 5.6.0 (ActivePerl)

This has been fixed. Well, it's been eliminated. But that won't do what you
want anyway: it should convert between characters (Latin1) and characters
(Latin1). No Unicode here, sir!

#    use bytes;          # This does nothing

No, this does the exact opposite of what you want: it says "interpret this
string as bytes, not as UTF8". See, for instance:

% perl -le '$a=chr(300); {print ord for split //, $a}'
300
% perl -le '$a=chr(300); {use bytes; print ord for split //, $a}'
196
172

Any suggestions?


In cfgperl - which means *possibly* in 5.6.1 - you'll be able to say:

     $utf8 = pack("U*", unpack("A*",$input));

It's horrible, but it'll work. The problem is as above: Perl isn't to
know whether incoming data is a string of bytes rather than a
UTF-encoded file. What if the first 30 bytes looks like UTF, and
then the next character is malformed? Should Perl have guessed
that it was UTF, or left it alone? When you hit the malformed character,
what should you do? Maybe we should assume that after 30 bytes,
if it looks like UTF so far, it should all be UTF. But why 30 bytes?
Why not 20...? etc. It's *impossible* for Perl to automatically
detect UTF. You have to tell it that incoming data is UTF, and
there's currently no good way to do that.

This is a problem I am working on fixing. The solution will probably
be something like this:

     open (FH, ":utf8", $filename) or die $!;
     @array = <FH>; # Data has been checked and has SvUTF on here.

I'd appreciate suggestions as to how this should extend to other
information coming from outside Perl: environment variables and so
on. Maybe "use utf8" should mean "any external data gets checked for
valid UTF8-ness and has the SvUTF bit set". Would that be sufficiently
intuitive?

How do I make a UTF8 string containing codes 127<x<256 without
having to insert a BOM in the front and then strip it off?


I hope the above answers your question. Don't forget that you can always
use tuplets to create UTF8 strings from inside Perl:

     $string = v400.500.600;

Don't think you can then concatenate an incoming string to that and
take off the first three characters: in 5.6.0 that won't work, and
in 5.6.1 that should upgrade the non-UTF8 encoded string.

I have an XS module which I've been meaning to put onto CPAN which
just flicks on the SvUTF bit on a variable. That'll do what you want
but it's a sick and disgusting hack.

Simon
----------------------------------------------------------------
The information transmitted is intended only for the person or entity to which
it is addressed and may contain confidential and/or privileged material.  Any
review, retransmission, dissemination or other use of, or taking of any action
in reliance upon, this information by persons or entities other than the
intended recipient is prohibited.   If you received this in error, please
contact the sender and delete the material from any computer.