Re: converting between utf8 and bytes






       Dear Simon,

       > I have found out how to create a utf8 string: insert
       something with a code > > 255 (a BOM should do it) and then
       strip it off later. Hacky, but works.

       >Interesting. That shouldn't work at all.

       It works because any constant string containing \x{nnnn} where
       nnnn > 0x100, in the environment of use utf8; causes the string
       to be interpretted as utf8. From then on, things propagate. I
       chose the BOM as some random, no meaning, nnnn to cause the
       behaviour to occur. Mind you I had to dig around in the Perl
       source to find such a trick.

       >#    use bytes;          # This does nothing
       No, this does the exact opposite of what you want: it says
       "interpret this string as bytes, not as UTF8". See, for
       instance:

       No it doesn't. This is what I am trying to get Perl to do. But
       having use utf8 at the start of the file (necessary to get
       extended regexps to compile), seems to mask the use bytes and
       so the interpretation doesn't work. It's horrible. Hence my
       question:

       How do we change the interpretation of a string?

       It is all very well trying to make the type of a string
       implicit information calculable by clever insight. But it is
       also important to allow the programmer control over their data
       by making the information explicit too. Perl is never going to
       get it right all the time automagically. And my example above,
       of wanting to interpret a utf8 string as bytes, is an example
       of that.

       % perl -le '$a=chr(300); {print ord for split //, $a}'
       300
       % perl -le '$a=chr(300); {use bytes; print ord for split //,
       $a}'
       196
       172

       And what about:

       % perl -le 'use utf8; $a=chr(300); {use bytes; print ord for
       split //, $a}'
       196
       % perl -le 'use utf8; $a=chr(300); {print ord for split //,
       $a}'
       196

       Which are decidedly odd!

       >In cfgperl - which means *possibly* in 5.6.1 - you'll be able
       to say:
            $utf8 = pack("U*", unpack("A*",$input));
       It's horrible, but it'll work.

       This is disgusting. I don't want to have to unpack my string
       and then repack it. That takes forever. All I need to do is
       find someway of interacting with the SvUTF bit. The offered
       module is the right way to go, but it should be in the core and
       not in a module.

       There is talk of a is_utf8() function. If we make this
       assignable. We can then have the 3 commands we need:

       if (is_utf8($str)){}          # what is it
       is_utf8($str) = 0;       # make it bytes
       is_utf8($str) = 1;       # make it utf8

       >The problem is as above: Perl isn't to know whether incoming
       data is a string of bytes rather than a
       UTF-encoded file. What if the first 30 bytes looks like UTF,
       and
       then the next character is malformed? Should Perl have guessed
       that it was UTF, or left it alone? When you hit the malformed
       character,
       what should you do? Maybe we should assume that after 30 bytes,
       if it looks like UTF so far, it should all be UTF. But why 30
       bytes?
       Why not 20...? etc. It's *impossible* for Perl to automatically
       detect UTF. You have to tell it that incoming data is UTF, and
       there's currently no good way to do that.

       Correct. Hence the need for explicit control with Perl making a
       good guess for DWIMNWIS

       >This is a problem I am working on fixing. The solution will
       probably be something like this:
            open (FH, ":utf8", $filename) or die $!;
            @array = <FH>; # Data has been checked and has SvUTF on
            here.
       I'd appreciate suggestions as to how this should extend to
       other
       information coming from outside Perl: environment variables and
        so
       on. Maybe "use utf8" should mean "any external data gets
       checked for valid UTF8-ness and has the SvUTF bit set". Would
       that be sufficiently intuitive?

       Perhaps explicit control over the SvUTFness of a string would
       solve your problems. The open() you have is an excellent way to
       go for DWIM and we need the :utf16 :utf16le :utf16se as well.

       > How do I make a UTF8 string containing codes 127<x<256
       without
       > having to insert a BOM in the front and then strip it off?

       >I hope the above answers your question. Don't forget that you
       can always use tuplets to create UTF8 strings from inside Perl:
            $string = v400.500.600;
       Don't think you can then concatenate an incoming string to that
        and
       take off the first three characters: in 5.6.0 that won't work,
       and
       in 5.6.1 that should upgrade the non-UTF8 encoded string.

       Yes. This is a horrible hack too.

       I'm worried that if we don't get this right soon, there will be
       a lot of nasty version specific hacks out there which will
       damage the quality of Perl's backward support, etc.

       Summary: The effort to make utf8 support implicit is excellent,
       but we also need explicit control for those situations where
       Perl is not going to make it, or isn't there yet.

       Martin Hosken