perl-unicode

Re: original_string method

1999-12-20 10:19:33
Matt Sergeant writes:
: ...I _believe_ Larry was talking on p5p about supporting multiple 
: encodings in a single perl file (so you could write your script in utf8, but 
: have a heredoc with chinese writing embeded in your script - what a 
: nightmare...) but I don't think you need to worry about that.

I wasn't going that far, though it may be possible in practice.  Generally,
if the script is in utf8, all the literals would default to being in utf8
as well.

That being said, 5.6 is going to track both utf8-ness and
OEM-character-set-ness on a per-scalar basis and just Do The Right
Thing at runtime, so if you did something fancy like

    $foo = do {
        use bytes;
        markoem <<"END";
    (Big5 stuff here.)
    END
        };

    print $foo;

then you could conceivably embed Big5 in a utf8 script, and $foo would
automatically be converted lazily to utf8 if STDOUT wants utf8.

Ordinarily you wouldn't have use a function like markoem() to mark
scalars explicitly as in the OEM character set, since that would be set
by the input filehandle.  But "use bytes" will probably assume that
any literals are simply binary byte data, and not set the OEM bit.
This makes no difference on machines where byte == character, but
it does make a difference in CJKV languages.

Actually, I don't like the term "OEM" character set, though that seems to
be the standard name on Windows, so that's what Sarathy and I have been
calling it.  I'm open to suggestions for a better name.  It's not
exactly "national" character set either.  Maybe "local" character set,
though that is confusing in a Perl context.

Anyway, at the moment we're only planning on allowing only one local
character set along with utf8.  We're not interested in tagging strings
with which character set, apart from the three way distinction:

    utf8 string
    octet string
    oem string (basically, octets that need extra local conversion)

Even this should be hidden from most scripts, unless you say "use bytes".
Most scripts won't have to do much more than make sure the I/O handles are
marked correctly, if they don't default reasonably.

Sorry for the stream-of-consciousness writing--I'm in a hurry...

Larry

<Prev in Thread] Current Thread [Next in Thread>