perl-unicode

C<use utf8> dynamic scope?

1999-05-09 20:55:06

Well, the discussion on C<use utf8> has gone dark for a while, and I am
here to revive. I was confused about the scoping of the pragma, and in
looking at the current UTF-8 support, I would like to make some suggestions
as I explore how to take advantage of Unicode support that is built into
Windows NT. For example, when opening files, it would make sense that a
UTF-8 string could be handed to open(), and to be consistent with how
strlen() works, open() would know which mode the user wanted to be in and
do_the_right_thing (in this case, it would be to convert the UTF-8 to UCS-2
and make a W api call rather than the A api call). Similarly, when working
with opendir() and readdir() etc., a user would expect filenames to be
returned encoded as UTF-8 if C<use utf8>. Following this logic, if a module
is handed a filename encoded as UTF-8 and that module calls open() on that
filename, the module would want to know the encoding that the caller had
been expecting. Given situations like these, I would suggest the following:

1. C<use bytechar> pragma 
        This would enable a C<use utf16> or other character encoding in the 
future.
        ie. what would C<no utf8> mean after C<use utf16>?
        
2. C<use bytechar> is default

3. C<use utf8> is dynamically scoped rather than lexically scoped
        a) char_encoding is a local var
        b) called routines inherit the encoding of the caller
        c) user added functions work the same as built-ins like strlen()
        
        See below for discussion
        
4. &encoding([encoding])
        returns the current encoding or boolean of wether passed encoding is the
current one
        <char_encoding> = encoding;
        <1|0> = encoding(<char_encoding>);
        
        if (encoding(utf8)) ...

Notes
-----
1. Existing scripts will not break. Unless the C<use utf8> is added to a
script, nothing new happens.

2. In new scripts that C<use utf8>, modules that depend on bytechar
encoding will break. A modules that uses regular expressions to
create/modify GIFs would be an example. There are two solutions to this:

A) calling code can C<use bytechar> in a scope to call 

        {
        use bytechar;
        BinaryStrings::MakeGIF();
        }
        
B) called code can C<use bytechar> at entry to the subroutine

        package BinaryStrings;
        sub MakeGIF {
                use bytechar;
                # code that works with strings as bytes... 
        }



Here is some sample code
------------------------

use utf8;
use MyModule;
use BinaryStrings;


# .... other code

$len0 = strlen($foo);

$len1 = strlen_plus_one($foo);

$len2 = MyModule::MyStrlen($foo);

FH = open('

sub strlen_plus_one {
    return (strlen($_[0]) + 1);
}



#
# MyModule.pm
#

package MyModule;


sub MyStrlen {
    return (strlen($_[0]));     # want MyStrlen to have strlen() work
appropiately for the passed string
}

sub GetUserName {               # want to know what mode we are in so that we 
can
do_the_right_thing
    $uname = Utf8EncodedNameCall();
    if (encoding(utf8)) return $uname;
    if (encoding(bytechar)) return utf8-bytechar($uname);
    # unknown encoding
    return undef;
}    





<Prev in Thread] Current Thread [Next in Thread>