Well, the discussion on C<use utf8> has gone dark for a while, and I am
here to revive. I was confused about the scoping of the pragma, and in
looking at the current UTF-8 support, I would like to make some suggestions
as I explore how to take advantage of Unicode support that is built into
Windows NT. For example, when opening files, it would make sense that a
UTF-8 string could be handed to open(), and to be consistent with how
strlen() works, open() would know which mode the user wanted to be in and
do_the_right_thing (in this case, it would be to convert the UTF-8 to UCS-2
and make a W api call rather than the A api call). Similarly, when working
with opendir() and readdir() etc., a user would expect filenames to be
returned encoded as UTF-8 if C<use utf8>. Following this logic, if a module
is handed a filename encoded as UTF-8 and that module calls open() on that
filename, the module would want to know the encoding that the caller had
been expecting. Given situations like these, I would suggest the following:
1. C<use bytechar> pragma
This would enable a C<use utf16> or other character encoding in the
future.
ie. what would C<no utf8> mean after C<use utf16>?
2. C<use bytechar> is default
3. C<use utf8> is dynamically scoped rather than lexically scoped
a) char_encoding is a local var
b) called routines inherit the encoding of the caller
c) user added functions work the same as built-ins like strlen()
See below for discussion
4. &encoding([encoding])
returns the current encoding or boolean of wether passed encoding is the
current one
<char_encoding> = encoding;
<1|0> = encoding(<char_encoding>);
if (encoding(utf8)) ...
Notes
-----
1. Existing scripts will not break. Unless the C<use utf8> is added to a
script, nothing new happens.
2. In new scripts that C<use utf8>, modules that depend on bytechar
encoding will break. A modules that uses regular expressions to
create/modify GIFs would be an example. There are two solutions to this:
A) calling code can C<use bytechar> in a scope to call
{
use bytechar;
BinaryStrings::MakeGIF();
}
B) called code can C<use bytechar> at entry to the subroutine
package BinaryStrings;
sub MakeGIF {
use bytechar;
# code that works with strings as bytes...
}
Here is some sample code
------------------------
use utf8;
use MyModule;
use BinaryStrings;
# .... other code
$len0 = strlen($foo);
$len1 = strlen_plus_one($foo);
$len2 = MyModule::MyStrlen($foo);
FH = open('
sub strlen_plus_one {
return (strlen($_[0]) + 1);
}
#
# MyModule.pm
#
package MyModule;
sub MyStrlen {
return (strlen($_[0])); # want MyStrlen to have strlen() work
appropiately for the passed string
}
sub GetUserName { # want to know what mode we are in so that we
can
do_the_right_thing
$uname = Utf8EncodedNameCall();
if (encoding(utf8)) return $uname;
if (encoding(bytechar)) return utf8-bytechar($uname);
# unknown encoding
return undef;
}