Re: A question on utf8

Niral Trivedi wrote:


Doesn't it the case that, if we have 'use utf8' pragma in a block and if
we do length function on a variable, it should have returned number of
characters and not number of bytes???


That's true, if perl knows that the variable you're length'ing is utf8.  Last I
heard (early July), if you're reading strings from an external source like a
file, you need to explicitly mark them as utf8.  The way to do that is to use
Simon's UTF8::Hack module (attached), and then you can do:

# $a = some bytes
print length( $a );     # number of bytes
utf8on( $a );
print length( $a );     # number of characters
utf8off( $a );
print length( $a );     # back to bytes again

There was also some talk about piggy-backing this trick on the taint mechanism,
not sure what came of that...

And second question.. we have perl5.6 installed on our box as I've said.
So, do we still need to install Unicode::Map and Unicode::Map8 or
Unicode::String module from CPAN site??? what is the difference between
using any of those module and useing 'use utf8' pragma???


As I understand it, the Unicode::* modules are needed because perl prior to 5.6
didn't have a datatype for (or operations on) multibyte strings, internally it
was all single byte.  As of Perl 5.6, strings are utf8 encoded interally, so you
can use all of the built in string manipulation facilities on Unicode strings. 
So at least for basic string operations, those modules should be obselete.

-- 
Neil Gower - Developer
iNAGO Incorporated
Toronto, Ontario.

UTF8-Hack-0.01.tar.gz
Description: GNU Zip compressed data