perl-unicode

Re: Am I correct in thinking that the only way to get ord() to return a value over 256 is to send the character as a Unicode string instead of a byte string?

2010-10-29 02:34:28
* Dan Muey <dan(_at_)cpanel(_dot_)net> [2010-10-28 21:55]:
For example, note the differences in output between a unicode
string and a byte string regarding character 257, as a unicode
string it is 257, as a byte string it is 196.

That is not what’s going on.

    $ perl -E'say ord "1234"'
    49

When you pass a multi-character string to `ord`, you get the code
point of the first character.

    $ perl -E'say chr 49'
    1

In your case you get 196. That is 0xC4, or the character Ä. It is
not the character ā (U+101 = code point 257).

0xC4 is the value of the first byte in the two-byte UTF-8
sequence that encodes the character 257. You are passing a string
containing a representation of those bytes as two characters to
`ord`, and `ord` is giving you the code point of the first
byte-as-character.

You are missing the rest of the bytes from the UTF-8 encoding.

You are losing data.

If you try this on more code points you will find that there are
*lots* of different characters that are reported as 196 – because
they get encoded as multi-byte sequences that all start with the
byte value 0xC4.

-- 
*AUTOLOAD=*_;sub _{s/::([^:]*)$/print$1,(",$\/"," ")[defined 
wantarray]/e;chop;$_}
&Just->another->Perl->hack;
#Aristotle Pagaltzis // <http://plasmasturm.org/>