On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote:
: Tim Bunce wrote:
:
: > Am I right in thinking that perl's internal utf8 representation
: > represents surrogates as a single (4 byte) code point and not as
: > two separate code points?
:
: Mmmh. Right and wrong... as a single code point, yes, since the real
: UTF-8 doesn't do surrogates which are only a UTF-16 thing. 4 bytes, no,
: 3 bytes.
No, Tim's right--they're four bytes. It's only the individual
surrogates that would come out to three bytes. The break between
three and four bytes is between \x{ffff} and \x{10000}.
Larry