perl-unicode

Re: Character (or byte?) escapes under utf8 pragma

2010-03-10 03:38:18
Moin Juerd,

Am 08.03.2010 um 16:15 schrieb Juerd Waalboer:

Michael Ludwig skribis 2010-03-08 15:55 (+0100):
Okay. But unless I'm completely misled, you can tell whether a
string is supposed to contain characters (<- Encode::decode) or
bytes (<- Encode::encode)

The result of decode is a character string.

The result of encode is a byte string.

Thanks for confirming.

However, apart from looking at the source code and deducing the
intentions of the programmer, there is no way to tell whether a given
string is meant as a character or byte string, simply because there is
no technical representation of this intent in the string or its
metadata.

Note that "characters" are the general case: a string is made of
characters. When every character value fits in a single byte, the string
can be used as a byte string.

And clarifying further.

This bug forces us to look at the internal encoding and flags to come to
the conclusion that it is indeed a bug. Don't mistake this as a sign
that looking at the internal encoding or flags should ever happen in
actual code. Even if you work around the bug, make sure that you don't
make anything conditional on the current formatting of the string.

Instead, coerce it to whatever you need by using utf8::downgrade or
utf8::upgrade. In your specific case, concatenation of two separate
parts is probably the most sane thing to do.

Good.

Am I mistaken in my expectation that while "\xa0" should be
a byte, "\x{a0}" and "\x{00a0}" should be characters?

Yes. These three escapes are supposed to be exactly the same. They
create a U+00A0 character, which happens to be perfectly usable as the
A0 byte when used as such, in a string that doesn't contain any
character greater than U+00FF.

Okay. Let me try to see if I have understood correctly. Without the utf8
pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence
of two bytes in my source code will be stored internally as a sequence
of 12 integers. With the utf8 pragma in scope, only 11 integers.

I know I shouldn't care about the internals, but sometimes grokking the
internals is helpful as an aide-mémoire, because it puts things into
perspective that otherwise seem more arbitrary.

-- 
Michael.Ludwig (#) XING.com

<Prev in Thread] Current Thread [Next in Thread>