perl-unicode

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 02:15:12
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> writes:

I would like to see these convert perl strings to bytes:

  to_utf8

And these convert a sequence of bytes to perl strings:

  from_utf8

You seem to want to define these function the opposite way.  Perhaps
the names are just too confusing.

Even on second reading I do not follow your naming logic.  Sorry, I
must be slow today.

Perhaps I just have a screwed mind :-)

First of all, what do you mean by "sequence of bytes"?

A perl string where ord() of every char < 256.

 As opposed to "perl strings"?

No restriction.  ord() has the same range as ints (a least 0 .. 2**32-1)

That difference makes little sense at Perl level,
where the user only has "perl strings".

True.

If I have a perl string containing this char:

  "\xA9"

and then perform the to_utf8() function on it I expect to get a string
containing the UTF8 representation of that char, i.e. two chars (or
bytes if you want):

  "\xC2\xA9"

If I say to_utf8() once more on it I expect to get a string containing
4 chars:

  "\xC3\x82\xC2\xA9"

and I expect from_utf8() to go the other way.

Your to_utf8() seems to be named after "turn-on-the-utf8-flag" which I
think of as just an internal implementation detail.  What if we change
the internal representation to be always UTF-32.  Do you want to
rename your functions then?

What if we decide to remove the UTF8 flag and just use UTF8 as the the
way strings are always represented?

To me the UTF8 flag is just a trick to improve the performance of
dealing with binary data (as we don't have to convert in/out of UTF8
all the time).  There should never be a semantic difference between a
string just because it is up/downgraded to/from UTF8 representation
internally.  Unfortunately, since we are missing line disciplines, the
internal representation is exposed on printing currently.
That is a bug.

Regards,
Gisle