perl-unicode

character widths [Was: format ellipses(...) bug]

1998-09-30 10:32:43
maeda(_at_)src(_dot_)ricoh(_dot_)co(_dot_)jp writes:
: P.S. How does the Unicode impact on formats, Larry?  Jperl
: (Japanized version of perl) assumes that a CJK character takes
: two bytes and occupy two colomns.  Though there ARE characters
: which take two bytes and only require one column.  Jperl can't
: handle these *odd* ones properly :<

At the moment, 5.005_52 presumes that all characters are of width 1.
This is probably wrong.  I suppose it wouldn't be too hard to throw a
table out into lib/unicode that distinguishes narrow and wide
characters.  It seems like just another boolean character property.
Making formats (and presumably printf) understand that wouldn't seem
too difficult, though it might slow down width calculations somewhat,
and should maybe be optional, if there's any demand for the current
semantics.  Is anyone going to want all characters to be treated as the
same width?  (Note that column width has nothing to do with length()
calculations in Perl, where a Unicode character is always 1 long (under
utf8, that is).)

Hmm, it occurs to me that we should also be dealing with zero-width
combining characters, so maybe it isn't a boolean table after all.  But
translation to small numbers is no problem in the current lib/unicode
scheme.  That's just what tr/// and case conversions do, after all.
The resulting small number just happens to be interpreted as another
Unicode in those cases, but the swash mechanism doesn't care about
that.

Alternately, one could go with the boolean double-width table, and
combine that with info from the existing tables of character classes,
but why do multiple lookups when one will do?

Larry

<Prev in Thread] Current Thread [Next in Thread>
  • character widths [Was: format ellipses(...) bug], Larry Wall <=