[approved] [bugs #12314] linebreak not utf-8 aware


Follow-up Comment #1, bugs #12314 (project mhonarc):

This one is interesting. We have the told Mhonarc to break lines
once they hit 80 characters. This was for English language users
who don't know how to hit the return key. Our CSS layout isn't
happy when the message body gets too wide.

However, it looks to me that the linebreak is having trouble with
Japanese in two places. First, it doesn't seem smart enough to
understand that a linebreak can occur between any two Japanese
characters (instead it is looking for ASCII space characters).


Checking mhtxtplain.pl, the line breaking code is charset unaware.
Therefore, it does assume that ASCII-based octets that indicate a
white space can be broken on.

For most charsets, this is generally not a problem, with the following
exceptions:

* Charsets that are not a superset of ASCII.

* Charsets that have white space characters represented by non-ASCII
 octet equivalents.

* Multi-byte charsets: Well-designed multi-bytes charsets normally
 avoid using octets between 0 and 127 within multi-byte sequences to
 be friendly to old C string functions and non-multi-byte-aware
 software.  However, this is not always the case.

 Also, multi-byte characters will throw-off the "max-width" line breaking.
 I.e.  The line-breaking code only deals with octets, _not_ characters.
 Therefore, multi-byte character text may end up with "short" lines
 after line breaking.

Earl, do you have any thoughts?


maxwidth option has limitations with multi-byte encodings :)

(Note: maxwidth is not applicable for flowed text messages.)

Some code analysis would need to be done to see what is the best
approach to deal with issue.  Doing such an analysis depends on
how problematic the maxwidth limitations are.  Also, line breaking
is done with flowed text messages, so any solution would have to
be generalized for all line-breaking operations.

My initial thoughts would to do something similiar as the TEXTCLIPFUNC
route.  I.e.  Have the ability to customize how line-breaking is
done, with the routine provided a charset argument so the routine
can be charset aware.

(Personal side note: TEXTCLIPFUNC may need to support a charset
argument, but I do not know if this will be easy to do.)

Also Earl, do you have a preferred
venue for this type of question? Private email, mhonarc-user, mhonarc-dev,
etc?


mhonarc-user or mhonarc-dev is okay.  If technical in nature and
related to how mhonarc code may need to be changed, dev is the
better list.  When in doubt, send to mhonarc-user, and that I can
redirect the discussion to mhonarc-dev if the discussion warrants it.

--ewh


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/bugs/?func=detailitem&item_id=12314>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV