Re: Authority For Western Line Breaking Rules


I have pointed this thread to Asmus Freytag who is a author of UAX#14.
Follwings are reply from him, I copied them with the permission 
from Freytag.

Mr. Eliot:
In thinking about it, I think that the Annex 14 rules are stated in such 
a way that the rules are appropriate for languages that do not use space 
to determine line breaks without explicitly disallowing Western-style 
line breaking behavior.


Kobayashi:

The question is UAX#14 is appropriate for Western language or not.


(Freytag until the end of body of this mail:)

The answer to that is YES. The whole idea about UAX#14 is to have a single 
default algorithm that does well in a Western (space based) and East Asian 
environment, by giving special treatment to characters that are of concern 
in both environments.

The results should be usable in standard text handling, perhaps with minor 
tailoring as suggested in the document.

High-end publishing systems may need to apply some additional tailoring.
These systems often give users a choice of line-breaking rules. There may 
be some languages that require tailoring in specific situations.

In the message you pointed me to, the following statements were made:

For background, Annex 14 is very permissive, implicitly allowing line 
breaks wherever they are not explicitly disallowed and does not, for 
example, disallow breaks following closing punctuation, allowing for 
example, this break:


"e.
g., a thing"

That is, Annex 14 allows this break, even though it would be wrong in any 
Western language I'm familiar with.


However, the statement is incorrect. UAX#14 allows breaks after closing 
punctuation, but not if it precedes alphabetic characters.

There are no breaks in "e.g.", but there is a break in "...tailoring.
These...", since there is a space after the ".".

Annex 14 is also informative--it does not require conforming Unicode 
implementations to implement the Annex 14 rules except for those 
characters that have normative line breaking properties, such as line 
separator and soft hyphen.


This statement is correct. The rules in UAX#14 define what I would like to 
call for the purpose of this discussion a 'best default practice with 
normative nucleus'.

Some of the rules (and the properties they are based on) describe behavior 
that is required. Usually, this is limited to special behaviors, such as 
the non-breaking behavior of the NO BREAK SPACE for example. Without such 
requirements, users would not be able to rely on the use of NO BREAK SPACE 
to express the kinds of linebreak behavior for which NO BREAK SPACE has 
always been intended.

However, many of the other rules are subject to customization (tailoring) 
to fit the requirements of particular languages more precisely, or to match 
the needs of a particular in-house style at a large publisher's. In other 
words, the main reason that those rules are informative is that there is no 
single set of rules for line breaking, often not even a single one for a 
given language.

However, using UAX#14 as the starting point will allow an implementation to 
cover all Unicode characters, so that texts with foreign material inserted 
will behave quite reasonable, without the need for all implementers to 
become experts in *all* languages. In some instances, a small amount of 
tailoring will be useful if texts are known to be predominantly in a given 
language which has special requirements.

--------------Up to here------
Best regards,

Tokushige Kobayashi
Antenna House, Inc.
E-mail koba(_at_)antenna(_dot_)co(_dot_)jp
WWW    http://www.antenna.co.jp/XML/xml-top.htm
WWW    http://www.antennahouse.com/xslformatter.html (English)
TEL    +81-3-3234-1361(direct call)
FAX    +81-3-3221-9975

Antenna House XSL School
http://www.antenna.co.jp/XML/school/xslday.htm



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list