ietf
[Top] [All Lists]

Re: Troubles with UTF-8

2005-12-29 18:13:50
"Tom" == Tom Petch <sisyphus(_at_)dial(_dot_)pipex(_dot_)com> writes:

Tom> You've lost me here.  I don't understand the use of state in the
Tom> context of Unicode

Masataka was refering to the fact that the universal character set
contains combining characters and some characters that otherwise alter
how subsequent and/or previous characters are treated.

As an example, the sequence of the two characters:

,----
| U+0061 LATIN SMALL LETTER A
| U+030B COMBINING DOUBLE ACUTE ACCENT
`----

which is enocoded in utf-8 as:

,----
| a̋
`----

has state between the base letter and the accent.  If the a is lost,
the accent will be added to whatever was before the a.

Similarly, U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK
affect how anything after them is displayed.  Their existance in the
standard, therefore, makes the standard statefull.

The combining accent characters can be added the base characters in
arbitrary number and sequence.  Not all combinations are currently in
use by any written language, of course, but they remain open ended.
You can even have multiple instances of a single combining character
in a sequence of combining characters.  (Consider, eg, a stack of
accents where there is a circumflex above a dieresis above a
circumflex above the base character.  Probably not used by anyone,
but it /could/ be.)

-JimC
-- 
James H. Cloos, Jr. <cloos(_at_)jhcloos(_dot_)com>





_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf

<Prev in Thread] Current Thread [Next in Thread>