Tim Bray wrote:
That problem is that Unicode is stateful with complex and
indefinitely long term states
Has this ever caused a real problem to a real programmer in real life?
Yes, of course. State information preserved between lines is
really annoying.
But, you miss the point in my original mail:
: Unicode is not even finite state, which means some pattern
: matching and normalization problems are hard or insolvable.
that is, with Unicode, you can not search strings in reasonable
amount of time.
I have written a whole bunch of mission-critical code that reads and
generates UTF-8, and any correct implementation will have to deal with
the fact that there is no necessary connection between the number of
glyphs on the screen and bytes in its encoding.
You completely miss the point. It has nothing to do with the long
term state.
It would be perfectly
reasonable for an implementation to declare a limitation, for example
that it will not process than 32 trailing modifiers on any character,
and this would not cause problems in production because sequences of
such a length do not occur in the encoding of any known text.
I said "long term state", which, of course, is not confined in a
character with or without modifiers.
Which is to say, Ohta's statement about statefulness is true, but the
conclusion that this is a "problem" is erroneous. -Tim
Instead, your statement: "I have written a whole bunch of mission-
critical code that reads and generates UTF-8" is untrustworthy.
Of course, it is perfectly reasonable for an implementation to
declare a limitation, for example, that it will not process
non-ASCII characters, which may also be the assumption of your
code.
Masataka Ohta
_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf