Re: Troubles with UTF-8

Tim Bray wrote:

That problem is that Unicode is stateful with complex and
indefinitely long term states

Has this ever caused a real problem to a real programmer in real life?


Yes, of course. State information preserved between lines is
really annoying.

But, you miss the point in my original mail:

: Unicode is not even finite state, which means some pattern
: matching and normalization problems are hard or insolvable.

that is, with Unicode, you can not search strings in reasonable
amount of time.

I have written a whole bunch of mission-critical code that reads and  
generates UTF-8, and any correct implementation will have to deal  with 
the fact that there is no necessary connection between the  number of 
glyphs on the screen and bytes in its encoding.


You completely miss the point. It has nothing to do with the long
term state.

It would  be perfectly 
reasonable for an implementation to declare a  limitation, for example 
that it will not process than 32 trailing  modifiers on any character, 
and this would not cause problems in  production because sequences of 
such a length do not occur in the  encoding of any known text.


I said "long term state", which, of course, is not confined in a
character with or without modifiers.

Which is to say, Ohta's statement about statefulness is true, but the  
conclusion that this is a "problem" is erroneous. -Tim


Instead, your statement: "I have written a whole bunch of mission-
critical code that reads and generates UTF-8" is untrustworthy.

Of course, it is perfectly reasonable for an implementation to
declare a limitation, for example, that it will not process
non-ASCII characters, which may also be the assumption of your
code.

                                                Masataka Ohta 



_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf