Re: Troubles with UTF-8

From: "Ned Freed" <ned(_dot_)freed(_at_)mrochek(_dot_)com>
To: "TomPetch" <sisyphus(_at_)dial(_dot_)pipex(_dot_)com>
Cc: "ietf" <ietf(_at_)ietf(_dot_)org>
Sent: Friday, December 23, 2005 7:13 PM
Subject: Re: Troubles with UTF-8
<snip>

(Unicode
lacks a no-op, a meaningless octet, one that could be added or removed

without

causing any change to the meaning of the text).


NBSP is used for this purpose.

Thank you for that; it is not something I have seen documented before.

Other protocols use a terminating sequence.  NUL is widely used in *ix; 
some
protocols specify that NUL must terminate the text, some specify that it

must

not, one at least specifies that embedded NUL means that text after a NUL

must

not be displayed (interesting for security).  Since UTF-8 encompasses so

much,

there is no natural terminating sequence.


This simply isn't true. NUL is present in Unicode and is commonly used as  a
terminator.

Not sure which bit isn't true.  I agree NUL is present in Unicode and agree 
that
some protocols use it as a terminator and prohibit its use in the text.  But
some allow it in the text in which case another form of termination is needed 
or
else the NUL must be escaped/encoded.


None of this differs in any material way from the situation with plain
ASCII text. I fail to see why we have to do something with Unicode to deal
with a situation that's existed with ASCII for decades.

Presented with a comparable problem where
XML is in use, one WG has chosen to use an illegal XML sequence as a 
terminator
so what I was fishing for is if there were any parallels with UTF-8, which has
many illegal sequences of octets and so it would be easy to choose one as a
terminator.


Using a construct that's syntactically illegal at a higher protocol level
is one thing - I still wouldn't do it, but it is arguagly OK. Using a sequence
of octets that's not allowed by the underlying charset, OTOH, is a really
bad idea. For one thing, various agents do perform syntax checks on charset
data, so this is bound to cause major problems. And for another, such sequences
are going to be specific to a particular character encoding scheme, which
will make agents that transcode from, say, UTF-8 to UTF-16 pretty unhappy.

If Unicode data needs to be self-terminated I strongly recommend using
NUL to do it.

                                Ned

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf

<Prev in Thread]	Current Thread	[Next in Thread>
Re: Troubles with UTF-8, (continued) Re: Troubles with UTF-8, Masataka Ohta Re: Troubles with UTF-8, JFC (Jefsey) Morfin Re: Troubles with UTF-8, Masataka Ohta Re: Troubles with UTF-8, Frank Ellermann Accessibility was Re: Troubles with UTF-8, Tom.Petch Re: Accessibility was Re: Troubles with UTF-8, Harald Tveit Alvestrand Re: Troubles with UTF-8, Ned Freed Re: Troubles with UTF-8, JFC (Jefsey) Morfin ABNF Re: Troubles with UTF-8, Tom.Petch Re: Troubles with UTF-8, Tom.Petch Re: Troubles with UTF-8, Ned Freed <= Re: Troubles with UTF-8, Tom.Petch Re: Troubles with UTF-8, Harald Tveit Alvestrand Re: Troubles with UTF-8, Tom.Petch Re: Troubles with UTF-8, Julian Reschke Re: Troubles with UTF-8, Tom.Petch Re: Troubles with UTF-8, Randy Presuhn Re: Troubles with UTF-8, Frank Ellermann Re: Troubles with UTF-8, Tom.Petch Re: Troubles with UTF-8, Masataka Ohta Re: Troubles with UTF-8, Tom.Petch