On Sun, 16 Nov 2003 14:54:54 -0800
Randall Gellens <randy(_at_)qualcomm(_dot_)com> wrote:
One thing I'm not totally sure about is the encoding of space and
">". I added a note to the ABNF section that says that these
characters are encoded according to the charset, but as I recall,
during the discussions on how f=f can work with non-Western
languages/charsets, especially Chinese, Japanese and Korean, it was
mentioned that ASCII space is sometimes used in some of these
languages. So perhaps the statement needs to include the possibility
that space might be encoded in ASCII as well?
I guess the question is what to do when there is more than one way to
represent the space character and/or ">", as can certainly happen
when using iso-2022 code-switching.
This is one of the reasons why charsets are defined as a mapping from octets to
characters, not the other way around: It lets us talk about the processing of
various characters without having to worry about whether there is one way to
represent them or fifty. If follows that simply scanning the content for 0x20
and 0x3E is unacceptable; the transformation from octets to characters must be
performed first, then the resulting sequence of *characters* can be checked for
space, greater than, etc.
We got this wrong in text/richtext but we fixed it in text/enriched, although
the language in RFC 1896 isn't as clean as I would like. Let's please not
repeat this whole argument yet again. As long as the document makes it clear
we're dealing with the characters that result from the application of the
charset to the sequence of octets we should be good to go.
Ned
P.S. I note in passing that while iso-2022 does allow for very general things,
including the ability to have the same character bound to multiple different
octet values at the same time and for common characters like space and greater
than to be bound to less than obvious octet values, in practice real charsets
defined in terms of iso-2022 tend to restrict themselves to a very small subset
of iso-2022's capabilities. The result is that while you cannot assume that,
say, 0x3E is always a greater than, when it appears greater than will be
represented as 0x3E. I haven't found this characteristic to be particularly
helpful when coding support for this stuff, but it sure helps a lot when
inspecting input iso-2022-based text manually.