Re: quoted-printable

This is all fine and dandy if you know what character set is being
used. You definitely should know this if you're talking about
text/plain messages. But there are lots of other sorts of things we
send via mail. Many of them don't break down cleanly into a single
character set. Others involve multiple character sets and cannot be
characterized by a single external character set.

Could you provide a few examples?


Actually, it is very difficult to find example of a content type that works
differently. Let's see -- handy examples include PostScript, SGML, TeX, LaTeX,
and of course Richtext.

All of these types internally document the character sets that they use, and
none of them allow for easy extraction of this information. All of these types
support the use of multiple character sets within the same document, only one
of them (SGML) uses ISO 2022 conventions for switching between character sets,
and even this one is not inherently limited to just ISO 2022 switching.

All of these types are potentially readable by humans and hence the use of
quoted-printable is preferable to the use of base64. (Even PostScript can be
made to be readable; in fact, it is possible to write a PostScipt preamble that
processes the rest of a document as plain text, RichText, or whatever.)

Once you leave the area of potentially readable information things really
explode. For example, many graphic file formats are composed of printable
characters but aren't necessarily in a single character set. The clear text
encoding of CGM is one such example. (Note that there are three ways to encode
CGM -- one is essentially graphics commands written out as readable text, the
next is a pure binary format, and the third is a compressed thing that ends up
being printable but gibberish. I'm only talking about the first of these three
here.) I don't know enough about GKS metafiles to be able to tell you for sure
what they look like, but I'm sure that there are loads of other examples.

But mnemonic is only
appropriate when a message has a 1:1 mapping from its bytes into a
single character set definition.

I think Keld's system can support messages that mix e.g. Latin-1 and
Latin-2 quite well. (What do you mean by "a single character set
definition"? Something like 10646?)


Suppose there's a byte with value 174 in the midst of a TeX document. I want
to translate this into mnemonic. What character does it represent? How do
you tell what character it represents?

TeX is an especially good example since it is a complete programming language.
As a result of this it is _impossible_ to find out what a given byte in a TeX
input stream means without executing the stream as a program. I don't use the
term impossible lightly -- this is a thinly disguised variant on the halting
problem.

You can of course just decide to assume a character set and be done with it.
Conversion of a specific character set into mnemonics is a perfectly invertible
operation. There may be some characters that are now encoded with the wrong
mnemonic, but so what? It is invertible, so who cares?

But there's a problem of intent here. The very notion of mnemonic as a message
part representation is that it provides valuable new functionality --
specifically, that representing characters using sequences that can be read as
the real thing with a little practice makes things a little nicer for the end
user. By using mnemonic in situations where this intent is specifically
violated you are doing a great disservice to mnemonic. In fact, you are making
things harder to figure out, not easier, if you encode using mnemonic but
assume the input material is in a character set that its does not match up
with.

Another point: Mnemonic does not include all the facilities necessary to be a
totally invertible encoding. (This is not a flaw -- mnemonic is not intended
for use when this is an important consideration.) As such, my earler claim that
mnemonic encoding can be perfectly decoded to produce the original material is
not true in all cases.

You don't have to go very far past
plain text before this condition no longer holds.

I will admit that I had plain text in mind. Perhaps mnemonic is not
very useful for things other than plain text.


Yes indeed. Most mail is plain text currently, but I am not prepared to
maintain that this is going to be the case in the future.

(I see escape sequences that involve the use of 8-bit
characters quite frequently, so the impact of this on encoding
methodologies is obvious.)

The escape sequences that I have in mind for a multilingual encoding
are based on ISO 2022, and do not use 8-bit characters.


All I was trying to say here is that escape sequence usage is far from uniform.
Given a random document in an unknown format I think the use of ANSI/ISO
conventions for escape sequence parsing is just asking for trouble. It is much
safer to just encode the whole thing as either base64 or quoted-printable so
that the entire content is preserved bit-for-bit.

If you want to experiment with a system
that tries to exclude escape sequences from conversion you might want
to look at Kermit.

As far as I know, Kermit uses many of the escape sequences specified
in ISO 2022. However, I think that it uses too many of 2022's
features. It is possible to do the same thing with far fewer different
types of escape sequences.


You're probably right that things could be done with fewer 2022 features. But
as I understand it Kermit is not trying to generate documents, it is trying to
provide a full-featured conversion facility. A conversion facility is in no
position to mandate what sequences are used.

If you want to define a new format that uses some subset of the available
facilites that's fine. But Kermit has to deal with all sorts of usage that
won't be restricted to some subset.

But mnemonic
in no way obviates the need for quoted-printable, which is the
encoding of choice for text objects that cannot be conveniently
categorized as being in a given character set.

Sorry, I didn't mean to say that quoted-printable is unneeded. I
simply meant to say that I don't think that quoted-printable will
catch on for certain types of messages, such as German plain text.


It will if there's no alternative :-) I certainly agree that an
alternative is needed.

I'm not sure what the size of Keld's proposal has to do with anything.

What I meant by "humungous set" is that Keld's proposal includes
mnemonics for languages that are currently encoded very differently.
For example, RFC-CHAR also contains specifications for Japanese. Most
of the Japanese text in messages in the Internet and beyond are
encoded in a subset of ISO 2022, which is far more compact than Keld's
mnemonics for Japanese (i.e. 2 bytes vs 8 bytes).


I think you're missing the point of having a definition for these things.
Mnemonic serves two purposes at the same time. First, it is a facility for
representing characters in a format that's intrinsicly a bit more readable than
just having hexadecimal goo scattered throughout your document. Second,
it is a facility that provides a mechanism for converting from one character
set to another.

The inclusion of Japanese is, as I see it, only done for the sake of the second
purpose. The way Keld's mnemonics for Japanese work is totally unreadable and
cannot possibly aid in making Japanese character readable on terminals that
cannot support them. (Please note that I have only said that Keld's approach
does not help. I have been told by people who are knowledgable in this area
that this problem is basically intractable, and thus Keld's labelling is as
good as any when it comes to readability on terminals that don't support
the proper character set. If there is a better way I'd like to see it used.
If there isn't Keld's scheme is as good as any.)

Apart from the obvious size problem, hardly any Japanese users have
software that understands Keld's mnemonics. Most of the software
understands the 2022 subset. So Japanese encoded in Keld's mnemonics
would be extremely unreadable.


This is correct, and in fact I would never expect to see such software.
However, this does not obviate the need for  conversion tables between
different character sets that include Japanese characters as a subset.

It is quite clear that the Japanese will not use Keld's mnemonics for
their usual email. So the question is: What would Keld's Japanese
mnemonics be used for? For use in other countries? Wouldn't this be a
rather minor usage, in terms of volume in characters per day? Also,
wouldn't it be less confusing if Japanese was encoded in one way (i.e.
the Japanese way) instead of two ways?


I would not expect anyone to actually try to read Japanese encoded in this way.
And I certainly expect that most people will continue to use the facilities
that they are used to. But there are other character sets on the horizon (I
already have to cope with two for Japanese, and this before the arrival of
10646), and the problem of how to convert from one to the other is not that far
away. RFC-CHAR is attempting to address the need to support existing practice
while allowing for conversion to/from future practice.

If you have additional problems with RFC-CHAR I'd like to hear what
they are. But issues of scope are not a valid area of concern for the
Working Group, in my opinion.

You say (later) that you are reluctant to pursue two mnemonic
approaches at once. In much the same way, I am reluctant to pursue two
approaches for encoding Japanese at once. Since there is already an
established encoding for Japanese, the Japanese mnemonics should be
removed from RFC-CHAR.


Keld is not proposing a new encoding for Japanese. In fact, he explicitly
states that he is not doing this -- that the existing iso-2022-jp encoding is
preferred.

All Keld is doing is documenting and assigning names to existing encodings, with
the notion that conversion between encodings is going to be needed.

But having two mnemonic formats is an entirely different kettle of fish. We
don't need two, we need one that has the input of the entire community going
into its design.

If it is true that issues of scope are not a valid area of concern for
the Working Group, I would like to hear the Chair himself say so.


This Working Group has a charter that as far as I can see covers most of the
technology in RFC-CHAR. The Working Group can and has elected to extend its
charter to cover a lot of other stuff. But if we don't deal with the material
that's covered in RFC-CHAR we will have in essence decided to not fulfill the
condition of our charter.

I don't even have a problem with the notion that a Working Group can decide to
do something other than what its charter says. But I don't see how the desire
of a subset of the group to deliberately fail to deal with all the issues the
Working Group was formed to handle can be respected.

Stay tuned for the next version of the multilingual encoding draft,
which will take into account some of the realities that we have bumped
into lately.

I won't comment on this apart from saying that I'm reluctant to pursue
two mnemonic approaches at once.

I'm not necessarily advocating two different mnemonic approaches. We
may well end up including some of Keld's work in the new document,
either by a reference or by explicit inclusion if that is felt to be
desirable.


This makes the case for having only one document even stronger, doesn't it?

If you could work out your
differences with what Keld has proposed and come up with a unified
result I think we'd all be a lot happier. (I have found that Keld is
more than willing to listen to suggestions on how to modify RFC-CHAR
to make it a better specification.)

Well, I'm sorry to say that I have not found Keld at all willing to
make changes that I propose.


Hmm, well, I guess I'm going to have to ask you to back this up with examples. 
Examples with technical substance, if you please -- I don't want to continue
the debate on whether RFC-CHAR should be restricted to only a subset of the
world's character sets. I want to hear about technical changes that have been
suggested but not adopted. The notion of gutting the document is not the sort
of change that Keld has any business honoring based on a request from a single
reader of the list.

I also feel that this group has basically given Keld the go-ahead to
continue the development of RFC-CHAR, with the stated goal that it
will become a standard.

As far as I can tell, this group has not made any such decision. You
yourself were complaining about the lack of comment on RFC-CHAR a
little while ago. Silence does not mean agreement.


Better read the minutes of the IETF meetings. If this is not clearly stated in
the minutes it sure should be. I attended all the meetings and I heard it
with these two ears.

I quite frankly
don't like what I see happening here -- I see a possibility that
RFC-CHAR will be abandoned, and I think this is a huge mistake.

I also don't want RFC-CHAR to be abandoned. I think that it might be
possible to reach consensus on the Latin-1 part quite quickly.


I guess I don't have a problem with the notion of taking the parts of RFC-CHAR
that are not controversial and standardizing them quickly. But I want to see
some real evidence that there are parts of the document that need further
attention need either prolonged work or need to be removed completely. My
biggest problem with all this hinges on this point -- I want to see some honest
technical review, and I'm very disturbed and disappointed to see so little of
it in conjunction with RFC-CHAR on this list.

                                Ned