ietf-822
[Top] [All Lists]

Re: New-ish idea on non-ascii headers

1991-09-23 08:03:12
I suspect that
similar problems arise with Arabic, but I don't know Arabic myself,
so I'm not entirely sure.

Let me try to try a taxonomy here in the hope of lowering the 
temperature a bit and at least reaching agreement on what we are talking 
about.  I hope.

Language group 1: Languages which use Roman characters, plus or minus 
diacriticals and small variations (e.g., hooks and slashes).  Systems 
like mnemonic are going to do a really good job, and people will be able 
to read them by intuition alone if their designers are 
careful and at least half-sensible.  Keld has done an outstanding job.

Language group 2: Languages with use alphabetic writing derived from 
Latin/Greek/North Semitic (which are inexorably tied up with each 
other), but which are not Roman characters.  Greek and Cyrillic fit in 
this category.  Mnemonic is still going to work well, but one is going to 
have it as what its name says: a set of mnemonic conventions that permit 
deducing the original character by looking at it.  Intuitive reading is 
a little less likely to work.  Keld has, IMHO, done an outstanding job 
here, too.

Language group 3: Languages that are alphabetic, but whose characters 
are not derived from the Latin/Greek/North Semitic set.  Most of these 
are not European; many of them use character glyphs derived from 
Sanscrit.  Mnemonic is going to rely more on "memory" than on "image and 
intuition" than it does for languages in the second group.  Thai and 
Hindi clearly fall into this category.  Arabic probably does, but it a 
little between the second and third groups.  But the character systems 
represent closed sets (which is a result of their being alphabetic 
writing systems), so it will/can work if one accepts the lack of obvious 
imagry.

Language group 4: Languages that are not written alphabetically or 
phonetically and that (perhaps as a natural consequence) have a very 
large number of glyphs.  Relationships among glyphs are based on 
historical meaning relationships, not on pronouciation.  The glyph 
collection tends to not be a closed set: what goes into things like 
10646 is the N "most important" symbols, not all of them.  And 
increasing the value of N.
  With one exception, systems like mnemonic are not going to be much 
easier to use than quoted-printable: essentially, one is going to have 
to look at a symbol sequence that represents a glyph and then go look it 
up in a table--the classic "decoding" operation.
  The exception is that I gather that, at least for Chinese, it might be 
possible to design a mnemonic-like system that would encode things using 
some sequence of designators for radicals and strokes.  For some 
languages, one might even build on a few centuries of tradition and 
research in how to do precisely that.  But it is a slightly different 
type of system.  And, for other languages, one would have to do the 
research and do the lexigraphic work first.

Why is this type of breakdown important?  Because it makes an a priori 
prediction about how well mnemonic might possibly do in different 
situations.  Our expectations should be different for each, and the 
reasons have more to do with the various natures of "character" than 
they do with what Keld is doing.  RFC-MNEMOMIC can approach the limits 
this model imposes either well or poorly (and I think it does it very 
well), but the limits on strongly intuititive imagery-based codings are 
such that, all things being equal, things are going to be worse with 
increasing group numbers.  One is going to need to rely more on the aid 
to memory than on deduction from the coded-character images.  And, as 
one moves to neglected (by the whole community, not by Keld) ideographic 
languages, "mnemonic" isn't going to be very suggestive at all, but may 
still be as good as any of the alternatives.

So...

 In short, mnemonic encoding is useful for alphabetic languages
PRIMARILY because of the ease with which someone lacking an enhanced
MUA can figure out the real letters intended.  This feature does NOT
carry over to non-alphabetic languages.
   True.  And since the mnemonic value of the codes partially depends on 
imagery, it works better for alphabetical languages that share the base 
characters of restricted-ASCII than those that don't.

 However, it is NOT a general world-wide solution and should not be
represented as such. 
   And, even if all other problems could be solved, the "always one more 
glyph" problem gets you.  Characters don't need to be in DISbis10646 to 
be valid in their languages.
   On the other hand, quoted-printable fans, that just encodes the bit 
pattern, so you have to have some reference to some computer-coded 
character set with which to interpret it.  To a considerable extent, 
mnemonic represents glyphs (regardless of how easy its representations 
are to understand); quoted-printable represents code points in specific 
character sets.

Now, since I have my flameproof suit on already, let me suggest in very 
general terms that we might think about patching a kludge onto mnemonic 
in the hope of making this problem go away.  Note that is "make go away" 
not "solve"--I think the gist of both Ran's and my comments is that 
there isn't going to be a neat solution.  We ask Keld to think about an 
escape convention that would permit, within mnemonic, representing a 
character by a notational pair consisting of 
    { character-set-designation, code point }
The set of "character-set-designation"s would be equivalent to the 
candidates for "character set" in a separate header or a content-type 
text subfield or...

  As an even more obnoxious variation, one could provide for 2022-like 
shifting in and out of this mode, designating the character set as part 
of the shifting activity.  At that point, by a little more magic and 
handwaving, we could say "the following types of header fields are in 
mnemonic when RFC-XXX is in use" and treat quoted-printable as a subset
of mnemonic.  That really cleans up the inter-header referencing mess. 

  Now, these paired values have no mnemonic significance at all.   Too
bad. But it is possible to designate *anything* at the simple cost of
finding it in some standardized or registered character set, or by
dashing off to ECMA to register another one. 

   These are terrible kludges.  Maybe they let us get on with our lives.

   -john