Re: Language Tags - Re: How to do UTF-8

Drat. I seem to be in the minority here in my worry that the IETF won't get
language tags in UTF-8. I'm again assured they will, so I agree with John 
and
others to take out the optional language tag. My proposed privacy mark is 
now:

I'm still not so sure, and even less sure that if IETF does get
them in that they will solve more problems than they create. The
more I study this issue the more prone I am to side with Paul's
earlier suggestion of including an optional languageTag in human
readable S/MIME type definitions.


This approach is appropriate when charsets other than UTF-8 or UTF-16 must be
accomodated. It is inappropriate otherwise.

This issue arose when reference was made to the I-D by Whistler and
Adams, <draft-whistler-plane14-01.txt>, published on February 15.
From this work I note the following:

   1) mechanism for language tagging in [UNICODE] plain text

Strictly speaking, I question that what we're using is
really an appropriate environment for employing this
proposed technique. I tend to think that we're not
really a "plain text" environment when Unicode is
already embedded within an ASN.1 encoded structure.


The distinction here is between environments that support markup tags and
those which do not. Unless you propose to make your strings into HTML
documents you're talking about a plain text environment.

I note below, remarks from the Whistler draft that
the HTML folks are not "plain text" an unlikely to
adopt this proposed embedded language tag mechanism.


Exactly.

   2) One tag identification character and one cancel tag
      character are also proposed.

I note that X.690 states clearly that for type UTF8String,
neither escape characters nor announcers are allowed. While
I'm unsure whether the plane14 proposal uses either, it sure
sounds like it does.


No, the Whistler proposal creates new Unicode codepoints. These are neither
escape characters or announcers, both of which are things at odds with the
design principles of Unicode.

Perhaps Bancroft or others will comment, but I am aware of
no defined mechanism by which a 'coder' would transmit such
embedded tag information elegantly to a using application.
It would seem most likely to me, that at best a 'coder'
would merely decode the UTF8String encoding, and hand the
value portion to the application and leave it up to the
application to determine which parts of the string were
language control information and which parts were the
actual human displayable content.


Yes, and this is exactly what is supposed to happen. The application is where
the language information is needed.

This seems likely to me, since I can imagine an application
wishing, perhaps, to send a message in several languages at
once, all contained in a single ASN.1 encoding (say in English,
French, Italian, and Spanish for a target European message). If
this were the case, the number of language tags included in a
given ASN.1 encoding would be indeterminate, and very difficult
for a 'coder' (particularly if hand coding) to handle.


Adding appropriate language tag information isn't easy. But this is true
regardless of whether the tag is internal to the string or an external field.

When I consider the impact of non content, language control
characters on the SIZE constraint aspect of this issue, it
also raises a concern. If the number of embedded language
tags is open ended, how will an application ever be able to
correctly anticipate how many characters need to be handled
in his buffer? This thought makes Paul's single optional
languageTag component idea even more appealing. It's use
effectively limits the application to only having to handle
at most one language at a time, while still meeting the need
to support national languages.

   3) A set of special-use tag characters on Plane 14 ...
      using characters which can be strictly separated from
      ordinary text content characters in ISO10646 (or UNICODE)

I can look into this further, but I can find no reference at
all in X.690 to Plane 14 characters.


Nor should there be. I realize that there's a bunch of stuff in X.680 and so on
about profiling Unicode, but the IETF already has requirements in place in this
regard, and a standard that doesn't follow them is unlikely to make it through
the IETF process.

   4) much discussion over the last 8 years of language tagging

      great deal of controversy regarding the appropriate placement
      of language tags

      implementation of this decision awaits formal acceptance by
      ISO JTC1/SC2/WG2, the working group responsible for ISO10646.
      Potential implementers should be aware that until this formal
      acceptance occurs, any usage of the characters proposed herein
      is strictly experimental and not sanctioned for standardized
      character data interchange.

I am alarmed at the use of "strictly experimental" here. Didn't
someone once remark to me that ASN.1:1994 has not yet been embraced
by the IETF, despite the numerous documented X.208 bugs it corrects,
because the IETF only uses technology recognized as stable?


The referenced prose explains why this is experimental -- it hasn't been
accepted by the ISO. That's the _only_ reason it is experimental at the
present time.

The current situation is that the IETF's position on charset requires that
language tagging be possible. And various protocols are already on the
standards track which use UTF-8 and which rely on language tagging facilities
being part of UTF-8. So there are basically two possible outcomes here:

(1) The ISO approves what the UTC has already approved and we have a
    language tagging facility blessed by the UTC, ISO, and the IETF.

(2) The ISO doesn't approve the present facility and the IETF creates its own
    facility instead. This could be tags in the private use range, it could
    be the MLSF proposal, or it could be something else entirely.

I believe these are the _only_ possible outcomes. Adding external language tags
is not a viable solution in far too many cases for the IETF not to address this
problem.

I think
that S/MIME is too important a standard to warrant adoption of such
a risky proposition as this. Paul's out-of-band solution seems much
safer, and can be replaced easily (optional) in a latter revision
of S/MIME if necessary.


I think S/MIME is too important a standard to embrace a mechanism which will be
obsoleted shortly and will remain as a major wart on the protocol for all time.

   5) It is fully anticipated that implementations of Unicode which
      already make use of out-of-band mechanisms for language tagging
      or "heavy-weight" in-band mechanisms such as HTML will continue
      to do exactly what they are doing and will ignore Plane 14
      tag characters completely.

I believe that S/MIME should do likewise. Paul's out-of-band
solution aligns nicely with HTML practice, which seems natural
for this specification.


Actually it does nothing of the sort. Paul's proposal allows for a single
language tag for a given string. HTML, on the other hand, allows for language
changes at any point. This could potentially be a major limitation.

                                Ned