Language Tags - Re: How to do UTF-8

Paul Hoffman / IMC wrote:

snip


Drat. I seem to be in the minority here in my worry that the IETF won't get
language tags in UTF-8. I'm again assured they will, so I agree with John and
others to take out the optional language tag. My proposed privacy mark is now:


I'm still not so sure, and even less sure that if IETF does get
them in that they will solve more problems than they create. The
more I study this issue the more prone I am to side with Paul's 
earlier suggestion of including an optional languageTag in human
readable S/MIME type definitions.

This issue arose when reference was made to the I-D by Whistler and
Adams, <draft-whistler-plane14-01.txt>, published on February 15.

From this work I note the following:


   1) mechanism for language tagging in [UNICODE] plain text

Strictly speaking, I question that what we're using is
really an appropriate environment for employing this
proposed technique. I tend to think that we're not 
really a "plain text" environment when Unicode is
already embedded within an ASN.1 encoded structure.
I note below, remarks from the Whistler draft that 
the HTML folks are not "plain text" an unlikely to
adopt this proposed embedded language tag mechanism.

   2) One tag identification character and one cancel tag 
      character are also proposed.

I note that X.690 states clearly that for type UTF8String,
neither escape characters nor announcers are allowed. While
I'm unsure whether the plane14 proposal uses either, it sure
sounds like it does. 

Perhaps Bancroft or others will comment, but I am aware of
no defined mechanism by which a 'coder' would transmit such
embedded tag information elegantly to a using application. 
It would seem most likely to me, that at best a 'coder' 
would merely decode the UTF8String encoding, and hand the 
value portion to the application and leave it up to the 
application to determine which parts of the string were
language control information and which parts were the 
actual human displayable content.

This seems likely to me, since I can imagine an application
wishing, perhaps, to send a message in several languages at
once, all contained in a single ASN.1 encoding (say in English,
French, Italian, and Spanish for a target European message). If
this were the case, the number of language tags included in a
given ASN.1 encoding would be indeterminate, and very difficult
for a 'coder' (particularly if hand coding) to handle.

When I consider the impact of non content, language control 
characters on the SIZE constraint aspect of this issue, it
also raises a concern. If the number of embedded language
tags is open ended, how will an application ever be able to
correctly anticipate how many characters need to be handled
in his buffer? This thought makes Paul's single optional
languageTag component idea even more appealing. It's use
effectively limits the application to only having to handle
at most one language at a time, while still meeting the need
to support national languages. 

   3) A set of special-use tag characters on Plane 14 ...
      using characters which can be strictly separated from
      ordinary text content characters in ISO10646 (or UNICODE)

I can look into this further, but I can find no reference at
all in X.690 to Plane 14 characters. 

   4) much discussion over the last 8 years of language tagging

      great deal of controversy regarding the appropriate placement 
      of language tags

      implementation of this decision awaits formal acceptance by 
      ISO JTC1/SC2/WG2, the working group responsible for ISO10646. 
      Potential implementers should be aware that until this formal 
      acceptance occurs, any usage of the characters proposed herein
      is strictly experimental and not sanctioned for standardized 
      character data interchange.

I am alarmed at the use of "strictly experimental" here. Didn't
someone once remark to me that ASN.1:1994 has not yet been embraced
by the IETF, despite the numerous documented X.208 bugs it corrects,
because the IETF only uses technology recognized as stable? I think
that S/MIME is too important a standard to warrant adoption of such
a risky proposition as this. Paul's out-of-band solution seems much 
safer, and can be replaced easily (optional) in a latter revision 
of S/MIME if necessary.

   5) It is fully anticipated that implementations of Unicode which
      already make use of out-of-band mechanisms for language tagging
      or "heavy-weight" in-band mechanisms such as HTML will continue
      to do exactly what they are doing and will ignore Plane 14
      tag characters completely.

I believe that S/MIME should do likewise. Paul's out-of-band 
solution aligns nicely with HTML practice, which seems natural
for this specification.

Phil
-- 
Phillip H. Griffin         Griffin Consulting
asn1(_at_)mindspring(_dot_)com        ASN.1-SET-Java-Security
919.828.7114               1625 Glenwood Avenue
919.832.7008 [mail]        Raleigh, North Carolina 27608 USA
------------------------------------------------------------
          Visit  http://www.fivepointsfestival.com
                 http://www.five-points.com
------------------------------------------------------------