Re: Language Tags - Re: How to do UTF-8

Ned Freed wrote:

Drat. I seem to be in the minority here in my worry that the IETF won't 
get
language tags in UTF-8. I'm again assured they will, so I agree with John 
and
others to take out the optional language tag. My proposed privacy mark is 
now:

I'm still not so sure, and even less sure that if IETF does get
them in that they will solve more problems than they create. The
more I study this issue the more prone I am to side with Paul's
earlier suggestion of including an optional languageTag in human
readable S/MIME type definitions.


This approach is appropriate when charsets other than UTF-8 or UTF-16 must be
accomodated. It is inappropriate otherwise.


That's debatable, of course, but my arguments are merely 
technical while, as you admit below, yours are based on 
insight into IETF politics. (You win!) I agree that here,
technical concerns should not be allowed to get in the way 
of IETF standards approval. Technical quality issues like 
backwards compatibility or adherence to international
standards have little merit if a proposed IETF standard 
fails to win IESG approval.

This issue arose when reference was made to the I-D by Whistler and
Adams, <draft-whistler-plane14-01.txt>, published on February 15.
From this work I note the following:

   1) mechanism for language tagging in [UNICODE] plain text

Strictly speaking, I question that what we're using is
really an appropriate environment for employing this
proposed technique. I tend to think that we're not
really a "plain text" environment when Unicode is
already embedded within an ASN.1 encoded structure.


The distinction here is between environments that support markup tags and
those which do not. Unless you propose to make your strings into HTML
documents you're talking about a plain text environment.


Really? The last time I checked (this morning) I was getting

   (3) <underline>Road Kill on the Information Highway</underline>,
   Myhrvold, Nathan, 1993, Internal Microsoft Memorandum.  Would 
   love a copy of this.

in my mail. But I'm comfortable with this, and have even
come to expect such.

I note below, remarks from the Whistler draft that
the HTML folks are not "plain text" an unlikely to
adopt this proposed embedded language tag mechanism.


Exactly.

   2) One tag identification character and one cancel tag
      character are also proposed.

I note that X.690 states clearly that for type UTF8String,
neither escape characters nor announcers are allowed. While
I'm unsure whether the plane14 proposal uses either, it sure
sounds like it does.


No, the Whistler proposal creates new Unicode codepoints. These are neither
escape characters or announcers, both of which are things at odds with the
design principles of Unicode.


That's good. They should pass through 'coders' nicely,
and should be easy enough to filter for those who don't
want them by manipulating the ASN.1 in bits on the wire
compatible ways.

In fact, I suppose that's the rub for IESG. It's quite
easy to eliminate national language support from UTF8String,
or to exclude any unwanted languages if you're adept at 
manipulating the ASN.1. I suspect that lots of implementors
will end up only supporting the ASCII characters here to
save money and reduce time to market.

Perhaps Bancroft or others will comment, but I am aware of
no defined mechanism by which a 'coder' would transmit such
embedded tag information elegantly to a using application.
It would seem most likely to me, that at best a 'coder'
would merely decode the UTF8String encoding, and hand the
value portion to the application and leave it up to the
application to determine which parts of the string were
language control information and which parts were the
actual human displayable content.


Yes, and this is exactly what is supposed to happen. The 
application is where the language information is needed.


I see your point. This is good news for programmers.

This seems likely to me, since I can imagine an application
wishing, perhaps, to send a message in several languages at
once, all contained in a single ASN.1 encoding (say in English,
French, Italian, and Spanish for a target European message). If
this were the case, the number of language tags included in a
given ASN.1 encoding would be indeterminate, and very difficult
for a 'coder' (particularly if hand coding) to handle.


Adding appropriate language tag information isn't easy. But this is true
regardless of whether the tag is internal to the string or an external field.


Agreed.

When I consider the impact of non content, language control
characters on the SIZE constraint aspect of this issue, it
also raises a concern. If the number of embedded language
tags is open ended, how will an application ever be able to
correctly anticipate how many characters need to be handled
in his buffer? This thought makes Paul's single optional
languageTag component idea even more appealing. It's use
effectively limits the application to only having to handle
at most one language at a time, while still meeting the need
to support national languages.


Though I notice no comment here, let me note that your
solution presents yet another interesting test case for
implementors, besides the one's pointed out recently by 
DavidK: 
   An application receives an unconstrained ietf-UTF8String 
   in an optional component, which contains nothing but 32k 
   of language tags (no displayable content). GUI guys must
   take care to handle this case.

   3) A set of special-use tag characters on Plane 14 ...
      using characters which can be strictly separated from
      ordinary text content characters in ISO10646 (or UNICODE)

I can look into this further, but I can find no reference at
all in X.690 to Plane 14 characters.


Nor should there be. I realize that there's a bunch of stuff in X.680 and so 
on
about profiling Unicode, but the IETF already has requirements in place in 
this
regard, and a standard that doesn't follow them is unlikely to make it through
the IETF process.


True, a bunch of stuff. But that's not relevant here, as
S/MIME is using an ISO-invalid X.208 derivative, not ASN.1
1994 or 1997. Technically, it doesn't even matter what
X.208 requires. For S/MIME, that standard is really just a
guideline.

   4) much discussion over the last 8 years of language tagging

      great deal of controversy regarding the appropriate placement
      of language tags

      implementation of this decision awaits formal acceptance by
      ISO JTC1/SC2/WG2, the working group responsible for ISO10646.
      Potential implementers should be aware that until this formal
      acceptance occurs, any usage of the characters proposed herein
      is strictly experimental and not sanctioned for standardized
      character data interchange.

I am alarmed at the use of "strictly experimental" here. Didn't
someone once remark to me that ASN.1:1994 has not yet been embraced
by the IETF, despite the numerous documented X.208 bugs it corrects,
because the IETF only uses technology recognized as stable?


The referenced prose explains why this is experimental -- it hasn't been
accepted by the ISO. That's the _only_ reason it is experimental at the
present time.

The current situation is that the IETF's position on charset requires that
language tagging be possible. And various protocols are already on the
standards track which use UTF-8 and which rely on language tagging facilities
being part of UTF-8. So there are basically two possible outcomes here:

(1) The ISO approves what the UTC has already approved and we have a
    language tagging facility blessed by the UTC, ISO, and the IETF.

(2) The ISO doesn't approve the present facility and the IETF creates its own
    facility instead. This could be tags in the private use range, it could
    be the MLSF proposal, or it could be something else entirely.


If the goal is to break other standards, we should 
all hope for the latter. Since ISO is known for 
moving cautiously, (8 years it's taken them already)
the latter will likely come true.

I believe these are the _only_ possible outcomes. Adding external language 
tags
is not a viable solution in far too many cases for the IETF not to address 
this
problem.


I agree with your last statement, and IESG leadership should be
commended for their efforts at trying to find a solution. Though
one size does not fit all in this case, your solution does 
present an interesting case for X.5* Distinguished Names: Is an all 
UTF8String( ASCII, French ) DN <=> UTF8String( ASCII, English )?
Not likely, at least not in any backwards compatible way. But this 
in probably out of scope for S/MIME, and not a problem. Once the 
language tags are stripped and the true plain text displayed, I
can imagine a few programmers may be a bit perplexed.

BTW. UTF8String has now been added to DistinguishedName{} and
all it awaits is to pass ballot. No language tag issues are 
included as yet though, but that's an issue for LDAP/X.500
to work I'd think.

I think
that S/MIME is too important a standard to warrant adoption of such
a risky proposition as this. Paul's out-of-band solution seems much
safer, and can be replaced easily (optional) in a latter revision
of S/MIME if necessary.


I think S/MIME is too important a standard to embrace a mechanism which will 
be
obsoleted shortly and will remain as a major wart on the protocol for all 
time.

   5) It is fully anticipated that implementations of Unicode which
      already make use of out-of-band mechanisms for language tagging
      or "heavy-weight" in-band mechanisms such as HTML will continue
      to do exactly what they are doing and will ignore Plane 14
      tag characters completely.

I believe that S/MIME should do likewise. Paul's out-of-band
solution aligns nicely with HTML practice, which seems natural
for this specification.


Actually it does nothing of the sort. Paul's proposal allows for a single
language tag for a given string. HTML, on the other hand, allows for language
changes at any point. This could potentially be a major limitation.


No. Paul's proposal is backwards compatible while yours is not.
Paul's "plain text" string will work fine for the hoards of
users relying on programs currently deployed that know nothing 
of your experimental "plain text" solution. Unfortunately, most
will not behave in an HTML manner and simply ignore what they
do not understand (a great concept).

I find your logic circular. Once you place tags such as you suggest 
in a senders ordinary text, it's no longer "plain text". It is
tagged text - text that contains tags. But I suppose it will not 
matter. Already user's have grown accustomed to seeing the likes 
of 

  - -----END PGP SIGNATURE-----

  <HTML><META HTTP-EQUIV="Content-Type:text/html"> <SCRIPT>
  function X() {var Text = "HTML is not acceptable for using in mail " +
  "or usenet so your browser will stop."; alert(Text); parent.close();};
  </SCRIPT> </HEAD><BODY onLoad="X();return true">Hi</HTML>

splattered all over their screens. Truely, "plain text" is in the
eye of the beholder, the meaning debatable at best, and a few embedded 
language tags are hardly anything to get excited over.

Ned


Thanks for explaining this to me. Hope to see you in LA.

Phil
-- 
Phillip H. Griffin         Griffin Consulting
asn1(_at_)mindspring(_dot_)com        ASN.1-SET-Java-Security
919.828.7114               1625 Glenwood Avenue
919.832.7008 [mail]        Raleigh, North Carolina 27608 USA
------------------------------------------------------------
          Visit  http://www.fivepointsfestival.com
                 http://www.five-points.com
------------------------------------------------------------