Re: Language Tags - Re: How to do UTF-8

The distinction here is between environments that support markup tags and
those which do not. Unless you propose to make your strings into HTML
documents you're talking about a plain text environment.

Really? The last time I checked (this morning) I was getting

   (3) <underline>Road Kill on the Information Highway</underline>,
   Myhrvold, Nathan, 1993, Internal Microsoft Memorandum.  Would
   love a copy of this.

in my mail. But I'm comfortable with this, and have even
come to expect such.


But message bodies _are_ an environment that supports multiple media types,
many of which provide markup tag facilities. This is quite different than
string fields in S/MIME information objects, where I doubt very much that
you want to try and support full HTML (or worse, MHTML).

No, the Whistler proposal creates new Unicode codepoints. These are neither
escape characters or announcers, both of which are things at odds with the
design principles of Unicode.

That's good. They should pass through 'coders' nicely,
and should be easy enough to filter for those who don't
want them by manipulating the ASN.1 in bits on the wire
compatible ways.


There really should be little reason to do this. Unicode contains lots of
codepoints that don't get displayed; these are simply more of them. Anything
that supports Unicode has to be prepared to deal with the presence of this
sort of thing. (Full Unicode support even when you have a complete BMP
font available isn't exactly a piece of cake...)

In fact, I suppose that's the rub for IESG. It's quite
easy to eliminate national language support from UTF8String,
or to exclude any unwanted languages if you're adept at
manipulating the ASN.1. I suspect that lots of implementors
will end up only supporting the ASCII characters here to
save money and reduce time to market.


As far as I know the IETF does not at present have any requirements that say
you must be able to display full Unicode or even some selected subset. However,
the best way to create such a requirement is for a significant number of
vendors not to meet such requirements and cause problems by not doing so.
Interoperability problems are every bit as much of an anathema to the IETF as
markup and escape sequences are to the Unicode Technical Committee.

I note in passing that in many cases vendors will be using platforms with
built-in support for Unicode. In such cases not handling a reasonably full
repetiore is inexcusable.

When I consider the impact of non content, language control
characters on the SIZE constraint aspect of this issue, it
also raises a concern. If the number of embedded language
tags is open ended, how will an application ever be able to
correctly anticipate how many characters need to be handled
in his buffer? This thought makes Paul's single optional
languageTag component idea even more appealing. It's use
effectively limits the application to only having to handle
at most one language at a time, while still meeting the need
to support national languages.

Though I notice no comment here, let me note that your
solution presents yet another interesting test case for
implementors, besides the one's pointed out recently by
DavidK:

   An application receives an unconstrained ietf-UTF8String
   in an optional component, which contains nothing but 32k
   of language tags (no displayable content). GUI guys must
   take care to handle this case.


Well, they must also be prepared to deal with 32K worth of directionality
switching codepoints. Or 32K of combining diacritical codepoints.

Language tags are trivially easy compared to some other aspects of Unicode.
Heck, I've read through the text describing directionality carefully four or
five times and I'm still not sure I have it all straight. (And while you can
ignore tag characters if you want to and things will probably turn out OK, the
same cannot be said for the other stuff.)

I can look into this further, but I can find no reference at
all in X.690 to Plane 14 characters.

Nor should there be. I realize that there's a bunch of stuff in X.680 and 
so on
about profiling Unicode, but the IETF already has requirements in place in 
this
regard, and a standard that doesn't follow them is unlikely to make it 
through
the IETF process.

True, a bunch of stuff. But that's not relevant here, as
S/MIME is using an ISO-invalid X.208 derivative, not ASN.1
1994 or 1997. Technically, it doesn't even matter what
X.208 requires. For S/MIME, that standard is really just a
guideline.


Point taken.

The current situation is that the IETF's position on charset requires that
language tagging be possible. And various protocols are already on the
standards track which use UTF-8 and which rely on language tagging 
facilities
being part of UTF-8. So there are basically two possible outcomes here:

(1) The ISO approves what the UTC has already approved and we have a
    language tagging facility blessed by the UTC, ISO, and the IETF.

(2) The ISO doesn't approve the present facility and the IETF creates its 
own
    facility instead. This could be tags in the private use range, it could
    be the MLSF proposal, or it could be something else entirely.

If the goal is to break other standards, we should
all hope for the latter. Since ISO is known for
moving cautiously, (8 years it's taken them already)
the latter will likely come true.


I'm really not qualified to comment on this, but from what I've heard the
codepoint assignment process is somewhat more streamlined than this. I suspect
it has to be so for political reasons.

I believe these are the _only_ possible outcomes. Adding external language 
tags
is not a viable solution in far too many cases for the IETF not to address 
this
problem.

I agree with your last statement, and IESG leadership should be
commended for their efforts at trying to find a solution. Though
one size does not fit all in this case, your solution does
present an interesting case for X.5* Distinguished Names: Is an all
UTF8String( ASCII, French ) DN <=> UTF8String( ASCII, English )?


DNs and X.400 ORnames are a _huge_ mess, but I think the addition of Unicode is
small beer compared to the rest of it. Canonicalization and comparators are
significant issues for Unicode, but language tags really don't add much
complexity to what's already there, and there are strategies in place for
dealing with most of this.

No, the bigger problem for DNs and X.400 ORNames are the myriad different
charsets that are allowed, the allowance for fields in multiple forms, the fact
that some of these things are imprecisely and incompletely specified, and
finally that actual use is markedly at odds with what the standards require.
(Unfortunately I have to deal with this stuff on a daily basis so I'm all too
familiar with how sticky it can get.)

Not likely, at least not in any backwards compatible way. But this
in probably out of scope for S/MIME, and not a problem. Once the
language tags are stripped and the true plain text displayed, I
can imagine a few programmers may be a bit perplexed.

BTW. UTF8String has now been added to DistinguishedName{} and
all it awaits is to pass ballot. No language tag issues are
included as yet though, but that's an issue for LDAP/X.500
to work I'd think.


I've found that in practice these sorts of issues must be dealt with in the
field, which little or no guidance from standards bodies.

I find your logic circular. Once you place tags such as you suggest
in a senders ordinary text, it's no longer "plain text". It is
tagged text - text that contains tags. But I suppose it will not
matter. Already user's have grown accustomed to seeing the likes
of

...


Again, I'm really not enough of an expert to be arguing this point, but for the
experts in this area there are "tags" and then there are "tags". A clear
distinction is made between markup languages and their tags  and the various
codepoints in Unicode that do not cause something to be displayed but instead
affect the manner in which display operations are carried out. The former makes
text into something more than "plain", whereas the latter doesn't.

I understand the technical differences here, of course, but I'm not
sufficiently expert at this to argue about where the line should be drawn and
why. For that you really need to be talking to someone like Glenn Adams.

                                Ned