How to do UTF-8

Here's a summary of the discussion so far, with my tentative conclusions:

** Need for UTF-8 in the privacy mark **

The privacy mark is free text that may be shown to a human, and the human
may need to make a decision based on the text shown. Thus, the text must be
in a charset that is understanable by the human. The previous specification
had this as a PrintableString because that's what's in X.411.

** Using UTF-8 **

The IETF has decided to go with UTF-8, where possible, as its charset of
choice in things that will get shown to human users. See
<http://www.imc.org/rfc2277> for more information on this. See
<http://www.imc.org/rfc2279> for a description of how to do UTF-8.

** How to specifiy UTF-8 in 1988 ASN.1 **

Things got a bit hairier here. Three proposals had general approval:

(1) Use an OCTET STRING that has a comment saying that it has to be filled
with a UTF-8 string
(2) Use "[0] IMPLICIT UTF8String", and people whose compilers don't know
what a UTF8String is can change this to "[0] IMPLICIT OCTET STRING" and
still have the same bits on the wire
(3) Define UTF8String at the beginning of the module as "UTF8String ::=
[UNIVERSAL 12] IMPLICIT OCTET STRING" and then just use UTF8String in the
code (with no IMPLICIT)

I have chosen to go with (3) because it is the cleanest, and the ASN.1
compilers already must be able to do something like this because they have
to be able to do it for BMPString. Although this violates the ASN.1
standard, it is quite easy to implement.

The reason I want to get away from (1), which I put in the current version,
is that people tend not to read comments, and might stuff the OCTET STREAM
with, well, octets, and not UTF8 characters. There is nothing in that
construct that would help a good compiler do UTF-8 checking.

In choosing between (2) and (3), I figured that only a few compilers even
know how to do UTF8String today (it's in the 1997 spec, not even the 1994
spec), and those that do can simply comment out the definition line (which
purists would happily want to do anyways). Further, the changes needed for
the few people who do have 1997 compilers are far easier than the changes
needed for the rest of us using method (2). And, again, once we made the
change in (2), someone later reading the code would not know that this had
to be a UTF-8 string.

** Length limiting **

People like to have lengths specified for strings so they can tell how much
memory to allocate for buffers. However, accurately figuring out the memory
use for a UTF-8 string of some number of characters is impossible since the
number of bytes per character changes. The current privacy mark string is
limited to 128 characters, but it easy to imagine privacy marks that could
legitimately be longer than this. Thus, I'm choosing not to put a size
limit on the UTF8String.

** Language tagging **

Ned Freed pointed out that current direction we're supposed to be pursuing
is language tags embedded in UTF-8. However, the draft for that
functionality has just been proposed this week in the IETF, and is unlikely
to be finalized before S/MIME. I believe strongly in language tagging, and
want to get people in the habit of doing it from the start. Thus, I'm going
to use an optional laguage tag with a comment that says if you have
embedded the language tag in the UTF8String, do not include the language
tag. This lets us have language tags now, and lets people move to embedded
language tags when that spec is final.

** And what I propose is **

-- At the beginning of the module
UTF8String ::= [UNIVERSAL 12] IMPLICIT OCTET STRING

-- Then
ESSPrivacyMark ::= CHOICE {
    pString      PrintableString (SIZE (1..ub-privacy-mark-length)),
    freeText     ESSFreeText
}

ESSFreeText ::= SEQUENCE {
    languageTag  PrintableString OPTIONAL,
    text         UTF8String
    --languageTag is an language tag defined in [RFC1766]; it SHOULD NOT
    --   be used if the text string has an implicit UTF-8 language tag
}


--Paul Hoffman, Director
--Internet Mail Consortium