ietf-smime
[Top] [All Lists]

Re: How to do UTF-8

1998-02-21 00:36:37
On Thu, 19 Feb 1998, Paul Hoffman / IMC wrote:

** Using UTF-8 **

The IETF has decided to go with UTF-8, where possible, as its charset of
choice in things that will get shown to human users. See
<http://www.imc.org/rfc2277> for more information on this. See
<http://www.imc.org/rfc2279> for a description of how to do UTF-8.

** How to specifiy UTF-8 in 1988 ASN.1 **

Things got a bit hairier here. Three proposals had general approval:

(1) Use an OCTET STRING that has a comment saying that it has to be filled
with a UTF-8 string
(2) Use "[0] IMPLICIT UTF8String", and people whose compilers don't know
what a UTF8String is can change this to "[0] IMPLICIT OCTET STRING" and
still have the same bits on the wire
(3) Define UTF8String at the beginning of the module as "UTF8String ::=
[UNIVERSAL 12] IMPLICIT OCTET STRING" and then just use UTF8String in the
code (with no IMPLICIT)

I have chosen to go with (3) because it is the cleanest, and the ASN.1
compilers already must be able to do something like this because they have
to be able to do it for BMPString. Although this violates the ASN.1
standard, it is quite easy to implement.

The reason I want to get away from (1), which I put in the current version,
is that people tend not to read comments, and might stuff the OCTET STREAM
with, well, octets, and not UTF8 characters. There is nothing in that
construct that would help a good compiler do UTF-8 checking.

In choosing between (2) and (3), I figured that only a few compilers even
know how to do UTF8String today (it's in the 1997 spec, not even the 1994
spec), and those that do can simply comment out the definition line (which
purists would happily want to do anyways).

Paul, what you are doing is not a good thing.  People who don't know any
better will look at the "S/MIME Standard" and see its invalid use of ASN.1
and conclude that since it is an approved standard the ASN.1 usage must be
valid.  And so they follow suit in writing their own standard, no doubt
introducing their own invalidity.  For example, I recently saw on this
list a proposal by someone to use OCTET STRING to carry UTF8String, with
the justification being that it was done in LDAP.  This is not of "purity"
versus pragmatism, but the need for standards to be followed where at all
possible.  

Further, the changes needed for the few people who do have 1997
compilers are far easier than the changes needed for the rest of us
using method (2).

Isn't it just as easy for those who don't use a compiler that supports
UTF8String to add at the top of their code?:

        UTF8String ::= [UNIVERSAL 12] IMPLICIT OCTET STRING

You probably would not welcome it if another standard that draws upon
S/MIME were to bastardize it for the convenience of a particular group of
tools, for such bastardization increases over time and eventually yields
interoperability problems, in the end causing all sorts of problems.
If there were an overwhelming reason to violate a standard you have
no choice, but here you are choosing to violate the ASN.1 standard
further than S/MIME already does simply so that users of particular
tools do not have to insert the single statement shown above.  Sigh.

** Length limiting **

People like to have lengths specified for strings so they can tell how much
memory to allocate for buffers. However, accurately figuring out the memory
use for a UTF-8 string of some number of characters is impossible since the
number of bytes per character changes. The current privacy mark string is
limited to 128 characters, but it easy to imagine privacy marks that could
legitimately be longer than this. Thus, I'm choosing not to put a size
limit on the UTF8String.

If you multiply the upper bound on a Unicode size constraint by 6 you will
get the maximum number of bytes possible in a value.  More practical,
multiply by 3 and you get the maximum number of bytes needed to support
all the characters supported by BMPString.  I say "practical"  because my
hunch is that nobody will soon be having signatures of values that lie
outside of the ISO/IEC 10646-1 Basic Multilingual Plane :-).
 
--------------------------------------------------------------------------
Bancroft Scott                                Toll Free    :1-888-OSS-ASN1
Open Systems Solutions, Inc.                  International:1-609-987-9073
baos(_at_)oss(_dot_)com                                  Tech Support 
:1-732-249-5107
http://www.oss.com                            Fax          :1-732-249-4636


<Prev in Thread] Current Thread [Next in Thread>