These has been some discussion on the pem-dev list regarding the appropriate
choice of character sets for DirectoryString, especially X.500 and X.509
organizational names and common names, that has raised several interesting
questions.
According to X.520 (1993), DirectoryString (which is the base attribute used
for all X.500 Names, including organization, organizationalUnit,
commonName, etc.) includes a choice of TeletexString, PrintableString,
and UNIVERSAL STRING.
It appears that UNIVERSAL STRING is not yet stable, as defect reports are
being submitted and there is not yet firm agreement as to whether it should
be a 16 bit or 32 bit code, although there is emerging recognition that it
will be necessary to deal with languages other than the western European
languages in the near future.
As I currently understand it, PrintableString includes
Capital letters A,B,...Z; Small letters a,b,...z; Digits 0,1,...9; Space
(space);
Apostrophe '; Left parentheses (; Right parentheses ); Plus sign +; Comma ,;
Hyphen -; Full stop .; Solidus /; Colon :; Equal sign =; Question mark ?
The commonly used characters missing from this list include all of the national
currency symbols (Dollar, Pound sterling, Yen), the @-sign frequently used
for e-mail addresses, the & sign frequently used in corporate names. Semicolon,
asterisk, pound/number (#), per cent, caret, underscore, vertical bar,
back-slash,
tilde, reverse accent, and a whole host of diacritical marks and special symbols
are also missing, but these are less commonly used, at least in English names.
It therefore appears that the Teletex string would be the most appropriate
attribute
type to be used for names in X.500 directories, including Distinguished Names
included within X.509 certificates.
According to the information available to me at present, Teletex (previously
T.61)
includes the following characters within the primary set of graphic characters:
[space] ! " (note 4] [note 4] % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ [nochar] ] [nochar] [note 1 _]
[nochar] a b c d e f g h i j k l m n o
p q r s t u v w x y z [nochar] | [nochar] [nochar] [nochar]
Note 1: "When interworking with Videotex, this code shall have
the meaning _delimiter_."
Note 4: "Teletex terminals should only send the codes 10/6
and 10/8 for graphic characters [not equal] and [lozenge].
When receiving codes 2/3 and 2/4 terminals should interpret
them as # and [lozenge]. [Position 2/4 is the international currency
symbol - RRJ]
The secondary set of graphics characters includes characters for inverted
exclamation point, cent, pound sterling, dollar, yen, pound, section symbol,
lozenge, <<, degree, plus-minus, superscrit-2, superscript-3, times, micron,
paragraph symbol, middle-dot, divide sign, >>, 1/4, 1/2, 13/4 inverted
question mark, and some 32 miscellaneous characters and dipthongs for
languages like Icelandic, German, and French, in positions 10/2 through
11/15, and 14/0 through 15/14.
The T.61 text goes on to say,
"The supplementary set contains 13 diacritical marks that are used
in combination with the letters of the basic Latin alphabet in the primary
set to constitute the coded representations of accented letters and umlauts.
these diacritical marks, and their coded representations, are:
Acute accent 12/2
Grave accent 12/1
Circumflex accent 12/3
Diaersis or umlaut mark 12/8
Tilde 12/4
Caron 12/15
Breve 12/6
Double acute angle 12/13
Ring 12/10
Dot 12/7
Macron 12/5
Cedilla 12/11
Ogonek 12/14"
According to Richard Ankney the diacritical marks are _nonspacing_
characters which preceed the spacing characters and form a
composite character. Of course not all combinations of diacritical marks
and alphabetic characters are allowed -- one cannot create a 2-umlaut
or an X-tilde, for example. The allowable combinations are shown in
table B-1/T.61, which I cannot include in this message for obvious
reasons.
At present, if an organization chooses to register its name with ANSI,
the instructions for doing this specify that "the characters must be
taken from the set defined in registration 102, the Teletex Set of Primary
Graphic Characters of the ISO International Register of Coded
Character Sets to be used with Escape Sequences plus space. The Escape
Sequences follow:
G0: ESC 2/8 7/5
G1: ESC 2/9 7/5
G2: ESC 2/10 7/5
G3: ESC 2/11 7/5
C0: -
C1: -
A copy of the allowable characters is as found in the Teletex
Primary Set of Graphic Character Sets is attached. Please note,
the international currency symbol (position 02/04) is not supported."
(At this time, I do not know how the escape sequences are intended to be
used. At one time it appeared that T.61 was intended to be extensible
to other character sets, but this may have been overtaken by the
UNIVERSAL STRING and/or BMP specification effort. More research
is needed.)
At present, therefore, it appears that ANSI will only register names
containing characters from the primary set, without the diacritical marks
or special characters such as dollar sign, despite the fact that X.500
would presumably allow the additional secondary characters and
diacritical marks within a DirectoryString. On the other hand, ANSI will
allow a 100 character name to be registered, although X.500 has a
"recommended" limit of 64 characters on such names.
A quick check with the office of the Secretary of the Commonwhealth of
Massachusetts revealed that at one time the organizations "$50 Club, Inc."
and "$aving$ Express, Inc." were registered coroporate names (not just
trademarks). (Even more surprising,"Toys "R" Us of Massachusetts, Inc"
was registered, using the backwards R! I suspect that this _was_ a
trademark, not their corporate name, but that is what I was told. I don't
know how they got it to display on their Wang terminal, but it did.)
It is therefore obvious that such characters will be required to
support already existing corporate names, even if ANSI wouldn't
register them at the national level under their current policy.
In addition, considering the impact or at least the spirit of NAFTA (but
without having read the legislation for any particular requirements), it
would appear that we should be prepared to support English, French,
and Spanish names at a minimum. And given the national sensitivities
involved, suggesting that French or Spanish names be transliterated
into their English equivalents for the convenience of various
standards organizations is probably a political nonstarter.
Likewise, common names in Mexico in particular are often quite long,
and including the full legal name of organizations operating under various
Doing Business As, Trading As, and Fictitious Name statutes within a name
may also generate lengthy names. In general, telling people and/or
organizations that they must change their name or truncate it is met
with considerable resistance.
I would therefore like to make the following recommendations:
1. That the PEM community, the NADF, and the US Joint Registration
Authority Committee (which provides policy direction to ANSI in this
area) give immediate consideration to adopting the TeletexString, including
both primary and secondary set of graphic characters, as their
primary recommended string type for X.500 and X.509 names (at least
or until UNIVERSAL STRING is better defined and more widely supported).
2. That the TeletexString support within various X.500 and X.509
implementations, including Directory Service Agents, Directory User Agents, and
PEM and similar
agents be extended if and as necessary to provide explicit support for
the entire secondary set of graphics characters, specifically including
the nonspacing diacritical marks, and that hese implementations properly
input, display, and print such composite characters.
3. That X.500 DSA and DUA implementations within the NADF, and
X.509 implementations within PEM and elsewhere support upper
bounds of 128 characters (instead of 64) for ub-name,
ub-common-name, ub-organization-name, ub-organizational-unit-name, and
ub-title to support existing national registration procedures and to
conform to existing upper bounds for state, locality, street address,
etc., and that a defect report be submitted to X.500 to make this
change a permanent part of the nonbinding recommendations in X.520.
4. The standards bodies concerned with X.400 and similar message
transport systems also consider these issues.
As part of the process of evaluating these suggestions, I would
appreciate comments from the developers of various implementations
of X.500 and X.509 as to whether their systems would currently
support these recommendations, and the impact on their product
to include these capabilities.
Of course, corrections to my current understanding of these various
issues would be most welcome.
Robert R. Jueneman
GTE Laboratories
40 Sylvan Road
Waltham, MA 02254
617/466-2820
617/466-2603 FAX
Internet: Jueneman(_at_)GTE(_dot_)COM