pem-dev
[Top] [All Lists]

Recommendations for DirectoryString character set

1994-03-02 13:27:00
These has been some discussion on the pem-dev list regarding the appropriate
choice of character sets for DirectoryString, especially X.500 and X.509 
organizational names and common names, that has raised several interesting 
questions.

According to X.520 (1993), DirectoryString (which is the base attribute used
 for all X.500 Names, including organization, organizationalUnit, 
commonName, etc.) includes a choice of TeletexString, PrintableString, 
and UNIVERSAL STRING.

It appears that UNIVERSAL STRING is not yet stable, as defect reports are 
being submitted and there is not yet firm agreement as to whether it should 
be a 16 bit or 32 bit code, although there is emerging recognition that it
will be necessary to deal with languages other than the western European 
languages in the near future.

As I currently understand it, PrintableString includes

Capital letters A,B,...Z; Small letters a,b,...z; Digits 0,1,...9; Space 
(space);
Apostrophe '; Left parentheses (; Right parentheses ); Plus sign +; Comma ,;
Hyphen -; Full stop .; Solidus /; Colon :; Equal sign =; Question mark ?

The commonly used characters missing from this list include all of the national
currency symbols (Dollar, Pound sterling, Yen), the @-sign frequently used
for e-mail addresses, the & sign frequently used in corporate names. Semicolon,
asterisk, pound/number (#), per cent, caret, underscore, vertical bar, 
back-slash,
tilde, reverse accent, and a whole host of diacritical marks and special symbols
are also missing, but these are less commonly used, at least in English names.

It therefore appears that the Teletex string would be the most appropriate 
attribute
type to be used for names in X.500 directories, including Distinguished Names
included within X.509 certificates.

According to the information available to me at present, Teletex (previously 
T.61)
includes the following characters within the primary set of graphic characters:

[space] ! " (note 4] [note 4] % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ [nochar] ] [nochar] [note 1 _]
[nochar] a b c d e f g h i j k l m n o
p q r s t u v w x y z [nochar] | [nochar] [nochar] [nochar]

Note 1: "When interworking with Videotex, this code shall have 
the meaning _delimiter_."

Note 4: "Teletex terminals should only send the codes 10/6 
and 10/8 for graphic characters [not equal] and [lozenge]. 
When receiving codes 2/3 and 2/4 terminals should interpret 
them as # and [lozenge]. [Position 2/4 is the international currency
symbol - RRJ]

The secondary set of graphics characters includes characters for inverted 
exclamation point, cent, pound sterling, dollar, yen, pound, section symbol,
lozenge, <<, degree, plus-minus, superscrit-2, superscript-3, times, micron, 
paragraph symbol, middle-dot, divide sign, >>, 1/4, 1/2, 13/4 inverted 
question mark, and some 32 miscellaneous characters and dipthongs for
languages like Icelandic, German, and French, in positions 10/2 through 
11/15, and 14/0 through 15/14.

The T.61 text goes on to say, 

"The supplementary set contains 13 diacritical marks that are used 
in combination with the letters of the basic Latin alphabet in the primary 
set to constitute the coded representations of accented letters and umlauts.
these diacritical marks, and their coded representations, are:

Acute accent 12/2
Grave accent 12/1
Circumflex accent 12/3
Diaersis or umlaut mark 12/8
Tilde 12/4
Caron 12/15
Breve 12/6
Double acute angle 12/13
Ring 12/10
Dot 12/7
Macron 12/5
Cedilla 12/11
Ogonek 12/14"

According to Richard Ankney the diacritical marks are _nonspacing_ 
characters  which preceed the spacing characters and form a 
composite character. Of course not all combinations of diacritical marks 
and alphabetic characters are allowed -- one cannot create a 2-umlaut 
or an X-tilde, for example. The allowable combinations are shown in 
table B-1/T.61, which I cannot include in this message for obvious
reasons.

At present, if an organization chooses to register its name with ANSI,
the instructions for doing this specify that "the characters must be
taken from the set defined in registration 102, the Teletex Set of Primary 
Graphic Characters of the ISO International Register of Coded 
Character Sets to be used with Escape Sequences plus space. The Escape
Sequences follow:

G0:   ESC 2/8       7/5
G1:   ESC 2/9       7/5
G2:   ESC 2/10     7/5
G3:   ESC 2/11     7/5
C0:         -
C1:         -

A copy of the allowable characters is as found in the Teletex
Primary Set of Graphic Character Sets is attached. Please note,
the international currency symbol (position 02/04) is not supported."

(At this time, I do not know how the escape sequences are intended to be 
used. At one time it appeared that T.61 was intended to be extensible 
to other character sets, but this may have been overtaken by the 
UNIVERSAL STRING and/or BMP specification effort. More research 
is needed.)

At present, therefore, it appears that ANSI will only register names 
containing characters from the primary set, without the diacritical marks 
or special characters such as dollar sign, despite the fact that X.500 
would presumably allow the additional secondary characters and 
diacritical marks within a DirectoryString. On the other hand, ANSI will 
allow a 100 character name to be registered, although X.500 has a 
"recommended" limit of 64 characters on such names.

A quick check with the office of the Secretary of the Commonwhealth of 
Massachusetts revealed that at one time the organizations "$50 Club, Inc."
and "$aving$ Express, Inc." were registered coroporate names (not just 
trademarks). (Even more surprising,"Toys "R" Us of Massachusetts, Inc" 
was registered, using the backwards R! I suspect that this _was_ a 
trademark, not their corporate name, but that is what I was told. I don't 
know how they got it to display on their Wang terminal, but it did.)

It is therefore obvious that such characters will be required to 
support already existing corporate names, even if ANSI wouldn't 
register them at the national level under their current policy.

In addition, considering the impact or at least the spirit of NAFTA (but 
without having read the legislation for any particular requirements), it 
would appear that we should be prepared to support English, French, 
and Spanish names at a minimum. And given the national sensitivities 
involved, suggesting that French or Spanish names be transliterated 
into their English equivalents for the convenience of various
standards organizations is probably a political nonstarter.

Likewise, common names in Mexico in particular are often quite long,
and including the full legal name of organizations operating under various
Doing Business As, Trading As, and Fictitious Name statutes within a name
may also generate lengthy names. In general, telling people and/or 
organizations that they must change their name or truncate it is met 
with considerable resistance.

I would therefore like to make the following recommendations:

1. That the PEM community, the NADF, and the US Joint Registration 
Authority Committee (which provides policy direction to ANSI in this 
area) give immediate consideration to adopting the TeletexString, including
both primary and secondary set of graphic characters, as their 
primary recommended string type for X.500 and X.509 names (at least 
or until UNIVERSAL STRING is better defined and more widely supported).

2. That the TeletexString support within various X.500 and X.509 
implementations, including Directory Service Agents, Directory User Agents, and 
PEM and similar 
agents be extended if and as necessary to provide explicit support for 
the entire secondary set of graphics characters, specifically including 
the nonspacing  diacritical marks, and that hese implementations properly 
input, display, and print such composite characters.

3. That X.500 DSA and DUA implementations within the NADF, and
X.509 implementations within PEM and elsewhere support upper 
bounds of 128 characters (instead of 64) for ub-name, 
ub-common-name, ub-organization-name, ub-organizational-unit-name, and
ub-title to support existing national registration procedures and to 
conform to existing upper bounds for state, locality, street address,
etc., and that a defect report be submitted to X.500 to make this 
change a permanent part of the nonbinding recommendations in X.520.

4. The standards bodies concerned with X.400 and similar message 
transport systems also consider these issues.

As part of the process of evaluating these suggestions, I would 
appreciate comments from the developers of various implementations 
of X.500 and X.509 as to whether their systems would currently 
support these recommendations, and the impact on their product 
to include these capabilities.

Of course, corrections to my current understanding of these various
issues would be most welcome.

Robert R. Jueneman
GTE Laboratories
40 Sylvan Road
Waltham, MA 02254
617/466-2820
617/466-2603 FAX
Internet: Jueneman(_at_)GTE(_dot_)COM

<Prev in Thread] Current Thread [Next in Thread>