ietf-822
[Top] [All Lists]

Re: restrictions when defining charsets

1993-02-05 15:26:45
One of them is ASCII!  The ASCII standard is most carefully weasel-worded
so that it does *not* demand that you map bytes to glyphs in the way you
might think it required.  This was to accommodate an obscure historical
problem:  it was politically necessary that a 64-character subset of
ASCII -- codes 32 through 95 -- accommodate PL/I.  This causes two
problems:  ASCII has no "not" symbol, and the ASCII or-bar is not in
the 64-character subset.  The result is a standard which is vague enough
about the appearance of the glyphs that it is legitimate to print code 41
(exclamation point) as or-bar and code 94 (circumflex) as not-sign.

The weasel-wording is certainly present, but the cause is a myth of the
"PL/I is everyone's favorite bugaboo, it must be responsible for this
too" class.

Weasel-words appear in ASCII for a multitude of reasons.  For example,
when the standard first came along, interoperability with the (6bit and
uppercase only) BCD character code was very important, but a number of
prominent BCD devices, such as very old teletypes, had "uparrow" and
"right arrow" characters on their printing fonts.  So the standard was
worded to have a weak enough binding between character name and glyph to
permit the ASCII character normally stylized as carat to be printed as
uparrow and the one normally stylized as underbar to be printed as
right-pointing arrow.
   The first version was also extremely clear about the meaning of the
vertical motion [control] characters LF (NL), VT, and FF (NP): they
implied the first character position on the target line.  CR implied
first character position on the current line; there was no way to
accomplish the "same character position, next line" function sometimes
called "index" without sending LF followed by horizontal motion
characters.   This usage was consistent with the behavior of the
Teletype 37 (the first true ASCII/designed for ASCII device I remember)
and the Multics usage (which ultimately migrated into UNIX).  Other
systems (including Tenex and TOPS-10) came along between the first
version and the initial revision and used CR to designate "first
character on next line" or LF to designate an "index" function and the
initial revision was changed to make either the LF-as-NL or LF-as-index
interpretations valid.

----
Mythology story follows, those who are uninterested should stop reading
here.

The original PL/I definition was made against the prevalent EBCDIC on
the IBM System/360 / OS/360 line of the mid-1960s and without reference
to ASCII.  That EBCDIC has both a not-sign and an or-symbol (solid thin
vertical bar).   Because of concerns about BCD hosts, the original
definition also included two-character sequences for some of the special
characters required by PL/I, which assumed a larger character repertiore
than, say, Fortran, Cobol, or implementations of Algol-60.
  When the time came to do PL/I implementations on machines that were
essentially ASCII (or at least not native EBCDIC) in character (a few
while ASCII was being finalized) -- Multics, the PDP-11, Control Data,
Burroughs, and Univac mainframes -- the implementors made their own
guesses.  For example, some mapped the PL/I "or" symbol onto ASCII "|"
(broken vertical bar), others used "!" (exclamation mark).
   With this background, when work started on the Standard for PL/I (a
joint ANSI/ECMA effort from the very beginning), weasel words were put
into what is now ISO 6160 that leave the mapping of the language
characters of the PL/I definition onto physical character codings and
bits completely in the hands of the implementation as long as certain
distinguishability and reverseability criterion were met.  That text
(and some other special rules about, e.g., collating sequences and
prohibitions against overlaying characters on things that would make
the length of a character in bits detectable to a conforming
implementation) was intended to permit conforming PL/I implementations
to be built in either ASCII or EBCDIC and to avoid taking a stand on !
vs | for "or" and ^ vs ~ for "not".  Curiously, it makes it possible to
build conforming implementations whose verbs are in, e.g., ISO 10646 as
long as the mapping of graphic "characters" to "print positions" in a
fixed-pitch type font is one-one.
  Several PL/I implementations in ASCII-ish environments today will
accept ! and |  and ^ and ~ interchangably, even within the same
program.
  Some years after ANSI X3.53-1976, the ECMA standard whose number I
don't remember, and ISO 6160 were all solidly in place, the ISO working
group that was then responsible for 6160 maintenance got a questionnaire
from what is now JTC1/SC2 asking about character set use and, in
particular, use of "national use" positions in ISO 646.  They got back
an answer that explained the usage of those characters but that also
indicated a "not our problem" WG consensus as a consequence of the
binding between language-abstract-characters and character codings being
explicitly left to implementations and other standards.

Why do I know all this history?  Well, one of my badly kept secrets is
that I'm ISO/IEC JTC1/SC22 Project Editor responsible for ISO 6160 and
6522, former convenor of the relevant WG, and chair of the associated
ASC X3 technical committee.  Woe is me.  :-)

    --john