ietf-822
[Top] [All Lists]

FYI: Vietnamese Document Draft

1992-04-10 04:42:44

  Attached below is the document from the Vietnamese Standards Group
that has been publically released describing current conventions for
Vietnamese usage on the Internet/BITNET/USENET and other proposals or
de facto standards for Vietnamese.  It represents the consensus of the
people who have been working on these issues for the past several years.

  Anh Nguye^~n Tha`nh has indicated that he intends to work on
converting it into a format suitable for an RFC and publishing it on
behalf of the Vietnamese Standards Group as an informational RFC
documenting conventions and usages for the Vietnamese language.  I'm
not sure when that might be finished and submitted to the RFC Editor.

  If my understanding is correct, that informational RFC would not be
focused on MIME in any way and would not be proposing registration of
a token for Vietnamese for usage in MIME.  The VSG would prefer to
have at least a unified mechanism for mnemonic usages or possibly a
single mnemonic convention, provided that such a single unified
mnemonic convention's representation of Vietnamese glyphs were no less
readable than the existing Vietnamese convention.

  I would also like to take this opportunity to disclaim the credit
that the document gives to me.  The vast majority of the work has been
done by others in the working group and they deserve the lion's share
of the credit for the document content.  I am however quite pleased
that we in the VSG working group have been able to produce such a
document.

  Ran
  atkinson(_at_)itd(_dot_)nrl(_dot_)navy(_dot_)mil

----cut here, omitting this line and above---


Vietnamese Mnemonic Notes:

In the following ASCII text, Vietnamese letters with diacritics
are represented as a vowel followed by the diacritics, with the
following mappings:

        (       =       breve, as in "a(n na(n"
        ^       =       circumflex, as in "nha^n co^ng"
        +       =       horn, as in "tu+o+ng tu+"

        '       =       acute, as in "choa'ng va'ng"
        `       =       grave, as in "lu` khu`"
        ?       =       hook above, as in "ho?i tha(m"
        ~       =       tilde, as in "ky~ ca`ng"
        .       =       dot below, as in "Tra.ng Nguye^n"

        dd      =       lower case d-bar, as in "dda ti`nh"
        DD      =       upper case D-bar, as in "DDo^ng So+n"

The  diacritics are interspersed freely in the text and  should
be  clear from the context, for example, "The  Vietnamese  call
themselves `Con Cha'u Hu`ng Vu+o+ng',  or `Descendents  of King
Hu`ng'."  However there are instances where it is necessary  to
differentiate   between  a  single   Vietnamese   letter   with
diacritics and a sequence of characters, for example, "a^'". In
such cases, when the single Vietnamese letter is meant,  it  is
enclosed in angle brackets, e.g., "<a^'>"; without the brackets
the  string  "a^'"  should  be understood to be the sequence of
characters "a", "^", and "'".  It should be clear from  context
how the text should be read.

The  text  was  generated  with "dvi2tty" and "nroff" with con-
siderable hand-editing, but the formatting still leaves much to
be desired.  A  much  more readable  version  is  available  in
PostScript  form from various archive sites to be announced  by
the archivists themselves. If you have no means  of  retrieving
or printing the PostScript file, you may obtain a  printed copy
by sending a self-address, stamped envelope to "Cuong T. Nguyen
P. O. Box  9934,  Stanford, CA 94309-1634".  Please use two (2)
29-cent stamps and a letter-sized envelope.

Please forward typos & comments to 
Viet-Std(_at_)Haydn(_dot_)Stanford(_dot_)EDU(_dot_)

Acknowledgments:
----------------

We acknowledge the direct authorship/contribution by the following people:

            Atkinson, Randall (atkinson(_at_)itd(_dot_)nrl(_dot_)navy(_dot_)mil)
            Bu`i Cu+o+ng (bui(_at_)berlioz(_dot_)nsc(_dot_)com)
            Ho^` Khie^m (khiem(_at_)hpinddm(_dot_)cup(_dot_)hp(_dot_)com)
            Lu+o+ng V. Tu+o+'c (tluong(_at_)borland(_dot_)com)
            Ngo^ DDi`nh Ho.c (hoc%vri280(_at_)uunet(_dot_)uu(_dot_)net)
            Nguye^~n T. Cu+o+`ng (cuong(_at_)Haydn(_dot_)Stanford(_dot_)EDU)
            Nguye^~n Tha`nh 
(thanh(_at_)ipesun(_dot_)e-technik(_dot_)uni-stuttgart(_dot_)de)
            To^n Khoa (khoa(_at_)hpda(_dot_)hp(_dot_)com)
            Tra^`n Nha^n (tran(_at_)peora(_dot_)sdc(_dot_)ccur(_dot_)com)
    
  And the many, many insightful comments, arguments, and ideas
contributed by the people on Viet-Std(_at_)Haydn(_dot_)Stanford(_dot_)EDU, too 
numerous
to acknowledge properly but are nevertheless important, as well as the
people of Viet-Net and Soc.Culture.Vietnamese, including those who
proposed, discussed, and propagated the Viet-Net readable mnemonic
convention.


                        Viet-Std List
                        -------------
                Atkinson, Randall 
(atkinson(_at_)itd(_dot_)nrl(_dot_)navy(_dot_)mil)
                BINGO(_at_)MTUS5(_dot_)cts(_dot_)mtu(_dot_)edu
                Bu`i Cu+o+ng (bui(_at_)berlioz(_dot_)nsc(_dot_)com)
                DDa(.ng, Oliver 
(Oliver_Dang(_dot_)Washington_CSD(_at_)Xerox(_dot_)com)
                DDinh Hoa`n (hdinh(_at_)ihlpx(_dot_)att(_dot_)com)
                DDo^~, James (jDo(_at_)sjc(_dot_)mentorg(_dot_)com)
                Du+o+ng, Christie (chrisd(_at_)works(_dot_)sun(_dot_)com)
                Dung Trung (trung(_at_)CS(_dot_)BU(_dot_)EDU)
                Ho^.i Chuye^n Gia Vie^.t Nam (hcgvn(_at_)netcom(_dot_)com)
                Ho^` Khie^m (khiem(_at_)hpinddm(_dot_)hp(_dot_)com)
                Ho^` Phi Hu`ng (hho%aludra.usc.edu, Archivist)
                JFT%NCCIBM1(_dot_)BITNET(_at_)Forsythe(_dot_)Stanford(_dot_)EDU
                Le^ Quang (quangl(_at_)tabasco(_dot_)sps(_dot_)mot(_dot_)com)
                Le^ Ti'n (tin(_at_)smsc(_dot_)sony(_dot_)com, Archivist)
                Lu+o+ng V. Tu+o+'c (tluong(_at_)borland(_dot_)com)
                Ngo^ DDi`nh Ho.c 
(ngo(_at_)amelia(_dot_)nas(_dot_)nasa(_dot_)gov)
                Ngo^ Quang (quang(_at_)csufres(_dot_)csufresno(_dot_)edu)
                Ngo^ Thanh Nha`n (nhan(_at_)LSP5(_dot_)CS(_dot_)NYU(_dot_)EDU)
                Nguye^~n DDu+'c Long 
(long(_at_)ireq-num(_dot_)hydro(_dot_)qc(_dot_)ca)
                Nguye^~n Du (nguyen(_at_)zariski(_dot_)harvard(_dot_)edu)
                Nguye^~n Gia Hoa` (nguyenh(_at_)eng(_dot_)umd(_dot_)edu)
                Nguye^~n Hoa`ng (Hoang_Nguyen(_dot_)LAX1B(_at_)xerox(_dot_)com)
                Nguye^~n Kinh (Kinh_Nguyen(_dot_)ESXFC(_at_)Xerox(_dot_)COM)
                Nguye^~n T. Cu+o+`ng (cuong(_at_)haydn(_dot_)stanford(_dot_)edu)
                Nguye^~n Tha`nh 
(thanh(_at_)ipesun(_dot_)e-technik(_dot_)uni-stuttgart(_dot_)de)
                Nguye^~n Vu+o+ng 
(Vuong(_dot_)Nguyen(_at_)szebra(_dot_)saigon(_dot_)com)
                Pha.m Tha.ch (thach(_dot_)pham(_at_)Eng(_dot_)Sun(_dot_)COM)
                To^n Khoa (khoa(_at_)hpda(_dot_)hp(_dot_)com)
                Tra^`n Ha?i (htran(_at_)dash(_dot_)mitre(_dot_)org)
                Tra^`n Kha 
(KTRAN%APLVM(_dot_)BITNET(_at_)Forsythe(_dot_)Stanford(_dot_)EDU)
                Tra^`n Nha^n (tran(_at_)peora(_dot_)sdc(_dot_)ccur(_dot_)com)
                Vie^.t Anh (anh(_at_)media(_dot_)mit(_dot_)edu)
                Vu~ Tie^'n Giao (pyramid!infmx!grizzly!giao)
                ccicpg!al!ngo
                cdh1(_at_)homxc(_dot_)att(_dot_)com
                chi(_at_)markv(_dot_)com
                ctt(_at_)alux2(_dot_)att(_dot_)com
                cvu(_at_)ic(_dot_)sunysb(_dot_)edu
                gregg(_at_)eoc(_dot_)com
                hnguyen(_at_)sc9(_dot_)intel(_dot_)com
                k(_dot_)ly(_at_)trl(_dot_)OZ(_dot_)AU
                kpv(_at_)ulysses(_dot_)att(_dot_)com
                lphung(_at_)ihbhk(_dot_)att(_dot_)com
                lphung(_at_)nike(_dot_)calpoly(_dot_)edu
                ltpham(_at_)netcom(_dot_)com
                ndoduc(_at_)framentec(_dot_)fr
                shg(_at_)rock(_dot_)concert(_dot_)net
                thanht(_at_)hls(_dot_)com
                thu(_at_)friendship(_dot_)sun(_dot_)com
                trinh(_at_)paulus(_dot_)enet(_dot_)dec(_dot_)com
                tuan(_at_)lgc(_dot_)com
                v(_dot_)mai(_at_)uow(_dot_)edu(_dot_)au
                vtpham(_at_)ap040(_dot_)csc(_dot_)ti(_dot_)com
                vtt(_at_)turing(_dot_)scs(_dot_)uiuc(_dot_)edu



    A UNIFIED FRAMEWORK FOR VIETNAMESE INFORMATION PROCESSING

           Vietnamese Standardization Working Group 
                 (Viet-Std(_at_)Haydn(_dot_)Stanford(_dot_)EDU)

                          January 1992
                                
                            ABSTRACT

   Increasing demand for Vietnamese electronic information pro-
   cessing  has  seen  answer  in  a  wide array of Vietnamese-
   capable applications.  The inevitable need  for  integration
   of Vietnamese into existing environments and the exchange of
   data among them point to the necessity  of  standardization.
   This  paper  presents  the strategic and pragmatic technical
   considerations that must go into such a  standard,  and  re-
   views existing conventions/proposals in these important con-
   texts. A  full  description  of  the  Viet-Std  proposal  is
   presented,  including  1)  an 8-bit, fully precomposed Viet-
   namese encoding table, 2) a 7-bit quoted-readable Vietnamese
   standard  for  data  interchange over 7-bit channels, with a
   seamless interface to the 8-bit encoding, and 3) a  keyboard
   user-interface  specification  that works transparently with
   both 1 and 2. Together, these provide vide a  truly  unified
   framework  for  a Vietnamese information processing environ-
   ment with simplicity, efficiency,  and  straightforward  in-
   tegration. The real-world construction of this framework has
   proven quite successful in an array  of  compliant  applica-
   tions  from  a  number  of  group  and individual developers
   across a number of platforms, including Unix and  its  vari-
   ants, the X window system, MS-DOS, Windows, and with ongoing
   work elsewhere.



1 INTRODUCTION

With the growing Vietnamese population abroad  and   the   proli-
feration  of  computer  usage  within  Viet  Nam,  the Vietnamese
language has seen  rapidly  increasing  representation  in  elec-
tronic  information  processing. The concomitant growth in demand
for Vietnamese-capable software  has   resulted   in   successful
launches  of  myriad vendors in the U.S. and elsewhere, mainly in
the area of Vietnamese word processing.  In addition,  individual
and   group  efforts  have  also  been  productive  in  providing
Vietnamese-language users with highquality  public-domain  appli-
cations.   In   Viet Nam, centers such as the Institute of Infor-
matics have reported impressive progress on many  fronts,   among
which is the Vietnamization of standard software packages [1].

       All of the above illustrate two important points: 1) There
are   growing   market  demands for Vietnamese-capable processing
engines, and 2) There is no shortage of   technical   talent   to
fulfill  those demands. Unfortunately, therein lies a large prob-
lem: most existing Vietnamese applications have been designed  to
operate   in   the   exclusive  framework  or  environment of the
developer, and all are incompatible with one another. As long  as
this   trend   continues, the application base for Vietnamese can
never keep reasonable pace with demand. Users  want  to  do  more
with Vietnamese than mere word processing, and to expect one sin-
gle vendor to provide all potential applications across all plat-
forms is to dream the impossible. Technicians providing these ap-
plications are limited to the Vietnamese tools  they  must  them-
selves  learn  and develop from the ground up. Standardization is
necessary. Anyone who has had to deal  with  the  incompatibility
between  ASCII  and EBCDIC can try to imagine a world where every
machine is using a different character set, and  appreciate   how
limited  that  world  would  be  in  its application base and how
cumbersome in its data interchange. A  uniform   framework   will
greatly benefit both the user and the technician alike.

       The proposal for any Vietnamese data standardization  must
take  several important points in the proper contexts.  First and
foremost, since this discussion is geared  toward existing 7- and
8-bit   environments,   the  prime  goal  is  straightforward and
direct integration onto current platforms.  The   standard   must
work here and now. This implies the use of precomposed Vietnamese
characters, because the handling of floating diacritics will nev-
er  see  full or simple support outside of specific contexts. The
standard must be designed so as to  take  advantage  of  existing
applications  as much  as possible. The familiar ``don't reinvent



the  wheel'' rule is not only an  advantage---but  a   necessity-
--if  a  meaningful  application base is to be established in any
reasonable length of time. Furthermore, it is known that  overall
efficiency  both  in  time  and  space  is  greater in processing
precomposed character units  when  compared  with  the  floating-
diacritic   approach  [2]. Floating diacritics therefore must be
limited to only where they are necessary and   inevitable,   such
as   in   keyboard  entry or 7-bit data transmission. There is no
reason to require that all applications must deal with  the  com-
plexities   and  inefficiencies of floating diacritics, for exam-
ple, in 8-bit  data  processing,  storage,  transmission,  screen
rendering, or printing.

       The second major context points to the pragmatic and vital
consideration   of   existing  precedents  set  in the Vietnamese
software base. Standardization necessarily  requires  adaptation,
but  it makes little sense to propose to change the world so sig-
nificantly that the inertia against large changes greatly  delays
adoption   of   the standard. Sixteenbit and wider data standards
are just around the corner [3 , 4]; an 8-bit Vietnamese standard
must  not  ignore existing software precedents if it is not to be
useful only ``after its time,'' when it is no longer relevant.

       Thirdly, the standard must address the issue of  user  in-
terface;   if  not defining it, then at least consider its possi-
ble effects on the end-user. This relates primarily to   the   7-
bit   keyboarding   and representation of Vietnamese--in both in-
stances diacritics  are  necessarily  floating,  and  represented
mnemonically   by  existing 7-bit characters with similar appear-
ance. With keyboarding, one must preserve where  possible  exist-
ing   practices   such  as  that defined for the Viet-Net mailing
list and the Usenet newsgroup Soc.Culture.Vietnamese,  both  with
members  worldwide.   For 7-bit readable representation, the key-
word is ``readable.'' The goals here are to  maintain   a   short
learning  time  and  to promote a uniform interface so that it is
not necessary for a user to re-learn  the  particulars  of  every
software installation before being able to use it effectively.

       Finally, to every extent possible, the standard must  stay
within the framework of international standards, e.g., ISO-8859/x
[5], in order to ensure compatibility   with  existing   environ-
ments.   For   example, this goal means preservation of the ASCII
encoding. It should extend also to the  encoding  into  the  same
8859/Latin-1  slots  those Vietnamese characters that are already
defined, thus ensuring that 8859/Latin-1  keyboards   will   work
transparently for those Vietnamese



characters. However, there are many standards  requirements  that
are  obsolete  from a practical viewpoint. For example, in recent
Unicode/ISO-10646 decisions, the prohibition from  use   of   the
available  control  character space--those with encodings between
xx00h and xx1Fh, except for  C0  itself---was  discarded  on  the
grounds  that  it  was a waste of encoding space. As will be dis-
cussed later, the encoding of Vietnamese into the   existing   8-
bit  space  presents some well-known trade-offs. Where trade-offs
are made, they must be  justified  with  good  reason---pragmatic
preferred over theoretical.

       These primary requirements are summarized as follows:

         R1. Straightforward and direct integration into ex-
               isting platforms.

         R2. Ease of adaptation for existing software.

         R3. User-friendly mnemonic encoding scheme and inter-
               face.

         R4. Adherence to international standards.

         R5. Trade-offs made only on practical usage consid-
               erations and with good reason.

       In the following section we present a brief review of  the
strengths   and  weaknesses of different approaches to Vietnamese
encoding. Section 3 will describe the  proposed   8bit   encoding
table  in  detail. A quoted-readable encoding scheme encompassing
7-bit data streams, including electronic mail and  keyboard   in-
put,  is  presented in Section 4. Finally, Section 5 outlines the
particular rules and conventions relevant in  some   application-
specific contexts.

2 REVIEW OF CURRENT CONVENTIONS

A review of current conventions used by software vendors  reveals
one  distinct  feature:  virtually all realize the strengths of a
precomposed encoding and adopt it as a primary  requirement.  The
complications  arise  from a familiar fact: apart from the alpha-
betics already available in the ASCII  standard,  Vietnamese  re-
quires  an additional 134 unique characters. Of these, 128 can be
coded in the C1 and G1 areas. The allocation of the  remaining  6
characters  in  the lower C0 and G0 space is handled with differ-
ing approaches:

         A1. Encode into 6 of the ``least-used'' G0 characters
               in the context of Vietnamese data processing.



         A2. Encode into 6 of the 12 National Replacement char-
               acters in G0.

         A3. Drop 6 of the ``least-used''(1) Vietnamese charac-
               ters, typically accented capitals such as A(?, A(~,
                A^~, Y?, Y~, and Y. .

         A4. Map accented ``y'' combinations into corresponding
               ``i'' combinations, e.g., ``ky~ su+'' is replaced
               with ``ki~ su+''.

         A5. Encode into the ASCII control space C0.

       Approaches A1 and A2 both satisfy the typical needs of the
word  processing  environments in which rarely used ASCII charac-
ters can be avoided, or employed by font shifting.  However  they
both  eliminate  prospects for integration of Vietnamese into ex-
isting ASCII environments where all graphic  characters   in   G0
are needed. A character that already serves one purpose cannot be
re-used for another. First, it makes rendering of the  needed  G0
character incorrect, as it would now look like a Vietnamese char-
acter. The frequency of use of G0 characters in   an   integrated
environment  is  far  too high for this conflict to be tolerable.
While font shifting may be employed to remedy this in  some  sit-
uations,  a more serious problem occurs when the Vietnamese char-
acter is needed. The environment would  typically  have  assigned
some   specific   meaning  to the G0 character, particularly with
those in the National Replacement set. Consider,   for   example,
using  the  backslash  character ``\'' for a Vietnamese character
under Unix. The backslash is used for many Unix escape mechanisms
so  that  the Vietnamese character cannot simply be used but must
be escaped in one way or another. This is more than an inconveni-
ence;   it  means data interchange is now complicated by the fact
that the escape mechanism will not be understood on another plat-
form,  and data integrity has thus not been preserved. A standard
employing this approach fails at its basic mission:  to   provide
cross-platform  transparency.  A similar case can be made for the
other G0 characters.

       Both A3 and A4 propose to limit Vietnamese  language  data
in  one way or another. Most agree that elimination of some Viet-
namese characters are simply unacceptable; indeed, this point  is
so   fundamental   that we have in the foregoing chosen to assume
it as a technical requirement without elaboration.  However,   it
must be said that A4 is not
_________________________________________
  (1) Least-used because they (a) rarely begin words  and  there-
fore do not often get capitalized, and (b) appear in fewer words.



a proposal without rationale. A school of  thought  exists   that
believes y's existing in words as a single vowel should be mapped
to corresponding i's, as their pronunciations are indeed  identi-
cal.  The  concept dates as far back as 1948 [6 , 7]. However, it
is not the function  of  an  encoding  standard   to   settle   a
linguistic issue, and hence A4 is also a bad choice.

       The immediate objection to A5 is primarily  in  data  com-
munication  channels  where  many  C0 characters are used as data
control. In addition, it also presents problems  for  integration
into  environments  where some C0 characters are used in the key-
board interface and in data format controls,   similar   to   the
problem  facing A1 and A2. However, as will be discussed further,
judicious choice of the 6 C0 characters to be used has  in  prac-
tice   been  shown successfully to avoid characters that are sig-
nificant in data communication. Furthermore, most  data  channels
provide for clean transfer of binary data, and there is no reason
to worry that arbitrary data bits cannot be employed  over  these
binary routes.

       With those particular cases where C0 is used in  the  key-
board  interface,  judicious  choice as well as remapping of keys
can  minimize  conflict.  Data  format  control  is  application-
specific but is typically scattered in C0 and C1. It is therefore
a universal problem for integration  because  C1  is  necessarily
densely  encoded, but, again, conflict can be avoided by studying
significant applications. Finally, the choice can be made  for  6
least-used   Vietnamese   characters   so that the probability of
conflict is greatly reduced.

       It should be noted here that the foregoing discussion  has
subjected   the   alternatives to the requirements of integration
into existing applications and platforms, as outlined   in   Sec-
tion  1. The importance of this goal cannot be overstated, and it
does present complications that result in the  following  Pragma-
tism  Principle:  it is obviously impossible to define a standard
that would operate seamlessly with all   existing   applications,
therefore  pragmatic  considerations must be made to make a stan-
dard workable in as many important applications and  on  as  many
platforms as possible, with emphasis on the word ``workable.''



3 VISCII: 8-BIT ENCODING SPECIFICATION FOR VIETNAMESE

3.1 MOTIVATION

The  available  body  of  evidence  shows  that  alternative   A5
described  in  the  previous  section,  encoding into 6 of the C0
characters, has the greatest chance of  success   in   fulfilling
the  requirements  outlined  in Section 1. The choice of the 6 C0
codes and the 6 least-used Vietnamese capital letters to  encode,
when  made carefully, greatly reduces the probability of conflict
for all practical purposes. Concerns regarding  data   communica-
tions  are  well  addressed by avoiding C0 codes that are in fact
often used for data control. Indeed,  data   communication   con-
cerns   are   more  applicable to C1 and G1 encoding; a prominent
example is electronic mail transfer through 7-bit  gateways   and
mail  agents.   Communication failure here has in most cases been
due to the use of the eighth bit and not because of C0  encoding.
In  any  event,  the  option  exists  for data to be sent in some
``binary'' mode, or to employ the Vietnamese Quoted-Readable for-
mat to be described in Section 4.

       The overwhelming advantage of this approach is that it  is
readily  and easily integrated into existing environments without
many of the problems plaguing the other alternatives,   if   they
can  at  all be integrated. As a testimony to the approach's suc-
cessful application, this document itself  was   prepared   using
the   TeX  system under Unix. The text source was edited in an 8-
bit X terminal window, using a  minimally  modified(2) version of
Elvis,   a   public-domain 8bit version of Unix's Vi text editor.
Both  TeX  (a  document  preparation  system)   and   Dvi2ps   (a
PostScript  generator)  readily accepted and processed Vietnamese
(8-bit) data transparently. Many other applications  including  a
spreadsheet,   various   text  viewers, PostScript and dot-matrix
printing, DOS's WordPerfect, Word, PC Tools,  etc.,   have   been
tested and seen to operate well with Vietnamese  text.  Modifica-
tions  if  any,  were  primarily  in making   these  applications 
accept 8-bit   data.  An educational teaching tool for Vietnamese
has also been produced using the C  programming   language   with
8-bit  Vietnamese  strings  embedded in the source code. With in-
creasing system internationalization, applications and tools  are
being  made  8-bit ``clean,'' further facilitating integration of
this Vietnamese encoding.

_________________________________________
  (2) The modifications provided the keyboard interface described
in later sections.



3.2 ENCODING RATIONALE

A basic requirement is to preserve  the   7-bit   ASCII   graphic
characters   (G0)   layout, since the emphasis is on integration.
G0 was therefore left unchanged. For  the  6  C0  characters,  we
first  lay  out  the  code  space  and  consider typical usage, a
sampler of which is in Table 1. The codes selected, STX (2),  ENQ
(5),  ACK  (6),  DC4 (20), EM (25), and RS (30) present the least
possible problems with data communication and   significant   ap-
plications  considered.  The use of ACK, for example, is actually
context-dependent. In those protocols we  have  reviewed,  it  is
only  considered a ``control'' character outside of a data frame;
within a data frame it is transfered without  special   interpre-
tation.   To reduce the probability of conflict even further, the
6 least-often used Vietnamese capital letters, A(?, A(~, A^~, Y?,
Y~, and Y., are encoded into these slots.

       The encoding  of  C1  is  less  troublesome,  although  in
applicationspecific   contexts   it   has been found that some C1
characters are employed with special meanings. A review  of   on-
going   work   on  8-bit mail transport standardization indicates
that C1 characters will be fully supported  as  graphic   charac-
ters   without   special interpretation. Nevertheless, it is pru-
dent to encode only upper-case characters into the C1 space.

       For G1, the aim is to adhere to the  8859/Latin-1  mapping
where Vietnamese-specific characters are already encoded. Table 2
lists the subset of 8859/Latin-1 characters  in   G1   that   are
also   Vietnamese (3).  The  motivation behind this choice is the
predominant and increasing availability  of   8859/Latin-1   key-
boards  and  font sets, e.g., Digital's VT-terminal series, Xterm
keymaps, and Microsoft's Windows. It is  natural  and  reasonable
for a user in France to expect that the same keystrokes producing
"e'" on the screen for French will do the same for Vietnamese.

       With the above guidelines, the task is then to lay out the
remaining   Vietnamese   characters in some fashion, perhaps even
arbitrary. This has been done in such a way so as to provide some
degree   of   symmetry simply for aesthetics. Note that the Viet-
namese collating order cannot in any case be preserved, but  this
is  not a major issue since collation for non-ASCII characters is
well accepted to be a table-lookup problem.

_________________________________________
(3) Note that the ``dd'' in Table 2 is actually a similar-looking
Icelandic  ``edh'' in 8859/Latin-1; the Vietnamese rendering form
is better reflected in 8859/Latin-2.



       Table 1: A sampler of possible C0 usage conflicts.  Codes selected
               for this standard proposal are noted with a +.
 -----------------------------------------------------------------------------
 | CODE  COMM   CTRL  GENERAL   PRINTER (PC)    PC      UNIX     VI (Unix)   |
 |===========================================================================|
 |   0    NUL    @    C string                          strings              |
 |   1    SOH    A                                                           |
 |   2+   STX    B                                               back screen |
 |   3    ETX    C    INTR                              INTR     INTR        |
 |   4    EOT    D    EOF                               EOF      back tab    |
 |   5+   ENQ    E                                                           |
 |   6+   ACK    F                                               forw.screen |
 |   7    BEL    G    BEL       BEL                     BEL                  |
 |   8    BS     H    BS        BS              BS      BS       BS          |
 |   9    HT     I    HT        HT              HT      HT       HT          |
 |  10    LF     J    LF        LF              LF      LF       LF          |
 |  11    VT     K              VT                                           |
 |  12    FF     L    FF        FF                      FF       redraw      |
 |  13    CR     M    CR        CR              CR      CR       CR          |
 |  14    SO     N              wide on (IBM)                                |
 |  15    SI     O              comp.on (IBM)                                |
 |  16    DLE    P                              Prt.on/off                   |
 |  17    DC1    Q    XOFF      XOFF            XOFF    XOFF                 |
 |  18    DC2    R              comp.off(IBM)           retype               |
 |  19    DC3    S    XON       XON             XON     XON                  |
 |  20+   DC4    T              wide off(IBM)                   forw. tab    |
 |  21    NAK    U              clr. buf(IBM)           kill    kill         |
 |  22    SYN    V                                      literal literal      |
 |  23    ETB    W                                      werase  werase       |
 |  24    CAN    X                                      kill                 |
 |  25+   EM     Y                                      suspend              |
 |  26    SUB    Z                              EOF     suspend              |
 |  27    ESC    [    ESC       ESC sequence    ESC     ESC     ESC          |
 |  28    FS     \                                      quit                 |
 |  29    GS     ]    Telnet ESC                                             |
 |  30+   RS     ^                                                           |
 |  31    US     _                              Windows                      |
 -----------------------------------------------------------------------------

   Table 2: Vietnamese-specific characters already present in 8859/Latin-1.
    ----------------------------------------------------------------------
    |    | 0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F |
    |====|===============================================================|
    | Cx | A`  A'  A^  A~                  E`  E'  E^      I`  I'        |
    ----------------------------------------------------------------------
    | Dx | DD      O`  O'  O^  O~              U`  U'          Y'        |
    ----------------------------------------------------------------------
    | Ex | a`  a'  a^  a~                  e`  e'  e^      i`  i'        |
    ----------------------------------------------------------------------
    | Fx | dd      o`  o'  o^  o~              u`  u'          y'        |
    ----------------------------------------------------------------------

       Experience in development of this encoding  on  the  MSDOS
platform  motivates  the  consideration of line-drawing glyphs in
the PC character set (code page 850). Code positions  occupied by
singleand  double-line  drawing  characters should be popu- lated
with upper case letters. It  is  possible  to  do  this   without
violating  the  major  guidelines already established above. With
this provision, the MSDOS user can be supplied with  code   pages
containing  either  PC line-drawing or Vietnamese glyphs. For ex-
isting applications, the user can choose  the  code   page   most
appropriate   for   her   purpose.  Where the code page with line
drawing characters must be used, the penalty from  missing  Viet-
namese  characters has been minimized by the choice of the infre-
quently used ones. For new applications, code page switching  can
easily be done on the fly, if it is desired.

       The preceding guidelines have resulted in the  VISCII   8-
bit Vietnamese encoding proposal listed in Table 3. It is intend-
ed to be a single table that applies to Vietnamese data  handling
including  storage,  processing, transmission, and font encoding.
This greatly simplifies the  integration,   implementation,   and
usage  processes  and is indeed one of the major strengths of the
proposal.

4 VIQR: MNEMONIC ENCODING SPECIFICATION FOR VIETNAMESE

4.1 MOTIVATION

While the  8-bit  specification  attempts  to  standardize  Viet-
namese  encoding  in  8-bit  environments, much remains to be ad-
dressed in important 7-bit environments such as  electronic  mail
transport  and other 7-bit data lines, as well as in keyboard en-
try  applications  where the interface for generating  Vietnamese


       Table 3: VISCII 8-bit Encoding Standard Proposal for Vietnamese.
    -----------------------------------------------------------------------
    |    |  0   1   2   3   4   5   6   7  8   9   A   B   C   D   E   F  |
    |====|================================================================|
    | 0x | nul A(? stx etx eot A(~ A^~ bel bs  ht  lf  vt  ff  cr  so  si |
    |----|----------------------------------------------------------------|
    | 1x | dle dc1 dc2 dc3 Y?  nak syn etb can Y~  sub esc rs  gs  Y.  us |
    |----|----------------------------------------------------------------|
    | 2x |     !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /  |
    |----|----------------------------------------------------------------|
    | 3x | 0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?  |
    |----|----------------------------------------------------------------|
    | 4x | @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O  |
    |----|----------------------------------------------------------------|
    | 5x | P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _  |
    |----|----------------------------------------------------------------|
    | 6x | `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o  |
    |----|----------------------------------------------------------------|
    | 7x | p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~   del|
    |----|----------------------------------------------------------------|
    | 8x | A.  A(' A(` A(. A^' A^` A^? A^. E~  E.  E^' E^` E^? E^~ E^. O^'|
    |----|----------------------------------------------------------------|
    | 9x | O^` O^? O^~ O^. O+. O+' O+` O+? I.  O?  O.  I?  U?  U~  U.  Y` |
    |----|----------------------------------------------------------------|
    | Ax | a.  a(' a(` a(. a^' a^` a^? a^. e~  e.  e^' e^` e^? e^~ e^. o^'|
    |----|----------------------------------------------------------------|
    | Bx | o^` o^? o^~ O+~ O+  o^. o+` o+? i.  U+. U+' U+` U+? o+  o+' U+ |
    |----|----------------------------------------------------------------|
    | Cx | A`  A'  A^  A   A?  A(  a(? a(~ E`  E'  E^  E?  I`  I'  I~  y` |
    |----|----------------------------------------------------------------|
    | Dx | DD  u+' O`  O'  O^  O~  y?  u+` u+? U`  U'  y~  y.  Y'  o+~ u+ |
    |----|----------------------------------------------------------------|
    | Ex | a`  a'  a^  a~  a?  a(  u+~ a^~ e`  e'  e^  e?  i`  i'  i~  i? |
    |----|----------------------------------------------------------------|
    | Fx | dd  u+. o`  o'  o^  o~  o?  o.  u.  u`  u'  u~  u?  y'  o+. U+~|
    -----------------------------------------------------------------------


characters needs to be standardized.

       Transporting more than 128 unique symbols over 7-bit  data
channels  is  not  a problem specific to the Vietnamese language.
Since its proposal in 1982, the Internet Simple   Mail   Transfer
Protocol   (``SMTP'',   [8]) has seen unrelenting efforts to ex-
tend it to accommodate 8-bit and widerword data in  European  La-
tin   scripts  and Oriental ideographic characters (see, e.g., [9]).
While clean 8-bit transport is highly desirable,   all   mail
gateways  are  not going to be converted overnight. For the fore-
seeable future there is a need for unambiguous transport of Viet-
namese text over existing 7-bit channels.

       Indeed there is an ad-hoc standard in use on  the  VietNet
mailing  list  and  the  Usenet newsgroup Soc.Culture.Vietnamese,
where mnemonic use of appropriate characters to  follow  a  vowel
proves to be quite readable; for example, ``Vi<e^.>t Nam'' would be
written as ``Vie^.t Nam''. However, this is troubled by  the  am-
biguity  in the multiple roles played by the mnemonic diacritical
marks; for example, does ``tha?'' mean ``tha?'' or ``th<a?>''?

       The Viet-Net convention is  not  far  in  concept  from  a
quoted-readable format proposed by K. Simonsen [10 , 11].  which
disambiguates such texts by specifying text states  at  both  the
character   and   character set levels. Unfortunately, in its at-
tempt to provide a universal solution to mnemonic  encoding,  the
proposal  does   not  provide a good answer for Vietnamese text.
First, it restricts the use of  mnemonics  to  the  83  invariant
ISO-646  [12] graphic characters, which is a good idea in prin-
ciple, but sacrifices readability in the  process.  For  example,
the counter-intuitive mnemonics for hook-above (da^'u ho?i) and tilde
(da^'u nga~) are ``2'' and ``?'', respectively, in order to avoid
``\'' itself, which is not an invariant. The wide availability of
ASCII keyboards to the great majority of Vietnamese  users  makes
this  too  unreasonable a limitation in the context of Vietnamese
processing. It should be noted that we are in fact   arguing   in
favor   of   ``readability  for most'' against ``illegibility for
all.'' Furthermore, with ongoing progress on keyboard and display
internationalization,   e.g.,   in  graphical window environments
where keyboard mapping and font switching are easily implemented,
this availability is on the increase, further obsoleting the res-
triction.

       The greater difficulty is that  the  two-character  fixed-
length encoding(4) cannot provide a readable or mnemonic rep-

_________________________________________
  (4) The convention is ``&xy'', where x is a  literal  character
and y represents some combining form.



resentation of all Vietnamese  characters,  in  particular  those
with 2 diacritical  marks.  The variable-length mnemonics(5) have
been extended to include  all  Vietnamese  characters,  but  this
scheme  is  so cluttered with announcers and delimiters that rea-
dability and efficiency are near nil,  keeping   in   mind   that
diacritics  are  heavily  used in Vietnamese.  While machine data
translators  will  have  little  trouble  with  any  ``mnemonic''
scheme,  one  that is directly accessible to human users, who are
in many cases typing mail messages using 7-bit editors, needs  to
be  more  user-friendly. A Vietnamese user will not want to learn
or remember among all possible combinations that,   say,   ``a5''
stands  for  "a('", nor will she like typing sequences as long as
``&_a('_'' for some letter in every word.

       To satisfy the readability and flexibility requirements, a
separate   specification   is necessary. It is better to adopt an
approach like code-page switching under ISO-2022 [13] to  switch
the text into ``Vietnamese'' mode and optimize encoding according
to the language state. Recently, van der Poel put forth a mnemon-
ic   proposal   [14] which emphasizes language-specific conven-
tions for these reasons. This proposal provides a means to speci-
fy  the  language  state,  each with its own (efficient) encoding
method. Its strength lies in the flexible specification that con-
formant   implementations   ``need  not be able to display all of
the character sets specified''; they have the option  of  stating
messages  such  as  ``undisplayable Greek appeared here'' for un-
supported languages (for a more precise  specification,  see  [14]).
This allows networking communities to determine the best ap-
proach for encoding their own languages. The VIQR  convention  is
compatible  with  this approach and should easily be incorporated
into this framework.

       The specification here encompasses all  data  streams  in-
cluding text transfer, file I/O, and keyboard entry. This princi-
ple has been the major reason for success in  operating   systems
such   as   Unix,  in which device-specific details are hidden as
much as possible from the applications  programmer,   leaving   a
uniform  interface  above which tools such as common library rou-
tines can be shared. Indeed as the keyboard example above has im-
plied,   the  characters actually typed by the user are often not
different from  the  text  data  that  is  eventually  stored  or
transmitted.  It  is therefore desirable to provide a common base
on which  to  build  data  interpreters  for  all  data  streams,
independent of the input source. In actual implementation, this
_________________________________________
  (5) The convention is ``&_xxxx_'' where xxxx can  be  an  arbi-
trary mnemonic sequence.



has greatly facilitated  development  of  the  Vietnamese-capable
software base.

       In addition, the user stands to benefit tremendously  from
standardization  of  keyboard entry. One does not need to learn a
different keyboard entry technique for   each   different   Viet-
namese  application. If one standard keyboard model is fully sup-
ported by all Vietnamese software, a user   familiar   with   the
standard   can  sit down and start typing Vietnamese immediately.
This standard defines the minimum expected behavior   from   com-
pliant   software;  any additional input techniques can of course
be incorporated as a superset of the standard behavior.  This  is
discussed   further  in   Section  5.2 on Vietnamese keyboarding.

4.2 QUOTED-READABLE SPECIFICATION (VIQR)

The mnemonic model from Viet-Net is fully employed in the specif-
ication.   The   Vietnamese  QR  comprises  three  major  states:
Literal, English, and Vietnamese. The Literal state  is  intended
for   completely  transparent handling of literal data (except of
course for the escape sequences into and out of  Literal  state).
The  English  and Vietnamese states are designed for mixed use of
English and Vietnamese, with each optimized in appearance as well
as  data size for texts containing mostly English and Vietnamese,
respectively. In either state there exist methods  for  composing
Vietnamesespecific   characters,  using  a base vowel followed by
one or two diacritics.

       We first introduce the concept of  implicit  and  explicit
composition,  then  discuss  how  they  are  used  in each of the
states.

4.2.1 Implicit Composition

Implicit composition is useful for data containing a  large  per-
centage of Vietnamese characters.

       With implicit composition, a sequence of  a   base   vowel
followed   by   one or two diacritical marks is combined into one
Vietnamese letter as long as it is grammatically legal.  This  is
best illustrated by examples:



                       a^       --> <a^>
                       o+?      --> <o+?>
                       <o+>?    --> <o+?>
                       Vie^.t   --> Vi<e^.>t
                       Vi<e^>.t --> Vi<e^.>t
                       la'^n    --> l<a'>^n (not l<a^'>n)
                       l<a'>^n  --> l<a'>^n (not l<a^'>n)

       Note in the last two example that the sequence a^' is  not
grammatically equivalent to a'^  or  <a'>^. In general a modifier
("(", "^", "+") must immediately follow the appropriate  vowel in
order to be combined.

       The special sequence "dd" is  composed into  "<dd>"; "DD",
"dD", and "Dd" all represent "<DD>".

       The base vowels are: a, a(, a^, e, e^, i, o, o^, o+, u, u+,
y, and their corresponding capitals. The encoding values are those
listed in Table 3, the 8-bit VISCII proposed standard.

       The diacritical marks are represented  by  ASCII   charac-
ters  having  correspondingly  similar appearances. Table 4 lists
the 7 ASCII characters used  as  mnemonic  replacements  for  the
Vietnamese   diacritics:   the first three are modifiers, and the
remaining five are tone marks.

           Table 4: ASCII Mnemonics for Vietnamese Diacritics 
        --------------------------------------------------------
        | Diacritic  |  Char  |  ASCII Code      | Da^'u       |
        |============|========|==================|=============|
        | breve      |   (    | 0x28, left paren | tra(ng (()  |
        | circumflex |   ^    | 0x5E, caret      | mu~ (^)     |
        | horn       |   +    | 0x2B, plus sign  | mo'c (+)    |
        |------------|--------|------------------|-------------|
        | acute      |   '    | 0x27, apostrophe | sa('c (')   |
        | grave      |   `    | 0x60, backquote  | huye^`n (`) |
        | hook above |   ?    | 0x3F, question   | ho?i (?)    |
        | tilde      |   ~    | 0x7E, tilde      | nga~ (~)    |
        | dot below  |   .    | 0x2E, period     | na(.ng (.)  |
        --------------------------------------------------------

4.2.2 Explicit Composition

Explicit composition is associated with the concept of a  leading
character   which   explicitly announces the composition. The an-
nouncer character is the backslash ("\",   ASCII   0x5C),   known
here   as  <COM>. The subsequent combining characters are defined
in the same way as those in implicit composition. Thus  the   ex-
amples given above would appear in explicit composition mode as:



                              \a^     --> <a^> 
                              \o+?    --> <o+?>
                              Vi\e^.t --> Vi<e^.>t

       Explicit composition is useful for data  containing  main-
ly   English  text, as well as for maintaining real-time compati-
bility with keyboard character events, as will be  discussed   in
Section  5.2  on  Vietnamese  keyboarding.   With the composition
methods described, we are now ready to discuss how they  are  em-
ployed   in   each  of  the  three  states. The state of the data
stream is specified by the two character sequence <COM>x, where x
is specified below.

4.2.3 Literal State

The appearance of <COM>L or <COM>l in the data  stream  initiates
the  Literal  state. This state is intended for nearperfect tran-
sparent literal data transfer. Neither  implicit   nor   explicit
composition  is  available  here, nor is the <COM> character spe-
cial, except when it is followed by one of the six characters  l,
L, v, V, m or M which initiates one of the three states (6).

4.2.4 English State

The sequence <COM>M or <COM>m sets the data stream state  to  En-
glish.  In English state, only explicit composition is supported.
This means that in order to generate a Vietnamese   letter,   the
announcer  character  <COM>  must be used.  A ``composition'' se-
quence not preceded by <COM> will be  left  uninterpreted.  Exam-
ples:

        \mD\u~ng, how are you? --> D<u~>ng, how are you?
        \mKho\e? kh\o^ng?      --> Kho<e?> kh<o^>ng?

       As noted, the sequence "you?" above  was   not   converted
into "yo<u?>" because no composition was specified.

4.2.5 Vietnamese State

The data stream state is set  to  Vietnamese  when  the  sequence
<COM>V or <COM>v is encountered. In Vietnamese mode, both
_________________________________________
  (6) To effect <COM>L, <COM>M,  and  <COM>V  themselves,  it  is
necessary to switch to either English or Vietnamese state and use
the Character Literal feature available there.



explicit and implicit compositions are in effect.  The  following
examples  assume  that  the  data  stream is initially in English
state:

                  \vCh\u+~ Vi\e^.t --> Ch<u+~> Vi<e^.>t
                  \vChu+~ Vie^.t   --> Ch<u+~> Vi<e^.>t
                  Chu+~ \vVie^.t   --> Chu+~ Vi<e^.>t

       The availability of  implicit  composition  in  Vietnamese
state   ensures   that the text is not cluttered with unnecessary
<COM>s, as would be the case in Vietnamese text  using   explicit
composition.   Explicit  composition is included to maintain com-
patibility with the English state so that there is no need to de-
fine  additional  meanings  for  the  <COM>  sequences. Also, the
real-time keyboard compatibility mentioned previously   is   also
available  in  Vietnamese  state  through  explicit  composition.

4.2.6 Character Literals in English and Vietnamese States

Consider the following example:

         \vDu~ng, how are you? --> D<u~>ng, how are yo<u?>

       In this example, the sequence "you?"  was  interpreted  as
"yo<u?>" because the data stream was still in Vietnamese state. Thus
it is sometimes desirable to  suppress   composition   altogether
without   having   to  switch states. The literal property of the
<COM> character conveniently accomplishes this. In  either  Viet-
namese   or   English state, whenever <COM> is followed by a non-
combining character c the result is the literal character  c  it-
self.   The   <COM> is discarded from the data stream. To get the
<COM> character literally, use <COM><COM>. Consider the following
examples:

                       \vddi dda^u?  --> <dd>i <dd><a^><u?>
                       \vddi dda^u\? --> <dd>i <dd><a^>u?
                       \vddi v\o^?   --> <dd>i v<o^?>
                       \vddi v\o^\?  --> <dd>i v<o^>?
                       \h\e\l\l\o    --> hello
                       \\            --> \
                       \\V           --> \V
                       \\M           --> \M
                       \\L           --> \L



4.2.7 Closure

The data stream supports another special character used  to  gen-
erate  explicit  closure.  The closure character is CTRL-A (ASCII
0x01), known here as <CLS>. When <CLS> is  encountered   in   the
data   stream,  it immediately terminates any ongoing composition
sequence. The <CLS> itself is always discarded, unless   it   ap-
pears in the literal sequence \<CLS>.

       Explicit closure is useful in real-time  character  appli-
cations  such  as keyboard entry, when it is necessary to specify
that a composition sequence has in fact ended and the  input  en-
gine should not stay hanging and wait for more data.

5 SPECIFIC APPLICATIONS

This section outlines application-specific guidelines and conven-
tions   that  have evolved in the software development community.
It is intended to be a live and growing documentation   of   such
discussions  as  more experience is gathered. Readers are welcome
to participate in these  discussions  and   contribute   to   the
development  of  these guidelines in particular, and to the stan-
dards in general.

5.1 ELECTRONIC MAIL OVER 7-BIT CHANNELS

Many of the available channels for  electronic   mail   currently
still  enforce  the 7-bit limitation. The 8-bit character set de-
fined in Section 3 cannot be transported  verbatim   over   these
channels.  VIQR  plays an important role here, as it provides for
7-bit transport of Vietnamese text without the  ambiguity   prob-
lem   of  deciding  what  to  do  with  the  double  usage  of  a
diacritical/punctuation mark, e.g., the  hook-above  or  question
mark,   "?".  Because of the 7-bit nature of these communications
channels, mail  agents  will  typically   not   encounter   those
Vietnamese-specific    base  vowels that  are  encoded  in the G1
area,  namely:  a(,  A(, a^,  A^, e^, E^, o^, O^, o+, O+, u+, and
U+. However, mail agents designed to work with 8-bit channels are
still  expected  to  handle  the  occurrence  of these characters
according to the  complete  VIQR,  namely to combine base  vowels
and diacritical marks as appropriate.

       In order to be correctly interpreted, electronic mail mes-
sages  must  explicitly  set  the  language  state  either in the
headers or text body. One cannot assume what state the  receiving
input  engine  is  in at the start of the message, since messages
are not always read in message units, e.g., when a file  contain-
ing multiple mail messages is scanned.



       Furthermore, if a language state specification (\L, \V  or
\M)  is  present in a mail message, it is highly recommended that
the message end in the Literal state.  This  helps   applications
reading   multiple   mail  messages in one data stream, such as a
terminal application. It is useful because mail  headers  do  not
adhere to the VIQR, and they are more adversely affected when in-
terpreted in non-Literal states.

5.2 VIETNAMESE KEYBOARDING

Keyboards are becoming increasingly  internationalized.  As  men-
tioned  in  the 8-bit specification, this is the major reason for
using the same code positions for those Vietnamese characters al-
ready  present  in ISO 8859/Latin-1. A Vietnamese keyboard driver
designed to work in the 7-bit-only environment can assume that it
will  not  encounter  Vietnamese  base  vowels  residing  in  G1.
Keyboard drivers for the 8-bit environments, like 8-bit electron-
ic mail agents (Section 5.1), must be prepared to accept any base
vowel, including those encoded in G1.

       The real-time echoing behavior of  keyboard  input  during
composition   requires  further specification. The options are to
report the character only after the  composition   sequence   has
finished,  or  to  report  all intermediate forms and backspacing
over them. Each has its own useful context as described below.

5.2.1 Immediate Echo for Implicit Composition

Implicit composition is designed to be convenient  for   a   user
processing  data  that is mostly Vietnamese. As such it is desir-
able for the keyboarding user to  get   immediate   feedback   on
typed   keys.   With  implicit composition, the keyboard works in
immediate-echo  mode.  Keypresses   immediately   generate    key
events.  If  a character is subsequently composed with a diacrit-
ical mark, a backspace (typically BS, ASCII 0x08)  is  sent  fol-
lowed   by   the  new composed character. This cycle continues as
long as composition is possible. The sequence of events  for  the
key sequence "a^'n" under immediate echo is:


   1. user types a, a is sent to the application

   2. user types ^, BS and <a^> are sent

   3. user types ', BS and <a^'> are sent

   4. user types n, the single key n is sent



       The actual backspace character code may vary  depending on
the system, application, and user settings. The keyboard in- ter-
face should use the appropriate code, and/or allow the   user  to
specify the preferred backspace character.

5.2.2 Delayed Echo for Explicit Composition

When a composition sequence is started,  the  keyboard  interface
must   not  send any key events to the application expecting key-
board input until the sequence is  terminated.   Composition  may
end   either   naturally  when the interface receives a character
that cannot be composed into the sequence, or  when  the  closure
character   <CLS>   is  received. A single key event for the com-
posed character is then sent to the application above. Subsequent
processing  can proceed naturally. Consider what happens when the
user types the sequence "\a^'n" under delayed echo:

   1. user types \, no key is sent to the application

   2. user types a, no key is sent

   3. user types ^, no key is sent

   4. user types ', the single key <a^'> is sent

   5. user types n, the single key n is sent

Or an example involving closure, "t\o+<CLS>":

   1. user types t, the key t is sent

   2. user types \, no key is sent

   3. user types o, no key is sent

   4. user types +, no key is sent

   5. user types CTRL-A, the single key <o+> is sent

       Note that without the closure key the  keyboard  interface
would  still  be left hanging after the "+" key has been pressed,
because the user can still enter a tone mark as part of the  com-
position sequence.

       This delayed-echo behavior  for  explicit  composition  is
designed   to   ensure  compatibility with applications expecting
single key events for each character, particularly in the English
state   where   only explicit composition is available.  While it
is certainly possible to have immediate-echo in explicit composi-
tion  or  delayed-echo in implicit composition, these options are
not useful and serve only to confuse the user learning   how   to
use a Vietnamese keyboard.



It is therefore simplest to associate  delayed-echo  with  expli-
cit   composition,  and immediate-echo with implicit composition.
These options make natural sense.

       This standard defines the  minimal  ``look-and-feel''  be-
havior   a   user can expect from a compliant Vietnamese software
package. A standardized interface decreases the  required  learn-
ing   time  for each new application. This standard does not pre-
clude other input mechanisms to improve user-friendliness,  e.g.,
intelligent   menu-driven  diacritics, or to assist in speed typ-
ing, e.g., through the use  of  CONTROL  or  FUNCTION  keys.  Any
enhancement   in  compliant applications is a bonus for the user,
so long as such enhancements do not adversely conflict  with  the
minimum   expected  behavior described here.

5.3 ADAPTING EXISTING VIETNAMESE APPLICATIONS

A realistic approach to standardization provides for the  inertia
against  change  in  existing software applications.  While it is
desirable that the standard 8-bit encoding  described   here   be
fully  supported, an alternative exists which is more amenable to
rapid adoption. All applications should provide a means  for  im-
porting  and exporting data encoded using the VISCII 8-bit encod-
ing table. At the same time, the VIQR keyboard  interface  should
be  implemented, at least as an optional entry method. Such moves
are highly desirable both for the user and  the   vendor   alike.
The  user will be able to use the software immediately because of
the uniform keyboard interface, as well as process the  same  da-
ta in different applications and on different platforms, with in-
creased productivity and interactivity among users.   This   ease
of   use   means  greater acceptance and a correspondingly larger
customer base for the vendor.

6 SUMMARY & CONCLUSIONS

This paper has presented a proposal for standardization of  Viet-
namese  information  processing.  A  case  has  been made for the
necessity of standardization; we hope to  have  encouraged   ven-
dors  and  users of Vietnamese alike to work together toward this
goal to benefit everyone involved. Various  encoding   approaches
were  discussed, leading to the choice of the VISCII 8-bit encod-
ing proposal. A single encoding table was presented that has been
shown  in  actual  practice to work well for Vietnamese including
editing, processing, storage, transfer, font encoding, and print-
ing.   Where   8bit  data handling was not available or reliable,
e.g., elec-



tronic mail transport, the Vietnamese  Quote-Readable  specifica-
tion   (VIQR)   was  introduced  to  provide a seamless filtering
gateway. VIQR was  defined  to  be  input-source-independent  and
hence  has  been designed to be applicable to Vietnamese keyboard
input as well as machine data filters. All of this was  shown  to
have   been   integrated  into existing environments facilitating
the use of existing tools and applications--a major  strength  of
the   encoding.   Finally,  these specifications have been linked
together  seamlessly  to  include  every  point  in  the   input-
process/transfer-output  cycle of data handling and provide for a
truly unified framework for  Vietnamese  information  processing.

References

  [1] Ba.ch Hu+ng Khang. ``Institute of Informatics,''. Ha`
         No^.i, Vie^.t Nam, February 1991.

  [2] B. Jerman-Blazic, ``Will the Multi-octet Standard
         Character Set Code Solve the World Coding Problems
         for Information Interchange?,'' Computer Standards
         & Interfaces, vol. 8, pages 127--136, 1988.

  [3] The Unicode Consortium. The Unicode Standard:
         Worldwide Character Encoding Version 1.0. Addison-
         Wesley, Reading, MA, first edition, October 1991.

  [4] ISO Technical Committee, ``Universal
         Multiple-Octet Coded Character Set (UCS), ISO/IEC
         DIS 10646-1.2,'' Draft standard, International
         Organization for Standardization, 1992.

  [5] International Organization for Stan-
         dardization. ISO 8859/x: 8-bit International Code
         Sets. ISO, 1977.

  [6] Famjxuaen Thais. Vie^.t Ngu+~ Ca?i Ca'ch. Tu+' Ha?i, Ha` No^.i,
         Vie^.t Nam, March 1948.

  [7] Pha.m Xua^n Tha'i. Chu+~ Vie^.t Ho+.p Li'. Ti'n DDu+'c Thu+ Xa~
         Vie^.t Nam, April 1958.

  [8] J. Postel, ``Simple Mail Transfer Protocol,'' RFC
         822, USC Information Sciences Institute, August
         1982.

  [9] J. C. Klensin et al., ``SMTP Extensions for
         Transport of Text-Based Messages Containing 8-bit
         Characters,'' Internet draft, Massachusetts
         Institute of Technology, July 1991.

[10] K. Simonsen, ``Character Mnemonics & Character
         Sets,'' Internet draft, Danish Unix Users Group,
         January 1992.

[11] K. Simonsen, ``Mnemonic Text Format,'' Internet
         draft, Danish Unix Users Group, August 1991.



[12] International Organization for Standardization. ISO 646: 7-bit Cod-
         ed Character Set for Information Interchange. ISO,
         third edition, 1991.

[13] International Organization for
         Standardization. ISO 2022: 7-bit and 8-bit Coded
         Character Sets---Code Extension Techniques. ISO,
         third edition, 1986.

[14] E. M. van der Poel, ``Multilingual Character
         Encoding for Internet Messages,'' Internet draft,
         Software Research Associates, Japan, January 1992.

[15] IBM. System/370 Reference Summary--GX20-1850-5,
         sixth edition, 1984.

[16] C.E. Mackenzie. Coded-Character Sets: History and
         Development. Addison-Wesley, Reading, MA, 1980.

[17] D.E. Knuth. The TeXbook. Addison-Wesley, Reading,
         MA, 1984.

Glossary of Terms

Announcer: A character or sequence of  characters  appearing   in
the  data  that  signifies the start of some special sequence. In
this text, it announces a Vietnamese composition sequence.

ASCII: American Standard  Code  for  Information  Interchange,  a
128-character   code   used  almost  universally by computers for
representing and transmitting  characters  data,  in  which  each
character  corresponds  to  a  decimal  number between 0 and 127.
Eightor nine-bit  codes  of  which  the  first   128   characters
correspond   to   ASCII are called Extended ASCII; the additional
characters are used to provide graphic characters for  roman  al-
phabets  with diacritics, non-roman alphabets, special screen ef-
fects, etc.

Base Vowel: In this text, the unaccented Vietnamese vowels: a a(
a^ e e^ i o o^ o+ u u+ y (and their capitals). Contrast this with Vowel.

C0 Space: ``Control characters'' at code positions with hex values
00 through 1F.

C1 Space: ``Control characters'' at code positions with hex values
80 through 9F.

Code: In data communication, the numeric  or  internal  represen-
tation for a character, e.g., in ASCII.

Code Page: Name used to denote glyph sets on the IBM PC.   Abbre-
viated  as CP. CP 850 is the multilingual code page, CP860 is for
Portugal, CP863 is for French Canada, CP865 is for Norway.



Control Character: An ASCII character in the range 0 to 31,  plus
ASCII  character  127, contrasted with the printable, or graphic,
characters in the range 32 to 126. It is  produced  on  an  ASCII
terminal   by   holding  down the CTRL key and typing the desired
character.

EBCDIC: Extended Binary Coded Decimal Interchange Code. The char-
acter  code  used  on  IBM  mainframes. Not covered by any formal
standards but described definitively in [15] and  discussed   at
length in [16].

Floating Diacritics: A multiple-unit encoding approach for  Viet-
namese  that  treats the vowel and its diacritics as separate un-
its. The diacritics may either precede or follow  the  vowel,  or
even the word. Contrast this with Precomposed Character.

Glyph: The physical appearance of a character as displayed on the
screen or printed on paper.

G0 Space: ``Graphic characters'' at code positions with hex values
20 through 7F.

G1 Space: ``Graphic characters'' at code positions with hex values
A0 through FF.

ISO: International Organization for  Standardization.  A   volun-
tary   international   group  of national standards organizations
that issues standards in all areas, including  computers,  infor-
mation processing, and character sets.

ISO 646: The standard 7-bit code set,  equivalent to  ASCII [12]. 

ISO Standard 8859: An ISO standard specifying a series  of  8-bit
computer   character  sets  that  include  characters  from  many
languages. These include ISO Latin  Alphabets  1-9,  which  cover
most  of  the written languages based on Roman letters, plus spe-
cial character sets for Cyrillic, Greek, Arabic, and  Hebrew [5].

ISO 8859/1: ISO Standard 8859 Latin Alphabet Number  1.  Supports
at  least the following languages: Latin, Danish, Dutch, English,
Faeroese, Finnish, French,  German,  Icelandic,  Irish,  Italian,
Norwegian, Portuguese, Spanish, and Swedish [5].

ISO 2022 and ISO 4873: ISO standards for switching code pages [13].

ISO DIS 10646: The prospective 16and   32-bit   Universal   Coded
Set, (Draft International Standard) [4].

Latin: Referring to the Latin, or Roman, alphabet,  comprised  of
the letters A through Z, or to any alphabet based upon it.



MS-DOS: Microsoft's Disk Operating  System   for   microcomputers
based on the Intel 80x86 family of CPU chips.

Modifier: A phonetic diacritical mark.  The   Vietnamese   modif-
iers, are: breve (tra(ng, (), circumflex (mu~, ^), horn (mo'c, +).

PC: Personal Computer. In this text, the term PC refers  to   the
entire  IBM  PC and PS/2 families and compatibles, which includes
the AT, 286, 386, and 486 PC's.

PostScript: A page description language with  graphics  capabili-
ties   designed   for  electronic  printing.  The  description is
high-level and device-independent. PostScript is a  trademark  of
Adobe Systems Incorporated.

Precomposed Characters: An encoding approach for Vietnamese  that
treats   all   vowel  combinations as single units. Contrast this
with Floating Diacritics.

TeX: A computerized typesetting  system   developed   by   Donald
Knuth  [17], providing nearly everything needed for high-quality
typesetting of mathematical notations  as  well  as  of  ordinary
text. TeX is a trademark of the American Mathematical Society.

Tone  Mark:  A  tonal  diacritical  mark   that   indicates   the
tone/accent.   The  Vietnamese  tone marks are: acute (sa('c),
grave (huye^`n), hook above (ho?i), tilde (nga~), dot below (na(.ng).

Unicode: A 16-bit multilingual character code proposed by the Un-
icode Consortium [3].

Unix: A popular operating system developed at AT&T  Bell  Labora-
tories and noted for its portability.

Usenet: A worldwide network available to users for  sending  mes-
sages   (or   ``news articles'') that can be read and responded to
by other users. Participating in Usenet is like subscribing to  a
collection  of  electronic  magazines. These ``magazines,'' called
newsgroups,   are   devoted    to    particular    topics.    The
``Soc.Culture.Vietnamese''  newsgroup  is  very popular among both
Vietnamese and non-Vietnamese worldwide.

Viet-Std: A non-profit group of overseas  Vietnamese  profession-
als  working  on software & hardware standards for the Vietnamese
language. Members of the group exchange ideas via electronic mail
and meetings.

Vowel: In this text, a generic term applying  to  all  Vietnamese
vowels  and  their  various combining forms, e.g., a, a(, and a('.
See Base Vowel.




<Prev in Thread] Current Thread [Next in Thread>
  • FYI: Vietnamese Document Draft, Randall Atkinson <=