Attached below is the document from the Vietnamese Standards Group
that has been publically released describing current conventions for
Vietnamese usage on the Internet/BITNET/USENET and other proposals or
de facto standards for Vietnamese. It represents the consensus of the
people who have been working on these issues for the past several years.
Anh Nguye^~n Tha`nh has indicated that he intends to work on
converting it into a format suitable for an RFC and publishing it on
behalf of the Vietnamese Standards Group as an informational RFC
documenting conventions and usages for the Vietnamese language. I'm
not sure when that might be finished and submitted to the RFC Editor.
If my understanding is correct, that informational RFC would not be
focused on MIME in any way and would not be proposing registration of
a token for Vietnamese for usage in MIME. The VSG would prefer to
have at least a unified mechanism for mnemonic usages or possibly a
single mnemonic convention, provided that such a single unified
mnemonic convention's representation of Vietnamese glyphs were no less
readable than the existing Vietnamese convention.
I would also like to take this opportunity to disclaim the credit
that the document gives to me. The vast majority of the work has been
done by others in the working group and they deserve the lion's share
of the credit for the document content. I am however quite pleased
that we in the VSG working group have been able to produce such a
document.
Ran
atkinson(_at_)itd(_dot_)nrl(_dot_)navy(_dot_)mil
----cut here, omitting this line and above---
Vietnamese Mnemonic Notes:
In the following ASCII text, Vietnamese letters with diacritics
are represented as a vowel followed by the diacritics, with the
following mappings:
( = breve, as in "a(n na(n"
^ = circumflex, as in "nha^n co^ng"
+ = horn, as in "tu+o+ng tu+"
' = acute, as in "choa'ng va'ng"
` = grave, as in "lu` khu`"
? = hook above, as in "ho?i tha(m"
~ = tilde, as in "ky~ ca`ng"
. = dot below, as in "Tra.ng Nguye^n"
dd = lower case d-bar, as in "dda ti`nh"
DD = upper case D-bar, as in "DDo^ng So+n"
The diacritics are interspersed freely in the text and should
be clear from the context, for example, "The Vietnamese call
themselves `Con Cha'u Hu`ng Vu+o+ng', or `Descendents of King
Hu`ng'." However there are instances where it is necessary to
differentiate between a single Vietnamese letter with
diacritics and a sequence of characters, for example, "a^'". In
such cases, when the single Vietnamese letter is meant, it is
enclosed in angle brackets, e.g., "<a^'>"; without the brackets
the string "a^'" should be understood to be the sequence of
characters "a", "^", and "'". It should be clear from context
how the text should be read.
The text was generated with "dvi2tty" and "nroff" with con-
siderable hand-editing, but the formatting still leaves much to
be desired. A much more readable version is available in
PostScript form from various archive sites to be announced by
the archivists themselves. If you have no means of retrieving
or printing the PostScript file, you may obtain a printed copy
by sending a self-address, stamped envelope to "Cuong T. Nguyen
P. O. Box 9934, Stanford, CA 94309-1634". Please use two (2)
29-cent stamps and a letter-sized envelope.
Please forward typos & comments to
Viet-Std(_at_)Haydn(_dot_)Stanford(_dot_)EDU(_dot_)
Acknowledgments:
----------------
We acknowledge the direct authorship/contribution by the following people:
Atkinson, Randall (atkinson(_at_)itd(_dot_)nrl(_dot_)navy(_dot_)mil)
Bu`i Cu+o+ng (bui(_at_)berlioz(_dot_)nsc(_dot_)com)
Ho^` Khie^m (khiem(_at_)hpinddm(_dot_)cup(_dot_)hp(_dot_)com)
Lu+o+ng V. Tu+o+'c (tluong(_at_)borland(_dot_)com)
Ngo^ DDi`nh Ho.c (hoc%vri280(_at_)uunet(_dot_)uu(_dot_)net)
Nguye^~n T. Cu+o+`ng (cuong(_at_)Haydn(_dot_)Stanford(_dot_)EDU)
Nguye^~n Tha`nh
(thanh(_at_)ipesun(_dot_)e-technik(_dot_)uni-stuttgart(_dot_)de)
To^n Khoa (khoa(_at_)hpda(_dot_)hp(_dot_)com)
Tra^`n Nha^n (tran(_at_)peora(_dot_)sdc(_dot_)ccur(_dot_)com)
And the many, many insightful comments, arguments, and ideas
contributed by the people on Viet-Std(_at_)Haydn(_dot_)Stanford(_dot_)EDU, too
numerous
to acknowledge properly but are nevertheless important, as well as the
people of Viet-Net and Soc.Culture.Vietnamese, including those who
proposed, discussed, and propagated the Viet-Net readable mnemonic
convention.
Viet-Std List
-------------
Atkinson, Randall
(atkinson(_at_)itd(_dot_)nrl(_dot_)navy(_dot_)mil)
BINGO(_at_)MTUS5(_dot_)cts(_dot_)mtu(_dot_)edu
Bu`i Cu+o+ng (bui(_at_)berlioz(_dot_)nsc(_dot_)com)
DDa(.ng, Oliver
(Oliver_Dang(_dot_)Washington_CSD(_at_)Xerox(_dot_)com)
DDinh Hoa`n (hdinh(_at_)ihlpx(_dot_)att(_dot_)com)
DDo^~, James (jDo(_at_)sjc(_dot_)mentorg(_dot_)com)
Du+o+ng, Christie (chrisd(_at_)works(_dot_)sun(_dot_)com)
Dung Trung (trung(_at_)CS(_dot_)BU(_dot_)EDU)
Ho^.i Chuye^n Gia Vie^.t Nam (hcgvn(_at_)netcom(_dot_)com)
Ho^` Khie^m (khiem(_at_)hpinddm(_dot_)hp(_dot_)com)
Ho^` Phi Hu`ng (hho%aludra.usc.edu, Archivist)
JFT%NCCIBM1(_dot_)BITNET(_at_)Forsythe(_dot_)Stanford(_dot_)EDU
Le^ Quang (quangl(_at_)tabasco(_dot_)sps(_dot_)mot(_dot_)com)
Le^ Ti'n (tin(_at_)smsc(_dot_)sony(_dot_)com, Archivist)
Lu+o+ng V. Tu+o+'c (tluong(_at_)borland(_dot_)com)
Ngo^ DDi`nh Ho.c
(ngo(_at_)amelia(_dot_)nas(_dot_)nasa(_dot_)gov)
Ngo^ Quang (quang(_at_)csufres(_dot_)csufresno(_dot_)edu)
Ngo^ Thanh Nha`n (nhan(_at_)LSP5(_dot_)CS(_dot_)NYU(_dot_)EDU)
Nguye^~n DDu+'c Long
(long(_at_)ireq-num(_dot_)hydro(_dot_)qc(_dot_)ca)
Nguye^~n Du (nguyen(_at_)zariski(_dot_)harvard(_dot_)edu)
Nguye^~n Gia Hoa` (nguyenh(_at_)eng(_dot_)umd(_dot_)edu)
Nguye^~n Hoa`ng (Hoang_Nguyen(_dot_)LAX1B(_at_)xerox(_dot_)com)
Nguye^~n Kinh (Kinh_Nguyen(_dot_)ESXFC(_at_)Xerox(_dot_)COM)
Nguye^~n T. Cu+o+`ng (cuong(_at_)haydn(_dot_)stanford(_dot_)edu)
Nguye^~n Tha`nh
(thanh(_at_)ipesun(_dot_)e-technik(_dot_)uni-stuttgart(_dot_)de)
Nguye^~n Vu+o+ng
(Vuong(_dot_)Nguyen(_at_)szebra(_dot_)saigon(_dot_)com)
Pha.m Tha.ch (thach(_dot_)pham(_at_)Eng(_dot_)Sun(_dot_)COM)
To^n Khoa (khoa(_at_)hpda(_dot_)hp(_dot_)com)
Tra^`n Ha?i (htran(_at_)dash(_dot_)mitre(_dot_)org)
Tra^`n Kha
(KTRAN%APLVM(_dot_)BITNET(_at_)Forsythe(_dot_)Stanford(_dot_)EDU)
Tra^`n Nha^n (tran(_at_)peora(_dot_)sdc(_dot_)ccur(_dot_)com)
Vie^.t Anh (anh(_at_)media(_dot_)mit(_dot_)edu)
Vu~ Tie^'n Giao (pyramid!infmx!grizzly!giao)
ccicpg!al!ngo
cdh1(_at_)homxc(_dot_)att(_dot_)com
chi(_at_)markv(_dot_)com
ctt(_at_)alux2(_dot_)att(_dot_)com
cvu(_at_)ic(_dot_)sunysb(_dot_)edu
gregg(_at_)eoc(_dot_)com
hnguyen(_at_)sc9(_dot_)intel(_dot_)com
k(_dot_)ly(_at_)trl(_dot_)OZ(_dot_)AU
kpv(_at_)ulysses(_dot_)att(_dot_)com
lphung(_at_)ihbhk(_dot_)att(_dot_)com
lphung(_at_)nike(_dot_)calpoly(_dot_)edu
ltpham(_at_)netcom(_dot_)com
ndoduc(_at_)framentec(_dot_)fr
shg(_at_)rock(_dot_)concert(_dot_)net
thanht(_at_)hls(_dot_)com
thu(_at_)friendship(_dot_)sun(_dot_)com
trinh(_at_)paulus(_dot_)enet(_dot_)dec(_dot_)com
tuan(_at_)lgc(_dot_)com
v(_dot_)mai(_at_)uow(_dot_)edu(_dot_)au
vtpham(_at_)ap040(_dot_)csc(_dot_)ti(_dot_)com
vtt(_at_)turing(_dot_)scs(_dot_)uiuc(_dot_)edu
A UNIFIED FRAMEWORK FOR VIETNAMESE INFORMATION PROCESSING
Vietnamese Standardization Working Group
(Viet-Std(_at_)Haydn(_dot_)Stanford(_dot_)EDU)
January 1992
ABSTRACT
Increasing demand for Vietnamese electronic information pro-
cessing has seen answer in a wide array of Vietnamese-
capable applications. The inevitable need for integration
of Vietnamese into existing environments and the exchange of
data among them point to the necessity of standardization.
This paper presents the strategic and pragmatic technical
considerations that must go into such a standard, and re-
views existing conventions/proposals in these important con-
texts. A full description of the Viet-Std proposal is
presented, including 1) an 8-bit, fully precomposed Viet-
namese encoding table, 2) a 7-bit quoted-readable Vietnamese
standard for data interchange over 7-bit channels, with a
seamless interface to the 8-bit encoding, and 3) a keyboard
user-interface specification that works transparently with
both 1 and 2. Together, these provide vide a truly unified
framework for a Vietnamese information processing environ-
ment with simplicity, efficiency, and straightforward in-
tegration. The real-world construction of this framework has
proven quite successful in an array of compliant applica-
tions from a number of group and individual developers
across a number of platforms, including Unix and its vari-
ants, the X window system, MS-DOS, Windows, and with ongoing
work elsewhere.
1 INTRODUCTION
With the growing Vietnamese population abroad and the proli-
feration of computer usage within Viet Nam, the Vietnamese
language has seen rapidly increasing representation in elec-
tronic information processing. The concomitant growth in demand
for Vietnamese-capable software has resulted in successful
launches of myriad vendors in the U.S. and elsewhere, mainly in
the area of Vietnamese word processing. In addition, individual
and group efforts have also been productive in providing
Vietnamese-language users with highquality public-domain appli-
cations. In Viet Nam, centers such as the Institute of Infor-
matics have reported impressive progress on many fronts, among
which is the Vietnamization of standard software packages [1].
All of the above illustrate two important points: 1) There
are growing market demands for Vietnamese-capable processing
engines, and 2) There is no shortage of technical talent to
fulfill those demands. Unfortunately, therein lies a large prob-
lem: most existing Vietnamese applications have been designed to
operate in the exclusive framework or environment of the
developer, and all are incompatible with one another. As long as
this trend continues, the application base for Vietnamese can
never keep reasonable pace with demand. Users want to do more
with Vietnamese than mere word processing, and to expect one sin-
gle vendor to provide all potential applications across all plat-
forms is to dream the impossible. Technicians providing these ap-
plications are limited to the Vietnamese tools they must them-
selves learn and develop from the ground up. Standardization is
necessary. Anyone who has had to deal with the incompatibility
between ASCII and EBCDIC can try to imagine a world where every
machine is using a different character set, and appreciate how
limited that world would be in its application base and how
cumbersome in its data interchange. A uniform framework will
greatly benefit both the user and the technician alike.
The proposal for any Vietnamese data standardization must
take several important points in the proper contexts. First and
foremost, since this discussion is geared toward existing 7- and
8-bit environments, the prime goal is straightforward and
direct integration onto current platforms. The standard must
work here and now. This implies the use of precomposed Vietnamese
characters, because the handling of floating diacritics will nev-
er see full or simple support outside of specific contexts. The
standard must be designed so as to take advantage of existing
applications as much as possible. The familiar ``don't reinvent
the wheel'' rule is not only an advantage---but a necessity-
--if a meaningful application base is to be established in any
reasonable length of time. Furthermore, it is known that overall
efficiency both in time and space is greater in processing
precomposed character units when compared with the floating-
diacritic approach [2]. Floating diacritics therefore must be
limited to only where they are necessary and inevitable, such
as in keyboard entry or 7-bit data transmission. There is no
reason to require that all applications must deal with the com-
plexities and inefficiencies of floating diacritics, for exam-
ple, in 8-bit data processing, storage, transmission, screen
rendering, or printing.
The second major context points to the pragmatic and vital
consideration of existing precedents set in the Vietnamese
software base. Standardization necessarily requires adaptation,
but it makes little sense to propose to change the world so sig-
nificantly that the inertia against large changes greatly delays
adoption of the standard. Sixteenbit and wider data standards
are just around the corner [3 , 4]; an 8-bit Vietnamese standard
must not ignore existing software precedents if it is not to be
useful only ``after its time,'' when it is no longer relevant.
Thirdly, the standard must address the issue of user in-
terface; if not defining it, then at least consider its possi-
ble effects on the end-user. This relates primarily to the 7-
bit keyboarding and representation of Vietnamese--in both in-
stances diacritics are necessarily floating, and represented
mnemonically by existing 7-bit characters with similar appear-
ance. With keyboarding, one must preserve where possible exist-
ing practices such as that defined for the Viet-Net mailing
list and the Usenet newsgroup Soc.Culture.Vietnamese, both with
members worldwide. For 7-bit readable representation, the key-
word is ``readable.'' The goals here are to maintain a short
learning time and to promote a uniform interface so that it is
not necessary for a user to re-learn the particulars of every
software installation before being able to use it effectively.
Finally, to every extent possible, the standard must stay
within the framework of international standards, e.g., ISO-8859/x
[5], in order to ensure compatibility with existing environ-
ments. For example, this goal means preservation of the ASCII
encoding. It should extend also to the encoding into the same
8859/Latin-1 slots those Vietnamese characters that are already
defined, thus ensuring that 8859/Latin-1 keyboards will work
transparently for those Vietnamese
characters. However, there are many standards requirements that
are obsolete from a practical viewpoint. For example, in recent
Unicode/ISO-10646 decisions, the prohibition from use of the
available control character space--those with encodings between
xx00h and xx1Fh, except for C0 itself---was discarded on the
grounds that it was a waste of encoding space. As will be dis-
cussed later, the encoding of Vietnamese into the existing 8-
bit space presents some well-known trade-offs. Where trade-offs
are made, they must be justified with good reason---pragmatic
preferred over theoretical.
These primary requirements are summarized as follows:
R1. Straightforward and direct integration into ex-
isting platforms.
R2. Ease of adaptation for existing software.
R3. User-friendly mnemonic encoding scheme and inter-
face.
R4. Adherence to international standards.
R5. Trade-offs made only on practical usage consid-
erations and with good reason.
In the following section we present a brief review of the
strengths and weaknesses of different approaches to Vietnamese
encoding. Section 3 will describe the proposed 8bit encoding
table in detail. A quoted-readable encoding scheme encompassing
7-bit data streams, including electronic mail and keyboard in-
put, is presented in Section 4. Finally, Section 5 outlines the
particular rules and conventions relevant in some application-
specific contexts.
2 REVIEW OF CURRENT CONVENTIONS
A review of current conventions used by software vendors reveals
one distinct feature: virtually all realize the strengths of a
precomposed encoding and adopt it as a primary requirement. The
complications arise from a familiar fact: apart from the alpha-
betics already available in the ASCII standard, Vietnamese re-
quires an additional 134 unique characters. Of these, 128 can be
coded in the C1 and G1 areas. The allocation of the remaining 6
characters in the lower C0 and G0 space is handled with differ-
ing approaches:
A1. Encode into 6 of the ``least-used'' G0 characters
in the context of Vietnamese data processing.
A2. Encode into 6 of the 12 National Replacement char-
acters in G0.
A3. Drop 6 of the ``least-used''(1) Vietnamese charac-
ters, typically accented capitals such as A(?, A(~,
A^~, Y?, Y~, and Y. .
A4. Map accented ``y'' combinations into corresponding
``i'' combinations, e.g., ``ky~ su+'' is replaced
with ``ki~ su+''.
A5. Encode into the ASCII control space C0.
Approaches A1 and A2 both satisfy the typical needs of the
word processing environments in which rarely used ASCII charac-
ters can be avoided, or employed by font shifting. However they
both eliminate prospects for integration of Vietnamese into ex-
isting ASCII environments where all graphic characters in G0
are needed. A character that already serves one purpose cannot be
re-used for another. First, it makes rendering of the needed G0
character incorrect, as it would now look like a Vietnamese char-
acter. The frequency of use of G0 characters in an integrated
environment is far too high for this conflict to be tolerable.
While font shifting may be employed to remedy this in some sit-
uations, a more serious problem occurs when the Vietnamese char-
acter is needed. The environment would typically have assigned
some specific meaning to the G0 character, particularly with
those in the National Replacement set. Consider, for example,
using the backslash character ``\'' for a Vietnamese character
under Unix. The backslash is used for many Unix escape mechanisms
so that the Vietnamese character cannot simply be used but must
be escaped in one way or another. This is more than an inconveni-
ence; it means data interchange is now complicated by the fact
that the escape mechanism will not be understood on another plat-
form, and data integrity has thus not been preserved. A standard
employing this approach fails at its basic mission: to provide
cross-platform transparency. A similar case can be made for the
other G0 characters.
Both A3 and A4 propose to limit Vietnamese language data
in one way or another. Most agree that elimination of some Viet-
namese characters are simply unacceptable; indeed, this point is
so fundamental that we have in the foregoing chosen to assume
it as a technical requirement without elaboration. However, it
must be said that A4 is not
_________________________________________
(1) Least-used because they (a) rarely begin words and there-
fore do not often get capitalized, and (b) appear in fewer words.
a proposal without rationale. A school of thought exists that
believes y's existing in words as a single vowel should be mapped
to corresponding i's, as their pronunciations are indeed identi-
cal. The concept dates as far back as 1948 [6 , 7]. However, it
is not the function of an encoding standard to settle a
linguistic issue, and hence A4 is also a bad choice.
The immediate objection to A5 is primarily in data com-
munication channels where many C0 characters are used as data
control. In addition, it also presents problems for integration
into environments where some C0 characters are used in the key-
board interface and in data format controls, similar to the
problem facing A1 and A2. However, as will be discussed further,
judicious choice of the 6 C0 characters to be used has in prac-
tice been shown successfully to avoid characters that are sig-
nificant in data communication. Furthermore, most data channels
provide for clean transfer of binary data, and there is no reason
to worry that arbitrary data bits cannot be employed over these
binary routes.
With those particular cases where C0 is used in the key-
board interface, judicious choice as well as remapping of keys
can minimize conflict. Data format control is application-
specific but is typically scattered in C0 and C1. It is therefore
a universal problem for integration because C1 is necessarily
densely encoded, but, again, conflict can be avoided by studying
significant applications. Finally, the choice can be made for 6
least-used Vietnamese characters so that the probability of
conflict is greatly reduced.
It should be noted here that the foregoing discussion has
subjected the alternatives to the requirements of integration
into existing applications and platforms, as outlined in Sec-
tion 1. The importance of this goal cannot be overstated, and it
does present complications that result in the following Pragma-
tism Principle: it is obviously impossible to define a standard
that would operate seamlessly with all existing applications,
therefore pragmatic considerations must be made to make a stan-
dard workable in as many important applications and on as many
platforms as possible, with emphasis on the word ``workable.''
3 VISCII: 8-BIT ENCODING SPECIFICATION FOR VIETNAMESE
3.1 MOTIVATION
The available body of evidence shows that alternative A5
described in the previous section, encoding into 6 of the C0
characters, has the greatest chance of success in fulfilling
the requirements outlined in Section 1. The choice of the 6 C0
codes and the 6 least-used Vietnamese capital letters to encode,
when made carefully, greatly reduces the probability of conflict
for all practical purposes. Concerns regarding data communica-
tions are well addressed by avoiding C0 codes that are in fact
often used for data control. Indeed, data communication con-
cerns are more applicable to C1 and G1 encoding; a prominent
example is electronic mail transfer through 7-bit gateways and
mail agents. Communication failure here has in most cases been
due to the use of the eighth bit and not because of C0 encoding.
In any event, the option exists for data to be sent in some
``binary'' mode, or to employ the Vietnamese Quoted-Readable for-
mat to be described in Section 4.
The overwhelming advantage of this approach is that it is
readily and easily integrated into existing environments without
many of the problems plaguing the other alternatives, if they
can at all be integrated. As a testimony to the approach's suc-
cessful application, this document itself was prepared using
the TeX system under Unix. The text source was edited in an 8-
bit X terminal window, using a minimally modified(2) version of
Elvis, a public-domain 8bit version of Unix's Vi text editor.
Both TeX (a document preparation system) and Dvi2ps (a
PostScript generator) readily accepted and processed Vietnamese
(8-bit) data transparently. Many other applications including a
spreadsheet, various text viewers, PostScript and dot-matrix
printing, DOS's WordPerfect, Word, PC Tools, etc., have been
tested and seen to operate well with Vietnamese text. Modifica-
tions if any, were primarily in making these applications
accept 8-bit data. An educational teaching tool for Vietnamese
has also been produced using the C programming language with
8-bit Vietnamese strings embedded in the source code. With in-
creasing system internationalization, applications and tools are
being made 8-bit ``clean,'' further facilitating integration of
this Vietnamese encoding.
_________________________________________
(2) The modifications provided the keyboard interface described
in later sections.
3.2 ENCODING RATIONALE
A basic requirement is to preserve the 7-bit ASCII graphic
characters (G0) layout, since the emphasis is on integration.
G0 was therefore left unchanged. For the 6 C0 characters, we
first lay out the code space and consider typical usage, a
sampler of which is in Table 1. The codes selected, STX (2), ENQ
(5), ACK (6), DC4 (20), EM (25), and RS (30) present the least
possible problems with data communication and significant ap-
plications considered. The use of ACK, for example, is actually
context-dependent. In those protocols we have reviewed, it is
only considered a ``control'' character outside of a data frame;
within a data frame it is transfered without special interpre-
tation. To reduce the probability of conflict even further, the
6 least-often used Vietnamese capital letters, A(?, A(~, A^~, Y?,
Y~, and Y., are encoded into these slots.
The encoding of C1 is less troublesome, although in
applicationspecific contexts it has been found that some C1
characters are employed with special meanings. A review of on-
going work on 8-bit mail transport standardization indicates
that C1 characters will be fully supported as graphic charac-
ters without special interpretation. Nevertheless, it is pru-
dent to encode only upper-case characters into the C1 space.
For G1, the aim is to adhere to the 8859/Latin-1 mapping
where Vietnamese-specific characters are already encoded. Table 2
lists the subset of 8859/Latin-1 characters in G1 that are
also Vietnamese (3). The motivation behind this choice is the
predominant and increasing availability of 8859/Latin-1 key-
boards and font sets, e.g., Digital's VT-terminal series, Xterm
keymaps, and Microsoft's Windows. It is natural and reasonable
for a user in France to expect that the same keystrokes producing
"e'" on the screen for French will do the same for Vietnamese.
With the above guidelines, the task is then to lay out the
remaining Vietnamese characters in some fashion, perhaps even
arbitrary. This has been done in such a way so as to provide some
degree of symmetry simply for aesthetics. Note that the Viet-
namese collating order cannot in any case be preserved, but this
is not a major issue since collation for non-ASCII characters is
well accepted to be a table-lookup problem.
_________________________________________
(3) Note that the ``dd'' in Table 2 is actually a similar-looking
Icelandic ``edh'' in 8859/Latin-1; the Vietnamese rendering form
is better reflected in 8859/Latin-2.
Table 1: A sampler of possible C0 usage conflicts. Codes selected
for this standard proposal are noted with a +.
-----------------------------------------------------------------------------
| CODE COMM CTRL GENERAL PRINTER (PC) PC UNIX VI (Unix) |
|===========================================================================|
| 0 NUL @ C string strings |
| 1 SOH A |
| 2+ STX B back screen |
| 3 ETX C INTR INTR INTR |
| 4 EOT D EOF EOF back tab |
| 5+ ENQ E |
| 6+ ACK F forw.screen |
| 7 BEL G BEL BEL BEL |
| 8 BS H BS BS BS BS BS |
| 9 HT I HT HT HT HT HT |
| 10 LF J LF LF LF LF LF |
| 11 VT K VT |
| 12 FF L FF FF FF redraw |
| 13 CR M CR CR CR CR CR |
| 14 SO N wide on (IBM) |
| 15 SI O comp.on (IBM) |
| 16 DLE P Prt.on/off |
| 17 DC1 Q XOFF XOFF XOFF XOFF |
| 18 DC2 R comp.off(IBM) retype |
| 19 DC3 S XON XON XON XON |
| 20+ DC4 T wide off(IBM) forw. tab |
| 21 NAK U clr. buf(IBM) kill kill |
| 22 SYN V literal literal |
| 23 ETB W werase werase |
| 24 CAN X kill |
| 25+ EM Y suspend |
| 26 SUB Z EOF suspend |
| 27 ESC [ ESC ESC sequence ESC ESC ESC |
| 28 FS \ quit |
| 29 GS ] Telnet ESC |
| 30+ RS ^ |
| 31 US _ Windows |
-----------------------------------------------------------------------------
Table 2: Vietnamese-specific characters already present in 8859/Latin-1.
----------------------------------------------------------------------
| | 0 1 2 3 4 5 6 7 8 9 A B C D E F |
|====|===============================================================|
| Cx | A` A' A^ A~ E` E' E^ I` I' |
----------------------------------------------------------------------
| Dx | DD O` O' O^ O~ U` U' Y' |
----------------------------------------------------------------------
| Ex | a` a' a^ a~ e` e' e^ i` i' |
----------------------------------------------------------------------
| Fx | dd o` o' o^ o~ u` u' y' |
----------------------------------------------------------------------
Experience in development of this encoding on the MSDOS
platform motivates the consideration of line-drawing glyphs in
the PC character set (code page 850). Code positions occupied by
singleand double-line drawing characters should be popu- lated
with upper case letters. It is possible to do this without
violating the major guidelines already established above. With
this provision, the MSDOS user can be supplied with code pages
containing either PC line-drawing or Vietnamese glyphs. For ex-
isting applications, the user can choose the code page most
appropriate for her purpose. Where the code page with line
drawing characters must be used, the penalty from missing Viet-
namese characters has been minimized by the choice of the infre-
quently used ones. For new applications, code page switching can
easily be done on the fly, if it is desired.
The preceding guidelines have resulted in the VISCII 8-
bit Vietnamese encoding proposal listed in Table 3. It is intend-
ed to be a single table that applies to Vietnamese data handling
including storage, processing, transmission, and font encoding.
This greatly simplifies the integration, implementation, and
usage processes and is indeed one of the major strengths of the
proposal.
4 VIQR: MNEMONIC ENCODING SPECIFICATION FOR VIETNAMESE
4.1 MOTIVATION
While the 8-bit specification attempts to standardize Viet-
namese encoding in 8-bit environments, much remains to be ad-
dressed in important 7-bit environments such as electronic mail
transport and other 7-bit data lines, as well as in keyboard en-
try applications where the interface for generating Vietnamese
Table 3: VISCII 8-bit Encoding Standard Proposal for Vietnamese.
-----------------------------------------------------------------------
| | 0 1 2 3 4 5 6 7 8 9 A B C D E F |
|====|================================================================|
| 0x | nul A(? stx etx eot A(~ A^~ bel bs ht lf vt ff cr so si |
|----|----------------------------------------------------------------|
| 1x | dle dc1 dc2 dc3 Y? nak syn etb can Y~ sub esc rs gs Y. us |
|----|----------------------------------------------------------------|
| 2x | ! " # $ % & ' ( ) * + , - . / |
|----|----------------------------------------------------------------|
| 3x | 0 1 2 3 4 5 6 7 8 9 : ; < = > ? |
|----|----------------------------------------------------------------|
| 4x | @ A B C D E F G H I J K L M N O |
|----|----------------------------------------------------------------|
| 5x | P Q R S T U V W X Y Z [ \ ] ^ _ |
|----|----------------------------------------------------------------|
| 6x | ` a b c d e f g h i j k l m n o |
|----|----------------------------------------------------------------|
| 7x | p q r s t u v w x y z { | } ~ del|
|----|----------------------------------------------------------------|
| 8x | A. A(' A(` A(. A^' A^` A^? A^. E~ E. E^' E^` E^? E^~ E^. O^'|
|----|----------------------------------------------------------------|
| 9x | O^` O^? O^~ O^. O+. O+' O+` O+? I. O? O. I? U? U~ U. Y` |
|----|----------------------------------------------------------------|
| Ax | a. a(' a(` a(. a^' a^` a^? a^. e~ e. e^' e^` e^? e^~ e^. o^'|
|----|----------------------------------------------------------------|
| Bx | o^` o^? o^~ O+~ O+ o^. o+` o+? i. U+. U+' U+` U+? o+ o+' U+ |
|----|----------------------------------------------------------------|
| Cx | A` A' A^ A A? A( a(? a(~ E` E' E^ E? I` I' I~ y` |
|----|----------------------------------------------------------------|
| Dx | DD u+' O` O' O^ O~ y? u+` u+? U` U' y~ y. Y' o+~ u+ |
|----|----------------------------------------------------------------|
| Ex | a` a' a^ a~ a? a( u+~ a^~ e` e' e^ e? i` i' i~ i? |
|----|----------------------------------------------------------------|
| Fx | dd u+. o` o' o^ o~ o? o. u. u` u' u~ u? y' o+. U+~|
-----------------------------------------------------------------------
characters needs to be standardized.
Transporting more than 128 unique symbols over 7-bit data
channels is not a problem specific to the Vietnamese language.
Since its proposal in 1982, the Internet Simple Mail Transfer
Protocol (``SMTP'', [8]) has seen unrelenting efforts to ex-
tend it to accommodate 8-bit and widerword data in European La-
tin scripts and Oriental ideographic characters (see, e.g., [9]).
While clean 8-bit transport is highly desirable, all mail
gateways are not going to be converted overnight. For the fore-
seeable future there is a need for unambiguous transport of Viet-
namese text over existing 7-bit channels.
Indeed there is an ad-hoc standard in use on the VietNet
mailing list and the Usenet newsgroup Soc.Culture.Vietnamese,
where mnemonic use of appropriate characters to follow a vowel
proves to be quite readable; for example, ``Vi<e^.>t Nam'' would be
written as ``Vie^.t Nam''. However, this is troubled by the am-
biguity in the multiple roles played by the mnemonic diacritical
marks; for example, does ``tha?'' mean ``tha?'' or ``th<a?>''?
The Viet-Net convention is not far in concept from a
quoted-readable format proposed by K. Simonsen [10 , 11]. which
disambiguates such texts by specifying text states at both the
character and character set levels. Unfortunately, in its at-
tempt to provide a universal solution to mnemonic encoding, the
proposal does not provide a good answer for Vietnamese text.
First, it restricts the use of mnemonics to the 83 invariant
ISO-646 [12] graphic characters, which is a good idea in prin-
ciple, but sacrifices readability in the process. For example,
the counter-intuitive mnemonics for hook-above (da^'u ho?i) and tilde
(da^'u nga~) are ``2'' and ``?'', respectively, in order to avoid
``\'' itself, which is not an invariant. The wide availability of
ASCII keyboards to the great majority of Vietnamese users makes
this too unreasonable a limitation in the context of Vietnamese
processing. It should be noted that we are in fact arguing in
favor of ``readability for most'' against ``illegibility for
all.'' Furthermore, with ongoing progress on keyboard and display
internationalization, e.g., in graphical window environments
where keyboard mapping and font switching are easily implemented,
this availability is on the increase, further obsoleting the res-
triction.
The greater difficulty is that the two-character fixed-
length encoding(4) cannot provide a readable or mnemonic rep-
_________________________________________
(4) The convention is ``&xy'', where x is a literal character
and y represents some combining form.
resentation of all Vietnamese characters, in particular those
with 2 diacritical marks. The variable-length mnemonics(5) have
been extended to include all Vietnamese characters, but this
scheme is so cluttered with announcers and delimiters that rea-
dability and efficiency are near nil, keeping in mind that
diacritics are heavily used in Vietnamese. While machine data
translators will have little trouble with any ``mnemonic''
scheme, one that is directly accessible to human users, who are
in many cases typing mail messages using 7-bit editors, needs to
be more user-friendly. A Vietnamese user will not want to learn
or remember among all possible combinations that, say, ``a5''
stands for "a('", nor will she like typing sequences as long as
``&_a('_'' for some letter in every word.
To satisfy the readability and flexibility requirements, a
separate specification is necessary. It is better to adopt an
approach like code-page switching under ISO-2022 [13] to switch
the text into ``Vietnamese'' mode and optimize encoding according
to the language state. Recently, van der Poel put forth a mnemon-
ic proposal [14] which emphasizes language-specific conven-
tions for these reasons. This proposal provides a means to speci-
fy the language state, each with its own (efficient) encoding
method. Its strength lies in the flexible specification that con-
formant implementations ``need not be able to display all of
the character sets specified''; they have the option of stating
messages such as ``undisplayable Greek appeared here'' for un-
supported languages (for a more precise specification, see [14]).
This allows networking communities to determine the best ap-
proach for encoding their own languages. The VIQR convention is
compatible with this approach and should easily be incorporated
into this framework.
The specification here encompasses all data streams in-
cluding text transfer, file I/O, and keyboard entry. This princi-
ple has been the major reason for success in operating systems
such as Unix, in which device-specific details are hidden as
much as possible from the applications programmer, leaving a
uniform interface above which tools such as common library rou-
tines can be shared. Indeed as the keyboard example above has im-
plied, the characters actually typed by the user are often not
different from the text data that is eventually stored or
transmitted. It is therefore desirable to provide a common base
on which to build data interpreters for all data streams,
independent of the input source. In actual implementation, this
_________________________________________
(5) The convention is ``&_xxxx_'' where xxxx can be an arbi-
trary mnemonic sequence.
has greatly facilitated development of the Vietnamese-capable
software base.
In addition, the user stands to benefit tremendously from
standardization of keyboard entry. One does not need to learn a
different keyboard entry technique for each different Viet-
namese application. If one standard keyboard model is fully sup-
ported by all Vietnamese software, a user familiar with the
standard can sit down and start typing Vietnamese immediately.
This standard defines the minimum expected behavior from com-
pliant software; any additional input techniques can of course
be incorporated as a superset of the standard behavior. This is
discussed further in Section 5.2 on Vietnamese keyboarding.
4.2 QUOTED-READABLE SPECIFICATION (VIQR)
The mnemonic model from Viet-Net is fully employed in the specif-
ication. The Vietnamese QR comprises three major states:
Literal, English, and Vietnamese. The Literal state is intended
for completely transparent handling of literal data (except of
course for the escape sequences into and out of Literal state).
The English and Vietnamese states are designed for mixed use of
English and Vietnamese, with each optimized in appearance as well
as data size for texts containing mostly English and Vietnamese,
respectively. In either state there exist methods for composing
Vietnamesespecific characters, using a base vowel followed by
one or two diacritics.
We first introduce the concept of implicit and explicit
composition, then discuss how they are used in each of the
states.
4.2.1 Implicit Composition
Implicit composition is useful for data containing a large per-
centage of Vietnamese characters.
With implicit composition, a sequence of a base vowel
followed by one or two diacritical marks is combined into one
Vietnamese letter as long as it is grammatically legal. This is
best illustrated by examples:
a^ --> <a^>
o+? --> <o+?>
<o+>? --> <o+?>
Vie^.t --> Vi<e^.>t
Vi<e^>.t --> Vi<e^.>t
la'^n --> l<a'>^n (not l<a^'>n)
l<a'>^n --> l<a'>^n (not l<a^'>n)
Note in the last two example that the sequence a^' is not
grammatically equivalent to a'^ or <a'>^. In general a modifier
("(", "^", "+") must immediately follow the appropriate vowel in
order to be combined.
The special sequence "dd" is composed into "<dd>"; "DD",
"dD", and "Dd" all represent "<DD>".
The base vowels are: a, a(, a^, e, e^, i, o, o^, o+, u, u+,
y, and their corresponding capitals. The encoding values are those
listed in Table 3, the 8-bit VISCII proposed standard.
The diacritical marks are represented by ASCII charac-
ters having correspondingly similar appearances. Table 4 lists
the 7 ASCII characters used as mnemonic replacements for the
Vietnamese diacritics: the first three are modifiers, and the
remaining five are tone marks.
Table 4: ASCII Mnemonics for Vietnamese Diacritics
--------------------------------------------------------
| Diacritic | Char | ASCII Code | Da^'u |
|============|========|==================|=============|
| breve | ( | 0x28, left paren | tra(ng (() |
| circumflex | ^ | 0x5E, caret | mu~ (^) |
| horn | + | 0x2B, plus sign | mo'c (+) |
|------------|--------|------------------|-------------|
| acute | ' | 0x27, apostrophe | sa('c (') |
| grave | ` | 0x60, backquote | huye^`n (`) |
| hook above | ? | 0x3F, question | ho?i (?) |
| tilde | ~ | 0x7E, tilde | nga~ (~) |
| dot below | . | 0x2E, period | na(.ng (.) |
--------------------------------------------------------
4.2.2 Explicit Composition
Explicit composition is associated with the concept of a leading
character which explicitly announces the composition. The an-
nouncer character is the backslash ("\", ASCII 0x5C), known
here as <COM>. The subsequent combining characters are defined
in the same way as those in implicit composition. Thus the ex-
amples given above would appear in explicit composition mode as:
\a^ --> <a^>
\o+? --> <o+?>
Vi\e^.t --> Vi<e^.>t
Explicit composition is useful for data containing main-
ly English text, as well as for maintaining real-time compati-
bility with keyboard character events, as will be discussed in
Section 5.2 on Vietnamese keyboarding. With the composition
methods described, we are now ready to discuss how they are em-
ployed in each of the three states. The state of the data
stream is specified by the two character sequence <COM>x, where x
is specified below.
4.2.3 Literal State
The appearance of <COM>L or <COM>l in the data stream initiates
the Literal state. This state is intended for nearperfect tran-
sparent literal data transfer. Neither implicit nor explicit
composition is available here, nor is the <COM> character spe-
cial, except when it is followed by one of the six characters l,
L, v, V, m or M which initiates one of the three states (6).
4.2.4 English State
The sequence <COM>M or <COM>m sets the data stream state to En-
glish. In English state, only explicit composition is supported.
This means that in order to generate a Vietnamese letter, the
announcer character <COM> must be used. A ``composition'' se-
quence not preceded by <COM> will be left uninterpreted. Exam-
ples:
\mD\u~ng, how are you? --> D<u~>ng, how are you?
\mKho\e? kh\o^ng? --> Kho<e?> kh<o^>ng?
As noted, the sequence "you?" above was not converted
into "yo<u?>" because no composition was specified.
4.2.5 Vietnamese State
The data stream state is set to Vietnamese when the sequence
<COM>V or <COM>v is encountered. In Vietnamese mode, both
_________________________________________
(6) To effect <COM>L, <COM>M, and <COM>V themselves, it is
necessary to switch to either English or Vietnamese state and use
the Character Literal feature available there.
explicit and implicit compositions are in effect. The following
examples assume that the data stream is initially in English
state:
\vCh\u+~ Vi\e^.t --> Ch<u+~> Vi<e^.>t
\vChu+~ Vie^.t --> Ch<u+~> Vi<e^.>t
Chu+~ \vVie^.t --> Chu+~ Vi<e^.>t
The availability of implicit composition in Vietnamese
state ensures that the text is not cluttered with unnecessary
<COM>s, as would be the case in Vietnamese text using explicit
composition. Explicit composition is included to maintain com-
patibility with the English state so that there is no need to de-
fine additional meanings for the <COM> sequences. Also, the
real-time keyboard compatibility mentioned previously is also
available in Vietnamese state through explicit composition.
4.2.6 Character Literals in English and Vietnamese States
Consider the following example:
\vDu~ng, how are you? --> D<u~>ng, how are yo<u?>
In this example, the sequence "you?" was interpreted as
"yo<u?>" because the data stream was still in Vietnamese state. Thus
it is sometimes desirable to suppress composition altogether
without having to switch states. The literal property of the
<COM> character conveniently accomplishes this. In either Viet-
namese or English state, whenever <COM> is followed by a non-
combining character c the result is the literal character c it-
self. The <COM> is discarded from the data stream. To get the
<COM> character literally, use <COM><COM>. Consider the following
examples:
\vddi dda^u? --> <dd>i <dd><a^><u?>
\vddi dda^u\? --> <dd>i <dd><a^>u?
\vddi v\o^? --> <dd>i v<o^?>
\vddi v\o^\? --> <dd>i v<o^>?
\h\e\l\l\o --> hello
\\ --> \
\\V --> \V
\\M --> \M
\\L --> \L
4.2.7 Closure
The data stream supports another special character used to gen-
erate explicit closure. The closure character is CTRL-A (ASCII
0x01), known here as <CLS>. When <CLS> is encountered in the
data stream, it immediately terminates any ongoing composition
sequence. The <CLS> itself is always discarded, unless it ap-
pears in the literal sequence \<CLS>.
Explicit closure is useful in real-time character appli-
cations such as keyboard entry, when it is necessary to specify
that a composition sequence has in fact ended and the input en-
gine should not stay hanging and wait for more data.
5 SPECIFIC APPLICATIONS
This section outlines application-specific guidelines and conven-
tions that have evolved in the software development community.
It is intended to be a live and growing documentation of such
discussions as more experience is gathered. Readers are welcome
to participate in these discussions and contribute to the
development of these guidelines in particular, and to the stan-
dards in general.
5.1 ELECTRONIC MAIL OVER 7-BIT CHANNELS
Many of the available channels for electronic mail currently
still enforce the 7-bit limitation. The 8-bit character set de-
fined in Section 3 cannot be transported verbatim over these
channels. VIQR plays an important role here, as it provides for
7-bit transport of Vietnamese text without the ambiguity prob-
lem of deciding what to do with the double usage of a
diacritical/punctuation mark, e.g., the hook-above or question
mark, "?". Because of the 7-bit nature of these communications
channels, mail agents will typically not encounter those
Vietnamese-specific base vowels that are encoded in the G1
area, namely: a(, A(, a^, A^, e^, E^, o^, O^, o+, O+, u+, and
U+. However, mail agents designed to work with 8-bit channels are
still expected to handle the occurrence of these characters
according to the complete VIQR, namely to combine base vowels
and diacritical marks as appropriate.
In order to be correctly interpreted, electronic mail mes-
sages must explicitly set the language state either in the
headers or text body. One cannot assume what state the receiving
input engine is in at the start of the message, since messages
are not always read in message units, e.g., when a file contain-
ing multiple mail messages is scanned.
Furthermore, if a language state specification (\L, \V or
\M) is present in a mail message, it is highly recommended that
the message end in the Literal state. This helps applications
reading multiple mail messages in one data stream, such as a
terminal application. It is useful because mail headers do not
adhere to the VIQR, and they are more adversely affected when in-
terpreted in non-Literal states.
5.2 VIETNAMESE KEYBOARDING
Keyboards are becoming increasingly internationalized. As men-
tioned in the 8-bit specification, this is the major reason for
using the same code positions for those Vietnamese characters al-
ready present in ISO 8859/Latin-1. A Vietnamese keyboard driver
designed to work in the 7-bit-only environment can assume that it
will not encounter Vietnamese base vowels residing in G1.
Keyboard drivers for the 8-bit environments, like 8-bit electron-
ic mail agents (Section 5.1), must be prepared to accept any base
vowel, including those encoded in G1.
The real-time echoing behavior of keyboard input during
composition requires further specification. The options are to
report the character only after the composition sequence has
finished, or to report all intermediate forms and backspacing
over them. Each has its own useful context as described below.
5.2.1 Immediate Echo for Implicit Composition
Implicit composition is designed to be convenient for a user
processing data that is mostly Vietnamese. As such it is desir-
able for the keyboarding user to get immediate feedback on
typed keys. With implicit composition, the keyboard works in
immediate-echo mode. Keypresses immediately generate key
events. If a character is subsequently composed with a diacrit-
ical mark, a backspace (typically BS, ASCII 0x08) is sent fol-
lowed by the new composed character. This cycle continues as
long as composition is possible. The sequence of events for the
key sequence "a^'n" under immediate echo is:
1. user types a, a is sent to the application
2. user types ^, BS and <a^> are sent
3. user types ', BS and <a^'> are sent
4. user types n, the single key n is sent
The actual backspace character code may vary depending on
the system, application, and user settings. The keyboard in- ter-
face should use the appropriate code, and/or allow the user to
specify the preferred backspace character.
5.2.2 Delayed Echo for Explicit Composition
When a composition sequence is started, the keyboard interface
must not send any key events to the application expecting key-
board input until the sequence is terminated. Composition may
end either naturally when the interface receives a character
that cannot be composed into the sequence, or when the closure
character <CLS> is received. A single key event for the com-
posed character is then sent to the application above. Subsequent
processing can proceed naturally. Consider what happens when the
user types the sequence "\a^'n" under delayed echo:
1. user types \, no key is sent to the application
2. user types a, no key is sent
3. user types ^, no key is sent
4. user types ', the single key <a^'> is sent
5. user types n, the single key n is sent
Or an example involving closure, "t\o+<CLS>":
1. user types t, the key t is sent
2. user types \, no key is sent
3. user types o, no key is sent
4. user types +, no key is sent
5. user types CTRL-A, the single key <o+> is sent
Note that without the closure key the keyboard interface
would still be left hanging after the "+" key has been pressed,
because the user can still enter a tone mark as part of the com-
position sequence.
This delayed-echo behavior for explicit composition is
designed to ensure compatibility with applications expecting
single key events for each character, particularly in the English
state where only explicit composition is available. While it
is certainly possible to have immediate-echo in explicit composi-
tion or delayed-echo in implicit composition, these options are
not useful and serve only to confuse the user learning how to
use a Vietnamese keyboard.
It is therefore simplest to associate delayed-echo with expli-
cit composition, and immediate-echo with implicit composition.
These options make natural sense.
This standard defines the minimal ``look-and-feel'' be-
havior a user can expect from a compliant Vietnamese software
package. A standardized interface decreases the required learn-
ing time for each new application. This standard does not pre-
clude other input mechanisms to improve user-friendliness, e.g.,
intelligent menu-driven diacritics, or to assist in speed typ-
ing, e.g., through the use of CONTROL or FUNCTION keys. Any
enhancement in compliant applications is a bonus for the user,
so long as such enhancements do not adversely conflict with the
minimum expected behavior described here.
5.3 ADAPTING EXISTING VIETNAMESE APPLICATIONS
A realistic approach to standardization provides for the inertia
against change in existing software applications. While it is
desirable that the standard 8-bit encoding described here be
fully supported, an alternative exists which is more amenable to
rapid adoption. All applications should provide a means for im-
porting and exporting data encoded using the VISCII 8-bit encod-
ing table. At the same time, the VIQR keyboard interface should
be implemented, at least as an optional entry method. Such moves
are highly desirable both for the user and the vendor alike.
The user will be able to use the software immediately because of
the uniform keyboard interface, as well as process the same da-
ta in different applications and on different platforms, with in-
creased productivity and interactivity among users. This ease
of use means greater acceptance and a correspondingly larger
customer base for the vendor.
6 SUMMARY & CONCLUSIONS
This paper has presented a proposal for standardization of Viet-
namese information processing. A case has been made for the
necessity of standardization; we hope to have encouraged ven-
dors and users of Vietnamese alike to work together toward this
goal to benefit everyone involved. Various encoding approaches
were discussed, leading to the choice of the VISCII 8-bit encod-
ing proposal. A single encoding table was presented that has been
shown in actual practice to work well for Vietnamese including
editing, processing, storage, transfer, font encoding, and print-
ing. Where 8bit data handling was not available or reliable,
e.g., elec-
tronic mail transport, the Vietnamese Quote-Readable specifica-
tion (VIQR) was introduced to provide a seamless filtering
gateway. VIQR was defined to be input-source-independent and
hence has been designed to be applicable to Vietnamese keyboard
input as well as machine data filters. All of this was shown to
have been integrated into existing environments facilitating
the use of existing tools and applications--a major strength of
the encoding. Finally, these specifications have been linked
together seamlessly to include every point in the input-
process/transfer-output cycle of data handling and provide for a
truly unified framework for Vietnamese information processing.
References
[1] Ba.ch Hu+ng Khang. ``Institute of Informatics,''. Ha`
No^.i, Vie^.t Nam, February 1991.
[2] B. Jerman-Blazic, ``Will the Multi-octet Standard
Character Set Code Solve the World Coding Problems
for Information Interchange?,'' Computer Standards
& Interfaces, vol. 8, pages 127--136, 1988.
[3] The Unicode Consortium. The Unicode Standard:
Worldwide Character Encoding Version 1.0. Addison-
Wesley, Reading, MA, first edition, October 1991.
[4] ISO Technical Committee, ``Universal
Multiple-Octet Coded Character Set (UCS), ISO/IEC
DIS 10646-1.2,'' Draft standard, International
Organization for Standardization, 1992.
[5] International Organization for Stan-
dardization. ISO 8859/x: 8-bit International Code
Sets. ISO, 1977.
[6] Famjxuaen Thais. Vie^.t Ngu+~ Ca?i Ca'ch. Tu+' Ha?i, Ha` No^.i,
Vie^.t Nam, March 1948.
[7] Pha.m Xua^n Tha'i. Chu+~ Vie^.t Ho+.p Li'. Ti'n DDu+'c Thu+ Xa~
Vie^.t Nam, April 1958.
[8] J. Postel, ``Simple Mail Transfer Protocol,'' RFC
822, USC Information Sciences Institute, August
1982.
[9] J. C. Klensin et al., ``SMTP Extensions for
Transport of Text-Based Messages Containing 8-bit
Characters,'' Internet draft, Massachusetts
Institute of Technology, July 1991.
[10] K. Simonsen, ``Character Mnemonics & Character
Sets,'' Internet draft, Danish Unix Users Group,
January 1992.
[11] K. Simonsen, ``Mnemonic Text Format,'' Internet
draft, Danish Unix Users Group, August 1991.
[12] International Organization for Standardization. ISO 646: 7-bit Cod-
ed Character Set for Information Interchange. ISO,
third edition, 1991.
[13] International Organization for
Standardization. ISO 2022: 7-bit and 8-bit Coded
Character Sets---Code Extension Techniques. ISO,
third edition, 1986.
[14] E. M. van der Poel, ``Multilingual Character
Encoding for Internet Messages,'' Internet draft,
Software Research Associates, Japan, January 1992.
[15] IBM. System/370 Reference Summary--GX20-1850-5,
sixth edition, 1984.
[16] C.E. Mackenzie. Coded-Character Sets: History and
Development. Addison-Wesley, Reading, MA, 1980.
[17] D.E. Knuth. The TeXbook. Addison-Wesley, Reading,
MA, 1984.
Glossary of Terms
Announcer: A character or sequence of characters appearing in
the data that signifies the start of some special sequence. In
this text, it announces a Vietnamese composition sequence.
ASCII: American Standard Code for Information Interchange, a
128-character code used almost universally by computers for
representing and transmitting characters data, in which each
character corresponds to a decimal number between 0 and 127.
Eightor nine-bit codes of which the first 128 characters
correspond to ASCII are called Extended ASCII; the additional
characters are used to provide graphic characters for roman al-
phabets with diacritics, non-roman alphabets, special screen ef-
fects, etc.
Base Vowel: In this text, the unaccented Vietnamese vowels: a a(
a^ e e^ i o o^ o+ u u+ y (and their capitals). Contrast this with Vowel.
C0 Space: ``Control characters'' at code positions with hex values
00 through 1F.
C1 Space: ``Control characters'' at code positions with hex values
80 through 9F.
Code: In data communication, the numeric or internal represen-
tation for a character, e.g., in ASCII.
Code Page: Name used to denote glyph sets on the IBM PC. Abbre-
viated as CP. CP 850 is the multilingual code page, CP860 is for
Portugal, CP863 is for French Canada, CP865 is for Norway.
Control Character: An ASCII character in the range 0 to 31, plus
ASCII character 127, contrasted with the printable, or graphic,
characters in the range 32 to 126. It is produced on an ASCII
terminal by holding down the CTRL key and typing the desired
character.
EBCDIC: Extended Binary Coded Decimal Interchange Code. The char-
acter code used on IBM mainframes. Not covered by any formal
standards but described definitively in [15] and discussed at
length in [16].
Floating Diacritics: A multiple-unit encoding approach for Viet-
namese that treats the vowel and its diacritics as separate un-
its. The diacritics may either precede or follow the vowel, or
even the word. Contrast this with Precomposed Character.
Glyph: The physical appearance of a character as displayed on the
screen or printed on paper.
G0 Space: ``Graphic characters'' at code positions with hex values
20 through 7F.
G1 Space: ``Graphic characters'' at code positions with hex values
A0 through FF.
ISO: International Organization for Standardization. A volun-
tary international group of national standards organizations
that issues standards in all areas, including computers, infor-
mation processing, and character sets.
ISO 646: The standard 7-bit code set, equivalent to ASCII [12].
ISO Standard 8859: An ISO standard specifying a series of 8-bit
computer character sets that include characters from many
languages. These include ISO Latin Alphabets 1-9, which cover
most of the written languages based on Roman letters, plus spe-
cial character sets for Cyrillic, Greek, Arabic, and Hebrew [5].
ISO 8859/1: ISO Standard 8859 Latin Alphabet Number 1. Supports
at least the following languages: Latin, Danish, Dutch, English,
Faeroese, Finnish, French, German, Icelandic, Irish, Italian,
Norwegian, Portuguese, Spanish, and Swedish [5].
ISO 2022 and ISO 4873: ISO standards for switching code pages [13].
ISO DIS 10646: The prospective 16and 32-bit Universal Coded
Set, (Draft International Standard) [4].
Latin: Referring to the Latin, or Roman, alphabet, comprised of
the letters A through Z, or to any alphabet based upon it.
MS-DOS: Microsoft's Disk Operating System for microcomputers
based on the Intel 80x86 family of CPU chips.
Modifier: A phonetic diacritical mark. The Vietnamese modif-
iers, are: breve (tra(ng, (), circumflex (mu~, ^), horn (mo'c, +).
PC: Personal Computer. In this text, the term PC refers to the
entire IBM PC and PS/2 families and compatibles, which includes
the AT, 286, 386, and 486 PC's.
PostScript: A page description language with graphics capabili-
ties designed for electronic printing. The description is
high-level and device-independent. PostScript is a trademark of
Adobe Systems Incorporated.
Precomposed Characters: An encoding approach for Vietnamese that
treats all vowel combinations as single units. Contrast this
with Floating Diacritics.
TeX: A computerized typesetting system developed by Donald
Knuth [17], providing nearly everything needed for high-quality
typesetting of mathematical notations as well as of ordinary
text. TeX is a trademark of the American Mathematical Society.
Tone Mark: A tonal diacritical mark that indicates the
tone/accent. The Vietnamese tone marks are: acute (sa('c),
grave (huye^`n), hook above (ho?i), tilde (nga~), dot below (na(.ng).
Unicode: A 16-bit multilingual character code proposed by the Un-
icode Consortium [3].
Unix: A popular operating system developed at AT&T Bell Labora-
tories and noted for its portability.
Usenet: A worldwide network available to users for sending mes-
sages (or ``news articles'') that can be read and responded to
by other users. Participating in Usenet is like subscribing to a
collection of electronic magazines. These ``magazines,'' called
newsgroups, are devoted to particular topics. The
``Soc.Culture.Vietnamese'' newsgroup is very popular among both
Vietnamese and non-Vietnamese worldwide.
Viet-Std: A non-profit group of overseas Vietnamese profession-
als working on software & hardware standards for the Vietnamese
language. Members of the group exchange ideas via electronic mail
and meetings.
Vowel: In this text, a generic term applying to all Vietnamese
vowels and their various combining forms, e.g., a, a(, and a('.
See Base Vowel.