Re: Announce: Perl, Unicode and I18N FAQ

Q24. How do other programming languages implement Unicode and I18N?

Add:

Ada95 language

Ada95 was designed for Unicode support and the Ada95 standard library
features special ISO 10646-1 data types Wide_Character and Wide_String,
as well as numerous associated procedures and functions. The GNU Ada95
compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of
wide characters. This allows you to use UTF-8 in both source code and
application I/O. To activate it in the application, use "WCEM=8" in the
FORM string when opening a file, and use compiler option "-gnatW8" if
the source code is in UTF-8. See the <A HREF="ftp://cs.nyu.edu/pub/gnat/";

GNAT</A> and <A HREF=

"http://wuarchive.wustl.edu/languages/ada/userdocs/docadalt/rm95/index.htm";

Ada95</A> reference manuals for details.


Change:

C language

C provides the wchar_t and wchar_t * type for handling Unicode
characters and strings.  wchar_t is on modern C implementations with
available ISO 10646/UTF-8 locale support typically a signed 32-bit
integer type. <A HREF=
"http://www.unix-systems.org/version2/whatsnew/login_mse.html";>ISO C
Amendment 1</A> and ISO C 99 both added a rich set of new library
functions for handling wchar_t strings to the language. C can also
handle Unicode by using UTF-8 as the multi-byte encoding in the char *
type, such that upgrades from ASCII to UTF-8 can be implemented with
relatively minor changes in most existing software.

Change:

Linux

The kernel remains mostly agnostic of what character encoding is used in
files and file names, as long as it is ASCII compatible. File content,
pipe content, file names, environment variables, source code, etc. all
can be in UTF-8. Linux (like Unix) does not provide any per-file or
per-syscall tagging of character sets and instead the preferred system
character set can be specified per process using LC_CTYPE. Users should
aim at using only a single character set throughout their applications.
This is today mostly the respective regional ISO 8859 variant and will
in the future become UTF-8. Work is being done on making most
applications usable with UTF-8 and there is hope that Linux will be able
to switch over completely from ASCII and ISO 8859 to UTF-8 in only a few
years. Full UTF-8 locale support will be available starting with glibc
2.2. UTF-8 support for xterm will be available with XFree86 4.0. There
are no plans in the Linux/POSIX world to duplicate the entire API for
16-bit Unicode as it was done for Win32. UTF-8 will simply replace ASCII
at most levels eventually in any inter-process communication. UCS-4 in
the form of wchar_t might be used internally by a few applications for
which UTF-8 is inconvenient to process.

In Q5/Linux, please also change "Marcus Kuhn" to "Markus Kuhn" (with k).
You might also add Bruno Haible's Linux Unicode HOWTO in addition:
<ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html>

Thanks!

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>