perl-unicode

Re: Japanese text search problem

2001-08-07 13:30:43
On 7 Aug 2001, Andreas Marcel Riechert wrote:
Why should Unicode be the "de facto standard for internal
representation"? ...or "internal standard" to whom, or what?

Because every new system designed after around 1995 has based
its character encoding competely on ISO 10646. All the rest
is slowly about to be phased out over the next decode or so.
Because having more than 1 encoding is a huge pain and completely
unnecessary. Win32 has been fully Unicode based for years, Unix
is slowly following now.

E.g. if I was going to write one of the bigger Kanwa-Jiten
(Chinese/Japanese Character Dictionary) Database I would rather
use TAD (TRON-encoding) than its compititor Unicode.

There might always be special encodings for very special applications with
very special software around, but everything point towards that Unicode is
well on its way of becoming the standard bread-and-butter character set,
just as ASCII has become in the late 1960s (though there are still
non-ASCII systems around for special applications).

For much other stuff I am quit happy with euc-jp.

If you think that double-width cyrillic and the lack of many scripts is in
any form nice or adequate. EUC-JP remains a little regional
single-language encoding and will never catch the interest outside Japan
in the way that UCS has already. EUC-JP is round-trip compatible to
Unicode, so you don't loose information when you convert.

Perl's decicion to go Unicode is *very* mainstream; Python, TCL,
Java, C#, Ada95, etc. all did the same.

Maybe I am old fashioned, but I still use euc-jp or sjis for
most of the processing/ output I do. And I am quit happy with
them.

EUC-JP is fine (it has similar properties as UTF-8), but SJIS is not
suitable as a general locale encoding for POSIX systems. It is not ASCII
compatible and has state, breaking numerous assumptions inherent in the
design of Unix tools. SJIS is really just a common email format, nothing
more.

Sticking with EUC-JP is at the moment advisible if you have application
what depend on EUC-JP's traditional terminal emulator character width
conventions. Unicode terminal emulators won't give you double-width Greek,
Cyrillic, Math, etc., which requires slight reformatting of some column
aligned EUC-JP plaintext files.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

<Prev in Thread] Current Thread [Next in Thread>