perl-unicode

[Encode] Encode::Supported revised

2002-04-03 07:04:06
Folks,

Encode is near completion. I am still bulding djgpp environment for possible fixes needed but anything else is over. Meanwhile, Please have a look at Encode::Supported revised for added Encodings (now Encode comes with all encodings covered by http://www.unicode.org/Public/MAPPINGS/ -- except for Indics which are beyond cap. of the current encengine; Algorithmical approaches still possible. Porters wanted. See below). Enjoy.

Dan the Encode Maintainer

=head1 NAME

Encode::Supported -- Supported encodings by Encode

=head1 DESCRIPTION

=head2 Encoding Names

Encoding names are case insensitive. White space in names
is ignored.  In addition an encoding may have aliases.
Each encoding has one "canonical" name.  The "canonical"
name is chosen from the names of the encoding by picking
he first in the following sequence:

       o The MIME name as defined in IETF RFCs.
       o The name in the IANA registry.
       o The name used by the organization that defined it.

Because of all the alias issues, and because in the general case
encodings have state, "Encode" uses the encoding object internally
once an operation is in progress.

=head1 Supported Encodings

As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
(via alias) and all occurrance of spaces are replaced with '-'.  In
other words, "ISO 8859 1" and "iso-8859-1" are identical.

Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
most cases.  Encode.pm will automatically load those modules in need.

=head2 Built-in Encodings

The following encodings are always available.

  Canonical     Aliases                      Comments & References
  ----------------------------------------------------------------
  US-ascii      ascii                                       [ECMA]
  iso-8859-1    latin1                                       [ISO]
  UCS-2         ucs2, iso-10646-1                    [IANA, et al]
  UCS-2le
  UTF-8         utf8                                     [RFC2279]
  ----------------------------------------------------------------

=head2 Encode::Byte -- Extended Asci

Encode::Byte implements most of single-byte encodings except for
Symbols and EBCDIC. The following encodings are based single-byte
encoding implemented as extended ASCII.  For most cases it uses
\x80-\xff (upper half) to map non-ASCII characters.

=over 2

=item ISO-8859 and corresponding vendor mappings

Since there are so many, They are presented in table format with
Languages and corresponding encoding names by vendors.  Note the table
is sorted in order of ISO-8859 and the corresponding vendor mappings
are slightly different from that of ISO.  See
L<http://czyborra.com/charsets/iso8859.html> for details.

  Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
  ----------------------------------------------------------------
  U.S           (ASCII)         cp437        AdobeStandardEncoding
                                cp863 (DOSCanadaF)
  W.  Europe    (iso-8859-1)    cp850   cp1252  MacRoman  nextstep
                                                         hp-roman8
                                cp860 (DOSPortuguese)
  CE. Europe    iso-8859-2      cp852   cp1250  MacCentralEurRoman
                                                MacCroatian
                                                MacRomanian
                                                MacRumanian
  Latin3(*3)    iso-8859-3
  Latin4(*4)    iso-8859-4
  Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
    (Also see next section)     cp866           MacUkrainian
  Arabic        iso-8859-6      cp864   cp1256  MacArabic
                                cp1006          MacFarsi
  Greek         iso-8859-7      cp737   cp1253  MacGreek
                                cp869 (DOSGreek2)
  Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
  Turkish       iso-8859-9      cp857   cp1254  MacTurkish
  Nordics       iso-8859-10     cp865
                                cp861           MacIcelandic
                                                MacSami
  Thai          iso-8859-11     cp874           MacThai
  (iso-8859-12 is nonexistent. Reserved for Indics?)
  Baltics      iso-8859-13      cp775           cp1257
  Celtics      iso-8859-14
  Latin9(*15)  iso-8859-15
  Latin10      iso-8859-16
  Vietnamese    viscii                  cp1258  MacVietnamese
  ----------------------------------------------------------------

  (*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
  (*4) Baltics.  Now on 8859-10
  (*9) Nicknamed Latin0; Euro sign as well as  French and Finnish
       letters that are missing from 8859-1 are added.

All cp* are also available as ibm-*, ms-*, and windows-* .  See also
L<http://czyborra.com/charsets/codepages.html>.

Macintosh encodings don't seem to be registered in such entities as
IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
for details

=item KOI8 - De Facto Standard for Cyrillic world

Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
in the Net.   L<Encode> comes with the following KOI charsets.  for
gory details, See <http://czyborra.com/charsets/cyrillic.html> for
details.

  ----------------------------------------------------------------
  koi8-f
  koi8-r cp878                                           [RFC1489]
  koi8-u                                                 [RFC2319]


=item gsm0338 - Hentai Latin 1

GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
control character ranges and other parts are mapped very differently,
presumablly to store Cyrillics.  This one is also covered in
Encode::Byte even thought this one does not comply extended ASCII.

=back

=head2 The CJK: Chinese, Japanese, Korean (Multibyte)

Note Vietnamese is listed above.  Also read "Encoding vs Charset"
below.  Also note these are implemented in distinct module by
languages, due the the size concerns.  Please also refer to their
respective document pages.

=over 4

=item Encode::CN -- Continental China

  Standard      DOS/Win Macintosh       Comment
  ----------------------------------------------------------------
  euc-cn                MacChineseSimp  GB2312 is aliased to this
  (gbk)         cp936                   GBK is aliased to to this
  gb12345-raw                           GB12345 as is
  gb2312-raw                            GB2312 as is
  hz
  iso-ir-165
  ----------------------------------------------------------------

=item Encode::JP -- Japan

  Standard      DOS/Win Macintosh       Comment/Reference
  ----------------------------------------------------------------
  euc-jp
  shiftjis      cp932   macJapanese
  7bit-jis        jis
  euc-jp          ujis
  iso-2022-jp                           [RFC1468]
  iso-2022-jp-1                         [RFC2237]
  ----------------------------------------------------------------

=item Encode::KR -- Korea

  ----------------------------------------------------------------
  euc-kr                MacKorean
                cp949                   ks_c_5601-1987
  iso-2022-kr                           [RFC1557]
  johab
  ksc5601-raw                           KSC5601 as is
  ----------------------------------------------------------------

=item Encode::TW -- Taiwan

  ----------------------------------------------------------------
  big5          cp950   MacChineseTrad
  big5-hkscs
  ----------------------------------------------------------------

=item Encode::HanExtra -- More Chinese via CPAN

Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.

  ----------------------------------------------------------------
  gb18030
  euc-tw
  big5plus
  ----------------------------------------------------------------

=back

=head2 Miscellaneous encodings

=over 4

=item Encode::EBCDIC

See perlebcdic for details.

  ----------------------------------------------------------------
  cp1047
  cp37
  posix-bc
  ----------------------------------------------------------------

=item Encode::Symbols

For symbols  and dingbats.

  ----------------------------------------------------------------
  symbol
  dingbats
  MacDingbats
  AdobeZdingbat
  AdobeSymbol
  ----------------------------------------------------------------

=back

=head1 Unsupported encodings

The following are not supported as yet.  Some because they are rarely
usede, some because of technical difficulty.  They may be supported by
external modules via CPAN in future, however.

=over 4

=item   ISO-2022-JP-2 [RFC1554]

Not very popular yet.  Needs Unicode Database or equivalent to
implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
GB2312 sumulteniously, which code points in unicode overlap.  So you
need to lookup the database to determine what character set a given
Unicode character should belong).

=item   ISO-2022-CN [RFC1922]

Not very popular.  Needs CNS 11643-1 and 2 which are not available in
this module.  CNS 11643 is supported (via euc-tw) in
Encode::HanExtra.  Autrijus may add support for this encoding in his
module in future

=item various UP-UX encodings

The following are unsoported due to the lack of mapping data.

  '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
  '15' - japanese15, korean15, and  roi15

=item Cyrillic encoding ISO-IR-111

Anton doubts its usefulness.

=item ISO-8859-8-1 [Hebrew]

None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
MacHebrew are supported because and just because there were mappings
available at L<http://www.unicode.org/>).  Contribution welcome.

=item Thai encoding TCVN

Ditto.

=item Vietnamese encodings VPS

Ditto.

=item various Mac encodings

The following are unsoported due to the lack of mapping data.

  MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
  MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
  MacLaotian,   MacMalayalam, MacMongolian, MacOriya
  MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
  MacVietnamese

The rest of which already available are based upon the vendor mapping
available at L<http://www.unicode.org/>

=item (Mac) Indic encodings

The maps for the following is available at L<http://www.unicode.org/>
but remains unsupport because those encordigs need algorithmical
approach, unsupported by F<enc2xs>

  MacDevanagari
  MacGurmukhi
  MacGujarati

For details, please see C<Unicode mapping issues and notes:> at
L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .

I believe this issue is prevalent not only for Mac Indics but also in
other Indic encodings but those mentions were the only Indic encodings
maps that I could find at L<http://www.unicode.org/> .

=back

=head1 Encoding vs. Charset

Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.

=over 2

=item Character I<Set> (I<charset> for short)

Is a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).

=item Character I<Encoding>

Is a way to represent character set(s) in a stream of bits.

=back

A character encoding may contain a single character set
(i.e. US-ascii) or multiple character sets (i.e. EUC-JP;
US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212).

A character encoding may also encode character set as-is (also called
a I<raw> encoding.  i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
as-is, JIS X 0201 is prepended  with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

As the name suggests, the Encode module supports encodings, not
individual charsets.

However, the word I<charset> is casually used even in Internet
Assigned Number Authority to actually mean I<encoding>.  Encode tries
to soothe this misconception via aliases.  For instance,
C<gb2312> is aliased to C<euc-cn>, while "raw" encoded version is
available as C<gb2312-raw>.

=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)

This section tries to classify the supported encodings by their
applicability for information exchange over the Internet and to
choose the most suitable aliases to name them in the context of
such communication.

=over 2

=item *

To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
,available from CPAN.

=back

Encoding names

  US-ASCII    UTF-8     ISO-8859-*  KOI8-R
  Shift_JIS   EUC-JP  ISO-2022-JP ISO-2022-JP-1
  EUC-KR      Big5

are registered to IANA as preferred MIME names and may probably be used over the Internet.

C<Shift_JIS> is no longer Microsft proprietary since it has been
officialized by JIS X 0208-1997.

  EUC-CN

has not been registered with IANA (as of march 2002) but
seems to be supported by major web browsers. In Encode, GB2312
is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
as gb2312-raw.  See L<Encode::CN> for details.

  KS_C_5601-1987

has been registered to IANA but when they are used, they are
EUC-coded.  Internet community in Korea is not happy with this.
so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
of C<euc-kr>, with ksc5601-raw for "uncooked".

  UTF-16
  KOI8-U        (http://www.faqs.org/rfcs/rfc2319.html)

are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.

  ISO-IR-165    (http://www.faqs.org/rfcs/rfc1345.html)
  GBK
  VISCII
  GB 12345
  GB 18030 (*)  (see links bellow)
  EUC-TW   (*)

are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.

  BIG5PLUS (*)

is a bit proprietary name.

=head1 Bookmarks

=over 2

=item czyborra.com

<http://czyborra.com/>

Contains a a lot of useful information, especially gory details of ISO
vs. vendor mappings.

=item Assigned Charset Names by IANA

L<http://www.iana.org/assignments/character-sets>

Most of the C<canonical names> in Encode derive from this list
so you can directly apply the string you have extracted from MIME
header of mails and we pages.

=item CJK.inf

L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>

Somewhat obsolete (last update in 1996), but still useful.  Also try

L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>

You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>

=item EMCA-035 (eq C<ISO-2022>)

L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>

The very dspecification of ISO-2022 is available from the link above.

=back

=head1 See Also

L<Encode>,
L<Encode::Byte>,
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
L<Encode::EBCDIC>, L<Encode::Symbol>

=cut

I could not find this page because the hostname doesn't resolve!

 Brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>