Folks,
Encode is near completion. I am still bulding djgpp environment for
possible fixes needed but anything else is over.
Meanwhile, Please have a look at Encode::Supported revised for added
Encodings (now Encode comes with all encodings covered by
http://www.unicode.org/Public/MAPPINGS/ -- except for Indics which are
beyond cap. of the current encengine; Algorithmical approaches still
possible. Porters wanted. See below). Enjoy.
Dan the Encode Maintainer
=head1 NAME
Encode::Supported -- Supported encodings by Encode
=head1 DESCRIPTION
=head2 Encoding Names
Encoding names are case insensitive. White space in names
is ignored. In addition an encoding may have aliases.
Each encoding has one "canonical" name. The "canonical"
name is chosen from the names of the encoding by picking
he first in the following sequence:
o The MIME name as defined in IETF RFCs.
o The name in the IANA registry.
o The name used by the organization that defined it.
Because of all the alias issues, and because in the general case
encodings have state, "Encode" uses the encoding object internally
once an operation is in progress.
=head1 Supported Encodings
As of Perl 5.8.0, at least the following encodings are recognized.
Note that unless otherwise specified, they are all case insensitive
(via alias) and all occurrance of spaces are replaced with '-'. In
other words, "ISO 8859 1" and "iso-8859-1" are identical.
Encodings are categorized and implemented in several different modules
but you don't have to C<use Encode::XX> to make them available for
most cases. Encode.pm will automatically load those modules in need.
=head2 Built-in Encodings
The following encodings are always available.
Canonical Aliases Comments & References
----------------------------------------------------------------
US-ascii ascii [ECMA]
iso-8859-1 latin1 [ISO]
UCS-2 ucs2, iso-10646-1 [IANA, et al]
UCS-2le
UTF-8 utf8 [RFC2279]
----------------------------------------------------------------
=head2 Encode::Byte -- Extended Asci
Encode::Byte implements most of single-byte encodings except for
Symbols and EBCDIC. The following encodings are based single-byte
encoding implemented as extended ASCII. For most cases it uses
\x80-\xff (upper half) to map non-ASCII characters.
=over 2
=item ISO-8859 and corresponding vendor mappings
Since there are so many, They are presented in table format with
Languages and corresponding encoding names by vendors. Note the table
is sorted in order of ISO-8859 and the corresponding vendor mappings
are slightly different from that of ISO. See
L<http://czyborra.com/charsets/iso8859.html> for details.
Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
----------------------------------------------------------------
U.S (ASCII) cp437 AdobeStandardEncoding
cp863 (DOSCanadaF)
W. Europe (iso-8859-1) cp850 cp1252 MacRoman nextstep
hp-roman8
cp860 (DOSPortuguese)
CE. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
MacCroatian
MacRomanian
MacRumanian
Latin3(*3) iso-8859-3
Latin4(*4) iso-8859-4
Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
(Also see next section) cp866 MacUkrainian
Arabic iso-8859-6 cp864 cp1256 MacArabic
cp1006 MacFarsi
Greek iso-8859-7 cp737 cp1253 MacGreek
cp869 (DOSGreek2)
Hebrew iso-8859-8 cp862 cp1255 MacHebrew
Turkish iso-8859-9 cp857 cp1254 MacTurkish
Nordics iso-8859-10 cp865
cp861 MacIcelandic
MacSami
Thai iso-8859-11 cp874 MacThai
(iso-8859-12 is nonexistent. Reserved for Indics?)
Baltics iso-8859-13 cp775 cp1257
Celtics iso-8859-14
Latin9(*15) iso-8859-15
Latin10 iso-8859-16
Vietnamese viscii cp1258 MacVietnamese
----------------------------------------------------------------
(*3) Esperanto, Maltese, and Turkish. Turkish is now on 8859-5
(*4) Baltics. Now on 8859-10
(*9) Nicknamed Latin0; Euro sign as well as French and Finnish
letters that are missing from 8859-1 are added.
All cp* are also available as ibm-*, ms-*, and windows-* . See also
L<http://czyborra.com/charsets/codepages.html>.
Macintosh encodings don't seem to be registered in such entities as
IANA. "Canonical" names in Encode are based upon Apple's Tech Note
1150. See L<http://developer.apple.com/technotes/tn/tn1150.html>
for details
=item KOI8 - De Facto Standard for Cyrillic world
Though ISO-8859 does have ISO-8859, KOI8 series is far more popular
in the Net. L<Encode> comes with the following KOI charsets. for
gory details, See <http://czyborra.com/charsets/cyrillic.html> for
details.
----------------------------------------------------------------
koi8-f
koi8-r cp878 [RFC1489]
koi8-u [RFC2319]
=item gsm0338 - Hentai Latin 1
GSM0338 is for GSM handsets. Though it shares alpanumerals with ASCII,
control character ranges and other parts are mapped very differently,
presumablly to store Cyrillics. This one is also covered in
Encode::Byte even thought this one does not comply extended ASCII.
=back
=head2 The CJK: Chinese, Japanese, Korean (Multibyte)
Note Vietnamese is listed above. Also read "Encoding vs Charset"
below. Also note these are implemented in distinct module by
languages, due the the size concerns. Please also refer to their
respective document pages.
=over 4
=item Encode::CN -- Continental China
Standard DOS/Win Macintosh Comment
----------------------------------------------------------------
euc-cn MacChineseSimp GB2312 is aliased to this
(gbk) cp936 GBK is aliased to to this
gb12345-raw GB12345 as is
gb2312-raw GB2312 as is
hz
iso-ir-165
----------------------------------------------------------------
=item Encode::JP -- Japan
Standard DOS/Win Macintosh Comment/Reference
----------------------------------------------------------------
euc-jp
shiftjis cp932 macJapanese
7bit-jis jis
euc-jp ujis
iso-2022-jp [RFC1468]
iso-2022-jp-1 [RFC2237]
----------------------------------------------------------------
=item Encode::KR -- Korea
----------------------------------------------------------------
euc-kr MacKorean
cp949 ks_c_5601-1987
iso-2022-kr [RFC1557]
johab
ksc5601-raw KSC5601 as is
----------------------------------------------------------------
=item Encode::TW -- Taiwan
----------------------------------------------------------------
big5 cp950 MacChineseTrad
big5-hkscs
----------------------------------------------------------------
=item Encode::HanExtra -- More Chinese via CPAN
Due to size concerns, additional Chinese encodings below are
distributed separately on CPAN, under the name Encode::HanExtra.
----------------------------------------------------------------
gb18030
euc-tw
big5plus
----------------------------------------------------------------
=back
=head2 Miscellaneous encodings
=over 4
=item Encode::EBCDIC
See perlebcdic for details.
----------------------------------------------------------------
cp1047
cp37
posix-bc
----------------------------------------------------------------
=item Encode::Symbols
For symbols and dingbats.
----------------------------------------------------------------
symbol
dingbats
MacDingbats
AdobeZdingbat
AdobeSymbol
----------------------------------------------------------------
=back
=head1 Unsupported encodings
The following are not supported as yet. Some because they are rarely
usede, some because of technical difficulty. They may be supported by
external modules via CPAN in future, however.
=over 4
=item ISO-2022-JP-2 [RFC1554]
Not very popular yet. Needs Unicode Database or equivalent to
implement encode() (Because it includes JIS X 0208/0212, KSC5601, and
GB2312 sumulteniously, which code points in unicode overlap. So you
need to lookup the database to determine what character set a given
Unicode character should belong).
=item ISO-2022-CN [RFC1922]
Not very popular. Needs CNS 11643-1 and 2 which are not available in
this module. CNS 11643 is supported (via euc-tw) in
Encode::HanExtra. Autrijus may add support for this encoding in his
module in future
=item various UP-UX encodings
The following are unsoported due to the lack of mapping data.
'8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
'15' - japanese15, korean15, and roi15
=item Cyrillic encoding ISO-IR-111
Anton doubts its usefulness.
=item ISO-8859-8-1 [Hebrew]
None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
MacHebrew are supported because and just because there were mappings
available at L<http://www.unicode.org/>). Contribution welcome.
=item Thai encoding TCVN
Ditto.
=item Vietnamese encodings VPS
Ditto.
=item various Mac encodings
The following are unsoported due to the lack of mapping data.
MacArmenian, MacBengali, MacBurmese, MacEthiopic
MacExtArabic, MacGeorgian, MacKannada, MacKhmer
MacLaotian, MacMalayalam, MacMongolian, MacOriya
MacSinhalese, MacTamil, MacTelugu, MacTibetan
MacVietnamese
The rest of which already available are based upon the vendor mapping
available at L<http://www.unicode.org/>
=item (Mac) Indic encodings
The maps for the following is available at L<http://www.unicode.org/>
but remains unsupport because those encordigs need algorithmical
approach, unsupported by F<enc2xs>
MacDevanagari
MacGurmukhi
MacGujarati
For details, please see C<Unicode mapping issues and notes:> at
L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
I believe this issue is prevalent not only for Mac Indics but also in
other Indic encodings but those mentions were the only Indic encodings
maps that I could find at L<http://www.unicode.org/> .
=back
=head1 Encoding vs. Charset
Character encoding (or just "encoding") and Character Set (or just
"charset") are often used interchangeably but they are different
concepts.
=over 2
=item Character I<Set> (I<charset> for short)
Is a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).
=item Character I<Encoding>
Is a way to represent character set(s) in a stream of bits.
=back
A character encoding may contain a single character set
(i.e. US-ascii) or multiple character sets (i.e. EUC-JP;
US-ascii, JIS X 0201 Kana, JIS X 0208 and JIS X 0212).
A character encoding may also encode character set as-is (also called
a I<raw> encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is
as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).
As the name suggests, the Encode module supports encodings, not
individual charsets.
However, the word I<charset> is casually used even in Internet
Assigned Number Authority to actually mean I<encoding>. Encode tries
to soothe this misconception via aliases. For instance,
C<gb2312> is aliased to C<euc-cn>, while "raw" encoded version is
available as C<gb2312-raw>.
=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
This section tries to classify the supported encodings by their
applicability for information exchange over the Internet and to
choose the most suitable aliases to name them in the context of
such communication.
=over 2
=item *
To (en|de) code Encodings marked as C<*>, You need C<Encode::HanExtra>
,available from CPAN.
=back
Encoding names
US-ASCII UTF-8 ISO-8859-* KOI8-R
Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
EUC-KR Big5
are registered to IANA as preferred MIME names and may probably be used
over the Internet.
C<Shift_JIS> is no longer Microsft proprietary since it has been
officialized by JIS X 0208-1997.
EUC-CN
has not been registered with IANA (as of march 2002) but
seems to be supported by major web browsers. In Encode, GB2312
is aliased to EUC-CN, with "uncooked" version of GB2312 canonicalized
as gb2312-raw. See L<Encode::CN> for details.
KS_C_5601-1987
has been registered to IANA but when they are used, they are
EUC-coded. Internet community in Korea is not happy with this.
so C<KS_C_5601-1987> is aliased to C<cp949>, an enhanced version
of C<euc-kr>, with ksc5601-raw for "uncooked".
UTF-16
KOI8-U (http://www.faqs.org/rfcs/rfc2319.html)
are IANA-registered (C<UTF-16> even as a preferred MIME name)
but probably should be avoided as encoding for web pages due to
the lack of browser supports.
ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html)
GBK
VISCII
GB 12345
GB 18030 (*) (see links bellow)
EUC-TW (*)
are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.
BIG5PLUS (*)
is a bit proprietary name.
=head1 Bookmarks
=over 2
=item czyborra.com
<http://czyborra.com/>
Contains a a lot of useful information, especially gory details of ISO
vs. vendor mappings.
=item Assigned Charset Names by IANA
L<http://www.iana.org/assignments/character-sets>
Most of the C<canonical names> in Encode derive from this list
so you can directly apply the string you have extracted from MIME
header of mails and we pages.
=item CJK.inf
L<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
Somewhat obsolete (last update in 1996), but still useful. Also try
L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>
=item EMCA-035 (eq C<ISO-2022>)
L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
The very dspecification of ISO-2022 is available from the link above.
=back
=head1 See Also
L<Encode>,
L<Encode::Byte>,
L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
L<Encode::EBCDIC>, L<Encode::Symbol>
=cut
I could not find this page because the hostname doesn't resolve!
Brief description for most of the mentioned CJK encodings
L<http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html>