perl-unicode

Re: iso-2022-jp, adding encodings..

2001-06-15 14:13:30


On 15 Jun 2001, Andreas Marcel Riechert wrote:

Benjamin Franz <snowhare(_at_)nihongo(_dot_)org> writes:
On Thu, 14 Jun 2001, Edward Peschko wrote:


All I'm trying to do is convert from UTF8 to iso-2022-jp ( the form of 
shift
jis that is used in email...) any help on how to do this would be greatly 
appreciated...

Don't mix up JIS encoding (=former JUNET-encoding; iso-2022-jp) which
is 7-bit escaped encoding with Shift-JIS (sjis) which uses 8-bits and
no escapes. For email, usually iso-2022-jp (JIS encoding) is used. For
internal processing sane people usually don't use JIS encoding. 

Install 'Unicode::MapUTF8' - it probably does what you want:

my $sjis_string = from_utf8({ -string => $utf8_string, 
                             -charset => 'iso-2022-jp' })

I hope I will never have to maintain such a code. I could spend hours
to find out wether the author intended to use  "sjis" (Shift-JIS) or 
"iso-2022-jp" (JIS) encoding. 

Alternatively, install the 'Jcode' module (Unicode::MapUTF8 forms a
'wrapper' around that and other Unicode modules to provide a single
consistent interface for _all_ Unicode charset convertors).

(ps - the charset that I'm talking about can be found at:

http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html

It would be really, really cool if perl had the same charset codes, or at 
least
an alias to them. That way, one wouldn't have to go through this 'is the 
charset
there' junk. Unfortunately there seems to be 10 aliases for charsets all 
over
the place.

If Japanese information processing is your main concern I would go
for Jcode.pm. BTW, last week the SJIS-string module  was released
on CPAN. I don't know how reliable it is, but maybe its worth a try.

It does not seem to have been mentioned yet but bleedperl kits have a new
Encode module.  Here is an excerpt:

=head1 DESCRIPTION

The C<Encode> module provides the interfaces between Perl's strings
and the rest of the system.  Perl strings are sequences of B<characters>.

<snip>

=head2 Encoding Names

Encoding names are case insensitive. White space in names is ignored.
In addition an encoding may have aliases. Each encoding has one
"canonical" name.  The "canonical" name is chosen from the names of
the encoding by picking the first in the following sequence:

=over 4

=item * The MIME name as defined in IETF RFC-XXXX.

=item * The name in the IANA registry.

=item * The name used by the the organization that defined it.

=back

=head2 Generic Encoding Interface

=over 4

=item *

        $bytes  = encode(ENCODING, $string[, CHECK])

Encodes string from Perl's internal form into I<ENCODING> and returns
a sequence of octets.  For CHECK see L</"Handling Malformed Data">.

=item *

        $string = decode(ENCODING, $bytes[, CHECK])

Decode sequence of octets assumed to be in I<ENCODING> into Perl's
internal form and returns the resulting string.  For CHECK see
L</"Handling Malformed Data">.

etcetera.  It was the work of Nick Ing-Simmons and is in bleedperl.

Peter Prymmer

<Prev in Thread] Current Thread [Next in Thread>