perl-unicode

Encode, take four

2000-09-12 13:59:27
=head1 NAME

Encode - character encodings

=head2 TERMINOLOGY

        byte    a B<number> in the range 0..255
        char    a B<character> in the range 0..maxint (at least 2**32-1)

The marker [INTERNAL] marks Internal Implementation Details, in
general meant only for those who think they know what they are doing,
such details may change in future releases.

=head2 bytes

        bytes_to_utf8(STRING)

The bytes in STRING are encoded in-place into UTF-8.  Returns the new
size of STRING, or undef if there's a failure.  [INTERNAL] Also the
UTF-8 flag is turned on.

        utf8_to_bytes(STRING [, STRICT])

The UTF-8 in STRING is decoded in-place into bytes.  Returns the new
size of STRING, or undef if there's a failure, or dies is STRICT is
true and the UTF-8 in STRING is malformed.  [INTERNAL] The UTF-8 flag
of STRING is not checked.

=head2 chars

        chars_to_utf8(STRING [, STRICT])

The chars in STRING are encoded in-place into UTF-8.  The chars are
assumed to be encoded in US-ASCII.  Returns the new size of STRING, or
undef if there's a failure, or dies if there are characters > 127.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.

        utf8_to_chars(STRING [, STRICT])

The UTF-8 in STRING is decoded in-place into chars.  The chars are
assumed to be encoded in US-ASCII.  Returns the new size of STRING,
or undef if there's a failure, or dies if there are characters > 127.
[INTERNAL] The UTF-8 flag of STRING is not checked.

        utf8_to_chars_strict(STRING)

The UTF-8 in STRING is decoded in-place into chars.  Returns the new
size of STRING, or dies if the UTF-8 in STRING is malformed.  Note
that this interface is exceptionally named since a two-argument
utf8_to_chars() has different semantics.  [INTERNAL] The UTF-8 flag of
STRING is not checked.

=head2 chars With Encoding

        chars_to_utf8(STRING, ENCODING)

The chars in STRING encoded in ENCODING are recoded in-place into
UTF-8.  Returns the new size of STRING, or undef if there's a failure.
[INTERNAL] Also the UTF-8 flag of STRING is turned on.

        utf8_to_chars(STRING, ENCODING [, STRICT])

The UTF-8 in STRING is decoded in-place into chars encoded in
ENCODING.  Returns the new size of STRING, or undef if there's a
failure, or dies if STRICT is true and the UTF-8 in STRING is
malformed.  [INTERNAL] The UTF-8 flag of STRING is not checked.

        from_to(STRING, FROM_ENCODING, TO_ENCODING [, STRICT])

The chars in STRING encoded in FROM_ENCODING are recoded in-place into
TO_ENCODING.  Returns the new size of STRING, or undef if there's a
failure, or dies is STRICT is true and a mapping between the encodings
is impossible.

=head2 Testing For UTF-8

        is_utf8(STRING [, STRICT])

[INTERNAL] Test whether the UTF-8 flag is turned on in the STRING.
If STRICT is true, also checks the data in STRING for being
well-formed UTF-8.  Returns true if successful, false otherwise.

=head2 Toggling UTF-8-ness

        on_utf8(STRING)

[INTERNAL] Turn on the UTF-8 flag in STRING.  The data in STRING is
B<not> checked for being well-formed UTF-8.  Do not use unless you
B<know> that the STRING is well-formed UTF-8.  Returns the previous
state of the UTF-8 flag (so please don't test for I<not> success or
failure).

        off_utf8(STRING)

[INTERNAL] Turn off the UTF-8 flag in STRING.  Do not use frivolously.
Returns the previous state of the UTF-8 flag (so please don't test for
I<not> success or failure).

=head2 UTF-16 and UTF-32 Encodings

        utf_to_utf(STRING, FROM, TO [, STRICT])

The data in STRING is converted from Unicode Transfer Encoding FROM to
Unicode Transfer Encoding TO.  Both FROM and TO may be any of the
following tags (case-insensitive)':

        tag             meaning

        '7'             UTF-7
        '8'             UTF-8
        '16be'          UTF-16 big-endian
        '16le'          UTF-16 little-endian
        '16ne'          UTF-16 native-endian
        '32be'          UTF-32 big-endian
        '32le'          UTF-32 little-endian
        '32ne'          UTF-32 native-endian

UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as
UCS-4, 32-bit or 4-byte chunks.  Returns the new size of STRING, or
undef is there's a failure, or dies if the STRICT is on and the FROM
is '8' and the UTF-8 in STRING is malformed.  [INTERNAL] Even if
STRICT is true and FROM is '8' the UTF-8 flag of STRING is not
checked.  If TO is '8' also the UTF-8 flag of STRING is turned on.
Identical FROM and TO are fine.

=cut


-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>