chomp() under encoding.pm

Hello.

Recently (after 5.8.3; i.e. to-be 5.8.4/5.9.1) chomp() has been
enhanced to cope with unicode (utf8).
Then chomp() should work properly under encoding.pm.

In perl-current, chomp() under encoding.pm behaves as following:

(0) when in slurp mode or record mode: no care for encoding.pm;
    # here arguments are not modified.
(1) an argument is upgraded to unicode, if it is in bytes.
(2) when in paragraph mode: no care for encoding.pm;
    # here encodings whose newline is not "\n" are not considered.
(3) when in normal mode ($/ is a non-empty string):
   a. $/ is upgraded to unicode, if it is in bytes.
   b. $/ is compared with an argument in unicode.
   c. If they are matched, an argument is chomped and the length
      of the removed parts in unicode is returned.

The reason why both arguments and $/ are upgraded to unicode is
that if $/ would be compared with an argument byte by byte,
arguments in a multibyte encoding should often be chomped erroneously.

When in normal mode and paragraph mode, this algorithm has
a side effect that, an argument will be modified (upgraded to unicode)
even if $/ does not match at its end.

If people does not want any side effect in the case of no-op,
a copy of argument could be used for comparison with $/.
(but it may be wasteful.)

The above side effect may be useful, say, in the following case:
even if a filehandle missed using a proper encoding layer, chomp()
upgrades lines into unicode, whether a line ends with $/ or not.

use encoding "something";
while (<>) {
    chomp;
    .....
    # operations can be performed in unicode
    .....
    print;
}

regards,
SADAHIRO Tomoyuki

Previous by Date:	Re: How to convert base64 string to utf-8, Nick Ing-Simmons
Next by Date:	Re: DBI and UTF-8, SADAHIRO Tomoyuki
Previous by Thread:	How to convert base64 string to utf-8, ALexander N. Treyner
Next by Thread:	Re: DBI and UTF-8, SADAHIRO Tomoyuki
Indexes:	[Date] [Thread] [Top] [All Lists]