perl-unicode

chomp() under encoding.pm

2004-02-10 08:30:05
Hello.

Recently (after 5.8.3; i.e. to-be 5.8.4/5.9.1) chomp() has been
enhanced to cope with unicode (utf8).
Then chomp() should work properly under encoding.pm.

In perl-current, chomp() under encoding.pm behaves as following:

(0) when in slurp mode or record mode: no care for encoding.pm;
    # here arguments are not modified.
(1) an argument is upgraded to unicode, if it is in bytes.
(2) when in paragraph mode: no care for encoding.pm;
    # here encodings whose newline is not "\n" are not considered.
(3) when in normal mode ($/ is a non-empty string):
   a. $/ is upgraded to unicode, if it is in bytes.
   b. $/ is compared with an argument in unicode.
   c. If they are matched, an argument is chomped and the length
      of the removed parts in unicode is returned.

The reason why both arguments and $/ are upgraded to unicode is
that if $/ would be compared with an argument byte by byte,
arguments in a multibyte encoding should often be chomped erroneously.

When in normal mode and paragraph mode, this algorithm has
a side effect that, an argument will be modified (upgraded to unicode)
even if $/ does not match at its end.

If people does not want any side effect in the case of no-op,
a copy of argument could be used for comparison with $/.
(but it may be wasteful.)

The above side effect may be useful, say, in the following case:
even if a filehandle missed using a proper encoding layer, chomp()
upgrades lines into unicode, whether a line ends with $/ or not.

use encoding "something";
while (<>) {
    chomp;
    .....
    # operations can be performed in unicode
    .....
    print;
}

regards,
SADAHIRO Tomoyuki

<Prev in Thread] Current Thread [Next in Thread>
  • chomp() under encoding.pm, SADAHIRO Tomoyuki <=