perl-unicode

Re: Performance and interface of Encode(3pm) in perl 5.8.0-RC1

2002-07-11 04:30:04
Guido Flohr <guido(_at_)imperia(_dot_)net> writes:
Until this morning I didn't know about the new Encode interface.  Last
weekend I had started something quite similar which is entirely written
in Perl (no C code).  I have chosen a slightly different interface,
however, and maybe you are interested to learn why.

My interface looks roughly like this:

      my $cd = Locale::Iconv->new (from => 'Windows-1250',
                                   to   => 'iso-8859-2');

      my $success = $cd->recode ($input);

I always convert "in-place", that's the first difference.  The main
drawback of this are possible run-time errors "attempt to modify
read-only values" when called with constant arguments.  But the
memory footprint is a lot better and besides even in C you cannot
copy large memory areas for free.  Why create unnecessary copies?

For my Tk application of encode the in-place form causes unnecessary
copies. e.g. I need the original and the form encoded into the encoding 
required by the font, or I have to copy the input arg to return location.

Doing in-place is very hard to do when converting between two variable 
length encodings. I suspect your "all perl" version is not _really_ 
doing it "in place" but just in same scalar, but in different PV "buffers".
The Encode API is writen to allow core of encodings to be written in C
Keeping return value and source separate is very useful for C.


The second (and IMHO more important) difference is the object-oriented
interface.  The object returned by the constructor can be re-used
(and the conversion has to be initialized only once), I can pass
it around to other objects (important in large modularized projects),
and I can still offer a procedural interface at almost no cost

Sounds exactly like the way Encode is implemented!
I suspect you are only using Encode via its procedural interface.

:

      sub Locale::Iconv::iconv_open
      {
              Locale::Iconv::new (from => $_[0], to => $_[1]);
      }

And now I can do

      my $cd = iconv_open ('Windows-1250' => 'iso-8859-2');

and say in an iconv(3) fashion

      my $success = recode ($cd, $input);

Internally my objects of type Locale::Iconv contain an encoding
chain that leads from the source (from) encoding to the destination
(to) encoding.  Theoretically this chain can have an arbitrary
length (like in the GNU libc iconv implementation) but I either
know a direct path (all conversion modules are capable of converting
into UTF-8 or my internal format) or I take an intermediate step
via the internal representation.

My internal representation is simply a reference to an array of
the corresponding ISO 10646 codes, which allows me to use map()
instead of operating on strings.

After I have learned about Encode(3pm) I have written a test
script that compares the three different conversion techniques
I know: my own one, Text::Iconv which uses the iconv(3) implementation
of the libc resp. libiconv, and finally Encode::from_to from
perl 5.8.0.

For each implementation I convert a tiny (10 bytes), a small
(100 bytes), and a large (100 k) buffer from Windows-1250 to
ISO-8859-2.  The buffers do not contain any characters in the
range from /x80 to /x9f so that the conversions can never fail
and actually do not change anything.

For those implementations (mine and Text::Iconv) that allow to
reuse a conversion handle, two flavors of the test exist: one
that creates that handle once, and then converts in a loop,
another that creates that handle anew in every round.

On my system (GNU-Linux, glibc 2.2.2) I approximately get the
following results (number of iterations in parentheses, results
in seconds):

             | tiny (2000000) | small (200000) | large (200)
--------------+----------------+----------------+-------------
Locale::Iconv |         510    |          120   |       120
(cached)      |         160    |           90   |       120
Text::Iconv   |          56    |            7   |         1.3 
(cached)      |          18    |            3   |         1.3
Encode        |         120    |            1.5 |         0.4

Nice to see that Encode is a lot faster than iconv() when operating
on large buffers.  But the result for very small buffers is
disappointing.  My pure Perl version takes only 33 % longer for
the same job (160 s compared to 120 s) because it doesn't have
the overhead to resolve the aliases, find the correct encodings
and initialize its state information for each call.  

I would use Encode that way as well.

  my $enc = find_encoding('cp1250');
  my $string  = decode($enc,$octets); 

For that trivial
encoding (Windows 1250 and iso-8859-2 are more or less the same)
I could actually write a specialized module that omits the 
intermediate ISO 10646 representation and I wouldn't be suprised
if that conversion module outperformed the C version included
in Encode.

For trivial translations between 8-bit encodings a canned tr///
will do the job just fine.

Encode is mainly about getting external data to/from perl's internal
form so you can manipulate it. If you just want to transform between 
encodings then dedicated tools like iconv will out-perform perl
version. However in my limited experience the problems arise when 
things do not map. As soon as that happens then you want the 
perl script to "look at it and decide what to do" and then 
the convert to internal form is a win.



One could argue that the above described test case is pathological.
I don't think so.  The current interface of Encode is ok when you
operate on strings.  But there are situations where you operate
on data _streams_ and then the difference may become very significant.

Which is why we have :encoding layer in the PerlIO system.
That :
  A. caches the encoding object.
  B. Buffers the IO to and works on whole buffers to avoid
     small string effects.
  C. Handles partial characters when stream gets broken across pipe 
     boundaries etc.

Besides, IMHO both the object-oriented and the "handle" approach
are cleaner in design.

I quite agree - which is why Encode works the same way :-)


-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/