perl-unicode

Performance and interface of Encode(3pm) in perl 5.8.0-RC1

2002-07-10 14:30:04
Hi,

I'm new to this list and hope not to reopen a discussion that has already
been led.

Until this morning I didn't know about the new Encode interface.  Last
weekend I had started something quite similar which is entirely written
in Perl (no C code).  I have chosen a slightly different interface,
however, and maybe you are interested to learn why.

My interface looks roughly like this:

        my $cd = Locale::Iconv->new (from => 'Windows-1250',
                                     to   => 'iso-8859-2');

        my $success = $cd->recode ($input);

I always convert "in-place", that's the first difference.  The main
drawback of this are possible run-time errors "attempt to modify
read-only values" when called with constant arguments.  But the
memory footprint is a lot better and besides even in C you cannot
copy large memory areas for free.  Why create unnecessary copies?

The second (and IMHO more important) difference is the object-oriented
interface.  The object returned by the constructor can be re-used
(and the conversion has to be initialized only once), I can pass
it around to other objects (important in large modularized projects),
and I can still offer a procedural interface at almost no cost:

        sub Locale::Iconv::iconv_open
        {
                Locale::Iconv::new (from => $_[0], to => $_[1]);
        }

And now I can do

        my $cd = iconv_open ('Windows-1250' => 'iso-8859-2');

and say in an iconv(3) fashion

        my $success = recode ($cd, $input);

Internally my objects of type Locale::Iconv contain an encoding
chain that leads from the source (from) encoding to the destination
(to) encoding.  Theoretically this chain can have an arbitrary
length (like in the GNU libc iconv implementation) but I either
know a direct path (all conversion modules are capable of converting
into UTF-8 or my internal format) or I take an intermediate step
via the internal representation.

My internal representation is simply a reference to an array of
the corresponding ISO 10646 codes, which allows me to use map()
instead of operating on strings.

After I have learned about Encode(3pm) I have written a test
script that compares the three different conversion techniques
I know: my own one, Text::Iconv which uses the iconv(3) implementation
of the libc resp. libiconv, and finally Encode::from_to from
perl 5.8.0.

For each implementation I convert a tiny (10 bytes), a small
(100 bytes), and a large (100 k) buffer from Windows-1250 to
ISO-8859-2.  The buffers do not contain any characters in the
range from /x80 to /x9f so that the conversions can never fail
and actually do not change anything.

For those implementations (mine and Text::Iconv) that allow to
reuse a conversion handle, two flavors of the test exist: one
that creates that handle once, and then converts in a loop,
another that creates that handle anew in every round.

On my system (GNU-Linux, glibc 2.2.2) I approximately get the
following results (number of iterations in parentheses, results
in seconds):

              | tiny (2000000) | small (200000) | large (200)
--------------+----------------+----------------+-------------
Locale::Iconv |         510    |          120   |       120
(cached)      |         160    |           90   |       120
Text::Iconv   |          56    |            7   |         1.3 
(cached)      |          18    |            3   |         1.3
Encode        |         120    |            1.5 |         0.4

Nice to see that Encode is a lot faster than iconv() when operating
on large buffers.  But the result for very small buffers is
disappointing.  My pure Perl version takes only 33 % longer for
the same job (160 s compared to 120 s) because it doesn't have
the overhead to resolve the aliases, find the correct encodings
and initialize its state information for each call.  For that trivial
encoding (Windows 1250 and iso-8859-2 are more or less the same)
I could actually write a specialized module that omits the 
intermediate ISO 10646 representation and I wouldn't be suprised
if that conversion module outperformed the C version included
in Encode.
 
One could argue that the above described test case is pathological.
I don't think so.  The current interface of Encode is ok when you
operate on strings.  But there are situations where you operate
on data _streams_ and then the difference may become very significant.
Besides, IMHO both the object-oriented and the "handle" approach
are cleaner in design.

If you want to test yourself, download

        http://ml.imperia.org/tw/users/guido/libintl-0.03.tar.gz

(no directory listing, download exactly that URL).  The package
does also contain a pure Perl version of XPG4 gettext that is
(hopefully) fully compatible to recent versions of GNU gettext
or the GNU libc, i. e. it can read their .mo format.  The function
bind_textdomain_codeset() is currently a dummy, that's what I
wrote my conversion stuff for.  And don't report bugs, it's
work in progress in an early state and interfaces and names will
most definitely change in the recent future.

Flame me if I have overlooked the feature I am asking for in Encode
or if it is already changed (I have only looked at RC1, not at RC2
of 5.8.0).

Thanks for your attention!

Guido
--  
Imperia AG, Development
Leyboldstr. 10 - D-50354 Hürth - http://www.imperia.de/

Attachment: pgpqHHmO3A9tP.pgp
Description: PGP signature