perl-unicode

Re: Performance and interface of Encode(3pm) in perl 5.8.0-RC1

2002-07-11 15:30:05
Guido Flohr <guido(_at_)imperia(_dot_)net> writes:
Hi,

On Thu, Jul 11, 2002 at 12:15:30PM +0100, Nick Ing-Simmons wrote:
For my Tk application of encode the in-place form causes unnecessary
copies. e.g. I need the original and the form encoded into the encoding 
required by the font, or I have to copy the input arg to return location.

But whether the caller or the callee makes the copy should make no 
difference in performance.  I personally prefer to make copies as
late as possible.

So do I in general terms. 

Due to the magic of perls internals (legacy from pre-perl5)

   $foo = function($bar);

may not copy anything to $foo. Instead the "assign" will (probably)
just change $foo's PV to point at the string buffer for the 
temporary returned from the call. But it cannot do that if 
I have to write 

   my $foo = $bar; # must copy
   inplace($foo); 


Doing in-place is very hard to do when converting between two variable 
length encodings. I suspect your "all perl" version is not _really_ 
doing it "in place" but just in same scalar, but in different PV "buffers".

Correct.  But (see your own example below) I could also write
something like

      my $replace = $subchar x 128;
      $_[n] =~ y/\x80-\xff/$replace/;

I think you can tune that some.


for many 8bit to ascii encodings and leave the decision whether a
copy of the original is left to the caller.

To ASCII with '?' as a replacement char is very uninteresting - it is 
horribly lossy. Representing failed octets as \x{HH} or HTML-like (which Encode 
can do)
is much more fun. 


The Encode API is writen to allow core of encodings to be written in C
Keeping return value and source separate is very useful for C.

However, do you need witch-craft to copy a string buffer in C if the
need for it arises?

No, but all the testing for the case arising slows the whole thing down.


I would use Encode that way as well.

  my $enc = find_encoding('cp1250');
  my $string  = decode($enc,$octets); 

That's it. ;-)

Provided that it is safe to call decode() and encode() as many times
as I want, even after an error, that's exactly what I was looking
for.

It is safe to do so for non-stateful encodings (and if one treats 
partial characters as a special case there are more of those).
If partial characters are an issue (as they are in :encoding IO layer)
then things get messy, and more trickey is required to avoid bulk 
copies. By separating source and desitination that trickery can be 
confined to the "driver" and not done over and over in each encoding. 

A fairly safe way to avoid partial chars is to process line-by-line
but as there are normally many lines per IO buffer that is slower 
(due to call overheads) than doing whole buffer and trickery for partials.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/