perl-unicode

Re: Encode UTF-8 optimizations

2016-08-29 10:00:57
On 08/25/2016 01:48 AM, pali(_at_)cpan(_dot_)org wrote:
Anyway, if you need some help with Encode module or something different,
let me know. As I want to have UTF-8 support in Encode correctly
working...

I now have a branch with my proposed changes at:
http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode

If you'd be willing to test this out, especially the performance parts that would be great!

There are a bunch of commits. Most significantly, I inlined several of the short functions formerly in utf8.c. I realized that isUTF8_CHAR could be implemented differently for large code points that allowed it to never have to decode the UTF-8 into a UV. I presume that is faster. The function valid_utf8_to_uvchr() is now documented, inlined, and used in the Encode code where appropriate. This avoids the overhead of checking for errors while decoding when they've already been ruled out.

There are 2 experimental performance commits. If you want to see if they actually improve performance by doing a before/after compare that would be nice.

One unrolls the loop in valid_utf8_to_uvchr(). I tried doing this, but not this way some time ago, and it made no appreciable difference. But now that it is inlined, that may be different. On the other hand, the unrolling makes the function bigger, and the compiler may now refuse to inline it.

And the other commit changes utf8n_to_uvchr() to first call isUTF8_CHAR(), and if it passes, call valid_utf8_to_uvchr(). Again, there should be less overhead for the normal case where the input is well-formed, at the expense of slowing down somewhat malformed input. It's unclear if this is worth changing or not.

<Prev in Thread] Current Thread [Next in Thread>