Re: Encode UTF-8 optimizations

On 08/25/2016 01:48 AM, pali(_at_)cpan(_dot_)org wrote:

Anyway, if you need some help with Encode module or something different,
let me know. As I want to have UTF-8 support in Encode correctly
working...


I now have a branch with my proposed changes at:
http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode

If you'd be willing to test this out, especially the performance partsthat would be great!

There are a bunch of commits. Most significantly, I inlined several ofthe short functions formerly in utf8.c. I realized that isUTF8_CHARcould be implemented differently for large code points that allowed itto never have to decode the UTF-8 into a UV. I presume that is faster.The function valid_utf8_to_uvchr() is now documented, inlined, and usedin the Encode code where appropriate. This avoids the overhead ofchecking for errors while decoding when they've already been ruled out.

There are 2 experimental performance commits. If you want to see ifthey actually improve performance by doing a before/after compare thatwould be nice.

One unrolls the loop in valid_utf8_to_uvchr(). I tried doing this, butnot this way some time ago, and it made no appreciable difference. Butnow that it is inlined, that may be different. On the other hand, theunrolling makes the function bigger, and the compiler may now refuse toinline it.

And the other commit changes utf8n_to_uvchr() to first callisUTF8_CHAR(), and if it passes, call valid_utf8_to_uvchr(). Again,there should be less overhead for the normal case where the input iswell-formed, at the expense of slowing down somewhat malformed input.It's unclear if this is worth changing or not.