On 08/25/2016 01:48 AM, pali(_at_)cpan(_dot_)org wrote:
Anyway, if you need some help with Encode module or something different,
let me know. As I want to have UTF-8 support in Encode correctly
working...
I now have a branch with my proposed changes at:
http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode
If you'd be willing to test this out, especially the performance parts
that would be great!
There are a bunch of commits. Most significantly, I inlined several of
the short functions formerly in utf8.c. I realized that isUTF8_CHAR
could be implemented differently for large code points that allowed it
to never have to decode the UTF-8 into a UV. I presume that is faster.
The function valid_utf8_to_uvchr() is now documented, inlined, and used
in the Encode code where appropriate. This avoids the overhead of
checking for errors while decoding when they've already been ruled out.
There are 2 experimental performance commits. If you want to see if
they actually improve performance by doing a before/after compare that
would be nice.
One unrolls the loop in valid_utf8_to_uvchr(). I tried doing this, but
not this way some time ago, and it made no appreciable difference. But
now that it is inlined, that may be different. On the other hand, the
unrolling makes the function bigger, and the compiler may now refuse to
inline it.
And the other commit changes utf8n_to_uvchr() to first call
isUTF8_CHAR(), and if it passes, call valid_utf8_to_uvchr(). Again,
there should be less overhead for the normal case where the input is
well-formed, at the expense of slowing down somewhat malformed input.
It's unclear if this is worth changing or not.