perl-unicode

Re: Encode UTF-8 optimizations

2016-08-22 16:38:26
On 08/22/2016 03:19 PM, Karl Williamson wrote:
On 08/22/2016 02:47 PM, pali(_at_)cpan(_dot_)org wrote:
> And I think you misunderstand when is_utf8_char_slow() is called.
It is
> called only when the next byte in the input indicates that the only
> legal UTF-8 that might follow would be for a code point that is at
least
> U+200000, almost twice as high as the highest legal Unicode code
point.
> It is a Perl extension to handle such code points, unlike other
> languages.  But the Perl core is not optimized for them, nor will
it be.
>   My point is that is_utf8_char_slow() will only be called in very
> specialized cases, and we need not make those cases have as good a
> performance as normal ones.
In strict mode, there is absolutely no need to call
is_utf8_char_slow(). As in strict
mode such sequence must be always invalid (it is above last valid
Unicode character)
This is what I tried to tell.

And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR()
(which could call
is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().

I only have time to respond to this portion just now.

The code could be tweaked to call UTF8_IS_SUPER first, but I'm asserting
that an optimizing compiler will see that any call to
is_utf8_char_slow() is pointless, and will optimize it out.


Now, I'm realizing I'm wrong. It can't be optimized out by the compiler because it is not declared (nor can it be) to be a pure function. And, I'd rather not tweak it to call UTF8_IS_SUPER first, because that relies on knowing what the current internal implementation is.

But I still argue that it is fine the way it is. It will only get called for code points much higher than Unicode, and the performance on those should not affect our decisions in any way.