perl-unicode

Re: Encode UTF-8 optimizations

2016-08-22 16:40:24
(this only applies for strict UTF-8)

On Monday 22 August 2016 23:19:51 Karl Williamson wrote:
The code could be tweaked to call UTF8_IS_SUPER first, but I'm
asserting that an optimizing compiler will see that any call to
is_utf8_char_slow() is pointless, and will optimize it out.

Such optimization cannot be done and compiler cannot know such thing...

You have this code:

+        const STRLEN char_len = isUTF8_CHAR(x, send);
+
+        if (    UNLIKELY(! char_len)
+            || (    UNLIKELY(isUTF8_POSSIBLY_PROBLEMATIC(*x))
+                && (   UNLIKELY(UTF8_IS_SURROGATE(x, send))
+                    || UNLIKELY(UTF8_IS_SUPER(x, send))
+                    || UNLIKELY(UTF8_IS_NONCHAR(x, send)))))
+        {
+            *ep = x;
+            return FALSE;
+        }

Here isUTF8_CHAR() macro will call function is_utf8_char_slow() if 
condition IS_UTF8_CHAR_FAST(UTF8SKIP(x))) is truth. And because 
is_utf8_char_slow() is external library function compiler has absolutely 
no idea what that function is doing. In non-functional world such 
function could have side effect, etc and compiler really cannot 
eliminate that call.

Moving UTF8_IS_SUPER before isUTF8_CHAR maybe could help, but I'm septic 
if gcc really can propagate constant from PL_utf8skip[] array back and 
prove that IS_UTF8_CHAR_FAST must be always true when UTF8_IS_SUPER is 
true too...

Rather add IS_UTF8_CHAR_FAST(UTF8SKIP(s))) check (or similar) before 
isUTF8_CHAR() call. That should totally eliminate generating code with 
call to is_utf8_char_slow() function.

With UTF8_IS_SUPER there can be branch in binary code which never will 
be evaluated.