perl-unicode

Re: Use case for utf8::upgrade?

2010-04-08 03:08:42
* Michael Ludwig <michael(_dot_)ludwig(_at_)xing(_dot_)com> [2010-04-08 09:25]:
since upgrading a string increases memory consumption and can
significantly slow down regex matches against it.

Is it some copying behind the scenes that increases memory
consumption?

Just the simple fact that some characters take multiple bytes to
encode in the UTF8-based format.

Why does that have the potential to significantly slow down
regex matches?

Because one byte and one character is no longer the same thing,
so if you know you want the 17th character in the string, you
can’t say where in memory that is. You have to scan the string.
This is sort of access pattern is rare in practice – most
operations either just copy the entire string or scan over it one
character at a time. But the regex engine is one of those things
that sometimes needs to jump around in the string rather than
merely scanning linearly. (Perl’s regex engine does some caching
to avoid the worst penalties with this, but that in itself also
causes slowdown, so there’s a balance to strike.)

Does that mean that when doing lots of matching, it might be
preferable to use byte strings and byte semantics, not
character strings and character semantics?

Almost all of the time the performance cost is negligible and not
worth sweating at the application code level.

Trying to work on text using byte semantics is a recipe for
massive headaches, and an invitation for bugs. It’s doable if you
are careful and disciplined, absolutely. But why punish yourself?
You gain little, at significant effort.

On 5.12, though, you can get a tiny potential improvement en
passant, with basically zero effort.

In that case – and only in that case: why not? The gain is small;
but the cost is also.

In the other direction, that doesn’t translate. Don’t go micro-
optimising your code for this.

Under older perls, it’s a question of getting the wrong
results in less time and memory, so there’s not an option.

Wrong results? Could you clarify? Thanks :-)

Well, you get Latin-1 semantics, eg. upper-/lowercasing will
ignore accented characters that fall outside the Latin-1 charset.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>

<Prev in Thread] Current Thread [Next in Thread>