namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: Problems with mknmz and Perl 5.8.6

2005-06-11 16:07:52
This is a "+from:earl" search.  Notice how the subject links in the
results are clipped.  The first parts of the subject text is not
printed.  However, examining NMZ.fields.subject shows that the complete
subjects are present.

After some further analysis (and many hours), I have at determined
what triggers the malformed utf-8 errors.

The problem is in html::decode_numbered_entity, which is invoked in the
regex's used by html::decode_entity.  If it is passed a number >= 160.
The call to sprintf() causes $$contref to get the UTF-8 flag set on
it (regardless that the locale is set to 'C'), causing Perl to do
subsequent utf-8 checks.

For example, the data input contain strings like:

  
について助けてください!

When add the following to html::decode_numbered_entity:

      return ""
          if $num >= 255;

All problems go away.  Search results do not clip out subjects anymore
and searching for "PHP" provides hits.

I believe the subject clipping occurs do to length offsets being wrong.
I think the offset written by mkmnz does not take into account of
UTF-8 encoding of text written (due to the utf-8 flag getting set).
I.e. When computing the "size" it is actually getting the number
of _characters_ in the data and not the number of _octets_ that are
actually written.  This could cause the funny "clipping" of subjects,
or the wrong subjects being listed in search results.

Now, it must be determined what is the proper fix for this.  Is
the above hack sufficient, or does something more robust needed?

--ewh
_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en