Re: Malformed UTF-8 character

(cc'ed to namazu-users-en)

On June 14, 2005 at 17:40, Marie-Noelle Dauphin wrote:

I have a problem when files are indexed by /usr/bin/mknmz.
It 's a known problem in this list mharc-users(_at_)mhonarc(_dot_)org:

Malformed UTF-8 character (unexpected continuation byte 0xb8, with no 
prece.............


I saw in this list that I had to change the $LANG to C  before running mharc 
scripts ...
I did this but nothing was changed , i had always the same error.


This problem has recently been discussed on the namazu-users-en
list.  See thread starting at
<http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=namazu-users-en&i=200506110226.j5B2QsG17211%40gator.earlhood.com>.

I posted a recommended fix to the namazu list, but it appears my
fix may not be adopted, or a lesser inferior, fix will be done. From
what I understand -- note, I think language has been a barrier in
my communication with a namazu developer.

I have been told that namazu is not designed for 8-bit charsets.
I find this odd since it is known that there are users of Namazu in
locales with 8-bit sets (e.g. DE/German and PL/Polish).

The problem is in the HTML filter of mknmz and with character
entity references (e.g. &#x306B;) that are greater than 256.
The file is html.pl in the filters directory (usually
/usr/local/share/namazu/filter).  The problem routine is
decode_numbered_entity().  The sprintf() will cause Perl to internally
auto-tag strings as UTF-8 text in these cases.

My initial recommendation is to replace the routine with the following
version:

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
        if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
           ($num >= 255);
    return "?"
        if $num >=127 && util::islang('ja');
    sprintf ("%c",$num);
}

My fix basically "zeros" out entity references for chars over 255,
preventing Perl from internally auto-tagging strings as UTF-8 (which
leads to the "Malformed ..." warnings, and if you have not noticed yet,
some serious behavioral problems), despite what the locale is set to.

I have yet to see adverse problems with this fix, but I have been
told that 8-bit characters are problem to namazu.  I have NOT been
told exactly why and what the actual problems are: Does it cause
data corruption?  Searching will not work properly?

If 8-bit chars are a problem, you could use the following version of
the routine:

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
        if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
           ($num >= 127);
    sprintf ("%c",$num);
}

This "zeros" out any HTML character entity references over 127.

It is interesting that the original version of the routine did not
exclude 8-bit character entity references, only for the locale of JA.
So if 8-bit chars are not desirable, why did decode_numbered_entity()
allow it initially?

It is also worth noting that the above only addresses HTML character
entity references (the underlying culprit causing the "Malformed
UTF-8..." messages).  Raw character data could be in 8-bit (but this
should NOT trigger the malformed UTF-8 message).  I would like to
hear from users that use namazu on data encoded with 8-bit charsets
(e.g. ISO-8859-*) and if they are able to perform searches on
such data.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHARC-USERS