namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: Malformed UTF-8 character

2005-06-19 20:43:31
Thank you for your description about the issue.

At Fri, 17 Jun 2005 12:04:21 -0500,
Earl Hood wrote:
Wrt mknmz, with a locale of "C" or "en_US", by default, the strings are
_not_ utf-8.  Even the mknmz code invokes binmode() on filehandles to
prevent Perl from applying any character encoding semantics (Perl 5.8.x
supports character encoding/decoding on file handles similiar to Java).

binmode was used for Win32 former, I hadn't know such side effect.

The problem trigger is in decode_numbered_entity() in html.pl and
the statement:

  sprintf("%c",$num);

If $num is > 256, Perl ends up creating a utf-8 sequence (because
of the "%c" format), causing the string having the entity decoded
get its utf-8 flag set (regardless of the current locale setting).
Subsequently, any character-based operations (like regexes or file
writes) cause Perl to generate warnings.  It also causes mis-behavior
and probably corruption in Namazu.

Therefore, my initial fix was to drop any $num >= 255.  This would
preserve the 8-bit agnostic behavior of namazu.

Hmm, it seems sufficently for me. I want to apply it in the stable
branch and HEAD.

Do you have any objection about it, Teranishi-san?
-- 
NOKUBI Takatsugu
E-mail: knok(_at_)daionet(_dot_)gr(_dot_)jp
        knok(_at_)namazu(_dot_)org / knok(_at_)debian(_dot_)org
_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en