namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: Problems with mknmz and Perl 5.8.6

2005-06-13 07:54:42
On June 13, 2005 at 06:09, Tadamasa Teranishi wrote:

Even if it is ASCII character, it is not good according to the 
character entity references. 

This is what I discovered.

Namazu corresponds to a pure ASCII-only text alone without the 
character entity references. 

Please use it by pure ASCII text-only.

I understand this.  What I am trying to say that this is not
necessarily an easy task for users.

Namazu should handle, with grace, cases where code-points exceed what
Namazu will support.  Otherwise, users will get incorrect behaviour
and not understand why.  You require all users to pre-filter data,
something namazu should do.

The decode_numbered_entity subroutine of filter/html.pl is rewritten 
as follows. 

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?";
}

I believe this is a bad implementation, because it neutralizes all
character entity references that namazu does support.

I recommend the following:

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
        if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
           ($num >= 255);
    return "?" 
        if $num >=127 && util::islang('ja');
    sprintf ("%c",$num);
}

This allows 8-bit character entity references, which is needed
for 8-bit character sets (e.g. ISO-8859-* family).

The above version also avoids the problem of Perl auto-flagging
text with the utf-8 flag.

If you are not familiar with how Perl handles Unicode, see the
perlunicode and related manual pages.  Namazu needs to be coded
to avoid causing Perl (v5.8.x and later) to set the utf-8 flag
on strings.  Just setting the LC_ALL=C environment variable
is NOT enough.

--ewh

_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en