[Namazu-users-en] Re: Problems with mknmz and Perl 5.8.6

On June 14, 2005 at 00:46, Tadamasa Teranishi wrote:

I believe this is a bad implementation, because it neutralizes all
character entity references that namazu does support.


First of all, limit the input text to the one of an appropriate 
character set.


You do realize this greatly reduces the usability of namazu.  As
I will reiterate _again_, limiting the input text is not necessarily
easy, especially when indexing email.  If you do not believe me,
I can provide you with megabytes of email data.

In email, it is hard to control what charset you will get.  With my
program, MHonArc, it does a fairly good job of "normalizing" the data,
but it does (by default) use Unicode character entity references to
capture non-ASCII characters.  Otherwise, I would have lots of
lost data if I were to take you viewpoint on this.

Please correct if you cannot do it and use it. 
Because however, because this correction is not recommended, this 
correction is not reflected in stable-2-0 either.


I fail to see why it is not recommended.  It is better than what
you provided which causes the loss of ALL character entity references,
even those in the ASCII range.

My fix is better then what currently exists in 2.0.14, and it at least
gracefully deals with unexpected input, something good software
is supposed to do.

Trying to support ISO-8859-*family etc. halfway is a problem. 
Whether it is not ISO-8859-*family or so it or UTF-8 or EUC-JP,
Shift_JIS, etc. cannot be easily judged according to 8bit code. 
That causes the problem.


I do not care, because I do not want to deal with Japanese text
with the data set in question.  Namazu should NOT be doing
JP processing if the locale is not set to JP.  Therefore, things
like multi-byte and wide characters are irrelevent.

The problem is evaded by limiting it to 7bit ASCII character. 
If 8bit is permitted, a lot of corrections are needed. 

ISO-8859-*family should not be permitted if the character set of the 
input text is not definable.


You must realize the difference between reality and theory.  For
example, MHonArc breaks some MIME conformance, mainly with respect
to character set processing, due to what works practically for users
in the real world versus what is the "standard" thing to do.

But, taking your argument, your decode_numbered_entity() implementation
is bad for when the data is definable.  For example, if the locale is
PL (Polish, iso-8859-2), your implementation will basically exclude all
Polish characters from getting indexed in HTML data if character entity
references are used for Polish chars.  Why?  All polish characters in
the iso-8859-2 charset are 8-bit.

My implementation of decode_numbered_entity() will not cause such
problems.

I still do not understand why you adverse to 8-bit characters, when
it is known that there are users of Namazu that use it on iso-8859-*
data.  In my implementation of decode_numbered_entity():

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
        if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
           ($num >= 255);
    return "?"
        if $num >=127 && util::islang('ja');
    sprintf ("%c",$num);
}

It is just like what currently exists, but it excludes all code points
over 255.  Your implementation excludes ALL code points (BAD).

Now, if there some valid technical reason for not allow 8-bit
data in a non-JP locale, then just change the above to "?"-out
all character entity references greater than 127.  However, I
think this will reduce the usability of namazu in many locales,
like those in Europe.

If it is the decision of Namazu developers to exclude all character
entity references in data, then I will just have to apply my patch
for each new release until I am able to find a replacement search
engine.

--ewh
_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en