[Namazu-users-en] Re: Malformed UTF-8 character

On June 15, 2005 at 03:00, Tadamasa Teranishi wrote:

Only 127 or more is whether it makes it to "?" or. 

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return ""
        if $num >= 0 && $num <= 31;
    return "?"
        if $num >= 127;
    sprintf ("%c",$num);
}


So non-printable characters and some whitespace characters do not
constitute word boundaries?  You realize that characters like tab
(ASCII 9) and form-feed (ASCII 12) are not being treated as word
boundaries.  I think this is a mistake.

The code you have will combine two words into one.  For example:

  hello&#9;there

Will get filtered to:

  hellothere

Using '?' for the replacement will have:

  hello?there

which, hopefully, will cause mknmz to treat "hello" and "there"
as two separate words.

There is a possibility of causing the problem if the input text has 
not been limited. 
As for 8bit character, the program is being written in the 
processing of Namazu on the assumption that it is EUC-JP.


If I understand you correctly, namazu using EUC-JP internally, even
if the locale is not JP?  Am I correct?

If so, EUC-JP has code point equivalents for ISO-8859-* charsets.
Examining the ucm file for euc-jp, I see encodings for greek, cyrillic,
and latin characters.

--ewh
_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

[Namazu-users-en] Re: some queries failing (cgi), Chad Leigh -- Shire.Net LLC

Next by Date:

[Namazu-users-en] Re: Malformed UTF-8 character, Earl Hood

Previous by Thread:

[Namazu-users-en] Re: Malformed UTF-8 character, Tadamasa Teranishi

Next by Thread:

[Namazu-users-en] Re: Malformed UTF-8 character, Earl Hood

Indexes:

[Date] [Thread] [Top] [All Lists]