namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: Malformed UTF-8 character

2005-06-14 12:04:11
On June 15, 2005 at 03:00, Tadamasa Teranishi wrote:

Only 127 or more is whether it makes it to "?" or. 

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return ""
        if $num >= 0 && $num <= 31;
    return "?"
        if $num >= 127;
    sprintf ("%c",$num);
}

So non-printable characters and some whitespace characters do not
constitute word boundaries?  You realize that characters like tab
(ASCII 9) and form-feed (ASCII 12) are not being treated as word
boundaries.  I think this is a mistake.

The code you have will combine two words into one.  For example:

  hello&#9;there

Will get filtered to:

  hellothere

Using '?' for the replacement will have:

  hello?there

which, hopefully, will cause mknmz to treat "hello" and "there"
as two separate words.


There is a possibility of causing the problem if the input text has 
not been limited. 
As for 8bit character, the program is being written in the 
processing of Namazu on the assumption that it is EUC-JP. 

If I understand you correctly, namazu using EUC-JP internally, even
if the locale is not JP?  Am I correct?

If so, EUC-JP has code point equivalents for ISO-8859-* charsets.
Examining the ucm file for euc-jp, I see encodings for greek, cyrillic,
and latin characters.

--ewh
_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en