namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: Malformed UTF-8 character

2005-06-14 11:00:28
Earl Hood wrote:

I have been told that namazu is not designed for 8-bit charsets.
I find this odd since it is known that there are users of Namazu in
locales with 8-bit sets (e.g. DE/German and PL/Polish).

As for Namazu, the design is not done because of 8bit charsets though 
it repeats. Moreover, the test is not done. 
It is not to relate even if there is a person who is using it with 
8-bit charsets. 

By the volunteer,
Namazu prepares 8bit charset in Message translations. 
However, Message translations and Text processing are another. 

If 8-bit chars are a problem, you could use the following version of
the routine:

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
        if ($num >= 0 && $num <= 31) || ($num >= 127 && $num <= 159) ||
           ($num >= 127);
    sprintf ("%c",$num);
}

Still, it will be omissible though doesn't care as follows. 

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return "?"
        if $num >= 0 && $num <= 31 || $num >= 127;
    sprintf ("%c",$num);
}

Only 127 or more is whether it makes it to "?" or. 

sub decode_numbered_entity ($) {
    my ($num) = @_;
    return ""
        if $num >= 0 && $num <= 31;
    return "?"
        if $num >= 127;
    sprintf ("%c",$num);
}

It is interesting that the original version of the routine did not
exclude 8-bit character entity references, only for the locale of JA.
So if 8-bit chars are not desirable, why did decode_numbered_entity()
allow it initially?

There is a problem in the program. 
decode_numbered_entity() was mounting to consider 8bit charsets 
certainly. 
However, Namazu doesn't come to be designed for 8bit charsets. 

See. Tips.html 
http://www.namazu.org/doc/tips.html.en#html

The content being written in tips.html becomes a basic design 
though it differs from the content and mounting being written here. 

There is a possibility of causing the problem if the input text has 
not been limited. 
As for 8bit character, the program is being written in the 
processing of Namazu on the assumption that it is EUC-JP. 
The part that doesn't become it might still remain in curettage 
though it is necessary to do Japanese processing only in a Japanese 
environment. 
(It has already been understood for chomp_eucjp() to pass even if 
it is Japanese, is environmental, and it is unexpected. This is 
corrected with Namazu 2.0.15. )
-- 
=====================================================================
TADAMASA TERANISHI
http://www.asahi-net.or.jp/~yw3t-trns/index.htm
Key fingerprint =  474E 4D93 8E97 11F6 662D  8A42 17F5 52F4 10E7 D14E

_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en