namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: Malformed UTF-8 character

2005-06-14 12:19:45
Earl Hood wrote:

So non-printable characters and some whitespace characters do not
constitute word boundaries?  You realize that characters like tab
(ASCII 9) and form-feed (ASCII 12) are not being treated as word
boundaries.  I think this is a mistake.

It is case by case. might.
Maybe, the correct answer doesn't exist. 

In the sample, because the control code from 0 to 31 was deleted, 
it doesn't pass. 

The code you have will combine two words into one.  For example:

  hello	there

Will get filtered to:

  hellothere

Using '?' for the replacement will have:

  hello?there

which, hopefully, will cause mknmz to treat "hello" and "there"
as two separate words.

If it wants to do this, neatly converting it into TAB no "?" it 
is better. 
Or, no do be known whether converting it into SPACE is better.
-- 
=====================================================================
TADAMASA TERANISHI
http://www.asahi-net.or.jp/~yw3t-trns/index.htm
Key fingerprint =  474E 4D93 8E97 11F6 662D  8A42 17F5 52F4 10E7 D14E

_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en