Hi,
I'm having few serious problem with namazu. I'm using Linux (glibc 2.3.3, perl
5.008004) and:
- with vanilla namazu there is bunch of
Malformed UTF-8 character (unexpected continuation byte
XYZ, with no preceding start byte) messages
There are other reports about this, too
http://www.mhonarc.org/archive/html/mharc-users/2004-05/msg00008.html
patch from fedora fixed these
(http://cvs.pld-linux.org/cgi-bin/cvsweb/SOURCES/namazu-fixinutf8.patch?rev=1.1)
- unfortunately that's not all, even with patch above I'm getting:
65/74 - /var/spool/mailman/archives/public/feedback/2002-July/003643.html
[text/html; x-type=pipermail]
66/74 - /var/spool/mailman/archives/public/feedback/2002-July/003644.html
[text/html; x-type=pipermail]
Wide character in print at /usr/bin/mknmz line 710, <GEN3> line 66.
67/74 - /var/spool/mailman/archives/public/feedback/2002-July/003645.html
[text/html; x-type=pipermail]
Wide character in print at /usr/bin/mknmz line 710, <GEN3> line 67.
68/74 - /var/spool/mailman/archives/public/feedback/2002-July/003646.html
[text/html; x-type=pipermail]
Wide character in print at /usr/bin/mknmz line 710, <GEN3> line 68.
69/74 - /var/spool/mailman/archives/public/feedback/2002-July/003647.html
[text/html; x-type=pipermail]
70/74 - /var/spool/mailman/archives/public/feedback/2002-July/author.html is
Pipermail's index file! skipped.
70/73 - /var/spool/mailman/archives/public/feedback/2002-July/date.html is
Pipermail's index file! skipped.
70/72 - /var/spool/mailman/archives/public/feedback/2002-July/index.html is
Pipermail's index file! skipped.
70/71 - /var/spool/mailman/archives/public/feedback/2002-July/subject.html is
Pipermail's index file! skipped.
70/70 - /var/spool/mailman/archives/public/feedback/2002-July/thread.html is
Pipermail's index file! skipped.
Wide character in print at /usr/bin/mknmz line 2475.
Wide character in print at /usr/bin/mknmz line 2475.
Wide character in print at /usr/bin/mknmz line 2475.
Wide character in print at /usr/bin/mknmz line 2475.
Wide character in print at /usr/bin/mknmz line 2475.
Wide character in print at /usr/bin/mknmz line 2475.
(tons of these)
I can workaround these by placing use bytes; before and no bytes after print
in 2475 line (and other lines where this occurs).
Finally when I have all indexes:
[Base]
Date: Sat Jun 26 23:37:08 2004
Added Documents: 3,730
Size (bytes): 11,472,325
Total Documents: 3,730
Added Keywords: 84,273
Total Keywords: 84,273
Wakati: module_kakasi -ieuc -oeuc -w
Time (sec): 77
File/Sec: 48.44
System: linux
Perl: 5.008004
Namazu: 2.0.13
it doesn't find anything from them:
root(_at_)anduril /root]# namazu pld
Results:
References: [ pld: 0 ]
No document matching your query.
[root(_at_)anduril /root]# namazu linux
Results:
References: [ linux: 0 ]
No document matching your query.
[root(_at_)anduril /root]# namazu -C
Loaded rcfile: /etc/namazu/namazurc
--
Index: /var/lib/namazu/index
Logging: off
Lang: C
Scoring: tfidf
Template: /var/lib/namazu/index
MaxHit: 10000
MaxMatch: 1000
EmphasisTags: <strong class="keyword"> </strong>
Replace: /var/spool/mailman/archives/private/
http://lists.pld-linux.org/pipermail/
[root(_at_)anduril /root]# mknmz -C
Loaded rcfile: /etc/namazu/mknmzrc
System: linux
Namazu: 2.0.13
Perl: 5.008004
File-MMagic: 1.22
NKF: module_nkf
KAKASI: module_kakasi -ieuc -oeuc -w
ChaSen: module_chasen -j -F '%m '
Wakati: module_kakasi -ieuc -oeuc -w
Lang_Msg: C
Lang: C
Coding System: euc
CONFDIR: /etc/namazu
LIBDIR: /usr/share/namazu/pl
FILTERDIR: /usr/share/namazu/filter
TEMPLATEDIR: /usr/share/namazu/template
Supported media types: (18)
Unsupported media types: (16) marked with minus (-) probably missing
application in your $path.
- application/excel: excel.pl
application/ichitaro5: taro56.pl
application/ichitaro6: taro56.pl
- application/ichitaro7: taro7_10.pl
application/macbinary: macbinary.pl
- application/msword: msword.pl
- application/pdf: pdf.pl
- application/postscript: postscript.pl
- application/powerpoint: powerpoint.pl
- application/rtf: rtf.pl
- application/vnd.sun.xml.calc: ooo.pl
- application/vnd.sun.xml.draw: ooo.pl
- application/vnd.sun.xml.impress: ooo.pl
- application/vnd.sun.xml.writer: ooo.pl
application/x-apache-cache: apachecache.pl
application/x-bzip2: bzip2.pl
application/x-compress: compress.pl
- application/x-deb: deb.pl
- application/x-dvi: dvi.pl
application/x-gzip: gzip.pl
- application/x-js-taro: taro7_10.pl
application/x-rpm: rpm.pl
- application/x-tex: tex.pl
- audio/mpeg: mp3.pl
message/news: mailnews.pl
message/rfc822: mailnews.pl
text/hnf: hnf.pl
text/html: html.pl
text/html; x-type=mhonarc: mhonarc.pl
text/html; x-type=pipermail: pipermail.pl
text/plain
text/plain; x-type=rfc: rfc.pl
text/x-hdml: hdml.pl
text/x-roff: man.pl
I'm also using pipermail.pl filter from http://mm.tkikuchi.net/pipermail.pl -
it has same license as filters in namazu tarball so it would be nice if it
also made into official namazu tarball.
There is also patch that fixes German support:
http://cvs.pld-linux.org/cgi-bin/cvsweb/SOURCES/namazu-de.patch?rev=1.1
--
Arkadiusz Mi?kiewicz CS at FoE, Wroclaw University of Technology
arekm.pld-linux.org, 1024/3DB19BBD, JID: arekm.jabber.org, PLD/Linux