Re: Namazu problem? (long)

On May 26, 2004 at 16:03, David L. Dewey wrote:

Check your *LANG* environment settings.  Before running mharc scripts,
or mknmz, set them to the C locale.  Namazu does not support UTF-8
locale settings.


Thanks, Earl, but that didn't seem to work... LANG is now
set to C.  I began the reindex and it ran for a long time
w/o error, but then suddenly blew up with tens of thousands
of these again:


There may be multiple language-related environment settings.  Do a
printenv and examine which envrionment variables need to be fixed.

Malformed UTF-8 character (unexpected continuation byte
0xb8, with no preceding start byte) in pattern match (m//)
at /usr/local/share/namazu/filter/mailnews.pl line 216,
<GEN3> line 45191.


This problem occurs because perl is treating the source character
encoding as UTF-8, but the source contains 8-bit octets that should
not be treated as UTF-8.

I did a hack of adding a 'use bytes' pragma within the the block
that is causing problems to force perl to treat data in the offending
regex as bytes instead of characters:

--- mailnews.pl.20040505        2004-05-05 14:52:23.000000000 -0700
+++ mailnews.pl 2004-05-05 15:03:43.000000000 -0700
@@ -209,6 +209,7 @@ sub mailnews_citation_filter ($$) {
     $$contref = "";
     my $i = 0;
     for my $line (@tmp) {
+       use bytes;
        # Complete excluding is impossible. I tnink it's good enough.
         # Process only first five paragrahs.
        # And don't handle the paragrah which has five or longer lines.


--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHARC-USERS