namazu-users-en
[Top] [All Lists]

[Namazu-users-en] Re: Problems with date sorts

2007-01-12 16:19:09
Thus spake Tadamasa Teranishi on Fri, Jan 12, 2007 at 09:18:26AM CST
Lindsay Haisley wrote:

I'm running into a rather nasty problem with date sorting on Mailman 
pipermail
archives.  When I sort on date:early or date:late there appears to be some
other sort being applied, although if I do a date sort in the reverse order 
the
order of the messages is indeed reversed, indicating that the sort is 
working,
albeit with an incorrect algorithm.

Does date information accurately follow the form of RFC2822 by 
all documents of MailMan?

Is there mail with an illegal Date: field ?

Please show the Date: field of the mail.

OK, here is an example.  I used the following query:

http://www.kca-tx.org/mailman/kca/namazu.cgi?query=Laptop&submit=Search%21&idxname=kca&max=100&result=short&sort=date%3Aearly

Here's the result:

1. win 98SE (score: 2)
    /pipermail/kca/2002-September/000192.html (4,152 bytes)

2. Linux install plus a note on jedit (score: 2)
    /pipermail/kca/2002-July/000052.html (4,432 bytes)

3. Canon BJC-2100, Restart in DOS mode (score: 2)
    /pipermail/kca/2002-August/000103.html (3,073 bytes)

4. March Newscard (score: 2)
    /pipermail/kca/2003-March/000353.html (4,296 bytes)

5. New TurboTax "feature" (score: 2)
    /pipermail/kca/2003-January/000331.html (6,865 bytes)

You can see from path names that these are out of order.  Here are the Date 
fields in each of these, copy-n-pasted from the files themselves:

Sun Sep 22 14:07:51 CDT 2002

Fri Jul 19 11:54:00 CDT 2002

Fri Aug 23 08:44:57 CDT 2002

Mon Mar  3 10:14:12 CST 2003

Tue Jan 14 17:37:11 CST 2003


The date information isn't in a standard RFC2822 header format once the files 
are in a pipermail archive, but embedded in HTML markup, e.g.:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
 <HEAD>
   <TITLE> New TurboTax &quot;feature&quot;
   </TITLE>
   <LINK REL="Index" HREF="index.html" >
   <LINK REL="made" 
HREF="mailto:kca%40lists.kca-tx.org?Subject=New%20TurboTax%20%22feature%22&In-Reply-To=20030114.1005
28.-140453.1.bstrohm%40juno.com">
   <META NAME="robots" CONTENT="index,nofollow">
   <META http-equiv="Content-Type" content="text/html; charset=us-ascii">
   <LINK REL="Previous"  HREF="000330.html">
   <LINK REL="Next"  HREF="000332.html">
 </HEAD>
 <BODY BGCOLOR="#ffffff">
   <H1>New TurboTax &quot;feature&quot;</H1>
    <B>Dale Cockle</B> <A 
HREF="mailto:kca%40lists.kca-tx.org?Subject=New%20TurboTax%20%22feature%22&In-Reply-To=20030114.100528.-140453.1.
bstrohm%40juno.com"
       TITLE="New TurboTax &quot;feature&quot;">k5jic at kca-tx.org
       </A><BR>
    <I>Tue Jan 14 17:37:11 CST 2003</I>

etc....

Could that be a problem?  Should I perhaps be indexing the mbox file?  Would 
namazu understand that better?

-- 
Lindsay Haisley       | "Fighting against human |     PGP public key
FMP Computer Services |    creativity is like   |      available at
512-259-1190          |    trying to eradicate  | <http://pubkeys.fmp.com>
http://www.fmp.com    |        dandelions"      |
                      |      (Pamela Jones)     |
_______________________________________________
Namazu-users-en mailing list
Namazu-users-en(_at_)namazu(_dot_)org
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-users-en

<Prev in Thread] Current Thread [Next in Thread>