nmh-workers
[Top] [All Lists]

Re: what's happening?

2002-05-29 20:30:08
Shantonu Sen <ssen(_at_)mit(_dot_)edu> writes:
On Fri, 12 Apr 2002, Ken Hornstein wrote:
If I remember correctly, wasn't there still some problems remaining with
the the code in CVS?  I thought I remember some problems with date
processing.

IMHO, the only problem was with Dan's perception of the date processing.
I thought the changes were fine.

Er, no, it wasn't just "my perception".  The new date parsing code
incorrectly interprets many unofficial but widely-used textual timezones,
like "JST" (Japanese Standard Time).

At the time when I updated the code, it worked fine for my purposes, only
differed significantly in:
1) parse failures for spam which often don't have legal (or legible)
headers
2) Date: lines which did not include the timezone. I think previously it
defaulted to assuming the local time zone, which I though was bogus. Now
it defaults to UTC, which is arguably also bogus, but I think less so.

Of course, I may be significantly misremembering. Another problem with the
dtimep.c from 1.0.4 is that it doesn't compile. In order to build it, you
need to run it through a sed script. End-users shouldn't necessarily need
to regenerate the c file from the lex file, so it was an awkward
situation. The new sbr/dtimep.c actually compiles on most platforms.

There were a lot of other differences not in the categories you mention.
I've attached my primary mail on the subject.

Now, I still feel strongly that as much of the old date-parsing capability
as possible should be implemented in the new parser, but while I felt this
was important enough to delay a release of 1.0.5 (or whatever 1.0.4+dev was
to become) for a short time, I wouldn't be on solid ground to insist that
it's important enough to delay 1.0.4+dev now that the release has lagged for
so long.

I'd personally be slightly happier if the date parser were temporarily
rolled back for the 1.0.5 release as has been proposed, but I could also
live with leaving it as-is for now and then starting to tackle
re-implementation of the lost parsing ability in 1.0.5+dev.

--
Dan Harkless    
nmh(_at_)harkless(_dot_)org
http://harkless.org/dan/



--- Begin Message ---

Shantonu Sen <ssen(_at_)mit(_dot_)edu> writes:
| I've taken out military zones, and this appears to have fixed up some
| of the errant parsing when compared to 1.0.4. Dan, can you do
| another diff to check this (I just checked in the changes). Also, you
| will still see many differing lines because dtimep no longer treats
| timezone-unqualified mails as having originated in the current zone.
| Those emails will show up as GMT, which I think is totally reasonable.

Well, the last time I ran the ad-hoc test, it was against my inbox, which of
course has changed since then, so this isn't all that meaningful, but now
1.0.4 and 1.0.4+dev produce the same output for it.

Therefore this time I ran the test against all my mail folders.  There are
still lots of differences between 1.0.4 and 1.0.4+dev.

You mention in ChangeLog:

        * Took out bad time textual time zones like BST and JST.
        I found them online somewhere, but am not sure if they're
        correct.

Were they causing a problem or did you just remove them because you were
unsure of them?  The first difference in my folders is a mail from someone
in Japan with "JST" in the date.  Now, it looks like what 1.0.4 printed for
that date was incorrect as well (it comes up with an offset of +02, which
doesn't seem to make sense), but 1.0.4+dev proclaims it to be GMT.

The next diff I come across is an "NZS" timezone, which previously printed
out as +01 (again, seems wrong) but now says GMT.  

This one's not a huge deal (the new output is right, I guess, just
unfriendly), but "Thu, 25 May 2000 20:19:10 -800" used to come out as
"20:19PST" in the scan.time output, but is now "20:19-08".

Okay, here's another one that was wrong before and is wrong in a different
way now.  "Mon, 3 Jul 2000 12:40:54 CEST" previously printed as "12:40EST"
(wonder if it really thought it was U.S. Eastern Standard Time or if the
initial "C" was getting cropped due to an erroneous assumption that all
textual timezones were 3 characters), and now prints as "12:40GMT".

Here's an interesting one.  "Wed, 26 Jul 2000 09:52:40 +1000 (EST)"
previously printed as "09:52+10" but is now "09:52EDT".

Okay, the next several differences are the "no timezone -> GMT instead of
local timezone" change you already mentioned.  I know that logically,
interpreting no zone as GMT makes more sense, but I wonder which type of
timezoneless date comes up more _often_.  It may have been the way it was
for a good reason (e.g. common old versions of sendmail that gave no
timezone on local mails or something).

The next differences seem to be a good change.  "JST +900" was previous
output as that +02 again, but is now +09.

Wow, here's a date format I don't recall seeing before.  An automated email
from eBay with the date "Wed, 01 Dec 1999 20:55:20 Pacific Standard Time"
previously was incorrectly "20:55EST" and is now incorrectly "20:55GMT".
Dunno if that date format is RFC-legal, but at least it's unambiguous...

The next one may be an OK change.  "Wed, 29 Mar 2000 15:11:23 -0600 (EST)"
(the wrong offset for EST, no?) previously printed as "15:11CST" (right per
the offset?) but now blindly trusts the "EST", which I guess is okay.
However, I've heard that there are duplicate ASCII timezones, so perhaps we
ought to trust the numeric offset, if present, over the timezone strings.

Okay, this one's way bogus.  "Mon, 3 Apr 2000 21:11:21 +0000 (GMT)" was
previously correctly "21:35-00" and is now "21:35BST".

Here's another exotic one (Ankorage Daylight Time perhaps??).  "Tue, 11 Apr
2000 04:58:07 AKDT" was "04:58+07" (correct?) but is now "04:58GMT".

Here's an interesting one.  A mail from someone in Australia with the date
"Thu, 25 May 2000 16:35:02 +0930 (CST)" (a duplicate "CST", I assume?)
previously was "16:35+09" (best we can do in 8 characters' width) but is now
"16:35CDT".

Here's one that probably breaks the RFCs but was previously interpreted
correctly.  "28 Jul 2000 11:4:6 GMT" was previously "11:04GMT" but is now
"00:00GMT".

Okay, here's another timezone string the Australians apparently co-opted.
"Wed, 6 Sep 2000 08:52:50 +1100 (EST)" was previously "08:52+11" but is now
"08:52EDT".  Indeed it appears that we needs to pay attention to the numeric
offset over the textual one, if both are present.

I think that's all the differences in my folders.  There were a lot more
instances, but I think they were all duplicates of these cases.

As far as finding the right offsets to use for timezones that nmh doesn't
grok (like NZS), one possible reference is:

    http://www.bsdi.com/date

-----------------------------------------------------------------------
Dan Harkless                   | To prevent SPAM contamination, please 
dan-nmh(_at_)dilvish(_dot_)speed(_dot_)net      | do not post this private 
email address
SpeedGate Communications, Inc. | to the USENET or WWW.  Thank you.     


--- End Message ---
<Prev in Thread] Current Thread [Next in Thread>