nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] nmh internals: full MIME integration

2014-07-26 14:08:14
Hi Ken,

If we're having lazy evaluation of MIME parts, which is good, can it
also cover the headers?  `pick --list-id <foo(_at_)bar(_dot_)com>' isn't
concerned with decoding Subject and all those Received headers.  It
may not sound like much, but we have folders with tens of thousands
of emails.  get_header() could note minimal details of each header
it comes across whilst searching for the List-ID but not bother too
much about their contents.

I wasn't actually thinking of decoding the headers for things like
MIME content, at least upon read (I assume you're talking about RFC
2047 encoding

No, less than that.  I'm hoping this change will also improve searching
for split-line headers.

    $ grep -A 1 '^foo:' `mhpath .`
    foo: bar
     xyzzy
    $ pick --foo 'bar xyzzy' .
    pick: no messages match specification
    $ pick --foo 'bar  xyzzy' .
    1 hit
    $

pick may have changed a bit since the above version, but I still
shouldn't have to care how much whitespace continuation lines are
indented.  Shouldn't pick be matching against a logical view of a single
line, with `CRLF WS*' becoming a single space?

Okay, I guess I could see that.  The normal case would be to decode
the contents completely

Yep, to UTF-8 single lines?

the kind of overhead that would be nice to see done only on demand.

I'm still skeptical that you'd even notice (it isn't 1988 anymore!),
but I think if the API was well designed it should be easy to
implement.

Well, you might be thinking the 2047-decoding might not make a lot of
difference, whereas I'm thinking a block can be read into a page-aligned
buffer that has an \n beyond it as a sentinel, then check for
/foo[ \t]*:/i, ignore any non-foo headers, hunt for the next \n and repeat
if it's not the sentinel, else read another block and try again.  Stop
if no more blocks or \n\n.  The detail's a bit more complex but there's
no allocation and copying for headers seen along the way;  they'll be
found when they're looked for in turn.  The file's blocks aren't being
modified so no copy-on-write's occurring.

I agree moderness is quick;  this is on about 22,500 emails.

    $ LC_ALL=C \time -v perl -e 'for (<[0-9]*>) {sysopen F, $_, 0 and sysread 
F, $b, 4096 or die}'
            Command being timed: "perl -e for (<[0-9]*>) {sysopen F, $_, 0 and 
sysread F, $b, 4096 or die}"
            User time (seconds): 0.40
            System time (seconds): 0.52
            Percent of CPU this job got: 98%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.93
            Average shared text size (kbytes): 0
            Average unshared data size (kbytes): 0
            Average stack size (kbytes): 0
            Average total size (kbytes): 0
            Maximum resident set size (kbytes): 24112
            Average resident set size (kbytes): 0
            Major (requiring I/O) page faults: 0
            Minor (reclaiming a frame) page faults: 1688
            Voluntary context switches: 1
            Involuntary context switches: 19
            Swaps: 0
            File system inputs: 0
            File system outputs: 0
            Socket messages sent: 0
            Socket messages received: 0
            Signals delivered: 0
            Page size (bytes): 4096
            Exit status: 0
    $

It would be nice if a simple pick didn't add much to that roughly
one-second 100%-CPU-utilisation wall-clock time.  :-)  Running pick
tends to be an iterative process where the query is honed.

Cheers, Ralph.

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>