nmh-workers
[Top] [All Lists]

Re: [Nmh-workers] MH-W intro/help request

2014-12-04 08:46:03
Is there any way I can completely avoid the giant folder check?  I
can't think of why it is being done time after time for simple program
invokations that, for example, refer to a specifically enumerated
message.  Obviously *asking* for some relative message list ID like
"last" would need to check the directory to find which message number
that is referring to, but it would be easy to do that in one step,
always referring to the number after that.

Sigh.  I suspect the original authors of MH simply did not envision
100K messages in a single folder.  The short answer is, no, you cannot
avoid it in nmh programs, at least not without a TON of work.

When I said it would require "some" surgery, I was kind of underestimating
the amount of work it would take.  I'll try to explain it in greater
detail.

What folder_read() returns is a "struct msgs".  That has the following
elements:

struct msgs {
    int lowmsg;         /* Lowest msg number                 */
    int hghmsg;         /* Highest msg number                */
    int nummsg;         /* Actual Number of msgs             */

    int lowsel;         /* Lowest selected msg number        */
    int hghsel;         /* Highest selected msg number       */
    int numsel;         /* Number of msgs selected           */

    int curmsg;         /* Number of current msg if any      */

    int msgflags;       /* Folder attributes (READONLY, etc) */
    char *foldpath;     /* Pathname of folder                */

    /*
     * Name of sequences in this folder.
     */
    svector_t msgattrs;

[...]

    /*
     * This is an array of bvector_t which we allocate dynamically.
     * Each bvector_t is a set of bits flags for a particular message.
     * These bit flags represent general attributes such as
     * EXISTS, SELECTED, etc. as well as track if message is
     * in a particular sequence.
     */
    size_t num_msgstats;
    bvector_t *msgstats;        /* msg status */

[...]

    /*
     * These represent the lowest and highest possible
     * message numbers we can put in the message status
     * area, without calling folder_realloc().
     */
    int lowoff;
    int hghoff;

There are more, but you get the idea.  Essentially ALL of these things,
with the exception of "foldpath", require a full readdir() of the
directory.  This is especially true of the sequence structure; it's
indexed based on the value of "lowoff", which is set by default to the
lowest message number in a folder.  Every single MH program has, for
approximately forever, had direct access to these fields and assumes
they'll be valid.hese 

The "struct msgs" is used by a lot of routines.  The routine that
converts a message name into a filename is m_convert().  It assumes
that all of the fields in struct msgs are valid, as it checks to see
if the given message number is within the range of messages in the
folder.  The code to check to see if it's a sequence happens right
at the beginning of m_convert() (I wonder if a sequence can be completely
numeric?  Maybe) and THAT code assumes that the sequence information
has been completely populated, which again requires a full readdir().
That's also completely distinct from all of the code at a higher level
that assumes that everything in struct msgs is valid.  These decisions
in terms of API and code structure were all made 30+ years ago.

So, you're going to say, "Hey, it still SHOULDN'T be necessary to read
the whole directory just to get one message that is given explicitly
by number".  The answer to that is, "yes, but ... reality intrudes."

The number of cases where that's true is actually very narrow; the
normal case, where you'd want to check and clear the unseen sequence and
possibly update the previous sequence is much more common.  You'd have
to really work at your setup to create a situation where scanning the
whole directory wasn't required.  Okay, fine, clearly that's possible
... but the code is not structured to take advantage of that.  Things
would either have to be seriously reshuffled to allow a number of test
short-circuits AND do some careful rejiggering to make sure nobody is
accessing "struct msgs" when it wasn't valid, or we'd need to wrap every
access to "struct msgs" in accessor functions that would perform a folder
scan if needed (I think you'd still need to do a lot of reshuffling).

I don't want to discourage anyone who wants to take this work on, but
I do want to make sure everyone understands the scope of work required.
It does strike me that 100K entries in a single folder is probably a
couple of orders of magnitude too big for any tool to effectively
deal with.  If it was me, I'd work on speeding up readdir(); you could
crib some ideas from here:

  
https://www.olark.com/developers-corner/you-can-list-a-directory-with-8-million-files-but-not-with-ls

Although I'd just make my own copy of the readdir() implementation from
libc and up the buffer size.

--Ken

_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
https://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>