procmail
[Top] [All Lists]

more-more or fewer-fewer?

1997-08-31 12:52:30
Thinking about how procmail's \< and \> are not zero-width one night while I
couldn't sleep and had to occupy my mind somehow brought some other things to
mind.

Say we're looking to count the number of times the word "very" occurs in the
body of a letter.  The problem is that the writer uses "very very" on occa-
sion, and

 * 1^1 \<very\>

will count "very very" as one.  (It would find two in "very, very" or "very
very" with a newline between them instead of a space.)

I thought of a solution, but it came in two forms, both of which should give
the right result:

 :0B
 * 1^1 very\>
 * -1^1 [a-z]very\>
 * -1^1 [a-z]-$[        ]*very\>

or

 :0B
 * 1^1 \<very
 * -1^1 \<very[a-z]
 * -1^1 [a-z]-$[        ]*very\>

The first conditon of the first method will include every occurrence of words
like bravery, delivery, discovery, every, livery, recovery, and thievery;
then its second condition will subtract them all, leaving only the number of
times "very" appears.  (The third condition corrects for the case where `bra-
very' is broken across two lines with a hyphen or that where breaking `thie-
very' has the same effect; by the way, grep -w was fooled by both.)

The second method's first condition probably won't find any false positives
(what other word starts with "very"?), so its second condition will almost
never find anything to subtract.

Let's say for example that a text has a valid "very" nine times, "every"
four times, and "delivery" twice ... and that the broken word problem does
not appear.  Then the first method scores 15-6+0=9, which the second method
scores 9+0+0=9.  Is the second method therefore more efficient than the
first, becase it found six fewer occurrences to include and six fewer to
subtract?

If so, then it makes sense to use the first method for words that are more
likely to appear at the starts of other words and the second for those that
are more likely to appear at the ends of other words.

<Prev in Thread] Current Thread [Next in Thread>
  • more-more or fewer-fewer?, David W. Tamkin <=