Re: Alleviating Duplicates

On Sat, 8 May 1999 14:47:14 +0300,
era eriksson <era(_at_)iki(_dot_)fi> said:

 
E> The solution amounts to finding a "canonical" format which neutralizes
E> all the kinds of changes that you want to disregard (in princple,
E> all changes people might make before they resend a message, but we're
E> already up in the higher spheres) and converting all messages into
E> this format before comparing them.

   Apologies for the length of this posting.  I was messing around with
   this idea some time ago, and seeing this thread made me want to write
   some code.

-- 
Karl Vogel
ASC/YCOA, Wright-Patterson AFB, OH 45433, USA
vogelke(_at_)c17mis(_dot_)region2(_dot_)wpafb(_dot_)af(_dot_)mil  or  
kvogel(_at_)sumaria(_dot_)com

-----------------------------------------------------------------------------
INTRODUCTION

    We need something to find mail duplicates in mail messages.  It has
    to be fast and simple, because this would have to run on the body
    of every incoming message.

DESCRIPTION

    Make a histogram of the entire body, putting all characters in
    lowercase and ignoring anything but alphanumerics.  You get a total
    of 36 values.  If the alphabet is 0-9 followed by a-z, and
    results are stored in the 36-char array "r", then

        r[0]        number of occurrences of digit 0
        r[9]        number of occurrences of digit 9
        r[10]       number of occurrences of letter a
        ...
        r[35]       number of occurrences of letter z

    Normalize them to hold values of 1-255, so each bin fits in one
    byte.  You now have 36 bytes describing the alphanumeric content of
    a given message body.  Citations will only change the value of a
    few characters, and neither hyphenation nor formatting will make
    any difference.

    If the output is to be written to a file or a pipe, use "write"
    because a 0 value for a given bin will make this look like a
    null-terminated string.

    To check messages on the fly, use something like "hashd" with
    histograms from the last few thousand messages in sorted order.

    When a new message comes in, generate a histogram and check it
    against the db.  Look for an entry that has the fewest differences;
    if only one or two character positions differ in count, we have a
    probable duplicate.

    If less than half the positions differ by a small amount, we may have
    a message that includes a chunk of quoted text.

REDUCING THE NUMBER OF COMPARISONS

    Create a search key for this array of 36-byte values that consists of
    the sum of all the values.  Write the 4-byte key first, then the
    36-byte data.  Sort the file by the 4-byte key, then use something
    similar to "look".

    Say the sum of the 36 values is 150.  Look for keys between 148 and
    152 inclusive in this database.  You'll probably find duplicates
    for each key if the database has a non-trivial number of entries.
    These records should be the best candidates for either exact duplicates
    or close matches.

    You could also normalize the array to a smaller value than 255, but
    you run the risk of making every message look similar to every other.
    A lot depends on the size of the message; if there aren't enough
    characters, no normalization takes place.

CODE

#ifndef lint
static char    *chisto_c_rcsid =
"$Id: chisto.c,v 1.2 1999/05/10 17:59:23 vogelke Exp $";

static char    *chisto_c_source =
"$Source: /src/text/maildups/RCS/chisto.c,v $";
#endif  /* lint */

/*
 * NAME:
 *      chisto
 *
 * SYNOPSIS:
 *      #include <stdio.h>
 *
 *      int chisto (fp, work, r, len)
 *      FILE           *fp;
 *      int            *work;
 *      char           *r;
 *      int             len;
 *
 * DESCRIPTION:
 *      Accept an email message from a file pointer, skip the header,
 *      and create a character histogram from the body.  Keep only
 *      alphanumeric characters (lowercased if necessary).
 *
 *      Return max char count as the function value.
 *
 * ARGUMENTS:
 *      "fp" is the source of the mail message.
 *      "work" is for initial sums: length "len".
 *      "r" is the final histogram: length "len".
 *
 * AUTHOR:
 *      Karl Vogel <kvogel(_at_)sumaria(_dot_)com>
 *      Sumaria Systems, Inc.
 */

#include        <stdio.h>
#include        <ctype.h>
#include        <strings.h>

static char    *alphabet = "0123456789abcdefghijklmnopqrstuvwxyz";

int
chisto (fp, work, r, len)
FILE           *fp;
int            *work;
unsigned char  *r;
int             len;
{

/*
 *      Variables.
 */

        char           *p;
        char            c;

        double          factor;

        int             ic;
        int             k;
        int             max;
        int             prev;

/*
 *      Clear the arrays, and read the message body.
 */

        memset (work, 0, len * sizeof (int));
        memset (r, 0, len);

        prev = 0;
        while ((ic = getc (fp)) != EOF)
        {
                if (ic == '\n' && ic == prev)
                        break;
                prev = ic;
        }

        while ((ic = getc (fp)) != EOF)
        {
                c = (char) ic;
                if (isupper (c))
                        c = tolower (c);

                /*
                 * We use strchr() because this is ANSI C code,
                 * so it has to work for any character set, not
                 * just ones which put the digits together in order.
                 */

                if (p = strchr (alphabet, c))
                        work[p - alphabet]++;
        }

/*
 *      Normalize the array so the largest count is 255.
 */

        for (max = 0, k = 0; k < len; ++k)
                if (work[k] > max)
                        max = work[k];

        if (max > 255)
        {
                factor = (double) 255 / (double) max;
                for (k = 0; k < len; ++k)
                        r[k] = (char) ((double) work[k] * factor);
        }
        else
        {
                for (k = 0; k < len; ++k)
                        r[k] = (char) work[k];
        }

        return (max);
}

/* ------------------------------------------------------------------- */

/*
 *      "-d" prints debugging info for input via stdin.
 *      "-c file1 file2" compares results for two files.
 *      default reads stdin, writes character array (possibly with
 *              nulls) to stdout.
 */

#ifdef  TEST_CHISTO

#define         MAXHISTO 36

main (argc, argv)
int             argc;
char          **argv;
{

        FILE           *fp;

        unsigned char   r[MAXHISTO];
        unsigned char   s[MAXHISTO];

        int             work[MAXHISTO];
        int             k;
        int             max;

/*
 *      Debugging.
 */

        if (argc > 1 && strcmp (argv[1], "-d") == 0)
        {
                fp = stdin;
                max = chisto (fp, work, r, MAXHISTO);
                printf ("max=%d\n", max);

                for (k = 0; k < MAXHISTO; ++k)
                {
                        printf ("work[%d]=%d\tr[%d]=%d\n",
                                k, work[k], k, r[k]);
                }
        }

/*
 *      Compare two files.
 */

        else if (argc > 3 && strcmp (argv[1], "-c") == 0)
        {
                if (fp = fopen (argv[2], "r"))
                {
                        max = chisto (fp, work, r, MAXHISTO);
                        fclose (fp);
                }
                else
                {
                        perror (argv[2]);
                        exit (1);
                }

                if (fp = fopen (argv[3], "r"))
                {
                        max = chisto (fp, work, s, MAXHISTO);
                        fclose (fp);
                }
                else
                {
                        perror (argv[3]);
                        exit (1);
                }

                for (k = 0; k < MAXHISTO; ++k)
                {
                        printf ("r[%2d]=%6d\ts[%2d]=%6d",
                                k, r[k], k, s[k]);

                        if (r[k] == s[k])
                                printf ("\n");
                        else
                                printf ("\tDIFF\n");
                }
        }

/*
 *      Debugging.
 */

        else
        {
                fp = stdin;
                max = chisto (fp, work, r, MAXHISTO);
                write (fileno(stdout), r, MAXHISTO);
        }

        exit (0);
}

#endif  /* TEST_CHISTO */