ietf-asrg
[Top] [All Lists]

[Asrg] Do end users really use HTML?

2003-06-21 12:48:32
Frankly, I think the question is pointless.  Regardless of the answer, 
we aren't going to change the default. Which raises the question of why
I just spent three hours attempting to answer it.  Oh well.  Someone
asked.

The Sample
Messages sent to wormalert(_at_)somewhere(_dot_)com over the period of about 15 
months.  These are messages sent by non-technical (pretty much by 
definition) users who thought they were protecting their email from 
viruses.  They put the wormalert address in their address book 
because someone forwarded a hoax to them, and they believed it.  They 
sent mail to it because they were either sending email to everyone in 
their address book, replied to someone who did, or mistakenly thought 
the hoax said that they should cc/bcc the address every time.  The 
main point is that these are *not* mailing list or technical 
messages.  They tend to be forwarded jokes, conversations with 
relatives, and other end-user to end-user email.  Needless to say, 
some of these messages are very private, so I'm not going to make the 
repository publicly available.

These messages have been filtered to remove spam.  Initially I did it 
by hand (I used to reply to the hoaxes, and also hand reply to people 
who ignored our automatic request to stop sending us email).  For the 
past six months they've been filtered by Messagefire's spam filtering 
service.  The false negative (missed spam) rate for Messagefire is 
about 0.2%.

The Technique
I wrote a perl program (included below) that parses the Eudora 
mailboxes that hold the mail.  It looks at each message and checks to 
see if Eudora thought it was HTML (Eudora does some rather nasty 
munging of email, tossing multipart alternative and other message 
parts).  The script is pretty generic, if you want to test it on a 
non-Eudora mailbox you'll want to change the line ending setting and 
have it use the Content-Type to check the type of the message instead 
of the <x-html> tag.

The script picks out HTML tags and counts them.  Pretty simple.  I 
kept feeding it my email having it print out the lines the tags were 
on until I had a rough idea of which tags were auto-generated, and 
which seemed to indicate manual generation.  (Note that "manual" also 
includes "pasted from a web page".)  I made a couple special cases. 
End tags aren't counted.  No tag or tags preceding  [a-z]+: is 
counted (e.g. formatting of forwarded messages).  Font tags are 
counted only if they change the size or face of a previous font tag.

These numbers may differ somewhat from my previous quick summary.
The previous attempt was very quick and dirty with grep, and was 
prone to double-counting messages that contained forwarded messages.
That inflated the number of HTML messages in the count.

The Numbers
74,405 total messages
28,295 plain text          (38%)
24,429 auto-generated html (32%)
21,681 hand-generated html (29%)

796 of the auto-generated html messages had attachments
3,750 of the hand-generated html messages had attachments
(That's a little misleading, because I counted an img tag as hand-generated.)

Raw Information
Here are the tags with counts which I included in the 
"hand-generated" category, so long as they didn't violate any of the 
above exceptions.
13619   b
8067    font
5682    tr
5564    td
5518    table
5489    i
3541    strong
3428    img
3329    u
3320    tbody
1398    em
1026    center
248     ul
238     tt
202     li
164     h1
109     sup
80      code
72      h3
71      h2
70      map
67      small
55      area
48      bgsound
45      big
43      ol
38      base
35      h5
35      h4
34      nobr
30      h6
25      th
24      embed
22      link
13      param
12      caption
6       col
5       strike
4       fontfamily
3       noindex
3       excerpt
3       bigger
2       object
2       noscript
2       colgroup
2       blink
1       paraindent
1       noembed
1       label
1       headline
1       embossed
1       cite
1       byline
1       blackface

And here, for what it's worth, is a count of the X-Mailer 
information, with version information stripped out.  I've skipped 
anything with less than 10 entries.







































































And finally, here's the script.
#!/usr/bin/perl

use strict;

# Eudora does some odd mailbox munging.
# If you are using this on something other than eudora mac, you'll want to 
change
# the line ending here, and you'll want to change the check for attachments
# (search for "Related" below) and the way I tell if something is html 
(<x-html>).
#
$/ = "\r";

my ($inmsg, $inhdr, %mailer, %tags, %msg, $inhtml, %type, $cnt);
my ($face, $size, $attachments, $msgs, $line);

$inmsg = 0;
while (<>) {
    ++$line;
    if (/^From /) {
        last if (++$cnt > 999999);
        if ($inmsg) {
            ++$msgs;
            if ($inhtml) {
                my ($hastag, $hasrel);
                foreach my $tag (keys %msg) {
                    if ($tag eq 'Related') {
                        ++$hasrel;
                        ++$attachments;
                    } else {
                        ++$hastag;
                        ++$tags{$tag};
                    }
                }
                if ($hastag) {
                    if ($hasrel) {
                        ++$type{handhtmlattach};
                    } else {
                        ++$type{handhtml};
                    }
                } else {
                    if ($hasrel) {
                        ++$type{autohtmlattach};
                    } else {
                        ++$type{autohtml};
                    }
                }
            } else {
                ++$type{text};
            }
        } else {
            $inmsg = 1;
        }
        $inhdr = 1;
        $inhtml = 0;
        $face = $size = undef;
        %msg = ();
    } elsif (/^\s*$/) {
        $inhdr = 0;
    } else {
        if ($inhdr) {
            if (/^x-mailer:\s*(.*)/i) {
                my $mailer = lc($1);
                chomp $mailer;
                $mailer =~ s/\s*[\d,\/\(].*//;  # nuke everything after a number
                ++$mailer{$mailer};
            }
        } elsif (/<x-html>/) {
            $inhtml = 1;
        }
        if ($inhtml) {
            next if (/IncrediMail/);
            ++$msg{Related} if (/^Related:/);
            chomp;
            s/\0//g;    # unicode
            s/[\r\n]/ /g;
            my $orig = $_;
            while (/<[^>]+>/) {
                my ($nface, $nsize);
                s/<([^>]+)>(.*)/\2/;
                my ($tag, $rest) = (lc($1), lc($2));
                next if ($tag =~ /\@/);
                $tag =~ s/^[\s!]+//;
                # we only count font tags if they change
                if ($tag =~ /^font.*face="?([a-z]+)/) {
                    $nface = $1;
                }
                if ($tag =~ /^font.*size="?([-\d]+)/) {
                    $nsize = $1;
                }
                $tag =~ s/\s.*//;

                # weird tags
                next if ($tag =~ /\d/ && $tag !~ /^h\d$/);
                next if ($tag =~ /[^a-z\d]/);

                # probably spam or virus slipped through, don't count it
                if (grep(/^$tag$/, qw(iframe form input select option script
                        textarea))) {
                    $inmsg = 0;
                    last;
                }

                # weird stuff
                next if (grep(/^$tag$/, qw(fwd left pm refcode)));
                next if ($tag =~ /^\s*$/);

                # tags we think are usually auto generated
                next if (grep(/^$tag$/, qw(span body div head html xmeta meta 
style
                        doctype xbody dl dd dt marquee address clock dir comment
                        spacer basefont bold xeta
                        br a p blockquote hr xml pre title xmp defanged_meta)));

                # probably highlighting a mail field
                next if ($rest =~ /[a-z]+:/);

                if ($nface) {
                    if ($face && $nface ne $face) {
                        ++$msg{$tag};
                        #print "## $orig\n";
                    }
                } elsif ($nsize) {
                    if ($size && $nsize != $size) {
                        ++$msg{$tag};
                        #print "## $orig\n";
                    }
                } else {
                    next if ($tag eq 'font');
                    #print "# $line: $tag\t$orig\n";
                    # Anything else we ignore
                    if (grep(/^$tag$/, qw(b font h1 h2 h3 h4 h5 h6 strong sup 
tt big
                            bgsound embed small li noscript area  map ol code 
base
                            nobr base caption th label col dir byline headline
                            param strike  cite excerpt paraindent blink object
                            bigger blackface colgroup embossed fontfamily
                            noembed noindex
                            table tbody td tr u em i link img center ul))) {
                        ++$msg{$tag};
                    }
                }
                $face = $nface if ($nface);
                $size = $nsize if ($nsize);
            }
        }
    }
}

print "Messages: $msgs\n";
print "HTML Attachments: $attachments\n";
print "Types:\n";
open(S, "|sort -rn");
foreach my $type (sort keys %type) {
    print S "$type{$type}\t$type\n";
}
close(S);
print "\nTags:\n";
open(S, "|sort -rn");
foreach my $tag (sort keys %tags) {
    print S "$tags{$tag}\t$tag\n";
}
close(S);
print "\nMailers\n";
open(S, "|sort -rn");
foreach my $mailer (sort keys %mailer) {
    print S "$mailer{$mailer}\t$mailer\n";
}
close(S);

-- 
Kee Hinckley
http://www.messagefire.com/          Anti-Spam Service for your POP Account
http://commons.somewhere.com/buzz/   Writings on Technology and Society

I'm not sure which upsets me more: that people are so unwilling to accept
responsibility for their own actions, or that they are so eager to regulate
everyone else's.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>
  • [Asrg] Do end users really use HTML?, Kee Hinckley <=