Frankly, I think the question is pointless. Regardless of the answer,
we aren't going to change the default. Which raises the question of why
I just spent three hours attempting to answer it. Oh well. Someone
asked.
The Sample
Messages sent to wormalert(_at_)somewhere(_dot_)com over the period of about 15
months. These are messages sent by non-technical (pretty much by
definition) users who thought they were protecting their email from
viruses. They put the wormalert address in their address book
because someone forwarded a hoax to them, and they believed it. They
sent mail to it because they were either sending email to everyone in
their address book, replied to someone who did, or mistakenly thought
the hoax said that they should cc/bcc the address every time. The
main point is that these are *not* mailing list or technical
messages. They tend to be forwarded jokes, conversations with
relatives, and other end-user to end-user email. Needless to say,
some of these messages are very private, so I'm not going to make the
repository publicly available.
These messages have been filtered to remove spam. Initially I did it
by hand (I used to reply to the hoaxes, and also hand reply to people
who ignored our automatic request to stop sending us email). For the
past six months they've been filtered by Messagefire's spam filtering
service. The false negative (missed spam) rate for Messagefire is
about 0.2%.
The Technique
I wrote a perl program (included below) that parses the Eudora
mailboxes that hold the mail. It looks at each message and checks to
see if Eudora thought it was HTML (Eudora does some rather nasty
munging of email, tossing multipart alternative and other message
parts). The script is pretty generic, if you want to test it on a
non-Eudora mailbox you'll want to change the line ending setting and
have it use the Content-Type to check the type of the message instead
of the <x-html> tag.
The script picks out HTML tags and counts them. Pretty simple. I
kept feeding it my email having it print out the lines the tags were
on until I had a rough idea of which tags were auto-generated, and
which seemed to indicate manual generation. (Note that "manual" also
includes "pasted from a web page".) I made a couple special cases.
End tags aren't counted. No tag or tags preceding [a-z]+: is
counted (e.g. formatting of forwarded messages). Font tags are
counted only if they change the size or face of a previous font tag.
These numbers may differ somewhat from my previous quick summary.
The previous attempt was very quick and dirty with grep, and was
prone to double-counting messages that contained forwarded messages.
That inflated the number of HTML messages in the count.
The Numbers
74,405 total messages
28,295 plain text (38%)
24,429 auto-generated html (32%)
21,681 hand-generated html (29%)
796 of the auto-generated html messages had attachments
3,750 of the hand-generated html messages had attachments
(That's a little misleading, because I counted an img tag as hand-generated.)
Raw Information
Here are the tags with counts which I included in the
"hand-generated" category, so long as they didn't violate any of the
above exceptions.
13619 b
8067 font
5682 tr
5564 td
5518 table
5489 i
3541 strong
3428 img
3329 u
3320 tbody
1398 em
1026 center
248 ul
238 tt
202 li
164 h1
109 sup
80 code
72 h3
71 h2
70 map
67 small
55 area
48 bgsound
45 big
43 ol
38 base
35 h5
35 h4
34 nobr
30 h6
25 th
24 embed
22 link
13 param
12 caption
6 col
5 strike
4 fontfamily
3 noindex
3 excerpt
3 bigger
2 object
2 noscript
2 colgroup
2 blink
1 paraindent
1 noembed
1 label
1 headline
1 embossed
1 cite
1 byline
1 blackface
And here, for what it's worth, is a count of the X-Mailer
information, with version information stripped out. I've skipped
anything with less than 10 entries.
And finally, here's the script.
#!/usr/bin/perl
use strict;
# Eudora does some odd mailbox munging.
# If you are using this on something other than eudora mac, you'll want to
change
# the line ending here, and you'll want to change the check for attachments
# (search for "Related" below) and the way I tell if something is html
(<x-html>).
#
$/ = "\r";
my ($inmsg, $inhdr, %mailer, %tags, %msg, $inhtml, %type, $cnt);
my ($face, $size, $attachments, $msgs, $line);
$inmsg = 0;
while (<>) {
++$line;
if (/^From /) {
last if (++$cnt > 999999);
if ($inmsg) {
++$msgs;
if ($inhtml) {
my ($hastag, $hasrel);
foreach my $tag (keys %msg) {
if ($tag eq 'Related') {
++$hasrel;
++$attachments;
} else {
++$hastag;
++$tags{$tag};
}
}
if ($hastag) {
if ($hasrel) {
++$type{handhtmlattach};
} else {
++$type{handhtml};
}
} else {
if ($hasrel) {
++$type{autohtmlattach};
} else {
++$type{autohtml};
}
}
} else {
++$type{text};
}
} else {
$inmsg = 1;
}
$inhdr = 1;
$inhtml = 0;
$face = $size = undef;
%msg = ();
} elsif (/^\s*$/) {
$inhdr = 0;
} else {
if ($inhdr) {
if (/^x-mailer:\s*(.*)/i) {
my $mailer = lc($1);
chomp $mailer;
$mailer =~ s/\s*[\d,\/\(].*//; # nuke everything after a number
++$mailer{$mailer};
}
} elsif (/<x-html>/) {
$inhtml = 1;
}
if ($inhtml) {
next if (/IncrediMail/);
++$msg{Related} if (/^Related:/);
chomp;
s/\0//g; # unicode
s/[\r\n]/ /g;
my $orig = $_;
while (/<[^>]+>/) {
my ($nface, $nsize);
s/<([^>]+)>(.*)/\2/;
my ($tag, $rest) = (lc($1), lc($2));
next if ($tag =~ /\@/);
$tag =~ s/^[\s!]+//;
# we only count font tags if they change
if ($tag =~ /^font.*face="?([a-z]+)/) {
$nface = $1;
}
if ($tag =~ /^font.*size="?([-\d]+)/) {
$nsize = $1;
}
$tag =~ s/\s.*//;
# weird tags
next if ($tag =~ /\d/ && $tag !~ /^h\d$/);
next if ($tag =~ /[^a-z\d]/);
# probably spam or virus slipped through, don't count it
if (grep(/^$tag$/, qw(iframe form input select option script
textarea))) {
$inmsg = 0;
last;
}
# weird stuff
next if (grep(/^$tag$/, qw(fwd left pm refcode)));
next if ($tag =~ /^\s*$/);
# tags we think are usually auto generated
next if (grep(/^$tag$/, qw(span body div head html xmeta meta
style
doctype xbody dl dd dt marquee address clock dir comment
spacer basefont bold xeta
br a p blockquote hr xml pre title xmp defanged_meta)));
# probably highlighting a mail field
next if ($rest =~ /[a-z]+:/);
if ($nface) {
if ($face && $nface ne $face) {
++$msg{$tag};
#print "## $orig\n";
}
} elsif ($nsize) {
if ($size && $nsize != $size) {
++$msg{$tag};
#print "## $orig\n";
}
} else {
next if ($tag eq 'font');
#print "# $line: $tag\t$orig\n";
# Anything else we ignore
if (grep(/^$tag$/, qw(b font h1 h2 h3 h4 h5 h6 strong sup
tt big
bgsound embed small li noscript area map ol code
base
nobr base caption th label col dir byline headline
param strike cite excerpt paraindent blink object
bigger blackface colgroup embossed fontfamily
noembed noindex
table tbody td tr u em i link img center ul))) {
++$msg{$tag};
}
}
$face = $nface if ($nface);
$size = $nsize if ($nsize);
}
}
}
}
print "Messages: $msgs\n";
print "HTML Attachments: $attachments\n";
print "Types:\n";
open(S, "|sort -rn");
foreach my $type (sort keys %type) {
print S "$type{$type}\t$type\n";
}
close(S);
print "\nTags:\n";
open(S, "|sort -rn");
foreach my $tag (sort keys %tags) {
print S "$tags{$tag}\t$tag\n";
}
close(S);
print "\nMailers\n";
open(S, "|sort -rn");
foreach my $mailer (sort keys %mailer) {
print S "$mailer{$mailer}\t$mailer\n";
}
close(S);
--
Kee Hinckley
http://www.messagefire.com/ Anti-Spam Service for your POP Account
http://commons.somewhere.com/buzz/ Writings on Technology and Society
I'm not sure which upsets me more: that people are so unwilling to accept
responsibility for their own actions, or that they are so eager to regulate
everyone else's.
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg