----- Original Message ----
Ok, I'm just kind of throwing this idea out. It has been well over a
year since I last looked into the hashing systems used by DCC/Razor
and I have not seen any of the design work that went into the nowsp
method, so maybe this has already been done.
Though I don't know for sure, I'd guess that both of these schemes would try to
ignore non-human readable message body text. If they don't, then they
probably lost their effectiveness in 2002 or 2003. The spammers have spent much
time in thinking about cool ways to insert hashbusting goop into a message
without alerting the user:
- text in zero point font
- white font on white background
- $color font on $color background
- $color font on $similar_color background
- text in invalid html tags
- text in html comments
- text in a 1x1 invisible gif's alt tag
etc. Also, I'd similarly guess that these algorithms do some sort of
sub-document-level processing -- a la shingling or fuzzy hashing, where only
parts of documents are chosen to perform a match.
I do think it clever to think about re-using proven technology here though,
unless there is going to be a real test of diffenet canonicalization schemes.
miles