On Thu, 9 Feb 2006, Jon Kyme wrote:
William Leibzon:
I do actually have quite a bit larger plans for all this and reputation
database of single ips is just first step in it. One of the things I ran
into is trying to decide what algorithm to use for calculating mean value
in real time.
I am not a statistician, or indeed, any kind of mathematician, but I wonder
if a mean is really what you want? Aren't you kind of assuming that scores
are cardinal-ish if you're taking a mean?
That is why I said it looks like a stochastic process and was not sure if
using mean function is appropriate. It should also be noted that since
I'm using quantified (x.y) scoring data the sample space can be considered
to be a finite set - I can't yet decide if/how this would help though.
Are they? Do you really want a
set of scores like [5.1, 5.0, 5.2, 0.1] to give the same rep.(arithmetic
mean = 3.85) as the set [3.8, 3.9, 4.0, 3.7] ?
Testing will show if this concept works. But for now, yes I do want to
them to give the same or similar score, I think with larger sample this
would give fairly accurate information.
I know you could pick another average to get something better looking, but
I wonder if it would be more useful to refer to your threshold and count
overs/unders, spam/ham, whatever.
Counting spam/ham (and sometimes using spam/ham probabilities) is what
bayesean filters do. While this is somewhat similar I'm not trying to
create based on fuzzy logic but on arithmetic score. The mean score can
thereafter be used as part of filtering process but not as yes/no answer.
[BTW - it may well be appropriate to think of every email as not being
either spam or ham, but use fuzzy logic and apply it to every email,
i.e. email can be considered ham for some and spam for others. Some new
bayesian filters (spambayes in particular) appear to do that. This can
be good research topic for discussion on this list]
I don't know if you ever saw Mark Langston's (abandoned?) GOSSiP stuff? I
thought that there were some good ideas in there, although I guess you
wouldn't be interested in the distributed/co-operative aspects. I'm sure
there's some stuff on sourceforge.
Distributed operation is one of the next steps after initial trial.
WPBL work can also prove useful there.
---
William Leibzon
mailto: william(_at_)completewhois(_dot_)com
Anti-Spam and Email Security Research Worksite:
http://www.elan.net/~william/emailsecurity/
Whois & DNS Network Investigation Tools:
http://www.completewhois.com
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg