ietf
[Top] [All Lists]

Re: Proposal to define a simple architecture to differentiate legitimate bulk email from Spam (UBE)

2003-09-08 20:20:34

After this issue, I am probably moving the thread to IRTF (as suggested) if 
possible (but probably after taking a break to do some other work).


Information theory says that such things are impossible.  One can not
construct a spam-free protocol because this is the same problem as
constructing a system free of covert channels, which information theory
says is impossible.


But information theory also says you can optimize signal-to-noise ratio,
but only if you know what the characteristics of your signal are.

It actually doesn't say that precisely. It says that you can transmit a
signal with an arbitrarilly low error rate at a speed below the channel
capacity.

The concrete task of altering the signal to noise ratio is accomplished by
enhancing the signal with a harmonic oscillator, so that it is stronger
than the noise.


Agreed.

And thus on a "conceptual level", you have to have some idea about the signal 
characteristics in order to enhance it.

Actually if I remember correctly, your example is how it applies to periodic 
signals.  The general case is more abstract.


 This is then described as a set of differential equations
that can be optimized with Variational methods.  The limits of this
process are indicated by information theory, the nyquist theorem, etc.


Add Shannon entropy, chaos, etc...


If the channel isn't described by a fourier series, then the differential
equations may not be solvable, and it may be impossible to optimize its
signal to noise ratio. (Well, there are other mathematical methods, but
you get the point.)


Yes that is what I meant that the general case is more abstract, so I was 
talking on a "conceptual" or abstract level.


You are borrowing the concepts by metaphor, but the
concrete methods don't transfer well.


I was only using it to say we must define the signal how it appears in the 
channel before we can do any research on it in the channel.

The way spam is currently defined defined as UBE (instead of my proposed *BE), 
then it means you can only model the signal at the end point.  Given that means 
in the receivers subjective mind, that is not all that useful for research, 
unless you want to get into very fuzzy science such pyschology.  If you want to 
make the point about practicality, then that is a very strong one!


My point is not to discourage you from trying to stop spam,


You are only 1 of 3 people so far at IETF who has said that to me.  The rest 
who have commented have tried to discourage me.  So thank you.


but to focus
your attention on detection, rather than protocol alteration.  It is
impossible to alter the protocol in any way that will force the spammer to
identify themselves a-priori as a spammer.


Disagree strongly.  First benefit is once you define spam == *BE (instead of 
UBE), then it is easier to model spam and do research on it, because you can 
model it at any node in the channel, not only at the receiver end point.  That 
was my whole point about "enforcers".

However, there is a problem.  Some *BE is solicited.  Which is why I proposed 
moving the solicited *BE to another channel ("pull").

Your point is that it is futile to define a protocol that will separate the 
solicited from the unsolicited, because spammers will always be able to subvert 
the protocol.  And you to say thus there are no benefits to detection.  I 
strongly disagree.  There are two aspects to my response:

1. Spam coming thru the alternate "pull" channel can be modeled differently 
that spam defined as *BE.  This separation of models provides benefits over 
trying to model spam as UBE in the receiver's mind (end point).  Other person 
in this thread has provided one specific example, which is the "pull" delay 
gives a whole new dynamic to detection.  Also I have pointed about that the 
membership quality of the solicited channel, gives it unique modeling 
advantages.

2. Spam coming thru the existing channel can then be modeled as *BE at any node 
of the channel, instead of as UBE.  Some nodes have a much better model of spam 
in this definition, than the one at the end point.  For example, ISPs can see a 
lot more abuse data in real-time, than a single receiver or the current 
inherently more clumsy attempts to group or poll receivers.

Hopefully that will set the record straight that I am thinking about spam in 
new conceptual ways...and not rehashing as others have claimed...


You could ask for spammers to cooperatively self-mark their messages.
But this hasn't been terribly productive.


Obviously I am not asking for that or any thing like that.  See above.


 It is also pointless to ask for
cooperative identification of non-spammers and identify spammers as those
not in the set of non-spammers.


I am also not asking for this, and it is instructive to understand how I am not.

I am only making a definition, so that one can model under the benefits of that 
definition.  What people actually do is a different matter, but as I pointed 
out previously in this thread, once you model spam the way I have proposed, 
then solicited *BE will have a distinct advantage to adopt the model.  And as I 
point out above, it doesn't matter what spammers do, because the improved model 
is helpful for advancing detection in both cases.

And my other point has been that when a channel gets so saturated with noise 
that you can not longer find the original signal reliably (as you say above the 
S/N ratio will depend on Nyquist, which is a very crucial point), then 
solicited *BE and receivers are going to need a different model, else 
information transmission will no longer occur reliably.


So given a set of unmarked messages, some spam, some not-spam, the task is
to have a program mark them in the same way that a human would if a human
were reading the messages. Since humans have different definitions of
spam, it would be useful if the program could accept different definitions
as well.  This is the realm of content analysis.


You see this is the crux of the whole stagnation of anti-spam in my view.  
Content has nothing to do with what makes spam annoying.  It is the S/N factor, 
i.e. that it only gets a 0.005% response rate.

I am trying to shift the whole paradigm from thinking about psychology (will 
always be fuzzy result), to thinking and modeling the noise factor.

It is a profound paradigm shift that gets you closer to a more robust solution 
for detection.


Thus my whole motivation for an unambiguous definition (spam == all bulk
email) along the channel and not just a definition at the end points
(UBE).

You may need a precise definition before you can begin implementation
(just like you need a definition of voltage, current, etc to begin
building a transmitter),


Exactly.  You need a definition before you can model.


but you do not need a precise definition to talk
about the theoretical aspects.


Yes you do.


 Spam could be defined as UCE, CE, UBE, or
BE.  I have also a more complete and detailed taxonomy of spam:


Those are all definitions.



There are 3 types of email that we generally call spam:


This is going down into the psychology line of model, which I am trying to 
paradigm shift away from, because it is not very well correlated to what makes 
spam a problem.  If spam had a 5% response rate, it would no longer be a 
problem.  Modeling the psychology is something other people are working on 
already.

[snip]

Thanks,
Shelby Moore
http://AntiViotic.com




<Prev in Thread] Current Thread [Next in Thread>