Re: Poll: Should mail archives hide mail addresses

2004-01-02 01:27:19
Finally, Chuq had a good point about requirements changing over
time. In the future, MHonArc may want to move towards encouraging more
semantic markup

The problem with this approach is that it won't work with text-based
browsers.  Accessibility is something I try to maintain,

Sure it will. Jeffrey Zeldman has a lot of useful information on how to be accessible and compliant by degrading gracefully. you can start here: to get a first cut on this. The idea is to build things that use XHTML/CSS such that if certain features aren't supported by a browser, the site does the "right thing" instead of simply breaking, and does it without building multiple versions with browser sniffing. And accessible means more than sight-limited, it means alternative browsing tools, like my phone's mini-browser, and search engines like google.

So accessibility is good. CSS/XHTML is good. and since mHonarc gets used in so many sites where people have to skin an interface onto it, I think moving to those models is a great idea (and basically a no-brainer), once you get past a bunch of the myths about those tools.

I first thought of using libgd to have address changed into CGI
links that generate an image on the fly with showing email address.
I.e. Harvesters would have to use OCR to get the address.

and there's evidence that some harvester are experimenting in that direction. After all, it's only CPU time, and they're infinitely patient. Even if they only get a 10-15% hit rate on OCR conversions, that merely means that have to hit the site 10 times to get everything. That was the ultimate failure of the slashdot "random" obfuscation tool: spammers didn't have to break all of them, just enough of them to get useful data, and then cycle through the site enough times to get around the versions they didn't crack. took about a week.

Another alternative is to remove linking of addresses, and then
using a obfsucation technique like:


This way the address renders like "earl(_at_)example(_dot_)com" (and can be
copy-n-pasted by readers to their MUA), but a harverster may not
catch it.  Of course, a smart harvester that expands entity references
and deletes comment declarations would.

be very wary of "fixes" that merely make the problem more difficult. As soon as they have a financial incentive to crack them, they'll be cracked. you're basically looking to try to implement the "I don't have to outrun the bear, I just have to outrun you" solution, meaning you make it tough enough to crack they go harvest someone else's site.

In the case of mHonarc especially, that's a bad design choice. Since so many sites use mHonarc, any change you make to mHonarc will be a focus of the spammers to crack. mHonarc doesn't have the option of making it tough enough for the spammers to go elsewhere. So you risk putting energy into things that won't fix the problem long (if at all), and worse, might create a false sense of security for developers and users of the tools.

My suggestion: don't get involved in any "solution" that merely makes it "harder" or "causes more work", because they only solve things as long as the spammers don't feel it's worth it. and if you get into an arms race with them, you'll lose. So you have to fix things in ways they can't crack, or you probably shouldn't fix them at all. half-measures waste time and energy and give people a sense of comfort that is worse than doing nothing.

I don't believe any obfuscation setup is safe. Period. They may work today, but if they ever get adopted widely enough to annoy the spammers, they'll be broken. And with their continuing to build huge farms of zombied machines for delivery (which is what's hosed over the RBLs, the spammers have figured out how to hack around them by changing their delivery methods and using stolen system access), if they can use a machine for zombie delivery of spam, they can use that machine for computational work, too, so you should assume the spammers have a roughly infinitely large cluster of machines they can use to throw cycles at whatever you build. Because they do.

I read a study dated March 2003 that showed that simple obfsucation
techniques actually work, but I think (and the study even states)
that it likely that it is a matter of time that spammers adapt.

most of them are broken now. basically useless.
uses a POST form to obfsucate addresses, but it is straight-forward
to customize a harvester to defeat it.

anythign with a large enough data-set to warrant the spammer's attention will get it. mHonarc, sort of by definition, will be high on their lists.

Obfuscation is a waste of energy. It works only as long as the spammers don't bother worrying about it. Graphic representations are non-accessible, crackable (via OCR) and not easily used by end-users, so they not only don't solve the problem, they create new ones. javascript-based and POST-based stuff, ditto -- you break in all sorts of systems today (like phone browsers) where people want access to that data, and it only holds off the spammers as long as they don't bother implementing it. those aren't solutions, just delaying tactics. Bad use of time.

 Since text-only
browsers can still read the messages in the archives, is it okay that
they will not have the ability to determine the author's address if
an image-based solution is adobted?  Is this an acceptable limitation
weighed against the problem of spam?

I think a "guest" has no demand on access to sensitive data. I don't allow "guests" open access to private mail lists, for instance, and I see no reason why they should assume they should have access to it.

I think it's safe to extend that to data I consider sensitive or private. Just because we've always been open and that data is accessible doesn't mean there's any requirement it remain so. After all, there was a time in life when few houses had locks on them, too. Times change. not only do we lock doors and windows, we build gated communities.

I think the only safe way to do this is to make sure that this sensitive data is simply never in the data stream -- it's edited out before a user can get to it. If it's not there, it can't be de-obfuscated, it can't be reconstructed, it can't be reverse-engineered, because it's not there.

If people want more access, including that restricted data, then biuld a system to let them authenticate in and be granted access. I think that's more or less beyond the scope of mHonarc, but strongly related to it. In a perfect world, however you authenticate yourself to the maling list to prove that "you are you" for purposes of posting or accepting list mail is how you'd authenticate into the archives, too, which implies this is probably a list-server operation which pulls data out of mHonarc, not a mHonarc operation, unless you want to start tightly coupling all of these different pieces together. Which has advantages and disadvantages...

I'd probably argue against building data-stripping data into mHonarc, but perhaps a group of mHonarc folks would be interested in building a separate-but-equal project (similar to mharc) to handle the delivery/stripping/authentication piece, with hooks that allow it to interface into other systems for authentication data, so it could, perhaps, use Mailman email addresses and passwords, or Sympa user data to simplify things for the users a bit.