The weight of spam
The weight of spam:
As of this writing, I've received 71,120 unsolicited commercial e-mail messages in 2003, and that count is growing by approximately 500 messages per day. Keep in mind that I'm only tracking mail on my primary, personal e-mail account; the total would be a bit higher were I tracking receipts on all domains and addresses which I manage. (Some poor sots get 3,000 spam messages every day.)Since January 1, I've not only been counting my spam receipts, I've been saving them. Right now, I've got an archive totalling some 297 megabtyes of crap. My original thought was that the aggregate file would be handy to use as fodder for a spam filter, as a means of training it to recognize the dreck. But, truly, SpamAssassin has been doing a bang-up job for me since I began using it, and I can't fathom that feeding it 71,000 junk messages is going to make it appreciatively better.
So, what should I do with all this spam? Come the end of the year, I'm likely to have 100,000 messages or more. Are there any fun analyses you'd like me to do? Any word frequency charts? Weird pattern searches? I'm open to suggestions (and have provided a comments link for this post), but keep in mind that my programming chops are pretty meager, so if you can't explain how to do what you're proposing in BBEdit or a few lines of script, it's probably beyond my ken.
Comments:
What's your spam-to-virus ratio these days? Does that say anything about the economics of spam?
What's the proportion where you'd say "no WAY would anyone ever BUY this"? Has that proportion gone up recently? Does that say anything about the economics of spam?
Of course, there is the trivially easy "How long and/or thick would Little Brad be if all of the manhood-enhancement spam were answered and actually worked"...but you might have done that analysis already. ![]()
I think Chris has pegged it. That's the data inquiring minds are clamouring for.
I'm the original developer of SpamAssassin (now just one of several). And I must say, wow, that's a lot of spam ![]()
Have you got an always-on UNIX machine handy, with plenty of CPU time free? And do you use a sensible mail app, like pretty much anything apart from Outlook? Because a great way to give back to SA, especially with a collection of mail like that, is to run "mass-check" over it and help out with our rescoring runs. it's pretty easy -- but very UNIX-based.
The bonus, of course, is that SA optimises itself to handle your mail ![]()
It's more-or-less impossible to figure out what your mail might be about -- pretty much the only thing made visible in the uploaded results, is the names of your mail folders.
More details are at:
http://spamassassin.org/dist/masses/README
http://spamassassin.org/dist/masses/CORPUS_SUBMIT
we also do nightly runs as well -- ie. you set up a crontab to run SA over your mail collection once per night, and it uploads the results for ongoing rule QA...
http://spamassassin.org/dist/masses/CORPUS_SUBMIT_NIGHTLY
cheers!
The only spam I seem to get is either from someone trying to sell me a new mortgages or a penis enlargement. I can't help but wonder whether one could finance a penis enlargement with a second mortgage?
Would your programming skills allow you to extract from your large spam collection additional syngergies that today's e-marketers haven't yet thought of?
Page 1 of 1 pages
Next entry: Safari Enhancer
Previous entry: G5 in da hizzouse!






Find Brad elsewhere on the web.