Fighting Spam May Be Easier Than You Think

advertisement
Fighting Spam May
Be Easier Than You
Think
Cynthia Dwork
Microsoft Research SVC
Why?

Huge problem
Industry: costs in worker attention, infrastructure
Individuals: increased ISP fees
Hotmail: huge storage costs, (was?) 65-85%
FTC: fraud, confidence crimes
Ruining e-mail, devaluing the Internet
2
Principal Techniques

Filtering






Everyone: text-based
Brightmail: decoys; rules updates
Microsoft Research: (seeded) trainable filters
SpamCloud: collaborative filtering
SpamCop, Osirusoft, etc: IP addresses, proxies, …
Make Sender Pay
Computation [Dwork-Naor’92; Back’97]
 Human Attention [Naor’96, SRC patent]
 Money [Gates’96] and Bonds [Vanquish]

3
Outline

The Computational Approach (“puzzles”)



Overview; economic considerations
Burrows’ suggestion: memory-bound functions
Social and Technical issues

Turing Tests

Architectures
Point-to-Point & Amanuensis
Centralized Ticket Server

Deeper Discussion of Memory-Bound Functions
3 Approaches and Small-Space Cryptanalyses
New Proposal (joint with A. Goldberg and M. Naor)
4
Computational Approach

If I don’t know you:
Prove you spent ten seconds CPU time,
just for me, and just for this message

User Experience:
Automatically and in the background
Checking proof extremely easy
 All
unsolicited mail treated equally
5
Economics

(80,000 s/day) / (10s/message) = 8,000 msgs/day

Hotmail’s billion daily spams:
 125,000
CPUs
 Up front capital cost just for HM: circa $150,000,000

The spammers can’t afford it.
6
Who Are the Spammers?
"Most of the spammers are not wealthy
people," said Stephen Kline, a lawyer for
the New York State attorney general's
office.
7
8
Who Are the Spammers?

Spamhaus ROKSO 100/90%

Circa 300 people total; very top few spammers
make a few million/year (F. Krueger, SMN; also,
see the recent articles about Alan Ralsky)

Comparison: FastClick, with 30% of popunder
market, has profit of $2 mil/yr (income of $4
mil/yr)

Biggest publicly traded company: eUniverse ($6
mil profit); list of 108 “good” names

Hit ratio: 10-6 to 10-5
9
Cryptographic Puzzles

Hard to compute; f(S,R,t,nonce) can’t be amortized
lots of work for the sender

Easy to check “z = f(S,R,t,nonce)”
little work for receiver

Parameterized to scale with Moore's Law
easy to exponentially increase computational cost, while
barely increasing checking cost

Can be based on (carefully) weakened signature
schemes, hash collisions

Can arrange a “shortcut” for post office
10
Burrows’ Suggestion

CPU speeds vary widely across machines, but
memory latencies vary much less (10+ vs 4)

So: design a puzzle leading to a large number of
cache misses
11
Social Issues

Who chooses f?
One global f? Who sets the price?
Autonomously chosen f’s?

How is f distributed (ultimately)?
Global f built into all mail clients? (1-pass)
Directory? Query-Response? (3-pass)
12
Technical Issues

Distribution Lists (!)

Awkward Introductory Period
Old versions of mail programs; bounces

Very Slow/Small-Memory Machines
Can implement “post office” (CPU), but:
Who gets to be the Post Office? Trust?

Cache Thrashing (memory-bound)

The Subverters
13
Turing Tests: Payment in Kind [N’96;SRC]

CAPTCHAs (Completely Automated Public T. test
for telling Computers and Humans Apart)
Defeat automated account generation
5-10% drop in subscription rate
Teams of conjectured-humans (8-hour shifts)

Yes: Distorted images of simple word

No: ``Find 3 words in image with multiple
overlapping images of words''

Others: subject classification, audio

M. Blum: people have done preprocessing
14
Social and Technical Issues

Social (especially in enterprise setting)





ADA, S.508 (blind, dyslexic)
Not ``professional''
Productivity cost: context switch (may be mitigated in
architecture supporting pre-computation)
Wrong direction: we should offload crap onto machines
Technical



No theory; if broken, can’t scale.
Idrive, AltaVista, broken [J]
EZ Gimpy (Yahoo!) broken [M+]
15
Turing Test Economic Issues

Human time in poor country has roughly
same cost as computer time (even if 8/5 wk)
 Also
need computer, but it can be slow, cheap
 10s same ballpark as some Turing tests

Suppose we’re wrong about cost (a few
hundredths of a cent), and really the price
should be 1 cent
 1-cent
challenge costs 2 minutes. No one not
being paid would dream of submitting to this.
16
Cycle Stealing

Stealing cycles from humans: Pornography
companies require random users to solve a
CAPTCHA before seeing the next image [vABL]

Worse for computational challenges

Economics: lots of currency means currency
is devalued. Prices go up.

There are lots of cycles, but anyone can buy
them. Can go into business brokering cycles.
17
Point-to-Point Architecture
Sender client
S
m, f(S,R,t,nonce)
Recipient client
R
(Ideal Message Flow)
Single-pass “send-and-forget”
Can augment with amanuensis to handle slow machines
Can add post office / pricing authority to handle money payments
Time mostly used as nonce for avoiding replays (cache tags,
discard duplicates; time controls size of cache)
18
Here to There (and There’)
m
Ignorant Sender
S



bounce
Spam-Protected
Recipient
release m
R
Three e-mail messages
R’s mail client caches m, S, R, t, nonce
Bounce (viral marketing opportunity)
html attachment with Java Script for f
contains parameters for f (S, R, t,nonce)
clicking on link causes computation, sending e-mail
(optional) link for download of client software
19
Point-to-Point: Issues

Social
Senders trust code (transition period)?
(Who gets to be a pricing authority?)

Technical
No pre-computation possible (slow machines)
Function update requires new download
Sender’s browser must be configured for
sending e-mail (transition period)
20
Remark: Web-Based Mail

Computation done by client (applet or
JavaScript)

Verification and safelist checking done by
server
21
Ticket Server
Ticket Server
[ABBW]
4,5
HTTP
Ticket
OK?
HTTP
Get
Ticket
Kit
1,2
Recipient
Server
3
(Ideal Message Flow)
Ticket kit = (#, puzzle)
Ticket = (#, response)
 Any payment method
 Tickets may be accumulated in
advance (pre-computation).
Useful if price is high.
 Refunds (not shown)
 Centralization eases updates
SMTP
MSG + Ticket
Sender
22
Social and Technical Issues

Social
Who gets to be a ticket server? Trust? Privacy?
Federation?
Trust code (transition period)

Technical
Complex; 5 flows (7 with refund)
T may be on different continent than S, R
Target of choice for subverters
23
Memory-Bound Functions [ABMW’02]

Devilish model!
assume cache size (of spammer’s big
machine) is half the size of memory (of
weakest sending machine)
 must avoid exploitation of locality, cache lines
(factors of 64 or even 16 matter)
 watch out for low-space crypto attacks

24
Meet in Middle (eg, using DES)
Input: x, y partially specified K1, K2
Output: K1, K2

Si = {keys consistent with Ki }

Create table T with all EK1(x),
(so |T| = |S1|)

8 K2 2 S2:
x
DES
K1
compute z=(EK2)-1 (y)
check if z 2 T

DES
y
K2
Optimism: need lots of space
to store T; won’t fit in cache;
hard to generate the z’s in
useful order
25
Analysis of Meet-in-the-Middle

Tradeoff: cache size c < |S1| yields time
proportional to |S1|¢|S2|/c with no cache misses

Sequentiality attacks [AB]: Once T is prepared,
cache-sized blocks can be loaded sequentially.

Luck: challenge may be resolved in cache.
Example: if c = |S1|/4 then there is a ¼ chance of
resolution with no further cache misses once the
cache has been loaded with the first block of T.
26
Subset Sum
Input: Random 2k-bit numbers a1, … , a2k, T
 Output: subset of ai’s summing to T mod 22k
 Optimism:

 LLL
lattice basis reduction
 dynamic programming: simultaneously O(2k)
time and space
Analysis: killed by Schroeppel and Shamir,
Simultaneous T=O(2k), S=O(2k/2)
27
Easy Functions [ABMW’02]
0 1 2 ...
(simplified construction)
 f: n bits to n bits, easy
 Given xk 2 range(f(k)), find a
pre-image with certain
properties
 Hope: best solved by building
table for f-1 and working back
from xk
 Choose n=22 so f -1 fits in
small memory, but not in
cache
 Optimism: xk is root of tree of
expected size k2
X0
...
...
2 n-1
Xk-1
Xk
28
Analysis of Easy Functions

Vulnerable to T/S tradeoffs for inverting
functions [Hellman; Fiat-Naor]
 TS2
= O(N2)cpuf
 Cost of inverting f without memory accesses is
only O(4) times cost of computing f (S=N/2); no
cache misses!
 As clock speeds increase, must increase cpuf
29
Interpretation:

Easy-Function challenges are cpu-bound, but
their structure (the cpu-avoiding, table-lookup
method) dampens the effect of Moore’s Law.

Verifier’s costs increase with Moore’s Law
k ¢ cpuf cycles
30
Our Approach
The sender makes a sequence of random
memory accesses and sends a proof of having
done so to the receiver, who needs a small
number of accesses for verification. The
memory access pattern leads to many cache
misses, making the computation memory-bound.
How to generate random-looking access pattern?
What is an easily verifiable proof?
31
High Level Description

Shared array of random numbers T

A path is a sequence of inherently sequential
accesses to T

Sender must search a set S of paths to find a
desired path

Process is message/recipient dependent

Sender sends a proof of finding desired path to
receiver, who must verify

Ratio of sender/receiver work is S
32
Incompressibility and RC4

Big (222) Fixed-Forever Truly Random T

Model walk on prg of RC4 stream cipher

Intuition:
Use the pr bit stream (seeded by message) to choose
entries of T to access.
These, in turn, affect the stream (defending against preprocessing to order T so as to exploit locality).
33
RC4 (Partial Description)
Initialize A to pseudorandom permutation
of {1, 2, …, 256}
using a seed K

Initialize Indices:
i = 0; j = 0

PR generation loop:
i=i+1
j = j + A[i]
Swap (A[i], A[j]);
Output z = A[A[i] + A[j]]
34
DGN: Initializing A and Basic Step

Fixed-Forever truly
random A of size 256;
32-bit words

Given m and trial
number i, compute
strong cryptographic
256-word mask M.

Set A = M © A.

c = A mod 222

Initialize Indices:
d=0; e=0

Walk:
d=d+1
e = e + A[d]
A[d] = A[d] © T[c]
Swap (A[d], A[e])
c = T[c] © A[A[e] + A[d]]
35
Side by Side
Generate pseudorandom permutation
of {1, 2, …, 256}
using a seed K

Fixed-Forever truly
random A of size 256;
32-bit words

Given m and trial
number i, compute
strong cryptographic
256-word mask M.

Set A = M © A.

c = A mod 222
36
Side by Side

Initialize Indices:

i = 0; j = 0

PR generation loop:
i=i+1
j = j + A[i]
Swap (A[i], A[j])
Output z = A[A[i]+A[j]]
Initialize Indices:
d=0; e=0

Walk:
d=d+1
e = e + A[d]
A[d] = A[d] © T[c]
Swap (A[d], A[e])
c = T[c] © A[A[e] + A[d]]
37
Full DGN

E: factor by which computation cost
exceeds verification

L: length of walk

Trial i succeeds if hash of A after ith walk
ends in 1+log2E zeroes (expected number
of trials is E).

Verification: trace the walk (L memory
accesses)
38
Parallel Attack [ABMW]
Process many paths in parallel; do
memory accesses in sorted order to take
advantage of cache lines
 Say attack succeeds if the number of
occupied cache lines is ¾ the number of
parallel paths (ie, reduce #lines by 1/3)
 Probabilistic analysis shows attack needs
over 217 paths, but only 214 different A’s fit
into 16MB cache

39
Parameters

Want ELt=10s, where
L = path length
t = memory latency = 0.2 microseconds

Reasonable choices:
E = 24,000, L = 2048
40
Intuition for Why it Works

Fight tradeoffs using incompressible T

Fight preprocessing of T by using message+trial
dependent A in computing walk

Fight preprocessing of A by modifying using T

Fight locality: choice of E and trial-dependence
of A

Good statistics for walks by analogy with
statistics for heavily-studied RC4 (?)
41
Lower Bound for Abstract Version

General formal model

Abstraction of our algorithm using
idealized hash functions

Proof that amortized number of cache
misses is linear in desired number
42
Summary

Discussed computational approach, Turing
tests, economics, cycle-stealing

Briefly mentioned two architectures

Examined difficulties of constructing
memory-bound pricing functions and
proposed a new one designed to avoid
these difficulties

Lower bounds for idealization.
43
Future Work and Open Questions

Stanford undergrads just completed
simple Outlook proxy and Unix filters; webbased implementation in progress; next
steps: handle bounces, add signatures

Distribution Lists

Good MB functions with short descriptions
(will subset product work)?
44
Download