Correlation Immune Functions and Learning

advertisement
Correlation Immune Functions
and Learning
Lisa Hellerstein
Polytechnic Institute of NYU
Brooklyn, NY
Includes joint work with Bernard Rosell
(AT&T), Eric Bach and David Page (U.
of Wisconsin), and Soumya Ray (Case
Western)
Identifying relevant variables from
random examples
f ( x1, x 2,... , x10)  x 2  x7  x10
x
(1,1,0,0,0,1,1,0,1,0)
(0,1,0,0,1,0,1,1,0,1)
(1,0,0,1,0,1,0,0,1,0)
f(x)
1
1
0
2
Technicalities
• Assume random examples drawn from
uniform distribution over {0,1}n
• Have access to source of random
examples
3
Detecting that a variable is relevant
• Look for dependence between input variables
and output
If xi irrelevant
P(f=1|xi=1) = P(f=1|xi=0)
If xi relevant
P(f=1|xi=1) ≠ P(f=1|xi=0)
for previous function f
4
Unfortunately…
f ( x1, x 2,... , x10)  parity ( x 2, x7, x10)
xi relevant
xi irrelevant
P(f=1|xi=1) = 1/2 = P(f=1|xi=0)
P(f=1|xi=1) = 1/2 = P(f=1|xi=0)
Finding a relevant variable easy for some functions.
Not so easy for others.
5
How to find the relevant variables
• Suppose you know r (# of relevant vars)
Assume r << n
(Think of r = log n)
• Get m random examples, where
m = poly(2r ,log n,1/δ)
• With probability > 1-δ, have enough info to
determine which r variables are relevant
– All other sets of r variables can be ruled out
6
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
(1, 1, 0, 1, 1, 0, 1, 0, 1, 0)
(0, 1, 1, 1, 1, 0, 1, 1, 0, 0)
(1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
(0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
(1, 1, 1, 0, 0, 0, 1, 1, 1, 1)
f
1
0
1
0
0
7
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
(1, 1, 0, 1, 1, 0, 1, 0, 1, 0)
(0, 1, 1, 1, 1, 0, 1, 1, 0, 0)
(1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
(0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
(1, 1, 1, 0, 0, 0, 1, 1, 0, 1)
f
1
0
1
0
0
8
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
(1, 1, 0, 1, 1, 0, 1, 0, 1, 0)
(0, 1, 1, 1, 1, 0, 1, 1, 0, 0)
(1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
(0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
(1, 1, 1, 0, 0, 0, 1, 1, 0, 1)
f
1
0
1
0
0
x3, x5, x9 can’t be the relevant variables
9
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
(1, 1, 0, 1, 1, 0, 1, 0, 1, 0)
(0, 1, 1, 1, 1, 0, 1, 1, 0, 0)
(1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
(0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
(1, 1, 1, 0, 0, 0, 1, 1, 1, 1)
f
1
0
1
0
0
x1, x3, x10 ok
10
• Naïve algorithm: Try all combinations of r
variables. Time ≈ nr
• Mossel, O’Donnell, Servedio [STOC 2003]
– Algorithm that takes time ≈ ncr where c ≈ .704
– Subroutine: Find a single relevant variable
Still open: Can this bound be improved?
11
If output of f is dependent on xi, can detect
dependence (whp) in time poly(n, 2r) and identify
xi as relevant.
Problematic Functions
Every variable is independent of output of f
P[f=1|xi=0] = P[f=1|xi=1] for all xi
Equivalently, all degree 1 Fourier coeffs = 0
Functions with this property said to be
CORRELATION-IMMUNE
12
P[f=1|xi=0] = P[f=1|xi=1] for all xi
Geometrically:
10
11
00
01
e.g. n=2
13
P[f=1|xi=0] = P[f=1|xi=1] for all xi
Geometrically:
1
10
11
0 00
01
0
Parity(x1,x2)
1
14
P[f=1|xi=0] = P[f=1|xi=1] for all xi
Geometrically:
1
10
11
0 00
01
0
X1=1
X1=0
1
15
P[f=1|xi=0] = P[f=1|xi=1] for all xi
1
X2=0
X2=1
10
11
0 00
01
0
1
16
• Other correlation-immune functions besides
parity?
– f(x1,…,xn) = 1 iff x1 = x2 = … = xn
17
• Other correlation-immune functions besides
parity?
– All reflexive functions
f(x)  f( x) for all x
18
• Other correlation-immune functions besides
parity?
– All reflexive functions
f(x)  f( x) for all x
– More…
19
Correlation-immune functions and
decision tree learners
• Decision tree learners in ML
– Popular machine learning approach (CART, C4.5)
– Given set of examples of Boolean function, build a
decision tree
• Heuristics for decision tree learning
–
–
–
–
Greedy, top-down
Differ in way choose which variable to put in node
Pick variable having highest “gain”
P[f=1|xi=1] = P[f=1|xi=0] means 0 gain
• Correlation-immune functions problematic for
decision tree learners
20
• Lookahead
• Skewing: An efficient alternative to
lookahead for decision tree induction.
IJCAI 2003 [Page, Ray]
• Why skewing works: learning difficult
Boolean functions with greedy tree
learners. ICML 2005 [Rosell, Hellerstein,
Ray, Page]
21
Story
Part One
22
• How many difficult functions?
n
# fns
0
2
1
2
2
4
3
18
4
5
648 3140062
• More than
n-1
2
2
23
• How many different hard functions?
n
# fns
0
2
1
2
2
4
3
18
4
5
648 3140062
• More than
n/2
2
2
SOMEONE MUST HAVE STUDIED THESE FUNCTIONS
BEFORE…
24
25
26
Story
Part Two
27
• I had lunch with Eric Bach
28
Roy, B. K. 2002. A Brief Outline of Research on
Correlation Immune Functions. In Proceedings
of the 7th Australian Conference on information
Security and Privacy (July 03 - 05, 2002). L. M.
Batten and J. Seberry, Eds. Lecture Notes In
Computer Science, vol. 2384. Springer-Verlag,
London, 379-394.
29
Correlation-immune functions
• k-correlation immune function
– For every subset S of the input variables s.t.
1 ≤ |S| ≤ k
P[f | S] = P[f]
– [Xiao, Massey 1988] Equivalently, all Fourier
coefficients of degree i are 0, for
1≤i≤k
30
Siegenthaler’s Theorem
If f is k-correlation immune, then the
GF[2] polynomial for f has degree at
most n-k.
31
Siegenthaler’s Theorem [1984]
If f is k-correlation immune, then the
GF[2] polynomial for f has degree at
most n-k.
Algorithm of Mossel, O’Donnell, Servedio [STOC
2003] based on this theorem
32
End of Story
33
Non-uniform distributions
• Correlation-immune functions are defined
wrt the uniform distribution
• What if distribution is biased?
e.g. each bit 1 with probability ¾
34
f(x1,x2) = parity(x1,x2)
each bit 1 with probability 3/4
x
parity(x)
P[x]
00
0
1/16
01
1
3/16
10
1
3/16
11
0
9/16
P[f=1|x1=1] ≠ P[f=1|x1=0]
35
f(x1,x2) = parity(x1,x2)
p=1 with probability 1/4
x
parity(x)
P[x]
00
0
1/16
01
1
3/16
10
1
3/16
11
0
9/16
P[f=1|x1=1] ≠ P[f=1|x1=0]
For added irrelevant variables, would be equal
36
Correlation-immunity wrt p-biased
distributions
Definitions
• f is correlation-immune wrt distribution D if
PD[f=1|xi=1] = PD[f=1|xi=0]
for all xi
• p-biased distribution Dp: each bit set to 1
independently with probability p
– For all p-biased distributions D,
PD[f=1|xi=1] = PD[f=1|xi=0]
for all irrelevant xi
37
Lemma: Let f(x1,…,xn) be a Boolean
function with r relevant variables. Then f is
correlation immune w.r.t. Dp for at most r-1
values of p.
Pf: Correlation immune wrt Dp means
P[f=1|xi=1] – P[f=1|xi=0] = 0
(*)
for all xi.
Consider fixed f and xi. Can write lhs of (*)
as polynomial h(p).
38
• e.g. f(x1,x2, x3) = parity(x1,x2, x3)
p-biased distribution Dp
h(p) = PDp[f=1|x1=1] - PDp[f=1|x1=0] =
( p2 + p(1-p) ) – ( p(1-p) + (1-p)p )
If add irrelevant variable, this polynomial doesn’t
change
• h(p) for arbitrary f, variable xi, has degree <= r1, where r is number of variables.
• f correlation-immune wrt at most r-1 values of p,
unless h(p) identically 0 for all xi.
39
h(p) = PDp[f=1|xi=1] -PDp[f=1|xi=0]
n 1
 PDp [f  1 | x i  1]   wdp (1  p)
d
n 1 d
d 0
where wd is number of inputs x for which
f(x)=1, xi=1, and x contains exactly d
additional 1’s.
i.e. wd = number of positive assignments of
fxi<-1 of Hamming weight d
• Similar expression for PDp[f=1|xi=0]
40
PDp[f=1|xi=1] - PDp[f=1|xi=0]
=
n 1
PDp [ f  1 | xi  1]   ( wd  rd ) p (1  p)
d
n 1 d
d 0
where wd = number of positive assignments of fxi<-1
of Hamming weight d
rd = number of positive assignments of fxi<-0 of
Hamming weight d
Not identically 0 iff wd ≠ rd for some d
41
Property of Boolean functions
Lemma: If f has at least one relevant
variable, then for some relevant variable
xi, and some d,
wd ≠ rd for some d
where
wd = number of positive assignments of fxi<-1
of Hamming weight d
rd = number of positive assignments of fxi<-0
of Hamming weight d
42
How much does it help to have access to
examples from different distributions?
43
How much does it help to have access to
examples from different distributions?
Exploiting
Product
Distributions
to Identify
Hellerstein,
Rosell,
Bach, Page,
Ray
Relevant Variables of Correlation Immune
Exploiting Product Distributions to Identify Relevant
Functions
Variables of Correlation Immune Functions
[Hellerstein, Rosell, Bach, Ray, Page]
44
• Even if f is not correlation-immune wrt Dp,
may need very large sample to detect
relevant variable
– if value of p very near root of h(p)
• Lemma: If h(p) not identically 0, then for
some value of p in the set
{ 1/(r+1),2/(r+1),3/(r+1)…, (r+1)/(r+1) },
h(p) ≥ 1/(r+1)r-1
45
• Algorithm to find a relevant variable
– Uses examples from distributions Dp, for
p = 1/(r+1),2/(r+1),3/(r+1)…, (r+1)/(r+1)
– sample size poly((r+1) r, log n, log 1/δ)
[Essentially same algorithm found independently by
Arpe and Mossel, using very different techniques]
• Another algorithm to find a relevant variable
– Based on proving (roughly) that if choose random p,
then h2(p) likely to be reasonably large. Uses prime
number theorem.
– Uses examples from poly(2r, log 1/ δ) distributions Dp.
– Sample size poly(2r, log n, log 1/ δ)
46
Better algorithms?
47
Summary
• Finding relevant variables (junta-learning)
• Correlation-immune functions
• Learning from p-biased distributions
48
Moral of the Story
• Handbook of integer sequences can be
useful in doing literature search
• Eating lunch with the right person can be
much more useful
49
Download