Abstract

advertisement
Chapter
Learning
the
Fourier
Probabilistic
William
that
the
of learning
Aiello*
Linial,
Mansour,
and
boolean
concepts
(under
concepts
Nissan
that
lists,
can
time
in
small
the
uniform
instance
polynomial
Fourier
Fourier
identified
sampling
with
and
algorithms:
a-
appears
This
result
of
spirit,
is
this
learning
the polynomiality
Like
functions
[M90].
like
the
The
new
us to achieve
Fourier
ingredient
this
analysis
of our
polynomiality
is that
we are able to isolate
ally small set of non-negligible
that reside in a super-polynomially
spectrum.
We
further
eral concept
(npolyk)gn
include
trees,
special
“Bdl
aidlo((!fl
flkll
)
all
their
work
observe
learning
convex
case which
algorithms.
ColIl]lltl[\icalioI]s
more
These
etc.
in polynomial
Research,
Morristown
genclasses
.l}cllcorc.
n-cube,
the label
Research,
hlorrist.own
NJ
that
291
is, in
uniform
concepts
n-cube.
However
is as follows:
for
we ordy
raining
= 0.8.
are
un-
it either
For a prob-
element
d of the
probability
Probabilistic
and Shapire
give
P(F)
and
concepts
[KS90] as a
uncertainties
further
justi-
“weather-prediction”
“If it was raining
as
on Mon-
on Tuesday,
then it will rain
probability y O.8. Here @ corMonday
However,
Mondays
follows
some
of i is 1 with
of examples:
to
P(3)
of rainy
07!w0,
col]).
with
captures
natural
inherent
is referred
to [KS90]
for
here
responds
07960.
[M90]
theorem.
probabilistic
the
p, and
day, and it is raining
on Wednesday
with
while
(bmmnnicat.ions
concerned
interpretation
concept
the simplest
learnabil-
[LMN89]
apthat
concepts.
domain
abilistic
fication;
A concrete
transforms”
poly-
boolean
concepts
whose range is the set {O, 1},
range of probabilistic
concepts
is the interval
model that
(the reader
ash .Ixdlcorc.con].
]I~illail(@)fl&sll
are
O with probability y 1 – p(i).
were introduced
by Kearns
decision
NJ
we
The
of N yquist’s
concepts,
with
[0, 1]. The
super-polynomial
probabilistic
combinations,
results
via refined
the polynomi-
several
slightly
polynomial-size
allows
Fourier
coefficients
large area of the
that
classes have
that
paper
for infi-
versus
domains.
elsewhere
by
the case that
statements
Fourier
analogue
of probabilistic
boolean
of our results
should be contrasted
to the np”lylogn
complexities
in the analogous
cases of [LMN89]
and
so often
finite
via
and
the finite
In
in
for
spec-
reconstructed
to “exponential
“learning
here
has bounded
count able”
translate
stat ements
proach
estimated.
It is every
versus
domains
signal
can be completely
sampling”.
nomial”
in polynomial
methods
it
“uncountable
nite
coeffi-
distribution.
Fourier
“If a continuous
then
discrete
at
represent
are learnable
where
learning
trees
and
can be approx-
non-negligible
be efficiently
under
decision
of each literal,
polynomially
that
sense
decision
probabilistic
all such concepts
the first
the
probabilistic
generally
cients
Introduction
1
uni-
following:
one occurrence
Hence,
of k-DNF.
trum,
most
and
arithmetization
of Kearns
more
imate ed by
is the weighted
[KS90].
probabilistic
show
Mihailt
There is a famous
(and very practical)
theorem
by
Nyquist
in signal processing
which roughly
says the
are
and Shapire
of
Trees
Milena
form sampling
distribution)
by reconstructing
their
Fourier
represent ation [LMN89]
extends
when the
tions,
and
ity
We observe
We
Spectrum
Lists
Abstract
method
32
and
and
Tuesdays,
rains
raining
Tuesday,
for any specific
the
or it does not
sequence
Wednesday
rain”).
292
AIELLO
AND MIHAIL
Uniform
learning
is a special
form of Valiant’s
distribution-free
learning
[V84] where examples
of
the concept
to be learned
are drawn
according
to
more general
decision lists
cases [R87]. However
for probabilistic
the best known distribution-free
result
requires
the list to be learned
the uniform
is, pl>pz
In the
(as opposed
uniform
to arbitrary)
learning
scenario
probabilistic
concepts
concept p (P is simply
task of a learning
distribution.
there
is a class
algorithm
for the class P is to pro-
duce a “good”
training
phase
approximation
~ of p after
and some fhrther
efficient
tion.
the
During
presented
training
a “small”
2, together
phase
number
ple is a uniformly
a label
the
a “short”
computa-
algorithm
of samples.
generated
with
of
P, one of which is the target
a collection
of concepts).
The
element
Each
of the
1 or O that
>...
> p~+l
In this paper
learning
sion
lists
we
t ain
polynomial-time
show
arbitrary
3.6.
that
probabilistic
polynomial-time
for
Theorem
that
uniform-
probabilistic
Furthermore
our
techniques
trees
(see Figure
deci-
in Theorem
extend
uniform-learning
decision
rence per literal
is monotone,
[KS90].
we obtain
algorithms
3.10
for
is
that
to
ob-
algorithms
with
a single
occur-
2).
sam-
n-cube
is determined
by p( ~).
Despite
their
concepts,
cepts
seeming
the
task
presents
is reflected
mines
the
in general
[KS90]
Chervonenkis
fact
that
not
known
cepts
hold
than
[B EHW89]),
several
as well
have
are
con-
rather
algorithms.
latter
as the
results
probabilistic
analogues
of this
deter-
the Vapnik-
learnability
learning
example
(for
that
of probabilistic
complex
for
boolean
distribution-free
ample
theorems
dimension
dimension
whose
conThis
structural
distribution-free
boolean
difficulties.
learnability
is more
to
with
probabilistic
larger
combinatorial
distribution-free
concepts
resemblance
learning
significantly
both
example
of
simple
A prime
situation
ex-
is decision
lists.
The
approach
poses
is to
such
concepts
ficient.
In
that
we use for
consider
the
and
this
our
Fourier
approximate
way,
the
by Linial,
context
of boolean
ically,
each
target
in the
concepts:
Mansour,
concepts
context
depends
efficients
that
tic
de-
argue
Fourier
2.2.
The
upon
must
decision
decision
and
a probabilistic
coef-
is approxi-
and
the
that
class
specif-
ACO
of constant
simple
observato probabilistic
efficiency
of the whole
number
of Fourier
and
single
but
coefficients
For
literal
we use detailed
all
in the
(more
be approximated.
lists
trees
Nissan
[LMN89]
of the
Theorem
scheme
bilistic
paper,
of
Fourier
concept
depth
circuits ); here we make the
tion that
their
techniques
extend
of this
pur-
mated in a vector sense. This novel method
of learning by approximate ing Fourier
represent at ions was
introduced
For the purposes
learning
representation
probabilis-
Fourier
a polynomially
are negligible
co-
probaanalysis
small
(Lemmas
set of
3.2 and
cision list over n variables
is a single branch
decision tree whose edges are labeled
by literals
or their
negations
(see Figure
1). The leaves of the tree are
3.9).
We further
give an algorithmic
scheme that
of
the nonefficiently
determines
the frequencies
3.3, Algonegligible
Fourier
coefficients
(Remark
further
rithm
labeled
# of the
by a number
n-cube
naturally
in [0, 1]. Each
follows
a path
element
from
the
root to a unique
leaf of the list.
The value of the
decision list on 2 is the label of this unique
leaf.
Decision
ural
in
lists
concepts
the
are generally
and
boolean
have
case
accepted
been
where
as quite
studied
the
pi’s
nat-
extensively
are
in
{O, 1}
small
Both
conceptually
that
nificant
are distribution-free
of the
in significantly
before
and
set of non-negligible
work
area
even
discussion
of [LMN89]
super- polynomially
(see [R87] for a collection
of resdts).
In particular,
it is well known
that
boolean
decision
lists
learnable
1, and
the terms
and
the
frequencies
the reason
why
np”lylog”
area
technically,
from
3.10).
resides
we obtain
complexities
polynomial
in the
frequencies”.
the
small
part
new.
ally
This
results
of [MLN89]
of our
set of sig-
a super- polynomi
is entirely
In
polynomially
of “low
polynomially
of low frequencies
this
frequencies
large
filters
Theorem
[M90],
and
large
is also
instead
[M90]
LEARNING
THE FOURIER SPECTRUM OF PROBABILISTIC
in analogous
results
cases.
We consider
thmst
of our
as the main
these
work.
LISTS AND TREES
2
polynomial
Preliminaries
In
For arbitrary
nomial
probabilistic
size
(i.e.
when
decision
trees
of poly-
single
literal
condi-
the
tion is removed)
we observe that r$’”lyl”g”
learning
is feasible:
Theorem
4.1.
This
tion
fo~ows
and
along
with
the
of Hastad’s
the line
further
Switching
([LMN89]
make
ACO proof).
binations
quencies
the
Lemma
that
orem
An
4.2.
vation
The-
consequence
of this
obser-
interesting
is that
the
weighted
is learnable
model,
in
arithmetization
polynomial
time
and by the recent
distribution-free!
freand
model:
results
(Theorems
of
in
the
The
respect
to extend
point
elegant
for
to
out
uniform
Mansour
unit
●
k-
also
here
for
classes
work
uniform
learning
general
on
zero).
that
requirement.
Mansour’s
the
zero
(1/2n)xi
are
be
a
case of distribution-
free learning.
Finally,
sent ation
the
all
before
proceeding
to the technical
of our work, it is worth
mentioning
Fourier
appear
here
for their
orem,
mic
methods
and
elsewhere
remarkable
posses
uniform
they
easy to implement,
that
[M90],
except
to Nyquist’s
of further
are
prethat
learning
[LMN89]
comparability
a variety
features:
for
desirable
conceptually
parallelizable
the-
algorith-
simple,
follows:
of this
In
extended
very
Section
2 we establish
technical
background
present the polynomial
sion
lists
tion
4 we discuss
mial
size and convex
bounded
and
single
spectrum.
are in Section
5.
of our work.
learnability
literal
arbitrary
decision
decision
combinations
Summary
is organized
the
context
trees.
and
In
open
Sec-
of polyno-
of concepts
and
as
In Section
3 we
results for decitrees
with
problems
when
n is well
the n-cube
Q~ to the
(0, 1].
is a concept
‘P = Un~n
[Analogous
is learnable
for
any
n
where
1 with
1 –p(~).
of p
if
in
to
the
f(n,
c, 6),
and
any
model
if there
p
A concept
[K S90]]
uniform
and
in
is an algorithm
E P.
takes
as
input
of p; moreover
the running-time
say that P is learnable
in poly-
lVe
time
if f (n, e, 6)
We say that
super-polynomial
(nc-l)polylogn.-’
Concepts
dimensional
time
is polynomial
P is learnable
if f(n,
in n, c-l,
in slightly
6, f5) is of
and polynomial
the form
in log b–l.
can be viewed
as elements
of the 2nvector
space of all real valued
func-
tions
on the n-cube
point
).
their
values
In
this
(one dimension
context,
on each
for each domain-
determining
vertex
of the
concepts
n-cube
by
is equiv-
alent to using the standard
basis.
Of course the
obvious
difficulty
of the learning
task is that all directions
and
of the standard
we are required
projection
etc.
abstract
of
F(ql < ~.
-
2.1
P
nomial
basis are equally
to
correctly
of p on all but
important,
approximate
a vanishingly
small
the
frac-
tion of these directions
after seeing the behavior
of
p on a vanishingly
small fraction
of the directions
(i.e.
The rest
lp(~)
that
could
a concept
from
Elements
i = Z1 . . . x~.
probability y p( F) and O with probability
●
A fmction
@ is an c-approximation
and log 6-1.
However
suggests
{O, I}n.
of a concept p is a pair (2,13),
uniformly
from
Q.,
and 1~ is
c-approximation
is f (n, c, 6).
concepts
are exactly
this
for
learning
to
simply
p : Qn +
=
vectors
c, b and m independent
samples
of p and produces
with probability
1 – 6 a hypothesis
@ which
is an
methods
to approximately
probabilistic
side,
methods
approach
the
satisfy
on the positive
Fourier
apply
that
for probabilistic
class.
that
fairly
is Qn
is a set of n-concepts.
time-complexity
we should
distribution-free
(as opposed
ACO nor
described
first
to
techniques
results
some
Fourier
representations
frequencies
Neither
transform
learning
Fourier
learning,
2.2, namely,
extend
by n-bit
(or
interval,
Pn
Definition
of all these
has developed
to
Mausour’s
whose
potential
Theorem
is a function
A sample
2 is drawn
4.3 and 4.4)
distribution-free
that
techniques
[M90].
high
to the
review
the Fourier
aponly new point
is equa-
in [LMN89]
A n-concept
class
With
suggests
set of objects
●
uni-
in [M90]
(2) which
Qn are denoted
●
com-
in the uniform
we briefly
The
techniques
understood)
in their
all convex
section
to learning.
concepts.
use
concepts
that
have negligible
high
preserve
the low-frequency
property,
also learnable
form
that
is unnecessary
[H86]
use of this
the
●
simplification
crucial
tion
in [LMN89],
Lemma
We also observe
are hence
DNF
of reasoning
this
proach
uniform
observa-
293
the small
come
this
set of samples).
fundamental
and Nissan
[LMN89]
difficulty,
introduced
In an effort
to over-
Linial,
Mansour,
the idea
of switch-
ing bases from the standard
basis to a Fourier
basis,
of
so that the projection
of p along most directions
the Fourier
basis is negligible.
Hence the number
of
non-negligible
number
The
ions
directions
is comparable
to the small
of samples.
Fourier
on the
basis for the set of all real valed
n-cube
is defined
as follows:
funct-
For each
294
AIELLO
S C [n] consider
S in the natural
a “parity”
way:
XS(Z)
where
pars(~)
with
= O if ~ies
respect
associated
with
Now if we use as a hypothesis
for p the concept
~
whose Fourier
coefficients
are the ii’s, then we have:
Zi is even,
X. IP(2) - il~)l
+
and pars(d)=
the
inner
product
(p, q)
G(P(~)
—
—
=
where
as
and
= ~(p,
s
Sth
Fourier
a(s)
coefficient
=
of p is
respectively.
crucially
The
learning
based
are simply
on the fact
averages,
these averages
gested by (2).
In
spectrum,
Fourier
frequency
for
in the
technicality
these
this
All
that
values
only
pute
=
ii(S)
k
b’&nds
1–6,
the
S:
following
].’ill >
(ijl,
(l/m)
~~1
log ii-l,
guarantee
la(S)
–ii(S)/2
Fourier
lines:
k.
Then
lC,,, ) and
l~,(–l)wrs(tii)
appropriate
then
that,
for
< k-l
probability
–1
~
Hence
~
(a(s)
-
ii(S))2
in-
all
—
< ,2/2
the class ACO has
in slightly
intuition
behind
their
differ significantly
from the spectrum
of the
function
which is known
not to be in ACO
[H86], and whose
gle frequency:
observe
the
in Section
4 with
decision
results
differ
it y which
spectrum
highest
.“
“The
consists
The
respect
trees
from
exponential
we
as the stronger
of poly-size
on the low frequencies
requires
that
3) are justified
spectrum
significantly
of a sin-
results
to polynomial-size
(as well
of Section
concentrates
should
3
Probabilistic
and
S:
by the
decision
because
it
the spectrum
of par-
size decision
trees .“
Decision
Lists
least
Single
Decision
Literal
Trees
in
C2/2 for each S.
()
S:lSl<k
concentrates
The
in dightly
learnable
com-
Chernoff
y at
that
and is hence
and
use
m
polynomial
straightforward
with
First
use
it was shown
same intuition:
represen-
model
“The
spectrum
of the class
on the low frequencies
because it
trees
12, ),. . . . (jl~,
If m is some
and
the
in (3)).
in the uniform
time.
is as follows:
polynomial
n, log ~–1 ).
by
notice
imply:
spectrum
probabilistic
k = poly(log
approximate
all
of S)
p01y(n)2-kc
s ‘2/2
samples
H
cardinality
have
bypassed
result
[Y85]
of
we may
1 respectively;
the errors
@
probability.
time.
P
notion
O or
is
and hence
high
i’s
z
quanity
super-polynomial
ACO
has
last
remarks,
some
- Schwart
[Extension
of [LMN90]]
lj a probaclass P has bounded spectrum,
then
In [LMN89]
should
parity
sense:
‘2(s)
dependent
n
is the
~S:[Sl>k
p along
that
to
helps
the above
bounded
as sug-
class
natural
<
O for
and
p E P all “high”
a’(s)
and
coefficients
by (1),
h:pt>k
we may
IS] < k.
a(S)
following
Then
of
(the
are
this
for
sam-
follow
estimated,
a concept
coefficients
c is a constant
=
Foum”er
if for all concepts
where
tation
that
XS and
are negligible
that
as suggested
say
frequency
G(S)
that
can be efficiently
particular,
bounded
algorithms
Now
< 0 or @(F) > 1 can be trivially
Theorem
2.2
bilistic
concept
(?, 1=) and the ( ii, lti, )‘s are independent
- W))z
by Cauchy
(The
P is learnable
super-polynomial
where
follow
for p with
that
ples of p.
lines
by e by the previous
setting
(12(+-q
(a(s)
is a an e-approximation
(1)
~
two
Parseval
~(~)
(p, xs)
—
—
the last
bounded
xs)xs(q.
&
- W)))2
(3)
~ ~zP(E)q(~)Hence, any real valued function
on
the n-cube (therefore
any concept p) can be written
The
=
It is well known
and easy
are indeed
an orthonormal
to
p(i)
J(GziG@
< /+
= (–-l)wrs(Z)
1 if Eies xi is odd.
to verify
that
Us {XS}
set
function
AND MIHAIL
In
the
previous
section
we sketched
how
bounded
spectrum
concept
classes can be learned
in the
uniform
model
and in in slightly
super-polynomial
time.
In fact, without
using additional
structure,
this
is probably
the
best
possible
(lO~n) Fourier
coefficients
must
The main
contribution
of this
that
for
lists
and
single
literal
since
roughly
be approximated.
paper
is to show
decision
trees
there
LEARNING
THE FOURIER SPECTRUM OF PROBABILISTIC
is a polynomial
size subset
quencies in which
and that
most
of the lowest
of the power
furthermore,
this
(l”n)
fre-
it is possible
properties
time
of the
uniform
The
spectrum
learning
case of lists
decision
trees
get
the
The
same
list
learning
case of
general
on
in
n variables
algorithms
let
us
the
introduction,
The
in the
value
of this
●
tree
unique
is a single-branch
bi-
the
to a unique
leaf.
on z E Qn is the
root
label
decision
list
in which
the literals
we introduce
in Figure
appear
1 we have T(1)=
1, 7F(3) =4, 7r(4)= 2. In general the ith
of the list are labeled
~=(i) and Z=(i).
TO denote
of the i-th
branch
O-1 vector
~=
whether
~.
For
1101 since
3,
level
Con-
naturally
list
in
edges
Figure
is farthest
the
the
maximum
subsets
maximum
of di: ~i = 1 – di.
the partition
of the
1 we have
by 23, c1,
di to denote
mate
along
of the
level
variable
by the branches
{S
: MazL(S)
= n}.
= i}l
which
the spectrum
=
classes
we shall
of decision
2i-1,
so that
the
are polynomial
efficiently
lists
for
approxi-
are, roughly,
the
following:
(i) ~s,~~~~(s)>i
a2(S) < 2-i, which suggests that
for the purpose
of approximating
the Fourier
representation
it su#ices
to approximate
each one of
a(S)
such that Maz L(S)
s i, for i = O(log n).
Now
of
the
{7r(l)
set {S
: MazL(S)
,. ... m(i)}
coefficients
Each
s
so there
i}
are
and
coefficients
is the
only
to approximate,
one of these
for
powerset
polynomially
i = O (log n ).
can be approximated
satisfactory
in
polynomial
time
by
sampling.
(ii)
The function
satisfactorily
T on 1,...,
i can be approximated
in polynomial
time
by further
sam-
pfing.
Point
(i) is justified
tures
all the structure
1 that
Lemma
by Remark
by Lemma
3.2 below
of the spectrum.
which
Point
3.3 and the learning
cap(ii)
3.2
For
all S G [n],
= i then
if MazL(S)
the
which
= ~12n-ipi
-
~
is
Zn-jpj
-
ILz+11(4)
j=i+l
of the decision
[a(S)/
= la({7r(i)})l
(5)
S 2-i
we introduce:
C;
=
{Z:
=
{d,...
Zm(l)
dlj.
=
. .7z=(i_l)
Clearly,
(6)
< 2–i
S:iVfc3L(S)>i
x ~(i) Z dz}, fOr 1 ~ i $ n
To see (4) recall
by (1) that
dn}
[C’i[ = 2n-i
ICn+,l
a2(S)
E
= ~–1>
PROOF.
G+l
for
1 <
i ~
n, and
of COUrSe
ZP(Z)
(-1)=(=),
= 1.
In the above
terms,
a decision
list is formally
defined
as follows:
Definition
decision
n-bit
3.1
list
~
defined.
=
some
d, some
Z1. ..zn
p(z)
A concept
if for
O-1 vector
all~=
Ci
is
Algorithm
follows.
la(S)l
n-cube
= l},...,
: MazL(S)
The lines
justified
an n-bit
are labeled
use the notation
suggested
the left edge
we introduce
example,
the left
24, Z2. We further
complement
o To denote
~T(i) or ~T(i) label
of the list,
of all
to the
ies of partition
efficiently
a permutation
versely, z~ and ~i are on level r– 1(i) of the list. For
convenience
define the level function
/(i) = m-l ( i).
●
1{S
cardinalit
i=o(]ogn),
many
the order
m of [n] . For example,
T(2)=
edges
the
tree
leaf.
To formalize
along
from
of a decision
which
we define
a partition
according
0, {S : MazL(S)
a
nary tree with edges labeled
by literals
xi and their
negations
~i, so that if the right edge of an internal
node is labeled
~i, (resp. ( Zi ) ) then the left edge is
labeled
~i, (resp.
(z;)).
There are n + 1 leaves labeled by pl . . . pn+l.
Any Z E Q. naturally
follows
a path
To do this
itturns
: i E S C [n]}
lines
terminology.
informally
list.
suggests
Clearly
the
some useful
the
{xi
the variable
in the set:
in detail.
along
of variables
to identify
variables
a polynomial
295
level of S or MazL(S)
to be j~ S: i(j) 2 l(i) for all
i ~ S. Now z~a=~(s)
is the desired variable.
In turn,
this
of further
sketched.
As mentioned
decision
and
is treated
presenting
formalize
advantage
algorithm.
follows
and is simply
Before
to take
crucial
down
can be efficiently
identified.
Thus we obtain
the first nontrivial
example of a concept
class with bounded
spectrum
for which
For a subset
out
is concentrated,
subset
LISTS AND TREES
pi,
p over
n variables
permutation
T of [n],
E Qn
pl,
. . . . p~+l
the following
where
the Ci
‘S
are
is a
.
some
E [~, 1], and
holds:
ii? E
as previously
—
—
*12n-ipi
-
~
j=i+l
2“-jf)j
-
p.+~1
296
AIELLO
where the last two equalities
are fairly
easy to justify as follows:
o Clearly
S ~ {7r(l),
..., m(i)}.
For each j such that
~ < i, we argue
is because
all
%(1)> . ..7Z
~(j_l),
respectively,
%r(j+l)
that
the
~3eCj
vectors
quently,
for
bits
Cj
each
have
to
one
= O. This
coordinates
dl ,. ... dj_l,
of
the
all
that
belong
to
the
times
even
the quantity
and
S fl {Zr(l)
the other
(– 1 )Wrs(s)
. . . z=(j)}
is
of the sum
. . . zT(;)}
half
averaged
odd.
k
Hence,
over all vectors
in
●
For j = i, there
are 2m–i vectors
in Ci.
in Ci have coordinates
. . . . m(io)}.
scribed
in (a) and
Xo(i)
Q X* ~ X(i).
for learning
x=(l),
ALGORITHM
Stage
respectively.
Furthermore
1:
Approximate
Hence,
pars(i)
is the complement
when
j
Moreover,
Set i := log 26-2
+ log n ;
Input
X“:=g;
m samples
Equation
(5) follows
in {S
all the pi’s
from
i < j ~ n,
E Ci.
The
case
the
that
are 2~- 1 elements
~(~)
(If
= j},
there
and
by (5) each of them
{7r(l),7r(2),.
. . . ~(i)}
was the case that
be identified
(a)
For
each
~(j)})l
(b)
each
all jz,
m(j)
we would
Hence
say
the
set X(i)
, i = O(log
= 2–i,
.
E
X(i)
= 2-Z which
r(jl
by [5) since
then
for,
[a({~(i)})l
) @ X(i)
then
have
if it
we would
we
follows
(hence
Ia({m(jl),
li4azL({~(jl),
case that
be able
and (b)
However,
in general
ities.
2-*}.
above
since
If it
would
horn
jl
not
true,
In particular,
Clearly,
iO <
condition
have
and
s
2–(i+l)
= 2-i,
Ia({r(i)})l
suflice
sample,
(5).
Let
ap-
)’s and
and by
to isolate
[a({x(i)})l
io be max{j
i by
for
> i.
we need some further
let
U {~1}
(~l,lfi),.
S ~ X*,
;
the
G(S)
Spectrum.
+ log C-2)
..,(r~,l~n,)
:= ~ Et
;
;
lU,(–l)~’s(U’)
;
j(~)
-,
P
is
Zl(s)(-l)p=”’(=);
<-0,
then
@(i?) :=
O;
If ~(z)
>1,
then
:= 1;)
3.4 and
of Algorithm
3.5 below
justify
the
correctness
1, XO(i)
~ X*
1.
3.4
with
At the end of Stage
probability
at least
g
1 – 6/2.
X(i).
= 2-’
is
technical-
: a ({m(j)})
further
PROOF
(sketch).
3.3 and
standard
Claim
3.5
fi(iE)/
s e with
Follows
Chernoff
in the
spirit
of Remark
2,
~z
bounds.
At
the
end
probability
of Stage
at least
~
/p(Z)
–
1 – 6/2.
(5).
> i)
to use a small
this would
the
:= X*
;
could
proximate
with high probability y all a({r(j)}
a( {T(jl ), m(j2 )})’s up to arbitrary
accuracy,
(a)
=
n).
X(i)
r(jz)})l
7r(j2)})
was the
;
do
for our learning
to know
as follows:
la({m(i),
For
in principle,
wish
~tltit(-I)par{JI}(~’)
(log 8n + log 8-1
= -&x.
X(i)
before,
;
hypothesis
claim
3.3
we would
X*
m samples
Claims
AS we discussed
X*.
is
•1
purposes
by
;
END.
j(z)
2–i.
Remark
Set m := n@
For each
X(i)
Iogd-l)
~z # jl
2: Approximating
Input
and
+ 1)+
(rl,ld,),...,(~~,l~m,)
then
(4) by notic-
equation
are in [0, 1].
(6) recall
: lIfazL(S)
at most
i
Lists
‘i({~lt~2})
:= ~ ~tl~,(–l)pariilj~](~’)
or for some j2 Zl({jl,
j2})>~2-i
The
in (4).
To verify
E 6 Cj,
for
in Cj
to ~1, . . . . d~
seen to be pm+] with
= n + 1 is easily
sign as given
ing that
forced
for
of pa~s(~)
The
computwas de-
some X*
:
this sufhces
Decision
Xo(i)
(log2(n
. . . . ~m(i_1)7
all vectors
. . . . Zn(q
Learns
Set m := 16n2e-4
2“–~
x=(l),
(b),
and isolates
We will argue that
1:
Stage
in Ci.
~ X(i).
BEGIN
Hence
Zfi(i) forced to dl, . . . . &_~, di respectively.
pars~
is fixed.
Similarly,
for i < j ~ n, there are
vectors
Xo(i)
pwposes.
If ii({j1})z~2-i
have coordinates
Clearly
that
follows
uses the idea of
for ISI = 1 and [Sl = 2 that
For jI:=l
to n do ii({jl}):=~
For j2 := 1 to n, such that
C’j is zero.
all vectors
T(2),
MIHAIL
coordinates
fixed. And on the other hand, the parity
of their bits that belong to S n {~m(j+l)
half
dj
is free to vary in {O, 1}.
Consesuch vectors
the parity
of the sum
. . . %(i)
of their
in
forced
z=(j)
while
(–l)mrs(~)
{7r(l),
algorithm
ing a(S)
AND
Xo(i)
z
=
PROOF
(sketch).
Assume that at the end of Stage
1 X* is as in Claim 3.4, so that there are at most 2i
ii(S)’s to be approximate ed. Then standard
Chernoff
bounds
suggest that for the particular
choice of m
the sum of the squares of alJ these 6(S)’s is bounded
by 62/2 with the desired probability y. Hence:
LEARNING
Where
THE
the last
~Sg.1’(io)
FOURIER
bound
a2(S)
SPECTRUM
holds
because:
a2(S)
zs:M=.L(s)=j
at most
from
(XS:Jf.zL(S)=j
<
2-2;)
+ 2-’
definition
+ 2–i
<
S ~~=i~+l
< t2/2,
by choice
—
the above
the
root
lists
form
In
ture
3.6
to
(which
of i in ~lgorithm
model
the
of this
and learning
a single
(see Figure
section
per
y analogous
decision
we sketch
for
literal.
to the
the
decision
The
struc-
trees
idea
a unique
leaf.
tree
a label-
The
value
of the
on E is the
label
of this
to the permutation
can be also viewed
descendent
of Zj,
to decision
lists,
on some
and in the uni-
1.
algorithm
occurrence
time
Consider
r of decision
as a total
order
lists
on [n]:
m(i) < ~(i + l)),
a decision
tree defines a partial
order u on [n] in the natural
way:
If Zi labels a
1
of probabilistic
via Algorithm
rest
complete]
class
in polynomial
2).
leaf.
Analogously
imply:
The
is learnable
(see Figure
decision
S are related
Theorem
297
one node
unique
of i.
~2–i
•1
All
TREES
probabilistic
by (5), (6), and
the
2j2-2i
{.
AND
ing of the leaves of T with numbers
in [0,1]. Each
element
3 of the n-cube
follows
a path of the tree
a’(s))
a2(S)
+ ZS:ilf.zL(S)>iO
LISTS
nodes.
Consider
a labeling
of the nodes of T with
the variables
Z1, . . ., Zn, so that each variable
labels
= ~s:iw.zqs)z;,
—
— {( 2;=%+,
~~=i[,+l
OF PROBABILISTIC
with
here
case of decision
then
i >.
for some
in a (hence
path
from
the
S) to be the largest
tural
Lemmas
analogues
of Lemma
be shown
by similar
left
Again
analogously
all elements
root
Maz(
3.7, 3.8,
j.
S ~ [n], if all elements
to a leaf)
element
and
3.9 that
follow
3.2 for decision
for the complete
then
define
in S. The
(the
strucare the
lists,
manipulations
in
in S appear
and
can
proofs
are
paper):
is
lists
3).
Lemma
two
3.7
For
elements
a(S)
S ~
[n],
are
if there
are at least
related
in c“, then
not
= O.
Lemma
3.8
For
in S are related
Let
Yl,..
along
all S C [n] such
in at
a({Maz(S)})
%
all
in S that
. , yk be the
sets of variables
leaf (k < n).
Let
the
correspond
that
of each Yj, that
tered
along
than
i then
Xl(i),...
the
path
from
be the
=
that
if the
corresponding
appear
of
variables
are encoun-
path
Xj (i)
to each
subsets
i smallest
i nodes
(and
that
the root
,Xk(i)
to the
is, the first
each
corresponding
is shorter
equals
to the
Yj ).
So to approximate
sufiices
all elements
Then a(S)
< 2–’.
each one of the k paths
Yj’s
that
let i := Maz(S).
to approximate
the
spectrum
a(S)
of the
tree,
it
for S ~ Xj (i) for all j.
There are at most n such Xj(i)’s
and the powerset
of each one of them, for say i = O(log n), is polynomially
number
~iyre
small.
Hence
of coefficients
there
is a polynomially
Furthermore,
the crucial
sets Xj(i)
can be approximately
isolated
as in Stage 1 of Algorithm
1
by estimating
a(S) for ISI = 1 and [St = 2, roughly,
3
as follows:
A probabilistic
per literal
lows:
over
decision
tree with
n variables
Let T be a binary
single
occurrence
can be described
tree with
small
to be approximated.
at most
●
as fol-
n interior
●
For each jl
E [n],
let Xj,
= {jl}
If [X~l(i)l
(i)
> clogn
U {jz
: ii({jl,
then
X~l(i)
j2})
:=
0;
Z ~2-i};
298
AIELLO
● The
x;,(i);
sets
Xi(i)
are
All this discussion
3.10 which concludes
Theorem
trees with
4
by
can be formalized
the section.
the
sets
to Theorem
model
and
More precisely,
it is easy to see that if a probabilistic
decision
tree has depth
k then a(S) = O for all S :
ISI > k.
Now the next steps are fairly technical,
but identical to the manipulations
in Lemmas
5 through
9 in
polynomial
in
time.
[LMN89].
restriction
probabilistic
decision
age of many
other
lates
●
this
final
section
size
decision
nomial
vex
we argue
trees,
combinations.
or simply
sketched.
about
general
as well
Proofs
here
However
been left for the complete
as convex
are either
all
paper
polyomit ted
details
are of small
to bounded
Once
established,
be described
decision
that
have
are straightforward
tree of polynomial
as follows:
Let
Let
q(n)
T be a binary
interior
nodes.
T with
the
Consider
variables
tree
with
a labeling
z 1, . . . . Zn.
poly-
nodes
Consider
tions
As
mentioned
far”
in
of
a label-
from
the
spectrum
Theorem
4.1
2,
and
decision
parity
bounded
leaf.
Section
polynomial-size
fwction,
and
exactly
trees
which
Iearnabilit
as in
are
“very
results
in
that
Theorem
trum
polynomial
the
class
mial
size
slightly
Extension
The class of probabilistic
size
has
bounded
of probabilistic
is learnable
decision
spectrum.
decision
in
super-polynomial
PROOF
(outline).
tree
will
The
along
proof
decision
decision
with
be forced
and
that
trees in [LMN89],
trees
have
gl,
The
weighted
as a special
of proof
xi
First
notice
that
and
realize
that
are
bounded
n variables
hence
4.1 fol-
absolutely
the re-
of small depth.
long branch
of
“chopped-off”
like small
small
spec(N
is
of the
con-
a probabilis-
the range
of g is indeed
of quantities
in [0,1]).
for all S’S ag(S)
= ~i
a:(s) =
~S:[Sl>k
(~i
<
—
=
~S:lS[>k
xi
A~ag, (S).
-Mds))z
‘~af,
xi
‘i
xi
J~pOly(n)2-kC
~S:lSl>k
(7)
(s)
al,(s)
poly(n)2-kC
in
of the Main
probability
tree
every
Aigi
g is indeed
in the sence that
(g is the average
Then
over
and if g is a convex combination
g is a bounded spectrum
probabilistic
PROOF.
[0,1]
concepts
of polyno-
(with the additional
Switching
Lemma
is
high
case.
g2, . . . . gN, a convez
If gl, gz, . . . , gN
trees of
model
of Theorem
the lines
then
small depth.
● Second
of all, notice
combina-
time.
stricted
concept
is a decision
The reason is, roughly,
that
the
2.2.
convex
concepts.
follows
for functions
4.2
of
. First of all, notice
that if a probabilistic
polynomial size decision
tree is hit with a suitable
“ranrestriction”,
arbitrary
spectrum
Therefore,
trees
the uniform
Lemma
(Lemma
9) in [LMN89]
ease that the use of Hastad’s
unnecessary).
In particular:
dom
has been
by Theorem
of the gi’s is a sum of the form
= 1 and all &’s are in [0,1].
tic concept
Now
lows identically
about
probabilistic
)klSl>k
[LMN89]]
spectrum
Hence:
y.
[Straightforward
of the
4.1 follows
of k-DNF
combination
where ~i };
cept.
[LMN89],
argue
of bounded
follows a path of the tree from the root to a unique
leaf. The value of the probabilistic
decision
tree on
unique
most
trans-
•1
arbitrary),
gi ‘s, then
of this
Theorem
We finally
ing of the leaves of T with numbers
in [0,1]. Now,
as usual, realize that each element
i? of the n-cube
3 is the label
trees
in turn,
q(n)
at most
of the
This,
spectrum.
arithmetization
size can
be a fixed
as an aver-
decision
depth.
the boundedness
Recall
A probabilistic
tree can be written
probabilistic
con-
to reconstruct.
nomial.
Very roughly,
the idea that the randommethod
suggests
is that
any poly-size
of which
Generalizations
In
MIHAIL
●
3.1o
The class of probabilistic
decision
a single occurrence
per literal
is learnable
the uniform
in
approximated
AND
depth
depth
bounded
boolean
probabilistic
spectrum.
in
notice
metization
of
if
that
p is
some
that is,
k-DNF,
weiqhted
p is
of
the
n
arithform
P(S) = xi ~ici(~),
where the ci’s are products
of k
variables
or their negations,
and since acl (S) = O
for all ci’s and ISI > k, then
(7) suggests
that
k, this suggests
Xs:lsl>k
a;(s)
= 0. For constant
that
there
imate ed.
are only
0( nk ) coefficients
to be approx-
Therefore:
Theorem
4.3
The weighted
arithmetization
of kDNF is learnable
in the uniform
model and in polynomial
time.
Furthermore
condition
sour’s
and
~S:lSl>k
techniques
very
a~(S)
[M90]
interestingly,
the
strong
=
with
Man-
O coupled
suggest:
LEARNING
Theorem
is
DNF
THE
FOURIER
SPECTRUM
OF PROBABILISTIC
LISTS
AND
299
TREES
the 30th IEEE
Symposium
on Foundations
of ComputeT
Science,
1989, pp
4.4
The weighted
arithmetization
of kdistribution-free
learnable
in polynomial
574-579.
time.
In this
some
sense, it might
natural
the results
be interesting
arithmetization
in [LMN89]
[M90]
to formalize
of ACO
and
check
[R87]
R.
Rivest,
Summary
and
Here we used Fourier
time
analysis
uniform-learning
tic
decision
We
further
lists
single
observed
probabilis-
decision
exten-
or obtain
negative
some sense like
It
might
careful
Nyqnist’s
).
turn
out
also
study
evidence
[KV89]
of the
wide
theorem,
to the finite
and
for
the 26th
proba-
tions
is to exmodel,
interesting
to
consequences
(in
pursue
a
and uses of
much
carries
over
case.
References
[BEHW86]
A.
Blumer,
A.
Ehrenfeucht,
D.
Haus-
sler, and M. Warmuth,
“Learnability
and the VaprnikChervonenkis
Dimension”,
Journal
36(4), 1989,
of the ACM,
Pp, 929-965.
[H86]
J. Hast ad, “Computational
of Small
MIT
[KS90]
Depth
M. Kearns
Ph.D.
and R. E. Shapire,
Learning
bilistic
Concepts,
“
IEEE
Symposium
M. Kearns
tographic
Boolean
[LMN89]
Circuits”,
Distribution-Free
ComputeT
[Kv89]
Limitations
Thesis,
1986.
%ess,
Science,
“Efficient
of
PTOC.
on
of
Probathe
31st
Foundations
1990,
of
pp 382-391.
and L. G. Valiant,
“CrypLimitations
on
Learning
Finite
Au-
tomat a, “ Proc. of the 21st ACM
posium
of Theory
of Computing,
pp 433-444.
Sym1989,
N.
Linial,
Formulae
Y.
and
Mansour,
A. C. Yao,
arbitrary
size.
such extensions
see how
[Y85]
Time
Of course the most challenging
question
tend these results
in the distribution-free
and
san, “Constant
Depth
Circuits,
Transforms,
and Learnability”,
N.
Fourier
preprint.
Decision
Learning,
2(3),
November
super-polynomial
algorithms
for
trees of polynomial
1989,
1987,
L. G. Valiant,
“A theory
Communications
able”,
27(11),
trees.
straightforward
slightly
[V84]
polynomial-
for
literal
that
suggests
uniform-learning
bilistic
decision
to obtain
via
Lists”,
pp 229-
246.
Problems
algorithms
and
sion of [LMN89]
Open
Spring
“Learning
Machine
5
“Learning
Mansour,
Transforms”,
if
extend.
Y.
Nis-
Fourier
Prcx. of
10.
1984,
“Separating
Hierarchy
IEEE
of the
of the
by
pp 1134-1142.
the polynomial-
Oracles”,
Symposium
of ComputeT
LearnACM,
Science,
PTOC. of
on Founda1985,
pp 1-
Download