Chapter Learning the Fourier Probabilistic William that the of learning Aiello* Linial, Mansour, and boolean concepts (under concepts Nissan that lists, can time in small the uniform instance polynomial Fourier Fourier identified sampling with and algorithms: a- appears This result of spirit, is this learning the polynomiality Like functions [M90]. like the The new us to achieve Fourier ingredient this analysis of our polynomiality is that we are able to isolate ally small set of non-negligible that reside in a super-polynomially spectrum. We further eral concept (npolyk)gn include trees, special “Bdl aidlo((!fl flkll ) all their work observe learning convex case which algorithms. ColIl]lltl[\icalioI]s more These etc. in polynomial Research, Morristown genclasses .l}cllcorc. n-cube, the label Research, hlorrist.own NJ that 291 is, in uniform concepts n-cube. However is as follows: for we ordy raining = 0.8. are un- it either For a prob- element d of the probability Probabilistic and Shapire give P(F) and concepts [KS90] as a uncertainties further justi- “weather-prediction” “If it was raining as on Mon- on Tuesday, then it will rain probability y O.8. Here @ corMonday However, Mondays follows some of i is 1 with of examples: to P(3) of rainy 07!w0, col]). with captures natural inherent is referred to [KS90] for here responds 07960. [M90] theorem. probabilistic the p, and day, and it is raining on Wednesday with while (bmmnnicat.ions concerned interpretation concept the simplest learnabil- [LMN89] apthat concepts. domain abilistic fication; A concrete transforms” poly- boolean concepts whose range is the set {O, 1}, range of probabilistic concepts is the interval model that (the reader ash .Ixdlcorc.con]. ]I~illail(@)fl&sll are O with probability y 1 – p(i). were introduced by Kearns decision NJ we The of N yquist’s concepts, with [0, 1]. The super-polynomial probabilistic combinations, results via refined the polynomi- several slightly polynomial-size allows Fourier coefficients large area of the that classes have that paper for infi- versus domains. elsewhere by the case that statements Fourier analogue of probabilistic boolean of our results should be contrasted to the np”lylogn complexities in the analogous cases of [LMN89] and so often finite via and the finite In in for spec- reconstructed to “exponential “learning here has bounded count able” translate stat ements proach estimated. It is every versus domains signal can be completely sampling”. nomial” in polynomial methods it “uncountable nite coeffi- distribution. Fourier “If a continuous then discrete at represent are learnable where learning trees and can be approx- non-negligible be efficiently under decision of each literal, polynomially that sense decision probabilistic all such concepts the first the probabilistic generally cients Introduction 1 uni- following: one occurrence Hence, of k-DNF. trum, most and arithmetization of Kearns more imate ed by is the weighted [KS90]. probabilistic show Mihailt There is a famous (and very practical) theorem by Nyquist in signal processing which roughly says the are and Shapire of Trees Milena form sampling distribution) by reconstructing their Fourier represent ation [LMN89] extends when the tions, and ity We observe We Spectrum Lists Abstract method 32 and and Tuesdays, rains raining Tuesday, for any specific the or it does not sequence Wednesday rain”). 292 AIELLO AND MIHAIL Uniform learning is a special form of Valiant’s distribution-free learning [V84] where examples of the concept to be learned are drawn according to more general decision lists cases [R87]. However for probabilistic the best known distribution-free result requires the list to be learned the uniform is, pl>pz In the (as opposed uniform to arbitrary) learning scenario probabilistic concepts concept p (P is simply task of a learning distribution. there is a class algorithm for the class P is to pro- duce a “good” training phase approximation ~ of p after and some fhrther efficient tion. the During presented training a “small” 2, together phase number ple is a uniformly a label the a “short” computa- algorithm of samples. generated with of P, one of which is the target a collection of concepts). The element Each of the 1 or O that >... > p~+l In this paper learning sion lists we t ain polynomial-time show arbitrary 3.6. that probabilistic polynomial-time for Theorem that uniform- probabilistic Furthermore our techniques trees (see Figure deci- in Theorem extend uniform-learning decision rence per literal is monotone, [KS90]. we obtain algorithms 3.10 for is that to ob- algorithms with a single occur- 2). sam- n-cube is determined by p( ~). Despite their concepts, cepts seeming the task presents is reflected mines the in general [KS90] Chervonenkis fact that not known cepts hold than [B EHW89]), several as well have are con- rather algorithms. latter as the results probabilistic analogues of this deter- the Vapnik- learnability learning example (for that of probabilistic complex for boolean distribution-free ample theorems dimension dimension whose conThis structural distribution-free boolean difficulties. learnability is more to with probabilistic larger combinatorial distribution-free concepts resemblance learning significantly both example of simple A prime situation ex- is decision lists. The approach poses is to such concepts ficient. In that we use for consider the and this our Fourier approximate way, the by Linial, context of boolean ically, each target in the concepts: Mansour, concepts context depends efficients that tic de- argue Fourier 2.2. The upon must decision decision and a probabilistic coef- is approxi- and the that class specif- ACO of constant simple observato probabilistic efficiency of the whole number of Fourier and single but coefficients For literal we use detailed all in the (more be approximated. lists trees Nissan [LMN89] of the Theorem scheme bilistic paper, of Fourier concept depth circuits ); here we make the tion that their techniques extend of this pur- mated in a vector sense. This novel method of learning by approximate ing Fourier represent at ions was introduced For the purposes learning representation probabilis- Fourier a polynomially are negligible co- probaanalysis small (Lemmas set of 3.2 and cision list over n variables is a single branch decision tree whose edges are labeled by literals or their negations (see Figure 1). The leaves of the tree are 3.9). We further give an algorithmic scheme that of the nonefficiently determines the frequencies 3.3, Algonegligible Fourier coefficients (Remark further rithm labeled # of the by a number n-cube naturally in [0, 1]. Each follows a path element from the root to a unique leaf of the list. The value of the decision list on 2 is the label of this unique leaf. Decision ural in lists concepts the are generally and boolean have case accepted been where as quite studied the pi’s nat- extensively are in {O, 1} small Both conceptually that nificant are distribution-free of the in significantly before and set of non-negligible work area even discussion of [LMN89] super- polynomially (see [R87] for a collection of resdts). In particular, it is well known that boolean decision lists learnable 1, and the terms and the frequencies the reason why np”lylog” area technically, from 3.10). resides we obtain complexities polynomial in the frequencies”. the small part new. ally This results of [MLN89] of our set of sig- a super- polynomi is entirely In polynomially of “low polynomially of low frequencies this frequencies large filters Theorem [M90], and large is also instead [M90] LEARNING THE FOURIER SPECTRUM OF PROBABILISTIC in analogous results cases. We consider thmst of our as the main these work. LISTS AND TREES 2 polynomial Preliminaries In For arbitrary nomial probabilistic size (i.e. when decision trees of poly- single literal condi- the tion is removed) we observe that r$’”lyl”g” learning is feasible: Theorem 4.1. This tion fo~ows and along with the of Hastad’s the line further Switching ([LMN89] make ACO proof). binations quencies the Lemma that orem An 4.2. vation The- consequence of this obser- interesting is that the weighted is learnable model, in arithmetization polynomial time and by the recent distribution-free! freand model: results (Theorems of in the The respect to extend point elegant for to out uniform Mansour unit ● k- also here for classes work uniform learning general on zero). that requirement. Mansour’s the zero (1/2n)xi are be a case of distribution- free learning. Finally, sent ation the all before proceeding to the technical of our work, it is worth mentioning Fourier appear here for their orem, mic methods and elsewhere remarkable posses uniform they easy to implement, that [M90], except to Nyquist’s of further are prethat learning [LMN89] comparability a variety features: for desirable conceptually parallelizable the- algorith- simple, follows: of this In extended very Section 2 we establish technical background present the polynomial sion lists tion 4 we discuss mial size and convex bounded and single spectrum. are in Section 5. of our work. learnability literal arbitrary decision decision combinations Summary is organized the context trees. and In open Sec- of polyno- of concepts and as In Section 3 we results for decitrees with problems when n is well the n-cube Q~ to the (0, 1]. is a concept ‘P = Un~n [Analogous is learnable for any n where 1 with 1 –p(~). of p if in to the f(n, c, 6), and any model if there p A concept [K S90]] uniform and in is an algorithm E P. takes as input of p; moreover the running-time say that P is learnable in poly- lVe time if f (n, e, 6) We say that super-polynomial (nc-l)polylogn.-’ Concepts dimensional time is polynomial P is learnable if f(n, in n, c-l, in slightly 6, f5) is of and polynomial the form in log b–l. can be viewed as elements of the 2nvector space of all real valued func- tions on the n-cube point ). their values In this (one dimension context, on each for each domain- determining vertex of the concepts n-cube by is equiv- alent to using the standard basis. Of course the obvious difficulty of the learning task is that all directions and of the standard we are required projection etc. abstract of F(ql < ~. - 2.1 P nomial basis are equally to correctly of p on all but important, approximate a vanishingly small the frac- tion of these directions after seeing the behavior of p on a vanishingly small fraction of the directions (i.e. The rest lp(~) that could a concept from Elements i = Z1 . . . x~. probability y p( F) and O with probability ● A fmction @ is an c-approximation and log 6-1. However suggests {O, I}n. of a concept p is a pair (2,13), uniformly from Q., and 1~ is c-approximation is f (n, c, 6). concepts are exactly this for learning to simply p : Qn + = vectors c, b and m independent samples of p and produces with probability 1 – 6 a hypothesis @ which is an methods to approximately probabilistic side, methods approach the satisfy on the positive Fourier apply that for probabilistic class. that fairly is Qn is a set of n-concepts. time-complexity we should distribution-free (as opposed ACO nor described first to techniques results some Fourier representations frequencies Neither transform learning Fourier learning, 2.2, namely, extend by n-bit (or interval, Pn Definition of all these has developed to Mausour’s whose potential Theorem is a function A sample 2 is drawn 4.3 and 4.4) distribution-free that techniques [M90]. high to the review the Fourier aponly new point is equa- in [LMN89] A n-concept class With suggests set of objects ● uni- in [M90] (2) which Qn are denoted ● com- in the uniform we briefly The techniques understood) in their all convex section to learning. concepts. use concepts that have negligible high preserve the low-frequency property, also learnable form that is unnecessary [H86] use of this the ● simplification crucial tion in [LMN89], Lemma We also observe are hence DNF of reasoning this proach uniform observa- 293 the small come this set of samples). fundamental and Nissan [LMN89] difficulty, introduced In an effort to over- Linial, Mansour, the idea of switch- ing bases from the standard basis to a Fourier basis, of so that the projection of p along most directions the Fourier basis is negligible. Hence the number of non-negligible number The ions directions is comparable to the small of samples. Fourier on the basis for the set of all real valed n-cube is defined as follows: funct- For each 294 AIELLO S C [n] consider S in the natural a “parity” way: XS(Z) where pars(~) with = O if ~ies respect associated with Now if we use as a hypothesis for p the concept ~ whose Fourier coefficients are the ii’s, then we have: Zi is even, X. IP(2) - il~)l + and pars(d)= the inner product (p, q) G(P(~) — — = where as and = ~(p, s Sth Fourier a(s) coefficient = of p is respectively. crucially The learning based are simply on the fact averages, these averages gested by (2). In spectrum, Fourier frequency for in the technicality these this All that values only pute = ii(S) k b’&nds 1–6, the S: following ].’ill > (ijl, (l/m) ~~1 log ii-l, guarantee la(S) –ii(S)/2 Fourier lines: k. Then lC,,, ) and l~,(–l)wrs(tii) appropriate then that, for < k-l probability –1 ~ Hence ~ (a(s) - ii(S))2 in- all — < ,2/2 the class ACO has in slightly intuition behind their differ significantly from the spectrum of the function which is known not to be in ACO [H86], and whose gle frequency: observe the in Section 4 with decision results differ it y which spectrum highest .“ “The consists The respect trees from exponential we as the stronger of poly-size on the low frequencies requires that 3) are justified spectrum significantly of a sin- results to polynomial-size (as well of Section concentrates should 3 Probabilistic and S: by the decision because it the spectrum of par- size decision trees .“ Decision Lists least Single Decision Literal Trees in C2/2 for each S. () S:lSl<k concentrates The in dightly learnable com- Chernoff y at that and is hence and use m polynomial straightforward with First use it was shown same intuition: represen- model “The spectrum of the class on the low frequencies because it trees 12, ),. . . . (jl~, If m is some and the in (3)). in the uniform time. is as follows: polynomial n, log ~–1 ). by notice imply: spectrum probabilistic k = poly(log approximate all of S) p01y(n)2-kc s ‘2/2 samples H cardinality have bypassed result [Y85] of we may 1 respectively; the errors @ probability. time. P notion O or is and hence high i’s z quanity super-polynomial ACO has last remarks, some - Schwart [Extension of [LMN90]] lj a probaclass P has bounded spectrum, then In [LMN89] should parity sense: ‘2(s) dependent n is the ~S:[Sl>k p along that to helps the above bounded as sug- class natural < O for and p E P all “high” a’(s) and coefficients by (1), h:pt>k we may IS] < k. a(S) following Then of (the are this for sam- follow estimated, a concept coefficients c is a constant = Foum”er if for all concepts where tation that XS and are negligible that as suggested say frequency G(S) that can be efficiently particular, bounded algorithms Now < 0 or @(F) > 1 can be trivially Theorem 2.2 bilistic concept (?, 1=) and the ( ii, lti, )‘s are independent - W))z by Cauchy (The P is learnable super-polynomial where follow for p with that ples of p. lines by e by the previous setting (12(+-q (a(s) is a an e-approximation (1) ~ two Parseval ~(~) (p, xs) — — the last bounded xs)xs(q. & - W)))2 (3) ~ ~zP(E)q(~)Hence, any real valued function on the n-cube (therefore any concept p) can be written The = It is well known and easy are indeed an orthonormal to p(i) J(GziG@ < /+ = (–-l)wrs(Z) 1 if Eies xi is odd. to verify that Us {XS} set function AND MIHAIL In the previous section we sketched how bounded spectrum concept classes can be learned in the uniform model and in in slightly super-polynomial time. In fact, without using additional structure, this is probably the best possible (lO~n) Fourier coefficients must The main contribution of this that for lists and single literal since roughly be approximated. paper is to show decision trees there LEARNING THE FOURIER SPECTRUM OF PROBABILISTIC is a polynomial size subset quencies in which and that most of the lowest of the power furthermore, this (l”n) fre- it is possible properties time of the uniform The spectrum learning case of lists decision trees get the The same list learning case of general on in n variables algorithms let us the introduction, The in the value of this ● tree unique is a single-branch bi- the to a unique leaf. on z E Qn is the root label decision list in which the literals we introduce in Figure appear 1 we have T(1)= 1, 7F(3) =4, 7r(4)= 2. In general the ith of the list are labeled ~=(i) and Z=(i). TO denote of the i-th branch O-1 vector ~= whether ~. For 1101 since 3, level Con- naturally list in edges Figure is farthest the the maximum subsets maximum of di: ~i = 1 – di. the partition of the 1 we have by 23, c1, di to denote mate along of the level variable by the branches {S : MazL(S) = n}. = i}l which the spectrum = classes we shall of decision 2i-1, so that the are polynomial efficiently lists for approxi- are, roughly, the following: (i) ~s,~~~~(s)>i a2(S) < 2-i, which suggests that for the purpose of approximating the Fourier representation it su#ices to approximate each one of a(S) such that Maz L(S) s i, for i = O(log n). Now of the {7r(l) set {S : MazL(S) ,. ... m(i)} coefficients Each s so there i} are and coefficients is the only to approximate, one of these for powerset polynomially i = O (log n ). can be approximated satisfactory in polynomial time by sampling. (ii) The function satisfactorily T on 1,..., i can be approximated in polynomial time by further sam- pfing. Point (i) is justified tures all the structure 1 that Lemma by Remark by Lemma 3.2 below of the spectrum. which Point 3.3 and the learning cap(ii) 3.2 For all S G [n], = i then if MazL(S) the which = ~12n-ipi - ~ is Zn-jpj - ILz+11(4) j=i+l of the decision [a(S)/ = la({7r(i)})l (5) S 2-i we introduce: C; = {Z: = {d,... Zm(l) dlj. = . .7z=(i_l) Clearly, (6) < 2–i S:iVfc3L(S)>i x ~(i) Z dz}, fOr 1 ~ i $ n To see (4) recall by (1) that dn} [C’i[ = 2n-i ICn+,l a2(S) E = ~–1> PROOF. G+l for 1 < i ~ n, and of COUrSe ZP(Z) (-1)=(=), = 1. In the above terms, a decision list is formally defined as follows: Definition decision n-bit 3.1 list ~ defined. = some d, some Z1. ..zn p(z) A concept if for O-1 vector all~= Ci is Algorithm follows. la(S)l n-cube = l},..., : MazL(S) The lines justified an n-bit are labeled use the notation suggested the left edge we introduce example, the left 24, Z2. We further complement o To denote ~T(i) or ~T(i) label of the list, of all to the ies of partition efficiently a permutation versely, z~ and ~i are on level r– 1(i) of the list. For convenience define the level function /(i) = m-l ( i). ● 1{S cardinalit i=o(]ogn), many the order m of [n] . For example, T(2)= edges the tree leaf. To formalize along from of a decision which we define a partition according 0, {S : MazL(S) a nary tree with edges labeled by literals xi and their negations ~i, so that if the right edge of an internal node is labeled ~i, (resp. ( Zi ) ) then the left edge is labeled ~i, (resp. (z;)). There are n + 1 leaves labeled by pl . . . pn+l. Any Z E Q. naturally follows a path To do this itturns : i E S C [n]} lines terminology. informally list. suggests Clearly the some useful the {xi the variable in the set: in detail. along of variables to identify variables a polynomial 295 level of S or MazL(S) to be j~ S: i(j) 2 l(i) for all i ~ S. Now z~a=~(s) is the desired variable. In turn, this of further sketched. As mentioned decision and is treated presenting formalize advantage algorithm. follows and is simply Before to take crucial down can be efficiently identified. Thus we obtain the first nontrivial example of a concept class with bounded spectrum for which For a subset out is concentrated, subset LISTS AND TREES pi, p over n variables permutation T of [n], E Qn pl, . . . . p~+l the following where the Ci ‘S are is a . some E [~, 1], and holds: ii? E as previously — — *12n-ipi - ~ j=i+l 2“-jf)j - p.+~1 296 AIELLO where the last two equalities are fairly easy to justify as follows: o Clearly S ~ {7r(l), ..., m(i)}. For each j such that ~ < i, we argue is because all %(1)> . ..7Z ~(j_l), respectively, %r(j+l) that the ~3eCj vectors quently, for bits Cj each have to one = O. This coordinates dl ,. ... dj_l, of the all that belong to the times even the quantity and S fl {Zr(l) the other (– 1 )Wrs(s) . . . z=(j)} is of the sum . . . zT(;)} half averaged odd. k Hence, over all vectors in ● For j = i, there are 2m–i vectors in Ci. in Ci have coordinates . . . . m(io)}. scribed in (a) and Xo(i) Q X* ~ X(i). for learning x=(l), ALGORITHM Stage respectively. Furthermore 1: Approximate Hence, pars(i) is the complement when j Moreover, Set i := log 26-2 + log n ; Input X“:=g; m samples Equation (5) follows in {S all the pi’s from i < j ~ n, E Ci. The case the that are 2~- 1 elements ~(~) (If = j}, there and by (5) each of them {7r(l),7r(2),. . . . ~(i)} was the case that be identified (a) For each ~(j)})l (b) each all jz, m(j) we would Hence say the set X(i) , i = O(log = 2–i, . E X(i) = 2-Z which r(jl by [5) since then for, [a({~(i)})l ) @ X(i) then have if it we would we follows (hence Ia({m(jl), li4azL({~(jl), case that be able and (b) However, in general ities. 2-*}. above since If it would horn jl not true, In particular, Clearly, iO < condition have and s 2–(i+l) = 2-i, Ia({r(i)})l suflice sample, (5). Let ap- )’s and and by to isolate [a({x(i)})l io be max{j i by for > i. we need some further let U {~1} (~l,lfi),. S ~ X*, ; the G(S) Spectrum. + log C-2) ..,(r~,l~n,) := ~ Et ; ; lU,(–l)~’s(U’) ; j(~) -, P is Zl(s)(-l)p=”’(=); <-0, then @(i?) := O; If ~(z) >1, then := 1;) 3.4 and of Algorithm 3.5 below justify the correctness 1, XO(i) ~ X* 1. 3.4 with At the end of Stage probability at least g 1 – 6/2. X(i). = 2-’ is technical- : a ({m(j)}) further PROOF (sketch). 3.3 and standard Claim 3.5 fi(iE)/ s e with Follows Chernoff in the spirit of Remark 2, ~z bounds. At the end probability of Stage at least ~ /p(Z) – 1 – 6/2. (5). > i) to use a small this would the := X* ; could proximate with high probability y all a({r(j)} a( {T(jl ), m(j2 )})’s up to arbitrary accuracy, (a) = n). X(i) r(jz)})l 7r(j2)}) was the ; do for our learning to know as follows: la({m(i), For in principle, wish ~tltit(-I)par{JI}(~’) (log 8n + log 8-1 = -&x. X(i) before, ; hypothesis claim 3.3 we would X* m samples Claims AS we discussed X*. is •1 purposes by ; END. j(z) 2–i. Remark Set m := n@ For each X(i) Iogd-l) ~z # jl 2: Approximating Input and + 1)+ (rl,ld,),...,(~~,l~m,) then (4) by notic- equation are in [0, 1]. (6) recall : lIfazL(S) at most i Lists ‘i({~lt~2}) := ~ ~tl~,(–l)pariilj~](~’) or for some j2 Zl({jl, j2})>~2-i The in (4). To verify E 6 Cj, for in Cj to ~1, . . . . d~ seen to be pm+] with = n + 1 is easily sign as given ing that forced for of pa~s(~) The computwas de- some X* : this sufhces Decision Xo(i) (log2(n . . . . ~m(i_1)7 all vectors . . . . Zn(q Learns Set m := 16n2e-4 2“–~ x=(l), (b), and isolates We will argue that 1: Stage in Ci. ~ X(i). BEGIN Hence Zfi(i) forced to dl, . . . . &_~, di respectively. pars~ is fixed. Similarly, for i < j ~ n, there are vectors Xo(i) pwposes. If ii({j1})z~2-i have coordinates Clearly that follows uses the idea of for ISI = 1 and [Sl = 2 that For jI:=l to n do ii({jl}):=~ For j2 := 1 to n, such that C’j is zero. all vectors T(2), MIHAIL coordinates fixed. And on the other hand, the parity of their bits that belong to S n {~m(j+l) half dj is free to vary in {O, 1}. Consesuch vectors the parity of the sum . . . %(i) of their in forced z=(j) while (–l)mrs(~) {7r(l), algorithm ing a(S) AND Xo(i) z = PROOF (sketch). Assume that at the end of Stage 1 X* is as in Claim 3.4, so that there are at most 2i ii(S)’s to be approximate ed. Then standard Chernoff bounds suggest that for the particular choice of m the sum of the squares of alJ these 6(S)’s is bounded by 62/2 with the desired probability y. Hence: LEARNING Where THE the last ~Sg.1’(io) FOURIER bound a2(S) SPECTRUM holds because: a2(S) zs:M=.L(s)=j at most from (XS:Jf.zL(S)=j < 2-2;) + 2-’ definition + 2–i < S ~~=i~+l < t2/2, by choice — the above the root lists form In ture 3.6 to (which of i in ~lgorithm model the of this and learning a single (see Figure section per y analogous decision we sketch for literal. to the the decision The struc- trees idea a unique leaf. tree a label- The value of the on E is the label of this to the permutation can be also viewed descendent of Zj, to decision lists, on some and in the uni- 1. algorithm occurrence time Consider r of decision as a total order lists on [n]: m(i) < ~(i + l)), a decision tree defines a partial order u on [n] in the natural way: If Zi labels a 1 of probabilistic via Algorithm rest complete] class in polynomial 2). leaf. Analogously imply: The is learnable (see Figure decision S are related Theorem 297 one node unique of i. ~2–i •1 All TREES probabilistic by (5), (6), and the 2j2-2i {. AND ing of the leaves of T with numbers in [0,1]. Each element 3 of the n-cube follows a path of the tree a’(s)) a2(S) + ZS:ilf.zL(S)>iO LISTS nodes. Consider a labeling of the nodes of T with the variables Z1, . . ., Zn, so that each variable labels = ~s:iw.zqs)z;, — — {( 2;=%+, ~~=i[,+l OF PROBABILISTIC with here case of decision then i >. for some in a (hence path from the S) to be the largest tural Lemmas analogues of Lemma be shown by similar left Again analogously all elements root Maz( 3.7, 3.8, j. S ~ [n], if all elements to a leaf) element and 3.9 that follow 3.2 for decision for the complete then define in S. The (the strucare the lists, manipulations in in S appear and can proofs are paper): is lists 3). Lemma two 3.7 For elements a(S) S ~ [n], are if there are at least related in c“, then not = O. Lemma 3.8 For in S are related Let Yl,.. along all S C [n] such in at a({Maz(S)}) % all in S that . , yk be the sets of variables leaf (k < n). Let the correspond that of each Yj, that tered along than i then Xl(i),... the path from be the = that if the corresponding appear of variables are encoun- path Xj (i) to each subsets i smallest i nodes (and that the root ,Xk(i) to the is, the first each corresponding is shorter equals to the Yj ). So to approximate sufiices all elements Then a(S) < 2–’. each one of the k paths Yj’s that let i := Maz(S). to approximate the spectrum a(S) of the tree, it for S ~ Xj (i) for all j. There are at most n such Xj(i)’s and the powerset of each one of them, for say i = O(log n), is polynomially number ~iyre small. Hence of coefficients there is a polynomially Furthermore, the crucial sets Xj(i) can be approximately isolated as in Stage 1 of Algorithm 1 by estimating a(S) for ISI = 1 and [St = 2, roughly, 3 as follows: A probabilistic per literal lows: over decision tree with n variables Let T be a binary single occurrence can be described tree with small to be approximated. at most ● as fol- n interior ● For each jl E [n], let Xj, = {jl} If [X~l(i)l (i) > clogn U {jz : ii({jl, then X~l(i) j2}) := 0; Z ~2-i}; 298 AIELLO ● The x;,(i); sets Xi(i) are All this discussion 3.10 which concludes Theorem trees with 4 by can be formalized the section. the sets to Theorem model and More precisely, it is easy to see that if a probabilistic decision tree has depth k then a(S) = O for all S : ISI > k. Now the next steps are fairly technical, but identical to the manipulations in Lemmas 5 through 9 in polynomial in time. [LMN89]. restriction probabilistic decision age of many other lates ● this final section size decision nomial vex we argue trees, combinations. or simply sketched. about general as well Proofs here However been left for the complete as convex are either all paper polyomit ted details are of small to bounded Once established, be described decision that have are straightforward tree of polynomial as follows: Let Let q(n) T be a binary interior nodes. T with the Consider variables tree with a labeling z 1, . . . . Zn. poly- nodes Consider tions As mentioned far” in of a label- from the spectrum Theorem 4.1 2, and decision parity bounded leaf. Section polynomial-size fwction, and exactly trees which Iearnabilit as in are “very results in that Theorem trum polynomial the class mial size slightly Extension The class of probabilistic size has bounded of probabilistic is learnable decision spectrum. decision in super-polynomial PROOF (outline). tree will The along proof decision decision with be forced and that trees in [LMN89], trees have gl, The weighted as a special of proof xi First notice that and realize that are bounded n variables hence 4.1 fol- absolutely the re- of small depth. long branch of “chopped-off” like small small spec(N is of the con- a probabilis- the range of g is indeed of quantities in [0,1]). for all S’S ag(S) = ~i a:(s) = ~S:[Sl>k (~i < — = ~S:lS[>k xi A~ag, (S). -Mds))z ‘~af, xi ‘i xi J~pOly(n)2-kC ~S:lSl>k (7) (s) al,(s) poly(n)2-kC in of the Main probability tree every Aigi g is indeed in the sence that (g is the average Then over and if g is a convex combination g is a bounded spectrum probabilistic PROOF. [0,1] concepts of polyno- (with the additional Switching Lemma is high case. g2, . . . . gN, a convez If gl, gz, . . . , gN trees of model of Theorem the lines then small depth. ● Second of all, notice combina- time. stricted concept is a decision The reason is, roughly, that the 2.2. convex concepts. follows for functions 4.2 of . First of all, notice that if a probabilistic polynomial size decision tree is hit with a suitable “ranrestriction”, arbitrary spectrum Therefore, trees the uniform Lemma (Lemma 9) in [LMN89] ease that the use of Hastad’s unnecessary). In particular: dom has been by Theorem of the gi’s is a sum of the form = 1 and all &’s are in [0,1]. tic concept Now lows identically about probabilistic )klSl>k [LMN89]] spectrum Hence: y. [Straightforward of the 4.1 follows of k-DNF combination where ~i }; cept. [LMN89], argue of bounded follows a path of the tree from the root to a unique leaf. The value of the probabilistic decision tree on unique most trans- •1 arbitrary), gi ‘s, then of this Theorem We finally ing of the leaves of T with numbers in [0,1]. Now, as usual, realize that each element i? of the n-cube 3 is the label trees in turn, q(n) at most of the This, spectrum. arithmetization size can be a fixed as an aver- decision depth. the boundedness Recall A probabilistic tree can be written probabilistic con- to reconstruct. nomial. Very roughly, the idea that the randommethod suggests is that any poly-size of which Generalizations In MIHAIL ● 3.1o The class of probabilistic decision a single occurrence per literal is learnable the uniform in approximated AND depth depth bounded boolean probabilistic spectrum. in notice metization of if that p is some that is, k-DNF, weiqhted p is of the n arithform P(S) = xi ~ici(~), where the ci’s are products of k variables or their negations, and since acl (S) = O for all ci’s and ISI > k, then (7) suggests that k, this suggests Xs:lsl>k a;(s) = 0. For constant that there imate ed. are only 0( nk ) coefficients to be approx- Therefore: Theorem 4.3 The weighted arithmetization of kDNF is learnable in the uniform model and in polynomial time. Furthermore condition sour’s and ~S:lSl>k techniques very a~(S) [M90] interestingly, the strong = with Man- O coupled suggest: LEARNING Theorem is DNF THE FOURIER SPECTRUM OF PROBABILISTIC LISTS AND 299 TREES the 30th IEEE Symposium on Foundations of ComputeT Science, 1989, pp 4.4 The weighted arithmetization of kdistribution-free learnable in polynomial 574-579. time. In this some sense, it might natural the results be interesting arithmetization in [LMN89] [M90] to formalize of ACO and check [R87] R. Rivest, Summary and Here we used Fourier time analysis uniform-learning tic decision We further lists single observed probabilis- decision exten- or obtain negative some sense like It might careful Nyqnist’s ). turn out also study evidence [KV89] of the wide theorem, to the finite and for the 26th proba- tions is to exmodel, interesting to consequences (in pursue a and uses of much carries over case. References [BEHW86] A. Blumer, A. Ehrenfeucht, D. Haus- sler, and M. Warmuth, “Learnability and the VaprnikChervonenkis Dimension”, Journal 36(4), 1989, of the ACM, Pp, 929-965. [H86] J. Hast ad, “Computational of Small MIT [KS90] Depth M. Kearns Ph.D. and R. E. Shapire, Learning bilistic Concepts, “ IEEE Symposium M. Kearns tographic Boolean [LMN89] Circuits”, Distribution-Free ComputeT [Kv89] Limitations Thesis, 1986. %ess, Science, “Efficient of PTOC. on of Probathe 31st Foundations 1990, of pp 382-391. and L. G. Valiant, “CrypLimitations on Learning Finite Au- tomat a, “ Proc. of the 21st ACM posium of Theory of Computing, pp 433-444. Sym1989, N. Linial, Formulae Y. and Mansour, A. C. Yao, arbitrary size. such extensions see how [Y85] Time Of course the most challenging question tend these results in the distribution-free and san, “Constant Depth Circuits, Transforms, and Learnability”, N. Fourier preprint. Decision Learning, 2(3), November super-polynomial algorithms for trees of polynomial 1989, 1987, L. G. Valiant, “A theory Communications able”, 27(11), trees. straightforward slightly [V84] polynomial- for literal that suggests uniform-learning bilistic decision to obtain via Lists”, pp 229- 246. Problems algorithms and sion of [LMN89] Open Spring “Learning Machine 5 “Learning Mansour, Transforms”, if extend. Y. Nis- Fourier Prcx. of 10. 1984, “Separating Hierarchy IEEE of the of the by pp 1134-1142. the polynomial- Oracles”, Symposium of ComputeT LearnACM, Science, PTOC. of on Founda1985, pp 1-