A Tutorial on Property Testing Dana Ron Tel Aviv University Property Testing (Informal Definition) For a fixed property P and any object O, determine whether O has property P, or whether O is far from having property P (i.e., far from any other object having P ). ? ? ? ? ? Task should be performed by querying the object (in as few places as possible). Examples • The object can be a graph (represented by its adjacency matrix), and the property can be 3-colorabilty. • The object can be a string and the property can be membership in a given regular language L. • The object can be a function and the property can be linearity. Context Property testing can be viewed as: • A relaxation of exactly deciding whether the object has the property. • A relaxation of learning the object. In either case want testing algorithm to be significantly more efficient than decision/learning algorithm. When can Property Testing be Useful? • Object is to too large to even fully scan, so must make approximate decision. • Object is not too large but (1) Exact decision is NP-hard (e.g. coloring) (2) Prefer sub-linear approximate algorithm to polynomial exact algorithm. • Use Testing as preliminary step to exact decision or learning. In first case can quickly rule out object far from property. In second case can aid in efficiently selecting good hypothesis class. Property Testing - Background • Initially defined by Rubinfeld and Sudan in the context of Program Testing (of algebraic functions). • Goldreich Goldwasser and Ron initiated study of testing properties of graphs. • Growing body of work deals with properties of functions, graphs, strings, sets of points ... Many algorithms with complexity that is sub-linear in (or even independent of) size of object. Talk Organization Will discuss four topics: • Testing Algebraic Properties of Functions: Linearity Testing [BLR] • Testing “Basic” (non-algebraic) Properties of Functions: Singletons, Monomials, small DNF [PRS] • Testing Graph Properties: Testing Bipartiteness [GGR] • Testing Properties of strings: Testing Membership in Regular Languages [AKNS] Testing Algebraic Properties of Functions: Linearity Testing [BLR] Linearity Testing Def1: Let F be a finite field. A function f : Fm F is called linear (multi-linear) if there exists constants a1,…,am F s.t. for every x=x1,…,xm Fm it holds that f(x) = aixi . Def2: A function f is said to be -far from linear if for every linear function g, dist(f,g)>, where dist(f,g)=Pr[f(x) g(x)] (x selected uniformly in Fm). Fact: A function f : Fm F is linear i.f.f for every x,y Fm it holds that f(x)+f(y)=f(x+y) . Linearity Testing Cont’ Linearity Test (Input: F, m, ) 1) Uniformly and independently select (1/) pairs of elements x,y Fm . 2) For every pair x,y selected, verify that f(x)+f(y) = f(x+y). 3) If for any of the pairs selected linearity is violated (i.e., f(x)+f(y) f(x+y)), then REJECT, otherwise ACCEPT. Observe: If f is linear then tests accepts w.p. 1. Theorem: If f is -far from linear then with probability at least 2/3 the test rejects it. Linearity Testing Cont’ Proof (of special case): Let (f) denote distance of f to closest linear function g. Assume 1/2 - (f) is constant. Let G={x: f(x)=g(x)} (so that Pr[xG]= (f)>). Say that x and y are a violating pair if f(x)+f(y) f(x+y). Observation: for any x, y, if among the 3 elements, x, y, x+y we have 2 in G and 1 not in G, then x,y are a violating pair. Consider one of the 3 (disjoint) events. Can show: Pr[xG , yG , (x+y) G ] (f) (1 - 2 (f) ). Since events are disjoint, prob of violating pair is at least 3(f) (1 - 2 (f) ) = 6 (f) (1/2- (f) ) = (). Since test takes (1/) pairs x,y, will reject w.h.p. Linearity Testing Cont’ How do we deal with the general case (where (f) not necessarily bounded away from 1/2)? In order to prove that if (f)> then reject w.p. 2/3 , prove contrapositive: if accept w.p > 1/3 (i.e., small fraction of violating pairs) then f is -close to linear. That is, exists linear g s.t. dist(f,g) . Specifically, define g as follows: g(x) = 1 if Pry[f(x+y)-f(y)=1] 1/2 g(x) = 0 if Pry[f(x+y)-f(y)=0] > 1/2 Can prove that if fraction of violating pairs (w.r.t. f) is sufficiently small the f is close to g and g is linear. Note: definition of g allows for Self-Correcting of f (for every x can determine g(x) w.h.p by few queries to f). Testing “Basic” Properties of Functions: Singletons, Monomials, small DNF [PRS] Testing “Basic” Properties of Functions: This work considers “The most basic” function classes: • Singletons: f ( x) xi f ( x) xi x j xk f ( x) ( xi x j ) ( xk x xm ) • Monomials: • DNF: Testing “Basic” Properties of Functions Cont’ • Can test whether f is a singleton using O (1 / ) queries. • Can test whether f is a monomial using O (1 / ) queries. • Can test whether f is a monotone DNF with at most t ~ terms using O (t 2 / ) queries. Common theme: no dependence in query complexity on size of input, n, and polynomial dependence on distance parameter, . Learning Boolean Formulae Basic observation: (proper) learning implies testing. F F f f h h • Can learn singletons and monomials under uniform distribution using O (log n / ) queries [BEHW]. • Can properly learn monotone DNF with t terms and r literals ~ using O ( r log 2 n / t (r 1 / )) queries [A+BJT]. Main difference w.r.t testing results: no dependence on n and different algorithmic approach. Testing (Monotone) Singletons Singletons satisfy: (1) (2) Pr[ f ( x) 1] 1 / 2 f ( x y) f ( x) f ( y) x, y Natural test: check, by sampling, that conditions hold (approximately). Can analyze natural test for case that distance between function and class of singletons is not too big (bounded from 1/2). Testing Singletons II - Parity Testing Observation: Singletons are a special case of parity functions (i.e., functions of the form g ( x) xi .) iS Claim: Let g ( x) xi . If | S | 2 iS then Pr[ g ( x y) g ( x) g ( y )] 1 / 4 Modified algorithm: (1) Test whether f is a parity function (with dist. par. ) using algorithm of [BLR] . (2) Uniformly select constant number of pairs x,y and check whether any is a violating pair (i.e.: f ( x y ) f ( x) f ( y ) ). Testing Singletons III - Self Correcting This “almost works”: If f is singleton - always accepted. If f is -far from parity - rejected w.h.p. But if f is -close to parity function g, then cannot simply apply claim to argue that many violating pairs w.r.t. f. If we could only test violations w.r.t. g instead of f ... Use Self-Corrector of [BLR] to “fix” f into parity function (g), and then test violations on self-corrected version. Testing Singletons IIII - The Algorithm Final Algorithm for Testing Singletons: (1) Test whether f is a parity function with dist. par. using algorithm of [BLR] . (2) Uniformly select constant number of pairs x,y. Verify that Self-Cor(f,x) Self-Cor(f,y) = Self-Cor(f,xy) . (3) Verify that Self-Cor( 1 ) = 1 . Testing Monomials and Monotone DNF Monomial testing algorithm has similar structure to Singleton testing algorithm. (Here too suffice to find test for monotone monomials.) The first stage of linearity testing is replaced by Affinity Testing: if f is a monomial then F1={x: f(x)=1} is an affine subspace. [Fact: H is affine subspace i.f.f x,y,zH, xyz H]. Affinity test is similar to parity test: select x,yF1, z{0,1}n, verify that f(xyz)=f(x)f(y)f(z). The second stage is as in singleton test (check for violating pairs). Here affinity adds structure that helps analyze second stage. Testing monotone DNF: use monomial test as sub-routine (a monotone DNF function is a disjunction of monotone monomials). Testing Graph Properties [GGR] Testing Graph Properties Assume graphs are represented by their adjacency matrix. In this model, testing algorithm can perform queries: “is there an edge between u and v”. Distance between graphs: fraction of entries in adjacency matrix on which they differ. This model most appropriate for testing dense graphs. v u 1 Results for Testing Graph Properties In Adjacency-Matrix model • Can test: Bipartiteness, k-colorability, r-Clique, r-Cut and a more general family of partition problems, with sample complexity poly(1/)and running time exp(poly(1/))both independent of size of graph [GGR]. • Can test all properties that can be formulated by first order expression about graphs with sample and time complexity independent of graph size (but at “steep” cost as function of 1/) [AFKS]. • In directed graphs can test acyclicity with sample and time complexity poly(1/)[BR] (special case treated in [EKKRV]). In Incidence-Lists model Connectivity, k-edge-connectivity: complexity poly(1/)[GR1], Bipartiteness: poly(1/)|V|1/2 [GR2], Diameter: poly(1/)[PR]. Testing Bipartiteness Def: Graph G=(V,E) is bipartite i.f.f. can partition vertices into two subsets V1 and V2 s.t. there are no edges between vertices that are both in V1 or both in V2. V1 V2 Recall that can decide whether graph is bipartite in time O(|V|+|E|) by Breadth First Search (BFS). However, we want very fast approximate decision. Furthermore, can extend algorithm and analysis to testing k-colorability (which is NP-Hard). Testing Bipartiteness Cont’ Bipartite Testing Algorithm • Uniformly and independently select m=(log(1/)/2) vertices in graph. G • For every pair of vertices selected query whether there is an edge between the two, obtaining induced sub-graph. • Perform a BFS to determine whether induced subgraph is bipartite. If it is output accept, o.w. output reject. Query complexity and running time of algorithm: O(log2(1/)/4) . Slight variant of alg yields O(log2(1/)/3) and [AK] have reduced to O(log2(1/)/2) . Correctness: If graph is bipartite then clearly always accepted. From this point on assume graph is -far from bipartite. Will show that rejected w.p. at least 2/3. Analysis of Bipartiteness Testing Alg Def: Let X be a subset of points, and (X1,X2) a partition of X. Say that an edge (u,v) is violating w.r.t. (X1,X2) if either both u,v in X1 or both in X2. If there are no violating edges w.r.t. (X1,X2) then say it is a bipartite partition. View sample as consisting of two parts: U and S. Show that w.h.p., for every partition (U1,U2) of U there is no partition (S1,S2) of S, s.t. (U1S1,U2S2) is bipartite. In other words, the sub-graph induces by sample US is not bipartite. X1 v X2 u X1 U1 U2 S X2 U1 U2 S Analysis of Bipartiteness Testing Alg Cont’ Def1: A vertex v is influential if has degree at least ( /4)|V|. Def2: A vertex v is covered by subset U if has neighbor in U. U v Lem: W.h.p. U covers all influential vertices but ( /4)|V|. U V Influential Uncovered influential Non-influential Analysis of Bipartiteness Testing Alg Cont’ Let C be vertices covered by U and let R be remaining vertices. U C Non-influential R Uncovered influential Observe: Since R contains at most all non-influential vertices, and at most ( /4)|V| influential ones, total num of edges incident to R is at most ( /2)|V|2. Recall, graph G is -far from bipartite: every partition (V1,V2) of V has > |V|2 violating edges. Together, above two imply that every partition of UC has > ( /2)|V|2 violating edges. Analysis of Bipartiteness Testing Alg Cont’ Consider fixed partition (U1,U2) of U , and let (C1,C2) be partition of C where neighbors of vertices in U1 are put in C2 and neighbors of vertices in U2 are put in C1. U1 U2 w C1 v C2 Since (U1C1,U2C2) contains > ( /2)|V|2 violating edges, this many pairs of vertices (v,w) in C1 (C2) have violating edge between them. If get such pair (v,w) in sample S, then for every partition (S1,S2), partition (U1S1,U2S2) contains some violating edge. Since many such pairs, the sample S contains such a pair w.h.p. By union bound on number of partitions (U1,U2) (at most 2|U|= exp(log(1/)/)) S contains such a pair for every (U1,U2). Testing Other Graph (Partition) Properties Each property (k-colorability, r-Clique, r-Cut ) has its own “particularities” but in all cases: • “Natural algorithm” (take small uniform sub-sample and check induced subgraph for property) works. • Analysis works by breaking sample into two parts: the first part, U “forces” constraints on possible partitions of all vertices. Second part, S, “tests” whether constraints are satisfied. More general results of [AFKS] (combination of partition and forbidden subgraph properties ( properties)) also analyze natural algorithm. Analysis builds on Szemerdi’s regularity lemma. Testing Properties of Strings: Membership in Regular Languages [AKNS] Testing Membership in Regular Languages For fixed regular language L {0,1}*, testing algorithm should accept w.h.p. every word wL, and should reject w.h.p. every word w that differs on more than n bits (n=|w|) from every w’L (|w’|=n). Algorithm can query any bit wi of w. Let M=(Q,F,q0,) be the (minimum) DFA that accepts L. Let G(M) denote directed graph induced by M (that is, there is a directed edge for every transition). Def: Let u=wi…wj be sub-word of w that starts at position i. Say that u is feasible w.r.t. M starting from i if there exists a state q s.t. q can be reached in G(M) from q0 in exactly i-1 steps, and there is a path of length (n-(|u|+i-1)) in G(M) from q’= (q,u) to an accepting state qf. q0 q i-1 steps q’ u qf n-(|u|+i-1) steps Testing Regular Language Cont’ Consider special case: • Unique accepting state qf ; •Q - can be partitioned into two parts: C and D: q0,qf C ; subgraph G(C) strongly connected; no edges from D to C. C q0 D q’ q qf - The GCD of cycle-lengths in G(C) is 1 There exists a constant r (=O(|Q|2) s.t. q,q’ C , m r , exists path of length m from q to q’. Testing Regular Language Cont’ The Algorithm (simplified version): • Uniformly and independently select (r/) indices 1i n . • For each i selected, check that the substring wi … wi+r/ is feasible. • If any substring is infeasible then reject, otherwise accept. Number of queries: O(r2/ 2)=poly(|Q|)/ 2 and running time poly(|Q|)/ 2 (can improve to almost linear dependence on 1/ ). Correctness: If wL, then always accept. If w is -far from L , would like to show that w contains many (short) infeasible substrings (causing rejection w.h.p). Testing Regular Language Cont’ Prove contrapositive statement: If number of (short) infeasible substrings in w is small then w is close to w*L Proof idea: partition w (except first and last r symbols) into disjoint maximal feasible substrings u1, … ,uh : each uj is feasible, but addition of next symbol wk makes it infeasible. C qj uj qj+1 qj’ D wk uj+1 q’j+1 By slightly modifying each uj , can “glue” the modified substrings together into one string w* that “does not leave C”, and reaches qf. If h is small (as assumed), the w* close to w. Testing Regular Language Cont’ General case works by reducing to special case we discussed. In particular need to decompose G(M) into its strongly connected components, and consider how a word “moves between them”. This work has been extended by Newman to testing Branching Programs of bounded width, and by Kupferman and XX to testing Tree Automata. Directions for Further Research “Biggest” open problem: Can we characterize what properties are efficiently testable? (e.g., find a measure analogous to VC - dimension.) Find Families of properties that are efficiently testable. Exist some such results for testing graph properties (e.g. partition problems) and we have the regular languages result. Extend scope of property testing. Testing Properties of Collections of Points: Testing of Clustering Property Testing - Background Properties of functions: • Initially defined by Rubinfeld and Sudan in the context of Program Testing. Tested algebraic properties of functions: low-degree polynomials. • Other work on testing algebraic properties: [BLR,R,EKKRV...]. • Non-algebraic properties: Monotonicity [GGLRS,DGLRSS,B,FN]. Properties of other objects: • Main focus: Graph properties: [GGR,GR,AK,AFKS,BR,PR,CS...] • Growing body of work deals with properties of strings [AKNS,N,PRR], sets of points [PR], geometric objects [CSZ], distributions [BFRW], and more. All algorithms have complexity that is sub-linear in (or even independent of) size of object.