Property of Natali Ruchansky Matrix Completion with Queries Natali Ruchansky, Mark Crovella, Evimaria Terzi Property of Natali Ruchansky Can you guess the picture? Property of Natali Ruchansky What about now? 3 Property of Natali Ruchansky And now? 4 Salvador Domingo Felipe Jacinto Dalí i Domènech Property of Natali Ruchansky 5 Property of Natali Ruchansky How did you do it? Input Image Available Information Our Estimate For most there is too little information to recognize shapes or patterns. I’m not sure. Arbitrary guess. Recognize human features — ear, eye brow shape, and facial contour. I know, a human face. (not Van Gogh) I know this mustache! My friend Salvador Dali! 6 Property of Natali Ruchansky How Much and Which Information? So the questions is, if we start at this image: ! ! abracadabra ! ! How much and which information do I need to add so that my particular algorithm can infer the image? Property of Natali Ruchansky If we can answer: How much and which information do I need to add so that my particular algorithm can infer the image? ! ! 1. Choose which information to add, tailored to the particular reconstruction algorithm. ! 2. Reconstruct based on this information. Property of Natali Ruchansky The example of reconstructing Dali is an instance of the problem of Matrix Completion: ! Given a partially-observed matrix M, fill in the missing entires. Property of Natali Ruchansky In particular, the version applied to real world data is Low Rank Matrix Completion: ! Given a partially-observed matrix M of low rank r, fill in the missing entires. Property of Natali Ruchansky Completion of what? Property of Natali Ruchansky Completion of what? ! • Yelp users rate restaurants Property of Natali Ruchansky Completion of what? ! • Yelp users rate restaurants restaurants users But a given user has not visited all restaurants … So the matrix is partially observed. Property of Natali Ruchansky Completion of what? ! • • Yelp users rate restaurants Traffic counters measure traffic on roads Property of Natali Ruchansky Completion of what? ! • • Yelp users rate restaurants Traffic counters measure traffic on roads destination source But counters do not exist on all roads … So the matrix is partially observed. Property of Natali Ruchansky Completion of what? ! • • • Yelp users rate restaurants Cities can install traffic counters Biologists measure interaction of proteins ! Property of Natali Ruchansky Completion of what? ! • • • Yelp users rate restaurants Cities can install traffic counters Biologists measure interaction of proteins ! protein protein But they cannot exhaustively run all experiments … So the matrix is partially observed. Property of Natali Ruchansky Completion of what? ! • • • Yelp users rate restaurants Cities can install traffic counters Biologists measure interaction of proteins https://www.telegeography.com/telecom-maps/global-traffic-map.1.html Property of Natali Ruchansky Completion of what? ! • • • Yelp users rate restaurants Cities can install traffic counters Biologists measure interaction of proteins ! ! And many more instance of partially observed data… Property of Natali Ruchansky Statistical Matrix Completion Traditional approaches assume: 1. A random distribution of observations 2. At least n r log(n) observation ! With these (at least) these assumptions, statistical matrix completion methods pose the problem as an optimization and find the best solution to match the visible information. input meets assumptions reconstruction Property of Natali Ruchansky Statistical Matrix Completion Traditional approaches assume: 1. A random distribution of observations 2. At least n r log(n) observation ! The challenge with these assumptions is that in real data: 1. The distribution is often not random 2. Very few entries are actually known. Property of Natali Ruchansky Statistical Matrix Completion Traditional approaches assume: 1. A random distribution of observations 2. At least n r log(n) observation ! The challenge with these assumptions is that in real data: 1. The distribution is often not random 2. Very few entries are actually known. known ratings : 9e7 required n r log(n) : 2.5e8 ≈160,000,000 fewer entries Property of Natali Ruchansky Statistical Matrix Completion Traditional approaches assume: 1. A random distribution of observations 2. At least n r log(n) observation ! The challenge with these assumptions is that in real data: 1. The distribution is often not random 2. Very few entries are actually known. match on Ω, not elsewhere real observed data best guess Property of Natali Ruchansky Our Question . + + = ! ! How can we design one querying and matrix completion algorithm, that minimizes the reconstruction error and number of queries ? ! ! We call this the Active Completion problem. Property of Natali Ruchansky Our Question . + + = ! ! How can we design one querying and matrix completion algorithm, that minimizes the reconstruction error and number of queries ? ! 1 2 ! We call this the Active Completion problem. Property of Natali Ruchansky Our Question . + + = ! ! How can we design one querying and matrix completion algorithm, that minimizes the reconstruction error and number of queries ? 1 ! fixed to budget b ! We call this the Active Completion problem. Property of Natali Ruchansky With great power… Many data owners are in the powerful position to add additional observations: ! Property of Natali Ruchansky With great power… Many data owners are in the powerful position to add additional observations: ! • Yelp can ask some users to rate some restaurants Property of Natali Ruchansky With great power… Many data owners are in the powerful position to add additional observations: ! • • Yelp can ask some users to rate some restaurants Cities can install traffic counters Property of Natali Ruchansky With great power… Many data owners are in the powerful position to add additional observations: ! • • • Yelp can ask some users to rate some restaurants Cities can install traffic counters Biologists can experiment with a particular protein pair Property of Natali Ruchansky With great power… Many data owners are in the powerful position to add additional observations: ! • • • Yelp can ask some users to rate some restaurants Cities can install traffic counters Biologists can experiment with a particular protein pair Property of Natali Ruchansky With great power… Many data owners are in the powerful position to add additional observations: ! • • • Yelp can ask some users to rate some restaurants Cities can install traffic counters Biologists can experiment with a particular protein pair ! How to make the most use of the limited budget of queries? Property of Natali Ruchansky The Answer We construct an algorithm called Order&Extend that is the first to integrate a querying strategy into its matrix completion algorithm. ! ! Able to select a small number of queries needed to find an accurate completion. Property of Natali Ruchansky Our Approach The key to our approach is viewing matrix completion through a sequence of linear systems. ! This allows us to identify: 1. Parts of the matrix that can be recovered given the observations 2. Other parts that cannot due to insufficient information 3. The additional entries needed to recover those areas. ! ! Note this means our algorithm will not do this: It will only estimate the parts it can. Property of Natali Ruchansky MC as Linear Systems m m n M r = n r Y X Write the data M = XY as a product of factors. Property of Natali Ruchansky MC as Linear Systems m m n M r = n X r Y Property of Natali Ruchansky MC as Linear Systems Property of Natali Ruchansky MC as Linear Systems yj xi xi’ for rank 2 : Mij Mi’j Mij = xi1y1j + xi2y2j Mi’j = xi’1y1j + xi’2y2j Property of Natali Ruchansky yj ﹖ xi xi’ known Mij Mi’j Mij = xi1y1j + xi2y2j Mi’j = xi’1y1j + xi’2y2j unknown Two equations in two variables Property of Natali Ruchansky yj xi xi’ Mij Mi’j Mij = xi1y1j + xi2y2j Mi’j = xi’1y1j + xi’2y2j solve for y Property of Natali Ruchansky yj Iteratively solve systems of this form xi xi’ Mij Mi’j M = xi1y1j + xi2y2j M = xi’1y1j + xi’2y2j fill in X and Y, then multiply to get the ~ estimate M=XY. Property of Natali Ruchansky How do we know when and what we need to query? 42 Property of Natali Ruchansky Incomplete Systems ﹖ xi xi’ known Mij Mi’j Mij = xi1y1j + xi2y2j Mi’j = xi’1y1j + xi’2y2j unknown Two equations in two variables Property of Natali Ruchansky Incomplete Systems ﹖ xi xi’ ﹖ known Mij Mi’j Mij = xi1y1j + xi2y2j Mi’j = xi’1y1j + xi’2y2j Mi’j was not observed in the input data. Two equations in three variables unknown Property of Natali Ruchansky Incomplete Systems ﹖ xi xi’ ﹖ known Mij Mi’j Mij = xi1y1j + xi2y2j Mi’j = xi’1y1j + xi’2y2j unknown Query: what is the value of Mi’j ? Property of Natali Ruchansky Incomplete Systems ﹖ xi xi’ known Mij Mi’j Mij = xi1y1j + xi2y2j Mi’j = xi’1y1j + xi’2y2j unknown Two equations in two unknowns, so we can solve for y. Property of Natali Ruchansky Unstable systems X y = M 1 1/2 1/2 1/3 1 1/2 1/2 1/3 1 2 y = y’ = 3/2 1 3/2 5/6 Property of Natali Ruchansky Unstable systems X y = M 1 1/2 1/2 1/3 1 1/2 1/2 1/3 y = y’ = 3/2 1 3/2 5/6 Property of Natali Ruchansky Unstable systems X y = M 1 1/2 1/2 1/3 1 1/2 1/2 1/3 y = y’ = 3/2 y = 0 1 3 3/2 1 5/6 y’ = 1 Property of Natali Ruchansky Unstable systems In the paper… ! 1. How can we detect unstable systems? ! ! 2. How mitigate unstable systems? Property of Natali Ruchansky Minimizing Queries Encountering an incomplete or unstable system Algorithm needs to query. Property of Natali Ruchansky Minimizing Queries Encountering an incomplete or unstable system Algorithm needs to query. How can we also keep the number of queries asked to a minimum? Property of Natali Ruchansky Minimizing Queries Encountering an incomplete or unstable system Algorithm needs to query. How can we also keep the number of queries asked to a minimum? ! By manipulating the order in which we solve the systems. (Hence the ‘order’ in Order&Extend) Property of Natali Ruchansky Takeaway Observed data is typically: - not random - sparse …But we can query! (minimally!) + + = estimate Property of Natali Ruchansky Option 1: Independent Decide what to query independently of how you complete. + Query Limit = 1 = Property of Natali Ruchansky Option 2: Integrated Decide what to query based on of how you complete. Who is guessing? normal person + = an artist + Property of Natali Ruchansky ! Our algorithm Order&Extend is the first one composed of 1. a querying strategy tailored to 2. a completion algorithm ! ! This integrated nature enables Order&Extend to : - carefully select a small number of queries, so that the completion algorithm can - recover the matrix with high accuracy. ! ! And allows it to output partial completions for strict limits of the number allotted of queries. Property of Natali Ruchansky A Flavor (of internet traffic data) For full and accurate completion, Order&Extend asks 13k queries …while other algorithms do not achieve comparable error even with <40k queries Property of Natali Ruchansky Deeper discussion of: • Matrix completion as a sequence of linear systems • Sequence of linear systems as graph propagation • Predicting unstable systems • distinction from ill-condition • Efficient computation of stability checks • Finding a good solving-order • through the lens of graph propagation ! Experiments: • Comparison with Matrix Completion algorithms • extended with a querying ability • Approximate low-rank • Exact low-rank Read the paper! Property of Natali Ruchansky Thank you. (and read the paper) from the book Dali’s Mustache