Advanced Topics NP-complete reports. Continue on NP, parallelism Reprise: Non-determinism • Informal: add to any algorithm – taking a guess at one or more places – forking and pursuing one or more possibilities • If there is a Non-deterministic algorithm, then there is a regular/standard algorithm – just try all the possibilities – may take a long time Reprise: the class P • … is all problems for which there exist an algorithm with complexity bounded by a polynomial. Reprise: the class NP • all problems for which there is an algorithm, possibly non-deterministic, that assuming you take the right paths, is bounded by a polynomial • Alternative definition: you can check that the answer is correct in polynomial time. Reprise: does P = NP? • Is it possible to find actual standard algorithms for these NP problems? • THE great problem of computer science. • Proving it false would also be significant. • Theoretical problem with considerable practical value. NP complete • A set of NP problems that can be translated into each other in polynomial time so… • If one of the problems can be solved in polynomial time – aka tractible • …. they all can. NP-hard • A problem is NP-hard if there is an NPcomplete problem that can be translated into it in polynomial time. – but not necessarily the other way. • NP-hard problems are at least as hard as NP-complete problems. NP-hard example • Robot path planning in a dynamic environment Reports on NP-complete problems • • • • • • Tetris Knapsack problem Steiner Tree problem Graph coloring Minesweeper Subset problem Note • There are methods for getting answers to NP problems, but they aren't guaranteed to be optimal. • Called heuristics or approximations Distributed computing • Approach to NP problems: fork a new process • That is, use distributed computing to investigate the different choices • Some problems may be embarrassingly parallelizable. Sources • Many • Google: http://code.google.com/edu/parallel/mapre duce-tutorial.html • Note: there is controversy re: MapReduce – may be issue of patent – Is it the right framework – ?? Concepts • key/value pair • Master / Worker • nodes on network – may be one Master and many Workers • hashing: quick way to find data (key/value data) • piece / partition / split / shard Example from Google tutorial • Compute pi using many workers, each doing a calculation using pseudo-random function. – no data (NOT typical MapReduce problem) • Worker picks a random point in the square. If it is in the circle, worker increments a counter. • http://faculty.purchase.edu/jeanine.meyer/ processing/piEstimate/applet/ Formulas • • • • Area_of_circle = pi * r2 Area_of_square containing circle = 4 * r2 So r2 = Area_of_square / 4 Let Ac be Area_of_circle and As be Area_of_square • Then pi = 4 * Ac / As • Estimate for pi is 4 * counter / Number_of_points_tried Informal proof • The chances of any point being in the circle is proportional to the ratio of the areas. • Choosing many points randomly carries out this test. • We could [simply] use for-loops and do the calculation for every point. MapReduce • Model for distributed (aka parallel) computing • There are different products that implement MapReduce. From a google search: – – – – – – Google Apache Hadoop: Open source Teradata Amazon Greenplum Platform MapReduce • Programmers sets up program for Master and for Workers. Typically, the Master program sets up and partitions input array(s). • Typically, data is key/value pairs. • Programmers write – Map functions that process data, possibly making use of functions in the MapReduce library – Reduce functions that combine the results • Workers work on Map tasks and/or Reduce tasks. The Map task is applied to the worker's piece (aka shard) of the input array. MapReduce for pi estimate • Not typical in that there is no data • The map function does the calculation • When all done, the reduce function adds up all the individual counters and calculates the estimate for pi Speed up for pi estimate • Suppose – each step (getting the 2 random values and determining if in circle) takes K steps – suppose 1000 workers calculating all together 1000000 values – suppose adding 2 numbers takes 1 time unit • Time without distributed computing: 1000000*K • Time with distributed computing 1000*K + 1000 • Speed up is slightly less than 1000 Follow-up • Look up examples using MapReduce • Note: one example is Google maintaining its keyword index by scanning (crawling) the web Speaker Twitter: @kmwinterfield • IBM Smarter Cities • Social media for political campaigns • World Community Grid Homework • Prepare question for Kevin – follow on twitter and send message OR – post on moodle • Continue with postings • Research unique NP complete problem and post summary and source!