Reports on NP-complete problems. Parallel processing. MapReduce.

advertisement
Advanced Topics
NP-complete reports. Continue on
NP, parallelism
Reprise: Non-determinism
• Informal: add to any algorithm
– taking a guess at one or more places
– forking and pursuing one or more possibilities
• If there is a Non-deterministic algorithm,
then there is a regular/standard algorithm
– just try all the possibilities
– may take a long time
Reprise: the class P
• … is all problems for which there exist an
algorithm with complexity bounded by a
polynomial.
Reprise: the class NP
• all problems for which there is an
algorithm, possibly non-deterministic, that
assuming you take the right paths, is
bounded by a polynomial
• Alternative definition: you can check that
the answer is correct in polynomial time.
Reprise: does P = NP?
• Is it possible to find actual standard
algorithms for these NP problems?
• THE great problem of computer science.
• Proving it false would also be significant.
• Theoretical problem with considerable
practical value.
NP complete
• A set of NP problems that can be
translated into each other in polynomial
time so…
• If one of the problems can be solved in
polynomial time
– aka tractible
• …. they all can.
NP-hard
• A problem is NP-hard if there is an NPcomplete problem that can be translated
into it in polynomial time.
– but not necessarily the other way.
• NP-hard problems are at least as hard as
NP-complete problems.
NP-hard example
• Robot path planning in a dynamic
environment
Reports on NP-complete problems
•
•
•
•
•
•
Tetris
Knapsack problem
Steiner Tree problem
Graph coloring
Minesweeper
Subset problem
Note
• There are methods for getting answers to
NP problems, but they aren't guaranteed
to be optimal.
• Called heuristics or approximations
Distributed computing
• Approach to NP problems: fork a new
process
• That is, use distributed computing to
investigate the different choices
• Some problems may be embarrassingly
parallelizable.
Sources
• Many
• Google:
http://code.google.com/edu/parallel/mapre
duce-tutorial.html
• Note: there is controversy re: MapReduce
– may be issue of patent
– Is it the right framework
– ??
Concepts
• key/value pair
• Master / Worker
• nodes on network
– may be one Master and many Workers
• hashing: quick way to find data (key/value
data)
• piece / partition / split / shard
Example from Google tutorial
• Compute pi using many workers, each
doing a calculation using pseudo-random
function.
– no data (NOT typical MapReduce problem)
• Worker picks a random point
in the square. If it is in the circle,
worker increments a counter.
• http://faculty.purchase.edu/jeanine.meyer/
processing/piEstimate/applet/
Formulas
•
•
•
•
Area_of_circle = pi * r2
Area_of_square containing circle = 4 * r2
So r2 = Area_of_square / 4
Let Ac be Area_of_circle and
As be Area_of_square
• Then pi = 4 * Ac / As
• Estimate for pi is
4 * counter / Number_of_points_tried
Informal proof
• The chances of any point being in the
circle is proportional to the ratio of the
areas.
• Choosing many points randomly carries
out this test.
• We could [simply] use for-loops and do the
calculation for every point.
MapReduce
• Model for distributed (aka parallel) computing
• There are different products that implement
MapReduce. From a google search:
–
–
–
–
–
–
Google
Apache Hadoop: Open source
Teradata
Amazon
Greenplum
Platform
MapReduce
• Programmers sets up program for Master and
for Workers. Typically, the Master program sets
up and partitions input array(s).
• Typically, data is key/value pairs.
• Programmers write
– Map functions that process data, possibly making use
of functions in the MapReduce library
– Reduce functions that combine the results
• Workers work on Map tasks and/or Reduce
tasks. The Map task is applied to the worker's
piece (aka shard) of the input array.
MapReduce for pi estimate
• Not typical in that there is no data
• The map function does the calculation
• When all done, the reduce function adds
up all the individual counters and
calculates the estimate for pi
Speed up for pi estimate
• Suppose
– each step (getting the 2 random values and
determining if in circle) takes K steps
– suppose 1000 workers calculating all together
1000000 values
– suppose adding 2 numbers takes 1 time unit
• Time without distributed computing: 1000000*K
• Time with distributed computing
1000*K + 1000
• Speed up is slightly less than 1000
Follow-up
• Look up examples using MapReduce
• Note: one example is Google maintaining
its keyword index by scanning (crawling)
the web
Speaker
Twitter: @kmwinterfield
• IBM Smarter Cities
• Social media for political campaigns
• World Community Grid
Homework
• Prepare question for Kevin
– follow on twitter and send message OR
– post on moodle
• Continue with postings
• Research unique NP complete problem
and post summary and source!
Download