4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE

advertisement
4. Scalability and MapReduce
ENEE 759D | ENEE 459D | CMSC 858Z
Prof. Tudor Dumitraș
Assistant Professor, ECE
University of Maryland, College Park
http://ter.ps/759d
https://www.facebook.com/SDSAtUMD
Today’s Lecture
• Where we’ve been
– How to say “hapax legomenon” and “heteroskedasticity”
– Interpretation of Statistics
– Attributes of Big Data
• Where we’re going today
– Threats to validity
– Scalability
– MapReduce
• Where we’re going next
– Machine learning
2
The IROP Keyboard
[Zeller, 2011]
To prevent bugs, remove the keystrokes
that predict 74% of failure-prone modules in Eclipse
3
Does this work?
What am I measuring?
C
Sample D
V1 ?
V2 ? Sample C
V3 ?
Reconstruct Lineage
G
D
N
E
Sample E
S
T
F
Korgo worm family
How well does this work in the real world?
Will this work tomorrow?
4
What Am I Measuring: Scalability vs. Latency
Can we make use of 1000s of cheap computers?
• Analyzing data in parallel
– To access 1 TB in 1 min, must distribute data over 20 disks
– Parallelism is useful for algorithms where complexity constants matter
• N log N operations sequentially => (N log N)/K operations in parallel
– Scalability: ability to throw resources at the problem
• You can measure scalability
– Scaleup (weak scalability):
• More resources => solve proportionally bigger problem with same latency
– Speedup (strong scalability):
• More resources => proportionally lower latency with same problem size
5
Some Problems Are Embarrassingly Parallel (1)
Task: Convert 405K TIFF images (~4 TB) to PNG
Input: many TIFF
images
Distribute images
among K computers
f
f
f
f
f
f
f is a function to
convert TIFF to
PNG; apply it to
every item
Output: a big
distributed set of
converted images
http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/
6
Some Problems Are Embarrassingly Parallel (2)
Task: Compute the word frequency of 5M documents
Input: millions of
documents
Distribute documents
among K computers
f
f
f
f
f
f
For each document
f returns a set of
<word, freq> pairs
Output: a big
a big distributed list
of sets of word freqs.
7
Adapted from slides by Bill Howe
Some Problems Are Embarrassingly Parallel (3)
Task: Compute the word frequency across all documents
Input: millions of
documents
Distribute documents
among K computers
f
f
f
Now what?
f
f
f
For each document
f returns a set of
<word, freq> pairs
We don’t want a bunch of
little histograms – we want
one big histogram
8
MapReduce
Task: Compute the word frequency across all documents
Distribute documents
among K computers
map
map
map
map
map
map
For each document
f returns a set of
<word, freq> pairs
A big distributed list
of sets of word freqs.
Shuffle <word, freq>
pairs so that all the
counts for a word are
sent to the same host
reduce
reduce
reduce
reduce
Add the counts
of each word
Output: the
distributed histogram
Hadoop on One Slide
• MapReduce was invented at
Google
[Dean & Ghemawat, OSDI’04]
• Hadoop = open source
implementation
• Data stored on HDFS
distributed file system
– Direct-attached storage
– No schema needed on load
• Programmers write Map
and Reduce functions
• Framework provides
automated parallelization
and fault tolerance
– Data replication, restarting
failed tasks
– Scheduling Map and Reduce
tasks on hosts with local
copies of input data 10
Source: Huy Vo
MapReduce Programming Model
• Iput & Output: each a set of key/value pairs
• Programmer specifies two functions:
map (in_key, in_value) -> list(out_key, intermediate_value)
– Processes input key/value pair
– Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
– Combines all intermediate values for a particular key
– Produces a set of merged output values (usually just one)
• Inspired by primitives from functional programming languages
such as Lisp, Scheme, and Haskell
Slide source: Google
11
Example: What Does This Do?
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
EmitFinal(output_key, result);
12
Big Data in the Security Industry
• Booz Allen Hamilton
– Dr. Brian Keller’s colloquium “Innovating with Analytics”
– Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & 2120
https://www.datasciencebowl.com/
• Symantec
– WINE platform for data analytics in security
• Google
– Mine user access patterns to mitigate data loss due to stolen credentials
• Supplementary to passwords and two-factor authentication
– Fuzz testing at scale
13
Big Data for Security: Benefits and Challenges
• Benefits
– Ability to analyze data at scale (e.g., the information on the 403 millions
malware variants created in 2011)
– MapReduce provides simple programming model, automated
parallelization and fault tolerance
• Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some
of these benefits, but they are very expensive
• Challenges
–
–
–
–
Lack of ground truth on malware families
Lack of contextual data: e.g., date and time of appearance
Inability to collect some types of data owing to privacy concerns
Sharing data (e.g., malware samples are dangerous, some data sets may
include personal information)
Illustrate general threats to validity in experimental cyber security14
Threats to Validity
Construct validity: use metrics that model the hypothesis
Internal validity: establish causal connection
What am I
measuring?
Does it work?
Will it work in
the real world?
Will it work
tomorrow?
Content validity: include only and all relevant data
External validity: generalize results beyond experimental data
15
Review of Lecture
• What did we learn?
– Construct, content, internal, external validity
– Programming in MapReduce
– Measuring scalability
• What’s next?
– Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day
Attacks In The Real World’
– Next lecture: Machine learning techniques
• Deadline reminder
– Pilot project reports due on Wednesday
– Post report on Piazza
16
Download