Uploaded by sanjanah99

NoSQL

advertisement
NoSQL
Topic:
Google’s MapReduce
PRESENTERS:
SANJANA H
(20MCAR0046)
KAVITHA N
(20MCAR0031)
Google File System
2
Introduction
to Google File
System
 Google File System (GFS) is a
scalable distributed file system
(DFS) created by Google Inc. and
developed to accommodate
Google’s expanding data
processing requirements.
 GFS provides fault tolerance,
reliability, scalability, availability
and performance to large networks
and connected nodes.
 The Google File System capitalized
on the strength of off-the-shelf
servers while minimizing hardware
weaknesses.
3
Distributed File System
 In Google file system each cluster is
made up of hundreds or thousands
of commodity servers and clients
can use this cluster to read or write
files in a fault tolerant and reliable
manner.
 GFS is made up of several storage
systems built from low-cost
commodity hardware components. It
is optimized to accommodate
Google's different data use and
storage needs, such as its search
engine, which generates huge
amounts of data that must be stored
4
Use Case:
Google music team
Using GFS, sort the analytics or logs of all the
times a particular song was played and find
out the top and trending songs based on these
analytics
5
All the information or metadata about
what files there are in the cluster in how
many parts the file is divided and what is
the location of each of these parts is
stored by the component called GFS
master
Chunk Servers
6
File(var/foo)
THE FILES STORED IN THIS CLUSTER ARE
GENERALLY, VERY LARGE AND THUS THIS
FILE IS SPLIT INTO MULTIPLE PARTS AND
EACH PART IS CALLED A CHUNK AND THESE
CHUNKS ARE SPREAD OVER THESE
MULTIPLE COMMODITY SERVERS IS WITHIN
THE CLUSTER
DRAWBACKS:
 FILES ARE VERY LARGE (100’S OF MB/S TO
MULTIPLE GBS )
 ACCESSING ALL THESE FILES CAN
STRESS A NETWORK BANDWIDTH VERY
SEVERELY
 SINGLE CLIENT APPLICATION TRYING TO
PROCESS MULTIPLE TERABYTES OF FILES
CONSUMES TIME
7
Solution:
 Bringing the processing function to the data
instead to the client
 It counts of all the songs and passes it along
to the servers which are stores the file. Using
this approach each of these servers can
apply that function to the files that are there
within that server
 Here all this processing will happen in
parallelly and the client application can wait
until the processing is completed. Once all
the processing is complete the final results
can be sent back to the client application
where the client can finally aggregate the
individual results of all the servers
8
MapReduce
What is MapReduce?

A programming model (& its
associated implementation)

For processing large data set

Exploits large set of commodity
computers

Executes process in distributed
manner


Offers high degree of
transparencies
In other words: simple and maybe
suitable for your tasks !!!
 Map Reduce is a programming model and
an associated implementation for
processing and generating large data sets
 Users specify a Map function that
processes a key/value pair to generate a
set of intermediate key/value pairs, and a
Reduce function that merges all
intermediate values associated with the
same intermediate key
 Programs written in this functional style
are automatically parallelized and executed
on a large cluster of commodity machines
 The run-‐time system take care of the
details of partitioning of the input data,
scheduling the program’s execution
9
10
Map+Reduce
 programming model where instead of
taking out all the data from multiple
servers you push your core onto
multiple servers and ask those servers
to process or run that code is called
MapReduce
 MapReduce comes from two parts the
first one is the map function so the
processing function that the client
passed is the mapping function
 Second part is the reduced function
which is the aggregation of all the
results of each of these individual
servers
11
Master Server
 Master Server takes care of all
the coordination.
Master will find out using the GFS
where the files are stored or
where parts of those files are
stored on with servers on each of
those servers it will start a
separate process all worker
processes these worker process
will take that particular function
and start mapping
12
Functioning of Master Server
master
 Master Server keeps track of how much
progress has happened by using
heartbeats so each of the workers will
keep sending the heartbeat data and
percentage of progression that has
happened until now also these workers
will not send the results directly to
this master instead they will save it on
the local hard disk itself
 The Chunk Servers send the result file
name and path to the master the master
collects the file path of the results on all
of these servers once the processing is
complete the partial results of each of
these servers are there on the local hard
disk
13
14
Examples
15
Inverted Web Index
Map
Reduce
For all crawled documents: map each
word with document id
For all mappings, aggregate results
with same key
(word, document-id)
(word, list<document-id>)
map worker-1
reduce worker-1
map worker-2
reduce worker-2
16
Fault Tolerance
&
Distributed
Processing
Fault tolerance for each of the
MapReduce jobs, the master
thread using the heartbeats
gets to know and it assigns
some other worker on some
other server to boot the same
processing and hence
MapReduce job is much faster
than doing the sequential
processing
17
18
19
Related documents
Download