Uploaded by Kevin

Big Data Concepts: MapReduce, Spark, Data Security

advertisement
Big Data
Data sets with sizes beyond the ability of commonly
used software tools to capture, curate, manage, and
process data within a tolerable elapsed time
Commonly used term for: Behavioral Analytics, User
Analytics, Specialized Data Analytics (DNA, Financial,
etc)
3V’s
Volume – Massive amounts of data collected
Velocity – Collecting and curating data at required
speeds
Variety – Structured, Semi-structured and Unstructured
data from a variety of sources
Batch/Stream Paths
Batch: Collection and storage of data and transaction,
scheduled time processing at once eg. Transactions
Stream: Immediate processing, databased updated at
the time eg. Reservation, streaming
Single:
Distributed:
Data processing spread across multiple compute
nodes in parallel
All nodes work in conjunction via network connection
distributing work relatively equally
Clusters of nodes are Scalable (up and out) and Highly
Available
All data is redundant and replicated across nodes and
storage
Pros
facilitates scalability, high availability, fault tolerance,
replication, redundancy which is typically not available
in centralized data processing systems.
Parallel distributed of work facilitates faster execution
of work.
Enforcing security, authentication & authorization
workflows becomes easier as the system is more
loosely coupled.
Cons
Setting up & working with a distributed system is
complex - many nodes working in conjunction with
each other, maintaining a consistent shared state.
The management of distributed systems is complex entails additional network latency which engineering
teams must deal with.
Strong consistency of data is hard to maintain when
everything is so distributed.
Head Node/Worker Node
MAPREDUCE
1. Collect Input Data
2. Split by "sentence"
3. Map Key-Value Pairs
4. Shuffle by Value
5. Reducer outputs final counts
Map Stage
• The master node takes the input.
• Breaks down the input into smaller
subproblems.
• The master node distributes these smaller
subproblems to the worker nodes.
• The worker node can do this again, resulting in
a multi-level tree structure.
• A worker node processes a smaller problem.
It forwards the response back to its master node.
Reduce Phase
• 1. The master node collects the answers to all
the subproblems given by the worker nodes.
• 2. It combines all the answers and forms the
output of the original problem
• The detailed steps in the MapReduce technique
are as follows:
• 1. Map input preparation: The system selects
the map processors, distributes an input keyvalue pair K1 to work on, and provides that
processor with all the input data associated with
that key value.
• 2. Run the user-provided map code ().
• 3. Execute the Map() code exactly once for
each K1 key value and generate output
organized by K2 key values.
• 4. Shuffle map output () into Reduce
processors; the MapReduce system selects
Reduce processors, assigns a K2 key value to
work with, and provides that processor with any
Map() generated data associated with that key
value.
• 5. Run the Reduce() code provided by the user
– Reduce() is executed exactly once for each
K2 key value created in the Map step.
• 6. To Produce the final output, the MapReduce
system collects all the output generated by
reducing () and sorts it by the key value of K2 to
produce the final result.
• HDFS – Hadoop Distributed File System. This
is the file system that manages the storage of
large sets of data across a Hadoop cluster.
HDFS can handle both structured and
unstructured data. The storage hardware can
range from any consumer-grade HDDs to
enterprise drives.
• MapReduce. The processing component of the
Hadoop ecosystem. It assigns the data
fragments from the HDFS to separate map
tasks in the cluster. MapReduce processes the
chunks in parallel to combine the pieces into
the desired result.
• YARN. Yet Another Resource Negotiator.
Responsible for managing computing resources
and job scheduling.
• The set of common libraries and utilities that
other modules depend on. Another name for
this module is Hadoop core, as it provides
support for all other Hadoop components.
Data Sharding
Optimization
Spark
• Apache Spark is an open-source tool that can
run in a standalone mode or on a cloud
platforms.
• It is designed for fast performance and uses
RAM for caching and processing data.
• The Spark engine was created to improve the
efficiency of MapReduce and keep its benefits.
• Even though Spark does not have its own file
system, it can access data on many different
storage solutions. The data structure that Spark
uses is called Resilient Distributed Dataset
(RDD).
• Apache Spark Core. The basis of the whole
project. Spark Core is responsible for
necessary functions such as scheduling, task
dispatching, input and output operations, fault
recovery, etc. Other functionalities are built on
top of it.
• Spark Streaming. This component enables the
processing of live data streams. Data can
originate from many different sources, including
Kafka, Kinesis, Flume, etc.
• Spark SQL. Spark uses this component to
gather information about the structured data
and how the data is processed.
• Machine Learning Library (MLlib). This library
consists of many machine learning algorithms.
MLlib’s goal is scalability and making machine
learning more accessible.
• GraphX. A set of APIs used for facilitating graph
analytics tasks.
Streaming Data
Lambda architecture
Kappa Architecture
Privacy and Security
The ability for individuals and organizations to protect
and control personal information that can be collected,
used, shared, or sold by organizations harvesting that
information.
Data security means protecting digital data, such as
those in a database, from destructive forces and from
the unwanted actions of unauthorized users
Data at Rest – Data stored in files, database, cloud,
removable media, other storage – even physically!
Data in Transit – Data being transmitted through
networks, across the internet, cellular
Data Encryption - security method where information is
encoded and can only be accessed or decrypted by a
user with the correct encryption key. Encrypted data,
also known as ciphertext, appears scrambled or
unreadable to a person or entity accessing without
permission.
Data masking - method of creating a structurally similar
but inauthentic version of an organization's data that
can be used for purposes such as software testing and
user training. The purpose is to protect the actual data
while having a functional substitute for occasions when
the real data is not required.
Single sign-on (SSO) - authentication scheme that
allows a user to log in with a single ID to any of several
related, yet independent, software systems. True single
sign-on allows the user to log in once and access
services without re-entering authentication factors.
Semi
(XML) is a markup language and file format for storing,
transmitting, and reconstructing arbitrary data. It
defines a set of rules for encoding documents in a
format that is both human-readable and machinereadable.
Popularized by SOAP - a messaging protocol
specification for exchanging structured information in
the implementation of web services in computer
networks. It uses XML Information Set for its message
format, and relies on application layer protocols, most
often Hypertext Transfer Protocol (HTTP), although
some legacy systems communicate over Simple Mail
Transfer Protocol (SMTP), for message negotiation and
transmission.
Includes other Markup Languages, EDI, etc
Underpins email and MS Word
JSON
Language-independent data format. It was derived
from JavaScript, but many modern programming
languages include code to generate and parse JSONformat data. JSON filenames use the extension .json.
Any valid JSON file is a valid JavaScript (.js) file, even
though it makes no changes to a web page on its own
Popularized by REST - Representational state transfer
(REST) is a software architectural style that describes
a uniform interface between physically separate
components, often across the Internet in a ClientServer architecture.
NoSQL
Key-value store: Key–value (KV) stores use the
associative array (also called a map or dictionary) as
their fundamental data model. In this model, data is
represented as a collection of key–value pairs, such
that each possible key appears at most once in the
collection.
Document store: Assume that documents encapsulate
and encode data (or information) in some standard
formats or encodings. Encodings in use include XML,
YAML, and JSON and binary forms like BSON.
Documents are addressed in the database via a unique
key. API or query language to retrieve documents
based on their contents.
Graph: Designed for data whose relations are well
represented as a graph consisting of elements
connected by a finite number of relations. Examples of
data include social relations, public transport links, road
maps, network topologies, etc.
Wide-column Store: AKA extensible record store, uses
tables, rows, and columns, but unlike a relational
database, the names and format of the columns can
vary from row to row in the same table. A wide-column
store can be interpreted as a two-dimensional key–
value store.
Reading
Data-Driven Decision Making
data can help managers make better decisions by
providing insights that might not be apparent through
intuition. defining the problem, identifying relevant data
sources, analyzing the data, and communicating the
findings to stakeholders
Download