Uploaded by Mohammed shaker

BigData

advertisement
Introduction:
The past decade has been explosion of "data-intensive" or "data-centric"
applications where the analysis of large volume of heterogeneous data is
the basis of analysis problems. These commonly known "Big Data
applications", and the system that supported the management and
processing this data commonly referred to as "Big Data processing
system"
What's the Big Data?
Big Data is aboard term for data sets so large or complex that traditional
data processing applications are inadequate. Obviously, there are many
challenges to deals with this type of data such: analysis, capture, data
manipulate, search, sharing, storage, transfer, visualization, and
information privacy.
How to describe Big Data?
Can describing it by following characteristics:
1. Volume: the quantity of data is important in this context, since it's
determined whether Big Data or not.
2. Varity: this means that category to which Big Data belongs to. This very
essential fact that needed to be known by the data analysis.
3. Velocity: this term refers to the speed of generation of data or how fast
the data is generated and processed to meet the demands and the
challenges which lie ahead in the path of growth and development.
4. Veracity: the data that arrived to Big Data systems that came from
deferent sources and could be noise, bias, or inconsistencies data,
therefore, the Big Data systems need to clean the data and maintain their
provenance in order to reason about their trustworthiness.
And there are some authors added another characteristics such Complexity,
which Data management can become a very complex process when it
deals with large volumes of data that come from deferments sources with
many undesired information, this process need data to be linked,
connected, and correlated in order to be able to grasp the information
that's supposed to be conveyed by these data.
What's the benefits and uses of Big Data?
Ability to process Big Data brings in multiple benefits, such as1. Businesses can utilize outside intelligence while taking decisions.
2. Universities use data science in their research but also to enhance the
study experience of their students.
3. Improved customer service: Traditional customer feedback
systems are getting replaced by new systems designed with Big
Data technologies.
4. Early identification of risk to the product/services, if any
5. Better operational efficiency: Big Data technologies can be used for
creating a staging area or landing zone for new data before identifying
what data should be moved to the data warehouse
What’s Facets of data?
In big data you’ll come across many different types of data, and each of
them tends to require different tools and techniques. The main categories
of data are these:
 Structure.
 Unstructured.
 Natural language.
 Machine-generated.
 Graph-based.
 Audio, video, and images.
 Streaming.
 In following examples of Big data:
1. The New York Stock Exchange generates about one terabyte of new trade
data per day.
2. Social Media: Facebook, and twitter... etc.
3. YouTube.
4. And more…
 The different between traditional data and Big data:
Traditional Data
Big Data
Volume
Ranges from
gigabytes to terabytes
is in manageable
volume
Ranges from petabytes to
zettabytes or exabytes this
huge volume which becomes
unmanageable.
Data
Deals with structured Deals with different types
data, stable, and inter- that’s unstable, and unknown
relationship
relationship
Generated
Per hour or per day or More frequently, mainly per
more, From specific second, and from very
sources
sources.
Architecture
Is centerized and it’s
managed in
centerized form
Distributed and it’s managed
in distributed form
Integration
Easy
Difficult
Process
Normal system is
capable to process
this data
High-system configuration
used to process big data
Data model
Is strict schema
based, it’s static
Dynamic
Source
May be financial
data, organization
data, or others.
Social media, device data,
sensor data, video, image,
audio, etc.
Storage
Block storage
Stored in files or objects
distributed over nodes.
 Big Data architecture:
1. Data sources
govern Big Data
architecture.
It
involves all those
sources
from
where the data
extraction pipeline
gets built.
2. Data Storage is the receiving end for Big Data. Data Storage receives data
of varying formats from multiple data sources and stores them.
3. Real-time Message Ingestion: We need to build a mechanism in our Big
Data architecture that captures and stores real-time data that is consumed
by stream processing consumers.
4. Batch Processing: The architecture requires a batch processing system for
filtering, aggregating, and processing data which is huge in size for
advanced analytics. This job involved reading the data from the data
storage, processing it, and writing outputs to the new files.
5. Stream Processing: There is a little difference between stream processing
and real-time message ingestion. Stream processing handles all streaming
data which occurs in windows or streams. It then writes the data to the
output.
6. Analytical Data Store: After processing data, we need to bring data in one
place so that we can accomplish an analysis of the entire data set.
7. Analytics and Reporting: After ingesting and processing data from
varying data sources we require a tool for analyzing the data. For this,
there are many data analytics and visualization tools that analyze the data
and generate reports or a dashboard. Companies use these reports for
making data-driven decisions.
8. Orchestration: Moving data through these systems requires orchestration
in some form of automation. Ingesting data, transforming the data,
moving data in batches and stream processes, then loading it to an
analytical data store, and then analyzing it to derive insights must be in a
repeatable workflow. This allows us to continuously gain insights from
our big data.
 Challenges in Designing Big Data Architecture:
1. Data quality is a challenge while
working with multiple data sources.
Cost
2. Big Data architecture must be
Tools
designed in such a way that it can
scale up when the need arises.
Quality
Secuirty
3. Data Security is the most crucial
part. It is the biggest challenge while
Scalabilty
dealing with big data. Hackers and
Fraudsters may try to add their own
fake data or skim companies’ data for sensitive information.
Cybercriminal would easily mine company data if companies do not
encrypt the data, secure the perimeters, and work to anonymize the data
for removing sensitive information.
4. There are many tools and technologies with their pros and cons for big
data analytics like Apache Hadoop, Spark, Casandra, Hive, etc. Choosing
the right technology set is difficult.
5. Big data architecture entails lots of expenses. During architecture design,
the Big data company must know the hardware expenses, new hires
expenses, electricity expenses, needed framework is open-source or not,
and many more.
 Big Data Management System: it’s uses a different software stack
with the following layers:
1. Distributed Storage Systems Big data management relies on a
distributed storage layer, whereby data is typically stored in files or
objects distributed over the nodes. This is one major difference with the
software stack of current DBMSs that relies on block storage. The
distributed storage layer typically provides two solutions to store data,
objects, or files, distributed over cluster nodes. These two solutions are
complementary, as they have different purposes and can be combined,
in order to support both high numbers of objects and large files. One of
the most influential distributed file systems is Google File System
(GFS).
2. Big Data Processing Frameworks: An important class of big data
applications requires data management without the overhead of full
database management, and cloud services require scalability for
applications that are easy to partition into a number of parallel but
smaller tasks the so-called embarrassingly parallelizable applications.
For these cases where scalability is more importance, a parallel
processing platform called MapReduce has been proposed. We’ll
discussed bellow.
3. Stream Data Management: is computer software system to manage
continuous data streams. It’s offers flexible query processing so that
the information is needed can be expressed using queries. Since most
DSMS are data-driven, a continuous query produces new results as a
new data arrive at the system.
One of the biggest challenges for a DSMS is to handle potentially
infinite data streams using a fixed amount of memory and no random
access to the data. There are different approaches to limit the amount
of data in one pass, which can be divided into two classes:
1. compression techniques that try to summarize the data.
2. window techniques that try to portion the data into (finite) parts.
Following some different between DBMS and DSMS:
DBMS
Persistent data relations
Random access
One-time queries
Unlimited secondary storage
Only the current state is relevant
Relatively low update rate
Little or no real time require
Assumes exact data
Plannable query process
DSMS
Volatile data streams
Sequential access
Continuous queries
Limited main memory
Consideration of the input
Potentially extremely high
update rate
Real time requirements
Assumes outdated \
inaccurate data
Variable data arrival and
data characteristics
4.Data Analysis Platforms:
The simplicity of this model allows for rapidly absorbing and
connecting large volumes of data from many sources. Big data
analytics systems should enable a platform that can support different
analytics techniques that can be adapted in ways that help solve a
variety of challenging problems.
The biggest advantage to using graphs is you can analyze these graphs
and use them for analyzing complex datasets. We required this analytic
method to build from massive big datasets in order to derive quick
insights. Obviously, appear point intersection to discern between graph
analysis model and rational model, at the case when we need to analysis
Big Data.
 Hadoop:
The
Apache
Hadoop
software library is a
framework that allows for
the distributed processing of
large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than
rely on hardware to deliver high-availability, the library itself is
designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers,
each of which may be prone to failures.
A wide variety of companies and organizations use Hadoop for both
research and production.
It provides a software work for distributed storage and processing of
Big Data using the MapReduce model.
 MapReduce:
Hadoop uses a programming method called MapReduce to achieve
parallelism. It’s a programming model and an associated
implementation for processing and generating Big Data with parallel,
distributed algorithm on cluster.
MapReduce: is composed from “Map” procedure which perform
filtering and sorting (such as sorting students by first name in queue,
one queue for each name).
And a “Reduce” method, which performs a summary operation (such
as counting the no. of students in each queue, yielding name
frequently).
MapReduce process stored data that may be either file system
(unstructured) or database (structure). It can take advantage of the
locality of data, by processing it near the place it’s stored in order to
minimize communication overhead.
MapReduce Stages:
1. Map: each worker node applies the map function to the local data, and
writes the output to a temporary storage. A master node ensures that
only one copy of the redundant input data is processed.
2. Shuffle: worker node redistributed data based on the output keys
(produced by Map function), such that all data belonging to one key is
located on the same worker node.
3. Reduce: worker nodes new process each group of output data, per key,
in parallel.
For commonly cited advantages of this type of processing from work
are:
1. Flexibility: Hadoop MapReduce programming model offers flexibility
to process structure or unstructured data by various business
organizations who can make use of the data and can operate on different
types of data.
2. Scalability: Hadoop as a platform that is highly scalable and is largely
because of its ability that it stores and distributes large data sets across
lots of servers.
3. Cost-effective solution: the business was forced to downsize the data
and further implement classification based on assumptions how certain
data could be valuable to the organization and hence removing the raw
data. Here the Hadoop scale-out architecture with MapReduce
programming comes to the rescue.
4. Fault tolerance: This is a unique functionality offered in Hadoop
MapReduce that it is able to quickly recognize the fault and apply a
quick fix for an automatic recovery solution
5. Parallel processing: The programming model divides the tasks in a
manner that allows the execution of the independent task in parallel.
Hence this parallel processing makes it easier for the processes to take
on each of the tasks which helps to run the program in much less time.
6. A simple model of programming: MapReduce programming is based
on a very simple programming model which basically allows the
programmers to develop a MapReduce program that can handle many
more tasks with more ease and efficiency
7. Fast: Hadoop MapReduce processes large volumes of data that is
unstructured or semi-structured in less time.
8. Security and Authentication: If any outsider person gets access to all
the data of the organization and can manipulate multiple petabytes of
the data it can do much harm in terms of business dealing in operation
to the business organization. This risk is addressed by the MapReduce
programming model by working with hdfs and HBase that allows high
security allowing only the approved user to operate on the stored data
in the system.
Download