John Lenhart
Data stores are growing by 50% each year,
and that rate of increase is accelerating[1]
In 2010, we crossed the barrier of the
zettabyte (ZB) across all online data. This
year, we will produce 4 ZB of data
The type of data is also changing. Over 80%
of it will be unstructured data which does not
work well with relational databases[1]
“Big data is defined as large amount of data
which requires new technologies and
architectures so that it becomes possible to
extract value from it…”
Big data is sort of a misnomer, as it only
points out the size of the data not giving too
much of attention to its other existing
Variety - the stored data is not all of the
same type or category
 Structured data - data that is organized in a
structure so that it is identifiable e.g. SQL data
 Semi-structured data - a form of structured data
that has a self-describing structure yet does not
conform with the formal structure of a relational
database e.g. XML
 Unstructured data - data with no identifiable
structure e.g. image
Volume - The “Big” in Big data and represents
the large volume or size of the data
 At present the data existing is in petabytes and is
supposed to increase to zettabytes in the near
 For example big social networking sites are
producing data in order of terabytes everyday and
this amount of data is difficult to handle using
traditional systems
Velocity - represents not only the speed at which
the data is incoming, but also the speed at which
the data is outgoing
 Traditional systems are not capable of performing
analytics on data that is constantly in motion
Variability - represents the inconsistency of the
data flow
 The flow of data can be highly inconsistent, leading to
periodic peaks and lows
 Daily, seasonal and event-triggered peak data loads can be
challenging to manage, especially for unstructured data[2]
 For example a large natural disaster would spike page visits
for cnn.com
 Represents the difficulty of linking, matching,
cleansing, and transforming data from multiple
 Systems must not only be designed to handle Big
data efficiently and effectively, but also be able to
filter the most important data from all of the data
 This filtered data is what helps add value to a business
Log Storage in IT Industries
 IT industries store large amounts of data as logs to deal with
problems which occur rarely in order to solve them
 Big data analytics is used on the data to pinpoint the point of
 Traditional Systems are not able to handle these logs because
of their volume, raw and semi structured nature, and high rate
of change
Sensor Data
 Massive amount of sensor data is also a big challenge for Big
 Example
▪ The Large Hadron Collider (LHC) is the world’s largest and highestenergy particle accelerator. The data flow in its experiments consists of
25 to 200 petabytes of data which needs to be processed and stored
Risk Analysis
 It’s important for financial institutions to model data in order to
calculate the risk so that it falls under their acceptable
 A lot of potential data is underutilized because of its volume
and should be integrated within the model to determine the risk
patterns more accurately
Social Media
 The largest use of Big data is for social media and customer
 Keeping an eye on what the customers are saying about their
products helps business organizations to get a kind of customer
 The customer feedback can then be used to make decisions and
add value to the business
Privacy and Security
 The most important issue with Big data which
includes conceptual, technical as well as legal
 The personal information of a person when combined
with external large data sets leads to the inference of
new private facts about that person
 Big data used by law enforcement will increase the
chances of certain tagged people to suffer from
adverse consequences without the ability to fight
back or even having knowledge that they are being
discriminated against
Data Access and Sharing of Information
 If data is to be used to make accurate decisions in
time it becomes necessary that it should be available
in accurate, complete and timely manner
Storage and Processing Issues
 Many companies are struggling to store the large
amount of data they are producing
▪ Outsourcing storage to the cloud may seem like an option but
long upload times and constant updates to the data preclude
this option
 Processing a large amount of data also takes a lot of
Hadoop - is an open source project hosted by
Apache Software Foundation for managing
Big data
Hadoop consists of two main components
 The Hadoop File System (HDFS) which is a
distributed file-system that stores the data on
multiple separate servers (each of which having its
own processor(s))
 MapReduce the framework that understands and
assigns work to the nodes in a cluster[3]
Hadoop provides the following advantages[3]
 Data read/write performance is increased by distributing
the data across the cluster allowing each processor to do
work in a parallel fashion
It’s scalable, new nodes can be added as needed without
making changes to the existing system
It’s cost effective because it brings parallel computing to
commodity servers
It’s flexible, it can absorb any type of data, structured or
not from any number of sources
It’s fault tolerant, it handles failures intrinsically by always
storing multiple copies of the data and automatically
loading a copy when a fault is detected
How do you use Hadoop?
 The developer writes a program that conforms to the
MapReduce programming model
 The developer specifies the format of the data to be
processed in their program
How does MapReduce work?[4]
 Each Hadoop program performs two tasks:
▪ Map - Breaks all of the data down into key/value pairs
▪ Reduce - Takes the output from the map step as input and
combines those data key/value pairs into a smaller set of
key/value pairs
MapReduce example[4]: Assume you have five files, and each file contains two
columns that represent a city and the corresponding temperature recorded in
that city for the various measurement days
 Toronto, 20 , New York, 22, Rome, 32 , Toronto, 4, Rome, 33 ,New York, 18
We want to find the maximum temperature for each city across all of the data
 Then we create five map tasks, where each mapper works on one of the five files
and the mapper task goes through the data and returns the maximum
temperature for each city
Let’s assume the other four mapper tasks (working on the other four files not
shown here) produced the following intermediate results:
Which results in: (Toronto, 20) (New York, 22) (Rome, 33)
(Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York, 33) (Rome, 38)(Toronto, 22) (New
York, 20) (Rome, 31)(Toronto, 31) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which
combines the input results and outputs a single value for each city, producing a
final result set as follows:
(Toronto, 32) (New York, 33) (Rome, 38)
Big data: Issues, challenges, tools and Good practices
 http://ieeexplore.ieee.org.ezp.scranton.edu/xpls/icp.jsp?arnum
Why Every Database Must Be Broken Soon
 1. https://blogs.vmware.com/vfabric/2013/03/why-every-
Big Data: What it is and why it matters
 2. http://www.sas.com/en_us/insights/big-data/what-is-big-
What is Hadoop?
 3. http://www-01.ibm.com/software/data/infosphere/hadoop/
What is MapReduce?
 4. http://www-