Uploaded by Ngoc Ha Nguyen

BDA 02 - Fundamentals

advertisement
Big Data Analysis and Business Knowledge
Lesson 2: Fundamentals of Big Data Analytics
Dr. Le, Hai Ha
Content
 Review
 Concepts and Terminology
 Different categories of Data
 Distributed computing
 Functional Programming
2
0. Review
Big Data
Big data primarily refers to data sets that
are too large or complex to be dealt with by
traditional data-processing application
software. -- Wikipedia
• Big Data is a field dedicated to the analysis, processing, and storage
of large collections of data that frequently originate from disparate
sources.
• Big Data solutions and practices are typically required when
traditional data analysis, processing and storage technologies and
techniques are insufficient.
• Big Data addresses distinct requirements, such as:
• the combining of multiple unrelated datasets
• processing of large amounts of unstructured data and
• harvesting of hidden information in a time-sensitive manner.
4
Scaling out vs. Scaling up
Compute
Compute
Compute
Compute
Compute
Storage
Storage
Storage
Storage
Storage
A Cluster of computers
5
Storage – HDFS example
6
Analysis, Processing - MapReduce
7
Typical architecture
8
1. Concepts and Terminology
Datasets
• Collections or groups of related data are generally referred to as
datasets.
• Each group or dataset member (datum) shares the same set of
attributes or properties as others in the same dataset.
10
Data Analysis
• Data analysis is the process of examining data to find facts,
relationships, patterns, insights and/or trends.
• Carrying out data analysis helps establish patterns and relationships
among the data being analyzed.
11
Data Analytics
• Data Analytics is a discipline that includes the
management of the complete data lifecycle,
which encompasses collecting, cleansing,
organizing, storing, analyzing and governing data.
• Data analytics is a broader term that
encompasses data analysis.
• In Big Data environments, data analytics has
developed methods that allow data analysis to
occur through the use of highly scalable
distributed technologies and frameworks that are
capable of analyzing large volumes of data from
different sources.
12
Categories of analytics
• Descriptive analytics
• Diagnostic analytics
• Predictive analytics
• Prescriptive analytics
The reality is that the generation of
high value analytic results increases
the complexity and cost of the
analytic environment.
13
Descriptive Analytics
• Descriptive Analytics are carried
out to answer questions about
events that have already occurred.
• Sample questions can include:
• What was the sales volume over the
past 12 months?
• What is the number of support calls
received as categorized by severity
and geographic location?
• What is the monthly commission
earned by each sales agent?
14
Diagnostic Analytics
• Diagnostic Analytics are aim to determine the cause of a
phenomenon that occurred in the past using questions that
focus on the reason behind the event.
• Sample questions can include:
• Why were Q2 sales less than Q1 sales?
• Why have there been more support calls originating from the
Eastern region than from the Western region?
• Why was there an increase in patient re-admission rates over the
past three months?
15
Predictive Analytics
• Predictive Analytics are carried out in
an attempt to determine the outcome
of an event that might occur in the
future.
• Questions are usually formulated using
a what-if rationale, such as the
following:
• What are the chances that a customer
will default on a loan if they have
missed a monthly payment?
• What will be the patient survival rate if
Drug B is administered instead of Drug
A?
• If a customer has purchased Products A
and B, what are the chances that they
will also purchase Product C?
16
Prescriptive Analytics
• Prescriptive Analytics are build
upon the results of predictive
analytics by prescribing actions that
should be taken.
• Sample questions may include:
• Among three drugs, which one
provides the best results?
• When is the best time to trade a
particular stock?
• Prescriptive analytics involve the
use of business rules and large
amounts of internal and external
data to simulate outcomes and
prescribe the best course of action
17
Business Intelligence (BI)
• Business Intelligence (BI) enables an organization to gain insight into
the performance of an enterprise by analyzing data generated by its
business processes and information systems.
• BI applies analytics to large amounts of data across the enterprise,
which has typically been consolidated into an enterprise data
warehouse to run analytical queries.
18
Key Performance Indicators (KPI)
• Key Performance Indicators (KPI) is a metric
that can be used to gauge success within a
particular business context.
• KPIs are linked with an enterprise’s overall
strategic goals and objectives.
• They are often used to identify business
performance problems and demonstrate
regulatory compliance.
19
2. Different Categories of
Data
Different Categories of Data
• The data processed by Big Data solutions can be human-generated or
machine-generated
• The primary categories of data are:
• structured data
• unstructured data
• semi-structured data
21
Structured Data
• Structured data conforms to a data model or schema and is often
stored in tabular form.
• It is used to capture relationships between different entities and is
therefore most often stored in a relational database.
• Structured data is frequently generated by enterprise applications
and information systems like ERP and CRM systems.
22
Types of Data
• Categorical (nominal) data
• sorted into categories according to specified characteristics. E.g.
gender: male, female
• Ordinal data
• ordered or ranked according to some relationship to one another. E.g.
rating a service as poor, average, good, very good, or excellent
• Interval data
• ordinal but have constant differences between observations and have
arbitrary zero points. E.g. time and temperature
• Ratio data
• continuous and have a natural zero. E.g. weight, revenue
23
Example
• Categorical data (=labels, nominal [ordered], binary)
• Quantitative data (=numbers, discrete [integer], continues [real]
• TABLE ROWS = instances, examples, data points, observations, samples
• TABLE COLUMNS = attributes, features, variables
24
Unstructured Data
• Data that does not conform to a data model or data schema is known
as unstructured data.
• It is estimated that unstructured data makes up 80% of the data
within any given enterprise.
• Unstructured data has a faster growth rate than structured data.
25
Semi-structured Data
• Semi-structured data has a defined level of structure and consistency,
but is not relational in nature.
• Instead, semi-structured data is hierarchical or graph-based.
• This kind of data is commonly stored in files that contain text.
26
Metadata
• Metadata provides information about a dataset’s characteristics and
structure.
• This type of data is mostly machine-generated and can be appended
to data.
• Examples of metadata include:
• XML tags providing the author and creation date of a document
• attributes providing the file size and resolution of a digital photograph
27
3. Distributed Computing
Distributed Computing for Big Data
•
•
Get parallelism from computing clusters – large collections
of commodity hardware, including conventional processors
(“compute nodes”) connected by ethernet cables or
inexpensive switches instead of a single supercomputer.
In these computing paradigms, we have a distributed file
system (DFS)
• which features much larger units than the disk blocks in a
conventional operating system.
• DFS also provide replication of data or redundancy to protect against
the frequent media failures that occur when data is distributed over
thousands of low-cost compute nodes.
•
MapReduce is a programming style on top of DFS
• There are many high level languages such as HIVE with a MapReduce
foundation
• Apache Spark is an extension of the MapReduce framework
29
A Compute Cluster Physical Setup
●
●
●
●
Compute nodes are stored on racks,
perhaps 8–64 on a rack.
The nodes on a single rack are
connected by a network, typically gigabit
Ethernet.
There can be many racks of compute
nodes, and racks are connected by
another level of network or a switch.
The bandwidth of inter-rack
communication is somewhat greater
than the intrarack Ethernet
30
Some of the Challenges with Distributed
Computing Systems
•
•
•
•
Communication costs. As the cluster grows, you need better
bandwidth for the nodes to across racks to compute effectively.
Administration and maintenance. Requires effort to ensure software
running across nodes is synchronized and also inspecting and visualizing
what's happening on each node is not easy
Partial failures. Since there are many nodes, failure of nodes is
inevitable.
Programming in this environment brings its own challenges
31
Overview of Main Solutions
How does DFS and MapReduce deal with the challenge of
constant failures?
•
DFS solution. Files must be stored redundantly. If we did not duplicate
the file at several compute nodes, then if one node failed, all its files
would be unavailable until the node is replaced.
•
MapReduce (or any programming system working with computer
clusters). Computations must be divided into tasks, such that if any one
task fails to execute to completion, it can be restarted without affecting
other tasks.
32
DFS Implementations
HDFS is not the only DFS out there, there are other ones.
•
•
•
•
Some DFS are open source (e.g., HDFS) while others are proprietary
They are also implemented in different programming languages. For
example, HDFS is implemented in Java
There are also cloud based/remote based DFS such as:
• AWS S3
• Google Cloud Storage
• Microsoft Azure
• IBM Cloud Object Storage
See this wikipedia for comparisons of DFS
33
MapReduce
A programming style which works well with data in DFS. Hadoop MapReduce is just one
implementation of this style. For example, Google has their own implementation of
MapReduce called MapReduce.
1.
2.
3.
Map tasks each are given one or more chunks of input/data from a
DFS. These Map tasks turn the chunk into a sequence of key-value
pairs. The way key-value pairs are produced (e.g., what should be the
value) from the input data is determined by the code written by the
user.
The key-value pairs from each Map task are collected by a master
controller and sorted by key. The keys are divided among all the
Reduce tasks, so all key-value pairs with the same key wind up at the
same Reduce task.
The Reduce tasks work on one key at a time, and combine all the
values associated with that key. The manner of combination of values
is determined by the code written by the user. For instance, you can
combine by adding the values for a single key.
34
MapReduce
35
Map Tasks in MapReduce
•
Input into Map task
• Input files for a Map task can be seen as consisting of elements,
which can be any type: a tuple or a document. A chunk is a collection
of elements, and no element is stored across two chunks.
• For example, in the word-count example, you can have multiple
documents to count from such that the input file is a repository of
documents, and each document is an element.
•
•
The Map function takes an input element as its argument
and produces zero or more key-value pairs.
Key-value pairs.
• Keys do not have to be unique. Rather a Map task can produce
several key-value pairs with the same key, even from the same
element.
36
4. Functional Programming
Contents
1. Big Data problem solving
2. Big Data package stack in Python
3. Functional programming basics
a)
b)
c)
d)
e)
f)
g)
What is functional programming?
Advantages of functional programming
Functional programming and Big Data processing
Lambda functions
Higher order functions
Functional programming in Python
Data structures for functional programming in Python
4. Further reading on functional programming
5. Functional programming tutorial in Python
38
Advice on Tackling a Big Data Problem
Some questions to ask yourself before you jump to the big guns
1. Can I optimize pandas to solve the problem?: If you are using Pandas for data
munging, you can optimize pandas to load large datasets depending on the
nature of your problem
2. How about drawing a sample from the large dataset?: Depending on your use
case, drawing a sample out of a large dataset may or may not work. Just be
careful that you sample correctly.
3. Can I use simple Python parallelism to solve the problem on my laptop?:
Sometimes the data isn't that big but you just need to run more intense
computations on the smaller data, multiprocessing can help.
4. Can I use a big data framework on my laptop?: For some tasks, even with a
25GB dataset, frameworks like Spark and Dask can work on a single laptop.
5. Which package should I use?
6. Need to build a cluster: Take time to think about which distribution of
Hadoop to use, which vendors to use, whether you will put the cluster on the
cloud or on-premise. You will need input of IT people for this one.
39
Big Data Package Ecosystem in Python
There are only 4 packages which are known to handle large
datasets in Python. Out of those, Pyspark and Dask are the
most stable options for enterprise level data processing
1.
2.
3.
4.
Apache Spark (Pyspark)
Dask
Vaex
Datatable
40
Explore Dask on Your Own
In this course, we use Dask but I encourage you to
explore PySpark
• One of the best features about Dask is that it uses existing Python APIs and data
structures, its easy to switch between NumPy, pandas, scikit-learn to their Daskpowered equivalents.
• At the same time, you can also run it on compute clusters such as those
powered by Hadoop framework.
• Learn all about Dask here
41
What is Functional Programming?
“In computer science, functional programming is a programming paradigm
where programs are constructed by applying and composing functions. It is a
declarative programming paradigm in which function definitions are trees of
expressions that map values to other values, rather than a sequence of
imperative statements which update the running state of the program.”-Wikipedia
A pure function is a function whose output value follows solely from its
input values and cannot be affected by any mutable state or other side
effects. In functional programming, a program consists entirely of evaluation
of pure functions.
42
What is Functional Programming
•
•
•
•
•
•
•
A mathematical function programming style
Follows declarative programming model
Emphasizes the “what” of the solution instead of “how to” get to the
solution
Uses expressions instead of statements
LISt Processing Language, known as LISP, was the first functional
programming language, starting in the 1950s.
Haskell and Scala is the most recent representative in this family of
programming languages. Apache Spark is written mainly in Scala
Other languages (e.g., Python, R, Java) also provide rudimentary
support for functional programming
43
Functional programming
style in Scala
Procedural
programming style
in Python
44
Traditional Vs. Functional Program in Javascript
Source: Wikipedia
45
Advantages of Functional Programming
Elegant code: Code is elegant and concise because of higher order
function abstractions.
● High level: You’re describing the result you want rather than explicitly
specifying the steps required to get there.
● Transparent: The behavior of a pure function depends only on its inputs
and outputs, without intermediary values. That eliminates the possibility
of side effects, which facilitates debugging and also reduces introduction
of bugs
● Parallelizable: Turning FP code to run into parallel requires no changes
to the function definition which is different from traditional procedural
code.
● Programs are deterministic.
●
46
Disadvantages of Functional Programming
Potential performance losses because of the amount of
garbage- collection that needs to happen when we end up
creating new variables as we can’t mutate existing ones.
● File I/O is difficult because it typically requires interaction
with state.
● Programmers who’re used to imperative programming
could find this paradigm harder to grasp.
●
47
Advantages of Functional Programming-Parallelization
Input
Function
Output
Compute node-1
Compute node-2
Increase compute nodes as
input size increase
48
Lambda Functions
•
Data types such as
numbers, strings,
booleans etc. don’t need
to be bound to a variable.
The same can be done for
functions!
•
•
In computer programming, an anonymous
function (function literal, lambda abstraction,
lambda function, lambda expression or block) is
a function definition that is not bound to an
identifier.
Anonymous functions are often arguments
being passed to higher-order functions or used
for constructing the result of a higher-order
function that needs to return a function.
Anonymous functions are ubiquitous in
functional programming languages and other
languages with first-class functions, where they
fulfil the same role for the function type as
literals do for other data types.
Source: Wikipedia
49
Lambda Functions in Scala vs. Python
50
Functional Programming and Big Data processing
• Functional programming lends itself amenable to Big Data processing
because of ease of parallelization
• For instance, Spark parallelizes computations using the lambda
calculus
• All functional Spark programs are inherently parallelizable-which
means when you increase your input data from 1MB to 1 PB during
analysis, all you have to do is add more compute resources, no need
to change the code
51
Functional Programming in Python
•
Functions in Python are first class citizensThat means functions have the same
characteristics as values like strings and
numbers.
•
Functions have two abilities which are
crucial for functional programming as
follows:
• They can take another function as an
argument
• They can return functions as values
• Storing them in variables just like other datatypes
•
Anonymous functions are easy to define
with lambda
•
Therefore, Python provides good support
functional programming
52
Data Structures for Functional Programming in
Python
• Mutable data structures such as Dictionaries and Lists are not ideal
for functional programs because they can be changed while the
program is running
• Instead, immutable data structures are better where you are forced
to make a copy of the object before you change it
• In Python “namedtuple” and “tuple” can be used instead of Lists and
Dictionaries
53
Why Does Data Immutability Matter in FP
In pure functional languages, all data is immutable and the program
state cannot change
● What are the implications of this property ?
○ Functions are deterministic - the same input will always yield the
same output. This makes it easier to re-use functions elsewhere.
○ The order of execution of multiple functions does not affect the final
outcome of the program.
○ Programs don’t contain any side effects.
● We will see that in Apache Spark, all data structures are immutable,
you have to make a copy or perform some action/transformations to
change it
●
54
Defining Lambda Functions in Python
55
How Lambda Functions Fit Together with Python
Function Definition and Lambda Calculus
56
Higher Order Functions
A function is a higher-order function if it either takes in one or more
functions as parameters and/or returns a function.
57
Higher Order Functions in Action
58
Common Higher Order Functions
Similar idea to
the map and
reduce
functions in
MapReduce
programming
model
•
Map - is a higher order function with the following
specifications
• Inputs: a function f and a list of elements L
• Outputs: a new list of elements, with f applied to
each of the elements of L
•
Reduce - reduces a list of elements to one element
using a binary function to successively combine the
elements.
• Inputs: a function f , a list of elements L and an
accumulator, acc, the parameter that collects the
return value. You can think of acc as the initial
value
• Outputs: The value of f sequentially applied and
tracked in ‘acc’
• Filter
59
Map Example
60
Reduce Example
61
Further Reading in Functional Programming
1.Wikipedia has good content on the topic
2.The Lambda calculus background is also interesting to read
3.These slides provide good introductory information on FP
62
Exercises
• Practice with Python basic, Numpy, Pandas.
63
Exercises
• Identify each of the variables in the Excel file Credit Approval
Decisions as categorical, ordinal, interval, or ratio and explain why
64
Download