Big Data Analytics - Seidenberg School of Computer Science and

advertisement
DPS2017 Candidates @PACE – Fall 2015
Team3
Amir Ataee
Marvin Billings
Sharice Cannady
LLiver José
Egal Sanchez
Professor Chuck Tappert, Ph.D.
DCS860A
EMERGING INFORMATION TECHNOLOGIES 1
Brief overview of Big Data Analytics with an emphasis on enabling algorithms
TABLE OF CONTENT
•
Big Data Overview (Egal)
•
•
•
Data Structures
Perspectives on Data Repository
State of the Practice in Analytics (Marvin)
•
Current Analytical Architecture
•
•
•
•
•
Data sources
Data Warehouse (DW)
EDW (Enterprise DW)
Data Science Users
Drivers of Big Data
•
Example of Big Data Analytics – IOT
(Sharice)
•
Data Analytics Lifecycle (Amir)
•
•
•
•
•
•
Phase 1: Discovery
Phase 2: Data Preparation
Phase 3: Model Planning
Phase 4: Model Building
Phase 5: Communicate Results
Phase 6: Operationalize
Algorithms (LLiver)
• Clustering
• K-means
• Association Rules
• Apriori Algorithm
• Regression
• Linear Regression
• Logistic Regression
• Classification
• Decision Trees
• Naïve Bayes
• Time Series Analysis
• Text Analysis
EGAL’S SECTION
Big Data Definition:
Big Data can be defined as all data that is not a fit for a
traditional Relational Data Base Management System
(RDBMS), regardless of its use, for online transaction
processing (OLTP) or analytic purpose. Big data is not only
associated with the size of the data, but most importantly
with the format of the data. However, there are
ecosystems like Hadoop that can utilize both structured
and unstructured data.
The collection of data and the size of the data have been
throughout an evolutionary process that covers four stages
 Engineer
hard wired computer to extract information “ the
introduction of telegraph”
 Data
input by Computer operators “corporation and
organizations logging corporate data”
 The
introduction of the web “ the introduction of web.01 and
web.02”
 The
introduction of sensors “ internet of things”
Explosion of data collection has been exponential, going from
KB to TB
In computer science, a data structure is a methodical way to
organizing data in a computer so that it can be used efficiently
 Structured
databases use B-tree indexes for small percentages
of data retrieval
 Compilers
tables

and databases use dynamic hash tables as look-up
Usually, efficient data structures are very important to
designing efficient algorithms. Some formal design methods
and programming languages emphasize data structures,
rather than algorithms, as the key organizing factor in software
design.
 Relational
 Popular
Data Base Management System (RDBMS)
RDBMS: Microsoft SQL, Oracle SQL, and MySQL
Enterprise Data Warehouse (EDW)
 Structured
 Data
 In
data
Warehouse
traditional EDW BI/analytics
Popular NO-SQL technology: Casandra, CouchDB, and
MongoDB
Big-data Data repository:
 Unstructured
 Hadoop
 Big
data
Big-data Repository
data applications
Some big data ecosystem like Hadoop can leverage both
data repositories (RDBMS) and Big-Data NO-SQL repositories
MARVIN’S SECTION
STATE OF THE PRACTICE IN ANALYTICS - WHAT IS DRIVING IT?
❖
[3] 2.5 Quintillion bytes of data created daily.
❖
[5]Data generated every 2 days = All data from beginning of time until 2003.
❖
90% percent of data stored in the world created in last 2 years.
❖
All data in storage in the world doubles in 1.2 years.
❖
3.2 zettabytes today to 40 zettabytes in 2020.
❖
570 websites are created every minute.
❖
Google processes over 40000 searches per second.
❖
Every minute Facebook process 1.8 million likes and 200000 photo uploads.
❖
Youtube receives 100 hours of video a minute.
❖
NSA monitors 60 petabytes daily(1.8% of internet traffic).
❖
1.8 billion smartphones to 7.1 billion people on earth.
❖
25% of the people on this planet have smart phone full of sensors.
❖
Typical smartphone sensors (Motion, Air Temp., Light, Promity, Humdity, Location, etc.)
STATE OF THE PRACTICE IN ANALYTICS Mobile device Users
❖ Sensors
❖ Archives
❖ Social Networks
❖ Internet of Things
❖ Enterprise Applications
❖ Cameras
❖ Software logs
❖ Health Data - Fitbit
❖ Databases
❖ Etc.
❖
WHERE IS THE DATA COMING FROM?
STATE OF THE PRACTICE IN ANALYTICS -
CURRENT GENERAL ARCHITECTURE
Mysore, D., Khupat, S., & Jain, S.[Photograph]Retrieved from http://www.ibm.com/developerworks/library/bd-archpatterns1/fig1.png
SHARICE’S SECTION
BIG DATA & THE INTERNET OF THINGS
•
Data management software such as:

NoSQL database

Hadoop
•
Initially were used to analyzed website traffic and social
media data.
•
It turns out, that NoSQL and Hadoop are capable of analyzing
data streams coming from sensors and controllers.
BIG DATA & THE INTERNET OF THINGS
•
Today sensors and controllers are
streaming data from:
1.
Jet Engines
2.
Mobile Devices
3.
Health Monitors
4.
Items of Shipment
•
On the right: Is a Simplified view of key
Internet of Things components.
BIG DATA & THE INTERNET OF THINGS
BIG DATA & THE INTERNET OF THINGS
• The
“Killer User Cases” for Big Data
•
Sensors and intelligent controllers
increasing provide data that is
critical to running the business.
•
Big Data use case could be driven
by the need to analyze data from
the Internet of Things.
AMIR’S SECTION
DATA ANALYTICS LIFE CYCLE
Big
Data is defined as “extremely large data sets that have
grown beyond the ability to manage and analyze them with
traditional data processing tools
Dealing
with big data has several problems such as
acquirement, storage, searching, sharing, analytics, and
visualization of data
To
overcome these issues, we need a process which
facilitated the analytical process of the Big Data.
For
this purpose, the data analytics life cycle process was
designed
DATA ANALYTICS LIFE CYCLE- CONTINUE
1
This
life cycle, has 6 phases but
they are not have to be in serial
order
any time, one or more phases
can happen at the same time
Discovery
6
2
Data
Prep
Operationalize
Operationalize
At
most
of these phases can either
go forward or backward
depend on what additional as
new information is available
5
Communicate
Results
3
4
Model
Buildin
g
Model
Planni
ng
PHASE 1: DISCOVERY
The
team learns the business
domain
The
team assesses the resources
available to support the project
The
team formulating initial
hypotheses (IHs) to test and
begin learning the data.
PHASE 2: DATA PREPARATION
It
requires the presence of an
analytic sandbox, in which the
team can work with data
The
team needs to execute
extract, load, and transform (ELT)
The
team needs to familiarize
itself with the data thoroughly
and take steps to condition the
data
PHASE 3: MODEL PLANNING
The
team determines the
methods, techniques, and
workflow it intends to follow for
the subsequent model building
phase
The
team explores the data to
learn about the relationships
between variables and
subsequently selects key
variables and the most suitable
models
PHASE 4: MODEL BUILDING
The
team develops datasets
for testing and production
purposes
The
team builds and executes
models based on the work
done in the model planning
phase
PHASE 5: COMMUNICATE RESULTS
The
team, in collaboration with
major stakeholders, determines if
the results of the project are a
success or a failure
The
team should identify key
findings, quantify the business
value, and develop a narrative
to summarize and convey
findings to stakeholders.
PHASE 6: OPERATIONALIZE
The
team delivers final reports,
briefings, code, and technical
documents.

The team may run a pilot
project to implement the models
in a production environment.
LLIVER’S SECTION
Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and
D. G. Stork, John Wiley & Sons, 2000
WHAT IS AN ALGORITHM
•
In mathematics and computer science, an algorithm is a self-contained
step-by-step set of operations to be performed. Algorithms exist that perform
calculation, data processing, automated reasoning, etc.
•
The concept of algorithm has existed for centuries, however a partial
formalization of what would become the modern algorithm began with
attempts to solve the Entscheidungs problem ("determining whether or not
two Boolean functions returning Boolean values are equivalent") posed by
David Hilbert in 1928. Giving a formal definition of algorithms, corresponding
to the intuitive notion, remains a challenging problem.
[3]
•
Algorithms normally behave as they are designed, performing a number of
tasks. But when left unsupervised, can and will do strange things. As we put
more and more of our world under the control of algorithms, we can lose
track of who-or-what-is pulling the strings.
•
Algorithms entered the general public newscast through the Flash Crash ,
but they did not leave. They soon showed up in stories about dating,
shopping, entertainment, medicine, everything imaginable. Algorithms are
taking over everything.
[12]
WHAT IS AN ALGORITHM (CONT.)
• The bounds of algorithms get pushed every day. They have displaced humans in a growing
number of industries, something they often do well. They are faster than us, and when things
work as they should, they make far fewer mistakes than we do. But, as algorithms acquire
power and independence, there can be unexpected consequences. They observe,
experiment and learn – all independently of their human creators.
• Using computer science advanced techniques such as machine learning and neural
networking, algorithms can even create new and improved algorithms based on observed
results. Algorithms have already written symphonies, picked through legalese, diagnosed
patients, written news article, fly airplanes, driven vehicles on urban highways with far better
control than humans. [12]
•
It’s no coincidence that the most upwardly mobile people in society right now are those who
can manipulate code to create algorithms that can sprint through oceans of data.
• In the history of thought, before the discovery of calculus, mathematics had been a discipline
of great interest; afterward, it became a discipline of great power. Only after the advent of
computer algorithm in the twentieth century has represented a mathematical idea of
comparable influence. The calculus and the algorithm are the two leading ideas of the
Western science.
ALGORITHM: DISCUSSION
• What will become of:
our duties as humans,
our employment, etc.?
• Since they are here to stay, how can
we be more in control, trust, etc.?
CLASSIFICATION OF ALGORITHMS
 By implementation means
• Recursion or iteration
• Logical
• Serial, parallel or distributed
• Deterministic or non-deterministic
• Exact or approximate
• Quantum
 Optimization problems
• Linear programming
• Dynamic programming
• The greedy method
• The heuristic method
 By design paradigm
• Brute-force or exhaustive search
• Divide and conquer
• Search and enumeration
• Randomized
• Reduction of complexity
 By complexity
 By field of study
MAJOR BIG DATA ANALYTICS ALGORITHMS
CLUSTERING
Clustering
Use Cases
 Clustering is the use of unsupervised techniques
for grouping similar objects. In machine learning,
unsupervised refers to the problem of finding
hidden structure within unlabeled data. Clustering
techniques are unsupervised in the sense that the
data scientist does not determine, in advance,
the labels to apply to the clusters. For large data
set is computationally expensive.
Image Processing
Video is one example of the growing volumes of
unstructured data being collected. Within each
frame of a video, k-means analysis can be used to
identify objects in the video.
 Clustering is primarily an exploratory technique to
discover hidden structures of the data, possibly as
a prelude to more focused analysis or decision
processes.
K-means is a simple and straightforward method for
defining clusters. Once clusters and their associated
centroids are identified, it is easy to assign new
objects to a cluster based on the object's distance
from the closest centroid.
Medical
Patient attributes such as age, height, weight,
systolic and diastolic blood pressures, cholesterol
level, and other attributes can identify naturally
occurring clusters. These clusters could be used to
target individuals for specific preventive measures or
clinical trial participation. Clustering, in general, is
useful in biology for the classification of plants and
animals, as well as in the field of human genetics.
Customer Segmentation
Marketing and sales groups use k-means to better
identify customers who have similar behaviors and
spending patterns.
Major Big Data Analytics Algorithms: Clustering (Cont.)
If n is the known number of patterns and c the desired number of
clusters, the k-means algorithm is:
initialize n, c, 1, 2, …, c(randomly selected)
do classify n samples according to nearest I
compute i
until no change in i
return 1, 2, …, c
The question is how to evaluate that the samples in one cluster are
more similar among themselves than samples in other clusters
•
Two isses:
•
How to measure the similarity between samples?
•
How to evaluate a partitioning of a set into clusters?
•
The most obvious measure of similarity between two samples is the
distance between them, i.e., define a metric
•
Once this measure is defined, one would expect the distance
between samples of the same cluster to be significantly less than
the distance between samples in different classes.
MAJOR BIG DATA ANALYTICS ALGORITHMS
ASSOCIATION RULES (ARS)
This is a descriptive, not predictive, method often used to
discover interesting relationships hidden in a large dataset.
The relationship occurs too frequently to be random and is
meaningful from a business perspective, which may or may
not be obvious. ARs are commonly used for mining
transactions in databases. Possible questions that ARs can
answer:
• Which products tend to be purchased together?
• Of those customers who are similar to this person, what
products do they tend to buy?
• Of those customers who have purchased this product,
what other similar products do they tend to view or
purchase?
Apriori [8] is one of the
earliest and the most
fundamental algorithms
for generating
association rules.
Applications of Association Rules
Market basket analysis refers to a specific implementation of
association rules mining that many companies use for a
variety of purposes, including these:
• Broad-scale approaches to better merchandising
• Cross-merchandising between products and high-margin
or high-ticket items
• Physical or logical placement of product within related
categories of products
• Promotional programs of multiple product purchase
incentives managed through a loyalty card program
ARs are commonly used for recommender systems [9] and
click stream analysis. Many online service providers such as
Amazon and Netflix use recommender systems; which can
discover related products or identify customers who have
similar interests. This observation provides valuable insight on
how to better personalize and recommend the content to
site visitors.
The framework has expanded to web contexts, such as
mining path traversal patterns and usage patterns [7] to
facilitate organization of web pages.
MAJOR BIG DATA ANALYTICS ALGORITHMS:
REGRESSION: LINEAR REGRESSION
Linear Regression
Use Cases
Regression analysis attempts to explain the influence
that a set of variables has on the outcome of
another variable of interest. Often, the outcome
variable is called a dependent variable because the
outcome depends on the other variables. These
additional variables are sometimes called the input
variables or the independent variables.
• What is a person's expected income?
• What is the probability that an applicant will
default on a loan?
Linear regression is often used in business,
government, and other scenarios. Some common
practical applications:
• Real estate: A simple linear regression analysis can
be used to model residential home prices as a
function of the home's living area. The model
could be further improved by including other
input variables such as number of bathrooms,
number of bedrooms, lot size, school district
rankings, crime statistics, and property taxes.
• Demand
forecasting:
Businesses
and
governments can use linear regression models to
predict demand for goods and services. Similar
models can be built to predict retail sales,
emergency
room
visits,
and
ambulance
dispatches.
• Medical: A linear regression model can be used
to analyze the effect of a proposed radiation
treatment on reducing tumor sizes.
Linear regression is an analytical technique used to
model the relationship between several input
variables and a continuous outcome variable. Linear
regression models are useful in physical and social
science applications where there may be
considerable variation in a particular outcome
based on a given set of input values.
MAJOR BIG DATA ANALYTICS ALGORITHMS:
REGRESSION: LOGISTIC REGRESSION
Logistic Regression
When the outcome variable is
categorical, logistic regression
is a better choice. Both
models assume a linear
additive function of the input
variables.
If
such
an
assumption does not hold
true,
both
regression
techniques perform poorly.
Use Cases
The logistic regression model is applied to a variety of
situations in both the public and the private sector.
• Medical: Develop a model to determine the likelihood of
a patient's successful response to a specific medical
treatment or procedure.
•
Finance: Using a loan applicant's credit history and the
details on the loan, determine the probability that an
applicant will default on the loan. Based on the
prediction, the loan can be approved or denied, or the
terms can be modified.
•
Marketing: Determine a wireless customer's probability of
switching carriers (known as churning) based on age,
number of family members on the plan, months
remaining on the existing contract, and social network
contacts.
•
Engineering: Based on operating conditions and various
diagnostic measurements, determine the probability of a
mechanical part experiencing a malfunction or failure.
With this probability estimate, schedule the appropriate
preventive maintenance activity.
MAJOR BIG DATA ANALYTICS ALGORITHMS:
REGRESSION: CLASSIFICATION
Classification
Classification is another fundamental learning
method that appears in applications related
to data mining. The primary task performed by
classifiers is to assign class labels to new
observations. The set of labels for classifiers is
predetermined, unlike in clustering, which
discovers the structure without a training set
and allows the data scientist optionally to
create and assign labels to the clusters.
Most classification methods are supervised, in
that they start with a training set of pre-labeled
observations to learn how likely the attributes
of these observations may contribute to the
classification of future unlabeled observations.
Classification is widely used for prediction
purposes.
Two fundamental methods: Decision Trees and Naïve Bayes.
 Decision Trees
A decision tree (also called prediction tree) uses a tree
structure to specify sequences of decisions and
consequences.
•
Classification trees usually apply to output variables that
are categorical-often binary-in nature, such as yes or no.
• Regression trees, on the other hand, can apply to output
variables that are numeric or continuous, such as the
predicted price of a consumer good or the likelihood a
subscription will be purchased.
 Naïve Bayes
Naïve Bayes is a probabilistic classification method based on
Bayes' theorem (or Bayes' law) with a few tweaks. Bayes'
theorem gives the relationship between the probabilities of
two events and their conditional probabilities.
Because Bayes classifiers are easy to implement and can
execute efficiently even without prior knowledge of the
data, they are among the most popular algorithms for
classifying text documents. E-mail spam filtering is a classic
use case of Bayes text classification. Naïve Bayes classifiers
can also be used for fraud detection [11]
MAJOR BIG DATA ANALYTICS ALGORITHMS:
REGRESSION: TIME SERIES ANALYSIS
Time series analysis attempts to model the underlying structure of observations taken over time. It
has many applications in finance, economics, biology, engineering, retail, and manufacturing.
Following are the goals of time series analysis:
• Identify and model the structure of the time series.
• Forecast future values in the time series.
Use cases:
• Retail sales: For various product lines, a clothing retailer is looking to forecast future monthly
sales.
• Spare parts planning: Companies' service organizations have to forecast future spare part
demands to ensure an adequate supply of parts to repair customer products.
• Stock trading: Some high-frequency stock traders utilize a technique called pairs trading.
MAJOR BIG DATA ANALYTICS ALGORITHMS:
REGRESSION: TEXT ANALYSIS
Text analysis, sometimes called text analytics, refers to the representation, processing, and modeling of
textual data to derive useful insights. An important component of text analysis is text mining, the process of
discovering relationships and interesting patterns in large text collections. Text analysis suffers from the curse of
high dimensionality.
A corpus (plural: corpora) is a large collection of texts used for various purposes in Natural Language
Processing (NLP). Table below lists a few example corpora that are commonly used in NLP research.
Corpus
Word Count
Domain
Shakespeare
0.88 million
Written
http://shakespeare.mit.edu/
Brown Corpus
1 million
Written
http://icame.uib.no/brown/bcm.html
Penn Treebank
1 million
Newswire
http://www.cis.upenn.edu/-treebank/
Switchboard Phone Conversations
3 million
Spoken
British National
100 million
Written and spoken
NA News
350 million
Newswire
European Parliament Proceedings Parallel
600 million
Legal
1 trillion
Written
Google N-Grams
Website
http://catalog.ldc.upenn.edu/LDC97S62
http://www.natcorp.ox.ac.uk/
http://catalog.ldc.upenn.edu/LDC95T21
http://www.statmt.org/europarl/
http://catalog.ldc.upenn.edu/LDC2006T13
REFERENCES (LLIVER)
[1] Axt, P (1959). "On a Sub-Recursive Hierarchy and Primitive Recursive Degrees". Transactions of the American
Mathematical Society 92: 85–105.
[2] Bell, C. Gordon and Newell, Allen (1971), Computer Structures: Readings and Examples, McGraw-Hill Book Company,
New York. ISBN 0-07-004357-4.
[3] Moschovakis, Yiannis N. (2001). "What is an algorithm?" In Engquist, B.; Schmid, W. Mathematics Unlimited — 2001 and
beyond. Springer. pp. 919–936 (Part II). ISBN 9783540669135.
[4] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," in Proceedings of the Fifth
Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 1967.
[5] P. Hajek, I. Havel, and M. Chytil, "The GUHA Method of Automatic Hypotheses Determination," Computing, vol. 1, no.
4, pp. 293-308, 1966.
[6] R. Agrawal, T. lmielinski, and A. Swami, "Mining Association Rules Between Sets of Items in Large Databases," SIGMOD
'93 Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207-216, 1993.
[7] R. Cooley, B. Mobasher, and J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web,"
Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, pp. 558-567, 1997.
[8] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," in Proceedings of the 20th
International Conference a Very Large Data Bases, San Francisco, CA, USA, 1994.
[9] W. Lin, S. A. Alvarez, and C. Ruiz, "Efficient Adaptive-Support Association Rule Mining for Recommender Systems," Data
Mining and Knowledge Discovery, vol. 6, no. 1, pp. 83-105, 2002.
[10] W. Lin, S. A. Alvarez, and C. Ruiz, "Collaborative Recommendation via Adaptive Association Rule Mining," in
Proceedings of the International Workshop on Web Mining for E-Commerce (WEBKDD), Boston, MA, 2000.
[11] C. Phua, V. C. S. Lee, S. Kate, and R. W. Gayler, "A Comprehensive Survey of Data Mining-Based Fraud Detection,"
CoRR, vol. abs/1009.6119, 2010.
[12] C. Steiner, “Automate This: How Algorithms Took Over Our Markets, Our Jobs, and the World”, Portfolio / Penguin, NY,
NY, 2013. ISBN 9781591846529.
[13] R. O. Duda, P. E. Hart and D. G. Stork, “Pattern Classification” (2nd ed). John Wiley & Sons, 2000
REFERENCES (MARVIN)
[1]Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends
and future directions. Journal of Parallel and Distributed Computing, 79, 3-15.
[2]Mysore, D., Khupat, S., & Jain, S. (2013, September 17). Big data architecture and patterns, Part 1: Introduction to big data
[3]classification and architecture. Retrieved from http://www.ibm.com/developerworks/library/bd-archpatterns1/
[4]Wang, L., Ranjan, R., Kołodziej, J., Zomaya, A., & Alem, L. (2015). Software Tools and Techniques for Big Data Computing
in Healthcare Clouds. Future Generation Computer Systems, 43, 38-39.
[5]Marr, B. (2015). Big Data: using SMART big data, analytics and metrics to make better decisions and improve
performance.
References (Egal):
 [1]https://en.wikipedia.org/wiki/Data_structure
 [2]http://searchbusinessanalytics.techtarget.com/definition/big-
data-analytics
 [3]http://searchstorage.techtarget.com/definition/file-storage
 [4]Big
Data: Evolution, Components, Challenges and Opportunities,
Universidad Iberoamericana, Alejandro Zarate Santovena
 [5]http://www.adamadiouf.com/2013/03/22/bigdata-vs-enterprise-
data-warehouse/
REFERENCES (SHARICE)
1
Licht, A., Mantha, N., Nagode, L., & Stackowiak,
Robert. , (2015). Big Data and The Internet of
Things : Enterprise Information Architecture for A
New Age. New York : Apress
2
CIT Group.,(2015). The Internet of Things. Retrieved
December 17, 2015. From
https://www.youtube.com/watch?v=ygfBzQ1RKts
AMIR’S REFERENCES
Big Data Analytics: Turning Big Data into Big Money ,
Frank Ohlhorst John Wiley & Sons, 2013
• http://www.brainyquote.com/quotes/authors/l/lao_
tzu.html
• Data Science and Big Data Analytics: Discovering,
Analyzing, Visualizing and Presenting Data , EMC
Education Services, John Wiley & Sons, 2015
•
Questions?
Download