Uploaded by veeraowk

1 - Big Data Analytics - Spark

advertisement
Talk
05th September 2018
BIG DATA USING SPARK
By
Dr.Asadi Srinivasulu
Professor, M.Tech(IIIT), Ph.D.
SREE VIDYANIKETHAN ENGINEERING COLLEGE
(AUTONOMOUS)
(AFFILIATED TO JNTUA,ANANTAPUR)
2018-2019
Contents
 Big Data Fundamentals
 Big Data Architecture
 Spark Fundamentals
 Spark Ecosystem
 Spark Transformations
 Spark Actions
 Spark – MLlib
 Classification using Spark
 Clustering using Spark
 Spark Challenges
IOT
Big Data Analytics
Data Mining
Data warehouse
Data Mart
Database System
DBMS
Database
Information
Data
Internet of Things
Deals with 3V’s big data like face book
Extracting Meaningful data
Collection of Data Marts/OLAP
Subset of a DWH
Combination of Data + DBMS
Collection of Software's
Collection inter-related Data
Processed Data
Raw material/facts/images
Fig: Pre-requisite of Big Data
Fig: Big Data Word Cloud
4
The Myth about Big Data
 Big Data Is New
 Big Data Is Only About Massive Data Volume
 Big Data Means Hadoop
 Big Data Need A Data Warehouse
 Big Data Means Unstructured Data
 Big Data Is for Social Media & Sentiment Analysis
Big Data is …….
Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information
processing for enhanced insight and decision
making.
Big data is data which is too large, complex and
dynamic for any conventional data tools to
capture, store, manage and analyze.
6
Fig: Big Data - Characteristics
7
 Big Data Analytics is the process of examining large
amounts of data of a variety of types (big data) to uncover
hidden patterns, unknown correlations and other useful
information.
 Why do they care about Big Data?
 More knowledge leads to better customer engagement,
fraud prevention and new products.
8
 Data Evolution is the 10% are structured and 90% are
unstructured like emails, videos, facebook posts, website clicks
etc.
Fig: The Evolution of BIG DATA
9
 Big data is a collection of data sets which is so large and
complex that it is difficult to handle using DBMS tools.
 Facebook alone generates more than 500 terabytes of
data daily whereas many other organizations like Jet Air
and Stock Exchange Market generates terabytes of data
every hour.
 Types of Data:
1. Structured Data- These data is organized in a highly mechanized
and manageable way. Ex: Tables, Transactions, Legacy Data etc…
2. Unstructured Data- These data is raw and unorganized, it varies in
its content and can change from entry to entry. Ex: Videos, images,
audio, Text Data, Graph Data, social media etc.
3. Semi-Structured Data- Ex: XML Database, 50% structured and
50% unstructured
10
 Big Data Matters…….
 Data Growth is huge and all that data is valuable.
 Data won’t fit on a single system, that's why use Distributed data
 Distributed data = Faster Computation.
 More knowledge leads to better customer engagement, fraud
prevention and new products.
 Big Data Matters for Aggregation, Statistics, Indexing, Searching,
Querying and Discovering Knowledge.
11
Fig: Measuring the Data in Big Data System
 Big Data Sources: Big data is everywhere and it can help
organisations any industry in many different ways.
 Big data has become too complex and too dynamic to be able to
process, store, analyze and manage with traditional data tools.
 Big Data Sources are











ERP Data
Transactions Data
Public Data
Social Media Data
Sensor Media Data
Big Data in Marketing
Big Data in Health & Life Sciences
Cameras Data
Mobile Devices
Machine sensors
Microphones
13
Fig: Big Data Sources
14
Structure of Big Data

Big Data Processing: So-called big data technologies are about discovering
patterns (in semi/unstructured data) development of big data standards & (open
source) software commonly driven by companies such as Google, Facebook,
Twitter, Yahoo! …
15
Big Data File Formats
 Videos
 Audios
 Images
 Photos
 Logs
 Click Trails
 Text Messages
 E-Mails
 Documents
 Books
 Transactions
 Public Records
 Flat Files
 SQL Files
 DB2 Files
 MYSQL Files
 Tera Data Files
 MS-Access Files
16
Characteristics of Big Data
 The are seven characteristics of Big Data are volume,
velocity, variety, veracity, value, validity and visibility.
Earlier it was assessed in megabytes and gigabytes but
now the assessment is made in terabytes.
17
1. Volume: Data size or the amount of Data or Data quantity or Data
at rest..
2. Velocity: Data speed or Speed of change or The content is changing
quickly or Data in motion.
3. Variety: Data types or The range of data types & sources or Data
with multiple formats.
4. Veracity: Data fuzzy & cloudy or Messiness or Can we trust the
data.
5. Value: Data alone is not enough, how can value be derived from it.
6. Validity: Ensure that the interpreted data is sound.
7. Visibility: Data from diverse sources need to be stitched together.
18
Advantages
 Flexible schema
 Massive scalability
 Cheaper to setup
 Understanding and Targeting Customers
 Understanding and Optimizing Business Process
 Improving Science and Research
 Improving Healthcare and Public Health
 Financial Trading
 Improving Sports Performance
 Improving Security and Law Enforcement
 No declarative query language
 Eventual consistency – higher performance
 Detect risks and check frauds
 Reduce Costs
Fig) Advantages of Big Data
20
Disadvantages
 Big data violates the privacy principle.
 Data can be used for manipulating customers.
 Big data may increase social stratification.
 Big data is not useful in short run.
 Faces difficulties in parsing and interpreting.
 Big data is difficult to handle -more programming
 Eventual consistency - fewer guarantees
Big Data Challenges
 Data Complexity
 Data Volume
 Data Velocity
 Data Variety
 Data Veracity
 Capture data
 Curation data
 Performance
 Storage data
 Search data
 Transfer data
 Visualization data
 Data Analysis
 Privacy and Security
22
 Big Data Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization,
and information privacy.
23
Fig: Challenges of Big Data
24
 Research issues in Big Data Analytics
1. Sentiment Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
2. Opinion mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
3. Predictive mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
4. Post-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
5. Pre-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
6. How we can capture and deliver data to right people in realtime
7. How we can handle variety of forms and data
8. How we can store and analyze data given its size and
25
computational capacity.
Big Data Tools
 Big Data Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,
LucidWorks, R, MapR, Ubuntu and Linux flavors.
26
Applications
 Social Networks and Relationships
 Cyber-Physical Models
 Internet of Things (IoT)
 Retail Market
 Retail Banking
 Real Estate
 Fraud detection and prevention
 Telecommunications
 Healthcare and Research
 Automotive and production
 Science and Research
 Trading Analytics
27
Fig: Applications of Big Data Analytics
28
29
30
31
32
33
34
35
36
37
Contents
 Spark Basics
 RDD(Resilient Distributed Dataset)
 Spark Transformations
 Spark Actions
 Spark with Machine Learning
 Hands on Spark
 Research Challenges in Spark
 Spark is a free and open source software web
application framework and domain-specific language
written in Java.
 Spark is an alternative to other Java web application
frameworks such as JAX-RS, Play framework and Spring
MVC.
 Apache Spark is a general-purpose cluster in-memory
computing system.
 Provides high-level APIs in Java, Scala and Python and
an optimized engine that supports general execution
graphs.
 Provides various high level tools like Spark SQL for
structured data processing, MLlib for Machine Learning
and more….
39
40
 Spark is a successor of MapReduce.
 Map Reduce is the ‘heart‘ of Hadoop that consists of two
parts – ‘map’ and ‘reduce’.
 Maps and reduces are programs for processing data.
 ‘Map’ processes the data first to give some intermediate
output which is further processed by ‘Reduce’ to generate
the final output.
 Thus, MapReduce allows for distributed processing of the
map and reduction operations.
41
 Apache Spark is the Next-Gen Big Data Tool i.e.
Considered as future of Big Data and successor of
MapReduce.
 Features of Spark are
– Speed.
– Usability.
– In - Memory Computing.
– Pillar to Sophisticated Analytics.
– Real Time Stream Processing.
– Compatibility with Hadoop & existing Hadoop Data.
– Lazy Evaluation.
– Active, progressive and expanding community.
42
Spark is a Distributed data analytics engine,
generalizing MapReduce.
Spark is a core engine, with streaming, SQL,
Machine Learning and Graph processing modules.
43
Precisely
44
Why Spark
 Spark is faster than MR when it comes to processing the
same amount of data multiple times rather than unloading
and loading new data.
 Spark is simpler and usually much faster than MapReduce
for the usual Machine learning and Data Analytics
applications.
5 Reasons Why Spark Matters to Business
1. Spark enables use cases “traditional” Hadoop can’t handle.
2. Spark is fast
3. Spark can use your existing big data investment
4. Spark speaks SQL
5. Spark is developer-friendly
45
46
47
48
49
50
51
 Spark Ecosystem: Spark Ecosystem is still in the stage of workin-progress with Spark components, which are not even in their
beta releases.
Components of Spark Ecosystem
 The components of Spark ecosystem are getting developed and
several contributions are being made every now and then.
 Primarily, Spark Ecosystem comprises the following components:
1) Shark (SQL)
2) Spark Streaming (Streaming)
3) MLLib (Machine Learning)
4) GraphX (Graph Computation)
5) SparkR (R on Spark)
6) BlindDB (Approximate SQL)
52
53
 Spark's official ecosystem consists of the following major
components.
 Spark DataFrames - Similar to a relational table
 Spark SQL - Execute SQL queries or HiveQL
 Spark Streaming - An extension of the core Spark API
 MLlib - Spark's machine learning library
 GraphX - Spark for graphs and graph-parallel computation
 Spark Core API - provides R, SQL, Python, Scala, Java
54
55
56
57
MLlib library has implementations for various
common machine learning algorithms
1. Clustering: K-means
2. Classification: Naïve Bayes, logistic regression,
SVM
3. Decomposition:
Principal
Component
Analysis
(PCA) and Singular Value Decomposition (SVD)
4. Regression : Linear Regression
5. Collaborative Filtering: Alternating Least Squares for
Recommendations
58
• Spark Ecosystem Components:
 Spark Core: Spark Core is the base for parallel and distributed
processing of huge datasets.
 Spark SQL element: SparkSQL is module / component in Apache
Spark that is employed to access structured and semi structured
information.
 SparkSQL: Built-in perform or user-defined function: Object
comes with some functions for column manipulation. Using Scala
we are able to outlined user outlined perform.
 SparkSQL: Executing SQL queries or Hive queries, result are
going to be came in variety of DataFrame.
59
60
 DataFrame: It is similar to relative table in SparkSQL. It is
distributed assortment of tabular information having rows and
named column.
 Datasets API: Dataset is new API additional to Spark Apache to
supply benefit of RDD because it is robust written and declarative
in nature.
 Spark Streaming: Spark Streaming is a light-weight API that
permits developers to perform execution and streaming of
information application.
 Spark element MLlib: MLlib in Spark stands for machine learning
(ML) library. Its goal is to form sensible machine learning effective,
ascendible and straightforward.
 GraphX: GraphX is a distributed graph process framework on
61
prime of Apache Spark.
62
 Language Support in Apache Spark
 Apache Spark ecosystem is built on top of the core
execution engine that has extensible API’s in different
languages.
 A recent 2016 Spark Survey on 62% of Spark users
evaluated the Spark languages
 58% were using Python in 2017
 71% were using Scala
 31% of the respondents were using Java and
 18% were using R programming language.
63
1) Scala: Spark framework is built on Scala,
so programming in Scala for Spark can
provide access to some of the latest and
greatest features that might not be available
in other supported programming spark
languages.
2) Python: Python language has excellent
libraries for data analysis like Pandas and
Sci-Kit learn but is comparatively slower
than Scala.
64
3) R Language: R programming language has rich
environment for machine learning and statistical
analysis which helps increase developer
productivity. Data scientists can now use R
language along with Spark through SparkR for
processing data that cannot be handled by a single
machine.
4) Java: Java is verbose and does not support REPL
but is definitely a good choice for developers
coming from a Java + Hadoop background.
65
 What is Scala?: Scala is a general-purpose programming
language, which expresses the programming patterns in a
concise, elegant, and type-safe way.
 It is basically an acronym for “Scalable Language”.
 Scala is an easy-to-learn language and supports both
Object Oriented Programming as well as Functional
Programming.
 It is getting popular among programmers, and is being
increasingly preferred over Java and other programming
languages.
 It seems much in sync with the present and future Big
Data frameworks, like Scalding, Spark, Akka, etc.
66
Why is Spark Programmed in Scala?
 Scala is a pure object-oriented language, in which conceptually
every value is an object and every operation is a method-call. The
language supports advanced component architectures through
classes and traits.
 Scala is also a functional language. It supports functions,
immutable data structures and gives preference to immutability
over mutation.
 Scala can be seamlessly integrated with Java
 It is already being widely used for Big Data platforms and
development of frameworks like Akka, Scalding, Play, etc.
 Being written in Scala, Spark can be embedded in any JVM-based
67
operational system.
68
 Procedure: Spark Installation in Ubuntu
 Apache Spark is a fast and general engine for large-scale data
processing.
 Apache Spark is a fast and general-purpose cluster computing
system. It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs.
 It also supports a rich set of higher-level tools including Spark
SQL for SQL and structured data processing, MLlib for machine
learning, GraphX for graph processing, and Spark Streaming.
Step 1: Installing Java.
java –version
Step 2: Installing Scala and SBT.
sudo apt-get update
sudo apt-get install scala
69
Step 3: Installing Maven plug-in.
 Maven plug-in is used to compile java program for spark. Type
below command to install maven.
sudo apt-get install maven
Step 4: Installing Spark.
 Download “tgz” file of spark by selecting specific version from
below link http://spark.apache.org/downloads.html
 Extract it and remember its path where ever it stored.
 Edit .bashrc file by placing below lines (terminal command: gedit
.bashrc)export SPARK_HOME=/path_to_spark_directory
export PATH=$SPARK_HOME/bin:$PATH
 Replace path_to_spark_directory in above line with address of your
spark directory.
 Restart .bashrc by saving and close it and type “..bashrc” in
terminal.
 If it doesn’t work restart system. Thus we installed spark
successfully.
70
 Type spark-shell in terminal to start spark shell.
Spark Installation on Windows
Step 1: Install Java (JDK)
Download and install java from https://java.com/en/download/
Step 2: Set java environment variable
 Open “control panel” and choose “system & security” and select
“system”.
 Select “Advanced System Settings” located at top right.
 Select “Environmental Variables” from pop-up.
 Next select new under system variables (below), you will get a pop-up.
 In variable name field type JAVA_HOME
 In variable value field provide installation directory of java, say
C:\Program Files\Java\jdk1.8.0_25
 Or you can simply choose the directory by selecting browse directory.
 Now close everything by choosing ok every time.
 Check whether java variable is set or not by pinging javac in command
promt. If we get java version details then we are done.
71
Step 3: Installing SCALA
 Download scala.msi file from https://www.scalalang.org/download/
 Set scala environment variable just like java done above.
Variable name = SCALA_HOME
Variable value = path to scala installed directory, say
C:\Program Files (x86)\scala
Step 4: Installing SPARK
 Download
and
extract
spark
from
http://spark.apache.org/downloads.html
 You can set SPARK_HOME just like java.
 Note:-We can only run spark-shell at bin folder in spark
folder on windows.
72
Step 5: Installing SBT Download and install sbt.msi from
http://www.scala-sbt.org/0.13/docs/Installing-sbt-onWindows.html
Step 6: Installing Maven
 Download
maven
from
http://maven.apache.org/
download.cgi and unzip it to the folder you want to install
Maven.
 Add both M2_HOME and MAVEN_HOME variables in
the Windows environment, and point it to your Maven
folder.
 Update PATH variable, append Maven bin folder –
%M2_HOME%\bin, so that you can run the Maven’s
command everywhere.
 Test maven by pinging mvn –version in command prompt
73
 Practice on Spark Framework with Transformations and
Actions: You can run Spark using its standalone cluster mode,
on EC2, on Hadoop YARN, or on Apache Mesos. Access data
in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data
source.
 Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the same
application.
74
 RDD (Resilient Distributed Dataset) is main logical data unit
in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“
 Spark performs Transformations and Actions:
75
 Resilient Distributed Datasets overcome this drawback of Hadoop
MapReduce by allowing - fault tolerant ‘in-memory’ computations.
 RDD in Apache Spark.
 Why RDD is used to process the data ?
 What are the major features/characteristics of RDD (Resilient
Distributed Datasets) ?
 Resilient Distributed Datasets are immutable, partitioned collection
of records that can be operated on - in parallel.
 RDDs can contain any kind of objects Python, Scala, Java or even
user defined class objects.
 RDDs are usually created by either transformation of existing RDDs
or by loading an external dataset from a stable storage like HDFS or
HBase.
76
Fig: Process of RDD Creation
77
 Operations on RDDs
 i) Transformations: Coarse grained operations like
join, union, filter or map on existing RDDs which
produce a new RDD, with the result of the operation,
are referred to as transformations. All transformations
in Spark are lazy.
 ii) Actions: Operations like count, first and reduce
which return values after computations on existing
RDDs are referred to as Actions.
78
• Properties / Traits of RDD:
 Immutable (Read only cant change or modify): Data is safe to
share across processes.
 Partitioned: It is basic unit of parallelism in RDD.
 Coarse gained operations: it's applied to any or all components in
datasets through maps or filter or group by operation.
 Action/Transformations: All computations in RDDs are actions or
transformations.
 Fault Tolerant: As the name says or include Resilient which means
its capability to reconcile, recover or get back all the data using
lineage graph.
 Cacheable: It holds data in persistent storage.
 Persistence: Option of choosing which storage will be used either
79
in-memory or on-disk.
80
How Spark Works - RDD Operations
81
 Task 1: Practice on Spark Transformations i.e. map(), filter(),
flatmap(), groupBy(), groupByKey(), sample(), union(), join(),
distinct(), keyBy(), partitionBy and zip().
 RDD (Resilient Distributed Dataset) is main logical data unit
in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“
 Transformations are lazy evaluated operations on RDD that create
one or many new RDDs, e.g. map, filter, reduceByKey, join,
cogroup, randomSplit.
 Transformations are lazy, i.e. are not executed immediately.
 Transformations can be executed only when actions are called.
82
 Transformations are lazy operations on a RDD that create one or
many new RDDs.
 Ex: map , filter , reduceByKey , join , cogroup , randomSplit .
 In other words, transformations are functions that take a RDD as the
input and produce one or many RDDs as the output.
 RDD allows you to create dependencies between RDDs.
 Dependencies are the steps for producing results i.e. a program.
 Each RDD in lineage chain, string of dependencies has a function
for operating its data and has a pointer dependency to its ancestor
RDD.
 Spark will divide RDD dependencies into stages and tasks and then
83
send those to workers for execution.
84
1.map(): Pass each element of the RDD through the supplied
function.
val x = Array("b", "a", "c")
val y = x.map(z => (z,1))
Output: y: [('b', 1), ('a', 1), ('c', 1)]
85
2.filter(): Filter creates a new RDD by passing in the supplied function
used to filter the results.
val x = sc.parallelize(Array(1,2,3))
val y = x.filter(n => n%2 == 1)
println(y.collect().mkString(", "))
Output: y: [1, 3]
86
3.flatmap() : Similar to map, but each input item can be mapped to 0 or
more output items.
val x = sc.parallelize(Array(1,2,3))
val y = x.flatMap(n => Array(n, n*100, 42))
println(y.collect().mkString(", "))
Output: y: [1, 100, 42, 2, 200, 42, 3, 300, 42]
87
4.groupBy() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.
val x = sc.parallelize( Array("John", "Fred", "Anna", "James"))
val y = x.groupBy(w => w.charAt(0))
println(y.collect().mkString(", "))
Output: y: [('A',['Anna']),('J',['John','James']),('F',['Fred'])]
88
5.groupByKey() : When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable<V>) pairs.
val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1)))
val y = x.groupByKey()
println(y.collect().mkString(", "))
Output: y: [('A', [3, 2, 1]),('B',[5, 4])]
89
6.sample() : Return a random sample subset RDD of the input RDD.
val x= sc.parallelize(Array(1, 2, 3, 4, 5))
val y= x.sample(false, 0.4)
// omitting seed will yield different output
println(y.collect().mkString(", "))
Output: y: [1, 3]
90
7.union() : Simple. Return the union of two RDDs.
val x= sc.parallelize(Array(1,2,3), 2)
val y= sc.parallelize(Array(3,4), 1)
val z= x.union(y)
val zOut= z.glom().collect()
Output z: [[1], [2, 3], [3, 4]]
91
8.join() : If you have relational database experience, this will be
easy. It’s joining of two datasets.
val x= sc.parallelize(Array(("a", 1), ("b", 2)))
val y= sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))
val z= x.join(y)
println(z.collect().mkString(", "))
Output z: [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
92
9.distinct() : Return a new RDD with distinct elements within a
source RDD.
val x= sc.parallelize(Array(1,2,3,3,4))
val y= x.distinct()
println(y.collect().mkString(", "))
Output: y: [1, 2, 3, 4]
93
10) keyBy() : Constructs two-component tuples (key-value pairs) by
applying a function on each data item.
val x= sc.parallelize(Array("John", "Fred", "Anna", "James"))
val y= x.keyBy(w => w.charAt(0))
println(y.collect().mkString(", "))
Output: y: [('J','John'),('F','Fred'),('A','Anna'),('J','James')]
94
11) partitionBy() : Repartitions as key-value RDD using its keys. The
partitioner implementation can be supplied as the first argument.
import org.apache.spark.partiotioner
val x=sc.parallelize(Array(‘J’,”James”),(‘F’,”Fred”),(‘A’,”Anna”),(‘J’,”John”),3)
val y= x.partitionBy(new Partitioner() { val numPartitions= 2
defgetPartition(k:Any) = {
if (k.asInstanceOf[Char] < 'H') 0 else 1
}
})
val yOut= y.glom().collect()
Output:y: Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James)))
95
12) zip() : Joins two RDDs by combining the i-th of either partition
with each other.
val x= sc.parallelize(Array(1,2,3))
val y= x.map(n=>n*n)
val z= x.zip(y)
println(z.collect().mkString(", "))
Output: z: [(1, 1), (2, 4), (3, 9)]
96
 Task 2: Practice on Spark Actions i.e. getNumPartitions(), collect(),
reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().
 RDD (Resilient Distributed Dataset) is main logical data unit
in Spark. An RDD is distributed collection of objects. ... Quoting
from Learning Spark book, "In Spark all work is expressed as
creating new RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result.“
 Actions returns final result of RDD computations / operation.
 Action produces a value back to the Spark driver program. It may
trigger a previously constructed, lazy RDD to be evaluated.
 Action function materialize a value in a Spark program. So basically
an action is RDD operation that returns a value of any type but
RDD[T] is an action.
97
 Actions: Unlike Transformations which produce RDDs,
action functions produce a value back to the Spark driver
program.
 Actions may trigger a previously constructed, lazy RDD to
be evaluated.
1. getNumPartitions()
2. collect()
3. reduce()
4.
5.
6.
7.
8.
9.
aggregate ()
mean()
sum()
max()
stdev()
countByKey()
98
1) collect() : collect returns the elements of the dataset as an array
back to the driver program.
val x= sc.parallelize(Array(1,2,3), 2)
val y= x.collect()
Output: y: [1, 2, 3]
99
2) reduce() : Aggregate the elements of a dataset through function.
val x= sc.parallelize(Array(1,2,3,4))
val y= x.reduce((a,b) => a+b)
Output: y: 10
100
3) aggregate(): The aggregate function allows the user to apply two different
reduce functions to the RDD.
val inputrdd=sc.parallelize(List((“maths”,21),(“english”,22), (“science”,31)),3)
valresult=inputrdd.aggregate(3)((acc,value)=>(acc+value._2),(acc1,acc2)=>(acc1+acc2))
Partition
Partition
Partition
1
2
3
:
:
:
Sum(all
Sum(all
Sum(all
Elements)
Elements)
Elements)
+
+
+
3
3
3
(Zero
(Zero
(Zero
value)
value)
value)
 Result = Partition1 + Partition2 + Partition3 + 3(Zero value)
So we get 21 + 22 + 31 + (4 * 3) = 86
 Output: y: Int = 86
101
4) max() : Returns the largest element in the RDD.
val x= sc.parallelize(Array(2,4,1))
val y= x.max()
Output: y: 4
102
5) count() : Number of elements in the RDD.
val x= sc.parallelize(Array("apple", "beatty", "beatrice"))
val y=x.count()
Output: y: 3
103
6) sum() : Sum of the RDD.
val x= sc.parallelize(Array(2,4,1))
val y= x.sum()
Output: y: 7
104
7) mean() : Mean of given RDD.
val x= sc.parallelize(Array(2,4,1))
val y= x.mean()
Output: y: 2.3333333
105
8) stdev() : An aggregate function that standard deviation of a set of
numbers.
val x= sc.parallelize(Array(2,4,1))
val y= x.stdev()
Output: y: 1.2472191
106
9) countByKey() : This is only available on RDDs of (K,V) and
returns a hashmap of (K, count of K).
val x= sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"John")))
val y= x.countByKey()
Output: y: {'A': 1, 'J': 2, 'F': 1}
107
10) getNumPartitions()
val x= sc.parallelize(Array(1,2,3), 2)
val y= x.partitions.size
Output: y: 2
108
1. Spark is initially developed by which university
Ans) Berkley
2. What are the characteristics of Big Data?
Ans) Volume, Velocity and Variety
3. The main focus of Hadoop ecosystem is on
Ans ) Batch Processing
4. Streaming data tools available in Hadoop ecosystem
are?
Ans ) Apache Spark and Storm
5. Spark has API's in?
Ans ) Java, Scala, R and Python
6. Which kind of data can be processed by spark?
Ans) Stored Data and Streaming Data
7. Spark can store its data in?
Ans) HDFS, MongoDB and Cassandra
109
8. How spark engine runs?
Ans) Integrating with Hadoop and Standalone
9. In spark data is represented as?
Ans ) RDDs
10. Which kind of data can be handled by Spark ?
Ans) Structured, Unstructured and Semi-Structured
11.Which among the following are the challenges in Map
reduce?
Ans) Every Problem has to be broken into Map and
Reduce phase
Collection of Key / Value pairs
High Throughput
110
12. Apache spark is a framework with?
Ans) Scheduling, Monitoring and Distributing Applications
13. Which of the features of Apache spark
Ans) DAG, RDDs and In- Memory processing
14) How much faster is the processing in spark when compared to
Hadoop?
ANS) 10-100X
15) In spark data is represented as?
Ans) RDDs
16) List of Transformations
Ans) map(), filter(), flatmap(), groupBy(), groupByKey(),
sample(), union(), join(), distinct(), keyBy(), partitionBy and
zip().
17) List of Actions
Ans) getNumPartitions(), collect(), reduce(), aggregate(),
max(), sum(), mean(), stdev(), countByKey().
111
18. Spark is developed in
Ans) Scala
19. Which type of processing Apache Spark can handle
Ans) Batch Processing, Interactive Processing, Stream
Processing and Graph Processing
20. List two statements of Spark
Ans) Spark can run on the top of Hadoop
Spark can process data stored in HDFS
Spark can use Yarn as resource management layer
21. Spark's core is a batch engine? True OR False
Ans) True
22) Spark is 100x faster than MapReduce due to
Ans) In-Memory Computing
23) MapReduce program can be developed in spark? T / F
Ans) True
112
24. Programming paradigm used in Spark
Ans) Generalized
25. Spark Core Abstraction
Ans) RDD
26. Choose correct statement about RDD
Ans) RDD is a distributed data structure
27. RDD is
Ans) Immutable, Recomputable and Fault-tolerant
28. RDD operations
Ans) Transformation, Action and Caching
29. We can edit the data of RDD like conversion to uppercase? T/F
Ans) False
30. Identify correct transformation
Ans) Map, Filter and Join
113
31. Identify Correct Action
Ans) Reduce
32. Choose correct statement
Ans) Execution starts with the call of Action
33. Choose correct statement about Spark Context
Ans) Interact with cluster manager and Specify spark how to
access cluster
34. Spark cache the data automatically in the memory as and when
needed? T/F
Ans) False
35. For resource management spark can use
Ans) Yarn, Mesos and Standalone cluster manager
36. RDD can not be created from data stored on
Ans) Oracle
37. RDD can be created from data stored on
Ans) Local FS, S3 and HDFS
114
38. Who is father of Big data Analytics
 Doug Cutting
39. What are major Characteristics of Big Data
 Volume, Velocity and Variety(3 V’s)
40. What is Apache Hadoop
 Open-source Software Framework
41. Who developed Hadoop
 Doug Cutting
42. Hadoop supports which programming framework
 Java
115
43. What is the heart of Hadoop
 MapReduce
44. What is MapReduce
 Programming Model for Processing Large Data Sets.
45. What are the Big Data Dimensions
 4 V’s
46. What is the caption of Volume
 Data at Scale
47. What is the caption of Velocity
 Data in Motion
116
48. What is the caption of Variety
 Data in many forms
49. What is the caption of Veracity
 Data Uncertainty
50. What is the biggest Data source for Big Data
 Transactions
51. What is the biggest Analytic capability for Big Data
 Query and Reporting
52. What is the biggest Infrastructure for Big Data
 Information integration
117
53. What are the Big Data Adoption Stages
 Educate, Explore, Engage and Execute
54. What is Mahout
 Algorithm library for scalable machine learning on Hadoop
55. What is Pig
 Creating MapReduce programs used with Hadoop.
56. What is HBase
 Non-Relational Database
57. What is the biggest Research Challenge for Big Data
 Heterogeneity , Incompleteness and Security
118
58. What is Sqoop
 Transferring bulk data between Hadoop to Structured data.
59. What is Oozie
 Workflow scheduler system to manage Hadoop jobs.
60. What is Hue

Web interface that supports Apache Hadoop and its ecosystem
61. What is Avro
 Avro is a data serialization system.
62. What is Giraph
 Iterative graph processing system built for high scalability.
63. What is Cassandra
 Cassandra does not support joins or sub queries, except
for batch analysis via Hadoop
64. What is Chukwa
 Chukwa is an open source data collection system for
monitoring large distributed systems
65. What is Hive
 Hive is a data warehouse on Hadoop
66. What is Apache drill
 Apache Drill is a distributed system for interactive
analysis of large-scale datasets.
67. What is HDFS
 Hadoop Distributed File System ( HDFS )
68. Facebook generates how much data per day
 25TB
69.
What is BIG DATA?
 Big Data is nothing but an assortment of such a huge and
complex data that it becomes very tedious to capture, store,
process, retrieve and analyze it with the help of on-hand
database management tools or traditional data processing
techniques.
70. What is HUE expansion
 Hadoop User Interface
71.Can you give some examples of Big Data?
 There are many real life examples of Big Data! Facebook is
generating 500+ terabytes of data per day, NYSE (New York Stock
Exchange) generates about 1 terabyte of new trade data per day, a
jet airline collects 10 terabytes of censor data for every 30 minutes
of flying time.
72. Can you give a detailed overview about the Big Data being
generated by Facebook?
 As of December 31, 2012, there are 1.06 billion monthly active
users on Facebook and 680 million mobile users. On an average,
3.2 billion likes and comments are posted every day on Facebook.
72% of web audience is on Facebook. And why not! There are so
many activities going on Facebook from wall posts, sharing images,
videos, writing comments and liking posts, etc.
122
73. What are the three characteristics of Big Data?
 The
three
characteristics
of
Big
Data
are: Volume: Facebook generating 500+ terabytes of data
per day. Velocity: Analyzing 2 million records each day
to identify the reason for losses. Variety: images, audio,
video, sensor data, log files, etc.
74. How Big is ‘Big Data’?
 With time, data volume is growing exponentially. Earlier
we used to talk about Megabytes or Gigabytes. But time
has arrived when we talk about data volume in terms of
terabytes, petabytes and also zettabytes! Global data
volume was around 1.8ZB in 2011 and is expected to be
7.9ZB in 2015.
123
75. How analysis of Big Data is useful for organizations?
 Effective analysis of Big Data provides a lot of business
advantage as organizations will learn which areas to focus
on and which areas are less important.
76. Who are ‘Data Scientists’?
 Data scientists are experts who find solutions to analyze
data. Just as web analysis, we have data scientists who
have good business insight as to how to handle a business
challenge.
124
77. What is Hadoop?
 Hadoop is a framework that allows for distributed
processing of large data sets across clusters of commodity
computers using a simple programming model.
78. Why the name ‘Hadoop’?
 Hadoop doesn’t have any expanding version like ‘OOPS’.
The charming yellow elephant you see is basically named
after Doug’s son’s toy elephant!
79. Why do we need Hadoop?
 Everyday a large amount of unstructured data is getting
dumped into our machines.
125
80. What are some of the characteristics of Hadoop framework?
 Hadoop framework is written in Java. It is designed to solve
problems that involve analyzing large data (e.g. petabytes). The
programming model is based on Google’s MapReduce. The
infrastructure is based on Google’s Big Data and Distributed File
System.
81. Give a brief overview of Hadoop history.
 In 2002, Doug Cutting created an open source, web crawler project.
In 2004, Google published MapReduce, GFS papers.
 In 2006, Doug Cutting developed the open source, MapReduce and
HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and
Hadoop won terabyte sort benchmark.
 In 2009, Facebook launched SQL support for Hadoop.
126
82. Give examples of some companies that are using Hadoop
structure?
 A lot of companies are using the Hadoop structure such as
Cloudera, EMC, MapR, Horton works, Amazon, Facebook, eBay,
Twitter, Google and so on.
83. What is the basic difference between traditional RDBMS and
Hadoop?
 RDBMS is used for transactional systems to report and archive the
data.
 Hadoop is an approach to store huge amount of data in the
distributed file system and process it.
 RDBMS will be useful when you want to seek one record from Big
data, whereas.
 Hadoop will be useful when you want Big data in one shot and
perform analysis on that later.
127
84. What is structured and unstructured data?
 Structured data is the data that is easily identifiable as it is
organized in a structure. The most common form of
structured data is a database where specific information is
stored in tables, that is, rows and columns.
 Unstructured data refers to any data that cannot be
identified easily. It could be in the form of images, videos,
documents, email, logs and random text.
85. What are the core components of Hadoop?
 Core components of Hadoop are HDFS and MapReduce.
HDFS is basically used to store large data sets and
MapReduce is used to process such large data sets.
128
86. What is HDFS?
HDFS is a file system designed for storing very
large files with streaming data access patterns,
running clusters on commodity hardware.
87. What are the key features of HDFS?
HDFS is highly fault-tolerant, with high
throughput, suitable for applications with large
data sets, streaming access to file system data and
can be built out of commodity hardware.
129
88. What is Fault Tolerance?
 Suppose you have a file stored in a system, and due to
some technical problem that file gets destroyed. Then
there is no chance of getting the data back present in that
file.
89. Replication causes data redundancy then why is
pursued in HDFS?
 HDFS works with commodity hardware (systems with
average configurations) that has high chances of getting
crashed any time. Thus, to make the entire system highly
fault-tolerant, HDFS replicates and stores data in different
places.
130
90. Since the data is replicated thrice in HDFS, does it
mean that any calculation done on one node will also
be replicated on the other two?
 Since there are 3 nodes, when we send the MapReduce
programs, calculations will be done only on the original
data. The master node will know which node exactly has
that particular data.
91. What is throughput? How does HDFS get a good
throughput?
 Throughput is the amount of work done in a unit time. It
describes how fast the data is getting accessed from the
system and it is usually used to measure performance of
the system.
131
92. What is streaming access?
 As HDFS works on the principle of ‘Write Once, Read
Many‘, the feature of streaming access is
extremely important in HDFS. HDFS focuses not so
much on storing the data but how to retrieve it at the
fastest possible speed, especially while analyzing logs.
93. What is a commodity hardware? Does commodity
hardware include RAM?
 Commodity hardware is a non-expensive system which is
not of high quality or high-availability. Hadoop can be
installed in any average commodity hardware.
132
94. What is a Name node?
 Name node is the master node on which job tracker runs
and consists of the metadata. It maintains and manages
the blocks which are present on the data nodes.
95. Is Name node also a commodity?
 No.
Name
node
can
never
be
a commodity hardware because the entire HDFS rely on
it. It is the single point of failure in HDFS. Name node
has to be a high-availability machine.
96. What is a metadata?
 Metadata is the information about the data stored in data
nodes such as location of the file, size of the file and so
on.
133
97. What is a Data node?
Data nodes are the slaves which are deployed on
each machine and provide the actual storage.
These are responsible for serving read and write
requests for the clients.
98. Why do we use HDFS for applications having
large data sets and not when there are lot of
small files?
HDFS is more suitable for large amount of data
sets in a single file as compared to small amount
of data spread across multiple files.
134
99. What is a daemon?
Daemon is a process or service that runs in
background. In general, we use this word in UNIX
environment.
100. What is a job tracker?
Job tracker is a daemon that runs on a name node
for submitting and tracking MapReduce jobs in
Hadoop. It assigns the tasks to the different task
tracker.
135
101. What is a task tracker?
 Task tracker is also a daemon that runs on data nodes.
Task Trackers manage the execution of individual tasks
on slave node.
102. Is Name node machine same as data node machine
as in terms of hardware?
 It depends upon the cluster you are trying to create. The
Hadoop VM can be there on the same machine or on
another machine.
136
103. What is a heartbeat in HDFS?
 A heartbeat is a signal indicating that it is alive. A data
node sends heartbeat to Name node and task tracker will
send its heart beat to job tracker.
104. Are Name node and job tracker on the same host?
 No, in practical environment, Name node is on a separate
host and job tracker is on a separate host.
105. What is a ‘block’ in HDFS?
 A ‘block’ is the minimum amount of data that can be read
or written. In HDFS, the default block size is 64 MB as
contrast to the block size of 8192 bytes in Unix/Linux.
106. What are the benefits of block transfer?
 A file can be larger than any single disk in the network.
Blocks provide fault tolerance and availability.
107. If we want to copy 10 blocks from one machine to
another, but another machine can copy only 8.5
blocks, can the blocks be broken at the time of
replication?
 In HDFS, blocks cannot be broken down. Before copying
the blocks from one machine to another, the Master node
will figure out what is the actual amount of space
required, how many block are being used, how much
space is available, and it will allocate the blocks
accordingly.
138
108. How indexing is done in HDFS?
 Hadoop has its own way of indexing. Depending upon the
block size, once the data is stored, HDFS will keep on
storing the last part of the data which will say where the
next part of the data will be.
109. If a data Node is full how it’s identified?
 When data is stored in data node, then the metadata of
that data will be stored in the Name node. So Name node
will identify if the data node is full.
110.If data nodes increase, then do we need to upgrade
Name node?
 While installing the Hadoop system, Name node is
determined based on the size of the clusters.
139
111.Are job tracker and task trackers present in separate
machines?
 Yes, job tracker and task tracker are present in different machines.
The reason is job tracker is a single point of failure for the Hadoop
MapReduce service.
112. When we send a data to a node, do we allow settling in time,
before sending another data to that node?
 Yes, we do.
113. Does Hadoop always require digital data to process?
 Yes. Hadoop always require digital data to be processed.
114. On what basis Name node will decide which data node to
write on?
 As the Name node has the metadata (information) related to all the
data nodes, it knows which data node is free.
140
115. Doesn’t Google have its very own version of DFS?
 Yes, Google owns a DFS known as “Google File System
(GFS)” developed by Google Inc. for its own use.
116. Who is a ‘user’ in HDFS?
 A user is like you or me, who has some query or who needs some
kind of data.
117. Is client the end user in HDFS?
 No, Client is an application which runs on your machine, which is
used to interact with the Name node (job tracker) or data node (task
tracker).
118. What is the communication channel between client and name
node/ data node?
141
 The mode of communication is SSH.
Relational DB’s vs. Big Data (Spark)
1. It deals with Giga Bytes to
Terabytes
2. It is centralized
3. It deals with structured data
4. It is having stable Data Model
5. It deals with known complex
inter relationships
6. Tools are Relational DB’s:
SQL,MYSQL,DB2.
7. Access is Interactive and
batch.
8. Updates are Read and write
many times.
9. Integrity is high
10. Scaling is Nonlinear
1. It deals with Petabytes to
Zettabytes
2. It is distributed
3. It deals with semi-structured
and unstructured
4. It is having unstable Data
Model
5. It deals with flat schemas and
few Interrelationships
6. Tools are Hadoop,R,Mahout
7. Access is Batch
8. Updates are Write once, read
many times.
9. Integrity is low
10. Scaling is Linear
148
HADOOP
SPARK
Performance
Process data on disk
Process data in-memory
Ease of use
Java need to be proficient in Java, Scala, R, Python
MapReduce
more expressive and
intelligent
Data processing
Need
other
platforms
streaming, graphs
for MLLib,
Graphs
Streaming,
Failure tolerance
Continue from the point it left off
Start the processing from
the beginning
Cost
Hard disk space cost
Memory space cost
Run
Everywhere
Runs on Hadoop
Memory
HDFS uses MapReduce to process This is called in memory
and analyse data map reduces takes operation
a backup of all the data in a
physical server after each operation
this is done
because the
data stored in a ram
149
Fast
Hadoop works less faster than Spark works more fast
spark
than Hadoop(100 times)
batchfile,10xfaster
on
disk
Version
Hadoop 2.6.5 Release Notes
Software
It is a open source s/w or, It is Fast and General
Reliable,
scalable Engine for large scale
distributed, computing It is a data processing
Big Data Tool
Spark 2.3.1
Execution Engine DAG
It is a Big Data Tool
Big Data frame Hadoop
works
Advanced DAG, support
for acyclic data flow and
in memory computing
Hardware cost
More
Less
Library
External machine lib
Internal Machine lib150
Recovery
Easier than spark Checkpoints Failure recovery is difficult
are present
but still good
FileManagement system Its own FMS(File Management It does not come with own
System)
FMS it support to the cloud
based data platform Spark
was designed for Hadoop
Support
HDFS, Hadoop YARN Apache, It support for RDD
Mesos
Technologies
Cassandra, HBase, HIVE
Tachyon, and any Hadoop
source
Use Places
Marketing Analysis, computing Online product
analysis, cyber security
analytics
Run
Clusters Data bases, server
Supports all systems in
Hadoop Processed by
Batch system
Cloud based systems and
Data sets
151
Reach me @
Srinivasulu Asadi:
[email protected] , + 91-9490246442
152
Reach me @
Srinivasulu Asadi:
[email protected] , + 91-9490246442
153
Fig: International Conference on Emerging Research In Computing,
Information, Communication and Applications- ERCICA-2014.
Fig: Keynote Speaker @GITM, Proddatur, 2016
Fig: Resource Person for Workshop @MITS, Madanapalle, 2016
157
Fig: Keynote Speaker @ SITM, Renigunta, 2017
Thank You
Business Intelligence
160
161
CLUSTERING
 What is Clustering: Clustering is a Unsupervised learning i.e. no
predefined classes, Group of similar objects that differ significantly
from other objects.
 The process of grouping a set of physical or abstract objects into
classes of similar objects is called clustering.
 Clustering is “the process of organizing objects into groups whose
members are similar in some way”.
 The cluster property is Intra-cluster distances are minimized and
Inter-cluster distances are maximized.
 What is Good Clustering: A good clustering method will produce
high quality clusters with
 high intra-class similarity
 low inter-class similarity
162
1. Nominal Variables allow for only qualitative classification. A
generalization of the binary variable in that it can take more than 2
states, e.g., red, yellow, blue, green
Ex: { male, female},{yes, no},{true, false}
2. Ordinal Data are categorical data where there is a logical
ordering to the categories.
Ex: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree;
3. Categorical Data represent types of data which may be divided
into groups.
Ex: race, sex, age group, and educational level.
4. Labeled Data are share the class labels or the generative
distribution of data.
5. Unlabeled Data are does not share the class labels or the
generative distribution of the labeled data.
6. Numerical Values: The data values completely belongs to and
only numerical values. Ex: 1,2,3,4,….
163
7. Interval-valued variables: These are variables ranges from
numbers
 Ex: 10-20, 20-30, 30-40,……..
8. Binary Variables: These are the variables and combination of 0
and 1.
 Ex: 1, 0, 001,010 ….
9. Ratio-Scaled Variables: A positive measurement on a nonlinear
scale, approximately at exponential scale, such as AeBt or Ae-Bt
 Ex: ½, 2/4, 4/8,…..
10. Variables of Mixed Types: A database may contain all the six
types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio.
 Ex: 11121A1201
164
Similarity Measure
 Euclidean distance: Distances are normally used to measure the
similarity or dissimilarity between two data objects.
 Euclidean distance: Euclidean distance is the distance between two
points in Euclidean space.
Major Clustering Approaches
1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Clustering Methods
6. Clustering High-Dimensional Data
7. Constraint-Based Cluster Analysis
8. Outlier Analysis
166
Examples of Clustering Applications
1.
2.
3.
4.
5.
Marketing
Land use
Insurance
City-planning
Earth-quake studies
Issues of Clustering
1.
2.
3.
4.
5.
6.
Accuracy,
Training time,
Robustness,
Interpretability, and
Scalability
Find top ‘n’ outlier points
167
Applications
 Pattern Recognition
 Spatial Data Analysis
 GIS(Geographical Information System)
 Cluster Weblog data to discover groups
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
 Weather forecasting
 Stock Marketing
168
2. Classification Vs Clustering
1. Classification is “the process 1. Clustering is “the process of
of organizing objects into
organizing
objects
into
groups whose members are
groups whose members are
not similar.
similar in some way”.
2. It is a Supervised Learning.
2.
3. Predefined classes.
3. No predefined classes.
4. Have labels for some points.
4. No labels in Clustering.
It is a Unsupervised Learning.
5. Require a “rule” that will 5. Group points into clusters
accurately assign labels to
based on how “near” they
169
new points.
are to one another.
6. Classification
6. Clustering
170
7. Classification approaches are
two types
1.
2.
Predictive Classification
Descriptive Classification
7. Clustering approaches are eight.
1.
2.
3.
4.
5.
6.
7.
8.
8. Issues of Classification
1.
2.
3.
4.
5.
Accuracy,
Training time,
Robustness,
Interpretability, and
Scalability
Partition Method
Hierarchical Method
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Clustering
High-Dimensional
Data
Constraint - Based Cluster
Analysis
Outlier Analysis
8. Issues of Clustering
1.
2.
3.
4.
5.
6.
Accuracy,
Training time,
Robustness,
Interpretability, and
Scalability
Find top ‘n’ outlier points
9. Examples
9. Examples
1.
Marketing
1.
Marketing
2.
Land use
2.
Land use
3.
Insurance
3.
Insurance
4.
City-planning
4.
City-planning
5.
Earth-quake studies
5.
Earth-quake studies
10. Techniques
1.
2.
3.
4.
10. Techniques
Decision Tree
Bayesian classification
Rule-based classification
Prediction and Accuracy and error
measures
1.
2.
3.
4.
K- Means Clustering
DIANA ((DIvisive ANAlysis)
AGNES (AGglomerative NESting)
BIRCH (Balanced Iterative Reducing
and Clustering using Hierarchies)
5. DBSACN (Density-Based Spatial
Clustering of Applications with
Noise)
172
11. Applications
11. Applications
1.
Credit approval
1.
Pattern Recognition
2.
Target marketing
2.
Spatial Data Analysis
3.
Medical diagnosis
3.
WWW (World Wide Web)
4.
Fraud detection
4.
Weblog data to discover groups
5.
Weather forecasting
5.
Credit approval
6.
Stock Marketing
6.
Target marketing
7.
Medical diagnosis
8.
Fraud detection
9.
Weather forecasting
10. Stock Marketing
173
k-Means Clustering
 It is a Partitioning cluster technique.
 It is a Centroid-Based cluster technique
 Clustering is a Unsupervised learning i.e. no predefined
classes, Group of similar objects that differ significantly
from other objects.
d (i, j)  (| x  x |  | x  x | ... | x  x | )
i1 j1
i2 j 2
i p jp
2
2
2
 It then creates the first k initial clusters (k= number of
clusters needed) from the dataset by choosing k rows of
data randomly from the dataset.
 The k-Means algorithm calculates the Arithmetic Mean
of each cluster formed in the dataset.
174
 Square-error criterion
 Where
–
–
–
–
E is the sum of the square error for all objects in the data set;
p is the point in space representing a given object; and
mi is the mean of cluster
Ci (both p and mi are multidimensional).
 Algorithm: The k-means algorithm for partitioning,
where each cluster’s center is represented by the mean
value of the objects in the cluster.
 Input:
– k: the number of clusters,
– D: a data set containing n objects.
 Output: A set of k clusters.
175
k-Means Clustering Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
k=2
Arbitrarily choose K
object as initial cluster
center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
176
10
Fig: Clustering of a set of objects based on the k-means
method. (The mean of each cluster is marked by a “+”.)
177
Steps
k - Means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters
of the current partition (the centroid is the center, i.e.,
mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed
point.
4. Go back to Step 2, stop when no more new
assignment
179
Issues of Clustering
1.
2.
3.
4.
5.
6.
Accuracy,
Training time,
Robustness,
Interpretability, and
Scalability
Find top ‘n’ outlier points
Examples of Clustering Applications
1.
2.
3.
4.
5.
Marketing
Land use
Insurance
City-planning
Earth-quake studies
180
Applications
 Pattern Recognition
 Spatial Data Analysis
 GIS(Geographical Information System)
 Image Processing
 WWW (World Wide Web)
 Cluster Weblog data to discover groups
 Credit approval
 Target marketing
 Medical diagnosis
 Fraud detection
 Weather forecasting
 Stock Marketing
181
Classification and Prediction
 Classification and Prediction: Classification is a supervised
learning i.e. we can predict input and out values and classification is
divided into groups but not necessarily similar properties is called
Classification.
 Classification is a two step process
1. Build the Classifier/ Model
2. Use Classifier for Classification.
1. Build the Classifier / Model: Describing a set of predetermined
classes
 Each tuple / sample is assumed to belong to a predefined class,
as determined by the class label attribute.
 Also called as Learning phase or training phase.
 The set of tuples used for model construction is training set.
 The model is represented as classification rules, decision trees,
182
or mathematical formulae.
Classification
Algorithms
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof 7
yes
Assistant Prof
6
no
Associate Prof 3
no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Fig: Model Construction
2. Using Classifier: for classifying future or unknown
objects. It estimate accuracy of the model.
 The known label of test sample is compared with the
classified result from the model.
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
 Test set is independent of training set, otherwise overfitting will occur.
 If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known.
184
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME RANK
Tom
M erlisa
G eorge
Joseph
A ssistant P rof
A ssociate P rof
P rofessor
A ssistant P rof
Y E A R S TE N U R E D
2
7
5
7
no
no
yes
yes
Tenured?
Fig: Using the Model in Prediction
TYPES OF CLASSIFICATION TECHNIQUES
1. Decision Tree
2. Bayesian classification
3. Rule-based classification
4. Prediction
5. Classifier Accuracy and Prediction error measures
186
187
Fig: Example for Model Construction and Usage of Model
2 - Decision Tree
 Decision tree is a flowchart-like tree structure, where
each internal node (nonleaf node)denotes a test on an
attribute, each branch represents an outcome of the test,
and each leaf node (or terminal node) holds a class label.
 Decision Tree Induction is developed by Ross Quinlan,
decision tree algorithm known as ID3 (Iterative
Dichotomiser).
 Decision tree is a classifier in the form of a tree structure




Decision node: specifies a test on a single attribute
Leaf node: indicates the value of the target attribute
Arc/edge: split of one attribute
Path: a disjunction of test to make the final decision
188
EX: Data Set in All Electronics Customer Database
189
Fig: A Decision tree for the concept buys computer, indicating
whether a customer at AllElectronics is likely to purchase a
computer. Each internal (nonleaf) node represents a test on an
attribute. Each leaf node represents a class (either buys computer
190
= yes or buys computer = no).
Ex: For age attribute
191
Issues of Classification and Prediction
1.
2.
3.
4.
5.
Accuracy,
Training time,
Robustness,
Interpretability, and
Scalability
Typical applications
1.
2.
3.
4.
5.
6.
Credit approval
Target marketing
Medical diagnosis
Fraud detection
Weather forecasting
Stock Marketing
192