Uploaded by veeraowk

1 - Big Data Analytics - Spark

advertisement

Talk

05 th September 2018

BIG DATA USING SPARK

By

Dr.Asadi Srinivasulu

Professor, M.Tech(IIIT), Ph.D.

SREE VIDYANIKETHAN ENGINEERING COLLEGE

(AUTONOMOUS)

(AFFILIATED TO JNTUA,ANANTAPUR)

2018-2019

Contents

Big Data Fundamentals

Big Data Architecture

Spark Fundamentals

Spark Ecosystem

 Spark Transformations

Spark Actions

Spark – MLlib

Classification using Spark

Clustering using Spark

 Spark Challenges

IOT

Big Data Analytics

Data Mining

Data warehouse

Data Mart

Database System

DBMS

Database

Information

Internet of Things

Deals with 3V’s big data like face book

Extracting Meaningful data

Collection of Data Marts/OLAP

Subset of a DWH

Combination of Data + DBMS

Collection of Software's

Collection inter-related Data

Processed Data

Data Raw material/facts/images

Fig: Pre-requisite of Big Data

Fig:

Big Data Word Cloud

4

The Myth about Big Data

 Big Data Is New

 Big Data Is Only About Massive Data Volume

 Big Data Means Hadoop

 Big Data Need A Data Warehouse

 Big Data Means Unstructured Data

 Big Data Is for Social Media & Sentiment Analysis

Big Data is …….

Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making.

Big data is data which is too large, complex and dynamic for any conventional data tools to capture, store, manage and analyze .

6

Fig:

Big Data - Characteristics

7

Big Data Analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns , unknown correlations and other useful information.

Why do they care about Big Data?

More knowledge leads to better customer engagement , fraud prevention and new products.

8

Data Evolution is the 10% are structured and 90% are unstructured like emails, videos, facebook posts, website clicks etc.

Fig: The Evolution of BIG DATA

9

Big data is a collection of data sets which is so large and complex that it is difficult to handle using DBMS tools.

Facebook alone generates more than 500 terabytes of data daily whereas many other organizations like Jet Air and Stock Exchange Market generates terabytes of data every hour.

Types of Data:

1. Structured Data - These data is organized in a highly mechanized and manageable way.

Ex: Tables, Transactions, Legacy Data etc…

2. Unstructured Data These data is raw and unorganized, it varies in its content and can change from entry to entry.

Ex: Videos, images, audio, Text Data, Graph Data, social media etc.

3. Semi-Structured Data Ex: XML Database, 50% structured and

50% unstructured

10

Big Data Matters…….

Data Growth is huge and all that data is valuable.

 Data won’t fit on a single system, that's why use Distributed data

Distributed data = Faster Computation.

More knowledge leads to better customer engagement , fraud prevention and new products.

Big Data Matters for Aggregation, Statistics, Indexing, Searching,

Querying and Discovering Knowledge.

11

Fig: Measuring the Data in Big Data System

Big Data Sources: Big data is everywhere and it can help organisations any industry in many different ways.

Big data has become too complex and too dynamic to be able to process, store, analyze and manage with traditional data tools.

Big Data Sources are

 ERP Data

Transactions Data

Public Data

Social Media Data

Sensor Media Data

Big Data in Marketing

Big Data in Health & Life Sciences

Cameras Data

Mobile Devices

 Machine sensors

Microphones

13

Fig: Big Data Sources

14

Structure of Big Data

Big Data Processing: So-called big data technologies are about discovering patterns (in semi/unstructured data) development of big data standards & (open source) software commonly driven by companies such as Google, Facebook,

Twitter, Yahoo! …

15

Big Data File Formats

Videos

Audios

Images

Photos

Logs

Click Trails

Text Messages

E-Mails

Documents

Books

Transactions

Public Records

Flat Files

SQL Files

DB2 Files

MYSQL Files

Tera Data Files

MS-Access Files

16

Characteristics of Big Data

The are seven characteristics of Big Data are volume, velocity, variety, veracity, value, validity and visibility.

Earlier it was assessed in megabytes and gigabytes but now the assessment is made in terabytes.

17

1.

Volume: Data size or the amount of Data or Data quantity or Data at rest..

2.

Velocity: Data speed or Speed of change or The content is changing quickly or Data in motion.

3.

Variety: Data types or The range of data types & sources or Data with multiple formats.

4.

Veracity: Data fuzzy & cloudy or Messiness or Can we trust the data.

5.

Value: Data alone is not enough, how can value be derived from it.

6.

Validity: Ensure that the interpreted data is sound.

7.

Visibility: Data from diverse sources need to be stitched together.

18

Advantages

Flexible schema

 Massive scalability

Cheaper to setup

Understanding and Targeting Customers

 Understanding and Optimizing Business Process

Improving Science and Research

Improving Healthcare and Public Health

Financial Trading

Improving Sports Performance

 Improving Security and Law Enforcement

No declarative query language

Eventual consistency – higher performance

Detect risks and check frauds

Reduce Costs

Fig) Advantages of Big Data

20

Disadvantages

Big data violates the privacy principle.

Data can be used for manipulating customers.

Big data may increase social stratification.

Big data is not useful in short run.

Faces difficulties in parsing and interpreting.

Big data is difficult to handle -more programming

Eventual consistency - fewer guarantees

Big Data Challenges

Data Complexity

Data Volume

Data Velocity

Data Variety

Data Veracity

Capture data

Curation data

Performance

Storage data

Search data

Transfer data

Visualization data

Data Analysis

Privacy and Security

22

Big Data Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy.

23

Fig: Challenges of Big Data

24

Research issues in Big Data Analytics

1.

Sentiment Analysis in Big Data Hadoop using Mahout

Machine Learning Algorithms

2.

Opinion mining Analysis in Big Data Hadoop using Mahout

Machine Learning Algorithms

3.

Predictive mining Analysis in Big Data Hadoop using Mahout

Machine Learning Algorithms

4.

Post-Clustering Analysis in Big Data Hadoop using Mahout

Machine Learning Algorithms

5.

Pre-Clustering Analysis in Big Data Hadoop using Mahout

Machine Learning Algorithms

6.

How we can capture and deliver data to right people in realtime

7.

How we can handle variety of forms and data

8.

How we can store and analyze data given its size and computational capacity.

Big Data Tools

Big Data Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,

LucidWorks, R, MapR , Ubuntu and Linux flavors.

26

Applications

Social Networks and Relationships

Cyber-Physical Models

Internet of Things (IoT)

Retail Market

Retail Banking

Real Estate

Fraud detection and prevention

Telecommunications

Healthcare and Research

Automotive and production

Science and Research

Trading Analytics

27

Fig: Applications of Big Data Analytics

28

29

30

31

32

33

34

35

36

37

Contents

Spark Basics

RDD(Resilient Distributed Dataset)

Spark Transformations

Spark Actions

Spark with Machine Learning

Hands on Spark

Research Challenges in Spark

Spark is a free and open source software web application framework and domain-specific language written in Java.

Spark is an alternative to other Java web application frameworks such as JAX-RS, Play framework and Spring

MVC.

Apache Spark is a general-purpose cluster in-memory computing system.

Provides high-level APIs in Java, Scala and Python and an optimized engine that supports general execution graphs.

Provides various high level tools like Spark SQL for structured data processing, MLlib for Machine Learning and more….

39

40

Spark is a successor of MapReduce.

Map Reduce is the

‘heart‘ of Hadoop that consists of two parts –

‘map’ and

‘reduce’.

Maps and reduces are programs for processing data.

 ‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output.

Thus, MapReduce allows for distributed processing of the map and reduction operations.

41

Apache Spark is the Next-Gen Big Data Tool i.e.

Considered as future of Big Data and successor of

MapReduce.

Features of Spark are

Speed.

Usability.

In - Memory Computing.

Pillar to Sophisticated Analytics.

Real Time Stream Processing.

Compatibility with Hadoop & existing Hadoop Data.

Lazy Evaluation.

Active, progressive and expanding community.

42

Spark is a Distributed data analytics engine , generalizing MapReduce.

Spark is a core engine, with streaming,

SQL,

Machine Learning

and Graph processing modules.

43

Precisely

44

Why Spark

Spark is faster than MR when it comes to processing the same amount of data multiple times rather than unloading and loading new data.

Spark is simpler and usually much faster than MapReduce for the usual Machine learning and Data Analytics applications.

5 Reasons Why Spark Matters to Business

1. Spark enables use cases “traditional” Hadoop can’t handle.

2. Spark is fast

3. Spark can use your existing big data investment

4. Spark speaks SQL

5. Spark is developer-friendly

45

46

47

48

49

50

51

Spark Ecosystem: Spark Ecosystem is still in the stage of workin-progress with Spark components, which are not even in their beta releases.

Components of Spark Ecosystem

The components of Spark ecosystem are getting developed and several contributions are being made every now and then.

Primarily, Spark Ecosystem comprises the following components:

1) Shark (SQL)

2) Spark Streaming (Streaming)

3) MLLib (Machine Learning)

4) GraphX (Graph Computation)

5) SparkR (R on Spark)

6) BlindDB (Approximate SQL)

52

53

 Spark's official ecosystem consists of the following major components.

Spark DataFrames - Similar to a relational table

Spark SQL - Execute SQL queries or HiveQL

Spark Streaming - An extension of the core Spark API

 MLlib - Spark's machine learning library

GraphX - Spark for graphs and graph-parallel computation

Spark Core API - provides R, SQL, Python, Scala, Java

54

55

56

57

MLlib library has implementations for various common machine learning algorithms

1. Clustering: K-means

2. Classification: Naïve Bayes, logistic regression,

SVM

3. Decomposition: Principal Component Analysis

(PCA) and Singular Value Decomposition (SVD)

4. Regression : Linear Regression

5. Collaborative Filtering: Alternating Least Squares for

Recommendations

58

Spark Ecosystem Components:

Spark Core: Spark Core is the base for parallel and distributed processing of huge datasets.

 Spark SQL element: SparkSQL is module / component in Apache

Spark that is employed to access structured and semi structured information.

 SparkSQL: Built-in perform or user-defined function: Object comes with some functions for column manipulation. Using Scala we are able to outlined user outlined perform.

SparkSQL: Executing SQL queries or Hive queries, result are going to be came in variety of DataFrame.

59

60

DataFrame: It is similar to relative table in SparkSQL.

It is distributed assortment of tabular information having rows and named column.

Datasets API: Dataset is new API additional to Spark Apache to supply benefit of RDD because it is robust written and declarative in nature.

Spark Streaming: Spark Streaming is a light-weight API that permits developers to perform execution and streaming of information application.

Spark element MLlib: MLlib in Spark stands for machine learning

(ML) library. Its goal is to form sensible machine learning effective, ascendible and straightforward.

 GraphX: GraphX is a distributed graph process framework on prime of Apache Spark.

61

62

Language Support in Apache Spark

Apache Spark ecosystem is built on top of the core execution engine that has extensible

API’s in different languages.

A recent 2016 Spark Survey on 62% of Spark users evaluated the Spark languages

58% were using Python in 2017

71% were using Scala

31% of the respondents were using Java and

18% were using R programming language.

63

1) Scala:

Spark framework is built on Scala, so programming in

Scala for Spark

can provide access to some of the latest and greatest features that might not be available in other supported programming spark languages.

2) Python: Python language

has excellent libraries for data analysis like Pandas and

Sci-Kit learn but is comparatively slower than Scala.

64

3) R Language: R programming language

has rich environment for machine learning and statistical analysis which helps increase developer productivity. Data scientists can now use R language along with Spark through SparkR for processing data that cannot be handled by a single machine.

4) Java:

Java is verbose and does not support REPL but is definitely a good choice for developers coming from a Java + Hadoop background.

65

What is Scala?: Scala is a general-purpose programming language, which expresses the programming patterns in a concise, elegant, and type-safe way.

It is basically an acronym for

“Scalable Language”.

Scala is an easy-to-learn language and supports both

Object Oriented Programming as well as Functional

Programming.

It is getting popular among programmers, and is being increasingly preferred over Java and other programming languages.

It seems much in sync with the present and future Big

Data frameworks, like Scalding, Spark, Akka , etc.

66

Why is Spark Programmed in Scala?

 Scala is a pure object-oriented language, in which conceptually every value is an object and every operation is a method-call.

The language supports advanced component architectures through classes and traits.

 Scala is also a functional language . It supports functions, immutable data structures and gives preference to immutability over mutation.

Scala can be seamlessly integrated with Java

It is already being widely used for Big Data platforms and development of frameworks like Akka, Scalding, Play, etc.

 Being written in Scala, Spark can be embedded in any JVM-based operational system.

67

68

 Procedure: Spark Installation in Ubuntu

Apache Spark is a fast and general engine for large-scale data processing.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R , and an optimized engine that supports general execution graphs.

It also supports a rich set of higher-level tools including Spark

SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Step 1: Installing Java.

java –version

Step 2: Installing Scala and SBT.

sudo apt-get update sudo apt-get install scala

69

Step 3: Installing Maven plug-in.

Maven plug-in is used to compile java program for spark. Type below command to install maven.

sudo apt-get install maven

Step 4: Installing Spark.

 Download “tgz” file of spark by selecting specific version from below link http://spark.apache.org/downloads.html

Extract it and remember its path where ever it stored.

Edit .bashrc file by placing below lines (terminal command: gedit

.bashrc

) export SPARK_HOME=/path_to_spark_directory export PATH=$SPARK_HOME/bin:$PATH

Replace path_to_spark_directory in above line with address of your spark directory.

Restart .bashrc by saving and close it and type “..bashrc” in terminal.

If it doesn’t work restart system. Thus we installed spark successfully.

Type spark-shell in terminal to start spark shell.

70

Spark Installation on Windows

Step 1: Install Java (JDK)

Download and install java from https://java.com/en/download/

Step 2: Set java environment variable

Open “control panel” and choose “system & security” and select

“system”.

Select “Advanced System Settings” located at top right.

Select “Environmental Variables” from pop-up.

Next select new under system variables (below), you will get a pop-up.

In variable name field type JAVA_HOME

In variable value field provide installation directory of java, say

C:\Program Files\Java\jdk1.8.0_25

Or you can simply choose the directory by selecting browse directory.

Now close everything by choosing ok every time.

Check whether java variable is set or not by pinging javac in command promt. If we get java version details then we are done.

71

Step 3: Installing SCALA

Download scala.msi

file from https://www.scalalang.org/download/

Set scala environment variable just like java done above.

Variable name = SCALA_HOME

Variable value = path to scala installed directory, say

C:\Program Files (x86)\scala

Step 4: Installing SPARK

Download and extract spark from http://spark.apache.org/downloads.html

You can set SPARK_HOME just like java.

Note:-We can only run spark-shell at bin folder in spark folder on windows.

72

Step 5: Installing SBT Download and install sbt.msi from http://www.scala-sbt.org/0.13/docs/Installing-sbt-on-

Windows.html

Step 6: Installing Maven

Download maven from http://maven.apache.org/ download.cgi

and unzip it to the folder you want to install

Maven.

Add both M2_HOME and MAVEN_HOME variables in the Windows environment, and point it to your Maven folder.

Update PATH variable, append Maven bin folder –

%M2_HOME%\bin, so that you can run the Maven’s command everywhere.

Test maven by pinging mvn –version in command prompt

73

 Practice on Spark Framework with Transformations and

Actions: You can run Spark using its standalone cluster mode , on EC2 , on Hadoop YARN , or on Apache Mesos . Access data in HDFS , Cassandra , HBase , Hive , Tachyon , and any Hadoop data source.

 Spark powers a stack of libraries including SQL and

DataFrames , MLlib for machine learning, GraphX , and Spark

Streaming . You can combine these libraries seamlessly in the same application.

74

 RDD (Resilient Distributed Dataset) is main logical data unit in Spark . An RDD is distributed collection of objects. ... Quoting from Learning Spark book, "In Spark all work is expressed as creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.“

Spark performs Transformations and Actions:

75

Resilient Distributed Datasets overcome this drawback of Hadoop

MapReduce by allowing - fault tolerant ‘in-memory’ computations.

RDD in Apache Spark.

Why RDD is used to process the data ?

What are the major features/characteristics of RDD (Resilient

Distributed Datasets) ?

Resilient Distributed Datasets are immutable , partitioned collection of records that can be operated on - in parallel.

 RDDs can contain any kind of objects Python, Scala, Java or even user defined class objects.

RDDs are usually created by either transformation of existing RDDs or by loading an external dataset from a stable storage like HDFS or

HBase.

76

Fig: Process of RDD Creation

77

Operations on RDDs

i) Transformations:

Coarse grained operations like

join, union, filter

or

map

on existing RDDs which produce a new RDD, with the result of the operation, are referred to as transformations. All transformations in

Spark are lazy

.

ii) Actions:

Operations like count, first and reduce which return values after computations on existing

RDDs are referred to as Actions.

78

Properties / Traits of RDD:

 Immutable (Read only cant change or modify): Data is safe to share across processes.

Partitioned: It is basic unit of parallelism in RDD.

Coarse gained operations: it's applied to any or all components in datasets through maps or filter or group by operation.

Action/Transformations: All computations in RDDs are actions or transformations.

 Fault Tolerant: As the name says or include Resilient which means its capability to reconcile, recover or get back all the data using lineage graph.

Cacheable: It holds data in persistent storage.

 Persistence: Option of choosing which storage will be used either in-memory or on-disk.

79

80

How Spark Works - RDD Operations

81

 Task 1: Practice on Spark Transformations i.e. map(), filter(), flatmap(), groupBy(), groupByKey(), sample(), union(), join(), distinct(), keyBy(), partitionBy and zip().

RDD (Resilient Distributed Dataset) is main logical data unit in Spark . An RDD is distributed collection of objects. ... Quoting from Learning Spark book, "In Spark all work is expressed as creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.“

Transformations are lazy evaluated operations on RDD that create one or many new RDDs, e.g. map, filter, reduceByKey, join, cogroup, randomSplit.

 Transformations are lazy, i.e. are not executed immediately.

Transformations can be executed only when actions are called.

82

 Transformations are lazy operations on a RDD that create one or many new RDDs.

Ex: map , filter , reduceByKey , join , cogroup , randomSplit .

 In other words, transformations are functions that take a RDD as the input and produce one or many RDDs as the output.

RDD allows you to create dependencies between RDDs.

Dependencies are the steps for producing results i.e. a program.

Each RDD in lineage chain, string of dependencies has a function for operating its data and has a pointer dependency to its ancestor

RDD.

Spark will divide RDD dependencies into stages and tasks and then send those to workers for execution.

83

84

1.map(): Pass each element of the RDD through the supplied function.

val x = Array("b", "a", "c") val y = x.map(z => (z,1))

Output: y: [('b', 1), ('a', 1), ('c', 1)]

85

2.filter(): Filter creates a new RDD by passing in the supplied function used to filter the results.

val x = sc.parallelize(Array(1,2,3)) val y = x.filter(n => n%2 == 1) println(y.collect().mkString(", "))

Output: y: [1, 3]

86

3.flatmap() : Similar to map, but each input item can be mapped to 0 or more output items.

val x = sc.parallelize(Array(1,2,3)) val y = x.flatMap(n => Array(n, n*100, 42)) println(y.collect().mkString(", "))

Output: y: [1, 100, 42, 2, 200, 42, 3, 300, 42]

87

4.groupBy() : When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

val x = sc.parallelize( Array("John", "Fred", "Anna", "James")) val y = x.groupBy(w => w.charAt(0)) println(y.collect().mkString(", "))

Output: y: [('A',['Anna']),('J',['John','James']),('F',['Fred'])]

88

5.groupByKey() : When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.

val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1))) val y = x.groupByKey() println(y.collect().mkString(", "))

Output: y: [('A', [3, 2, 1]),('B',[5, 4])]

89

6.sample() : Return a random sample subset RDD of the input RDD.

val x= sc.parallelize(Array(1, 2, 3, 4, 5)) val y= x.sample(false, 0.4)

// omitting seed will yield different output println(y.collect().mkString(", "))

Output: y: [1, 3]

90

7.union() : Simple. Return the union of two RDDs.

val x= sc.parallelize(Array(1,2,3), 2) val y= sc.parallelize(Array(3,4), 1) val z= x.union(y) val zOut= z.glom().collect()

Output z: [[1], [2, 3], [3, 4]]

91

8.join() : If you have relational database experience, this will be easy. It’s joining of two datasets.

val x= sc.parallelize(Array(("a", 1), ("b", 2))) val y= sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5))) val z= x.join(y) println(z.collect().mkString(", "))

Output z: [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]

92

9.distinct() : Return a new RDD with distinct elements within a source RDD.

val x= sc.parallelize(Array(1,2,3,3,4)) val y= x.distinct() println(y.collect().mkString(", "))

Output: y: [1, 2, 3, 4]

93

10) keyBy() : Constructs two-component tuples (key-value pairs) by applying a function on each data item.

val x= sc.parallelize(Array("John", "Fred", "Anna", "James")) val y= x.keyBy(w => w.charAt(0)) println(y.collect().mkString(", "))

Output: y: [('J','John'),('F','Fred'),('A','Anna'),('J','James')]

94

11) partitionBy() : Repartitions as key-value RDD using its keys. The partitioner implementation can be supplied as the first argument.

import org.apache.spark.partiotioner

val x=sc.parallelize(Array(‘J’,”James”),(‘F’,”Fred”),(‘A’,”Anna”),(‘J’,”John”),3) val y= x.partitionBy(new Partitioner() { val numPartitions= 2 defgetPartition(k:Any) = { if (k.asInstanceOf[Char] < 'H') 0 else 1

}

}) val yOut= y.glom().collect()

Output:y: Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James)))

95

12) zip() : Joins two RDDs by combining the i-th of either partition with each other.

val x= sc.parallelize(Array(1,2,3)) val y= x.map(n=>n*n) val z= x.zip(y) println(z.collect().mkString(", "))

Output: z: [(1, 1), (2, 4), (3, 9)]

96

Task 2: Practice on Spark Actions i.e. getNumPartitions(), collect(), reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().

RDD (Resilient Distributed Dataset) is main logical data unit in Spark . An RDD is distributed collection of objects. ... Quoting from Learning Spark book, "In Spark all work is expressed as creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.“

Actions returns final result of RDD computations / operation.

Action produces a value back to the Spark driver program. It may trigger a previously constructed, lazy RDD to be evaluated.

Action function materialize a value in a Spark program. So basically an action is RDD operation that returns a value of any type but

RDD[T] is an action.

97

Actions: Unlike Transformations which produce RDDs, action functions produce a value back to the Spark driver program.

Actions may trigger a previously constructed, lazy RDD to be evaluated.

1. getNumPartitions()

2. collect()

3. reduce()

4. aggregate ()

5. mean()

6. sum()

7. max()

8. stdev()

9. countByKey()

98

1) collect() : collect returns the elements of the dataset as an array back to the driver program.

val x= sc.parallelize(Array(1,2,3), 2) val y= x.collect()

Output: y: [1, 2, 3]

99

2) reduce() : Aggregate the elements of a dataset through function.

val x= sc.parallelize(Array(1,2,3,4)) val y= x.reduce((a,b) => a+b)

Output: y: 10

100

3) aggregate(): The aggregate function allows the user to apply two different reduce functions to the RDD.

val inputrdd=sc.parallelize(List((“maths”,21),(“english”,22), (“science”,31)),3) valresult=inputrdd.aggregate(3)((acc,value)=>(acc+value._2),(acc1,acc2)=>(acc1+acc2))

Partition 1 : Sum(all Elements) + 3 (Zero value)

Partition 2 : Sum(all Elements) + 3 (Zero value)

Partition 3 : Sum(all Elements) + 3 (Zero value)

Result = Partition1 + Partition2 + Partition3 + 3(Zero value)

So we get 21 + 22 + 31 + (4 * 3) = 86

 Output: y: Int = 86

101

4) max() : Returns the largest element in the RDD.

val x= sc.parallelize(Array(2,4,1)) val y= x.max()

Output: y: 4

102

5) count() : Number of elements in the RDD.

val x= sc.parallelize(Array("apple", "beatty", "beatrice")) val y=x.count()

Output: y: 3

103

6) sum() : Sum of the RDD.

val x= sc.parallelize(Array(2,4,1)) val y= x.sum()

Output: y: 7

104

7) mean() : Mean of given RDD.

val x= sc.parallelize(Array(2,4,1)) val y= x.mean()

Output: y: 2.3333333

105

8) stdev() : An aggregate function that standard deviation of a set of numbers.

val x= sc.parallelize(Array(2,4,1)) val y= x.stdev()

Output: y: 1.2472191

106

9) countByKey() : This is only available on RDDs of (K,V) and returns a hashmap of (K, count of K).

val x= sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"John"))) val y= x.countByKey()

Output: y: {'A': 1, 'J': 2, 'F': 1}

107

10) getNumPartitions() val x= sc.parallelize(Array(1,2,3), 2) val y= x.partitions.size

Output: y: 2

108

1. Spark is initially developed by which university

Ans) Berkley

2. What are the characteristics of Big Data?

Ans) Volume, Velocity and Variety

3. The main focus of Hadoop ecosystem is on

Ans ) Batch Processing

4. Streaming data tools available in Hadoop ecosystem are?

Ans ) Apache Spark and Storm

5. Spark has API's in?

Ans ) Java, Scala, R and Python

6. Which kind of data can be processed by spark?

Ans) Stored Data and Streaming Data

7. Spark can store its data in?

Ans) HDFS, MongoDB and Cassandra

109

8. How spark engine runs?

Ans) Integrating with Hadoop and Standalone

9. In spark data is represented as?

Ans ) RDDs

10. Which kind of data can be handled by Spark ?

Ans) Structured, Unstructured and Semi-Structured

11.Which among the following are the challenges in Map reduce?

Ans) Every Problem has to be broken into Map and

Reduce phase

Collection of Key / Value pairs

High Throughput

110

12. Apache spark is a framework with?

Ans) Scheduling, Monitoring and Distributing Applications

13. Which of the features of Apache spark

Ans) DAG, RDDs and In- Memory processing

14) How much faster is the processing in spark when compared to

Hadoop?

ANS) 10-100X

15) In spark data is represented as?

Ans) RDDs

16) List of Transformations

Ans) map(), filter(), flatmap(), groupBy(), groupByKey(), sample(), union(), join(), distinct(), keyBy(), partitionBy and zip().

17) List of Actions

Ans) getNumPartitions(), collect(), reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().

111

18. Spark is developed in

Ans) Scala

19. Which type of processing Apache Spark can handle

Ans) Batch Processing, Interactive Processing, Stream

Processing and Graph Processing

20. List two statements of Spark

Ans) Spark can run on the top of Hadoop

Spark can process data stored in HDFS

Spark can use Yarn as resource management layer

21. Spark's core is a batch engine? True OR False

Ans) True

22) Spark is 100x faster than MapReduce due to

Ans) In-Memory Computing

23) MapReduce program can be developed in spark? T / F

Ans) True

112

24. Programming paradigm used in Spark

Ans) Generalized

25. Spark Core Abstraction

Ans) RDD

26. Choose correct statement about RDD

Ans) RDD is a distributed data structure

27. RDD is

Ans) Immutable, Recomputable and Fault-tolerant

28. RDD operations

Ans) Transformation, Action and Caching

29. We can edit the data of RDD like conversion to uppercase? T/F

Ans) False

30. Identify correct transformation

Ans) Map, Filter and Join

113

31. Identify Correct Action

Ans) Reduce

32. Choose correct statement

Ans) Execution starts with the call of Action

33. Choose correct statement about Spark Context

Ans) Interact with cluster manager and Specify spark how to access cluster

34. Spark cache the data automatically in the memory as and when needed? T/F

Ans) False

35. For resource management spark can use

Ans) Yarn, Mesos and Standalone cluster manager

36. RDD can not be created from data stored on

Ans) Oracle

37. RDD can be created from data stored on

Ans) Local FS, S3 and HDFS

114

38. Who is father of Big data Analytics

Doug Cutting

39. What are major Characteristics of Big Data

 Volume, Velocity and Variety(3 V’s)

40. What is Apache Hadoop

Open-source Software Framework

41. Who developed Hadoop

Doug Cutting

42. Hadoop supports which programming framework

 Java

115

43. What is the heart of Hadoop

MapReduce

44. What is MapReduce

 Programming Model for Processing Large Data Sets.

45. What are the Big Data Dimensions

4 V’s

46. What is the caption of Volume

 Data at Scale

47. What is the caption of Velocity

Data in Motion

116

48. What is the caption of Variety

Data in many forms

49. What is the caption of Veracity

 Data Uncertainty

50. What is the biggest Data source for Big Data

Transactions

51. What is the biggest Analytic capability for Big Data

 Query and Reporting

52. What is the biggest Infrastructure for Big Data

Information integration

117

53. What are the Big Data Adoption Stages

Educate, Explore, Engage and Execute

54. What is Mahout

Algorithm library for scalable machine learning on Hadoop

55. What is Pig

 Creating MapReduce programs used with Hadoop.

56. What is HBase

Non-Relational Database

57. What is the biggest Research Challenge for Big Data

Heterogeneity , Incompleteness and Security

118

58. What is Sqoop

Transferring bulk data between Hadoop to Structured data.

59. What is Oozie

Workflow scheduler system to manage Hadoop jobs.

60. What is Hue

Web interface that supports Apache Hadoop and its ecosystem

61. What is Avro

Avro is a data serialization system.

62. What is Giraph

Iterative graph processing system built for high scalability.

63. What is Cassandra

Cassandra does not support joins or sub queries, except for batch analysis via Hadoop

64. What is Chukwa

Chukwa is an open source data collection system for monitoring large distributed systems

65. What is Hive

Hive is a data warehouse on Hadoop

66. What is Apache drill

Apache Drill is a distributed system for interactive analysis of large-scale datasets.

67. What is HDFS

Hadoop Distributed File System ( HDFS )

68. Facebook generates how much data per day

25TB

69.

What is BIG DATA?

Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques.

70. What is HUE expansion

 Hadoop User Interface

71.Can you give some examples of Big Data?

 There are many real life examples of Big Data! Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock

Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor data for every 30 minutes of flying time.

72. Can you give a detailed overview about the Big Data being generated by Facebook?

As of December 31, 2012, there are 1.06 billion monthly active users on Facebook and 680 million mobile users. On an average,

3.2 billion likes and comments are posted every day on Facebook.

72% of web audience is on Facebook. And why not! There are so many activities going on Facebook from wall posts, sharing images, videos, writing comments and liking posts, etc.

122

73. What are the three characteristics of Big Data?

The three characteristics of Big Data are: Volume : Facebook generating 500+ terabytes of data per day .

Velocity : Analyzing 2 million records each day to identify the reason for losses.

Variety : images, audio, video, sensor data, log files, etc.

74. How Big is ‘Big Data’?

With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes.

But time has arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes ! Global data volume was around 1.8ZB in 2011 and is expected to be

7.9ZB in 2015.

123

75. How analysis of Big Data is useful for organizations?

Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important.

76. Who are ‘Data Scientists’?

Data scientists are experts who find solutions to analyze data. Just as web analysis, we have data scientists who have good business insight as to how to handle a business challenge.

124

77. What is Hadoop?

Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model.

78. Why the name ‘Hadoop’?

Hadoop doesn’t have any expanding version like

‘OOPS’.

The charming yellow elephant you see is basically named after Doug’s son’s toy elephant!

79. Why do we need Hadoop?

Everyday a large amount of unstructured data is getting dumped into our machines.

125

80. What are some of the characteristics of Hadoop framework?

 Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google’s MapReduce. The infrastructure is based on Google’s Big Data and Distributed File

System.

81. Give a brief overview of Hadoop history.

In 2002, Doug Cutting created an open source, web crawler project.

In 2004, Google published MapReduce, GFS papers.

In 2006, Doug Cutting developed the open source, MapReduce and

HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and

Hadoop won terabyte sort benchmark.

In 2009, Facebook launched SQL support for Hadoop.

126

82. Give examples of some companies that are using Hadoop structure?

A lot of companies are using the Hadoop structure such as

Cloudera, EMC, MapR, Horton works, Amazon, Facebook, eBay,

Twitter, Google and so on.

83. What is the basic difference between traditional RDBMS and

Hadoop?

RDBMS is used for transactional systems to report and archive the data.

Hadoop is an approach to store huge amount of data in the distributed file system and process it.

RDBMS will be useful when you want to seek one record from Big data, whereas.

 Hadoop will be useful when you want Big data in one shot and perform analysis on that later.

127

84. What is structured and unstructured data?

 Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns.

Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text.

85. What are the core components of Hadoop?

Core components of Hadoop are HDFS and MapReduce.

HDFS is basically used to store large data sets and

MapReduce is used to process such large data sets.

128

86. What is HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

87. What are the key features of HDFS?

HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

129

88. What is Fault Tolerance?

Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file.

89. Replication causes data redundancy then why is pursued in HDFS?

HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places.

130

90. Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?

Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data.

91. What is throughput? How does HDFS get a good throughput?

Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system .

131

92. What is streaming access?

As HDFS works on the principle of ‘Write Once, Read

Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs.

93. What is a commodity hardware? Does commodity hardware include RAM?

Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware.

132

94. What is a Name node?

Name node is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the data nodes.

95. Is Name node also a commodity?

No.

Name node can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Name node has to be a high-availability machine.

96. What is a metadata?

Metadata is the information about the data stored in data nodes such as location of the file, size of the file and so on.

133

97. What is a Data node?

Data nodes are the slaves which are deployed on each machine and provide the actual storage.

These are responsible for serving read and write requests for the clients.

98. Why do we use HDFS for applications having large data sets and not when there are lot of small files?

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files.

134

99. What is a daemon?

Daemon is a process or service that runs in background. In general, we use this word in UNIX environment.

100. What is a job tracker?

Job tracker is a daemon that runs on a name node for submitting and tracking MapReduce jobs in

Hadoop. It assigns the tasks to the different task tracker.

135

101. What is a task tracker?

Task tracker is also a daemon that runs on data nodes.

Task Trackers manage the execution of individual tasks on slave node.

102. Is Name node machine same as data node machine as in terms of hardware?

It depends upon the cluster you are trying to create. The

Hadoop VM can be there on the same machine or on another machine.

136

103. What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A data node sends heartbeat to Name node and task tracker will send its heart beat to job tracker.

104. Are Name node and job tracker on the same host?

No, in practical environment, Name node is on a separate host and job tracker is on a separate host.

105. What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux.

106. What are the benefits of block transfer?

A file can be larger than any single disk in the network.

Blocks provide fault tolerance and availability.

107. If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5

blocks, can the blocks be broken at the time of replication?

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

138

108. How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be.

109. If a data Node is full how it’s identified?

When data is stored in data node, then the metadata of that data will be stored in the Name node. So Name node will identify if the data node is full.

110.If data nodes increase, then do we need to upgrade

Name node?

While installing the Hadoop system, Name node is determined based on the size of the clusters.

139

111.Are job tracker and task trackers present in separate machines?

Yes, job tracker and task tracker are present in different machines.

The reason is job tracker is a single point of failure for the Hadoop

MapReduce service.

112. When we send a data to a node, do we allow settling in time, before sending another data to that node?

Yes, we do.

113. Does Hadoop always require digital data to process?

Yes. Hadoop always require digital data to be processed.

114. On what basis Name node will decide which data node to write on?

As the Name node has the metadata (information) related to all the data nodes, it knows which data node is free.

140

115. Doesn’t Google have its very own version of DFS?

 Yes, Google owns a DFS known as “Google File System

(GFS)” developed by Google Inc. for its own use.

116. Who is a ‘user’ in HDFS?

A user is like you or me, who has some query or who needs some kind of data.

117. Is client the end user in HDFS?

No, Client is an application which runs on your machine, which is used to interact with the Name node (job tracker) or data node (task tracker).

118. What is the communication channel between client and name node/ data node?

The mode of communication is SSH.

141

Relational DB’s vs. Big Data (Spark)

1. It deals with Giga Bytes to

Terabytes

2. It is centralized

3. It deals with structured data

4. It is having stable Data Model

5. It deals with known complex inter relationships

6. Tools are Relational DB’s:

SQL,MYSQL,DB2.

7. Access is Interactive and batch.

8. Updates are Read and write many times.

9. Integrity is high

10. Scaling is Nonlinear

1. It deals with Petabytes to

Zettabytes

2. It is distributed

3. It deals with semi-structured and unstructured

4. It is having unstable Data

Model

5. It deals with flat schemas and few Interrelationships

6. Tools are Hadoop,R,Mahout

7. Access is Batch

8. Updates are Write once, read many times.

9. Integrity is low

10. Scaling is Linear

148

Performance

Ease of use

Data processing

HADOOP SPARK

Process data on disk

Java need to be proficient in

MapReduce

Process data in-memory

Java, Scala, R, Python more expressive and intelligent

Need other platforms for streaming, graphs

MLLib,

Graphs

Streaming,

Failure tolerance Continue from the point it left off Start the processing from the beginning

Cost

Run

Memory

Hard disk space cost Memory space cost

Everywhere

HDFS uses MapReduce to process and analyse data map reduces takes a backup of all the data in a physical server after each operation this is done data stored in a ram because the

Runs on Hadoop

This is called in memory operation

149

Fast

Version

Software

Execution Engine DAG

Big Data frame works

Hadoop

Hardware cost

Library

Hadoop works less faster than spark

Spark works more fast than Hadoop(100 times) batchfile,10xfaster on disk

Hadoop 2.6.5 Release Notes Spark 2.3.1

It is a open source s/w or,

Reliable, scalable distributed, computing It is a

Big Data Tool

It is Fast and General

Engine for large scale data processing

More

External machine lib

It is a Big Data Tool

Advanced DAG, support for acyclic data flow and in memory computing

Less

Internal Machine lib

150

Recovery Easier than spark Checkpoints are present

Failure recovery is difficult but still good

FileManagement system Its own FMS(File Management

System)

It does not come with own

FMS it support to the cloud based data platform Spark was designed for Hadoop

Support HDFS, Hadoop YARN Apache,

Mesos

It support for RDD

Technologies

Use Places

Run

Cassandra, HBase, HIVE

Tachyon, and any Hadoop source

Supports all systems in

Hadoop Processed by

Batch system

Marketing Analysis, computing analysis, cyber security analytics

Online product

Clusters Data bases, server Cloud based systems and

Data sets

151

Reach me @

Srinivasulu Asadi: srinu_asadi@yahoo.com

, + 91-9490246442

152

Reach me @

Srinivasulu Asadi: srinu_asadi@yahoo.com

, + 91-9490246442

153

Fig: International Conference on Emerging Research In Computing,

Information, Communication and Applications- ERCICA-2014.

Fig: Keynote Speaker @GITM, Proddatur, 2016

Fig: Resource Person for Workshop @MITS, Madanapalle, 2016

157

Fig: Keynote Speaker @ SITM, Renigunta, 2017

Thank You

Business Intelligence

160

161

CLUSTERING

 What is Clustering: Clustering is a Unsupervised learning i.e.

no predefined classes, Group of similar objects that differ significantly from other objects.

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

Clustering is “the process of organizing objects into groups whose members are similar in some way”.

The cluster property is Intra-cluster distances are minimized and

Inter-cluster distances are maximized.

What is Good Clustering: A good clustering method will produce high quality clusters with

 high intra-class similarity

 low inter-class similarity

162

1.

Nominal Variables allow for only qualitative classification. A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Ex: { male, female},{yes, no},{true, false}

2.

Ordinal Data are categorical data where there is a logical ordering to the categories.

Ex: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree;

3.

Categorical Data represent types of data which may be divided into groups.

Ex: race, sex, age group, and educational level.

4. Labeled Data are share the class labels or the generative distribution of data.

5.

Unlabeled Data are does not share the class labels or the generative distribution of the labeled data.

6.

Numerical Values: The data values completely belongs to and only numerical values.

Ex: 1,2,3,4,….

163

7. Interval-valued variables: These are variables ranges from numbers

Ex: 10-20, 20-30, 30-40,……..

8. Binary Variables: These are the variables and combination of 0 and 1.

Ex:

1, 0, 001,010 ….

9. Ratio-Scaled Variables: A positive measurement on a nonlinear scale, approximately at exponential scale, such as Ae Bt or Ae -Bt

Ex:

½

, 2/4, 4/8,…..

10. Variables of Mixed Types: A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.

Ex: 11121A1201

164

Similarity Measure

Euclidean distance: Distances are normally used to measure the similarity or dissimilarity between two data objects.

Euclidean distance: Euclidean distance is the distance between two points in Euclidean space.

Major Clustering Approaches

1. Partitioning Methods

2. Hierarchical Methods

3. Density-Based Methods

4. Grid-Based Methods

5. Model-Based Clustering Methods

6. Clustering High-Dimensional Data

7. Constraint-Based Cluster Analysis

8. Outlier Analysis

166

Examples of Clustering Applications

1. Marketing

2. Land use

3. Insurance

4. City-planning

5. Earth-quake studies

Issues of Clustering

1. Accuracy,

2. Training time,

3. Robustness,

4. Interpretability, and

5. Scalability

6. Find top ‘n’ outlier points

167

Applications

 Pattern Recognition

Spatial Data Analysis

GIS(Geographical Information System)

Cluster Weblog data to discover groups

 Credit approval

 Target marketing

Medical diagnosis

Fraud detection

Weather forecasting

 Stock Marketing

168

2. Classification Vs Clustering

1.

Classification is “the process of organizing objects into groups whose members are not similar.

1.

Clustering is “the process of organizing objects into groups whose members are similar in some way”.

2.

It is a Supervised Learning.

2.

It is a Unsupervised Learning.

3.

Predefined classes.

3.

No predefined classes.

4.

Have labels for some points.

4.

No labels in Clustering.

5.

Require a “rule” that will accurately assign labels to new points.

5.

Group points into clusters based on how “near” they are to one another.

169

6. Classification 6. Clustering

170

7. Classification approaches are two types

1.

Predictive Classification

2.

Descriptive Classification

7. Clustering approaches are eight.

1.

Partition Method

2.

Hierarchical Method

3.

Density-Based Methods

4.

Grid-Based Methods

5.

Model-Based Clustering Methods

6.

Clustering High-Dimensional

Data

7.

Constraint Based Cluster

Analysis

8.

Outlier Analysis

8. Issues of Classification

1.

Accuracy,

2.

Training time,

3.

Robustness,

4.

Interpretability, and

5.

Scalability

8. Issues of Clustering

1.

Accuracy,

2.

Training time,

3.

Robustness,

4.

Interpretability, and

5.

Scalability

6.

Find top ‘n’ outlier points

9. Examples

1.

Marketing

2.

Land use

3.

Insurance

4.

City-planning

5.

Earth-quake studies

9. Examples

1.

Marketing

2.

Land use

3.

Insurance

4.

City-planning

5.

Earth-quake studies

10. Techniques

10. Techniques

1.

Decision Tree

2.

Bayesian classification

3.

Rule-based classification

4.

Prediction and Accuracy and error measures

1. K- Means Clustering

2. DIANA ((DIvisive ANAlysis)

3. AGNES (AGglomerative NESting)

4. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)

5. DBSACN (Density-Based Spatial

Clustering of Applications with

Noise)

172

11. Applications

1.

Credit approval

2.

Target marketing

3.

Medical diagnosis

4.

Fraud detection

5.

Weather forecasting

6.

Stock Marketing

11. Applications

1.

Pattern Recognition

2.

Spatial Data Analysis

3.

WWW (World Wide Web)

4.

Weblog data to discover groups

5.

Credit approval

6.

Target marketing

7.

Medical diagnosis

8.

Fraud detection

9.

Weather forecasting

10. Stock Marketing

173

k-Means Clustering

It is a Partitioning cluster technique.

It is a Centroid-Based cluster technique

Clustering is a Unsupervised learning i.e. no predefined classes, Group of similar objects that differ significantly from other objects.

d ( i , j )

(| x i

1

 x j

1

|

2 

| x i

2

 x j

2

|

2 

...

| x i p

 x j p

|

2

)

It then creates the first k initial clusters (k= number of clusters needed) from the dataset by choosing k rows of data randomly from the dataset.

The k-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset.

174

Square-error criterion

Where

E is the sum of the square error for all objects in the data set;

– p is the point in space representing a given object; and

– mi is the mean of cluster

Ci (both p and mi are multidimensional).

Algorithm: The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster.

Input:

– k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

175

k-Means Clustering Method

Example

10

9

8

7

4

3

6

5

2

1

0

0 1 2 3 4 5 6 7 8 9 10

Assign each objects to most similar center k=2

Arbitrarily choose K object as initial cluster center

10

9

8

7

4

3

6

5

2

1

0

0 1 2 3 4 5 6 7 8 9 10 reassign

10

9

8

7

6

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

Update the cluster means

10

9

8

7

4

3

6

5

2

1

0

0 1 2 3 4 5 6 7 8 9 10 reassign

10

9

8

7

6

5

4

3

2

1

0

0 1 2 3 4 5 6 7 8 9 10

176

Fig: Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a “+”.)

177

Steps

k - Means algorithm is implemented in four steps:

1. Partition objects into k nonempty subsets.

2. Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster).

3. Assign each object to the cluster with the nearest seed point.

4. Go back to Step 2, stop when no more new assignment

179

Issues of Clustering

1.

Accuracy,

2.

Training time,

3.

Robustness,

4.

Interpretability, and

5.

Scalability

6.

Find top ‘n’ outlier points

Examples of Clustering Applications

1.

Marketing

2.

Land use

3.

Insurance

4.

City-planning

5.

Earth-quake studies

180

Applications

Pattern Recognition

Spatial Data Analysis

GIS(Geographical Information System)

Image Processing

WWW (World Wide Web)

 Cluster Weblog data to discover groups

Credit approval

Target marketing

Medical diagnosis

Fraud detection

Weather forecasting

 Stock Marketing

181

Classification and Prediction

Classification and Prediction: Classification is a supervised learning i.e. we can predict input and out values and classification is divided into groups but not necessarily similar properties is called

Classification.

Classification is a two step process

1.

Build the Classifier/ Model

2.

Use Classifier for Classification.

1. Build the Classifier / Model: Describing a set of predetermined classes

Each tuple / sample is assumed to belong to a predefined class, as determined by the class label attribute.

Also called as Learning phase or training phase.

The set of tuples used for model construction is training set.

The model is represented as classification rules, decision trees, or mathematical formulae.

182

Classification

Algorithms

Training

Data

NAME RANK YEARS TENURED

Mike Assistant Prof 3

Mary Assistant Prof 7

Bill Professor 2

Jim Associate Prof 7

Dave Assistant Prof 6

Anne Associate Prof 3 no yes yes yes no no

Classifier

(Model)

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

Fig:

Model Construction

2. Using Classifier : for classifying future or unknown objects. It e stimate accuracy of the model.

The known label of test sample is compared with the classified result from the model.

Accuracy rate is the percentage of test set samples that are correctly classified by the model.

Test set is independent of training set, otherwise overfitting will occur.

If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known.

184

Classifier

Testing

Data

Unseen Data

N A M E R A N K Y E A R S T E N U R E D

T om A ssistant P rof 2

M erlisa A ssociate P rof 7

G eorge P rofessor 5

Joseph A ssistant P rof 7 no no yes yes

(Jeff, Professor, 4)

Tenured?

Fig: Using the Model in Prediction

TYPES OF CLASSIFICATION TECHNIQUES

1. Decision Tree

2. Bayesian classification

3. Rule-based classification

4. Prediction

5. Classifier Accuracy and Prediction error measures

186

Fig: Example for Model Construction and Usage of Model

2 - Decision Tree

Decision tree is a flowchart-like tree structure, where each internal node (nonleaf node)denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.

Decision Tree Induction is developed by Ross Quinlan , decision tree algorithm known as ID3 (Iterative

Dichotomiser).

Decision tree is a classifier in the form of a tree structure

Decision node: specifies a test on a single attribute

Leaf node: indicates the value of the target attribute

Arc/edge: split of one attribute

Path: a disjunction of test to make the final decision

188

EX: Data Set in All Electronics Customer Database

189

Fig: A Decision tree for the concept buys computer, indicating whether a customer at AllElectronics is likely to purchase a computer. Each internal (nonleaf) node represents a test on an attribute. Each leaf node represents a class (either buys computer

= yes or buys computer = no).

190

Ex: For age attribute

191

Issues of Classification and Prediction

1. Accuracy,

2. Training time,

3. Robustness,

4. Interpretability, and

5. Scalability

Typical applications

1.

Credit approval

2.

Target marketing

3.

Medical diagnosis

4.

Fraud detection

5.

Weather forecasting

6.

Stock Marketing

192

Download