Talk
05 th September 2018
By
Dr.Asadi Srinivasulu
Professor, M.Tech(IIIT), Ph.D.
SREE VIDYANIKETHAN ENGINEERING COLLEGE
(AUTONOMOUS)
(AFFILIATED TO JNTUA,ANANTAPUR)
2018-2019
Big Data Fundamentals
Big Data Architecture
Spark Fundamentals
Spark Ecosystem
Spark Transformations
Spark Actions
Spark – MLlib
Classification using Spark
Clustering using Spark
Spark Challenges
IOT
Big Data Analytics
Data Mining
Data warehouse
Data Mart
Database System
DBMS
Database
Information
Internet of Things
Deals with 3V’s big data like face book
Extracting Meaningful data
Collection of Data Marts/OLAP
Subset of a DWH
Combination of Data + DBMS
Collection of Software's
Collection inter-related Data
Processed Data
Data Raw material/facts/images
Fig:
4
Big Data Is New
Big Data Is Only About Massive Data Volume
Big Data Means Hadoop
Big Data Need A Data Warehouse
Big Data Means Unstructured Data
Big Data Is for Social Media & Sentiment Analysis
Big data is high-volume, high-velocity and highvariety information assets that demand costeffective, innovative forms of information processing for enhanced insight and decision making.
Big data is data which is too large, complex and dynamic for any conventional data tools to capture, store, manage and analyze .
6
Fig:
7
Big Data Analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns , unknown correlations and other useful information.
Why do they care about Big Data?
More knowledge leads to better customer engagement , fraud prevention and new products.
8
Data Evolution is the 10% are structured and 90% are unstructured like emails, videos, facebook posts, website clicks etc.
Fig: The Evolution of BIG DATA
9
Big data is a collection of data sets which is so large and complex that it is difficult to handle using DBMS tools.
Facebook alone generates more than 500 terabytes of data daily whereas many other organizations like Jet Air and Stock Exchange Market generates terabytes of data every hour.
Types of Data:
1. Structured Data - These data is organized in a highly mechanized and manageable way.
Ex: Tables, Transactions, Legacy Data etc…
2. Unstructured Data These data is raw and unorganized, it varies in its content and can change from entry to entry.
Ex: Videos, images, audio, Text Data, Graph Data, social media etc.
3. Semi-Structured Data Ex: XML Database, 50% structured and
50% unstructured
10
Big Data Matters…….
Data Growth is huge and all that data is valuable.
Data won’t fit on a single system, that's why use Distributed data
Distributed data = Faster Computation.
More knowledge leads to better customer engagement , fraud prevention and new products.
Big Data Matters for Aggregation, Statistics, Indexing, Searching,
Querying and Discovering Knowledge.
11
Fig: Measuring the Data in Big Data System
Big Data Sources: Big data is everywhere and it can help organisations any industry in many different ways.
Big data has become too complex and too dynamic to be able to process, store, analyze and manage with traditional data tools.
Big Data Sources are
ERP Data
Transactions Data
Public Data
Social Media Data
Sensor Media Data
Big Data in Marketing
Big Data in Health & Life Sciences
Cameras Data
Mobile Devices
Machine sensors
Microphones
13
14
Big Data Processing: So-called big data technologies are about discovering patterns (in semi/unstructured data) development of big data standards & (open source) software commonly driven by companies such as Google, Facebook,
Twitter, Yahoo! …
15
Videos
Audios
Images
Photos
Logs
Click Trails
Text Messages
E-Mails
Documents
Books
Transactions
Public Records
Flat Files
SQL Files
DB2 Files
MYSQL Files
Tera Data Files
MS-Access Files
16
The are seven characteristics of Big Data are volume, velocity, variety, veracity, value, validity and visibility.
Earlier it was assessed in megabytes and gigabytes but now the assessment is made in terabytes.
17
1.
Volume: Data size or the amount of Data or Data quantity or Data at rest..
2.
Velocity: Data speed or Speed of change or The content is changing quickly or Data in motion.
3.
Variety: Data types or The range of data types & sources or Data with multiple formats.
4.
Veracity: Data fuzzy & cloudy or Messiness or Can we trust the data.
5.
Value: Data alone is not enough, how can value be derived from it.
6.
Validity: Ensure that the interpreted data is sound.
7.
Visibility: Data from diverse sources need to be stitched together.
18
Flexible schema
Massive scalability
Cheaper to setup
Understanding and Targeting Customers
Understanding and Optimizing Business Process
Improving Science and Research
Improving Healthcare and Public Health
Financial Trading
Improving Sports Performance
Improving Security and Law Enforcement
No declarative query language
Eventual consistency – higher performance
Detect risks and check frauds
Reduce Costs
20
Big data violates the privacy principle.
Data can be used for manipulating customers.
Big data may increase social stratification.
Big data is not useful in short run.
Faces difficulties in parsing and interpreting.
Big data is difficult to handle -more programming
Eventual consistency - fewer guarantees
Data Complexity
Data Volume
Data Velocity
Data Variety
Data Veracity
Capture data
Curation data
Performance
Storage data
Search data
Transfer data
Visualization data
Data Analysis
Privacy and Security
22
Big Data Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy.
23
Fig: Challenges of Big Data
24
Research issues in Big Data Analytics
1.
Sentiment Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
2.
Opinion mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
3.
Predictive mining Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
4.
Post-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
5.
Pre-Clustering Analysis in Big Data Hadoop using Mahout
Machine Learning Algorithms
6.
How we can capture and deliver data to right people in realtime
7.
How we can handle variety of forms and data
8.
How we can store and analyze data given its size and computational capacity.
Big Data Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase,
LucidWorks, R, MapR , Ubuntu and Linux flavors.
26
Social Networks and Relationships
Cyber-Physical Models
Internet of Things (IoT)
Retail Market
Retail Banking
Real Estate
Fraud detection and prevention
Telecommunications
Healthcare and Research
Automotive and production
Science and Research
Trading Analytics
27
Fig: Applications of Big Data Analytics
28
29
30
31
32
33
34
35
36
37
Spark Basics
RDD(Resilient Distributed Dataset)
Spark Transformations
Spark Actions
Spark with Machine Learning
Hands on Spark
Research Challenges in Spark
Spark is a free and open source software web application framework and domain-specific language written in Java.
Spark is an alternative to other Java web application frameworks such as JAX-RS, Play framework and Spring
MVC.
Apache Spark is a general-purpose cluster in-memory computing system.
Provides high-level APIs in Java, Scala and Python and an optimized engine that supports general execution graphs.
Provides various high level tools like Spark SQL for structured data processing, MLlib for Machine Learning and more….
39
40
Spark is a successor of MapReduce.
Map Reduce is the
‘heart‘ of Hadoop that consists of two parts –
‘map’ and
‘reduce’.
Maps and reduces are programs for processing data.
‘Map’ processes the data first to give some intermediate output which is further processed by ‘Reduce’ to generate the final output.
Thus, MapReduce allows for distributed processing of the map and reduction operations.
41
Apache Spark is the Next-Gen Big Data Tool i.e.
Considered as future of Big Data and successor of
MapReduce.
Features of Spark are
–
Speed.
–
Usability.
–
In - Memory Computing.
–
Pillar to Sophisticated Analytics.
–
Real Time Stream Processing.
–
Compatibility with Hadoop & existing Hadoop Data.
–
Lazy Evaluation.
–
Active, progressive and expanding community.
42
Spark is a Distributed data analytics engine , generalizing MapReduce.
Spark is a core engine, with streaming,
and Graph processing modules.
43
44
Why Spark
Spark is faster than MR when it comes to processing the same amount of data multiple times rather than unloading and loading new data.
Spark is simpler and usually much faster than MapReduce for the usual Machine learning and Data Analytics applications.
5 Reasons Why Spark Matters to Business
1. Spark enables use cases “traditional” Hadoop can’t handle.
2. Spark is fast
3. Spark can use your existing big data investment
4. Spark speaks SQL
5. Spark is developer-friendly
45
46
47
48
49
50
51
Spark Ecosystem: Spark Ecosystem is still in the stage of workin-progress with Spark components, which are not even in their beta releases.
Components of Spark Ecosystem
The components of Spark ecosystem are getting developed and several contributions are being made every now and then.
Primarily, Spark Ecosystem comprises the following components:
1) Shark (SQL)
2) Spark Streaming (Streaming)
3) MLLib (Machine Learning)
4) GraphX (Graph Computation)
5) SparkR (R on Spark)
6) BlindDB (Approximate SQL)
52
53
Spark's official ecosystem consists of the following major components.
Spark DataFrames - Similar to a relational table
Spark SQL - Execute SQL queries or HiveQL
Spark Streaming - An extension of the core Spark API
MLlib - Spark's machine learning library
GraphX - Spark for graphs and graph-parallel computation
Spark Core API - provides R, SQL, Python, Scala, Java
54
55
56
57
MLlib library has implementations for various common machine learning algorithms
1. Clustering: K-means
2. Classification: Naïve Bayes, logistic regression,
SVM
3. Decomposition: Principal Component Analysis
(PCA) and Singular Value Decomposition (SVD)
4. Regression : Linear Regression
5. Collaborative Filtering: Alternating Least Squares for
Recommendations
58
Spark Core: Spark Core is the base for parallel and distributed processing of huge datasets.
Spark SQL element: SparkSQL is module / component in Apache
Spark that is employed to access structured and semi structured information.
SparkSQL: Built-in perform or user-defined function: Object comes with some functions for column manipulation. Using Scala we are able to outlined user outlined perform.
SparkSQL: Executing SQL queries or Hive queries, result are going to be came in variety of DataFrame.
59
60
DataFrame: It is similar to relative table in SparkSQL.
It is distributed assortment of tabular information having rows and named column.
Datasets API: Dataset is new API additional to Spark Apache to supply benefit of RDD because it is robust written and declarative in nature.
Spark Streaming: Spark Streaming is a light-weight API that permits developers to perform execution and streaming of information application.
Spark element MLlib: MLlib in Spark stands for machine learning
(ML) library. Its goal is to form sensible machine learning effective, ascendible and straightforward.
GraphX: GraphX is a distributed graph process framework on prime of Apache Spark.
61
62
Language Support in Apache Spark
Apache Spark ecosystem is built on top of the core execution engine that has extensible
API’s in different languages.
A recent 2016 Spark Survey on 62% of Spark users evaluated the Spark languages
58% were using Python in 2017
71% were using Scala
31% of the respondents were using Java and
18% were using R programming language.
63
Spark framework is built on Scala, so programming in
can provide access to some of the latest and greatest features that might not be available in other supported programming spark languages.
has excellent libraries for data analysis like Pandas and
Sci-Kit learn but is comparatively slower than Scala.
64
has rich environment for machine learning and statistical analysis which helps increase developer productivity. Data scientists can now use R language along with Spark through SparkR for processing data that cannot be handled by a single machine.
Java is verbose and does not support REPL but is definitely a good choice for developers coming from a Java + Hadoop background.
65
What is Scala?: Scala is a general-purpose programming language, which expresses the programming patterns in a concise, elegant, and type-safe way.
It is basically an acronym for
“Scalable Language”.
Scala is an easy-to-learn language and supports both
Object Oriented Programming as well as Functional
Programming.
It is getting popular among programmers, and is being increasingly preferred over Java and other programming languages.
It seems much in sync with the present and future Big
Data frameworks, like Scalding, Spark, Akka , etc.
66
Scala is a pure object-oriented language, in which conceptually every value is an object and every operation is a method-call.
The language supports advanced component architectures through classes and traits.
Scala is also a functional language . It supports functions, immutable data structures and gives preference to immutability over mutation.
Scala can be seamlessly integrated with Java
It is already being widely used for Big Data platforms and development of frameworks like Akka, Scalding, Play, etc.
Being written in Scala, Spark can be embedded in any JVM-based operational system.
67
68
Procedure: Spark Installation in Ubuntu
Apache Spark is a fast and general engine for large-scale data processing.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R , and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including Spark
SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Step 1: Installing Java.
java –version
Step 2: Installing Scala and SBT.
sudo apt-get update sudo apt-get install scala
69
Step 3: Installing Maven plug-in.
Maven plug-in is used to compile java program for spark. Type below command to install maven.
sudo apt-get install maven
Step 4: Installing Spark.
Download “tgz” file of spark by selecting specific version from below link http://spark.apache.org/downloads.html
Extract it and remember its path where ever it stored.
Edit .bashrc file by placing below lines (terminal command: gedit
.bashrc
) export SPARK_HOME=/path_to_spark_directory export PATH=$SPARK_HOME/bin:$PATH
Replace path_to_spark_directory in above line with address of your spark directory.
Restart .bashrc by saving and close it and type “..bashrc” in terminal.
If it doesn’t work restart system. Thus we installed spark successfully.
Type spark-shell in terminal to start spark shell.
70
Step 1: Install Java (JDK)
Download and install java from https://java.com/en/download/
Step 2: Set java environment variable
Open “control panel” and choose “system & security” and select
“system”.
Select “Advanced System Settings” located at top right.
Select “Environmental Variables” from pop-up.
Next select new under system variables (below), you will get a pop-up.
In variable name field type JAVA_HOME
In variable value field provide installation directory of java, say
C:\Program Files\Java\jdk1.8.0_25
Or you can simply choose the directory by selecting browse directory.
Now close everything by choosing ok every time.
Check whether java variable is set or not by pinging javac in command promt. If we get java version details then we are done.
71
Step 3: Installing SCALA
Download scala.msi
file from https://www.scalalang.org/download/
Set scala environment variable just like java done above.
Variable name = SCALA_HOME
Variable value = path to scala installed directory, say
C:\Program Files (x86)\scala
Step 4: Installing SPARK
Download and extract spark from http://spark.apache.org/downloads.html
You can set SPARK_HOME just like java.
Note:-We can only run spark-shell at bin folder in spark folder on windows.
72
Step 5: Installing SBT Download and install sbt.msi from http://www.scala-sbt.org/0.13/docs/Installing-sbt-on-
Windows.html
Step 6: Installing Maven
Download maven from http://maven.apache.org/ download.cgi
and unzip it to the folder you want to install
Maven.
Add both M2_HOME and MAVEN_HOME variables in the Windows environment, and point it to your Maven folder.
Update PATH variable, append Maven bin folder –
%M2_HOME%\bin, so that you can run the Maven’s command everywhere.
Test maven by pinging mvn –version in command prompt
73
Practice on Spark Framework with Transformations and
Actions: You can run Spark using its standalone cluster mode , on EC2 , on Hadoop YARN , or on Apache Mesos . Access data in HDFS , Cassandra , HBase , Hive , Tachyon , and any Hadoop data source.
Spark powers a stack of libraries including SQL and
DataFrames , MLlib for machine learning, GraphX , and Spark
Streaming . You can combine these libraries seamlessly in the same application.
74
RDD (Resilient Distributed Dataset) is main logical data unit in Spark . An RDD is distributed collection of objects. ... Quoting from Learning Spark book, "In Spark all work is expressed as creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.“
Spark performs Transformations and Actions:
75
Resilient Distributed Datasets overcome this drawback of Hadoop
MapReduce by allowing - fault tolerant ‘in-memory’ computations.
RDD in Apache Spark.
Why RDD is used to process the data ?
What are the major features/characteristics of RDD (Resilient
Distributed Datasets) ?
Resilient Distributed Datasets are immutable , partitioned collection of records that can be operated on - in parallel.
RDDs can contain any kind of objects Python, Scala, Java or even user defined class objects.
RDDs are usually created by either transformation of existing RDDs or by loading an external dataset from a stable storage like HDFS or
HBase.
76
77
Coarse grained operations like
or
on existing RDDs which produce a new RDD, with the result of the operation, are referred to as transformations. All transformations in
.
Operations like count, first and reduce which return values after computations on existing
RDDs are referred to as Actions.
78
•
Properties / Traits of RDD:
Immutable (Read only cant change or modify): Data is safe to share across processes.
Partitioned: It is basic unit of parallelism in RDD.
Coarse gained operations: it's applied to any or all components in datasets through maps or filter or group by operation.
Action/Transformations: All computations in RDDs are actions or transformations.
Fault Tolerant: As the name says or include Resilient which means its capability to reconcile, recover or get back all the data using lineage graph.
Cacheable: It holds data in persistent storage.
Persistence: Option of choosing which storage will be used either in-memory or on-disk.
79
80
How Spark Works - RDD Operations
81
Task 1: Practice on Spark Transformations i.e. map(), filter(), flatmap(), groupBy(), groupByKey(), sample(), union(), join(), distinct(), keyBy(), partitionBy and zip().
RDD (Resilient Distributed Dataset) is main logical data unit in Spark . An RDD is distributed collection of objects. ... Quoting from Learning Spark book, "In Spark all work is expressed as creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.“
Transformations are lazy evaluated operations on RDD that create one or many new RDDs, e.g. map, filter, reduceByKey, join, cogroup, randomSplit.
Transformations are lazy, i.e. are not executed immediately.
Transformations can be executed only when actions are called.
82
Transformations are lazy operations on a RDD that create one or many new RDDs.
Ex: map , filter , reduceByKey , join , cogroup , randomSplit .
In other words, transformations are functions that take a RDD as the input and produce one or many RDDs as the output.
RDD allows you to create dependencies between RDDs.
Dependencies are the steps for producing results i.e. a program.
Each RDD in lineage chain, string of dependencies has a function for operating its data and has a pointer dependency to its ancestor
RDD.
Spark will divide RDD dependencies into stages and tasks and then send those to workers for execution.
83
84
1.map(): Pass each element of the RDD through the supplied function.
val x = Array("b", "a", "c") val y = x.map(z => (z,1))
Output: y: [('b', 1), ('a', 1), ('c', 1)]
85
2.filter(): Filter creates a new RDD by passing in the supplied function used to filter the results.
val x = sc.parallelize(Array(1,2,3)) val y = x.filter(n => n%2 == 1) println(y.collect().mkString(", "))
Output: y: [1, 3]
86
3.flatmap() : Similar to map, but each input item can be mapped to 0 or more output items.
val x = sc.parallelize(Array(1,2,3)) val y = x.flatMap(n => Array(n, n*100, 42)) println(y.collect().mkString(", "))
Output: y: [1, 100, 42, 2, 200, 42, 3, 300, 42]
87
4.groupBy() : When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
val x = sc.parallelize( Array("John", "Fred", "Anna", "James")) val y = x.groupBy(w => w.charAt(0)) println(y.collect().mkString(", "))
Output: y: [('A',['Anna']),('J',['John','James']),('F',['Fred'])]
88
5.groupByKey() : When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
val x = sc.parallelize( Array(('B',5),('B',4),('A',3),('A',2),('A',1))) val y = x.groupByKey() println(y.collect().mkString(", "))
Output: y: [('A', [3, 2, 1]),('B',[5, 4])]
89
6.sample() : Return a random sample subset RDD of the input RDD.
val x= sc.parallelize(Array(1, 2, 3, 4, 5)) val y= x.sample(false, 0.4)
// omitting seed will yield different output println(y.collect().mkString(", "))
Output: y: [1, 3]
90
7.union() : Simple. Return the union of two RDDs.
val x= sc.parallelize(Array(1,2,3), 2) val y= sc.parallelize(Array(3,4), 1) val z= x.union(y) val zOut= z.glom().collect()
Output z: [[1], [2, 3], [3, 4]]
91
8.join() : If you have relational database experience, this will be easy. It’s joining of two datasets.
val x= sc.parallelize(Array(("a", 1), ("b", 2))) val y= sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5))) val z= x.join(y) println(z.collect().mkString(", "))
Output z: [('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]
92
9.distinct() : Return a new RDD with distinct elements within a source RDD.
val x= sc.parallelize(Array(1,2,3,3,4)) val y= x.distinct() println(y.collect().mkString(", "))
Output: y: [1, 2, 3, 4]
93
10) keyBy() : Constructs two-component tuples (key-value pairs) by applying a function on each data item.
val x= sc.parallelize(Array("John", "Fred", "Anna", "James")) val y= x.keyBy(w => w.charAt(0)) println(y.collect().mkString(", "))
Output: y: [('J','John'),('F','Fred'),('A','Anna'),('J','James')]
94
11) partitionBy() : Repartitions as key-value RDD using its keys. The partitioner implementation can be supplied as the first argument.
import org.apache.spark.partiotioner
val x=sc.parallelize(Array(‘J’,”James”),(‘F’,”Fred”),(‘A’,”Anna”),(‘J’,”John”),3) val y= x.partitionBy(new Partitioner() { val numPartitions= 2 defgetPartition(k:Any) = { if (k.asInstanceOf[Char] < 'H') 0 else 1
}
}) val yOut= y.glom().collect()
Output:y: Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James)))
95
12) zip() : Joins two RDDs by combining the i-th of either partition with each other.
val x= sc.parallelize(Array(1,2,3)) val y= x.map(n=>n*n) val z= x.zip(y) println(z.collect().mkString(", "))
Output: z: [(1, 1), (2, 4), (3, 9)]
96
Task 2: Practice on Spark Actions i.e. getNumPartitions(), collect(), reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().
RDD (Resilient Distributed Dataset) is main logical data unit in Spark . An RDD is distributed collection of objects. ... Quoting from Learning Spark book, "In Spark all work is expressed as creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.“
Actions returns final result of RDD computations / operation.
Action produces a value back to the Spark driver program. It may trigger a previously constructed, lazy RDD to be evaluated.
Action function materialize a value in a Spark program. So basically an action is RDD operation that returns a value of any type but
RDD[T] is an action.
97
Actions: Unlike Transformations which produce RDDs, action functions produce a value back to the Spark driver program.
Actions may trigger a previously constructed, lazy RDD to be evaluated.
1. getNumPartitions()
2. collect()
4. aggregate ()
5. mean()
6. sum()
7. max()
8. stdev()
9. countByKey()
98
1) collect() : collect returns the elements of the dataset as an array back to the driver program.
val x= sc.parallelize(Array(1,2,3), 2) val y= x.collect()
Output: y: [1, 2, 3]
99
2) reduce() : Aggregate the elements of a dataset through function.
val x= sc.parallelize(Array(1,2,3,4)) val y= x.reduce((a,b) => a+b)
Output: y: 10
100
3) aggregate(): The aggregate function allows the user to apply two different reduce functions to the RDD.
val inputrdd=sc.parallelize(List((“maths”,21),(“english”,22), (“science”,31)),3) valresult=inputrdd.aggregate(3)((acc,value)=>(acc+value._2),(acc1,acc2)=>(acc1+acc2))
Partition 1 : Sum(all Elements) + 3 (Zero value)
Partition 2 : Sum(all Elements) + 3 (Zero value)
Partition 3 : Sum(all Elements) + 3 (Zero value)
Result = Partition1 + Partition2 + Partition3 + 3(Zero value)
So we get 21 + 22 + 31 + (4 * 3) = 86
Output: y: Int = 86
101
4) max() : Returns the largest element in the RDD.
val x= sc.parallelize(Array(2,4,1)) val y= x.max()
Output: y: 4
102
5) count() : Number of elements in the RDD.
val x= sc.parallelize(Array("apple", "beatty", "beatrice")) val y=x.count()
Output: y: 3
103
6) sum() : Sum of the RDD.
val x= sc.parallelize(Array(2,4,1)) val y= x.sum()
Output: y: 7
104
7) mean() : Mean of given RDD.
val x= sc.parallelize(Array(2,4,1)) val y= x.mean()
Output: y: 2.3333333
105
8) stdev() : An aggregate function that standard deviation of a set of numbers.
val x= sc.parallelize(Array(2,4,1)) val y= x.stdev()
Output: y: 1.2472191
106
9) countByKey() : This is only available on RDDs of (K,V) and returns a hashmap of (K, count of K).
val x= sc.parallelize(Array(('J',"James"),('F',"Fred"), ('A',"Anna"),('J',"John"))) val y= x.countByKey()
Output: y: {'A': 1, 'J': 2, 'F': 1}
107
10) getNumPartitions() val x= sc.parallelize(Array(1,2,3), 2) val y= x.partitions.size
Output: y: 2
108
1. Spark is initially developed by which university
Ans) Berkley
2. What are the characteristics of Big Data?
Ans) Volume, Velocity and Variety
3. The main focus of Hadoop ecosystem is on
Ans ) Batch Processing
4. Streaming data tools available in Hadoop ecosystem are?
Ans ) Apache Spark and Storm
5. Spark has API's in?
Ans ) Java, Scala, R and Python
6. Which kind of data can be processed by spark?
Ans) Stored Data and Streaming Data
7. Spark can store its data in?
Ans) HDFS, MongoDB and Cassandra
109
8. How spark engine runs?
Ans) Integrating with Hadoop and Standalone
9. In spark data is represented as?
Ans ) RDDs
10. Which kind of data can be handled by Spark ?
Ans) Structured, Unstructured and Semi-Structured
11.Which among the following are the challenges in Map reduce?
Ans) Every Problem has to be broken into Map and
Reduce phase
Collection of Key / Value pairs
High Throughput
110
12. Apache spark is a framework with?
Ans) Scheduling, Monitoring and Distributing Applications
13. Which of the features of Apache spark
Ans) DAG, RDDs and In- Memory processing
14) How much faster is the processing in spark when compared to
Hadoop?
ANS) 10-100X
15) In spark data is represented as?
Ans) RDDs
16) List of Transformations
Ans) map(), filter(), flatmap(), groupBy(), groupByKey(), sample(), union(), join(), distinct(), keyBy(), partitionBy and zip().
17) List of Actions
Ans) getNumPartitions(), collect(), reduce(), aggregate(), max(), sum(), mean(), stdev(), countByKey().
111
18. Spark is developed in
Ans) Scala
19. Which type of processing Apache Spark can handle
Ans) Batch Processing, Interactive Processing, Stream
Processing and Graph Processing
20. List two statements of Spark
Ans) Spark can run on the top of Hadoop
Spark can process data stored in HDFS
Spark can use Yarn as resource management layer
21. Spark's core is a batch engine? True OR False
Ans) True
22) Spark is 100x faster than MapReduce due to
Ans) In-Memory Computing
23) MapReduce program can be developed in spark? T / F
Ans) True
112
24. Programming paradigm used in Spark
Ans) Generalized
25. Spark Core Abstraction
Ans) RDD
26. Choose correct statement about RDD
Ans) RDD is a distributed data structure
27. RDD is
Ans) Immutable, Recomputable and Fault-tolerant
28. RDD operations
Ans) Transformation, Action and Caching
29. We can edit the data of RDD like conversion to uppercase? T/F
Ans) False
30. Identify correct transformation
Ans) Map, Filter and Join
113
31. Identify Correct Action
Ans) Reduce
32. Choose correct statement
Ans) Execution starts with the call of Action
33. Choose correct statement about Spark Context
Ans) Interact with cluster manager and Specify spark how to access cluster
34. Spark cache the data automatically in the memory as and when needed? T/F
Ans) False
35. For resource management spark can use
Ans) Yarn, Mesos and Standalone cluster manager
36. RDD can not be created from data stored on
Ans) Oracle
37. RDD can be created from data stored on
Ans) Local FS, S3 and HDFS
114
38. Who is father of Big data Analytics
Doug Cutting
39. What are major Characteristics of Big Data
Volume, Velocity and Variety(3 V’s)
40. What is Apache Hadoop
Open-source Software Framework
41. Who developed Hadoop
Doug Cutting
42. Hadoop supports which programming framework
Java
115
43. What is the heart of Hadoop
MapReduce
44. What is MapReduce
Programming Model for Processing Large Data Sets.
45. What are the Big Data Dimensions
4 V’s
46. What is the caption of Volume
Data at Scale
47. What is the caption of Velocity
Data in Motion
116
48. What is the caption of Variety
Data in many forms
49. What is the caption of Veracity
Data Uncertainty
50. What is the biggest Data source for Big Data
Transactions
51. What is the biggest Analytic capability for Big Data
Query and Reporting
52. What is the biggest Infrastructure for Big Data
Information integration
117
53. What are the Big Data Adoption Stages
Educate, Explore, Engage and Execute
54. What is Mahout
Algorithm library for scalable machine learning on Hadoop
55. What is Pig
Creating MapReduce programs used with Hadoop.
56. What is HBase
Non-Relational Database
57. What is the biggest Research Challenge for Big Data
Heterogeneity , Incompleteness and Security
118
58. What is Sqoop
Transferring bulk data between Hadoop to Structured data.
59. What is Oozie
Workflow scheduler system to manage Hadoop jobs.
60. What is Hue
Web interface that supports Apache Hadoop and its ecosystem
61. What is Avro
Avro is a data serialization system.
62. What is Giraph
Iterative graph processing system built for high scalability.
63. What is Cassandra
Cassandra does not support joins or sub queries, except for batch analysis via Hadoop
64. What is Chukwa
Chukwa is an open source data collection system for monitoring large distributed systems
65. What is Hive
Hive is a data warehouse on Hadoop
66. What is Apache drill
Apache Drill is a distributed system for interactive analysis of large-scale datasets.
67. What is HDFS
Hadoop Distributed File System ( HDFS )
68. Facebook generates how much data per day
25TB
69.
What is BIG DATA?
Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing techniques.
70. What is HUE expansion
Hadoop User Interface
71.Can you give some examples of Big Data?
There are many real life examples of Big Data! Facebook is generating 500+ terabytes of data per day, NYSE (New York Stock
Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor data for every 30 minutes of flying time.
72. Can you give a detailed overview about the Big Data being generated by Facebook?
As of December 31, 2012, there are 1.06 billion monthly active users on Facebook and 680 million mobile users. On an average,
3.2 billion likes and comments are posted every day on Facebook.
72% of web audience is on Facebook. And why not! There are so many activities going on Facebook from wall posts, sharing images, videos, writing comments and liking posts, etc.
122
73. What are the three characteristics of Big Data?
The three characteristics of Big Data are: Volume : Facebook generating 500+ terabytes of data per day .
Velocity : Analyzing 2 million records each day to identify the reason for losses.
Variety : images, audio, video, sensor data, log files, etc.
74. How Big is ‘Big Data’?
With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes.
But time has arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes ! Global data volume was around 1.8ZB in 2011 and is expected to be
7.9ZB in 2015.
123
75. How analysis of Big Data is useful for organizations?
Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus on and which areas are less important.
76. Who are ‘Data Scientists’?
Data scientists are experts who find solutions to analyze data. Just as web analysis, we have data scientists who have good business insight as to how to handle a business challenge.
124
77. What is Hadoop?
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity computers using a simple programming model.
78. Why the name ‘Hadoop’?
Hadoop doesn’t have any expanding version like
‘OOPS’.
The charming yellow elephant you see is basically named after Doug’s son’s toy elephant!
79. Why do we need Hadoop?
Everyday a large amount of unstructured data is getting dumped into our machines.
125
80. What are some of the characteristics of Hadoop framework?
Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes). The programming model is based on Google’s MapReduce. The infrastructure is based on Google’s Big Data and Distributed File
System.
81. Give a brief overview of Hadoop history.
In 2002, Doug Cutting created an open source, web crawler project.
In 2004, Google published MapReduce, GFS papers.
In 2006, Doug Cutting developed the open source, MapReduce and
HDFS project. In 2008, Yahoo ran 4,000 node Hadoop cluster and
Hadoop won terabyte sort benchmark.
In 2009, Facebook launched SQL support for Hadoop.
126
82. Give examples of some companies that are using Hadoop structure?
A lot of companies are using the Hadoop structure such as
Cloudera, EMC, MapR, Horton works, Amazon, Facebook, eBay,
Twitter, Google and so on.
83. What is the basic difference between traditional RDBMS and
Hadoop?
RDBMS is used for transactional systems to report and archive the data.
Hadoop is an approach to store huge amount of data in the distributed file system and process it.
RDBMS will be useful when you want to seek one record from Big data, whereas.
Hadoop will be useful when you want Big data in one shot and perform analysis on that later.
127
84. What is structured and unstructured data?
Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns.
Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text.
85. What are the core components of Hadoop?
Core components of Hadoop are HDFS and MapReduce.
HDFS is basically used to store large data sets and
MapReduce is used to process such large data sets.
128
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.
129
88. What is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is no chance of getting the data back present in that file.
89. Replication causes data redundancy then why is pursued in HDFS?
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places.
130
90. Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data.
91. What is throughput? How does HDFS get a good throughput?
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system .
131
92. What is streaming access?
As HDFS works on the principle of ‘Write Once, Read
Many‘, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs.
93. What is a commodity hardware? Does commodity hardware include RAM?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed in any average commodity hardware.
132
94. What is a Name node?
Name node is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the data nodes.
95. Is Name node also a commodity?
No.
Name node can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Name node has to be a high-availability machine.
96. What is a metadata?
Metadata is the information about the data stored in data nodes such as location of the file, size of the file and so on.
133
Data nodes are the slaves which are deployed on each machine and provide the actual storage.
These are responsible for serving read and write requests for the clients.
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files.
134
Daemon is a process or service that runs in background. In general, we use this word in UNIX environment.
Job tracker is a daemon that runs on a name node for submitting and tracking MapReduce jobs in
Hadoop. It assigns the tasks to the different task tracker.
135
101. What is a task tracker?
Task tracker is also a daemon that runs on data nodes.
Task Trackers manage the execution of individual tasks on slave node.
102. Is Name node machine same as data node machine as in terms of hardware?
It depends upon the cluster you are trying to create. The
Hadoop VM can be there on the same machine or on another machine.
136
103. What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A data node sends heartbeat to Name node and task tracker will send its heart beat to job tracker.
104. Are Name node and job tracker on the same host?
No, in practical environment, Name node is on a separate host and job tracker is on a separate host.
105. What is a ‘block’ in HDFS?
A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux.
106. What are the benefits of block transfer?
A file can be larger than any single disk in the network.
Blocks provide fault tolerance and availability.
107. If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5
blocks, can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.
138
108. How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be.
109. If a data Node is full how it’s identified?
When data is stored in data node, then the metadata of that data will be stored in the Name node. So Name node will identify if the data node is full.
110.If data nodes increase, then do we need to upgrade
Name node?
While installing the Hadoop system, Name node is determined based on the size of the clusters.
139
111.Are job tracker and task trackers present in separate machines?
Yes, job tracker and task tracker are present in different machines.
The reason is job tracker is a single point of failure for the Hadoop
MapReduce service.
112. When we send a data to a node, do we allow settling in time, before sending another data to that node?
Yes, we do.
113. Does Hadoop always require digital data to process?
Yes. Hadoop always require digital data to be processed.
114. On what basis Name node will decide which data node to write on?
As the Name node has the metadata (information) related to all the data nodes, it knows which data node is free.
140
115. Doesn’t Google have its very own version of DFS?
Yes, Google owns a DFS known as “Google File System
(GFS)” developed by Google Inc. for its own use.
116. Who is a ‘user’ in HDFS?
A user is like you or me, who has some query or who needs some kind of data.
117. Is client the end user in HDFS?
No, Client is an application which runs on your machine, which is used to interact with the Name node (job tracker) or data node (task tracker).
118. What is the communication channel between client and name node/ data node?
The mode of communication is SSH.
141
1. It deals with Giga Bytes to
Terabytes
2. It is centralized
3. It deals with structured data
4. It is having stable Data Model
5. It deals with known complex inter relationships
6. Tools are Relational DB’s:
SQL,MYSQL,DB2.
7. Access is Interactive and batch.
8. Updates are Read and write many times.
9. Integrity is high
10. Scaling is Nonlinear
1. It deals with Petabytes to
Zettabytes
2. It is distributed
3. It deals with semi-structured and unstructured
4. It is having unstable Data
Model
5. It deals with flat schemas and few Interrelationships
6. Tools are Hadoop,R,Mahout
7. Access is Batch
8. Updates are Write once, read many times.
9. Integrity is low
10. Scaling is Linear
148
Performance
Ease of use
Data processing
HADOOP SPARK
Process data on disk
Java need to be proficient in
MapReduce
Process data in-memory
Java, Scala, R, Python more expressive and intelligent
Need other platforms for streaming, graphs
MLLib,
Graphs
Streaming,
Failure tolerance Continue from the point it left off Start the processing from the beginning
Cost
Run
Memory
Hard disk space cost Memory space cost
Everywhere
HDFS uses MapReduce to process and analyse data map reduces takes a backup of all the data in a physical server after each operation this is done data stored in a ram because the
Runs on Hadoop
This is called in memory operation
149
Fast
Version
Software
Execution Engine DAG
Big Data frame works
Hadoop
Hardware cost
Library
Hadoop works less faster than spark
Spark works more fast than Hadoop(100 times) batchfile,10xfaster on disk
Hadoop 2.6.5 Release Notes Spark 2.3.1
It is a open source s/w or,
Reliable, scalable distributed, computing It is a
Big Data Tool
It is Fast and General
Engine for large scale data processing
More
External machine lib
It is a Big Data Tool
Advanced DAG, support for acyclic data flow and in memory computing
Less
Internal Machine lib
150
Recovery Easier than spark Checkpoints are present
Failure recovery is difficult but still good
FileManagement system Its own FMS(File Management
System)
It does not come with own
FMS it support to the cloud based data platform Spark was designed for Hadoop
Support HDFS, Hadoop YARN Apache,
Mesos
It support for RDD
Technologies
Use Places
Run
Cassandra, HBase, HIVE
Tachyon, and any Hadoop source
Supports all systems in
Hadoop Processed by
Batch system
Marketing Analysis, computing analysis, cyber security analytics
Online product
Clusters Data bases, server Cloud based systems and
Data sets
151
Reach me @
Srinivasulu Asadi: srinu_asadi@yahoo.com
, + 91-9490246442
152
Reach me @
Srinivasulu Asadi: srinu_asadi@yahoo.com
, + 91-9490246442
153
Fig: International Conference on Emerging Research In Computing,
Information, Communication and Applications- ERCICA-2014.
Fig: Keynote Speaker @GITM, Proddatur, 2016
Fig: Resource Person for Workshop @MITS, Madanapalle, 2016
157
Fig: Keynote Speaker @ SITM, Renigunta, 2017
160
161
What is Clustering: Clustering is a Unsupervised learning i.e.
no predefined classes, Group of similar objects that differ significantly from other objects.
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.
Clustering is “the process of organizing objects into groups whose members are similar in some way”.
The cluster property is Intra-cluster distances are minimized and
Inter-cluster distances are maximized.
What is Good Clustering: A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
162
1.
Nominal Variables allow for only qualitative classification. A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Ex: { male, female},{yes, no},{true, false}
2.
Ordinal Data are categorical data where there is a logical ordering to the categories.
Ex: 1=Strongly disagree; 2=Disagree; 3=Neutral; 4=Agree;
3.
Categorical Data represent types of data which may be divided into groups.
Ex: race, sex, age group, and educational level.
4. Labeled Data are share the class labels or the generative distribution of data.
5.
Unlabeled Data are does not share the class labels or the generative distribution of the labeled data.
6.
Numerical Values: The data values completely belongs to and only numerical values.
Ex: 1,2,3,4,….
163
7. Interval-valued variables: These are variables ranges from numbers
Ex: 10-20, 20-30, 30-40,……..
8. Binary Variables: These are the variables and combination of 0 and 1.
Ex:
1, 0, 001,010 ….
9. Ratio-Scaled Variables: A positive measurement on a nonlinear scale, approximately at exponential scale, such as Ae Bt or Ae -Bt
Ex:
½
, 2/4, 4/8,…..
10. Variables of Mixed Types: A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
Ex: 11121A1201
164
Similarity Measure
Euclidean distance: Distances are normally used to measure the similarity or dissimilarity between two data objects.
Euclidean distance: Euclidean distance is the distance between two points in Euclidean space.
1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Clustering Methods
6. Clustering High-Dimensional Data
7. Constraint-Based Cluster Analysis
8. Outlier Analysis
166
1. Marketing
2. Land use
3. Insurance
4. City-planning
5. Earth-quake studies
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
6. Find top ‘n’ outlier points
167
Pattern Recognition
Spatial Data Analysis
GIS(Geographical Information System)
Cluster Weblog data to discover groups
Credit approval
Target marketing
Medical diagnosis
Fraud detection
Weather forecasting
Stock Marketing
168
1.
Classification is “the process of organizing objects into groups whose members are not similar.
1.
Clustering is “the process of organizing objects into groups whose members are similar in some way”.
2.
It is a Supervised Learning.
2.
It is a Unsupervised Learning.
3.
Predefined classes.
3.
No predefined classes.
4.
Have labels for some points.
4.
No labels in Clustering.
5.
Require a “rule” that will accurately assign labels to new points.
5.
Group points into clusters based on how “near” they are to one another.
169
6. Classification 6. Clustering
170
7. Classification approaches are two types
1.
Predictive Classification
2.
Descriptive Classification
7. Clustering approaches are eight.
1.
Partition Method
2.
Hierarchical Method
3.
Density-Based Methods
4.
Grid-Based Methods
5.
Model-Based Clustering Methods
6.
Clustering High-Dimensional
Data
7.
Constraint Based Cluster
Analysis
8.
Outlier Analysis
8. Issues of Classification
1.
Accuracy,
2.
Training time,
3.
Robustness,
4.
Interpretability, and
5.
Scalability
8. Issues of Clustering
1.
Accuracy,
2.
Training time,
3.
Robustness,
4.
Interpretability, and
5.
Scalability
6.
Find top ‘n’ outlier points
9. Examples
1.
Marketing
2.
Land use
3.
Insurance
4.
City-planning
5.
Earth-quake studies
9. Examples
1.
Marketing
2.
Land use
3.
Insurance
4.
City-planning
5.
Earth-quake studies
10. Techniques
10. Techniques
1.
Decision Tree
2.
Bayesian classification
3.
Rule-based classification
4.
Prediction and Accuracy and error measures
1. K- Means Clustering
2. DIANA ((DIvisive ANAlysis)
3. AGNES (AGglomerative NESting)
4. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
5. DBSACN (Density-Based Spatial
Clustering of Applications with
Noise)
172
11. Applications
1.
Credit approval
2.
Target marketing
3.
Medical diagnosis
4.
Fraud detection
5.
Weather forecasting
6.
Stock Marketing
11. Applications
1.
Pattern Recognition
2.
Spatial Data Analysis
3.
WWW (World Wide Web)
4.
Weblog data to discover groups
5.
Credit approval
6.
Target marketing
7.
Medical diagnosis
8.
Fraud detection
9.
Weather forecasting
10. Stock Marketing
173
It is a Partitioning cluster technique.
It is a Centroid-Based cluster technique
Clustering is a Unsupervised learning i.e. no predefined classes, Group of similar objects that differ significantly from other objects.
d ( i , j )
(| x i
1
x j
1
|
2
| x i
2
x j
2
|
2
...
| x i p
x j p
|
2
)
It then creates the first k initial clusters (k= number of clusters needed) from the dataset by choosing k rows of data randomly from the dataset.
The k-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset.
174
Square-error criterion
Where
–
E is the sum of the square error for all objects in the data set;
– p is the point in space representing a given object; and
– mi is the mean of cluster
–
Ci (both p and mi are multidimensional).
Algorithm: The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster.
Input:
– k: the number of clusters,
–
D: a data set containing n objects.
Output: A set of k clusters.
175
10
9
8
7
4
3
6
5
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Assign each objects to most similar center k=2
Arbitrarily choose K object as initial cluster center
10
9
8
7
4
3
6
5
2
1
0
0 1 2 3 4 5 6 7 8 9 10 reassign
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
Update the cluster means
10
9
8
7
4
3
6
5
2
1
0
0 1 2 3 4 5 6 7 8 9 10 reassign
10
9
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
176
Fig: Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a “+”.)
177
Steps
1. Partition objects into k nonempty subsets.
2. Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster).
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment
179
1.
Accuracy,
2.
Training time,
3.
Robustness,
4.
Interpretability, and
5.
Scalability
6.
Find top ‘n’ outlier points
1.
Marketing
2.
Land use
3.
Insurance
4.
City-planning
5.
Earth-quake studies
180
Pattern Recognition
Spatial Data Analysis
GIS(Geographical Information System)
Image Processing
WWW (World Wide Web)
Cluster Weblog data to discover groups
Credit approval
Target marketing
Medical diagnosis
Fraud detection
Weather forecasting
Stock Marketing
181
Classification and Prediction: Classification is a supervised learning i.e. we can predict input and out values and classification is divided into groups but not necessarily similar properties is called
Classification.
Classification is a two step process
1.
Build the Classifier/ Model
2.
Use Classifier for Classification.
1. Build the Classifier / Model: Describing a set of predetermined classes
Each tuple / sample is assumed to belong to a predefined class, as determined by the class label attribute.
Also called as Learning phase or training phase.
The set of tuples used for model construction is training set.
The model is represented as classification rules, decision trees, or mathematical formulae.
182
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3
Mary Assistant Prof 7
Bill Professor 2
Jim Associate Prof 7
Dave Assistant Prof 6
Anne Associate Prof 3 no yes yes yes no no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Fig:
2. Using Classifier : for classifying future or unknown objects. It e stimate accuracy of the model.
The known label of test sample is compared with the classified result from the model.
Accuracy rate is the percentage of test set samples that are correctly classified by the model.
Test set is independent of training set, otherwise overfitting will occur.
If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known.
184
Classifier
Testing
Data
Unseen Data
N A M E R A N K Y E A R S T E N U R E D
T om A ssistant P rof 2
M erlisa A ssociate P rof 7
G eorge P rofessor 5
Joseph A ssistant P rof 7 no no yes yes
(Jeff, Professor, 4)
Tenured?
TYPES OF CLASSIFICATION TECHNIQUES
186
Fig: Example for Model Construction and Usage of Model
Decision tree is a flowchart-like tree structure, where each internal node (nonleaf node)denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.
Decision Tree Induction is developed by Ross Quinlan , decision tree algorithm known as ID3 (Iterative
Dichotomiser).
Decision tree is a classifier in the form of a tree structure
Decision node: specifies a test on a single attribute
Leaf node: indicates the value of the target attribute
Arc/edge: split of one attribute
Path: a disjunction of test to make the final decision
188
EX: Data Set in All Electronics Customer Database
189
Fig: A Decision tree for the concept buys computer, indicating whether a customer at AllElectronics is likely to purchase a computer. Each internal (nonleaf) node represents a test on an attribute. Each leaf node represents a class (either buys computer
= yes or buys computer = no).
190
Ex: For age attribute
191
1. Accuracy,
2. Training time,
3. Robustness,
4. Interpretability, and
5. Scalability
1.
Credit approval
2.
Target marketing
3.
Medical diagnosis
4.
Fraud detection
5.
Weather forecasting
6.
Stock Marketing
192