Uploaded by kogor43968

12 15.Machine Learning

advertisement
School of Computing Science and Engineering
Course Code : CSAI2030
Course Name: Predictive Analytics
Introduction to ML
Classification Techniques
Name of the Faculty: VIKASH KUMAR MISHRA
Program
Traditional Programming
Input
Process
x1,x2
y = x1 + x2
Or
y = f (x1,x2)
Output
y
Machine Learning
x1 (Input)
2
3
4
5
6
7
8
y (Output)
4
6
8
10
12
14
16
y = 2 multiplied by x1
Machine Learning
Input
Output
x1,x2
y
Machine
Learning
Algorithm
Process
Machine
Learning
Algorithm
y=f(x1,x2)
What is requirement of Machine Learning Problem:
Data Set
Input
(x1,x2,…)
Output
(y)
From where to get Data?
• UCI Repository
• https://archive.ics.uci.edu
• Kaggle web site:
• https://www.kaggle.com/datasets
• United Nations
• http://data.un.org/
• India
• https://data.gov.in/
What kind of problem Machine Learning
can solve:
x1,x2
y
Training Input (X)
Training Output(y)
Machine
Learning
Algorithm
Machine
Learning
Algorithm
y=f(x1,x2)
Machine
Learning
Model
(X y)
What kind of problem Machine Learning
can solve:
Training Input (X)
Training Output(y)
New Input (X)
Machine
Learning
Algorithm
Machine
Learning
Model
Machine
Learning
Model
(X y)
Output
(y)
What kind of problem Machine Learning
can solve:
S. No.
1
2
3
4
5
6
7
8
9
10
11
Area
bedrooms
585
400
306
665
636
416
388
416
480
550
720
3
2
3
3
2
3
3
3
3
3
3
bathrooms
1
1
1
1
1
1
2
1
1
2
2
stories
2
1
1
2
1
1
2
3
1
4
1
parking
no
no
no
yes
no
yes
no
no
yes
yes
no
This is called Regression Problem
price
4200000
3850000
4950000
6050000
6100000
6600000
6600000
6900000
8380000
8850000
9000000
Regression
• Calculates approximation of a continuous
variable
• For Example:
 Estimated household income for loan
processing or credit card limit etc.
 Estimated Crop production in an area
 Estimated traffic at a place
 Estimated future requirement for
saving/investment
What kind of problem Machine Learning
can solve:
S. No.
1
2
3
4
5
6
7
8
9
10
11
Area
bedrooms
585
400
306
665
636
416
388
416
480
550
720
3
2
3
3
2
3
3
3
3
3
3
bathrooms
1
1
1
1
1
1
2
1
1
2
2
stories
2
1
1
2
1
1
2
3
1
4
1
parking
no
no
no
yes
no
yes
no
no
yes
yes
no
This is called Classification Problem
Price
>650000
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
What kind of problem Machine Learning
can solve:
S. No.
1
2
3
4
5
6
7
8
9
10
11
Area
bedrooms
585
400
306
665
636
416
388
416
480
550
720
3
2
3
3
2
3
3
3
3
3
3
bathrooms
1
1
1
1
1
1
2
1
1
2
2
stories
2
1
1
2
1
1
2
3
1
4
1
parking
no
no
no
yes
no
yes
no
no
yes
yes
no
This is called Classification Problem
House
Cost
Low
Low
Low
Medium
Medium
Medium
Medium
High
High
High
High
Classification
• Examining the feature of a newly
presented object and assigning it to one of
the predefined classes.
• For Example:
 Classifying loan applications as
Low, Medium and High Risk
 Assigning product into Categories
and sub-categories
 Classifying people as BPL etc.
Feature and Outcome
S. No.
1
2
3
4
5
6
Area
bedrooms
585
400
665
636
416
480
3
2
3
2
3
3
bathrooms
1
1
1
1
1
1
Floor
2
1
2
1
3
1
parking
no
no
yes
no
no
yes
Features (X) : Area, bedrooms, bathrooms, Floor, parking
Predictors, Independent Variable
Outcome (y): House Cost
Dependent Variable, Result
House
Cost
Low
Low
Medium
Medium
High
High
Affinity Grouping
• Which things go together?
• To understand the purchase behaviour of
customers
• Market Basket Analysis
 For example:
 If someone buys a book on Data
Science, it is most likely that he will
also buy some book on Python.
 If someone buys Milk, he may buy
bread or cornflakes.
Clustering
• Segmenting heterogenous group of
population into a more homogenous sub
groups or clusters.
• For Example:
 Customer segmentation according to
buying behaviours
 Creating cluster of patients with
similar symptoms to identify
deceases
Abstract
In the present scenario the data is being generated
rapidly. In such a voluminous data world, finding
out the correct data is very critical and important.
On the basis of some criteria or pattern, the data
can be sorted into classes, this is also known as
classification. The process of grouping data into
classes can be either supervised or unsupervised.
17
History of Classification
•
living organisms were simply classified as plants or animals
•
plants were also classified as herbs, trees and shrubs
•
Animals were also grouped into herbivores, carnivores and omnivores
•
The modern classification systems take into account the evolutionary
relationships between living organisms
18
History of Classification
o In 1758
o a Swedish botanist
o Carl Linnaeus
o developed a system
that still is used
today to classify
species.
Carolus Linnaeus
Figure: seven taxonomic units of classification
19
Classification
• It is the process of arranging data into homogeneous (similar)
groups according to their common characteristics.
• Raw data cannot be easily understood, and it is not fit for further
analysis and interpretation. Arrangement of data helps users in
comparison and analysis.
• For example, the population of a town can be grouped according to
sex, age, marital status, etc
20
21
Stages of Classification
Performed in two stages:
Pointer = 7
Model Construction
Model Usage
Pointer
Results
5
Fail
8
Pass
4
Fail
9
Pass
3
Fail
8
Pass
Pointer≤
5
Fail
PASS
23
Objective of Classification
The primary objectives of data classification are:
 To consolidate the volume of data in such a way that similarities and
differences can be quickly understood
 To aid comparison.
 To point out the important characteristics of the data at a flash.
 To give importance to the prominent data collected while separating
the optional elements.
 To allow a statistical method of the materials gathered.
24
The basis of classification
The primary objectives of data classification are:
 Geographical or spatial Classification
 Chronological or Temporal Classification
 Qualitative Classification
 Quantitative Classification
25
Classification Steps
 Define why you want a classified
image, how will it be used?
 Decide, if you really need a
classified image?
 Define the study area.
 Select or develop a classification
 Select Imagery
 Prepare
Imagery
for
Classification
 Collect ancillary data
 Choose classification method and
classify
scheme.
26
Classification criteria
The minimum level of interpretation accuracy should be at least 85 percent.
The accuracy of interpretation for the several categories should be about equal.
The classification system should be applicable over extensive areas.
27
Classification criteria
 The classification system should be suitable for Temporal data
 Effective use of subcategories can be possible.
 Aggregation of categories must be possible.
 Comparison should be possible.
28
Classification
•Sentiment Analysis
•Email Spam Classification
•Document Classification
•Image Segmentation
•Speech Recognition
•DNA Sequence Classification
29
Statistical Descriptors
Statistical Classification requires some descriptors to find
the similarity with the data going to be classified:
 Mean
 Distance
 Range
 Standard deviation
30
Types of Classification
The Classification can be done mainly in three ways:
1. Supervised: Training, Classification and Testing.
2. Unsupervised: Classification and Testing.
3. Hybrid: A mixture of both the methods.
31
Topic
Supervised
Unsupervised
Hybrid
Process
input and output variables will be given.
only input data will be given
A mixture of both the methods
Input Data
Algorithms are trained using labeled data.
Algorithms are used against data which is not labeled
They can establish state of art results on any
task
Algorithms Used
Support vector machine, Neural network, Linear and Unsupervised algorithms can be divided into different Q-learning, State action reward state action,
logistics regression, random forest, and Classification categories: like Cluster algorithms, K-means, Hierarchical Deep Q Network
trees.
clustering
Computational
Complexity
a simpler method.
Use of Data
uses training data to learn a link between the input does not use output data.
and the outputs
Uses both input ad output data
Accuracy of Results
Highly accurate and trustworthy method
Less accurate and trustworthy method.
Highly accurate results
Real-Time Learning
Learning method takes place offline
Learning method takes place in real-time
Learning method takes place offline
Number of Classes
The number of classes is known.
Number of classes is not known
Main Drawback
classifying big data can be a real challenge in We cannot get precise information regarding data sorting, and
Supervised Learning.
the output as data used in unsupervised learning is labeled and
not known
Past Knowledge
allows you to collect data or produce a data output helps you to finds all kinds of unknown patterns in data.
from the previous experience
Type
Regression and Classification are two types of Clustering and Association are two types of Unsupervised Decision Making
supervised,
learning
machine-learning techniques
Uses
Image recognition, speech recognition, weather Pre-Process the data, pre-train supervised learning algorithm
forecasting
is computationally complex
Complex computation
Learns from past knowledge
Warehouses,
Inventory
Management,
Delivery management, power system,
Financial system
Supervised Classification
33
Sampling/Training
 Sampling is the process of selecting some data points from whole set
in such a way that on the basis of selected data points the whole set
could be predicted
 In remote sensing classification, some data from each existing classes
are needed so that spectral information could be converted to LULC
information.
 Selection of sample is very important and critical part of classification.
 Inaccurate sampling may lead to misclassification of remote sensing
imagery.
34
Sampling Rules
One should take care of following facts to avoid inaccurate sampling:
i. The reference data and satellite data should belong to same time period. (Temporal resolution
should be same)
ii. We should have enough number of samples (η) and it could be decided through Equation 2.1:
𝜂=𝑧×
𝑝𝑞
𝑒2
z= 2x(standard deviation), p = standard accuracy i.e., between 85%-95%, q=1-p, e = elloyable error ≈
100%-z
35
Sampling Rules
One should take care of following facts to avoid inaccurate sampling:
i. The reference data and satellite data should belong to same time period. (Temporal resolution
should be same)
ii. We should have enough number of samples (η) and it could be decided through Equation 2.1:
𝜂 = 𝑧2 ×
𝑝𝑞
𝑒2
z= 2x(standard deviation) and the value for z = 2 is generalized from the standard normal deviate of
1.96 for the 95-percent two-sided confidence level, p = standard accuracy i.e., between 85%-95%,
q=1-p, e = elloyable error ≈ 100%-z
36
Types of Sampling
iii. If the area is large (more than million acres) or land use category is more than 12
then number of samples should be increased to 75 to 100.
iv. More samples can be taken in more important categories and less in less
important categories.
v. The minimum number of pixel samples by class can be obtained by the Equation:
𝑆𝑎𝑚𝑝𝑙𝑒𝑠 ≡ 30 × N × C
N is the number of discriminate variables and C, is the number of classes.
37
Sampling methods
Samples from a Remote sensing imagery can be selected through any of the methods
discussed as below:
Simple Random Sample (RS)
Stratified Random Sampling (SRS)
Systematic Sampling (SS)
Cluster Sampling (CS)
38
Classification
methods
39
40
Minimum Distance to Mean
Following steps are recommended to implement the above defined
scheme:
 Define the possible number of classes and label them.
 select Training Data Set for each Class.
 Calculate mean value for each class on the basis of Training data set.
 Take a unclassified pixel find its distance to mean value of different
classes and classify into a class having minimum distance to mean.
 Repeat Step 4 for all unclassified pixels and classify them into
appropriate Class.
 After classification give name to each class.
41
Parallelepiped
Following steps are recommended to implement the above defined scheme:
 Define the possible number of classes and select Training Data Set for each Class.
 Based on the training data set find Range i.e. [minimum spectral value, maximum
spectral value] for a class.
 Classify a pixel into a class if its spectral value belongs to the range of that class.
 Repeat step 3 for all other pixel to classify them.
42
Classification
43
K-Means
Following steps are followed:
• Define the number of classes say K
• From the data set choose k random values as mean values.
• Now classify the remotely sensed data into K-classes on the basis of randomly
chosen mean on the basis of minimum distance to mean.
• Find the K number of means for newly classified K-classes.
• Again classify the original remotely sensed data set into K-classes on the basis of
means computed in step-4.
• Repeat step-4 and step-5 to classify the original data set iteratively until mean for
K-classes start giving same means for same classes.
44
ISODATA
Following steps are recommended to implement the above defined scheme:
 Classes centers i.e. mean are placed randomly.
 Pixels are assigned to classes based on the minimum distance to mean method
 The standard deviation within each cluster and distance between two centers are computed.
 3a. classes are split if the standard deviation is greater than the user defined threshold value.
 3b. Classes are merged together if the distance between centers is less than users’ threshold
values.
 Now iteration is performed with new class center.
 Iterations are performed until the average inter center distance falls below the user defined
threshold or the maximum number of iteration is reached.
45
Classification Accuracy Assessment
 N×N confusion matrix is generated
 Diagonals
represent sites classified correctly according to reference
data
 Off-Diagonals
represent sites classified correctly according to
reference data
 Total Accuracy: Number of correct plots / total number of plots
46
References
https://monkeylearn.com/blog/classification-algorithms/
https://online.stat.psu.edu/stat508/lesson/1a/1a.5
T. Lillesand, R. W. Kiefer, and J. Chipman, Remote sensing and image interpretation, John Wiley &
Sons, fourth edition, pp 1-52 and 470-568, 2014 .
P. M. Mather, and M. Koch, Computer processing of remotely-sensed images: an introduction, John
Wiley & Sons, Fourth Edition, pp 1-26 and 229-283, 2011.
J. R. Anderson, A land use and land cover classification system for use with remote sensor data, US
Government Printing Office, 1976.
Du, Z., Li, W., Zhou, D., et al.: ‘Analysis of Landsat-8 OLI imagery for land surface water mapping’,
Remote Sens. Lett., 2014, 5(7), pp. 672–681
Jia, K., Wei, X., Gu, X., et al.: ‘Land cover classification using Landsat 8 operational land imager data
in Beijing, China’, Geocarto Int., 2014, 29(8), pp. 941–951
Kondraju, T., Mandla, V. R. B., Mahendra, R. S. et al.: ‘Evaluation of various image classification
techniques on Landsat to identify Coral Reefs’, Geomat. Nat. Haz. Risk, 2014, 5(2), pp. 173–184
Lu, D., Weng, Q.: ‘A survey of image classification methods and techniques for improving classification
performance’, Int. J. Remote Sens., 2007, 28(5), pp. 823–870
47
Download