Uploaded by nocejof983

Machine Learning Shorts

advertisement
ML Short notes by an Idiot
Himanshu Negi
MachineLearning
Supervised
Classification
Unsupervised
Regression
Clustering
AssociationRuleMining
Machine Learning
Supervised - Labels known
Unsupervised - Labels and classed unknown
Classification - Classify into discrete classes
Regression - Find a continuous value
Clustering - Grouping similar objects together
Association Rule Mining - Finding patterns, mostly in streaming data
Measuring Dispersion
Range - Difference of largest and smallest value
Quartiles - 3 values which divide the data into 4 parts
Inter quartile range - Range of 4 quartiles separately
Five number summary of data - [ Min, Q1, Q2, Q3, Max ]
Standard deviation - Average of absolute deviation of all data-points from mean
Variance - Yet another measure of deviation, square of standard deviation σ2
Mean - Average
Median - Middle value(s) in sorted data
Mode - Most frequent observation
Statistics
Population - All data-points under study in an statistical operation
Sample - Small part of population to collect information
Knowing the data
Attributes - Properties that define a data point
Nominal attributes - Regarding names
Binary attributes - Meh
Ordinal attributes - Representing rank or order of data-points among themselves
Numeric attributes - Duh
Interval scaled attributes - Values are located at equal intervals and may be ranked
Ratio scaled attributes - When values are represented as comparison to other values
Reading box-plot
Properties of dataset
Dimensionality - Number of attributes a dataset possesses
Sparsity - The degree of presence of null values in a dataset
Resolution -
Measuring Performance
Confusion Matrix
Actual Class
Predicted Class
True
False
True
TP
FN / Type 2
False
FP / Type 1
TN
Accuracy - Overall correctness of classifier
Error rate - Misclassification rate
True positive rate - When "YES" was right
True negative rate / Specificity - When "NO" was right
False positive rate
Precision - How many times it was right when it said "YES"
Sensitivity
Recall
Prevalence - How many times "YES" condition occurs in sample
Correlation
Degree of similarity between the values
Direct method
Shortcut method
Matthews' correlation coefficient Brian W. Matthews - 1975
It is used to measure the quality of binary class classification. It considers the Confusion Matrix
and provides a measure even if the classes are not balanced in size.
Lazy Learners: Store the training data and wait until the testing data appear. Take less time
in training but more in prediction.
Ex. KNN, Case Based Learning
Eager Learners: Construct a classification model based on given training data, before
receiving testing data.
Regression
Regression means stepping back towards the average/Finding strength of relationship
between two variables
Regression analysis - Modeling relation between one or more variables, the target label
being continuous valued
Types of regression techniques
Linear regression
Logistic regression
Polynomial regression
Lasso regression
Redge regression
Elastic regression
Stepwise regression
Linear regression using least squares method
Calculate mean of X and Y values
Slope of the line for best fit:
Calculate intercept as:
Use m and c to put up the line
Multiple linear regression
When we formulate the X value as a tuple of values, so we predict a single value using a
number of values
Non-linear regression
When the data-points do not follow a straight line, we try moving to a higher degree and
fit a curve
Logistic regression
Making prediction for a binary dependent variable where linear relation between
independent values is not required
In general, the output of a linear model is put into a sigmoid function that produces
binary output
It makes use of probability of events in a bivariate environment to produce a binary
output
Performance parameters for regression
Correlation
Root Mean Square - Root of Mean of Square of Errors
Coefficient of determinant - The degree of correlation between X and Y values (How
much the y value changed on change in X value)
Accuracy - Percentage of deviation of predicted target from actual values
Dissimilarity in data
Numeric data
Cosine similarity
A measure of the cosine between the two vectors in a multidimensional space
Minkowski distance
for p = 1, D = Manhattan distance
for p = 2, D = Euclidean distance
Pearson's correlation coefficient
A measure of correlation for both numeric as well as binary data
Ranges from -1 to 1
Binary data
Use coefficients to determine similarity of binary attributes. Coefficient can range
anywhere from 0 to 1 where 0 represents no similarity and 1 represents complete
similarity
Simple Matching Coefficient
Jaccard coefficient
Naive Bayes classifier Thomas Bayes - 1973
For n disjoint sets E1,E2,E3....En
P(E1), P(E2)... are prior probabilities as they are provided beforehand
P(A|Ei) is the likelihood probability as it represents the likeliness of one event occurring after
another
P(Ei|A) are posterior probabilities as they are determined after result of experiment
Bayes Classification Method
Used for class membership problem i.e. "What is the probability of an item belonging to a class ?". As
per the Bayes' theorem, we assume that one class does not affect the other. Since it is made to
simplify the calculations, we call it "Naive". The results are comparable to decision trees and
Neural Network classifiers in some cases.
It determines an element belonging to a class if P(Ci|X) > P(Cj|X), so we maximize P(Ci|X). The
class for which P(Ci|X) is maximized is called as maximum posterior hypothesis.
Support Vector Machine
Used for classification on both linear and non linear data, for classification as well as numerical
prediction. It uses non linear mapping for mapping data to a higher dimension, where we search
for a linear solution and is mapped back to original dimension.
Procedure:
Write the separating hyperplane as w.x +b = 0, where w is the weight vector.
Write the equation as w0+w1x1+w2x2+... = 0
Find an equation such that for one class w0+w1x1+w2x2+... > 0 and for another
w0+w1x1+w2x2+... < 0
Principal Component Analysis
IDK
K Nearest Neighbors
Classification based on analogy. Comparing a single test tuple with the training tuples, similar to
it. All training tuples are stored in an n-dimensional space. Whenever testing tuple arrives, it is
compared to the pattern in space. The closeness to the tuples is defined in terms of euclidean
distance or any other suitable metrics. It is a lazy learning algorithm that creates a model once the
testing data arrives. It is non parametric, meaning we do not make assumptions about underlying
data.
Procedure:
Divide data into k sets
Find mean to all sets
Minimize the distance of all points to all means iteratively until the points do not move
Pros:
Easy to understand
No assumptions about data
Can both classify or calculate
Works easily on multi-class problems
Cons:
Memory intensive
Sensitive to scale of data
Struggles with multi valued training data
Modified KNN
Instead of making a distance based on majority, we make use of the weights, hence
calculating weighted majority, so that outliers have less effect on the classification.
Clustering
Differentiating points into groups based on similarity. Various clustering methods are:
Method
Characteristics
Partition method
- Find mutually exclusive clusters of spherical shape
- Distance based
- Make use of mean/median or similar measures
- Effective use of small to medium sized data
Hierarchical method
- Multiple level decomposition
- Incorporate other techniques like microclustering
Density based
- Can find arbitrary shaped clusters
method
- Low density regions seperate clusters
- Each point must have a minimum number of points within the
neighborhood
- May filter outliers
Grid based method
- Use a multiresolution grid data structure
- Fast to process
Requirements for cluster analysis:
Scalability
Ability to deal with arbitrary shape
Domain knowledge
Dealing with noisy data
Insensitivity to input order
Interpret-ability and usability
Decision trees
A flowchart that represents flow of tests on training data and the outcome, leaf nodes being the
classes to predict.
Example: Should I go out or not
Weather
Sunny
Cloudy
Rainy
Humidity
Yes
Wind
High
Low
Strong
Weak
No
Yes
No
yes
Types of decision trees:
Categorical variable decision tree
Continuous variable decision tree
Pruning - Instead of making more splits, we converge sub-nodes of a tree to make it less dense
and less complex. It can be attained via both top-down or bottom-up approach.
Hold out method and random sub-sampling
Given data is partitioned into 2 independent datasets. One set is used to train data and other for
testing. Random sub-sampling means the above process is carried out k times and overall
accuracy is the average of k operations.
Cross validation
Divide initial data into k mutually exclusive subsets called as "Folds". Reserve one fold for testing
and rest for training. Repeat the steps k times for k folds, and get the average of the accuracy.
Leave one out is a special case for the k-fold cross validation where k is the set to the number of
initial tuples and one sample is left out at a time for test.
In stratified cross validation, the folds are made such that the class distribution is balanced
among the folds.
Ensemble methods
When a model is made out of composition of different models. Bagging and Boosting are
common ensemble methods.
Bagging : This ensemble accounts for the majority votes given by the base classifiers to produce
the final result. Ex. Random forest
Boosting : Every base classifier is assigned a weight and its vote is accounted for according to the
weight. Ex. AdaBoost
Apriori algorithm
Finding frequent item sets by combined candidate generation.Name of the algorithm is Apriori
because it uses prior knowledge of frequent itemset properties. We apply an iterative approach
or level-wise search where k-frequent itemsets are used to find k+1 itemsets.
Apriori Property : All subsets of a frequent itemset must be frequent(Apriori propertry). If an
itemset is infrequent, all its supersets will be infrequent.
Neural Networks
Equivalent of biological neural system, just meant for use with computers.
Model of a neuron :
Every input has a weight assigned to it that is passed to a computer function and to the activation
function that calculates the output. We may add bias to a node to allow activation function to shift
and fit on the data better.
The Σ produces an output as w1x1 + w2x2 + ... wnxn, which is then passed to a non-linear function
called activation function. Introducing non-linearity is mandatory else the network may not learn at
all over any number of layers. Various functions used are:
Sigmoid
Signum
Hyperbolic tangent
Rectified Linear Unit (ReLU)
Neural Net Architecture
A network typically consists of an input layer having nodes depending upon the input size, hidden
layers and the output layer depending upon the number/type of output to be produces.
Types of neural networks :
Single layer feed forward network - 2 layers (input and output)
Multi layer feed forward network - Many layers between input and output
Recurrent network - Contain feedback loops, which allow working on serialized data.
Convolution Neural Networks :
ANNs specialized to work on images. Other than fully connected layers, they contain:
Convolution layer : Apply filters on the image kernel by kernel (kernel is a sub-matrix of
pixels in the image).
Pooling layer : Reduce the spatial size of representation to reduce the amount of
parameters in computation.
Average pooling (Find average in a kernel)
Min pooling (Find minimum in the kernel)
Max pooling (Find maximum in the kernel)
Other terms related to Neural Networks :
Back-propagation : NN predicts some output that is compared to the real value. An error is
calculated , and the error value is "back-propagated" into the network to adjust the weights
to get better output.
Batch Size : Portion of data worked upon at a time
Epoch - Number of passes of the entire training dataset
Filter : A matrix that is meant to spot specific features in an image.
Layers in a neural network :
Dense/Fully connected layer - When all nodes in current layer are connected to all layers in
previous nodes.
Convolution layer - Applies filter to image
Pooling layer - Gets important features from image
Dropout layer - Randomly sets input units to 0 with a frequency of rate at each step during
training time, which helps prevent overfitting.
Batch-norm layer - Standardizes the inputs to a layer for each mini-batch. This has the
effect of stabilizing the learning process and dramatically reducing the number of training
epochs
Download