ML Short notes by an Idiot Himanshu Negi MachineLearning Supervised Classification Unsupervised Regression Clustering AssociationRuleMining Machine Learning Supervised - Labels known Unsupervised - Labels and classed unknown Classification - Classify into discrete classes Regression - Find a continuous value Clustering - Grouping similar objects together Association Rule Mining - Finding patterns, mostly in streaming data Measuring Dispersion Range - Difference of largest and smallest value Quartiles - 3 values which divide the data into 4 parts Inter quartile range - Range of 4 quartiles separately Five number summary of data - [ Min, Q1, Q2, Q3, Max ] Standard deviation - Average of absolute deviation of all data-points from mean Variance - Yet another measure of deviation, square of standard deviation σ2 Mean - Average Median - Middle value(s) in sorted data Mode - Most frequent observation Statistics Population - All data-points under study in an statistical operation Sample - Small part of population to collect information Knowing the data Attributes - Properties that define a data point Nominal attributes - Regarding names Binary attributes - Meh Ordinal attributes - Representing rank or order of data-points among themselves Numeric attributes - Duh Interval scaled attributes - Values are located at equal intervals and may be ranked Ratio scaled attributes - When values are represented as comparison to other values Reading box-plot Properties of dataset Dimensionality - Number of attributes a dataset possesses Sparsity - The degree of presence of null values in a dataset Resolution - Measuring Performance Confusion Matrix Actual Class Predicted Class True False True TP FN / Type 2 False FP / Type 1 TN Accuracy - Overall correctness of classifier Error rate - Misclassification rate True positive rate - When "YES" was right True negative rate / Specificity - When "NO" was right False positive rate Precision - How many times it was right when it said "YES" Sensitivity Recall Prevalence - How many times "YES" condition occurs in sample Correlation Degree of similarity between the values Direct method Shortcut method Matthews' correlation coefficient Brian W. Matthews - 1975 It is used to measure the quality of binary class classification. It considers the Confusion Matrix and provides a measure even if the classes are not balanced in size. Lazy Learners: Store the training data and wait until the testing data appear. Take less time in training but more in prediction. Ex. KNN, Case Based Learning Eager Learners: Construct a classification model based on given training data, before receiving testing data. Regression Regression means stepping back towards the average/Finding strength of relationship between two variables Regression analysis - Modeling relation between one or more variables, the target label being continuous valued Types of regression techniques Linear regression Logistic regression Polynomial regression Lasso regression Redge regression Elastic regression Stepwise regression Linear regression using least squares method Calculate mean of X and Y values Slope of the line for best fit: Calculate intercept as: Use m and c to put up the line Multiple linear regression When we formulate the X value as a tuple of values, so we predict a single value using a number of values Non-linear regression When the data-points do not follow a straight line, we try moving to a higher degree and fit a curve Logistic regression Making prediction for a binary dependent variable where linear relation between independent values is not required In general, the output of a linear model is put into a sigmoid function that produces binary output It makes use of probability of events in a bivariate environment to produce a binary output Performance parameters for regression Correlation Root Mean Square - Root of Mean of Square of Errors Coefficient of determinant - The degree of correlation between X and Y values (How much the y value changed on change in X value) Accuracy - Percentage of deviation of predicted target from actual values Dissimilarity in data Numeric data Cosine similarity A measure of the cosine between the two vectors in a multidimensional space Minkowski distance for p = 1, D = Manhattan distance for p = 2, D = Euclidean distance Pearson's correlation coefficient A measure of correlation for both numeric as well as binary data Ranges from -1 to 1 Binary data Use coefficients to determine similarity of binary attributes. Coefficient can range anywhere from 0 to 1 where 0 represents no similarity and 1 represents complete similarity Simple Matching Coefficient Jaccard coefficient Naive Bayes classifier Thomas Bayes - 1973 For n disjoint sets E1,E2,E3....En P(E1), P(E2)... are prior probabilities as they are provided beforehand P(A|Ei) is the likelihood probability as it represents the likeliness of one event occurring after another P(Ei|A) are posterior probabilities as they are determined after result of experiment Bayes Classification Method Used for class membership problem i.e. "What is the probability of an item belonging to a class ?". As per the Bayes' theorem, we assume that one class does not affect the other. Since it is made to simplify the calculations, we call it "Naive". The results are comparable to decision trees and Neural Network classifiers in some cases. It determines an element belonging to a class if P(Ci|X) > P(Cj|X), so we maximize P(Ci|X). The class for which P(Ci|X) is maximized is called as maximum posterior hypothesis. Support Vector Machine Used for classification on both linear and non linear data, for classification as well as numerical prediction. It uses non linear mapping for mapping data to a higher dimension, where we search for a linear solution and is mapped back to original dimension. Procedure: Write the separating hyperplane as w.x +b = 0, where w is the weight vector. Write the equation as w0+w1x1+w2x2+... = 0 Find an equation such that for one class w0+w1x1+w2x2+... > 0 and for another w0+w1x1+w2x2+... < 0 Principal Component Analysis IDK K Nearest Neighbors Classification based on analogy. Comparing a single test tuple with the training tuples, similar to it. All training tuples are stored in an n-dimensional space. Whenever testing tuple arrives, it is compared to the pattern in space. The closeness to the tuples is defined in terms of euclidean distance or any other suitable metrics. It is a lazy learning algorithm that creates a model once the testing data arrives. It is non parametric, meaning we do not make assumptions about underlying data. Procedure: Divide data into k sets Find mean to all sets Minimize the distance of all points to all means iteratively until the points do not move Pros: Easy to understand No assumptions about data Can both classify or calculate Works easily on multi-class problems Cons: Memory intensive Sensitive to scale of data Struggles with multi valued training data Modified KNN Instead of making a distance based on majority, we make use of the weights, hence calculating weighted majority, so that outliers have less effect on the classification. Clustering Differentiating points into groups based on similarity. Various clustering methods are: Method Characteristics Partition method - Find mutually exclusive clusters of spherical shape - Distance based - Make use of mean/median or similar measures - Effective use of small to medium sized data Hierarchical method - Multiple level decomposition - Incorporate other techniques like microclustering Density based - Can find arbitrary shaped clusters method - Low density regions seperate clusters - Each point must have a minimum number of points within the neighborhood - May filter outliers Grid based method - Use a multiresolution grid data structure - Fast to process Requirements for cluster analysis: Scalability Ability to deal with arbitrary shape Domain knowledge Dealing with noisy data Insensitivity to input order Interpret-ability and usability Decision trees A flowchart that represents flow of tests on training data and the outcome, leaf nodes being the classes to predict. Example: Should I go out or not Weather Sunny Cloudy Rainy Humidity Yes Wind High Low Strong Weak No Yes No yes Types of decision trees: Categorical variable decision tree Continuous variable decision tree Pruning - Instead of making more splits, we converge sub-nodes of a tree to make it less dense and less complex. It can be attained via both top-down or bottom-up approach. Hold out method and random sub-sampling Given data is partitioned into 2 independent datasets. One set is used to train data and other for testing. Random sub-sampling means the above process is carried out k times and overall accuracy is the average of k operations. Cross validation Divide initial data into k mutually exclusive subsets called as "Folds". Reserve one fold for testing and rest for training. Repeat the steps k times for k folds, and get the average of the accuracy. Leave one out is a special case for the k-fold cross validation where k is the set to the number of initial tuples and one sample is left out at a time for test. In stratified cross validation, the folds are made such that the class distribution is balanced among the folds. Ensemble methods When a model is made out of composition of different models. Bagging and Boosting are common ensemble methods. Bagging : This ensemble accounts for the majority votes given by the base classifiers to produce the final result. Ex. Random forest Boosting : Every base classifier is assigned a weight and its vote is accounted for according to the weight. Ex. AdaBoost Apriori algorithm Finding frequent item sets by combined candidate generation.Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties. We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets. Apriori Property : All subsets of a frequent itemset must be frequent(Apriori propertry). If an itemset is infrequent, all its supersets will be infrequent. Neural Networks Equivalent of biological neural system, just meant for use with computers. Model of a neuron : Every input has a weight assigned to it that is passed to a computer function and to the activation function that calculates the output. We may add bias to a node to allow activation function to shift and fit on the data better. The Σ produces an output as w1x1 + w2x2 + ... wnxn, which is then passed to a non-linear function called activation function. Introducing non-linearity is mandatory else the network may not learn at all over any number of layers. Various functions used are: Sigmoid Signum Hyperbolic tangent Rectified Linear Unit (ReLU) Neural Net Architecture A network typically consists of an input layer having nodes depending upon the input size, hidden layers and the output layer depending upon the number/type of output to be produces. Types of neural networks : Single layer feed forward network - 2 layers (input and output) Multi layer feed forward network - Many layers between input and output Recurrent network - Contain feedback loops, which allow working on serialized data. Convolution Neural Networks : ANNs specialized to work on images. Other than fully connected layers, they contain: Convolution layer : Apply filters on the image kernel by kernel (kernel is a sub-matrix of pixels in the image). Pooling layer : Reduce the spatial size of representation to reduce the amount of parameters in computation. Average pooling (Find average in a kernel) Min pooling (Find minimum in the kernel) Max pooling (Find maximum in the kernel) Other terms related to Neural Networks : Back-propagation : NN predicts some output that is compared to the real value. An error is calculated , and the error value is "back-propagated" into the network to adjust the weights to get better output. Batch Size : Portion of data worked upon at a time Epoch - Number of passes of the entire training dataset Filter : A matrix that is meant to spot specific features in an image. Layers in a neural network : Dense/Fully connected layer - When all nodes in current layer are connected to all layers in previous nodes. Convolution layer - Applies filter to image Pooling layer - Gets important features from image Dropout layer - Randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Batch-norm layer - Standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs