Foundations of Statistics Variables - What are the central tendencies? What is the spread of the values? How much do the values vary? Are there any abnormalities that stand out? Numerical Variables: - Mean (μ): average of numbers - - - - Pros: helps describe the central tendencies - Cons: not robust to extreme values Median: 50th percentile - Pros: robust to extreme values - Cons: only concentrates on measures of location; difficult to use for describing multiple variables Mode: frequency - Pros: finds the most frequent value - Cons: difficult to determine with discrete sets when values gets grouped together too generally; mode might not be a good description in some continuous cases; not useful if data is spread out Variance (σ2) and Standard Deviation (σ): spread of the values - - Pros: dependent on the mean; help describe the distribution of values in relation to the mean - Small variance indicates that the data is close to the mean and thus similar in value; High variance indicates that the data is far away from the mean and thus dissimilar in value - Cons: not robust to extreme values Categorical Variables: - Frequency Tables: displays the number of times a data value occurs in a set - Pros: knows how many times each categories occur - Cons: doesn’t describe the number of occurrences in relation to everything else - - Proportion Tables: displays the fraction of occurrence of a data value in a set - Pros: describes the occurrences of each category relative to the whole data set - Cons: doesn’t give any information on the size of the data set - Contingency Tables: displays frequency or proportions among multiple categorical variables simultaneously - Pros: can describe the relationship between multiple variables - Cons: realistically can only be used for two variables at a time - Margins: individual variable frequencies/proportions; constructed by totalling respective rows or columns Correlation (ρ): The correlation helps quantify the linear dependence between two quantities (e.g., does knowing something about one variable inform anything about the other) - Bounds: -1 ≤ ρ ≤ 1 - Positive correlations means a direct relationship; negative correlations means an indirect relationship; Zero means no correlation - - Notes: Zero correlation does not imply that the variables are independent; above equation if for numerical variables - categorical variables usually needs to be dummified to calculate correlations (create a one vs. all situation) - Pros: describe relationship between variables - Cons: assumes the relationship is linear Independent vs. Dependent Variables: independent variables means that two variables are in no way related to one another; you cannot infer any information about one variable with information about the other; dependent variables are variables that are related to each other in some way Statistical Inference Process of deducing properties of an underlying distribution by analysis of data; hope to make educated conclusions about a population by inferring behavior from a sample - Statistical Hypothesis: question we wish to answer that is testable by observing a process that is modeled by a set of random variables - Use results to infer behavior in population - Hypothesis Test: - State null and alternative hypothesis: - - Null Hypothesis (H0): The assumed default scenario in which nothing abnormal is observed (e.g., there is no difference among groups, etc.) - Alternative Hypothesis (HA): The scientific supposition we desire to test that contrasts H0 (e.g, there is a difference among groups, etc.); the complete opposite of the null hypothesis - Assume null hypothesis is true; calculate probability (p-value) of observing results at least as extreme as what is present in your data sample; usually use a table or computers to do calculations - Based on the p-value, decide which hypothesis is more likely. - Generally: if the p-value is > 0.05, retain the H0; if the p-value is < 0.05, reject the H0 in favor of HA T-Test: An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups - One Sample T-Test: To examine the average difference between a sample and the known value of the population mean - Assumptions: The population from which the sample is drawn is normally distributed; Sample observations are randomly drawn and independent. - - P-value calculation: calculate t-test statistic given by equation above and compare the values with a standard table of values to get p-values; or use a computer to calculate p-value - ๐ฅ: sample mean - ๐: number of samples - ๐ : standard deviation - ๐ − 1: represents the degrees of freedom (degrees of freedom are usually one less than sample size) - H0: the average of the sample is equal to the known value - HA: the average of the sample is not equal to the known value - Note: Usually for numerical variables Two Sample T-Test: To examine the average difference between two samples drawn from two different populations; used to determine if the samples are statistically similar enough to each other to compare for evaluation purposes - Assumptions: The populations from which the samples are drawn are normally distributed; the standard deviations of the two populations are equal; sample observations are randomly drawn and independent - - P-value calculation: calculate t-test statistic given by equation above and compare the values with a standard table of values to get p-values; or use a computer to calculate p-value - ๐ฅ: sample mean - ๐: number of samples - ๐ : standard deviation - ๐ − 1: represents the degrees of freedom (degrees of freedom are usually one less than sample size) - H0: the average of the two samples are equal - HA: the average of the two samples are not equal - Note: Usually for numerical variables; can be used to derive one sample t-test F-Test: It is also used for testing hypothesis for population mean or population proportion. Unlike Z-statistic or t-statistic, where we deal with mean & proportion, Chisquare or F-test is used for finding out whether there is any variance within the samples. F-test is the ratio of variance of two samples. It is used to assess whether the variances of two different populations are equal. - Assumptions: The populations from which the samples are drawn are normally distributed; sample observations are randomly drawn and independent. - - P-value calculation: calculate f-test statistic given by equation above and compare the values with a standard table of values to get p-values; or use a computer to calculate p-value - ๐ : standard deviation - ๐ − 1: represents the degrees of freedom (degrees of freedom are usually one less than sample size) - H0: the variance of the two samples are equal - HA: the variance of the two samples are not equal - Note: Generally want to do f-test first before doing a two sample t-test; cannot determine if the means of the two samples are the same if the variances are different; usually for numerical variables because we are using mean and standard deviation One-Way ANOVA (Analysis of Variance): Uses f-tests to assess the equality of means of two or more groups; similar to t-test; when there are two groups, it is the same as a two sample t-test - Assumptions: The populations from which the samples are drawn are normally distribution; The standard deviations of the populations are equal; Sample observations are randomly drawn and independent - P-value calculation: Compare the test statistic value with a standard table of Fvalues to determine whether the test statistic surpasses the threshold of statistical significance (yielding a significant p-value); or use a computer - - - Mean squares between groups: A good estimate of the overall variance only when H0 is true. Quantifies the between-group deviations from the overall grand mean. - ๐: number of groups - ๐: index of the groups - ๐: number of observations for a specific group - ๐: average values Mean squares within groups: A good estimate of the overall variance, unaffected by whether the null or alternative hypothesis is true. Quantifies the within-group deviations from the respective group means. - ๐: number of groups - ๐: index of the groups - ๐: number of observations for a specific group - ๐: index of observation in a specific group - ๐: total number of observations - ๐: value for an observation - ๐: average values - H0: the average of all of the groups are equal - HA: at least one group has a different average from another - Notes: Generally for numerical values because we are using mean Chi-Square (χ2) Test of Independence: Use the chi-square test for independence to determine whether there is a significant relationship between two categorical variables. To test whether two categorical variables are independent. - Assumptions: Sample observations are randomly drawn and independent - - P-value calculation: Compare the test statistic value with a standard table of χ 2values to determine whether the test statistic surpasses the threshold of statistical significance (yielding a significant p-value) - ๐: index of the group in the first variable - ๐: index of the group in the second variable - ๐: number of observations for a specific group - ๐: total number of observations - H0: the two variables are independent - HA: the two variables are not independent - Note: For comparing two categorical variables; similar to correlation for numerical variables but not based on a linear relationship Other Tests: http://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-iusestatistical-analyses-using-stata/ https://www.csun.edu/~amarenco/Fcs%20682/When%20to%20use%20what%20test.pdf R Correlation, Variance and Covariance (Matrices): var, cov and cor compute the variance of x and the covariance or correlation of x and y if these are vectors. If x and y are matrices then the covariances (or correlations) between the columns of x and the columns of y are computed. cov2cor scales a covariance matrix into the corresponding correlation matrix efficiently. - var(x, y = NULL, na.rm = FALSE, use = "everything"): variance - cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")): covariance between two vectors - cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")): correlation between two vectors - cov2cor(V): scales covariance matrix into the corresponding correlation matrix efficiently - Arguments: - x: numeric vector, matrix or df - y: NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient). - method: a character string indicating which correlation coefficient (or covariance) is to be computed. One of "pearson" (default), "kendall", or "spearman": can be abbreviated. - v: symmetric numeric matrix, usually positive definite such as a covariance matrix. Density, distribution function, quantile function and random generation for the t distribution with df degrees of freedom (and optional non-centrality parameter ncp). - dt(x, df, ncp, log = FALSE): density for t distribution - - pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE): distribution function; gets p-values when inputting t statistic value for q; set lower.tail to false to get the distribution on the right side of line qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE): quantile function; use q = 0.5 to visualize 5% threshold of t distribution rt(n, df, ncp): random generation Arguments: - x, q: vector of quantiles - p: vector of probabilities - n: number of observations - df: degrees of freedom - ncp: non-centrality parameter delta; currently except for rt(), only for abs(ncp) ≤ 37.62. If omitted, use the central t distribution. - log, log.p: logical; if TRUE, probabilities p are given as log(p). - lower.tail: logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x]. t.test(x, y, mu, alternative): t test for one or two sample t tests; y default is null for one sample; x and y are non empty numeric vector of data values; mu is the the true value of the mean or difference in means (in two sample); default mu is 0; alternative specifies the alternative hypothesis; options are ‘two.sided’ (default), ‘greater’ or ‘less’ var.test(x, y, alternative): f test for comparing variances of two samples from normal populations; alternative specifies the alternative hypothesis; options are ‘two.sided’ (default), ‘greater’ or ‘less’ aov(formula, data): Fit an analysis of variance model by a call to lm for each stratum; conducting one-way ANOVA; formula specifies the model in the form of values ~ categories; usually paired up with summary() function to find out more information chisq.test(data): conduct chi square test of independence on data bartlett.test(x,y): conducting the Bartlett test of homogeneity of variances; can also take a formula instead of x and y; x and y are vectors of data values; x is numeric and y is factors Missingness Occurs when at least some of an observation’s values are not present within the dataset. We say that the absent values are “missing,” and that the observation itself is “incomplete.” A value could be missing because of many reasons (e.g., human error, carelessness in handling, an undefined mathematical computation, etc.). These reasons are often unknown by the person who ultimately receives the dataset. - What to do with incomplete dataset?: - Complete case analysis: only deal with complete observations; ignore all observations with missing data - Pros: quick and easy - Cons: severely limit the amount of available information; smaller sample size leads to increasing standard errors of estimates - Types of missing data: can reveal some information about the dataset - Missing Completely at Random (MCAR): Each piece of data in the overall dataset has an equally likely chance of being absent. each piece of data in the overall dataset has an equally likely chance of being absent. MCAR data is the best case scenario for missing data in general, because its manifestation is truly “completely at random”. Deletion of MCAR observations will not end up biasing your results - Missing at Random (MAR): The chance that a piece of data is missing is dependent on variables for which we have complete information within our overall dataset. The probability a piece of data is missing depends on available information that we have already collected; they are not independent. MAR is the next-best scenario for missing data after MCAR because, although each observation has a different likelihood of missing, we theoretically can estimate this likelihood. When data are MAR, it is acceptable to drop these observations from our analysis if we control for the factors that are related to the missingness and adjust for their effects, we can avoid bias in our model - Missing Not at Random (MNAR): The chance that a piece of data is missing is dependent on the actual value of the observation itself. The value of the missing piece of data is directly related to the reason why it is missing in the first place. MNAR is the worst-case scenario for missing data because it is non-ignorable. We cannot theoretically accurately estimate the missing values because the reason they are missing is not captured within our dataset. When data are MNAR, it is not appropriate to drop these observations from our analysis; doing so would leave us with a biased dataset, and thus our analyses would return biased models - Imputation: process of filling in missing data - Mean value imputation procedure: Compute the average of the observed values for a variable that has missingness. Impute the average for each of the missing values. - Pros: One of the simplest ways of dealing with missing data because of its relatively straightforward approach. - Cons: Can distort the distribution of the variable and underestimate the standard deviation. Can distort relationships between variables by dragging correlation estimates towards 0. - - - - - Simple random imputation procedure: For each missing value in a variable, randomly select a complete value of the same variable; impute this randomly selected value. Repeat the process until all values are complete. - Pros: Uses true, observed values to fill in missingness - Cons: Can amplify outlier observation values by having them repeat in the dataset. Can induce bias into the dataset. Regression prediction procedure: Assume an underlying, linear structure exists in the data. Give weights to a subset of the complete variables. Use a relationship between the complete variables and the complete observations to impute missing observations. - Pros: Uses true, observed values to fill in missingness. Uses the relationships among multiple variables to fill in missingness. - Cons: Must make assumptions about the structure of the data. Can inappropriately extrapolate beyond the scope of available information in our dataset. Pros of Imputation: Helps retain a larger sample size of your data. Does not sacrifice all the available information in an observation because of sparse missingness. Can potentially avoid unwanted bias. Cons of Imputation: The standard errors of any estimates made during analyses following imputation can tend to be too small. The methods are under the assumption that all measurements are actually “known,” when in fact some were imputed. Can potentially induce unwanted bias. Imputation can be done by supervised learning or other prediction methods but this is only possible if only one column has missing data. Can get by by using complete cases to fill in missing data R complete.cases(df or mat): a complete case is a row without missing data, returns booleans for each row telling if each row is a complete case transform(data, col = modval, …): transforms a dataset; choose a column or columns to transform and set it to the modified values of the data; good for transforming missing data - ex.: impute by average; can replace mean function with any other method for imputation transform(data, col1 = ifelse(is.na(col1), mean(col1, na.rm=TRUE), col1)) R Library: VIM Visualization and imputation of missing values aggr(data): aggregations for missing or imputed values on a plot R Library: mice md.pattern(data): display missing data pattern R Library: Hmisc impute(x, func): function to impute values; func default for imputation method is mean; set func equal to ‘random’ for random imputation is.imputed(x): determines if a value or values are imputed Machine Learning Basic Summary: Machine learning developed from the combination of statistics and computer science; it aims to implement algorithms that allow computers to “learn” about the data it analyzes. Traditional algorithms require computers to follow a strict set of program instructions; machine learning algorithms instead assess the data at hand to make informed decisions. Machine learning algorithms have the ability to learn and adapt; traditional algorithms do not. Works well because of human-designed representations and input features Supervised Learning Algorithms: Your data includes the “truth” that you wish to predict. Use what you know about your observations to construct a model for future decision making. Basically trying to correctly impute in missing data. - Regression: In regression, we aim to predict a continuous output given a slew of input variables. Our data contains the output that we wish to predict. - Classification: In classification, we aim to predict a categorical output given a slew of input variables. Our data contains the output that we wish to predict. - Simply becomes optimizing weights to best make a final prediction Unsupervised Learning Algorithms: Your data does not include the “truth” that you wish to predict. Use your data to find underlying structure to inform intrinsic behavior that is not already explicitly available. - Clustering: In clustering, we aim to uncover commonalities in our data that help segment observations into different groups; within the groups, observations share some characteristics. Our data does not contain the group information that we seek. - Dimension Reduction: In dimension reduction, we aim to summarize massive amounts of data into smaller, more understandable components while retaining the structure of the original dataset. Our data does not tell us what the smaller components are. Used to eliminate structural redundancies without sacrificing information Supervised Learning Used to predict the values of one or more variable Y from a given set of predictor X. Predictions are based on the training data of previously solved cases. Performance can be estimated by some loss function (for example, RSS in regression or OOB error in bootstrap aggregating), using training-test splitting or cross-validation. Regression: simple/multiple linear regression, regression trees, etc. Classification: logistic regression, discriminant analysis, naive bayes, support vector machines, classification trees, etc. K-Nearest Neighbors The basic idea: Observations that are closest to an arbitrary point are the most similar. Can be used in both classification and regression settings (i.e., can have output take the form of class membership or property values). For K-Nearest Neighbors we find the K closest observations to the data point in question, and predict the majority class as the outcome. For 1-Nearest Neighbors, the single closest observation is the sole vote. Note: unlike most supervised learning, KNN does not require training Voronoi Tessellation for Classification: The KNN algorithm partitions the feature space into different regions that represent classification rules; these regions are called Voronoi tessellations. Boundaries represent areas where distances are equal in respect to different observations. By following the Voronoi tessellations, the overall decision boundary has the flexibility to be non-linear. 1NN: While the algorithm is very simple to understand and implement, its simplicity comes along with some drawbacks. 1NN is unable to adapt to outliers; a single outlier can dramatically change the Voronoi tessellations, and thus the decision boundaries. There is no notion of class frequencies (i.e., the algorithm does not recognize that one class is more common than another). One way to get around these limitations and to add some stability is to consider more neighboring points (increasing the value of K), and assessing the majority vote. What happens when we choose all neighbors? Classification Algorithm: Given the following information: - The training set: - Xi: The feature values for the ith observation (i.e., the location in space) - Yi: The class value for the ith observation (i.e., the group label) - The testing set: - X*: The feature values for the new observation that we wish to classify. The KNN classification algorithm: - Calculate the distance between X* and each observation Xi - Determine the K observations that are closest to X* (have the smallest distance) - Classify X* as the most frequent class Y among the K selected observations. Regression Algorithm: Given the following information: - The training set: - Xi: The feature values for the ith observation (i.e., the location in space). - Yi: The real-valued target for the ith observation (i.e., a continuous measurement). - The testing set: - X*: The feature values for the new observation that we wish to regress. The KNN regression algorithm: - Calculate the distance between X* and each observation Xi - Determine the K observations that are closest to X* (have the smallest distance) - Assign X* the mean of the Y measurements among the K selected observations Choosing K As we vary K the predicted classification rule will change, thus the choice of K has a large effect on the algorithm’s performance. In general: - Small values of K: - Pros: highlight local variations - Cons: are not robust to outliers; induce unstable decision boundaries - Large values of K: - Pros: highlight global variations; are robust to outliers; induce stable decision boundaries - Cons: too high values make all predictions similar - In practice a good balance is typically achieved with ๐พ = √๐ - Can also use cross-validation to choose K Choosing Distance Measure As we change the way we measure the distance between two points in our feature space, the classification rule will change. The choice of distance measure also has a large effect on the algorithm’s performance. - Euclidean distance: The most common distance measure for continuous observations. - Pros: Euclidean distance is the “familiar” distance we typically use in everyday life - - Cons: it is symmetric, treats all dimensions equally, and thus is sensitive to large deviations in a single dimension (different ranges and units in each dimension) Hamming distance: most common distance for categorical observations; Hamming distance looks at each attribute between observations and compares whether or not the observations are the same Pros: simple way to determine “distance” between categories Cons: each similarity is ignored while each difference is penalized; the measure is symmetric and treats all dimensions equally Minkowski p-norm: family of distance functions - - - As we vary p, we define distance measures that each have different behaviors: - p → 0: Logical And (assigns more significance to simultaneous deviations) - p = 1: Manhattan block distance (adds each component separately) p = 2: Euclidean distance - p → ∞: Maximum distance, Logical Or (the largest difference among all attributes dominates the distance measure) Breaking Ties What do we do if there is a tie? More specifically, how do we decide to classify an observation whose K-nearest neighborhood has an equal number of maximum group memberships? Some methods for breaking ties: - If there are only two groups, we can easily get around this by using an odd K. Why doesn’t this work when there are more than two groups? - Use the maximum prior probability to uniformly decide all ties. - Randomly choose the group; for G groups: - Roll a G-sided die that has equally likely outcomes for each group. - Roll a G-sided die that has weighted outcomes for each group. - Use the 1NN to break the tie. Pros and Cons Pros: - The only assumption we are making about our data is related to proximity (i.e., observations that are close by in the feature space are similar to each other in respect to the target value). - We do not have to fit a model to the data since this is a non-parametric approach. Cons: - We have to decide on K and a distance metric. - Can be sensitive to outliers or irrelevant attributes because they add noise. - Computationally expensive; as the number of observations, dimensions, and K increases, the time it takes for the algorithm to run and the space it takes to store the computations increases dramatically. - Why is this bad? We want more data! R R Library: VIM kNN(data, k): KNN imputation; the dataset isn’t separated into a training set and a testing set when put in the function R Library: deldir Delaunay triangulation and Dirichlet tessellation library deldir(x, y): returns info needed to graph the tessellation tiles.list(deldirobj): takes a deldir() created object and creates a list of tiles in a tessellation plot.tile.list(tilelist, fillcol, main): plots the Voroni tiles; takes the list created by tiles.list() function; the fillcol argument is defaulted to none but it takes a vector of color strings to add color to the tiles (vector length has match number of points); useful in highlighting different categories on the graph; main is for the title of the plot R Library: kknn kknn(formula, train, test, k, distance): weighted KNN classifier; formula is in the form: coltoimpute ~ colstodecide (use a period, ., if you wish to use the rest of the columns to decide); train and test are the separated training and testing sets; distance is the Minkowski distance chosen Linear Regression Generalized Linear Models Regularization and Cross Validation Decision Trees Supervised learning models for both classification and regression; construct solutions that stratify the feature space into relatively easy to describe rectangular regions; can get an idea of the general characteristics of observations that fall within particular regions of space, we can inform the characteristics of new observations that fall within the same regions. Regression Trees How do we segment a graph into regions of high and low values? How to interpret the tree? - Start from the top of the tree and pass a new observation through the various internal nodes At each internal node, you make a decision on how to proceed based on the characteristics of the observation - If condition is satisfied, move down the left branch - If condition is not satisfied, move down the right branch - Continue moving down the nodes until a terminal node/leaf is reached - The value within the terminal node is the mean response value (ลทRj) for the observations that fell within that region; this is also the prediction for future observations that fall into that region; Mathematically Process Summary: - Segment the predictor space (all possible values of X1, X2, …, Xp) into J distinct and non-overlapping regions (R1, R2, …, RJ) - For each observation that falls into a specific region R j, predict the mean of the response values (ลทRj) for the training observations that fell within Rj How do we decide where exactly to segment the predictor space? How do we come upon the regions R1, R2, …, RJ? Theoretically can have any shape for the regions but decision trees uses rectangular box-like segments to for ease of interpretation; if the regions did not follow some specific pattern, it would be difficult to represent the resulting model by a decision tree Goal is to find rectangular boxes R1, R2, …, RJ such that RSS is minimized: - Aim to minimize the squared differences of the response as compared to the mean response for the training observations within the jth region - Computationally infeasible to consider every possible segmentation of the feature space into J regions; minimization isn’t as easily solvable especially as the number of regions increases Tree based methods provide an approximation by combining a top-down method with a greedy approach called recursive binary splitting: - The method is top-down because the feature space is split into binary components in a successive fashion, creating new branches of the tree to potentially be split themselves - The method is greedy because splits are made at each step of the process based on the best result possible at the given step - The splits are not based on what might eventually lead to a better segmentation in future steps - Splitting process depends on the greatest reduction in the RSS based on the predictor Xj, and the cut point s that end up partitioning the space into the regions: - R1(j,s) = {X|Xj<s} - R2(j,s) = {X|Xj≥ s} - Splitting process seeks the values of j and s that will end up minimizing the following: - This process is repeated by considering each of the newly created regions as the new overall feature spaces to segment When do we stop splitting? - The recursive binary splitting process is likely to induce overfitting, thus leading to poor predictive performance on new observations - Definitely overfit if each observation is its own terminal node (model will have high variance) but the the RSS will be exactly 0 on the training set - Can prevent overfitting by setting a threshold on: - Maximum depth of the tree - Minimum number of observations in a tree node to split - Minimum number of observations in each region (node) - What if we try fitting a tree with fewer regions? This should lead to lower variance with a cost of some bias, but ultimately lead to better predictions - Grow the tree to a certain extent until the reduction in the RSS at a split doesn’t surpass a certain threshold - Problem: although a split might not be incredibly valuable in reducing the RSS early on in a tree, it might lead to a future split that does reduce the RSS to a large extent Tree Pruning One solution to problem of stopping splits at locations that might lead to better reduction of the RSS; Build a large tree and then prune it back in order to obtain a suitable subtree; - The best subtree will be the one that yields the lowest test error rate. Given a subtree, we can estimate the test error by implementing the cross-validation process, but too cumbersome because the large number of possible subtrees; Need a better process - Rather than checking every single possible subtree, the process of cost complexity pruning (i.e., weakest link pruning) allows us to select a smaller set of subtrees for consideration Cost Complexity Pruning: consider a sequence of trees indexed by a non-negative tuning parameter α. For each value of α there corresponds a subtree T such that the following is minimized: - - |T| indicates the total number of terminal nodes of subtree T Rm is the subset region of the feature space corresponding to the mth terminal node Tuning parameter α helps balance the tradeoff between the overall complexity of the tree and its fit to the training data: - Small values of α yields trees that are quite extensive (have many terminal nodes) - Large values of α yield trees that are quite limited (have few terminal nodes) Process is similar to shrinkage/regularization method utilized in ridge and lasso regression - - Can be shown that as the value of the tuning parameter α increases, branches of the overall tree are pruned in a nested manner - Thus it is possible to obtain a sequence of subtrees as a function of α As with any other tuning parameter, in order to select the optimal value of α we implement cross-validation The subtree that used for prediction is built using all the available data with the determined optimal value of α Algorithm - - - Use recursive binary splitting to build a large tree on the training data; stop before each observation falls into its own leaf (e.g., when each terminal node has fewer than 5 observations, etc.) Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees as a function of α Use K-fold cross-validation to choose the best α: - For each of the K folds: - Repeat the binary splitting and tree pruning on all but the k th fold of the training data - Evaluate the mean squared prediction error on the data in the left-out kth fold as a function of α - Average the errors for each α; select the α that minimizes this criterion Return the subtree of the overall tree from pruning that corresponds to the best α Classification Trees Decision tree that predicts a qualitative (categorical) response rather than a quantitative (numerical) response; similar to regression trees but for each of the subregions created, we predict that an observation belongs to the most commonly occurring class of training observations in its associated region; still implement recursive binary splitting to create various subregions of the feature space but do not use RSS as a criterion to minimize Can use misclassification rate (i.e., the fraction of training observations in a region that do not belong to the most common class) - The misclassification rate can end up not being sufficiently sensitive; too choppy and doesn’t lead to a smooth tree building process Gini Index Typical criterion: - - The proportions of interest denote the fraction of training observations in the m th region that are from the kth class The index measures the total variance among the K classes; it is often referenced as a measure of terminal node purity Sometimes splits yield terminal nodes that have the same predicted value; such duplicate splits are recorded by the classification tree because they lead to increased node purity; having an increased sense of node purity yields an increased sense of certainty pertaining to the response value corresponding to each terminal node Goal is to have two subregions as pure as possible by reducing the weighted sum of gini impurities Information Entropy Works in a similar way to Gini impurity - fi is the fraction of items labeled with i in the set and Σfi = 1 Want the information gain as great as possible after the splitting Horizontal axis refers to the proportion of one of the classes Pros and Cons Pros: - Easy to interpret (especially if it's small) if you don’t have a heavy mathematical background; relatively non-complex - Can graphically depict a higher dimensionality easier than linear regression and still be interpreted by a novice - Process can easily adapt to qualitative/categorical predictors without the need to create and interpret dummy variables - Reflects a more “human” decision-making process as compared to other machine learning methods - Can be displayed graphically Cons: - Predictive accuracy tends to be lower and thus not as competitive as a trade off for less complexity - Can increase predictive accuracy with: - Bagging - Random Forests - Boosting - At a cost of decreased interpretative value - Suffer from high variance; will probably get very different trees if you randomly split the data into two and fit the independent trees - Instability: a small change in the data may result in very different splits Bagging Bootstrap aggregation (i.e., bagging): procedure that aids in the reduction of variance for a statistical learning method; frequently used alongside trees - Recall that given a set of n independent observations X1, X2, …, Xn, each observations as a group would be given by σ2/n - - Averaging the set of observations reduces the overall variance - Not practical: typically do not have access to multiple training sets Create multiple pseudo-training sets by bootstrapping - Take repeated samples of the same size from the single overall training dataset; treat these different sets of data as pseudo-training sets By bootstrapping, we create B different training datasets; method is trained on the bth bootstrapped training set in order to get predictions for each observation; end up getting b different decision trees; we can then average all predictions (or take the majority vote) to obtain the bagged estimate; the overall prediction is the most : - Recall reducing the variance by pruning; while pruning reduces the variance of the overall tree model upon repeated builds with different datasets, we induce bias because the trees are much simpler - The idea of bagging averts the pruning methodology but still gets its benefits: - Average many noisy trees and hence reduce the model variance - Instead of pruning back our trees, create very large trees in the first place. These large trees will tend to have low bias, but high variance - Retain the low bias, but get rid of the high variance by averaging across many trees - Since each tree generated in bagging is identically distributed, the expectation value of the averages is the same as the expectation of any one of them; this means bias will not be improved - How to estimate the test error of a bagged model? Out of bag Estimation: - Decision trees are fit to bootstrapped subsets of the overall available observations - Observations that are used to fit the tree are said to be “in the bag” - Observations that are not used to fit the tree are said to be “out of bag” - Can predict the response for a given observation using each of the trees in which the observation was out of bag and then average the results - The averaged predictions are used to calculate the out of bag error estimate - When the number of bootstrapped samples is large, this is essentially the same as leave-one-out cross-validation error for bagging Random Forest The variance of the mean of a sample increases as observations are correlated with one another - Correlated observations are not a effective at reducing the uncertainty of the mean as uncorrelated, independent observations Random forests: improve on bagging procedure by decorrelating trees; this results in a reduction of variance once we average the trees - - Similar to bagging, we first build various decision trees on bootstrapped training samples but we split the internal nodes in a special way Each time a split is considered within the construction of a decision tree, only a random subset of m of the overall p predictors are allowed to be candidates - Only the m predictors have the possibility to be chosen as the splitting factor At every split, a new subset of predictors is randomly selected - - - typically , m ≈ √p is a sufficient rule for subset selection - What happens if we choose m = p? Just get the bagging model Why does using less of the predictor variables at each split help in the long run? - It forces the decision tree building process to use different predictor to split at different times - Should a good predictor be left out of consideration for some splits, it still has many chances to be considered in the construction of other splits; same idea goes for predictors surfacing in trees as a whole Likely to yield different trees even when using the same training samples Can’t overfit by adding more trees; variance ends up decreasing Boosting Boosting: similar to bagging except that the decision trees are generated in a sequential manner - Each tree is generated using information from previously grown trees; the addition of a new tree improves upon the performance of the previous trees - The trees are now dependent upon one another - Whereas creating a single large decision tree can amount to severe overfitting to our training data, the boosted approach tends to slowly learn our data - Given a current decision tree model, we fit a new decision tree to the residuals of the current decision tree - The new decision tree (based on the residuals) is then added to the current decision tree, and the residuals are updated - Limit the number of terminal nodes in order to sequentially fit small trees - By fitting small trees to the residuals, slowly improve the overall model in areas where it does not perform well - The shrinkage parameter (๐บ) is taken to be quite small, and slows the process down even further to avoid overfitting Algorithm: - For each i in the training data, set: - For b = 1, 2, … , B: - Fit a tree with d splits (d + 1 terminal nodes) to the training data (X, r) - Update f by adding in a shrunken version of the new tree: - - Update the residuals: Output of the boosted model: Tuning Parameters: - B: number of trees - Can overfit (slowly) if B is too large; use cross-validation to select B - ๐บ: shrinkage parameter (small positive number) - Controls the rate of learning; typical values are around 0.01 to 0.001 - If ๐บ is too small, may require a large value of B or else it won’t learn at all - d: number of splits in each tree - Controls the complexity of the boosted ensemble; typically using stumps (single splits where d = 1) is sufficient and results in an additive model; the tree depth corresponds to the interaction order of the boosted model since d splits can involve at most d distinct variables Variable Importance For bagged and random forest trees, we can record the total amount that a given criterion is decreased over all splits relevant to a given predictor, averaged over all B trees - For regression trees, we can use the reduction in the RSS - For classification trees, we can use the reduction in the Gini index In both regression and classification, we can do this for each predictor in the original dataset - A relatively large value indicates a notable drop in the RSS or GIni Index, and thus a better fit to the data; corresponding variables are relatively important predictors This allows us to gain a qualitative understanding of the variables in our dataset R R Library: tree tree(formula, split, data, subset): fit a tree to the data; the formula is in the form of y ~ x1 + x2 + …; can use a period (.) to select all variables (excluding y) and can use subtraction (-) to exclude variables; split is the criteria to determine a split (‘deviance’ or ‘gini’); subset can be used to specifically select a subset to use as training data (vector of indices) - summary(treeobj): to get information about the fitted tree - plot(treeobj): plot the tree - text(treeobj): add text to the tree plot predict(treeobj, test, type): prediction of test data; set type ‘class’ for classification; default type is for regression trees - table(pred, actual): confusion matrix to assess accuracy of the overall tree for misclassification - Calculate the mean squared error (MSE) for regression cv.tree(treeobj, FUN): perform cross validation to decide how many splits to prune; set FUN to prune.misclass for using misclassification as the basis for pruning; default is prune.tree for regression - names(cv.treeobj): inspect element of cv.treeobj - cv.treeobj$size: indicates the number of terminal nodes - cv.treeobj$dev: deviance is the criterion we specify (misclassification rate) - cv.treeobj$k: cost complexity tuning parameter alpha - cv.treeobj$method: indicates the specified criterion - Plot k or size versus dev to visually inspect the results prune.tree(treeobj, best): prune a regression tree; best is the best number terminal nodes to use, determined by cross-validation prune.misclass(treeobj, best): prune a misclassification tree; best is the best number terminal nodes to use, determined by cross-validation R Library: randomForest randomForest(formula, split, data, subset, mtry, importance): fit a random forest tree to the data; the formula is in the form of y ~ x1 + x2 + …; can use a period (.) to select all variables (excluding y) and can use subtraction (-) to exclude variables; split is the criteria to determine a split (‘deviance’ or ‘gini’); subset can be used to specifically select a subset to use as training data (vector of indices); mtry is the number of randomly selected predictors to use at each split (set mtry to number of predictors for bagging); set importance to True to assess importance of predictors - rFobj$mse: MSE for random forest fitting - rFobj$err.rate: error rate for classification predict(rFobj, test, type): prediction of test data; set type ‘class’ for classification; default type is for regression trees - table(pred, actual): confusion matrix to assess accuracy of the overall tree for misclassification - Calculate the mean squared error (MSE) for regression importance(rFobj): determine importance of predictors varImpPlot(rFobj): plot importance of predictors R Library: gbm gbm(formula, data, distribution, n.trees, interaction.depth, shrinkage): fit a boosted tree to the data; the formula is in the form of y ~ x1 + x2 + …; can use a period (.) to select all variables (excluding y) and can use subtraction (-) to exclude variables; set distribution (default ‘bernoulli’) to ‘gaussian’ for gaussian distribution; n.trees is the number of trees to use (default 100); interaction.depth is the depth of each tree; shrinkage is learning rate - summary(gbmobj): gets a summary of the fit and also plots the importance of variables - Can only classify two groups; must turn the classes into numbers (0 and 1) and then treat it like a regression problem predict(gbmobj, newdata, n.trees): prediction of newdata (test data); n.trees is the number of trees to use; must be less than or equal to the number of trees specified in the fit; can use a vector of possible trees to get a prediction matrix - with(data, apply((predictions - y)^2, 2, mean)): Calculate boosted errors (MSE case) for predictions; can be plotted to view how the error changes with change in number of trees - Must round the predictions for classification problems to get the correct predictions; since gbm only does regression, the prediction is a number and the closer it is to one number means it will be classified as the category that associates with that number Python Helpful function to determine purity from collections import Counter import math def purity(L, metric='gini'): total = len(L) freq = map(lambda x: float(x)/total,list(Counter(L).viewvalues())) if metric == 'gini': scores = map(lambda x: x*(1-x), freq) elif metric == 'entropy': scores = map(lambda x: -x*math.log(x,2), freq) return sum(scores) Useful function to plot Decision Tree arguments def plotModel(model, x, y, label): ''' model: a fitted model x, y: two variables, should arrays label: true label ''' margin = 0.5 x_min = x.min() - margin x_max = x.max() + margin y_min = y.min() - margin y_max = y.max() + margin import matplotlib.pyplot as pl from matplotlib import colors colDict = {'red': [(0, 1, 1), (1, 0.7, 0.7)], 'green': [(0, 1, 0.5), (1, 0.7, 0.7)], 'blue': [(0, 1, 0.5), (1, 1, 1)]} cmap = colors.LinearSegmentedColormap('red_blue_classes', colDict) pl.cm.register_cmap(cmap=cmap) nx, ny = 200, 200 xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx), np.linspace(y_min, y_max, ny)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) ## plot colormap pl.pcolormesh(xx, yy, Z, cmap='red_blue_classes') ## plot boundaries pl.contour(xx, yy, Z, [0.5], linewidths=1., colors='k') pl.contour(xx, yy, Z, [1], linewidths=1., colors='k') ## plot scatters ans true labels pl.scatter(x, y, c = label) pl.xlim(x_min, x_max) pl.ylim(y_min, y_max) ## if it's a SVM model try: # if it's a SVC, plot the support vectors index = model.support_ pl.scatter(x[index], y[index], c = label[index], s = 100, alpha = 0.5) except: pass Python: tree from sklearn tree_model = tree.DecisionTreeClassifier(...): initializes decision tree; save to a variable for convenience - Arguments: - criterion: "gini" or "entropy", corresponding to the criteria of "gini impurity" and "information gain". default = 'gini'. - max_depth: The maximum depth of the tree. default = None, which means the nodes will be expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. - min_samples_split: The minimum number of samples required to split. default = 2. - min_samples_leaf : The minimum number of samples required to be at a terminate node. default = 1. - Methods: - fit: Build a decision tree from the training set (X, y). - predict: Predict class or regression value for X. - predict_log_proba Predict class log-probabilities of the input samples X. - predict_proba: Predict class probabilities of the input samples X. - score: Return the mean accuracy on the given test data and labels. - set_params: Set the parameters of this estimator. - get_params: Get parameters for this estimator. - Attributes: - tree_: Tree object, the underlying tree object. - feature_importances_: The feature importances. The higher, the more important the feature. Also known as gini importance. - tree_model.fit(x,y): fit the tree model - tree_model.score(x,y): score the accuracy of the model on the data - tree_model.feature_importances_: the importance of each feature, rated from 0 to 1; the higher, the better; sum of feature importances should be 1 Python: ensemble from sklearn rf = ensemble.RandomForestClassifier(...): initialize random forest; should save the classifier to a variable for convenience - Arguments: similar arguments to DecisionTreeClassifier() - criterion : default=”gini”; can be “entropy” - max_depth: default = None. - min_samples_split: default = 2. - min_samples_leaf: default = 1. - n_estimators: The number of trees. default=100. - bootstrap: Whether bootstrap samples are used when building trees. default=true. - oob_score: Whether to use out-of-bag samples to estimate the generalization error. default=false. - Methods: - - fit: Build a forest of trees from the training set (X, y). - score: Return the mean accuracy on the given test data and labels. - predict: Predict class for X. - predict_log_proba: Predict class log-probabilities for X. - predict_proba: Predict class probabilities for X. - set_params: Set the parameters of this estimator. - get_params: Get parameters for this estimator. Attributes: - feature_importances_:The feature importances (the higher, the more important the feature). - oob_score_: Score of the training dataset obtained using an out-of-bag estimate. Python: sklearn.grid_search gs will represent sklearn.grid_search for all the following examples gsmodel = gs.GridSearchCV(model, params, scoring, cv): initializes a grid search for supervised learning models; the model is the initialized model; params is a list of parameter combinations to search; scoring is the evaluation method to determine the best parameters; cv is the number of folds; like any other model, it should be saved to variable - gsmodel.fit(x,y): fit the model - Trees: for trees the model should be tree_model; the scoring should be ‘accuracy’; the list of params can be a list of dictionaries with the keys being different arguments (‘criterion’, ‘max_depth’, min_samples_split, min_samples_leaf, etc.) and the values as a list of values to search through - gsmodel.grid_scores_: returns all the scores of the grid search - gsmodel.best_params_: returns the best parameters from the grid search which can be saved and inputted into a model later on - gsmodel.best_score_: best score - gsmodel.score(x,y): score the performance on a set of data - if this was scored on the original set of data used for the grid search, the value might not be as good as the best score; this is because the best score is the average results from the cv folds; if the data is organized when splitting the folds, the model might be better at predictions because the because the data isn’t randomized, this may result in better scores for the cross validation process but will probably lead to more error when testing randomized data Python: sklearn.cross_validation cv will represent sklearn.cross_validation for all the following examples; this module can alter the process for cross validations cv.StratifiedKFold(y, n): selects the folds in a stratified pattern; y is the target variable and n is the number of folds; this can be inputted to CV grid search function in sklearn.grid_search (set cv equal to this function; cv.train_test_split(x, y, random_state, test_size): splits a data set into a training set and testing set; the random_state is like a seed in R; makes the test results reproducible; test_size determines the ratio of test data to the original data; - returns four values in the order: X_train, X_test, Y_train, Y_test Python: xgboost https://github.com/mpearmain/BayesBoost Kaggle competition based repo using xgboost and Bayesian Optimization. Support Vector Machines Only for classification; direct approach to classification by constructing linear/non-linear decision boundaries, by explicitly separating the data into two different classes as complete as possible; - the Linear decision boundaries in Support Vector Classifiers are called hyperplanes in the feature space - The non-linear decision boundaries in general Support Vector Machines are called hypersurfaces in the feature space Maximum Margin Classifier Hyperplane: a substance of one dimension less than its ambient space: - In 2D space, hyperplanes are 1D lines - In 3D space, they are 2D planes - In pD space, they are (p-1)D objects - Flat and affine: - They preserve parallel relationships - Don’t need to pass through the origin - Equation form: or - - β = (β1, …, βp) and X = (X1, …, Xp) are p-dimensional vectors For any point vector X in the space, there are two possibilities: - X satisfies the equation above and thus itself falls on the hyperplane - - X does not satisfy the equation above and thus falls on one side of the hyperplane The signed distance of any given point x to the hyperplane is given by: - - - - - The distance function f can be used as the decision function of the classification If X does not fall on the hyperplane (f(x) = 0), then one of the following must be true: - f(x) > 0, points on one side of the hyperplane, f(x) < 0, points on the opposite side Extracting the β coefficients from this equation (not including the intercept) yields what is called a normal vector: - The vector points in a direction orthogonal to the surface of the hyperplane and essentially defines its orientation - Might need to work in the normalized form: β* = β/|β| or require that: For any given point in the feature space, we can project onto the normal vector of the hyperplane - Based on the sign of the resulting value, we can determine on which side of the hyperplane the point falls When the normal vector is of unit length, the value of the hyperplane function defines the Euclidean distance from the point to the hyperplane 2D hyperplane of 1 + 2X1 + 3X2 = 0 Separating Hyperplanes: - Need to develop a classifier that will help us predict into which category a new observation will fall - Suppose observations fall into one of two classes which we can label as {-1,1} without loss of generality - Also suppose that it is possible to construct a hyperplane that perfectly separates the observations based on these class labels - This hyperplane would then have the following properties for all i = 1, …, n: - - y = 1, for f(x) > 0 and y = -1 for f(x) < 0 Separating hyperplane will have the following property: or - - By evaluating the value of the hyperplane function given an observation, we can determine on which side of the hyperplane the observation falls - If the value is positive, classify the observation into group 1 - If the value is negative, classify the observation into group 2 Magnitude of the evaluation also yields information regarding the confidence of our classification prediction: - Large values imply the observation is far from the hyperplane (high conf.) - Small values imply the observation is far from the hyperplane (low conf.) Maximal Margin Classifier: If the data can be separated perfectly with a hyperplane, then there are infinite separating hyperplanes in the feature space; How do we determine which of these hyperplanes is the best? - - Compute the distance from each training observation to a separating hyperplane; of these distances, the smallest distance is called the margin Then try to find the maximal margin hyperplane, which: - Is the separating hyperplane that is farthest from the training observations - Creates the biggest gap/margin between the two classes Hopefully, if the maximal margin hyperplane has a large margin on the training data, it will also have a large margin on the test data The construction of the maximal margin classifier is the solution to the following optimization problem: or - Maximize the margin M Ensure the normal vector is of unit length (not actually a constraint! why?) Guarantee that each observation is on the correct side of the hyperplane - - - - The conditions ensures that the distances from all the points to the decision boundary specified by β and β0 are at least M, and we seek the largest M by varying the parameters We can get rid of the constraint |β| = 1 by replacing the inequalities with: For any β and β0 satisfying the inequalities, any positively scaled multiple satisfies them too If we set |β| = 1/M, we can rephrase the original problem to a more elegant form by dropping the norm constraint on β: This is a convex quadratic optimization problem and can be solved efficiently Limitations: The observations that fall closest to the separating hyperplane (equidistant) define the width of the margin. These observations are known as the support vectors because the hyperplane depends on their location - If these observations were to move around in the feature space, the maximal margin hyperplane would also move - The maximal margin hyperplane directly depends only on the support vectors, not the remaining observations; poor solution if data is noisy - The definition of the classifier can be very sensitive to outliers or a single change in the data - High sensitivity to a small change suggests that we have overfit the classifier - What if no separating hyperplane exists? - There would be no solution to the optimization problem with M > 0 Support Vector Classifier Support Vector Classifier: extension of the maximal margin classifier that makes some compromises in an effort to improve upon the aforementioned limitations - May not perfectly separate the classes - Provides greater robustness to outliers and thus a lower sensitivity to individual observation shifts - Helps better classify most of the training observations - By giving up the ability to have a perfect classifier on the training data we: - Take a penalty by possibly misclassifying some observations - Do a better job classifying the remaining observations more confidently - May have better predictive power for future observations - Soft Margin: allows some observations to be on the incorrect side of either the margin or hyperplane; allows for the cases where the data is not separable - need to optimize the following or - Maximize the margin M Ensure the normal vector is of unit length - ฯต: slack variables (ฯตi ≥ 0 and Σฯตi ≤ constant); they allow individual observations to potentially fall on the wrong side of the margin or hyperplane; tells us where the i th - - observation is located relative to the margin and hyperplane - If ฯตi = 0, then the ith observation is on the correct side of both the margin and the hyperplane - If ฯตi > 0, then the ith observation violates the margin - If ฯตi > 1, then the ith observation violates the hyperplane (misclassification) - The magnitude of the slack variables is proportional to the distance from each observation to the margin C: a tuning parameter that helps determine the threshold of tolerable violations to the margin and hyperplane; often thought of as a budget for the slack variables - If C = 0, then there is no budget for the slack variables. For every i, ฯตi = 0, and the problem reduces to the maximal margin classifier - As C increases, there is more budget for violations; the classifier becomes more tolerant so the margin will widen (low variance, high bias) - As C decreases, there is less budget for violations; the classifier becomes less tolerant so the margin will narrow (high variance, low bias) - No more than C observations can be on the wrong side of the hyperplane Equation can also be in the form: - The C term here acts as a penalty parameter of the total error term (different from the C term mentioned before); maximum margin classifier corresponds to C = ∞; C close to 0 → wide soft margin; Large C → close to hard-margin formulation - Similar to the maximal margin classifier, not all observations directly affect the orientation of the hyperplane - Only observations that either fall on the margin or violate the margin affect the solution to the optimization problem - These observations are called the support vectors - Observations that fall on the correct side of the margin have no direct bearing on the ultimate classifier - If these observations were shifted around in the feature space, the hyperplane would remain unchanged (as long as they did not end up crossing over the margin) - Ultimately the support vector classifier is more robust than the maximal margin classifier Limitations: the support vector classifier assumes that the boundary between classes is roughly linear; however, this process fails when the boundary is nonlinear Support Vector Machines Feature Expansion: similar to adding polynomial terms in the linear regression setting, we can address nonlinearity by considering the enlargement of the feature space of our original dataset - - Implement functions of the predictors themselves by using higher-order polynomial functions of the predictors By fitting a support vector classifier in the enlarged feature space, the decision boundaries become nonlinear in the original feature space Suppose we only have X1 and X2 in the dataset; we can use X1, X2, X12, X22, and X1X2 In the enlarged feature space, the following decision boundary is linear: implementing feature expansion in the support vector classifier can help solve problems with data that are not linearly separable Support Vector Machines: extension of the support vector classifier that results from enlarging the feature space using kernels - Kernels are more efficient ways of implementing feature expansion from a computational standpoint - The solution to the support vector classifier problem can be reduced in such a way that only involves the inner products of the observations instead of the actual observation themselves - Linear support vector classifier can be represented as: - - There are n parameters, one α per training observation In order to estimate the α parameters and the intercept β 0, all that is needed are the inner products between all pair of observations To evaluate the function, we need to compute the inner product between the new observation and each of the training observations Computationally, it turns out that the α parameters are only nonzero if the corresponding α is necessarily 0; so most of the terms in the original equation disappear Thus, if S is the collection of indices of the support vectors, we have: - The formulation of the problem typically involves far fewer calculations than the original optimization described for support vector classifiers - How can we gain more flexibility with the support vector machine? Kernels: a function that quantifies the similarity of two observations; just as we have seen there are many measures of similarity in terms of distance, so too there are many different types of kernels; - Linear: the idea of the inner product used to improve upon calculations in the support vector classifier; this is a linear kernel because the resulting support vector classifier is linear in features - Polynomial: an extension of the linear kernel is the polynomial kernel of degree d; using a polynomial kernel with d > 1 is analogous to fitting a support vector classifier using feature expansion based on polynomials of degree d rather than the original feature space; the decision boundary appears to be more flexible; - Radial: suppose we have a test observation; if it is far from a training observation, the euclidean distance will be large, but the value of the radial kernel will be small; if it is close to a training observation, the euclidean distance will be small but the value of the radial kernel will be large; it exhibits local behavior since only nearby training observations have an abundant effect on the class labels of a test observation - - ๐: is a positive constant and another tuning parameter when the support vector classifier is combined with a non-linear kernel, the resulting classifier is called a support vector machine Polynomial and radial kernels: - To implement any kernels in a support vector classifier, we define the kernel in the classifier: - Kernels are much more computationally efficient because we only need to compute the kernel for distinct pairs of observations in our dataset - Don’t need to work in the enlarged feature space (impossible in radial kernel since the space is infinite) Multi-Class Classification Only considered implementing support vector machines in respect to two categories; SVM are limited to binary classification; need roundabout ways to predict an output with more than two categories 1-vs-1 Classification: - Construct a support vector machine for each pair of categories - For each classifier, record the prediction for each observation - Have the classifiers vote on the prediction for each observation 1-vs-All Classification: - Construct a support vector machine for each individual category against all other categories combined - Assign the observation to the classifier with the largest function value Pros and Cons Pros: - Not hindered by high dimensions Cons: - Slow - Can only do binary classification - To do multi-class classification, need to do 1-vs-1 or 1-vs-all classification R R Library: e1071 svm(formula, data, subset, kernel, cost, gamma): fit an svm classifier to the data; the formula is in the form of y ~ x1 + x2 + …; can use a period (.) to select all variables (excluding y) and can use subtraction (-) to exclude variables; kernel can be ‘linear’, ‘polynomial’, ‘radial basis’ or ‘sigmoid’; cost (default is 1) is the tuning parameter; if cost is high you get a maximum margin classifier; gamma is an additional tuning parameter for radial kernels (default 1/p); can do multiclass classification - plot(svmobj, data): plot the svm classifier - svmobj$index: find the indices of the support vectors predict(svmobj, test): predict results on test data - table(preds, actual): use confusion matrix to calculate error rate tune(method, formula, data, kernel, ranges): parameter tuning of functions using grid search; set the method to svm for svm; range is the range to search for each parameter; takes a list of parameters and setting them equal to a vector of possible numbers; for svm, use cost and gamma (only for radial kernels); ex. range = list(cost = 10^(seq(-1, 1.5, length = 20)), gamma = 10^(seq(-2, 1, length = 20))) - summary(tuneobj): inspect cv output - tuneobj$performance$cost: vector of cost values tested - tuneobj$performance$error: vector of error values tested - tuneobj$best.model: model with best result; should be used for predictions - tuneobj$best.model$cost: can be used to determine the best cost value R Library: rgl plot3d(x,y,z): plot in 3d; useful for plotting the cv results of radial kernel svm (cost vs gamma vs error) Python Useful function to plot SVM def plotModel(model, x, y, label): ''' model: a fitted model x, y: two variables, should arrays label: true label ''' margin = 0.5 x_min = x.min() - margin x_max = x.max() + margin y_min = y.min() - margin y_max = y.max() + margin import matplotlib.pyplot as pl from matplotlib import colors colDict = {'red': [(0, 1, 1), (1, 0.7, 0.7)], 'green': [(0, 1, 0.5), (1, 0.7, 0.7)], 'blue': [(0, 1, 0.5), (1, 1, 1)]} cmap = colors.LinearSegmentedColormap('red_blue_classes', colDict) pl.cm.register_cmap(cmap=cmap) nx, ny = 200, 200 xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx), np.linspace(y_min, y_max, ny)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) ## plot colormap pl.pcolormesh(xx, yy, Z, cmap='red_blue_classes') ## plot boundaries pl.contour(xx, yy, Z, [0.5], linewidths=1., colors='k') pl.contour(xx, yy, Z, [1], linewidths=1., colors='k') ## plot scatters ans true labels pl.scatter(x, y, c = label) pl.xlim(x_min, x_max) pl.ylim(y_min, y_max) ## if it's a SVM model try: # if it's a SVC, plot the support vectors index = model.support_ pl.scatter(x[index], y[index], c = label[index], s = 100, alpha = 0.5) except: pass Python: svm from sklearn svm_model = svm.SVC(...): initialize svm classifier; save to a variable for convenience - Kernels: - linear: <x1,x2>. - polynomial: (γ<x1,x2>+r)d. d is specified by the argument degree, r by coef0. (if degree is 1 → turns to a linear kernel) - rbf:expโก(−γโฅx1−x2โฅ2). γ is specified by the argument gamma, must be greater than 0. (radial kernel) - - sigmoid: ((tanhโก(γ<x1,x2>+r)), where r is specified by coef0. The linear kernel is the original feature space. In terms of the polynomial kernel, it's equivalent to linear kernel when γ=1 and r=0. Arguments: - kernel: Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given, it is used to precompute the kernel matrix. - C: Penalty parameter of the error term. C=1C=1 by default. Large C → closer to maximum margin classifier - degree: Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. gamma: only for radial kernel; ignored by all other kernels Methods: - fit: Fit the SVM model according to the given training data. - - - score: Return the mean accuracy on the given test data and labels. - predict: Perform classification on samples in X. - set_params: Set the parameters of this estimator. - get_params: Get the parameters of this estimator. Attributes: - support_: return the index of the support vectors. - n_support_: return the number of support vectors. - support_vectors_: return the value of support vectors. svm_model.set_params(params): set arguments for the model svm_model.fit(x, y): fit the data svm_model.score(x, y): return accuracy svm_model.n_support_: number of support vectors svm_model.support_: index of support vectors svm_model.support_vectors: values of support vectors svm_model.predict(test): predict values Python: sklearn.grid_search gs will represent sklearn.grid_search for all the following examples gsmodel = gs.GridSearchCV(model, params, scoring, cv): initializes a grid search for supervised learning models; the model is the initialized model; params is a list of parameter combinations to search; scoring is the evaluation method to determine the best parameters; cv is the number of folds; like any other model, it should be saved to variable - gsmodel.fit(x,y): fit the model - SVM: for SVM the model should be svm_model; the scoring should be ‘accuracy’; the list of params can be a list of dictionaries with the keys being different arguments (‘kernel’, ‘C’, etc.) and the values as a list of values to search through - gsmodel.grid_scores_: returns all the scores of the grid search - gsmodel.best_params_: returns the best parameters from the grid search which can be saved and inputted into a model later on - gsmodel.best_score_: best score - gsmodel.score(x,y): score the performance on a set of data - if this was scored on the original set of data used for the grid search, the value might not be as good as the best score; this is because the best score is the average results from the cv folds; if the data is organized when splitting the folds, the model might be better at predictions because the because the data isn’t randomized, this may result in better scores for the cross validation process but will probably lead to more error when testing randomized data Python: sklearn.cross_validation cv will represent sklearn.cross_validation for all the following examples; this module can alter the process for cross validations cv.StratifiedKFold(y, n): selects the folds in a stratified pattern; y is the target variable and n is the number of folds; this can be inputted to CV grid search function in sklearn.grid_search (set cv equal to this function; cv.train_test_split(x, y, random_state, test_size): splits a data set into a training set and testing set; the random_state is like a seed in R; makes the test results reproducible; test_size determines the ratio of test data to the original data; - returns four values in the order: X_train, X_test, Y_train, Y_test Discriminant Analysis Naive Bayes Association Rule Mining Natural Language Processing Intersection of computer science, artificial intelligence and linguistics. The goal is for computers to process or “understand” natural language in order to perform tasks that are useful Why is NLP so difficult? - Understanding context - Understanding “common sense” and “common knowledge” - Understanding named entities: variations in names and names that can refer to different things - Understanding idioms - Understanding ambiguity Applications: - Spell checking, keyword search - Extracting information from websites - Classifying reading level, positive/negative sentiment of longer documents - Machine translation: Siri, Google Now, Cortana, Alexa Deep Learning: - Provides a flexible, learnable framework for representing visual and linguistic information - Deep learning can learn unsupervised (from raw text) and supervised (with specific labels like positive/negative) - Benefits more from a lot of data Vectorizing Text: TF-IDF tf-idf: text frequency - inverse document frequency; the value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus (group or collection of documents); this helps adjust for the fact that some words appear more frequently in general Pros: - Have some basic metric to extract the most descriptive terms in a document - Can easily compute the similarity between 2 documents Cons: - Tf-idf is based on the bag-of-words (BoW) model, therefore it does not capture position in text, semantics, co-occurrences in different documents - Only useful as a lexical level feature - Cannot capture semantics (as compared to topic models, word embeddings) Vectorizing Text: Co-occurrence Matrix Co-occurrence matrix: matrix/table that keeps count of how often words are placed next to each other Co-occurrence vectors: vectors that keep count of how often words are placed next specified word - Cons: - Vectors increase in size with vocabulary (more words, more dimensions) - Very high dimensional: requires a lot of storage - Subsequent classification models have sparsity issues - People want to store the most important information in a fixed, small number of dimensions: a dense vector - How to reduce dimensionality? Reducing dimensions of co-occurrence matrix: - SVD: singular value decomposition; matrix factorization method to reduce dimensions; can be used to find principal components of a covariance matrix (details here) - Semantic Patterns: - Problems with SVD: - Computational cost scales quadratically for mxn matrix, which makes it bad for millions of words or documents - Hard to incorporate new words or documents - Function words (the, he, has) are too frequent and have too much syntaxical impact - min(X,t) with t~ 100 - Ignore them all - Ramped windows that count closer words more Word2Vec: instead of capturing co-occurrence counts directly, predict surrounding words in a window of length m of every word using a skip-gram neural network; the output are vectors with interesting relationships - Skip-gram neural network: simple neural network with only one hidden layer; train the neural network to perform a certain task to learn the weights of the hidden layer which are actually the word vectors - The task is to determine the probabilities of every word in our vocabulary being within the window of our target word - http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ - Probability Function: - - o is the outside (output) word id, c is the center word id, u and v are the outside and center vectors of o and c - Every word has two vectors; one is the center vector and the other one is the outside vector Objective function: maximize the log probability of any context word given the current center word - ๐ stands for the center and outside vectors - Can use gradient descent to optimize cost function Convolutional Neural Networks: - First layers embeds words into low-dimensional vectors Next layer performs convolutions over the embedded word vectors using multiple filter sizes. max pool the result of the convolutional layer into a long feature vector, add dropout regularization, and classify the result using a softmax layer Python Python: TfidfVectorizer from sklearn.feature_extraction.text Neural Networks Neural Networks: supervised methodology that models the relationship between a set of input signals and an output signal in order to perform either classification or regression; they are very complex; the underlying models are based on composite mathematical systems that often render accurate results, but near impossible to interpret; often referred to as a black box process because the mechanism seems to be hidden from view Biological Neurons: underlying model behind artificial neural networks; - how a biological brain responds to stimuli from sensory inputs: - Human brain uses a complicated network of interconnected cells called neurons in order to parallel-process input - Signals are received by the neuron’s dendrites through a biochemical process that weights impulses based on their relative importance - Should a threshold be reached by accumulating impulses, the neuron is said to fire; its impulse is then passed to neighboring neurons Perceptron Perceptron: most basic type of artificial neuron - Take in various binary inputs x1, x2, …, xn, and produces a single binary output - Each input has a corresponding weight w 1, w2, …,wn that expresses the importance of the input in determining the output - How is output determined? - The output is also binary; determining whether the output should be 0 or 1 is simple - Each input xi is considered alongside its corresponding weight w i - The sum of the various input/weight combinations is calculated - Should the grand sum be greater than a certain threshold, then the perceptron returns a 1; otherwise, the perceptron returns a 0 Step-function: - - Essentially the perception is a simple way of making a decision by weighing the evidence at hand and comparing it to a threshold that represents a willingness to make the decision Better step function: - Process similar to before but with two small improvements - The summation of weights is represented by a dot product - The threshold been moved to the other side of the equation; this new term is referred to as a bias (bias = - threshold) Modeling logical gates: perceptrons can be the basic models for various logical gates - Logical gates: basic logical decision-making tools given binary inputs - Can model the logical NAND gate (not-and) - NAND gate: is universal for computation; any computation can be built up out of multiple NAND gates More complex decisions: the single perceptron is an oversimplification of the decision making process; to model more complex decisions, consider various perceptrons alongside one another to create a network of neurons - Complexity/design of the network is often referred to as its topology - Each column of neurons describes a layer of learning; each layer passes its outputs to a future layer for more abstract learning - - It can be shown that any bounded continuous function can be represented with a neural network (to an arbitrary degree of error ε) that has only one hidden layer - Consider each hidden node turning on at a particular point of space - Differences between these hidden values can hone in on a particular region - To approximate any function, choose weights that reproduce desired function values in specific regions Essentially we are discretizing the function into arbitrarily small regions How do we choose/tune the weights and biases automatically to match these logical/functional structures? - If we can do this, we can construct solutions to problems where manual construction would fail because of the sheer complexity of the underlying relationship Sigmoid Neurons Examine small changes: when we do not understand or know the weights and biases of the network of perceptrons, we should like to see how small changes change the output to understand the relationships among the network - If a small change in a particular weight or bias caused only a small change in the output, we could use this information to tune our network in order to receive a better, more accurate output - To have our network learn, we could implement the following process: - Change a weight or bias - Observe the changes in the output - If the change made our prediction better, keep it and now manipulate a different weight - If the change made our prediction worse, forget it and try again - Upon repeated steps, an accumulation of much smaller changes could induce a big change in the output that ultimately leads to greater accuracy Problems with perceptrons: - Neural network learning relies on the fact that small changes in tuning parameters induce small changes on the output but that is not the case with perceptrons - Because perceptrons are step functions, that flip from off to on once a specific threshold (bias) value has been surpassed - A small change in weights or bias for a particular perceptron could cause the output of the perceptron to completely switch - Like a domino effect, this single switch would pass along to future perceptrons in the network and could end up completely changing the ultimate output drastically - There isn’t an easy way to gradually modify weights and biases to develop a robust learning algorithm Sigmoid Neuron: easiest and most common way of overcoming the sensitivity issue presented by perceptron networks - Has inputs x1, x2, …, xn and an overall output that takes on any values between 0 and 1 - Has corresponding weights w1, w2, …, wn that express the importance of the input in determining the output - Has an overall bias (threshold) - Sigmoid function: - When it is combined with the inputs, weights and bias/threshold of a sigmoid neuron, the resulting equation for the output is: - If wแงx+b is a large positive number then e-(wแงx+b) ≈ 0 and ๐(wแงx+b) ≈ 1 - If wแงx+b is a large negative number then e-(wแงx+b) ≈ ∞ and ๐(wแงx+b) ≈ 1 Sigmoid neuron is very similar to the perceptron; differences between them are only observable when wแงx+b is of intermediate magnitude Graphically, the neuron determines the output as a logistic function - - - x-axis represents the sum of all the input/weight combinations and the bias transformed by the sigmoid function - y-axis represents the tendency for a decision to be made; note that the neuron always fires, just now on a scale instead of all-or-nothing Why is the sigmoid neuron the solution to the problems the perceptron presents? - It is a smooth function that is easily differentiable at every point; it also has nice properties that will aid in simplifying future calculations. The perceptron is not a smooth function and is not easily differentiable - Sigmoid function has the following property that relates small changes in weights and biases to changes in the output: ๐๐๐ข๐ก๐๐ข๐ก - ๐๐๐ข๐ก๐๐ข๐ก ๐ฅ๐๐ข๐ก๐๐ข๐ก ≈ ∑ ๐ฅ๐ค๐ + ๐๐ ๐๐ค๐ ๐ Perceptrons do not have an easy way of quantifying this relationship because of their seemingly erratic and sensitive behavior Network Topology Input layer: raw data goes into this layer; neurons within this layer are called input neurons Output layer: desired target comes out of this layer; neurons within this layer are called output neurons Multilayer Perceptrons (MLPs): networks with multiple layers; each neuron is made up of sigmoid neurons instead of perceptrons Hidden layer: the black box; neurons within this layer are called hidden neurons; neural networks with more than one hidden layer are said to be deep - Hidden layer represents features of the data; as it pass from layer to layer, the features become increasingly complex - Combination of various neuron outputs in the hidden layers intend to essentially feature engineer aspects of your data - Similar idea to breaking down a problem into smaller subproblems - We can connect the idea of answering subquestions as neurons embedded within a hidden layer of a neural network as follows - Each of these subproblems can themselves be decomposed into even smaller subproblems - The cumulative answers to these questions will help us determine the answer to the first subproblem - We can imagine that the hidden layers of a neural network are continually breaking down harder questions by attempting to first answer multiple, easier to answer questions - Basically create a hierarchy that relies on the notion that the answers to a slew of easy questions will ultimately help answer one massively difficult question; encode the various low and high level features of our data in a hierarchical manner Problems: - Hidden features and their corresponding weights, biases, and connections are often not intuitively conceivable - No clear way to determine the best combinations given our dataset, especially for deep hidden layers Backpropagation Learning with Gradient Descent A network topology needs to be trained with repetition and experience - Weights are initially random because of the non-existence of knowledge - As neural networks process input data, connections between the various neurons can be strengthened or weakened depending on how they ultimately seem to affect the output - Errors that are initially made are back-propagated through the neural network and the connections among neurons are changed in an effort to reduce this error Backpropagation algorithm: use errors to change connections in an effort to reduce the same error - Extremely computationally expensive - Highly accurate results - Iterates through many cycles (epochs) of two processes: - Forward phase: neurons are activated in sequence from the input layer, through the hidden layers, and lastly the output layer - Predicted values are recorded - Backward phase: the network’s current output resulting from the forward phase is compared to the target values in the training data - Error is propagated backwards through the network from the output layer, through the hidden layers, and back to the input layer in order to modify the connections with the goal of reducing the error - Repeat cycle until a certain stopping criterion is reached - Connections between all of the neurons is very complicated - How does the algorithm know the best way to modify the weights among connections? - Gradient descent Gradient Descent: method for optimization parameters - For neural networks: - The derivative of each neuron’s activation function is used to identify the gradient in the direction of each of the incoming weights - Algorithm will attempt to change the weights in such a way that will result in the greatest reduction of error - Need a differentiable function; that’s why sigmoid neurons are better than perceptrons - Suppose you have a cost function C that is defined by a relationship among the variables v1 and v2. We desire to minimize this function as much as possible - Gradient descent will assess the gradient at a specific point in order to push v 1 and v2 in the direction that will minimize C - Benefits: - Why gradient descent? Why not grid? - Curse of dimensionality is extremely harsh; increase in weights increases the computations exponentially - Notes on local minimums: - What if our cost function doesn’t always go in the same direction? What if it’s possible that we reach a plateau or a local minimum? In other words, what happens when the problem is con-convex? - We chose our error function the way that we did not only because it makes the calculation of derivatives a bit cleaner, but also because we are exploiting the often generally convex nature of quadratic equations - Also, another possible way around this is to update our weights sequentially instead of all at once (i.e., implementing stochastic gradient descent); in this manner, we might be able to avoid local minimums by sequentially following gradient in varying directions Derivative: - Describes how fast a function f changes instantaneously at the value x - It determines whether the function f will increase or decrease if we increase the value of x - It informs whether x is higher or lower than an optimum value (maximum or minimum) In backpropagation algorithm: - Chain rule: - Derivative of a sum is the same as the sum of the derivatives: - Derivative of a sum with respect to one element of the sum collapses: - Derivative of the sigmoid function: - Derivative of the sigmoid function with the chain rule: Backpropagation: Formally - - - Suppose the above represents our neural network. For each layer in the network, we would like to determine: - The partial derivative of the error in respect to each neuron (i.e., should the neuron’s value be higher or lower?) - The partial derivative of the error in respect to each input (i.e., should the input weight’s value be higher or lower?) Error: the choice to scale by ½ is somewhat arbitrary but it will make future computations simpler Differentiate the error E with respect to one of the neurons in the last hidden layer (first layer of backpropagation): - Another level deeper (second layer of backpropagation): - We begin to see this nesting nature; thus, we can express the partial derivative of a deeper layer of backpropagation with that of a shallower layer: - Want to differentiate the error E with respect to one of the edges: - Again see this nesting nature; thus, we can express the partial derivative of a deeper edge of backpropagation with that of a shallower layer’s node: - Once we have computed all of the gradients for each of the weights in our network, we gain insight into how slight perturbations of the weights relate to the overall error. Now, it is easy to decide whether we should increase or decrease each weight. Given the particular weight, wi in our network and the gradient of the error in respect to that weight, we update wi as follows: - - Here: - If the gradient is positive, we shift wi away from the increasing tendency - If the gradient is negative, we shift wi towards the decreasing tendency - η is a small positive number referred to as the learning rate More Help http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ http://www.bogotobogo.com/python/scikit-learn/Artificial-Neural-Network-ANN-1Introduction.php Common Terminology: concepts of neural networks build upon ideas we have previously seen in respect to other machine learning algorithms, particularly regression. While the vocabulary is different, the terms generally point to similar concepts Pros and Cons Pros: - Neural networks models can be extremely flexible; with sufficient data they can effectively model curvatures, interactions, plateaus, step functions, etc. - Standard regression assumptions (e.g., the true residuals are independent, normally distributed, and have constant variance) are not required - Outliers tend to have a limited influence in comparison to standard regression approaches Cons: - The method depends on the availability of large datasets and is extremely computationally expensive - Model parameters are vastly uninterpretable - It is easy to overfit or underfit the training data - Diagnostic tests are not widely developed Time Series Analysis Other Optimizations https://github.com/fmfn/BayesianOptimization Python package that implements Bayesian Optimization. Unsupervised Learning Used to infer the properties directly without knowing the “correct” answers or the error for each observation. Only a set of N observations with p features, no response variables. No direct measure of success, “Learning without a teacher”. Unlabeled data is easier to obtain than labeled data. No specific prediction goals, therefore more subjective. We are usually interested in discovering the hidden pattern of the data. Principal component analysis: often used for data visualization or data preprocessing for supervised learning Clustering: broad class of methods for grouping or segmenting a collection of objects into distinct subsets (clusters) Principal Component Analysis Solves the problem of multicollinearity. Turns a set of possibly correlated variables into a set of values of linearly uncorrelated variables. Multicollinearity Phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be predicted from the others through linear formulae with substantial degree of accuracy. Issues: - Regression coefficients of highly correlated variables might be inaccurate (high model variance) - Estimate of one variable’s impact on the dependent variable Y while controlling for the others tend to be less precise - Nearly collinear variables contain similar information about the dependent variable, which may lead to overfitting - Standard errors of the affected coefficients tend to be large The Curse of dimensionality Given a number of observations, additional dimensions spread the points out further and further from one another. Sparsity becomes exponentially worse at dimensionality of the data increases. There tends to be insufficient repetition in various regions of the high-dimensional space. Less repetition makes inference more difficult. Difficult to replicate results and doesn’t take into account regions that don’t have any observations at all. Pros: Cons: - Note: SVM takes advantage of the curse of dimensionality Collecting data is expensive, both monetarily and temporally Too much complexity with higher order data Redundant information (multicollinearity) in measured dimensions PCA A tool that finds a sequence of linear combinations of the variables to convert a set of observations of possible correlated variables into a set of observations of possible correlated variables into a set of values of linearly uncorrelated variables called principal components Ideal input variables: - Linearly uncorrelated - Low-dimensional in the feature space Motivation: - Remove variables that provides little to no additional information (when all observations have similar values or the same value) - Search for, among all possible directions in the feature space (not just along the axes -> infinite directions) , the direction along which the projection of the observations are most widely spread First Loading Vector: - Direction on which the projection of the observations is more widely spread than the projection on any other direction - Being a direction (vector), it has as many components as the number of the features - This characterizes the principal direction, which needs linear algebra to calculate First Principal Component: keeping the information recorded in the first loading vector by all observations - Obtained by linear projection - Find the directions on which the projection is most widely spread (vectors of highest variance) using linear algebra - The projection can be described in two ways: a certain length away from the origin along the principal direction or a vector in the original coordinate system - There are N (number of samples) components for a principal component - There are p (number of features) components for a principal direction (the loading vector) - The principal components live in the space of samples, while the principal directions live in the space of features Second Principal Component: info stored in a data set is about the variation of the points across the whole sample set but not all directions are born equal. The first principal component provides the most information but most likely doesn’t provide all. Need to find the next significant direction and continue this until you get the majority of the information - First remove the data stored in the first PC - Then find the new direction (orthogonal to the original principal direction) on which the direction of the observations is most widely spread - Visually remove the effect of the first principal component by projecting the observations on a plane perpendicular to the first PC - Find the direction on which the projection is most widely spread - Since all the projected observations are now in the plane, the direction we find would be automatically in the plane and is perpendicular to the first loading vector - This is the second loading vector (second principal direction). The projected values of the observations to this direction is the second principal component PCA Mathematically First need to centralize the data at 0 by subtracting off the mean from each variable: - Pragmatically: this allows the future mathematical processes to be easier - Conceptually: PCA is modeling the variances of the data → the mean doesn’t matter as much; can always add the mean back in later if we desire to do a bit of back-construction - Our data X is an n by p matrix with the average of each column is 0 Project the data into any possible direction; a direction is represented by a unit vector û in linear algebra; the projection is: Need to find the direction on which the projection of the data is most widely spread: The solution to the above is the first loading vector (first principal direction) denoted by ๐1; the projection of the data on the first loading vector is the first principal component: Once the first k-1 PCs have been found, the next one (if there is one) can be found inductively - Remove the information about the first k-1 components from X (Xk denotes the resulting matrix) - With this matrix we solve the optimization problem again: The solution ๐k is the kth loading vector and the projection on this direction is called the k th principal component Note: solving an optimization problem can be hard. In the setting of PCA, this is relatively easy. The principal directions (loading vectors) are essentially the eigenvectors for the covariance matrix of the data, arranged in the descending order of the eigenvalues they correspond to Compute the covariance matrix Σ: - Observe a unique property of convergence Find the eigenvectors e of Σ: yield orthogonal directions of greatest variability (principal components) - Solve the equation: - Compute the eigenvectors by finding the solutions to: - The principal components are the eigenvectors e The eigenvectors are ordered by the magnitude of the corresponding eigenvalues λ (magnitude of variance along the principal components) Determine how many principal components to use: - Strike a balance between the total amount of variance that is captured by the principal components and the number of components selected - Use the first k principal components Project the original data onto the chose k principal components Notes - This process is continued to retained more and more information from the raw data - - - - can’t have more number of the principal components than the number of the original features we have; max number of PCs is the number of features (technically max = min(n, p) but we assume p is less than n) PC lives in the space of samples Principal directions live in the space of features PCs are orthogonal due to the linear projections; the dimensions are reduced by 1 every time a PC is found and the information is projected on to the plane orthogonal to the current PC Variance of each principal component decreases: Principal components Z1, Z2, …, Zp are mutually uncorrelated The principal loading vectors ๐1, ๐2, … ๐p are normalized and mutually perpendicular The variances of the data along the principal directions (eigenvectors) are the corresponding eigenvalues Pros and Cons Pros: - transformed data that straddles only k carefully selected dimensions that preserve as much original structure as possible - Solution to multicollinearity and curse of dimensionality - Don’t waste data by taking into account of all variables - Reduce complexity - Same data, new perspective - Results are useful properties that can be proved using calculus and linear algebra Cons: - Interpreting data may become a little difficult because PCs are composed from a combination of variables - Need to centralize the raw data R R Library: psych fa.parallel(x, n.obs, fa, n.iter): Creates scree plots with parallel analyses for choosing K; x is the data frame of data matrix of scores; if the matrix is square, it is assumed as a correlation matrix; otherwise, correlations (pairwise) will be found; n.obs is the number of observations; if using a data frame, n.obs is does not need to be specified (default is null); fa displays the eigenvalues (‘pc’ for principal components, ‘fa’ for factor analysis, and ‘both’ for both); default is both; n.iter is the number of simulated analysis to perform (default 20) principal(r, nfactors, rotate): Performs principal components analysis with optional rotation; r is a correlation matrix; if raw data is used, correlations will be found using pairwise deletions for missing values; nfactors is the number of components to extract (default 1); use fa.parallel() to choose; rotate is the rotation/transformation of the solution; default is ‘varimax’ but should set to ‘none’ - principal()$score: plot the score of the principal object to view a plot of the principal component; use a scatterplot matrix like pairs(), or the regular plot() function if you’re only graphing the first two PCs factor.plot(principalobj, labels): Visualizes the principal component loadings; principalobj is the object obtained from the principal() function; labels is used to add variable name to the plot (use the column names of the dataset; default is null Python Requires numpy (np), PCA from sklearn.decomposition, Axes3D from mpl_toolkits.mplot3d, and matplotlib.pyplot (plt) Helper functions for plotting def rotate(array): rot = np.matrix([[1,0,0],[0,np.sqrt(3)/2,-np.sqrt(1)/2],[0,np.sqrt(1)/2,np.sqrt(3)/2]]).T return np.array(np.matrix(array)*rot) def plot_vec(array, length, color='blue', alpha= 1): ax.plot(*zip(-array[0]*length, array[0]*length), color = color, # colour of the curve linewidth = 2.4, # thickness of the line #linestyle = '--', # available styles - -- -. : alpha = alpha ) #return array*length def plot_plane(normal, color='blue', alpha=0.2, x_min=-1.5, x_max=2.5, y_min=-2.5, y_max=1.5): surf_x, surf_y = np.meshgrid([x_min]+range(int(np.floor(x_min)+1),0) + \ range(int(np.floor(x_max)))+[x_max], \ [y_min]+range(int(np.floor(y_min)+1),0) + \ range(int(np.floor(y_max)))+[y_max]) surf_z = (-normal[0,0]*surf_x - normal[0,1]*surf_y - 0.5)*1./normal[0,2] ax.plot_surface(surf_x, surf_y, surf_z, color=color, alpha=0.1) def project2vec(data, vec, id_=0, color='green', along=False): pp = data[[id_]] proj = (np.sum(vec*pp)*vec) ax.scatter( *( proj.ravel() ), c=color, s=16) ax.plot(*(zip(pp[0], proj[0])), color = color, # colour of the curve linewidth = 1.4, # thickness of the line #linestyle = '--', # available styles - -- -. : alpha = 0.3) if along: ax.plot(*(zip(np.array([0,0,0]), proj[0])), color = 'Dark' + color, # colour of the curve linewidth = 1.4, # thickness of the line #linestyle = '--', # available styles - -- -. : alpha = 1) return np.sum(vec*pp) def project2plane(data, normal, id_=0, color='green', shoot=False ): pp = data[[id_]] proj = pp - np.sum((pp*normal))*normal ax.scatter( *( proj.ravel() ), c=color, s=16) if shoot: ax.plot(*(zip(pp[0], proj[0])), color = color, # colour of the curve linewidth = 1.4, # thickness of the line # linestyle = '--', # available styles - -- -. : alpha = 0.5) return pp - np.sum(normal*pp)*normal def plot_oigin(): ax.scatter(0, 0, 0, marker='o', s=26, c="black", alpha=1) def plotModel(model, x, y, label): ''' model: a fitted model x, y: two variables, should arrays label: true label ''' margin = 0.5 x_min = x.min() - margin x_max = x.max() + margin y_min = y.min() - margin y_max = y.max() + margin import matplotlib.pyplot as plt from matplotlib import colors colDict = {'red': [(0, 1, 1), (1, 0.7, 0.7)], 'green': [(0, 1, 0.5), (1, 0.7, 0.7)], 'blue': [(0, 1, 0.5), (1, 1, 1)]} cmap = colors.LinearSegmentedColormap('red_blue_classes', colDict) plt.cm.register_cmap(cmap=cmap) nx, ny = 200, 200 xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx), np.linspace(y_min, y_max, ny)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) ## plot colormap plt.pcolormesh(xx, yy, Z, cmap='red_blue_classes') ## plot boundaries plt.contour(xx, yy, Z, [0.5], linewidths=1., colors='k') plt.contour(xx, yy, Z, [1], linewidths=1., colors='k') ## plot scatters ans true labels plt.scatter(x, y, c = label) plt.xlim(x_min, x_max) plt.ylim(y_min, y_max) ## if it's a SVM model try: # if it's a SVC, plot the support vectors index = model.support_ plt.scatter(x[index], y[index], c = label[index], s = 100, alpha = 0.5) except: pass Python: PCA from sklearn.decomposition pca = PCA(): initialize PCA; must save it to a variable (for convenience); if it is saved to a variable, you can easily return properties of the cluster without repeatedly calculating the fit; all properties and functions will be called on using pca - Arguments: - n_components: The number of components to keep. In default it is all, min(n_samples, n_features). If n_components is less than 1, it is interpreted as the percentage of variance. PCA model will find the suitable number of components to explain ≥ n_components percentage of variance. - - - - whiten: When True (False by default) the components_ vectors are divided by n_samples times singular values to ensure uncorrelated outputs with unit component-wise variances. Attributes: - components_: Components with maximum variance. - explained_variance_ratio_: Percentage of variance explained by each of the selected components. - mean_: The average of each feature. Methods: - fit: Fit the model with X. - fit_transform: Fit the model with X and apply the dimensionality reduction on X. - inverse_transform: Transform data back to its original space. - get_covariance: Compute data covariance with the generative model. - get_params: Get parameters for this estimator. - set_params: Set the parameters of this estimator. - transform: Apply the dimensionality reduction on X. pca.set_params(params): change parameters; takes in the arguments in PCA() function; usually used to set n_components; mutating - pca.fit(x): find the principal components; x is the data; every time you fit the data, you remove all the information from the previous fit; mutating - After the model has been fitted, you can get the properties of the clusters by calling on PCA() attributes - pca.components_: returns the components - pca.explained_variance_ratio_: Percentage of variance explained by each of the selected components. - pca.mean_: The average of each original feature. - pca.transform(data): apply pca to the data set - Manual transform with: - np.dot(data - pca.mean_, pca.components_.T) - pca.inverse_transform(fitteddata): used to transform the PCs back into the original space; not an accurate inverse transformation because you first reduced the dimensions so you are missing some data; useful for image compression Clustering Unsupervised task that does not aim to specifically predict a numeric output or a class label. Does aim to uncover underlying structure of the data and see what pattern exists in the data. Aim to group together observations that are similar while separating observations that are dissimilar. Cluster analysis attempts to explore possible subpopulations that exist within your data. Cluster analysis tries to answer exploratory questions. Typical questions that cluster analysis attempts to answer are: - Approximately how many subgroups exist in the data? - Approximately what are the sizes of the subgroups in the data? - What commonalities exist among members in similar subgroups? - Are there smaller subgroups that can further segment current subgroups? - Are there any outlying observations? K-Means With the K-means clustering algorithm, we aim to split up our observations into predetermined number of clusters - Must specify the number of clusters K in advance - These clusters will be distinct and non-overlapping The points of each of the clusters are determined to be similar to a specific centroid value: - The centroid of a cluster represents the average observation of a given cluster; it is a single theoretical observation that represents the prototypical member that exists within the cluster - Each observation will be assigned to exactly one of the k clusters depending on where the observation falls in space in respect to the cluster centroid locations What makes a good clustering solution? We desire each point in a specific cluster to be near: - The centroid of that cluster - All other points within the same cluster Mathematically, we desire the within-cluster variation to be as small as possible Procedure To find the global minimum of the optimization function is very difficult. It is computationally expensive. If we check all possible clustering assignments, we would have to calculate the within-cluster variations for Kn different solutions In practice, most K-means packages perform the following algorithm, also known as Lloyd algorithm in the computer science circle: - Initialize: place K centroids at random locations in the feature space - Assign: each observation to the cluster whose centroid is closest by some distance measure (Euclidean) - Recalculate: recalculate the cluster centroids - The kth cluster centroid is the vector of the p variable averages for all observations in the kth cluster - Repeat: repeat assignment and recalculation steps - Halt: stop when the cluster assignments no longer change The K-means procedure always converges: - If you run the algorithm from a fixed initial assignment, it will reach a stable endpoint where the clustering solution will no longer change through the iterations Unfortunately, the guaranteed convergence is to a local minimum - Thus if we begin the K-means algorithm with a different initial configuration, it is possible that convergence will find different centroids and therefore ultimately assigning different cluster memberships What can we do to get around this? - Run the K-means procedure several times and pick the clustering solution that yields the smallest aggregate within-cluster variance How to choose K? - Need to know the answer prior to running the algorithm - - Can we check a lot of possible values of K and choose the K that yields the lowest variance? NO - As K increases, the overall within-cluster variance will continue to decrease - The more centroids you have, the closer all points will be to one of those centroids - If every data point was its own centroid (K = n), the within-cluster variance will be zero Use a scree plot (elbow graph) to visually inspect the data Plot the within-cluster variance as a function of the number of clusters to create a segmented curve - The within-cluster variance will necessarily decrease as we increase the number of clusters, but not uniformly - The within-cluster variance tend to decrease quickly at first, but then begin to taper off - The task reduces to simply finding the point where the within-cluster variance no longer decreases dramatically K-means Algorithm Mathematically Suppose C1,C2, …, CK denote the various sets containing the indices of the observations in the respective clusters. Then, under the k-means clustering algorithm, the following must be true: C1 ∪ C2 ∪ … ∪ CK = {1, 2, …, n} - Each observation belongs to at least one of the k clusters Ck ∩ Ck’ = ∅ - The clusters are distinct and non-overlapping; there does not exist an observation that belongs to more than one cluster It follows that each observation must fall into exactly one cluster What technique does K-means algorithm use to create these clusters? Suppose we use the Euclidean distance. Then the within-cluster variation is defined as: Here: - |Ck| denotes the total number of observations in cluster k - i and i’ denote indices of observations in cluster C k - p is the number of variables/features in our dataset In other words, the within-cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in the k th cluster divided by the total number of observations in the kth cluster Since the within-cluster variation is a measure of the amount by which the observations in a specific cluster differ from one another, we want to minimize this quantity W(C k) over all clusters: We desire to partition the observations into k clusters such that the total within-cluster variation added together across all K clusters is as small as possible; the optimization problem K-means is as follows: Why does the K-means algorithm end up necessarily reducing the within-cluster variances? We can rewrite the pairwise variation as the variation around the component-wise means (centroids). During the algorithm, if we had just fixed the: - Centroids, then the observation reassignment step finds the closest centroid (and thus reduces the within-cluster variances) - Observation assignments, then the resulting sample cluster means minimize the sum of squared distances (and thus reduces the within-cluster variances) Pros & Cons Pros: Cons: - Helps find underlying structure of a set of data Simple to understand (group by distance) Well defined groups Have to choose K and starting points (sometimes hard to determine) Change in units can change solutions Points that are nearby each other (have a small Euclidean Distance between them) are not guaranteed to be clustered together - It could be the case that a stable solution produces clusters that don’t necessarily cluster the closest points together: - - K-means assumes that true clusters have a globular shape (i.e., a spherical shape that has a well-defined center) - When the data has non-globular or chainlike shapes, K-means may not perform well (outside vs inside of a face) Perceived granularity: how many clusters are below? 2? 4? 16? Hierarchical Clustering (Agglomerative Cluster) Agglomerative Clustering: build a hierarchy of clustering structures (like a tree) - At the bottom level, the extreme case would be each observation is partitioned into its own cluster (K=n) - At each intermediary level, we can recursively define the closest two clusters and fuse them together - At the top level, the extreme case would be each observation is partitioned into the exact same cluster (K = 1) Dendrogram: visualization of the hierarchical tree - The lower down in the dendrogram a fusion occurs, the more similar the groups of observations that have been fused are to each other - The higher up in the dendrogram a fusion occurs, the more dissimilar the group of observations that have been fused are to each other - For any two observations we can inspect the dendrogram and find the point at which the groups that contain those two observations are fused together to get an idea of their dissimilarity - Be careful to consider groups of points in the fusions within dendrograms, not just individual points Procedure Two strategies: bottom-up and top-down Bottom-up: - Begin with n observations and a distance measure of all pairwise dissimilarities. At this step, treat each of the n observations as their own clusters - For i = n, (n-1), …, 2: - Evaluate all pairwise inter-cluster dissimilarities among the i clusters and fuse together the pair of clusters that are the least dissimilar - Note the dissimilarity between the recently fused cluster pair and mark that as the associated height in the dendrogram - Repeat the process, calculating the new pairwise inter-cluster dissimilarities among the remaining (i-1) clusters Need to choose a dissimilarity measure and a linkage measure - Sufficient to use a distance metric (Euclidean distance) for a dissimilarity measure - Linkage is a measure of the dissimilarity between two group of points - Compute the pairwise dissimilarities between the observations in the two clusters - Complete Linkage: Maximum inter-cluster dissimilarity - Record the largest of the dissimilarities listed between members of the two clusters as the overall inter-cluster dissimilarity - Sensitive to outliers, yet tends to identify clusters that are compact, somewhat spherical objects with relatively equivalent diameters - Single Linkage: Minimal inter-cluster dissimilarity - Record the smallest of the dissimilarities listed between members of the two clusters as the overall inter-cluster dissimilarity - Not as sensitive to outliers, yet tends to identify clusters that have a chaining effect; these clusters can often not represent intuitive groups among our data, and many pairs of observations might be quite distant from one another - Average Linkage: Mean inter-cluster dissimilarity - Record the average of the dissimilarities listed between members of the two clusters as the overall inter-cluster dissimilarity - Tends to strike a balance between the pros and cons of both complete linkage and single linkage - Ward’s Linkage: Minimum variance method - Minimize the variance of the clusters being merged Pros and Cons Pros: - Doesn’t require us to choose the number of centroids Cons: - Need to choose a dissimilarity measure and a linkage method - Change in units can change solutions Notes - - - Clustering is an unsupervised method to machine learning; the main goal is to uncover structure among subsets of the data - The procedure is generally used more for data exploration; not predicting any outcomes In good clustering solutions, points in the same cluster should be more similar to each other than to points in other clusters The units by which each variable is measured matters; different unit measurements cause different distance calculations and thus change clustering solutions - Usually we desire a unit change in one dimension to correspond to the same unit change in another dimension; in that perspective, we should standardize our data prior to clustering The process of clustering is more iterative and interactive; there’s no one correct way to cluster your data Supervised methods generally have one solution to the optimization problems posed, whereas some clustering methods (e.g., K-means) aren’t deterministic Different clustering methods yield different results (e.g., hierarchical clustering with varied linkage methodologies). Consider the output of different approaches R scale(data): used to scale your data kmeans(x, centers, nstart): perform K-means clustering on a data matrix; x is the data matrix; centers is the number of centers or the set of initial cluster centers; nstart is the number of random sets of centers to choose from (default 1) - kmeans()$cluster: this gives you a vector of the clusters assigned to each observation; can be used to visualize the clusters in the plot() function; plot points and set col = kmeans()$cluster - kmeans()$centers: obtain the coordinates of the cluster centers; can be added to the plot using points() function Use the following function to help determine the number of clusters when we do not have an idea ahead of time: wssplot = function(data, nc = 15, seed = 0) { wss = (nrow(data) - 1) * sum(apply(data, 2, var)) for (i in 2:nc) { set.seed(seed) wss[i] = sum(kmeans(data, centers = i, iter.max = 100, nstart = 100)$withinss) } plot(1:nc, wss, type = "b", xlab = "Number of Clusters", ylab = "Within-Cluster Variance", main = "Scree Plot for the K-Means Procedure") } dist(data, method): computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix; method default is euclidean; useful for hierarchical cluster function in flexclust package cutree(tree, k): cuts a tree, e.g., as resulting from hclust() into several groups either by specifying the desired number of groups or the cut height(s); k is the desired number of groups R Library: flexclust hclust(d, method): hierachical cluster analysis on a set of dissimilarities and methods for analyzing it; d is the dissimilarity structure as produced by dist() function; method can be one of "ward.D", "ward.D2", "single", "complete","average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC); default is ‘complete’ - plot(hclustobj, hang): graphs the dendrogram for the data; set hang to -1 for the best view - rect.hclust(hclustobj, k): draws rectangles around hierarchical clusters on dendrogram plot; k determines the number of clusters to highlight on the dendrogram - cutree(hclustobj, k): trims the hierarchical cluster to ignore smaller groups to make the interpretation easier; choose the number of clusters you want to highlight - table(cutreeobj): viewing the groups of data - aggregate(data, by = list(cluster = cutreeobj), func): aggregate data by the cluster assignments; data should be the original data or the scaled data; cutreeobj is the result from cutree() function; func should be set to median Python Note: requires matplotlib.pyplot (as plt) for visualization and numpy (as np) Python: KMeans from sklearn.cluster kmeans = KMeans(...): initialize KMeans; must save it to a variable (for convenience); if it is saved to a variable, you can easily return properties of the cluster without repeatedly calculating the fit; all properties and functions will be called on using kmeans - Arguments: - n_clusters: The number of clusters to divide, default is 8. - max_iter: The maximal number of iterations, default is 300. - n_init: Number of times the k-means algorithm will run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. default is 10. - random_state: Optional. The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator. - Usually, we just need to set the argument n_clusters to determine how many groups are we going to split. - Attributes: - cluster_centers_: The coordinates of cluster centers. - labels_: The Label of each observation, which indicates the group number of each observation. - inertia_: Sum of distances of samples to their closest cluster center. - The most import attribute here is the labels_ . - Methods: - fit: Fit k-means clustering on a given data set. - fit_predict: Compute cluster centers and predict cluster index for each sample. - get_params: Get parameters for this estimator. - set_params: Set the parameters of this estimator. - predict: Given a set of data, predict the closest cluster each sample belongs to. - kmeans.set_params(params): change parameters; takes in the arguments in KMeans() function; normally used to set n_clusters; mutating - kmeans.fit(x): compute kmeans clustering; x is the data; every time you fit the data, you remove all the information from the previous fit; mutating - After the model has been fitted, you can get the properties of the clusters by calling on KMeans() attributes - kmeans.cluster_centers_: return the values of the cluster centers - kmeans.labels_: returns the designated cluster of each observation; can use labels for visualizing the clusters on a plot (set color to the labels) For Image Compression Note: requires matplotlib.image (as mpimg) for reading image data mpimg.imread(filename): opens image file - np.shape(mpimgobj): returns the shape of the image in three dimensions; the first two are for the height and width in pixels and the third dimension is always 4 for the three color groups (red, blue, and green) and alpha; this is useful for keeping the size of the image the same while compressing the image to less colors def KmeansCompression(data, nclus=16): ''' data: data to cluster nclus: number of colors ''' cluster = KMeans(n_clusters = nclus, n_jobs=4) cluster.fit(data) centers = cluster.cluster_centers_ labels = cluster.labels_ return centers[labels] # The number of clusters centers < the number of the original samples # They are viewed as good representatives of the nearby sample points within the same # cluster # function used to compress the image using Kmeans; KmeansCompression(...).reshape(np.shape(imgobj)): used to reshape the image; should be reshaped to the original size of the image Python: pairwise_distances from klearn.metrics.pairwise pairwise_distances(x, metric): calculate distances between observations in data; metric can be ‘l1’ (manhattan), ‘l2’ (euclidean), ‘cosine’ (cosine), etc. Python: AgglomerativeClustering from sklearn.cluster hier = AgglomerativeClustering(...): initialize hierarchical clustering; must save it to a variable (for convenience); if it is saved to a variable, you can easily return properties of the cluster without repeatedly calculating the fit; all properties and functions will be called on using hier - Arguments: - n_clusters: The number of clusters to find. default=2 - affinity: Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”. If linkage is “ward”, only “euclidean” is accepted. default: “euclidean”; "l1" is the same as "manhattan", while "l2" is the same as "euclidean".; "cosine distance" here refers to 1−cos(θ)1−cos(θ), not cos(θ)cos(θ) itself. - - The smaller the euclidean/manhattan distance is, the closer the two observations are. The smaller the cosine value is, farther the observations are. The cos value is not non-negative. So the cosine distance is defined as : - Notice it is NOT the cosine itself, but rather 1−cos(θ)1−cos(θ) instead. - - Now the cosine distance ranges from 0 to 2, and the smaller it is, the closer the pair of observations are. - 0.0: two vectors point to the same direction - 1.0: perpendicular - 2.0: opposite direction linkage: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of clusters that minimize this criterion. - ward minimizes the variance of the clusters being merged. - average uses the average of the distances between all pairs of observations of the two sets. - complete or maximum linkage uses the maximum distances between all pairs of observations of the two sets. - Attributes: - labels_: Cluster label for each observation. - n_leaves_: Number of leaves in the hierarchical clustering tree, which is also the number of observations. - Methods: - fit: Fit the hierarchical clustering on the data. - get_params: Get parameters for this estimator. - set_params: Set the parameters of this estimator. hier.set_params(params): change parameters; takes in the arguments in AgglomerativeClustering() function; usually used to set n_clusters; mutating hier.fit(x): compute hier clustering; x is the data; every time you fit the data, you remove all the information from the previous fit; mutating - After the model has been fitted, you can get the properties of the clusters by calling on AgglomerativeClustering() attributes - hier.labels_: returns the designated cluster of each observation; can use labels for visualizing the clusters on a plot (set color to the labels) - Python: linkage from scipy.cluster.hierarchy linkage(x, method, metric): performs hierarchical clustering on condensed distance matrix x; method can be any linkage metric (‘complete’, ‘single’, ‘average’, ‘wards’, etc.); metric is defaulted to ‘euclidean’ Python: dendrogram from scipy.cluster.hierarchy dendrogram(z, p, truncate_mode, leaf_rotation, leaf_font_size): plots a dendrogram; z is the linkage matrix produced from linkage() function; p is the number of clusters to specify (default is 30); truncate_mode is used to condense the dendrogram (default none); set to ‘lastp’ to show p number of nodes; set to ‘level’ to show no more than p levels on the dendrogram; leaf_rotation default is 0 and it determines how to rotate the x axis labels (set to 90 to make reading the labels easier); leaf_font_size default is none (varies depending on the number of nodes on the dendrogram) Other R: caret http://topepo.github.io/caret/index.html Survival Analysis https://www.cscu.cornell.edu/news/statnews/stnews78.pdf Markov Chain https://en.wikipedia.org/wiki/Markov_chain A/B Testing