Uploaded by maarkhuynh

Statistics and Machine Learning

advertisement
Foundations of Statistics
Variables
-
What are the central tendencies?
What is the spread of the values?
How much do the values vary?
Are there any abnormalities that stand out?
Numerical Variables:
- Mean (μ): average of numbers
-
-
-
- Pros: helps describe the central tendencies
- Cons: not robust to extreme values
Median: 50th percentile
- Pros: robust to extreme values
- Cons: only concentrates on measures of location; difficult to use for
describing multiple variables
Mode: frequency
- Pros: finds the most frequent value
- Cons: difficult to determine with discrete sets when values gets grouped
together too generally; mode might not be a good description in some
continuous cases; not useful if data is spread out
Variance (σ2) and Standard Deviation (σ): spread of the values
-
-
Pros: dependent on the mean; help describe the distribution of values in
relation to the mean
- Small variance indicates that the data is close to the mean and
thus similar in value; High variance indicates that the data is far
away from the mean and thus dissimilar in value
- Cons: not robust to extreme values
Categorical Variables:
- Frequency Tables: displays the number of times a data value occurs in a set
- Pros: knows how many times each categories occur
- Cons: doesn’t describe the number of occurrences in relation to
everything else
-
-
Proportion Tables: displays the fraction of occurrence of a data value in a set
- Pros: describes the occurrences of each category relative to the whole
data set
- Cons: doesn’t give any information on the size of the data set
- Contingency Tables: displays frequency or proportions among multiple
categorical variables simultaneously
- Pros: can describe the relationship between multiple variables
- Cons: realistically can only be used for two variables at a time
- Margins: individual variable frequencies/proportions; constructed by totalling
respective rows or columns
Correlation (ρ): The correlation helps quantify the linear dependence between two
quantities (e.g., does knowing something about one variable inform anything about the
other)
-
Bounds: -1 ≤ ρ ≤ 1
-
Positive correlations means a direct relationship; negative correlations
means an indirect relationship; Zero means no correlation
-
-
Notes:
Zero correlation does not imply that the variables are independent; above
equation if for numerical variables
- categorical variables usually needs to be dummified to calculate
correlations (create a one vs. all situation)
- Pros: describe relationship between variables
- Cons: assumes the relationship is linear
Independent vs. Dependent Variables: independent variables means that two
variables are in no way related to one another; you cannot infer any information about
one variable with information about the other; dependent variables are variables that are
related to each other in some way
Statistical Inference
Process of deducing properties of an underlying distribution by analysis of data; hope to make
educated conclusions about a population by inferring behavior from a sample
- Statistical Hypothesis: question we wish to answer that is testable by observing a
process that is modeled by a set of random variables
- Use results to infer behavior in population
- Hypothesis Test:
- State null and alternative hypothesis:
-
-
Null Hypothesis (H0): The assumed default scenario in which nothing
abnormal is observed (e.g., there is no difference among groups, etc.)
- Alternative Hypothesis (HA): The scientific supposition we desire to test
that contrasts H0 (e.g, there is a difference among groups, etc.); the
complete opposite of the null hypothesis
- Assume null hypothesis is true; calculate probability (p-value) of observing
results at least as extreme as what is present in your data sample; usually use a
table or computers to do calculations
- Based on the p-value, decide which hypothesis is more likely.
- Generally: if the p-value is > 0.05, retain the H0; if the p-value is < 0.05,
reject the H0 in favor of HA
T-Test: An independent samples t-test is used when you want to compare the means of
a normally distributed interval dependent variable for two independent groups
- One Sample T-Test: To examine the average difference between a sample and
the known value of the population mean
- Assumptions: The population from which the sample is drawn is
normally distributed; Sample observations are randomly drawn and
independent.
-
-
P-value calculation: calculate t-test statistic given by equation above
and compare the values with a standard table of values to get p-values;
or use a computer to calculate p-value
- ๐‘ฅ: sample mean
- ๐‘›: number of samples
- ๐‘ : standard deviation
- ๐‘› − 1: represents the degrees of freedom (degrees of freedom are
usually one less than sample size)
- H0: the average of the sample is equal to the known value
- HA: the average of the sample is not equal to the known value
- Note: Usually for numerical variables
Two Sample T-Test: To examine the average difference between two samples
drawn from two different populations; used to determine if the samples are
statistically similar enough to each other to compare for evaluation purposes
- Assumptions: The populations from which the samples are drawn are
normally distributed; the standard deviations of the two populations are
equal; sample observations are randomly drawn and independent
-
-
P-value calculation: calculate t-test statistic given by equation above
and compare the values with a standard table of values to get p-values;
or use a computer to calculate p-value
- ๐‘ฅ: sample mean
- ๐‘›: number of samples
- ๐‘ : standard deviation
- ๐‘› − 1: represents the degrees of freedom (degrees of freedom are
usually one less than sample size)
- H0: the average of the two samples are equal
- HA: the average of the two samples are not equal
- Note: Usually for numerical variables; can be used to derive one sample
t-test
F-Test: It is also used for testing hypothesis for population mean or population
proportion. Unlike Z-statistic or t-statistic, where we deal with mean & proportion, Chisquare or F-test is used for finding out whether there is any variance within the samples.
F-test is the ratio of variance of two samples. It is used to assess whether the variances
of two different populations are equal.
- Assumptions: The populations from which the samples are drawn are normally
distributed; sample observations are randomly drawn and independent.
-
-
P-value calculation: calculate f-test statistic given by equation above and
compare the values with a standard table of values to get p-values; or use a
computer to calculate p-value
- ๐‘ : standard deviation
- ๐‘› − 1: represents the degrees of freedom (degrees of freedom are usually
one less than sample size)
- H0: the variance of the two samples are equal
- HA: the variance of the two samples are not equal
- Note: Generally want to do f-test first before doing a two sample t-test; cannot
determine if the means of the two samples are the same if the variances are
different; usually for numerical variables because we are using mean and
standard deviation
One-Way ANOVA (Analysis of Variance): Uses f-tests to assess the equality of means
of two or more groups; similar to t-test; when there are two groups, it is the same as a
two sample t-test
- Assumptions: The populations from which the samples are drawn are normally
distribution; The standard deviations of the populations are equal; Sample
observations are randomly drawn and independent
-
P-value calculation: Compare the test statistic value with a standard table of Fvalues to determine whether the test statistic surpasses the threshold of
statistical significance (yielding a significant p-value); or use a computer
-
-
-
Mean squares between groups: A good estimate of the overall variance
only when H0 is true. Quantifies the between-group deviations from the
overall grand mean.
- ๐‘˜: number of groups
- ๐‘–: index of the groups
- ๐‘›: number of observations for a specific group
- ๐‘Œ: average values
Mean squares within groups: A good estimate of the overall variance,
unaffected by whether the null or alternative hypothesis is true. Quantifies
the within-group deviations from the respective group means.
- ๐‘˜: number of groups
- ๐‘–: index of the groups
- ๐‘›: number of observations for a specific group
- ๐‘—: index of observation in a specific group
- ๐‘: total number of observations
- ๐‘Œ: value for an observation
- ๐‘Œ: average values
- H0: the average of all of the groups are equal
- HA: at least one group has a different average from another
- Notes: Generally for numerical values because we are using mean
Chi-Square (χ2) Test of Independence: Use the chi-square test for independence to
determine whether there is a significant relationship between two categorical variables.
To test whether two categorical variables are independent.
- Assumptions: Sample observations are randomly drawn and independent
-
-
P-value calculation: Compare the test statistic value with a standard table of χ 2values to determine whether the test statistic surpasses the threshold of
statistical significance (yielding a significant p-value)
- ๐‘–: index of the group in the first variable
- ๐‘—: index of the group in the second variable
- ๐‘›: number of observations for a specific group
- ๐‘: total number of observations
- H0: the two variables are independent
- HA: the two variables are not independent
- Note: For comparing two categorical variables; similar to correlation for
numerical variables but not based on a linear relationship
Other Tests: http://stats.idre.ucla.edu/stata/whatstat/what-statistical-analysis-should-iusestatistical-analyses-using-stata/
https://www.csun.edu/~amarenco/Fcs%20682/When%20to%20use%20what%20test.pdf
R
Correlation, Variance and Covariance (Matrices): var, cov and cor compute the variance of x
and the covariance or correlation of x and y if these are vectors. If x and y are matrices then the
covariances (or correlations) between the columns of x and the columns of y are computed.
cov2cor scales a covariance matrix into the corresponding correlation matrix efficiently.
- var(x, y = NULL, na.rm = FALSE, use = "everything"): variance
- cov(x, y = NULL, use = "everything", method = c("pearson", "kendall",
"spearman")): covariance between two vectors
- cor(x, y = NULL, use = "everything", method = c("pearson", "kendall",
"spearman")): correlation between two vectors
- cov2cor(V): scales covariance matrix into the corresponding correlation matrix efficiently
- Arguments:
- x: numeric vector, matrix or df
- y: NULL (default) or a vector, matrix or data frame with compatible dimensions to
x. The default is equivalent to y = x (but more efficient).
- method: a character string indicating which correlation coefficient (or covariance)
is to be computed. One of "pearson" (default), "kendall", or "spearman": can be
abbreviated.
- v: symmetric numeric matrix, usually positive definite such as a covariance
matrix.
Density, distribution function, quantile function and random generation for the t distribution with
df degrees of freedom (and optional non-centrality parameter ncp).
- dt(x, df, ncp, log = FALSE): density for t distribution
-
-
pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE): distribution function; gets p-values
when inputting t statistic value for q; set lower.tail to false to get the distribution on the
right side of line
qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE): quantile function; use q = 0.5 to
visualize 5% threshold of t distribution
rt(n, df, ncp): random generation
Arguments:
- x, q: vector of quantiles
- p: vector of probabilities
- n: number of observations
- df: degrees of freedom
-
ncp: non-centrality parameter delta; currently except for rt(), only for abs(ncp) ≤
37.62. If omitted, use the central t distribution.
-
log, log.p: logical; if TRUE, probabilities p are given as log(p).
-
lower.tail: logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X >
x].
t.test(x, y, mu, alternative): t test for one or two sample t tests; y default is null for one sample; x
and y are non empty numeric vector of data values; mu is the the true value of the mean or
difference in means (in two sample); default mu is 0; alternative specifies the alternative
hypothesis; options are ‘two.sided’ (default), ‘greater’ or ‘less’
var.test(x, y, alternative): f test for comparing variances of two samples from normal
populations; alternative specifies the alternative hypothesis; options are ‘two.sided’ (default),
‘greater’ or ‘less’
aov(formula, data): Fit an analysis of variance model by a call to lm for each stratum;
conducting one-way ANOVA; formula specifies the model in the form of values ~ categories;
usually paired up with summary() function to find out more information
chisq.test(data): conduct chi square test of independence on data
bartlett.test(x,y): conducting the Bartlett test of homogeneity of variances; can also take a
formula instead of x and y; x and y are vectors of data values; x is numeric and y is factors
Missingness
Occurs when at least some of an observation’s values are not present within the dataset. We
say that the absent values are “missing,” and that the observation itself is “incomplete.” A value
could be missing because of many reasons (e.g., human error, carelessness in handling, an
undefined mathematical computation, etc.). These reasons are often unknown by the person
who ultimately receives the dataset.
- What to do with incomplete dataset?:
- Complete case analysis: only deal with complete observations; ignore all
observations with missing data
- Pros: quick and easy
- Cons: severely limit the amount of available information; smaller sample
size leads to increasing standard errors of estimates
- Types of missing data: can reveal some information about the dataset
- Missing Completely at Random (MCAR): Each piece of data in the overall
dataset has an equally likely chance of being absent. each piece of data in the
overall dataset has an equally likely chance of being absent. MCAR data is the
best case scenario for missing data in general, because its manifestation is truly
“completely at random”. Deletion of MCAR observations will not end up biasing
your results
- Missing at Random (MAR): The chance that a piece of data is missing is
dependent on variables for which we have complete information within our
overall dataset. The probability a piece of data is missing depends on available
information that we have already collected; they are not independent. MAR is the
next-best scenario for missing data after MCAR because, although each
observation has a different likelihood of missing, we theoretically can estimate
this likelihood. When data are MAR, it is acceptable to drop these observations
from our analysis if we control for the factors that are related to the missingness
and adjust for their effects, we can avoid bias in our model
- Missing Not at Random (MNAR): The chance that a piece of data is missing is
dependent on the actual value of the observation itself. The value of the missing
piece of data is directly related to the reason why it is missing in the first place.
MNAR is the worst-case scenario for missing data because it is non-ignorable.
We cannot theoretically accurately estimate the missing values because the
reason they are missing is not captured within our dataset. When data are
MNAR, it is not appropriate to drop these observations from our analysis; doing
so would leave us with a biased dataset, and thus our analyses would return
biased models
- Imputation: process of filling in missing data
- Mean value imputation procedure: Compute the average of the observed
values for a variable that has missingness. Impute the average for each of the
missing values.
- Pros: One of the simplest ways of dealing with missing data because of
its relatively straightforward approach.
- Cons: Can distort the distribution of the variable and underestimate the
standard deviation. Can distort relationships between variables by
dragging correlation estimates towards 0.
-
-
-
-
-
Simple random imputation procedure: For each missing value in a variable,
randomly select a complete value of the same variable; impute this randomly
selected value. Repeat the process until all values are complete.
- Pros: Uses true, observed values to fill in missingness
- Cons: Can amplify outlier observation values by having them repeat in
the dataset. Can induce bias into the dataset.
Regression prediction procedure: Assume an underlying, linear structure
exists in the data. Give weights to a subset of the complete variables. Use a
relationship between the complete variables and the complete observations to
impute missing observations.
- Pros: Uses true, observed values to fill in missingness. Uses the
relationships among multiple variables to fill in missingness.
- Cons: Must make assumptions about the structure of the data. Can
inappropriately extrapolate beyond the scope of available information in
our dataset.
Pros of Imputation: Helps retain a larger sample size of your data. Does not
sacrifice all the available information in an observation because of sparse
missingness. Can potentially avoid unwanted bias.
Cons of Imputation: The standard errors of any estimates made during
analyses following imputation can tend to be too small. The methods are under
the assumption that all measurements are actually “known,” when in fact some
were imputed. Can potentially induce unwanted bias.
Imputation can be done by supervised learning or other prediction methods but
this is only possible if only one column has missing data. Can get by by using
complete cases to fill in missing data
R
complete.cases(df or mat): a complete case is a row without missing data, returns booleans for
each row telling if each row is a complete case
transform(data, col = modval, …): transforms a dataset; choose a column or columns to
transform and set it to the modified values of the data; good for transforming missing data
- ex.: impute by average; can replace mean function with any other method for imputation
transform(data, col1 = ifelse(is.na(col1), mean(col1, na.rm=TRUE), col1))
R Library: VIM
Visualization and imputation of missing values
aggr(data): aggregations for missing or imputed values on a plot
R Library: mice
md.pattern(data): display missing data pattern
R Library: Hmisc
impute(x, func): function to impute values; func default for imputation method is mean; set func
equal to ‘random’ for random imputation
is.imputed(x): determines if a value or values are imputed
Machine Learning
Basic Summary: Machine learning developed from the combination of statistics and computer
science; it aims to implement algorithms that allow computers to “learn” about the data it
analyzes. Traditional algorithms require computers to follow a strict set of program instructions;
machine learning algorithms instead assess the data at hand to make informed decisions.
Machine learning algorithms have the ability to learn and adapt; traditional algorithms do not.
Works well because of human-designed representations and input features
Supervised Learning Algorithms: Your data includes the “truth” that you wish to predict. Use
what you know about your observations to construct a model for future decision making.
Basically trying to correctly impute in missing data.
- Regression: In regression, we aim to predict a continuous output given a slew of input
variables. Our data contains the output that we wish to predict.
- Classification: In classification, we aim to predict a categorical output given a slew of
input variables. Our data contains the output that we wish to predict.
- Simply becomes optimizing weights to best make a final prediction
Unsupervised Learning Algorithms: Your data does not include the “truth” that you wish to
predict. Use your data to find underlying structure to inform intrinsic behavior that is not already
explicitly available.
- Clustering: In clustering, we aim to uncover commonalities in our data that help
segment observations into different groups; within the groups, observations share some
characteristics. Our data does not contain the group information that we seek.
- Dimension Reduction: In dimension reduction, we aim to summarize massive amounts
of data into smaller, more understandable components while retaining the structure of
the original dataset. Our data does not tell us what the smaller components are. Used to
eliminate structural redundancies without sacrificing information
Supervised Learning
Used to predict the values of one or more variable Y from a given set of predictor X. Predictions
are based on the training data of previously solved cases. Performance can be estimated by
some loss function (for example, RSS in regression or OOB error in bootstrap aggregating),
using training-test splitting or cross-validation.
Regression: simple/multiple linear regression, regression trees, etc.
Classification: logistic regression, discriminant analysis, naive bayes, support vector machines,
classification trees, etc.
K-Nearest Neighbors
The basic idea: Observations that are closest to an arbitrary point are the most similar. Can be
used in both classification and regression settings (i.e., can have output take the form of class
membership or property values). For K-Nearest Neighbors we find the K closest observations to
the data point in question, and predict the majority class as the outcome. For 1-Nearest
Neighbors, the single closest observation is the sole vote.
Note: unlike most supervised learning, KNN does not require training
Voronoi Tessellation for Classification: The KNN algorithm partitions the feature space into
different regions that represent classification rules; these regions are called Voronoi
tessellations. Boundaries represent areas where distances are equal in respect to different
observations. By following the Voronoi tessellations, the overall decision boundary has the
flexibility to be non-linear.
1NN: While the algorithm is very simple to understand and implement, its simplicity comes along
with some drawbacks. 1NN is unable to adapt to outliers; a single outlier can dramatically
change the Voronoi tessellations, and thus the decision boundaries. There is no notion of class
frequencies (i.e., the algorithm does not recognize that one class is more common than
another). One way to get around these limitations and to add some stability is to consider more
neighboring points (increasing the value of K), and assessing the majority vote. What happens
when we choose all neighbors?
Classification Algorithm:
Given the following information:
- The training set:
- Xi: The feature values for the ith observation (i.e., the location in space)
- Yi: The class value for the ith observation (i.e., the group label)
- The testing set:
- X*: The feature values for the new observation that we wish to classify.
The KNN classification algorithm:
- Calculate the distance between X* and each observation Xi
- Determine the K observations that are closest to X* (have the smallest distance)
- Classify X* as the most frequent class Y among the K selected observations.
Regression Algorithm:
Given the following information:
- The training set:
- Xi: The feature values for the ith observation (i.e., the location in space).
- Yi: The real-valued target for the ith observation (i.e., a continuous measurement).
- The testing set:
- X*: The feature values for the new observation that we wish to regress.
The KNN regression algorithm:
- Calculate the distance between X* and each observation Xi
- Determine the K observations that are closest to X* (have the smallest distance)
- Assign X* the mean of the Y measurements among the K selected observations
Choosing K
As we vary K the predicted classification rule will change, thus the choice of K has a large effect
on the algorithm’s performance. In general:
- Small values of K:
- Pros: highlight local variations
- Cons: are not robust to outliers; induce unstable decision boundaries
- Large values of K:
- Pros: highlight global variations; are robust to outliers; induce stable decision
boundaries
- Cons: too high values make all predictions similar
- In practice a good balance is typically achieved with
๐พ = √๐‘›
- Can also use cross-validation to choose K
Choosing Distance Measure
As we change the way we measure the distance between two points in our feature space, the
classification rule will change. The choice of distance measure also has a large effect on the
algorithm’s performance.
- Euclidean distance: The most common distance measure for continuous observations.
-
Pros: Euclidean distance is the “familiar” distance we typically use in everyday
life
-
-
Cons: it is symmetric, treats all dimensions equally, and thus is sensitive to large
deviations in a single dimension (different ranges and units in each dimension)
Hamming distance: most common distance for categorical observations; Hamming
distance looks at each attribute between observations and compares whether or not the
observations are the same
Pros: simple way to determine “distance” between categories
Cons: each similarity is ignored while each difference is penalized; the measure
is symmetric and treats all dimensions equally
Minkowski p-norm: family of distance functions
-
-
-
As we vary p, we define distance measures that each have different behaviors:
-
p → 0: Logical And (assigns more significance to simultaneous deviations)
-
p = 1: Manhattan block distance (adds each component separately)
p = 2: Euclidean distance
-
p → ∞: Maximum distance, Logical Or (the largest difference among all
attributes dominates the distance measure)
Breaking Ties
What do we do if there is a tie? More specifically, how do we decide to classify an
observation whose K-nearest neighborhood has an equal number of maximum group
memberships? Some methods for breaking ties:
- If there are only two groups, we can easily get around this by using an odd K. Why
doesn’t this work when there are more than two groups?
- Use the maximum prior probability to uniformly decide all ties.
- Randomly choose the group; for G groups:
- Roll a G-sided die that has equally likely outcomes for each group.
- Roll a G-sided die that has weighted outcomes for each group.
- Use the 1NN to break the tie.
Pros and Cons
Pros:
- The only assumption we are making about our data is related to proximity (i.e.,
observations that are close by in the feature space are similar to each other in respect to
the target value).
- We do not have to fit a model to the data since this is a non-parametric approach.
Cons:
- We have to decide on K and a distance metric.
- Can be sensitive to outliers or irrelevant attributes because they add noise.
- Computationally expensive; as the number of observations, dimensions, and K
increases, the time it takes for the algorithm to run and the space it takes to store the
computations increases dramatically.
- Why is this bad? We want more data!
R
R Library: VIM
kNN(data, k): KNN imputation; the dataset isn’t separated into a training set and a testing set
when put in the function
R Library: deldir
Delaunay triangulation and Dirichlet tessellation library
deldir(x, y): returns info needed to graph the tessellation
tiles.list(deldirobj): takes a deldir() created object and creates a list of tiles in a tessellation
plot.tile.list(tilelist, fillcol, main): plots the Voroni tiles; takes the list created by tiles.list()
function; the fillcol argument is defaulted to none but it takes a vector of color strings to add
color to the tiles (vector length has match number of points); useful in highlighting different
categories on the graph; main is for the title of the plot
R Library: kknn
kknn(formula, train, test, k, distance): weighted KNN classifier; formula is in the form:
coltoimpute ~ colstodecide (use a period, ., if you wish to use the rest of the columns to decide);
train and test are the separated training and testing sets; distance is the Minkowski distance
chosen
Linear Regression
Generalized Linear Models
Regularization and Cross Validation
Decision Trees
Supervised learning models for both classification and regression; construct solutions that
stratify the feature space into relatively easy to describe rectangular regions; can get an idea of
the general characteristics of observations that fall within particular regions of space, we can
inform the characteristics of new observations that fall within the same regions.
Regression Trees
How do we segment a graph into regions of high and low values?
How to interpret the tree?
- Start from the top of the tree and pass a new observation through the various internal
nodes
At each internal node, you make a decision on how to proceed based on the
characteristics of the observation
- If condition is satisfied, move down the left branch
- If condition is not satisfied, move down the right branch
- Continue moving down the nodes until a terminal node/leaf is reached
-
The value within the terminal node is the mean response value (ลทRj) for the
observations that fell within that region; this is also the prediction for future
observations that fall into that region;
Mathematically
Process Summary:
- Segment the predictor space (all possible values of X1, X2, …, Xp) into J distinct and
non-overlapping regions (R1, R2, …, RJ)
- For each observation that falls into a specific region R j, predict the mean of the response
values (ลทRj) for the training observations that fell within Rj
How do we decide where exactly to segment the predictor space?
How do we come upon the regions R1, R2, …, RJ?
Theoretically can have any shape for the regions but decision trees uses rectangular box-like
segments to for ease of interpretation; if the regions did not follow some specific pattern, it
would be difficult to represent the resulting model by a decision tree
Goal is to find rectangular boxes R1, R2, …, RJ such that RSS is minimized:
-
Aim to minimize the squared differences of the response as compared to the mean
response for the training observations within the jth region
- Computationally infeasible to consider every possible segmentation of the feature
space into J regions; minimization isn’t as easily solvable especially as the number of
regions increases
Tree based methods provide an approximation by combining a top-down method with a greedy
approach called recursive binary splitting:
- The method is top-down because the feature space is split into binary components in a
successive fashion, creating new branches of the tree to potentially be split themselves
- The method is greedy because splits are made at each step of the process based on
the best result possible at the given step
- The splits are not based on what might eventually lead to a better segmentation
in future steps
- Splitting process depends on the greatest reduction in the RSS based on the predictor
Xj, and the cut point s that end up partitioning the space into the regions:
- R1(j,s) = {X|Xj<s}
-
R2(j,s) = {X|Xj≥ s}
-
Splitting process seeks the values of j and s that will end up minimizing the following:
-
This process is repeated by considering each of the newly created regions as the new
overall feature spaces to segment
When do we stop splitting?
- The recursive binary splitting process is likely to induce overfitting, thus leading to
poor predictive performance on new observations
- Definitely overfit if each observation is its own terminal node (model will have
high variance) but the the RSS will be exactly 0 on the training set
- Can prevent overfitting by setting a threshold on:
- Maximum depth of the tree
- Minimum number of observations in a tree node to split
- Minimum number of observations in each region (node)
- What if we try fitting a tree with fewer regions? This should lead to lower variance
with a cost of some bias, but ultimately lead to better predictions
- Grow the tree to a certain extent until the reduction in the RSS at a split doesn’t
surpass a certain threshold
- Problem: although a split might not be incredibly valuable in reducing the RSS
early on in a tree, it might lead to a future split that does reduce the RSS to a
large extent
Tree Pruning
One solution to problem of stopping splits at locations that might lead to better reduction of the
RSS; Build a large tree and then prune it back in order to obtain a suitable subtree;
- The best subtree will be the one that yields the lowest test error rate. Given a subtree,
we can estimate the test error by implementing the cross-validation process, but too
cumbersome because the large number of possible subtrees; Need a better process
- Rather than checking every single possible subtree, the process of cost complexity
pruning (i.e., weakest link pruning) allows us to select a smaller set of subtrees for
consideration
Cost Complexity Pruning: consider a sequence of trees indexed by a non-negative tuning
parameter α. For each value of α there corresponds a subtree T such that the following is
minimized:
-
-
|T| indicates the total number of terminal nodes of subtree T
Rm is the subset region of the feature space corresponding to the mth terminal node
Tuning parameter α helps balance the tradeoff between the overall complexity of the
tree and its fit to the training data:
- Small values of α yields trees that are quite extensive (have many terminal
nodes)
- Large values of α yield trees that are quite limited (have few terminal nodes)
Process is similar to shrinkage/regularization method utilized in ridge and lasso
regression
-
-
Can be shown that as the value of the tuning parameter α increases, branches of the
overall tree are pruned in a nested manner
- Thus it is possible to obtain a sequence of subtrees as a function of α
As with any other tuning parameter, in order to select the optimal value of α we
implement cross-validation
The subtree that used for prediction is built using all the available data with the
determined optimal value of α
Algorithm
-
-
-
Use recursive binary splitting to build a large tree on the training data; stop before
each observation falls into its own leaf (e.g., when each terminal node has fewer than 5
observations, etc.)
Apply cost complexity pruning to the large tree in order to obtain a sequence of best
subtrees as a function of α
Use K-fold cross-validation to choose the best α:
- For each of the K folds:
- Repeat the binary splitting and tree pruning on all but the k th fold of the
training data
- Evaluate the mean squared prediction error on the data in the left-out kth
fold as a function of α
- Average the errors for each α; select the α that minimizes this criterion
Return the subtree of the overall tree from pruning that corresponds to the best α
Classification Trees
Decision tree that predicts a qualitative (categorical) response rather than a quantitative
(numerical) response; similar to regression trees but for each of the subregions created, we
predict that an observation belongs to the most commonly occurring class of training
observations in its associated region; still implement recursive binary splitting to create
various subregions of the feature space but do not use RSS as a criterion to minimize
Can use misclassification rate (i.e., the fraction of training observations in a region that do not
belong to the most common class)
- The misclassification rate can end up not being sufficiently sensitive; too choppy and
doesn’t lead to a smooth tree building process
Gini Index
Typical criterion:
-
-
The proportions of interest denote the fraction of training observations in the m th region
that are from the kth class
The index measures the total variance among the K classes; it is often referenced as a
measure of terminal node purity
Sometimes splits yield terminal nodes that have the same predicted value; such
duplicate splits are recorded by the classification tree because they lead to increased
node purity; having an increased sense of node purity yields an increased sense of
certainty pertaining to the response value corresponding to each terminal node
Goal is to have two subregions as pure as possible by reducing the weighted sum of gini
impurities
Information Entropy
Works in a similar way to Gini impurity
-
fi is the fraction of items labeled with i in the set and Σfi = 1
Want the information gain as great as possible after the splitting
Horizontal axis refers to the proportion of one of the classes
Pros and Cons
Pros:
- Easy to interpret (especially if it's small) if you don’t have a heavy mathematical
background; relatively non-complex
- Can graphically depict a higher dimensionality easier than linear regression and still
be interpreted by a novice
- Process can easily adapt to qualitative/categorical predictors without the need to
create and interpret dummy variables
- Reflects a more “human” decision-making process as compared to other machine
learning methods
- Can be displayed graphically
Cons:
- Predictive accuracy tends to be lower and thus not as competitive as a trade off for
less complexity
- Can increase predictive accuracy with:
- Bagging
- Random Forests
- Boosting
- At a cost of decreased interpretative value
- Suffer from high variance; will probably get very different trees if you randomly split the
data into two and fit the independent trees
- Instability: a small change in the data may result in very different splits
Bagging
Bootstrap aggregation (i.e., bagging): procedure that aids in the reduction of variance for a
statistical learning method; frequently used alongside trees
- Recall that given a set of n independent observations X1, X2, …, Xn, each observations
as a group would be given by σ2/n
-
-
Averaging the set of observations reduces the overall variance
- Not practical: typically do not have access to multiple training sets
Create multiple pseudo-training sets by bootstrapping
- Take repeated samples of the same size from the single overall training dataset;
treat these different sets of data as pseudo-training sets
By bootstrapping, we create B different training datasets; method is trained on the bth
bootstrapped training set in order to get predictions for each observation; end up getting
b different decision trees; we can then average all predictions (or take the majority
vote) to obtain the bagged estimate; the overall prediction is the most :
-
Recall reducing the variance by pruning; while pruning reduces the variance of the
overall tree model upon repeated builds with different datasets, we induce bias because
the trees are much simpler
- The idea of bagging averts the pruning methodology but still gets its benefits:
- Average many noisy trees and hence reduce the model variance
- Instead of pruning back our trees, create very large trees in the first place. These
large trees will tend to have low bias, but high variance
- Retain the low bias, but get rid of the high variance by averaging across many
trees
- Since each tree generated in bagging is identically distributed, the expectation
value of the averages is the same as the expectation of any one of them; this
means bias will not be improved
- How to estimate the test error of a bagged model?
Out of bag Estimation:
- Decision trees are fit to bootstrapped subsets of the overall available observations
- Observations that are used to fit the tree are said to be “in the bag”
- Observations that are not used to fit the tree are said to be “out of bag”
- Can predict the response for a given observation using each of the trees in which the
observation was out of bag and then average the results
- The averaged predictions are used to calculate the out of bag error estimate
- When the number of bootstrapped samples is large, this is essentially the same
as leave-one-out cross-validation error for bagging
Random Forest
The variance of the mean of a sample increases as observations are correlated with one
another
- Correlated observations are not a effective at reducing the uncertainty of the mean as
uncorrelated, independent observations
Random forests: improve on bagging procedure by decorrelating trees; this results in a
reduction of variance once we average the trees
-
-
Similar to bagging, we first build various decision trees on bootstrapped training samples
but we split the internal nodes in a special way
Each time a split is considered within the construction of a decision tree, only a random
subset of m of the overall p predictors are allowed to be candidates
- Only the m predictors have the possibility to be chosen as the splitting factor
At every split, a new subset of predictors is randomly selected
-
-
-
typically , m ≈ √p is a sufficient rule for subset selection
- What happens if we choose m = p? Just get the bagging model
Why does using less of the predictor variables at each split help in the long run?
- It forces the decision tree building process to use different predictor to split at
different times
- Should a good predictor be left out of consideration for some splits, it still has
many chances to be considered in the construction of other splits; same idea
goes for predictors surfacing in trees as a whole
Likely to yield different trees even when using the same training samples
Can’t overfit by adding more trees; variance ends up decreasing
Boosting
Boosting: similar to bagging except that the decision trees are generated in a sequential
manner
- Each tree is generated using information from previously grown trees; the addition
of a new tree improves upon the performance of the previous trees
- The trees are now dependent upon one another
- Whereas creating a single large decision tree can amount to severe overfitting to our
training data, the boosted approach tends to slowly learn our data
- Given a current decision tree model, we fit a new decision tree to the residuals of the
current decision tree
- The new decision tree (based on the residuals) is then added to the current
decision tree, and the residuals are updated
- Limit the number of terminal nodes in order to sequentially fit small trees
- By fitting small trees to the residuals, slowly improve the overall model in areas
where it does not perform well
-
The shrinkage parameter (๐บ) is taken to be quite small, and slows the process
down even further to avoid overfitting
Algorithm:
- For each i in the training data, set:
-
For b = 1, 2, … , B:
- Fit a tree with d splits (d + 1 terminal nodes) to the training data (X, r)
- Update f by adding in a shrunken version of the new tree:
-
-
Update the residuals:
Output of the boosted model:
Tuning Parameters:
- B: number of trees
- Can overfit (slowly) if B is too large; use cross-validation to select B
- ๐บ: shrinkage parameter (small positive number)
- Controls the rate of learning; typical values are around 0.01 to 0.001
- If ๐บ is too small, may require a large value of B or else it won’t learn at all
- d: number of splits in each tree
- Controls the complexity of the boosted ensemble; typically using stumps (single
splits where d = 1) is sufficient and results in an additive model; the tree depth
corresponds to the interaction order of the boosted model since d splits can
involve at most d distinct variables
Variable Importance
For bagged and random forest trees, we can record the total amount that a given criterion is
decreased over all splits relevant to a given predictor, averaged over all B trees
- For regression trees, we can use the reduction in the RSS
- For classification trees, we can use the reduction in the Gini index
In both regression and classification, we can do this for each predictor in the original dataset
- A relatively large value indicates a notable drop in the RSS or GIni Index, and thus a
better fit to the data; corresponding variables are relatively important predictors
This allows us to gain a qualitative understanding of the variables in our dataset
R
R Library: tree
tree(formula, split, data, subset): fit a tree to the data; the formula is in the form of y ~ x1 + x2 +
…; can use a period (.) to select all variables (excluding y) and can use subtraction (-) to
exclude variables; split is the criteria to determine a split (‘deviance’ or ‘gini’); subset can be
used to specifically select a subset to use as training data (vector of indices)
- summary(treeobj): to get information about the fitted tree
- plot(treeobj): plot the tree
- text(treeobj): add text to the tree plot
predict(treeobj, test, type): prediction of test data; set type ‘class’ for classification; default type
is for regression trees
- table(pred, actual): confusion matrix to assess accuracy of the overall tree for
misclassification
- Calculate the mean squared error (MSE) for regression
cv.tree(treeobj, FUN): perform cross validation to decide how many splits to prune; set FUN to
prune.misclass for using misclassification as the basis for pruning; default is prune.tree for
regression
- names(cv.treeobj): inspect element of cv.treeobj
- cv.treeobj$size: indicates the number of terminal nodes
- cv.treeobj$dev: deviance is the criterion we specify (misclassification rate)
- cv.treeobj$k: cost complexity tuning parameter alpha
- cv.treeobj$method: indicates the specified criterion
- Plot k or size versus dev to visually inspect the results
prune.tree(treeobj, best): prune a regression tree; best is the best number terminal nodes to
use, determined by cross-validation
prune.misclass(treeobj, best): prune a misclassification tree; best is the best number terminal
nodes to use, determined by cross-validation
R Library: randomForest
randomForest(formula, split, data, subset, mtry, importance): fit a random forest tree to the
data; the formula is in the form of y ~ x1 + x2 + …; can use a period (.) to select all variables
(excluding y) and can use subtraction (-) to exclude variables; split is the criteria to determine a
split (‘deviance’ or ‘gini’); subset can be used to specifically select a subset to use as training
data (vector of indices); mtry is the number of randomly selected predictors to use at each split
(set mtry to number of predictors for bagging); set importance to True to assess importance of
predictors
- rFobj$mse: MSE for random forest fitting
- rFobj$err.rate: error rate for classification
predict(rFobj, test, type): prediction of test data; set type ‘class’ for classification; default type is
for regression trees
- table(pred, actual): confusion matrix to assess accuracy of the overall tree for
misclassification
- Calculate the mean squared error (MSE) for regression
importance(rFobj): determine importance of predictors
varImpPlot(rFobj): plot importance of predictors
R Library: gbm
gbm(formula, data, distribution, n.trees, interaction.depth, shrinkage): fit a boosted tree to the
data; the formula is in the form of y ~ x1 + x2 + …; can use a period (.) to select all variables
(excluding y) and can use subtraction (-) to exclude variables; set distribution (default
‘bernoulli’) to ‘gaussian’ for gaussian distribution; n.trees is the number of trees to use (default
100); interaction.depth is the depth of each tree; shrinkage is learning rate
- summary(gbmobj): gets a summary of the fit and also plots the importance of variables
- Can only classify two groups; must turn the classes into numbers (0 and 1) and then
treat it like a regression problem
predict(gbmobj, newdata, n.trees): prediction of newdata (test data); n.trees is the number of
trees to use; must be less than or equal to the number of trees specified in the fit; can use a
vector of possible trees to get a prediction matrix
- with(data, apply((predictions - y)^2, 2, mean)): Calculate boosted errors (MSE case) for
predictions; can be plotted to view how the error changes with change in number of trees
- Must round the predictions for classification problems to get the correct predictions;
since gbm only does regression, the prediction is a number and the closer it is to one
number means it will be classified as the category that associates with that number
Python
Helpful function to determine purity
from collections import Counter
import math
def purity(L, metric='gini'):
total = len(L)
freq = map(lambda x: float(x)/total,list(Counter(L).viewvalues()))
if metric == 'gini':
scores = map(lambda x: x*(1-x), freq)
elif metric == 'entropy':
scores = map(lambda x: -x*math.log(x,2), freq)
return sum(scores)
Useful function to plot Decision Tree arguments
def plotModel(model, x, y, label):
'''
model: a fitted model
x, y: two variables, should arrays
label: true label
'''
margin = 0.5
x_min = x.min() - margin
x_max = x.max() + margin
y_min = y.min() - margin
y_max = y.max() + margin
import matplotlib.pyplot as pl
from matplotlib import colors
colDict = {'red': [(0, 1, 1), (1, 0.7, 0.7)],
'green': [(0, 1, 0.5), (1, 0.7, 0.7)],
'blue': [(0, 1, 0.5), (1, 1, 1)]}
cmap = colors.LinearSegmentedColormap('red_blue_classes', colDict)
pl.cm.register_cmap(cmap=cmap)
nx, ny = 200, 200
xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx),
np.linspace(y_min, y_max, ny))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
## plot colormap
pl.pcolormesh(xx, yy, Z, cmap='red_blue_classes')
## plot boundaries
pl.contour(xx, yy, Z, [0.5], linewidths=1., colors='k')
pl.contour(xx, yy, Z, [1], linewidths=1., colors='k')
## plot scatters ans true labels
pl.scatter(x, y, c = label)
pl.xlim(x_min, x_max)
pl.ylim(y_min, y_max)
## if it's a SVM model
try:
# if it's a SVC, plot the support vectors
index = model.support_
pl.scatter(x[index], y[index], c = label[index], s = 100, alpha = 0.5)
except:
pass
Python: tree from sklearn
tree_model = tree.DecisionTreeClassifier(...): initializes decision tree; save to a variable for
convenience
- Arguments:
- criterion: "gini" or "entropy", corresponding to the criteria of "gini impurity" and
"information gain". default = 'gini'.
- max_depth: The maximum depth of the tree. default = None, which means the
nodes will be expanded until all leaves are pure or until all leaves contain less
than min_samples_split samples.
- min_samples_split: The minimum number of samples required to split. default =
2.
- min_samples_leaf : The minimum number of samples required to be at a
terminate node. default = 1.
- Methods:
- fit: Build a decision tree from the training set (X, y).
- predict: Predict class or regression value for X.
- predict_log_proba Predict class log-probabilities of the input samples X.
- predict_proba: Predict class probabilities of the input samples X.
- score: Return the mean accuracy on the given test data and labels.
- set_params: Set the parameters of this estimator.
- get_params: Get parameters for this estimator.
- Attributes:
- tree_: Tree object, the underlying tree object.
- feature_importances_: The feature importances. The higher, the more
important the feature. Also known as gini importance.
- tree_model.fit(x,y): fit the tree model
- tree_model.score(x,y): score the accuracy of the model on the data
- tree_model.feature_importances_: the importance of each feature, rated from 0 to 1;
the higher, the better; sum of feature importances should be 1
Python: ensemble from sklearn
rf = ensemble.RandomForestClassifier(...): initialize random forest; should save the classifier
to a variable for convenience
- Arguments: similar arguments to DecisionTreeClassifier()
- criterion : default=”gini”; can be “entropy”
- max_depth: default = None.
- min_samples_split: default = 2.
- min_samples_leaf: default = 1.
- n_estimators: The number of trees. default=100.
- bootstrap: Whether bootstrap samples are used when building trees.
default=true.
- oob_score: Whether to use out-of-bag samples to estimate the generalization
error. default=false.
- Methods:
-
- fit: Build a forest of trees from the training set (X, y).
- score: Return the mean accuracy on the given test data and labels.
- predict: Predict class for X.
- predict_log_proba: Predict class log-probabilities for X.
- predict_proba: Predict class probabilities for X.
- set_params: Set the parameters of this estimator.
- get_params: Get parameters for this estimator.
Attributes:
- feature_importances_:The feature importances (the higher, the more important
the feature).
- oob_score_: Score of the training dataset obtained using an out-of-bag
estimate.
Python: sklearn.grid_search
gs will represent sklearn.grid_search for all the following examples
gsmodel = gs.GridSearchCV(model, params, scoring, cv): initializes a grid search for
supervised learning models; the model is the initialized model; params is a list of parameter
combinations to search; scoring is the evaluation method to determine the best parameters; cv
is the number of folds; like any other model, it should be saved to variable
- gsmodel.fit(x,y): fit the model
- Trees: for trees the model should be tree_model; the scoring should be ‘accuracy’; the
list of params can be a list of dictionaries with the keys being different arguments
(‘criterion’, ‘max_depth’, min_samples_split, min_samples_leaf, etc.) and the values as a
list of values to search through
- gsmodel.grid_scores_: returns all the scores of the grid search
- gsmodel.best_params_: returns the best parameters from the grid search which can be
saved and inputted into a model later on
- gsmodel.best_score_: best score
- gsmodel.score(x,y): score the performance on a set of data
- if this was scored on the original set of data used for the grid search, the value
might not be as good as the best score; this is because the best score is the
average results from the cv folds; if the data is organized when splitting the folds,
the model might be better at predictions because the because the data isn’t
randomized, this may result in better scores for the cross validation process but
will probably lead to more error when testing randomized data
Python: sklearn.cross_validation
cv will represent sklearn.cross_validation for all the following examples; this module can alter
the process for cross validations
cv.StratifiedKFold(y, n): selects the folds in a stratified pattern; y is the target variable and n is
the number of folds; this can be inputted to CV grid search function in sklearn.grid_search (set
cv equal to this function;
cv.train_test_split(x, y, random_state, test_size): splits a data set into a training set and testing
set; the random_state is like a seed in R; makes the test results reproducible; test_size
determines the ratio of test data to the original data;
- returns four values in the order: X_train, X_test, Y_train, Y_test
Python: xgboost
https://github.com/mpearmain/BayesBoost
Kaggle competition based repo using xgboost and Bayesian Optimization.
Support Vector Machines
Only for classification; direct approach to classification by constructing linear/non-linear
decision boundaries, by explicitly separating the data into two different classes as complete as
possible;
- the Linear decision boundaries in Support Vector Classifiers are called hyperplanes
in the feature space
- The non-linear decision boundaries in general Support Vector Machines are called
hypersurfaces in the feature space
Maximum Margin Classifier
Hyperplane: a substance of one dimension less than its ambient space:
- In 2D space, hyperplanes are 1D lines
- In 3D space, they are 2D planes
- In pD space, they are (p-1)D objects
- Flat and affine:
- They preserve parallel relationships
- Don’t need to pass through the origin
- Equation form:
or
-
- β = (β1, …, βp) and X = (X1, …, Xp) are p-dimensional vectors
For any point vector X in the space, there are two possibilities:
- X satisfies the equation above and thus itself falls on the hyperplane
-
-
X does not satisfy the equation above and thus falls on one side of the
hyperplane
The signed distance of any given point x to the hyperplane is given by:
-
-
-
-
- The distance function f can be used as the decision function of the classification
If X does not fall on the hyperplane (f(x) = 0), then one of the following must be true:
- f(x) > 0, points on one side of the hyperplane, f(x) < 0, points on the opposite side
Extracting the β coefficients from this equation (not including the intercept) yields what is
called a normal vector:
- The vector points in a direction orthogonal to the surface of the hyperplane and
essentially defines its orientation
- Might need to work in the normalized form: β* = β/|β| or require that:
For any given point in the feature space, we can project onto the normal vector of the
hyperplane
- Based on the sign of the resulting value, we can determine on which side of the
hyperplane the point falls
When the normal vector is of unit length, the value of the hyperplane function defines the
Euclidean distance from the point to the hyperplane
2D hyperplane of 1 + 2X1 + 3X2 = 0
Separating Hyperplanes:
- Need to develop a classifier that will help us predict into which category a new
observation will fall
- Suppose observations fall into one of two classes which we can label as {-1,1}
without loss of generality
- Also suppose that it is possible to construct a hyperplane that perfectly separates
the observations based on these class labels
- This hyperplane would then have the following properties for all i = 1, …, n:
-
- y = 1, for f(x) > 0 and y = -1 for f(x) < 0
Separating hyperplane will have the following property:
or
-
-
By evaluating the value of the hyperplane function given an observation, we can
determine on which side of the hyperplane the observation falls
- If the value is positive, classify the observation into group 1
- If the value is negative, classify the observation into group 2
Magnitude of the evaluation also yields information regarding the confidence of our
classification prediction:
- Large values imply the observation is far from the hyperplane (high conf.)
- Small values imply the observation is far from the hyperplane (low conf.)
Maximal Margin Classifier: If the data can be separated perfectly with a hyperplane, then there
are infinite separating hyperplanes in the feature space; How do we determine which of these
hyperplanes is the best?
-
-
Compute the distance from each training observation to a separating hyperplane; of
these distances, the smallest distance is called the margin
Then try to find the maximal margin hyperplane, which:
- Is the separating hyperplane that is farthest from the training observations
- Creates the biggest gap/margin between the two classes
Hopefully, if the maximal margin hyperplane has a large margin on the training data, it
will also have a large margin on the test data
The construction of the maximal margin classifier is the solution to the following
optimization problem:
or
-
Maximize the margin M
Ensure the normal vector is of unit length (not actually a constraint! why?)
Guarantee that each observation is on the correct side of the hyperplane
-
-
-
-
The conditions ensures that the distances from all the points to the decision
boundary specified by β and β0 are at least M, and we seek the largest M by
varying the parameters
We can get rid of the constraint |β| = 1 by replacing the inequalities with:
For any β and β0 satisfying the inequalities, any positively scaled multiple
satisfies them too
If we set |β| = 1/M, we can rephrase the original problem to a more elegant form
by dropping the norm constraint on β:
This is a convex quadratic optimization problem and can be solved efficiently
Limitations: The observations that fall closest to the separating hyperplane (equidistant) define
the width of the margin. These observations are known as the support vectors because the
hyperplane depends on their location
- If these observations were to move around in the feature space, the maximal margin
hyperplane would also move
- The maximal margin hyperplane directly depends only on the support vectors, not the
remaining observations; poor solution if data is noisy
- The definition of the classifier can be very sensitive to outliers or a single
change in the data
- High sensitivity to a small change suggests that we have overfit the classifier
- What if no separating hyperplane exists?
-
There would be no solution to the optimization problem with M > 0
Support Vector Classifier
Support Vector Classifier: extension of the maximal margin classifier that makes some
compromises in an effort to improve upon the aforementioned limitations
- May not perfectly separate the classes
- Provides greater robustness to outliers and thus a lower sensitivity to individual
observation shifts
- Helps better classify most of the training observations
- By giving up the ability to have a perfect classifier on the training data we:
- Take a penalty by possibly misclassifying some observations
- Do a better job classifying the remaining observations more confidently
- May have better predictive power for future observations
- Soft Margin: allows some observations to be on the incorrect side of either the margin
or hyperplane; allows for the cases where the data is not separable
-
need to optimize the following
or
-
Maximize the margin M
Ensure the normal vector is of unit length
-
ฯต: slack variables (ฯตi ≥ 0 and Σฯตi ≤ constant); they allow individual observations to
potentially fall on the wrong side of the margin or hyperplane; tells us where the i th
-
-
observation is located relative to the margin and hyperplane
- If ฯตi = 0, then the ith observation is on the correct side of both the margin and
the hyperplane
- If ฯตi > 0, then the ith observation violates the margin
- If ฯตi > 1, then the ith observation violates the hyperplane (misclassification)
- The magnitude of the slack variables is proportional to the distance from each
observation to the margin
C: a tuning parameter that helps determine the threshold of tolerable violations to the
margin and hyperplane; often thought of as a budget for the slack variables
- If C = 0, then there is no budget for the slack variables. For every i, ฯตi = 0, and the
problem reduces to the maximal margin classifier
- As C increases, there is more budget for violations; the classifier becomes more
tolerant so the margin will widen (low variance, high bias)
- As C decreases, there is less budget for violations; the classifier becomes less
tolerant so the margin will narrow (high variance, low bias)
- No more than C observations can be on the wrong side of the hyperplane
Equation can also be in the form:
-
The C term here acts as a penalty parameter of the total error term (different
from the C term mentioned before); maximum margin classifier corresponds to C
= ∞; C close to 0 → wide soft margin; Large C → close to hard-margin formulation
-
Similar to the maximal margin classifier, not all observations directly affect the
orientation of the hyperplane
- Only observations that either fall on the margin or violate the margin affect the
solution to the optimization problem
- These observations are called the support vectors
- Observations that fall on the correct side of the margin have no direct bearing on the
ultimate classifier
- If these observations were shifted around in the feature space, the hyperplane
would remain unchanged (as long as they did not end up crossing over the
margin)
- Ultimately the support vector classifier is more robust than the maximal margin classifier
Limitations: the support vector classifier assumes that the boundary between classes is
roughly linear; however, this process fails when the boundary is nonlinear
Support Vector Machines
Feature Expansion: similar to adding polynomial terms in the linear regression setting, we can
address nonlinearity by considering the enlargement of the feature space of our original
dataset
-
-
Implement functions of the predictors themselves by using higher-order polynomial
functions of the predictors
By fitting a support vector classifier in the enlarged feature space, the decision
boundaries become nonlinear in the original feature space
Suppose we only have X1 and X2 in the dataset; we can use X1, X2, X12, X22, and X1X2
In the enlarged feature space, the following decision boundary is linear:
implementing feature expansion in the support vector classifier can help solve problems
with data that are not linearly separable
Support Vector Machines: extension of the support vector classifier that results from enlarging
the feature space using kernels
- Kernels are more efficient ways of implementing feature expansion from a computational
standpoint
- The solution to the support vector classifier problem can be reduced in such a way that
only involves the inner products of the observations instead of the actual observation
themselves
- Linear support vector classifier can be represented as:
-
- There are n parameters, one α per training observation
In order to estimate the α parameters and the intercept β 0, all that is needed are the
inner products between all pair of observations
To evaluate the function, we need to compute the inner product between the new
observation and each of the training observations
Computationally, it turns out that the α parameters are only nonzero if the corresponding
α is necessarily 0; so most of the terms in the original equation disappear
Thus, if S is the collection of indices of the support vectors, we have:
-
The formulation of the problem typically involves far fewer calculations than the original
optimization described for support vector classifiers
- How can we gain more flexibility with the support vector machine?
Kernels: a function that quantifies the similarity of two observations; just as we have seen there
are many measures of similarity in terms of distance, so too there are many different types of
kernels;
- Linear: the idea of the inner product used to improve upon calculations in the support
vector classifier; this is a linear kernel because the resulting support vector classifier is
linear in features
-
Polynomial: an extension of the linear kernel is the polynomial kernel of degree d; using
a polynomial kernel with d > 1 is analogous to fitting a support vector classifier using
feature expansion based on polynomials of degree d rather than the original feature
space; the decision boundary appears to be more flexible;
-
Radial: suppose we have a test observation; if it is far from a training observation, the
euclidean distance will be large, but the value of the radial kernel will be small; if it is
close to a training observation, the euclidean distance will be small but the value of
the radial kernel will be large; it exhibits local behavior since only nearby training
observations have an abundant effect on the class labels of a test observation
-
- ๐›„: is a positive constant and another tuning parameter
when the support vector classifier is combined with a non-linear kernel, the resulting
classifier is called a support vector machine
Polynomial and radial kernels:
-
To implement any kernels in a support vector classifier, we define the kernel in the
classifier:
-
Kernels are much more computationally efficient because we only need to compute
the kernel for distinct pairs of observations in our dataset
- Don’t need to work in the enlarged feature space (impossible in radial kernel
since the space is infinite)
Multi-Class Classification
Only considered implementing support vector machines in respect to two categories; SVM are
limited to binary classification; need roundabout ways to predict an output with more than two
categories
1-vs-1 Classification:
- Construct a support vector machine for each pair of categories
- For each classifier, record the prediction for each observation
- Have the classifiers vote on the prediction for each observation
1-vs-All Classification:
- Construct a support vector machine for each individual category against all other
categories combined
- Assign the observation to the classifier with the largest function value
Pros and Cons
Pros:
- Not hindered by high dimensions
Cons:
- Slow
- Can only do binary classification
- To do multi-class classification, need to do 1-vs-1 or 1-vs-all classification
R
R Library: e1071
svm(formula, data, subset, kernel, cost, gamma): fit an svm classifier to the data; the formula is
in the form of y ~ x1 + x2 + …; can use a period (.) to select all variables (excluding y) and can
use subtraction (-) to exclude variables; kernel can be ‘linear’, ‘polynomial’, ‘radial basis’ or
‘sigmoid’; cost (default is 1) is the tuning parameter; if cost is high you get a maximum margin
classifier; gamma is an additional tuning parameter for radial kernels (default 1/p); can do
multiclass classification
- plot(svmobj, data): plot the svm classifier
- svmobj$index: find the indices of the support vectors
predict(svmobj, test): predict results on test data
- table(preds, actual): use confusion matrix to calculate error rate
tune(method, formula, data, kernel, ranges): parameter tuning of functions using grid search;
set the method to svm for svm; range is the range to search for each parameter; takes a list of
parameters and setting them equal to a vector of possible numbers; for svm, use cost and
gamma (only for radial kernels); ex. range = list(cost = 10^(seq(-1, 1.5, length = 20)), gamma =
10^(seq(-2, 1, length = 20)))
- summary(tuneobj): inspect cv output
- tuneobj$performance$cost: vector of cost values tested
- tuneobj$performance$error: vector of error values tested
- tuneobj$best.model: model with best result; should be used for predictions
- tuneobj$best.model$cost: can be used to determine the best cost value
R Library: rgl
plot3d(x,y,z): plot in 3d; useful for plotting the cv results of radial kernel svm (cost vs gamma vs
error)
Python
Useful function to plot SVM
def plotModel(model, x, y, label):
'''
model: a fitted model
x, y: two variables, should arrays
label: true label
'''
margin = 0.5
x_min = x.min() - margin
x_max = x.max() + margin
y_min = y.min() - margin
y_max = y.max() + margin
import matplotlib.pyplot as pl
from matplotlib import colors
colDict = {'red': [(0, 1, 1), (1, 0.7, 0.7)],
'green': [(0, 1, 0.5), (1, 0.7, 0.7)],
'blue': [(0, 1, 0.5), (1, 1, 1)]}
cmap = colors.LinearSegmentedColormap('red_blue_classes', colDict)
pl.cm.register_cmap(cmap=cmap)
nx, ny = 200, 200
xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx),
np.linspace(y_min, y_max, ny))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
## plot colormap
pl.pcolormesh(xx, yy, Z, cmap='red_blue_classes')
## plot boundaries
pl.contour(xx, yy, Z, [0.5], linewidths=1., colors='k')
pl.contour(xx, yy, Z, [1], linewidths=1., colors='k')
## plot scatters ans true labels
pl.scatter(x, y, c = label)
pl.xlim(x_min, x_max)
pl.ylim(y_min, y_max)
## if it's a SVM model
try:
# if it's a SVC, plot the support vectors
index = model.support_
pl.scatter(x[index], y[index], c = label[index], s = 100, alpha = 0.5)
except:
pass
Python: svm from sklearn
svm_model = svm.SVC(...): initialize svm classifier; save to a variable for convenience
- Kernels:
- linear: <x1,x2>.
-
polynomial: (γ<x1,x2>+r)d. d is specified by the argument degree, r by coef0. (if
degree is 1 → turns to a linear kernel)
-
rbf:expโก(−γโˆฅx1−x2โˆฅ2). γ is specified by the argument gamma, must be greater
than 0. (radial kernel)
-
-
sigmoid: ((tanhโก(γ<x1,x2>+r)), where r is specified by coef0.
The linear kernel is the original feature space. In terms of the polynomial kernel,
it's equivalent to linear kernel when γ=1 and r=0.
Arguments:
- kernel: Specifies the kernel type to be used in the algorithm. It must be one of
‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will
be used. If a callable is given, it is used to precompute the kernel matrix.
-
C: Penalty parameter of the error term. C=1C=1 by default. Large C → closer to
maximum margin classifier
-
degree: Degree of the polynomial kernel function (‘poly’). Ignored by all other
kernels.
gamma: only for radial kernel; ignored by all other kernels
Methods:
- fit: Fit the SVM model according to the given training data.
-
-
- score: Return the mean accuracy on the given test data and labels.
- predict: Perform classification on samples in X.
- set_params: Set the parameters of this estimator.
- get_params: Get the parameters of this estimator.
Attributes:
- support_: return the index of the support vectors.
- n_support_: return the number of support vectors.
- support_vectors_: return the value of support vectors.
svm_model.set_params(params): set arguments for the model
svm_model.fit(x, y): fit the data
svm_model.score(x, y): return accuracy
svm_model.n_support_: number of support vectors
svm_model.support_: index of support vectors
svm_model.support_vectors: values of support vectors
svm_model.predict(test): predict values
Python: sklearn.grid_search
gs will represent sklearn.grid_search for all the following examples
gsmodel = gs.GridSearchCV(model, params, scoring, cv): initializes a grid search for
supervised learning models; the model is the initialized model; params is a list of parameter
combinations to search; scoring is the evaluation method to determine the best parameters; cv
is the number of folds; like any other model, it should be saved to variable
- gsmodel.fit(x,y): fit the model
- SVM: for SVM the model should be svm_model; the scoring should be ‘accuracy’; the
list of params can be a list of dictionaries with the keys being different arguments
(‘kernel’, ‘C’, etc.) and the values as a list of values to search through
- gsmodel.grid_scores_: returns all the scores of the grid search
- gsmodel.best_params_: returns the best parameters from the grid search which can be
saved and inputted into a model later on
- gsmodel.best_score_: best score
- gsmodel.score(x,y): score the performance on a set of data
- if this was scored on the original set of data used for the grid search, the value
might not be as good as the best score; this is because the best score is the
average results from the cv folds; if the data is organized when splitting the folds,
the model might be better at predictions because the because the data isn’t
randomized, this may result in better scores for the cross validation process but
will probably lead to more error when testing randomized data
Python: sklearn.cross_validation
cv will represent sklearn.cross_validation for all the following examples; this module can alter
the process for cross validations
cv.StratifiedKFold(y, n): selects the folds in a stratified pattern; y is the target variable and n is
the number of folds; this can be inputted to CV grid search function in sklearn.grid_search (set
cv equal to this function;
cv.train_test_split(x, y, random_state, test_size): splits a data set into a training set and testing
set; the random_state is like a seed in R; makes the test results reproducible; test_size
determines the ratio of test data to the original data;
- returns four values in the order: X_train, X_test, Y_train, Y_test
Discriminant Analysis
Naive Bayes
Association Rule Mining
Natural Language Processing
Intersection of computer science, artificial intelligence and linguistics. The goal is for computers
to process or “understand” natural language in order to perform tasks that are useful
Why is NLP so difficult?
- Understanding context
- Understanding “common sense” and “common knowledge”
- Understanding named entities: variations in names and names that can refer to different
things
- Understanding idioms
- Understanding ambiguity
Applications:
- Spell checking, keyword search
- Extracting information from websites
- Classifying reading level, positive/negative sentiment of longer documents
- Machine translation: Siri, Google Now, Cortana, Alexa
Deep Learning:
- Provides a flexible, learnable framework for representing visual and linguistic information
- Deep learning can learn unsupervised (from raw text) and supervised (with specific
labels like positive/negative)
- Benefits more from a lot of data
Vectorizing Text: TF-IDF
tf-idf: text frequency - inverse document frequency; the value increases proportionally to the
number of times a word appears in the document, but is offset by the frequency of the word in
the corpus (group or collection of documents); this helps adjust for the fact that some words
appear more frequently in general
Pros:
- Have some basic metric to extract the most descriptive terms in a document
- Can easily compute the similarity between 2 documents
Cons:
- Tf-idf is based on the bag-of-words (BoW) model, therefore it does not capture position
in text, semantics, co-occurrences in different documents
- Only useful as a lexical level feature
- Cannot capture semantics (as compared to topic models, word embeddings)
Vectorizing Text: Co-occurrence Matrix
Co-occurrence matrix: matrix/table that keeps count of how often words are placed next to
each other
Co-occurrence vectors: vectors that keep count of how often words are placed next specified
word
- Cons:
- Vectors increase in size with vocabulary (more words, more dimensions)
- Very high dimensional: requires a lot of storage
- Subsequent classification models have sparsity issues
- People want to store the most important information in a fixed, small number of
dimensions: a dense vector
- How to reduce dimensionality?
Reducing dimensions of co-occurrence matrix:
- SVD: singular value decomposition; matrix factorization method to reduce dimensions;
can be used to find principal components of a covariance matrix (details here)
-
Semantic Patterns:
-
Problems with SVD:
- Computational cost scales quadratically for mxn matrix, which makes it bad for
millions of words or documents
- Hard to incorporate new words or documents
- Function words (the, he, has) are too frequent and have too much syntaxical
impact
- min(X,t) with t~ 100
- Ignore them all
- Ramped windows that count closer words more
Word2Vec: instead of capturing co-occurrence counts directly, predict surrounding words in a
window of length m of every word using a skip-gram neural network; the output are vectors
with interesting relationships
- Skip-gram neural network: simple neural network with only one hidden layer; train the
neural network to perform a certain task to learn the weights of the hidden layer which
are actually the word vectors
- The task is to determine the probabilities of every word in our vocabulary being
within the window of our target word
- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
-
Probability Function:
-
-
o is the outside (output) word id, c is the center word id, u and v are the outside
and center vectors of o and c
- Every word has two vectors; one is the center vector and the other one is the
outside vector
Objective function: maximize the log probability of any context word given the current
center word
- ๐œƒ stands for the center and outside vectors
- Can use gradient descent to optimize cost function
Convolutional Neural Networks:
-
First layers embeds words into low-dimensional vectors
Next layer performs convolutions over the embedded word vectors using multiple filter
sizes.
max pool the result of the convolutional layer into a long feature vector, add dropout
regularization, and classify the result using a softmax layer
Python
Python: TfidfVectorizer from sklearn.feature_extraction.text
Neural Networks
Neural Networks: supervised methodology that models the relationship between a set of input
signals and an output signal in order to perform either classification or regression; they are very
complex; the underlying models are based on composite mathematical systems that often
render accurate results, but near impossible to interpret; often referred to as a black box
process because the mechanism seems to be hidden from view
Biological Neurons: underlying model behind artificial neural networks;
- how a biological brain responds to stimuli from sensory inputs:
- Human brain uses a complicated network of interconnected cells called neurons
in order to parallel-process input
- Signals are received by the neuron’s dendrites through a biochemical process
that weights impulses based on their relative importance
- Should a threshold be reached by accumulating impulses, the neuron is said to
fire; its impulse is then passed to neighboring neurons
Perceptron
Perceptron: most basic type of artificial neuron
- Take in various binary inputs x1, x2, …, xn, and produces a single binary output
- Each input has a corresponding weight w 1, w2, …,wn that expresses the importance of
the input in determining the output
- How is output determined?
- The output is also binary; determining whether the output should be 0 or 1 is simple
- Each input xi is considered alongside its corresponding weight w i
- The sum of the various input/weight combinations is calculated
- Should the grand sum be greater than a certain threshold, then the perceptron
returns a 1; otherwise, the perceptron returns a 0
Step-function:
-
-
Essentially the perception is a simple way of making a decision by weighing the
evidence at hand and comparing it to a threshold that represents a willingness to make
the decision
Better step function:
-
Process similar to before but with two small improvements
- The summation of weights is represented by a dot product
- The threshold been moved to the other side of the equation; this new
term is referred to as a bias (bias = - threshold)
Modeling logical gates: perceptrons can be the basic models for various logical gates
- Logical gates: basic logical decision-making tools given binary inputs
- Can model the logical NAND gate (not-and)
- NAND gate: is universal for computation; any computation can be built up out of
multiple NAND gates
More complex decisions: the single perceptron is an oversimplification of the decision making
process; to model more complex decisions, consider various perceptrons alongside one another
to create a network of neurons
- Complexity/design of the network is often referred to as its topology
- Each column of neurons describes a layer of learning; each layer passes its outputs to a
future layer for more abstract learning
-
-
It can be shown that any bounded continuous function can be represented with a
neural network (to an arbitrary degree of error ε) that has only one hidden layer
- Consider each hidden node turning on at a particular point of space
- Differences between these hidden values can hone in on a particular region
- To approximate any function, choose weights that reproduce desired function
values in specific regions
Essentially we are discretizing the function into arbitrarily small regions
How do we choose/tune the weights and biases automatically to match these
logical/functional structures?
-
If we can do this, we can construct solutions to problems where manual construction
would fail because of the sheer complexity of the underlying relationship
Sigmoid Neurons
Examine small changes: when we do not understand or know the weights and biases of the
network of perceptrons, we should like to see how small changes change the output to
understand the relationships among the network
-
If a small change in a particular weight or bias caused only a small change in the
output, we could use this information to tune our network in order to receive a better,
more accurate output
- To have our network learn, we could implement the following process:
- Change a weight or bias
- Observe the changes in the output
- If the change made our prediction better, keep it and now manipulate a
different weight
- If the change made our prediction worse, forget it and try again
- Upon repeated steps, an accumulation of much smaller changes could induce a big
change in the output that ultimately leads to greater accuracy
Problems with perceptrons:
- Neural network learning relies on the fact that small changes in tuning parameters
induce small changes on the output but that is not the case with perceptrons
- Because perceptrons are step functions, that flip from off to on once a specific threshold
(bias) value has been surpassed
- A small change in weights or bias for a particular perceptron could cause the
output of the perceptron to completely switch
- Like a domino effect, this single switch would pass along to future perceptrons in
the network and could end up completely changing the ultimate output
drastically
- There isn’t an easy way to gradually modify weights and biases to develop a robust
learning algorithm
Sigmoid Neuron: easiest and most common way of overcoming the sensitivity issue presented
by perceptron networks
- Has inputs x1, x2, …, xn and an overall output that takes on any values between 0 and 1
- Has corresponding weights w1, w2, …, wn that express the importance of the input in
determining the output
- Has an overall bias (threshold)
- Sigmoid function:
-
When it is combined with the inputs, weights and bias/threshold of a sigmoid
neuron, the resulting equation for the output is:
-
If wแงx+b is a large positive number then e-(wแงx+b) ≈ 0 and ๐ˆ(wแงx+b) ≈ 1
-
If wแงx+b is a large negative number then e-(wแงx+b) ≈ ∞ and ๐ˆ(wแงx+b) ≈ 1
Sigmoid neuron is very similar to the perceptron; differences between them are
only observable when wแงx+b is of intermediate magnitude
Graphically, the neuron determines the output as a logistic function
-
-
-
x-axis represents the sum of all the input/weight combinations and the bias
transformed by the sigmoid function
- y-axis represents the tendency for a decision to be made; note that the neuron
always fires, just now on a scale instead of all-or-nothing
Why is the sigmoid neuron the solution to the problems the perceptron presents?
- It is a smooth function that is easily differentiable at every point; it also has
nice properties that will aid in simplifying future calculations. The perceptron is
not a smooth function and is not easily differentiable
-
Sigmoid function has the following property that relates small changes in
weights and biases to changes in the output:
๐œ•๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก
-
๐œ•๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก
๐›ฅ๐‘œ๐‘ข๐‘ก๐‘๐‘ข๐‘ก ≈ ∑
๐›ฅ๐‘ค๐‘– + ๐œ•๐‘
๐œ•๐‘ค๐‘–
๐‘–
Perceptrons do not have an easy way of quantifying this relationship because of
their seemingly erratic and sensitive behavior
Network Topology
Input layer: raw data goes into this layer; neurons within this layer are called input neurons
Output layer: desired target comes out of this layer; neurons within this layer are called
output neurons
Multilayer Perceptrons (MLPs): networks with multiple layers; each neuron is made up of
sigmoid neurons instead of perceptrons
Hidden layer: the black box; neurons within this layer are called hidden neurons; neural
networks with more than one hidden layer are said to be deep
- Hidden layer represents features of the data; as it pass from layer to layer, the features
become increasingly complex
- Combination of various neuron outputs in the hidden layers intend to essentially feature
engineer aspects of your data
- Similar idea to breaking down a problem into smaller subproblems
- We can connect the idea of answering subquestions as neurons embedded within a
hidden layer of a neural network as follows
- Each of these subproblems can themselves be decomposed into even smaller subproblems
- The cumulative answers to these questions will help us determine the answer to the
first subproblem
- We can imagine that the hidden layers of a neural network are continually breaking
down harder questions by attempting to first answer multiple, easier to answer
questions
- Basically create a hierarchy that relies on the notion that the answers to a slew of easy
questions will ultimately help answer one massively difficult question; encode the
various low and high level features of our data in a hierarchical manner
Problems:
- Hidden features and their corresponding weights, biases, and connections are often not
intuitively conceivable
- No clear way to determine the best combinations given our dataset, especially for deep
hidden layers
Backpropagation Learning with Gradient Descent
A network topology needs to be trained with repetition and experience
- Weights are initially random because of the non-existence of knowledge
- As neural networks process input data, connections between the various neurons can be
strengthened or weakened depending on how they ultimately seem to affect the output
- Errors that are initially made are back-propagated through the neural network and the
connections among neurons are changed in an effort to reduce this error
Backpropagation algorithm: use errors to change connections in an effort to reduce the same
error
- Extremely computationally expensive
- Highly accurate results
- Iterates through many cycles (epochs) of two processes:
- Forward phase: neurons are activated in sequence from the input layer, through
the hidden layers, and lastly the output layer
- Predicted values are recorded
- Backward phase: the network’s current output resulting from the forward phase
is compared to the target values in the training data
- Error is propagated backwards through the network from the output layer,
through the hidden layers, and back to the input layer in order to modify
the connections with the goal of reducing the error
- Repeat cycle until a certain stopping criterion is reached
- Connections between all of the neurons is very complicated
- How does the algorithm know the best way to modify the weights among connections?
- Gradient descent
Gradient Descent: method for optimization parameters
- For neural networks:
- The derivative of each neuron’s activation function is used to identify the
gradient in the direction of each of the incoming weights
- Algorithm will attempt to change the weights in such a way that will result
in the greatest reduction of error
- Need a differentiable function; that’s why sigmoid neurons are better than
perceptrons
- Suppose you have a cost function C that is defined by a relationship among the
variables v1 and v2. We desire to minimize this function as much as possible
- Gradient descent will assess the gradient at a specific point in order to push v 1
and v2 in the direction that will minimize C
-
Benefits:
- Why gradient descent? Why not grid?
- Curse of dimensionality is extremely harsh; increase in weights increases
the computations exponentially
- Notes on local minimums:
- What if our cost function doesn’t always go in the same direction? What if
it’s possible that we reach a plateau or a local minimum? In other words,
what happens when the problem is con-convex?
- We chose our error function the way that we did not only because it makes the
calculation of derivatives a bit cleaner, but also because we are exploiting the
often generally convex nature of quadratic equations
- Also, another possible way around this is to update our weights sequentially
instead of all at once (i.e., implementing stochastic gradient descent); in this
manner, we might be able to avoid local minimums by sequentially following
gradient in varying directions
Derivative:
- Describes how fast a function f changes instantaneously at the value x
- It determines whether the function f will increase or decrease if we increase the value of
x
-
It informs whether x is higher or lower than an optimum value (maximum or minimum)
In backpropagation algorithm:
- Chain rule:
-
Derivative of a sum is the same as the sum of the derivatives:
-
Derivative of a sum with respect to one element of the sum collapses:
-
Derivative of the sigmoid function:
-
Derivative of the sigmoid function with the chain rule:
Backpropagation: Formally
-
-
-
Suppose the above represents our neural network. For each layer in the network, we
would like to determine:
- The partial derivative of the error in respect to each neuron (i.e., should the
neuron’s value be higher or lower?)
- The partial derivative of the error in respect to each input (i.e., should the input
weight’s value be higher or lower?)
Error: the choice to scale by ½ is somewhat arbitrary but it will make future
computations simpler
Differentiate the error E with respect to one of the neurons in the last hidden layer (first
layer of backpropagation):
-
Another level deeper (second layer of backpropagation):
-
We begin to see this nesting nature; thus, we can express the partial derivative of a
deeper layer of backpropagation with that of a shallower layer:
-
Want to differentiate the error E with respect to one of the edges:
-
Again see this nesting nature; thus, we can express the partial derivative of a deeper
edge of backpropagation with that of a shallower layer’s node:
-
Once we have computed all of the gradients for each of the weights in our network, we
gain insight into how slight perturbations of the weights relate to the overall error.
Now, it is easy to decide whether we should increase or decrease each weight.
Given the particular weight, wi in our network and the gradient of the error in respect to
that weight, we update wi as follows:
-
-
Here:
- If the gradient is positive, we shift wi away from the increasing tendency
- If the gradient is negative, we shift wi towards the decreasing tendency
- η is a small positive number referred to as the learning rate
More Help
http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
http://www.bogotobogo.com/python/scikit-learn/Artificial-Neural-Network-ANN-1Introduction.php
Common Terminology: concepts of neural networks build upon ideas we have previously seen
in respect to other machine learning algorithms, particularly regression. While the vocabulary is
different, the terms generally point to similar concepts
Pros and Cons
Pros:
- Neural networks models can be extremely flexible; with sufficient data they can
effectively model curvatures, interactions, plateaus, step functions, etc.
- Standard regression assumptions (e.g., the true residuals are independent, normally
distributed, and have constant variance) are not required
- Outliers tend to have a limited influence in comparison to standard regression
approaches
Cons:
- The method depends on the availability of large datasets and is extremely
computationally expensive
- Model parameters are vastly uninterpretable
- It is easy to overfit or underfit the training data
-
Diagnostic tests are not widely developed
Time Series Analysis
Other Optimizations
https://github.com/fmfn/BayesianOptimization
Python package that implements Bayesian Optimization.
Unsupervised Learning
Used to infer the properties directly without knowing the “correct” answers or the error for each
observation. Only a set of N observations with p features, no response variables. No direct
measure of success, “Learning without a teacher”. Unlabeled data is easier to obtain than
labeled data. No specific prediction goals, therefore more subjective. We are usually interested
in discovering the hidden pattern of the data.
Principal component analysis: often used for data visualization or data preprocessing for
supervised learning
Clustering: broad class of methods for grouping or segmenting a collection of objects into
distinct subsets (clusters)
Principal Component Analysis
Solves the problem of multicollinearity. Turns a set of possibly correlated variables into a set of
values of linearly uncorrelated variables.
Multicollinearity
Phenomenon in which two or more predictor variables in a multiple regression model are highly
correlated, meaning that one can be predicted from the others through linear formulae with
substantial degree of accuracy.
Issues:
- Regression coefficients of highly correlated variables might be inaccurate (high model
variance)
- Estimate of one variable’s impact on the dependent variable Y while controlling for the
others tend to be less precise
- Nearly collinear variables contain similar information about the dependent variable,
which may lead to overfitting
- Standard errors of the affected coefficients tend to be large
The Curse of dimensionality
Given a number of observations, additional dimensions spread the points out further and further
from one another. Sparsity becomes exponentially worse at dimensionality of the data
increases. There tends to be insufficient repetition in various regions of the high-dimensional
space. Less repetition makes inference more difficult. Difficult to replicate results and doesn’t
take into account regions that don’t have any observations at all.
Pros:
Cons:
-
Note: SVM takes advantage of the curse of dimensionality
Collecting data is expensive, both monetarily and temporally
Too much complexity with higher order data
Redundant information (multicollinearity) in measured dimensions
PCA
A tool that finds a sequence of linear combinations of the variables to convert a set of
observations of possible correlated variables into a set of observations of possible correlated
variables into a set of values of linearly uncorrelated variables called principal components
Ideal input variables:
- Linearly uncorrelated
- Low-dimensional in the feature space
Motivation:
- Remove variables that provides little to no additional information (when all observations
have similar values or the same value)
- Search for, among all possible directions in the feature space (not just along the axes ->
infinite directions) , the direction along which the projection of the observations are most
widely spread
First Loading Vector:
-
Direction on which the projection of the observations is more widely spread than the
projection on any other direction
- Being a direction (vector), it has as many components as the number of the features
- This characterizes the principal direction, which needs linear algebra to calculate
First Principal Component: keeping the information recorded in the first loading vector by all
observations
- Obtained by linear projection
- Find the directions on which the projection is most widely spread (vectors of
highest variance) using linear algebra
-
The projection can be described in two ways: a certain length away from the
origin along the principal direction or a vector in the original coordinate system
- There are N (number of samples) components for a principal component
- There are p (number of features) components for a principal direction (the loading
vector)
- The principal components live in the space of samples, while the principal directions live
in the space of features
Second Principal Component: info stored in a data set is about the variation of the points
across the whole sample set but not all directions are born equal. The first principal component
provides the most information but most likely doesn’t provide all. Need to find the next
significant direction and continue this until you get the majority of the information
- First remove the data stored in the first PC
- Then find the new direction (orthogonal to the original principal direction) on which the
direction of the observations is most widely spread
- Visually remove the effect of the first principal component by projecting the observations
on a plane perpendicular to the first PC
-
Find the direction on which the projection is most widely spread
- Since all the projected observations are now in the plane, the direction we find
would be automatically in the plane and is perpendicular to the first loading
vector
- This is the second loading vector (second principal direction). The projected
values of the observations to this direction is the second principal component
PCA Mathematically
First need to centralize the data at 0 by subtracting off the mean from each variable:
- Pragmatically: this allows the future mathematical processes to be easier
-
Conceptually: PCA is modeling the variances of the data → the mean doesn’t matter as
much; can always add the mean back in later if we desire to do a bit of back-construction
- Our data X is an n by p matrix with the average of each column is 0
Project the data into any possible direction; a direction is represented by a unit vector û in linear
algebra; the projection is:
Need to find the direction on which the projection of the data is most widely spread:
The solution to the above is the first loading vector (first principal direction) denoted by ๐œ™1;
the projection of the data on the first loading vector is the first principal component:
Once the first k-1 PCs have been found, the next one (if there is one) can be found inductively
- Remove the information about the first k-1 components from X (Xk denotes the resulting
matrix)
-
With this matrix we solve the optimization problem again:
The solution ๐œ™k is the kth loading vector and the projection on this direction is called the k th
principal component
Note: solving an optimization problem can be hard. In the setting of PCA, this is relatively easy.
The principal directions (loading vectors) are essentially the eigenvectors for the covariance
matrix of the data, arranged in the descending order of the eigenvalues they correspond to
Compute the covariance matrix Σ:
- Observe a unique property of convergence
Find the eigenvectors e of Σ: yield orthogonal directions of greatest variability (principal
components)
- Solve the equation:
-
Compute the eigenvectors by finding the solutions to:
-
The principal components are the eigenvectors e
The eigenvectors are ordered by the magnitude of the corresponding eigenvalues λ
(magnitude of variance along the principal components)
Determine how many principal components to use:
- Strike a balance between the total amount of variance that is captured by the principal
components and the number of components selected
- Use the first k principal components
Project the original data onto the chose k principal components
Notes
-
This process is continued to retained more and more information from the raw data
-
-
-
-
can’t have more number of the principal components than the number of the original
features we have; max number of PCs is the number of features (technically max =
min(n, p) but we assume p is less than n)
PC lives in the space of samples
Principal directions live in the space of features
PCs are orthogonal due to the linear projections; the dimensions are reduced by 1 every
time a PC is found and the information is projected on to the plane orthogonal to the
current PC
Variance of each principal component decreases:
Principal components Z1, Z2, …, Zp are mutually uncorrelated
The principal loading vectors ๐œ™1, ๐œ™2, … ๐œ™p are normalized and mutually perpendicular
The variances of the data along the principal directions (eigenvectors) are the
corresponding eigenvalues
Pros and Cons
Pros:
- transformed data that straddles only k carefully selected dimensions that preserve as
much original structure as possible
- Solution to multicollinearity and curse of dimensionality
- Don’t waste data by taking into account of all variables
- Reduce complexity
- Same data, new perspective
- Results are useful properties that can be proved using calculus and linear algebra
Cons:
- Interpreting data may become a little difficult because PCs are composed from a
combination of variables
- Need to centralize the raw data
R
R Library: psych
fa.parallel(x, n.obs, fa, n.iter): Creates scree plots with parallel analyses for choosing K; x is the
data frame of data matrix of scores; if the matrix is square, it is assumed as a correlation matrix;
otherwise, correlations (pairwise) will be found; n.obs is the number of observations; if using a
data frame, n.obs is does not need to be specified (default is null); fa displays the eigenvalues
(‘pc’ for principal components, ‘fa’ for factor analysis, and ‘both’ for both); default is both; n.iter
is the number of simulated analysis to perform (default 20)
principal(r, nfactors, rotate): Performs principal components analysis with optional rotation; r is
a correlation matrix; if raw data is used, correlations will be found using pairwise deletions for
missing values; nfactors is the number of components to extract (default 1); use fa.parallel() to
choose; rotate is the rotation/transformation of the solution; default is ‘varimax’ but should set to
‘none’
- principal()$score: plot the score of the principal object to view a plot of the principal
component; use a scatterplot matrix like pairs(), or the regular plot() function if you’re
only graphing the first two PCs
factor.plot(principalobj, labels): Visualizes the principal component loadings; principalobj is
the object obtained from the principal() function; labels is used to add variable name to the plot
(use the column names of the dataset; default is null
Python
Requires numpy (np), PCA from sklearn.decomposition, Axes3D from
mpl_toolkits.mplot3d, and matplotlib.pyplot (plt)
Helper functions for plotting
def rotate(array):
rot = np.matrix([[1,0,0],[0,np.sqrt(3)/2,-np.sqrt(1)/2],[0,np.sqrt(1)/2,np.sqrt(3)/2]]).T
return np.array(np.matrix(array)*rot)
def plot_vec(array, length, color='blue', alpha= 1):
ax.plot(*zip(-array[0]*length, array[0]*length),
color = color,
# colour of the curve
linewidth = 2.4,
# thickness of the line
#linestyle = '--',
# available styles - -- -. :
alpha = alpha
)
#return array*length
def plot_plane(normal, color='blue', alpha=0.2, x_min=-1.5,
x_max=2.5, y_min=-2.5, y_max=1.5):
surf_x, surf_y = np.meshgrid([x_min]+range(int(np.floor(x_min)+1),0) + \
range(int(np.floor(x_max)))+[x_max], \
[y_min]+range(int(np.floor(y_min)+1),0) + \
range(int(np.floor(y_max)))+[y_max])
surf_z = (-normal[0,0]*surf_x - normal[0,1]*surf_y - 0.5)*1./normal[0,2]
ax.plot_surface(surf_x, surf_y, surf_z, color=color, alpha=0.1)
def project2vec(data, vec, id_=0, color='green', along=False):
pp = data[[id_]]
proj = (np.sum(vec*pp)*vec)
ax.scatter( *( proj.ravel() ), c=color, s=16)
ax.plot(*(zip(pp[0], proj[0])),
color = color,
# colour of the curve
linewidth = 1.4,
# thickness of the line
#linestyle = '--',
# available styles - -- -. :
alpha = 0.3)
if along:
ax.plot(*(zip(np.array([0,0,0]), proj[0])),
color = 'Dark' + color,
# colour of the curve
linewidth = 1.4,
# thickness of the line
#linestyle = '--',
# available styles - -- -. :
alpha = 1)
return np.sum(vec*pp)
def project2plane(data, normal, id_=0, color='green', shoot=False ):
pp = data[[id_]]
proj = pp - np.sum((pp*normal))*normal
ax.scatter( *( proj.ravel() ), c=color, s=16)
if shoot:
ax.plot(*(zip(pp[0], proj[0])),
color = color,
# colour of the curve
linewidth = 1.4,
# thickness of the line
# linestyle = '--',
# available styles - -- -. :
alpha = 0.5)
return pp - np.sum(normal*pp)*normal
def plot_oigin():
ax.scatter(0, 0, 0, marker='o', s=26, c="black", alpha=1)
def plotModel(model, x, y, label):
'''
model: a fitted model
x, y: two variables, should arrays
label: true label
'''
margin = 0.5
x_min = x.min() - margin
x_max = x.max() + margin
y_min = y.min() - margin
y_max = y.max() + margin
import matplotlib.pyplot as plt
from matplotlib import colors
colDict = {'red': [(0, 1, 1), (1, 0.7, 0.7)],
'green': [(0, 1, 0.5), (1, 0.7, 0.7)],
'blue': [(0, 1, 0.5), (1, 1, 1)]}
cmap = colors.LinearSegmentedColormap('red_blue_classes', colDict)
plt.cm.register_cmap(cmap=cmap)
nx, ny = 200, 200
xx, yy = np.meshgrid(np.linspace(x_min, x_max, nx),
np.linspace(y_min, y_max, ny))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
## plot colormap
plt.pcolormesh(xx, yy, Z, cmap='red_blue_classes')
## plot boundaries
plt.contour(xx, yy, Z, [0.5], linewidths=1., colors='k')
plt.contour(xx, yy, Z, [1], linewidths=1., colors='k')
## plot scatters ans true labels
plt.scatter(x, y, c = label)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
## if it's a SVM model
try:
# if it's a SVC, plot the support vectors
index = model.support_
plt.scatter(x[index], y[index], c = label[index], s = 100, alpha = 0.5)
except:
pass
Python: PCA from sklearn.decomposition
pca = PCA(): initialize PCA; must save it to a variable (for convenience); if it is saved to a
variable, you can easily return properties of the cluster without repeatedly calculating the fit; all
properties and functions will be called on using pca
- Arguments:
-
n_components: The number of components to keep. In default it is all,
min(n_samples, n_features). If n_components is less than 1, it is interpreted as
the percentage of variance. PCA model will find the suitable number of
components to explain ≥ n_components percentage of variance.
-
-
-
-
whiten: When True (False by default) the components_ vectors are divided by
n_samples times singular values to ensure uncorrelated outputs with unit
component-wise variances.
Attributes:
- components_: Components with maximum variance.
- explained_variance_ratio_: Percentage of variance explained by each of the
selected components.
- mean_: The average of each feature.
Methods:
- fit: Fit the model with X.
- fit_transform: Fit the model with X and apply the dimensionality reduction on X.
- inverse_transform: Transform data back to its original space.
- get_covariance: Compute data covariance with the generative model.
- get_params: Get parameters for this estimator.
- set_params: Set the parameters of this estimator.
- transform: Apply the dimensionality reduction on X.
pca.set_params(params): change parameters; takes in the arguments in PCA()
function; usually used to set n_components; mutating
-
pca.fit(x): find the principal components; x is the data; every time you fit the data, you
remove all the information from the previous fit; mutating
- After the model has been fitted, you can get the properties of the clusters by
calling on PCA() attributes
- pca.components_: returns the components
- pca.explained_variance_ratio_: Percentage of variance explained by each of the
selected components.
- pca.mean_: The average of each original feature.
- pca.transform(data): apply pca to the data set
- Manual transform with:
- np.dot(data - pca.mean_, pca.components_.T)
- pca.inverse_transform(fitteddata): used to transform the PCs back into the
original space; not an accurate inverse transformation because you first reduced
the dimensions so you are missing some data; useful for image compression
Clustering
Unsupervised task that does not aim to specifically predict a numeric output or a class label.
Does aim to uncover underlying structure of the data and see what pattern exists in the data.
Aim to group together observations that are similar while separating observations that are
dissimilar. Cluster analysis attempts to explore possible subpopulations that exist within your
data. Cluster analysis tries to answer exploratory questions.
Typical questions that cluster analysis attempts to answer are:
- Approximately how many subgroups exist in the data?
- Approximately what are the sizes of the subgroups in the data?
- What commonalities exist among members in similar subgroups?
- Are there smaller subgroups that can further segment current subgroups?
- Are there any outlying observations?
K-Means
With the K-means clustering algorithm, we aim to split up our observations into predetermined
number of clusters
- Must specify the number of clusters K in advance
- These clusters will be distinct and non-overlapping
The points of each of the clusters are determined to be similar to a specific centroid value:
- The centroid of a cluster represents the average observation of a given cluster; it is a
single theoretical observation that represents the prototypical member that exists
within the cluster
- Each observation will be assigned to exactly one of the k clusters depending on where
the observation falls in space in respect to the cluster centroid locations
What makes a good clustering solution? We desire each point in a specific cluster to be
near:
- The centroid of that cluster
- All other points within the same cluster
Mathematically, we desire the within-cluster variation to be as small as possible
Procedure
To find the global minimum of the optimization function is very difficult. It is computationally
expensive. If we check all possible clustering assignments, we would have to calculate the
within-cluster variations for Kn different solutions
In practice, most K-means packages perform the following algorithm, also known as Lloyd
algorithm in the computer science circle:
- Initialize: place K centroids at random locations in the feature space
- Assign: each observation to the cluster whose centroid is closest by some distance
measure (Euclidean)
- Recalculate: recalculate the cluster centroids
- The kth cluster centroid is the vector of the p variable averages for all
observations in the kth cluster
- Repeat: repeat assignment and recalculation steps
- Halt: stop when the cluster assignments no longer change
The K-means procedure always converges:
- If you run the algorithm from a fixed initial assignment, it will reach a stable endpoint
where the clustering solution will no longer change through the iterations
Unfortunately, the guaranteed convergence is to a local minimum
- Thus if we begin the K-means algorithm with a different initial configuration, it is possible
that convergence will find different centroids and therefore ultimately assigning different
cluster memberships
What can we do to get around this?
- Run the K-means procedure several times and pick the clustering solution that yields the
smallest aggregate within-cluster variance
How to choose K?
- Need to know the answer prior to running the algorithm
-
-
Can we check a lot of possible values of K and choose the K that yields the lowest
variance? NO
- As K increases, the overall within-cluster variance will continue to decrease
- The more centroids you have, the closer all points will be to one of those
centroids
- If every data point was its own centroid (K = n), the within-cluster variance will be
zero
Use a scree plot (elbow graph) to visually inspect the data
Plot the within-cluster variance as a function of the number of clusters to create a
segmented curve
- The within-cluster variance will necessarily decrease as we increase the number
of clusters, but not uniformly
- The within-cluster variance tend to decrease quickly at first, but then begin to
taper off
- The task reduces to simply finding the point where the within-cluster variance no
longer decreases dramatically
K-means Algorithm Mathematically
Suppose C1,C2, …, CK denote the various sets containing the indices of the observations in the
respective clusters. Then, under the k-means clustering algorithm, the following must be true:
C1 ∪ C2 ∪ … ∪ CK = {1, 2, …, n}
-
Each observation belongs to at least one of the k clusters
Ck ∩ Ck’ = ∅
-
The clusters are distinct and non-overlapping; there does not exist an observation that
belongs to more than one cluster
It follows that each observation must fall into exactly one cluster
What technique does K-means algorithm use to create these clusters?
Suppose we use the Euclidean distance. Then the within-cluster variation is defined as:
Here:
- |Ck| denotes the total number of observations in cluster k
- i and i’ denote indices of observations in cluster C k
- p is the number of variables/features in our dataset
In other words, the within-cluster variation for the kth cluster is the sum of all of the pairwise
squared Euclidean distances between the observations in the k th cluster divided by the total
number of observations in the kth cluster
Since the within-cluster variation is a measure of the amount by which the observations in a
specific cluster differ from one another, we want to minimize this quantity W(C k) over all
clusters:
We desire to partition the observations into k clusters such that the total within-cluster variation
added together across all K clusters is as small as possible; the optimization problem K-means
is as follows:
Why does the K-means algorithm end up necessarily reducing the within-cluster
variances?
We can rewrite the pairwise variation as the variation around the component-wise means
(centroids). During the algorithm, if we had just fixed the:
- Centroids, then the observation reassignment step finds the closest centroid (and thus
reduces the within-cluster variances)
- Observation assignments, then the resulting sample cluster means minimize the sum of
squared distances (and thus reduces the within-cluster variances)
Pros & Cons
Pros:
Cons:
-
Helps find underlying structure of a set of data
Simple to understand (group by distance)
Well defined groups
Have to choose K and starting points (sometimes hard to determine)
Change in units can change solutions
Points that are nearby each other (have a small Euclidean Distance between them) are
not guaranteed to be clustered together
- It could be the case that a stable solution produces clusters that don’t
necessarily cluster the closest points together:
-
-
K-means assumes that true clusters have a globular shape (i.e., a spherical shape that
has a well-defined center)
- When the data has non-globular or chainlike shapes, K-means may not
perform well (outside vs inside of a face)
Perceived granularity: how many clusters are below? 2? 4? 16?
Hierarchical Clustering (Agglomerative Cluster)
Agglomerative Clustering: build a hierarchy of clustering structures (like a tree)
- At the bottom level, the extreme case would be each observation is partitioned into its
own cluster (K=n)
- At each intermediary level, we can recursively define the closest two clusters and fuse
them together
- At the top level, the extreme case would be each observation is partitioned into the
exact same cluster (K = 1)
Dendrogram: visualization of the hierarchical tree
- The lower down in the dendrogram a fusion occurs, the more similar the groups of
observations that have been fused are to each other
- The higher up in the dendrogram a fusion occurs, the more dissimilar the group of
observations that have been fused are to each other
- For any two observations we can inspect the dendrogram and find the point at which the
groups that contain those two observations are fused together to get an idea of their
dissimilarity
- Be careful to consider groups of points in the fusions within dendrograms, not
just individual points
Procedure
Two strategies: bottom-up and top-down
Bottom-up:
- Begin with n observations and a distance measure of all pairwise dissimilarities. At
this step, treat each of the n observations as their own clusters
- For i = n, (n-1), …, 2:
- Evaluate all pairwise inter-cluster dissimilarities among the i clusters and fuse
together the pair of clusters that are the least dissimilar
- Note the dissimilarity between the recently fused cluster pair and mark that as
the associated height in the dendrogram
- Repeat the process, calculating the new pairwise inter-cluster dissimilarities
among the remaining (i-1) clusters
Need to choose a dissimilarity measure and a linkage measure
- Sufficient to use a distance metric (Euclidean distance) for a dissimilarity measure
- Linkage is a measure of the dissimilarity between two group of points
- Compute the pairwise dissimilarities between the observations in the two clusters
- Complete Linkage: Maximum inter-cluster dissimilarity
- Record the largest of the dissimilarities listed between members of
the two clusters as the overall inter-cluster dissimilarity
- Sensitive to outliers, yet tends to identify clusters that are
compact, somewhat spherical objects with relatively equivalent
diameters
-
Single Linkage: Minimal inter-cluster dissimilarity
- Record the smallest of the dissimilarities listed between members
of the two clusters as the overall inter-cluster dissimilarity
- Not as sensitive to outliers, yet tends to identify clusters that
have a chaining effect; these clusters can often not represent
intuitive groups among our data, and many pairs of observations
might be quite distant from one another
-
Average Linkage: Mean inter-cluster dissimilarity
- Record the average of the dissimilarities listed between members
of the two clusters as the overall inter-cluster dissimilarity
- Tends to strike a balance between the pros and cons of both
complete linkage and single linkage
-
Ward’s Linkage: Minimum variance method
- Minimize the variance of the clusters being merged
Pros and Cons
Pros:
- Doesn’t require us to choose the number of centroids
Cons:
- Need to choose a dissimilarity measure and a linkage method
- Change in units can change solutions
Notes
-
-
-
Clustering is an unsupervised method to machine learning; the main goal is to uncover
structure among subsets of the data
- The procedure is generally used more for data exploration; not predicting any
outcomes
In good clustering solutions, points in the same cluster should be more similar to
each other than to points in other clusters
The units by which each variable is measured matters; different unit measurements
cause different distance calculations and thus change clustering solutions
- Usually we desire a unit change in one dimension to correspond to the same unit
change in another dimension; in that perspective, we should standardize our
data prior to clustering
The process of clustering is more iterative and interactive; there’s no one correct way to
cluster your data
Supervised methods generally have one solution to the optimization problems posed,
whereas some clustering methods (e.g., K-means) aren’t deterministic
Different clustering methods yield different results (e.g., hierarchical clustering with
varied linkage methodologies). Consider the output of different approaches
R
scale(data): used to scale your data
kmeans(x, centers, nstart): perform K-means clustering on a data matrix; x is the data matrix;
centers is the number of centers or the set of initial cluster centers; nstart is the number of
random sets of centers to choose from (default 1)
- kmeans()$cluster: this gives you a vector of the clusters assigned to each observation;
can be used to visualize the clusters in the plot() function; plot points and set col =
kmeans()$cluster
- kmeans()$centers: obtain the coordinates of the cluster centers; can be added to the
plot using points() function
Use the following function to help determine the number of clusters when we do not have an
idea ahead of time:
wssplot = function(data, nc = 15, seed = 0) {
wss = (nrow(data) - 1) * sum(apply(data, 2, var))
for (i in 2:nc) {
set.seed(seed)
wss[i] = sum(kmeans(data, centers = i, iter.max = 100,
nstart = 100)$withinss)
}
plot(1:nc, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within-Cluster Variance",
main = "Scree Plot for the K-Means Procedure")
}
dist(data, method): computes and returns the distance matrix computed by using the specified
distance measure to compute the distances between the rows of a data matrix; method default
is euclidean; useful for hierarchical cluster function in flexclust package
cutree(tree, k): cuts a tree, e.g., as resulting from hclust() into several groups either by
specifying the desired number of groups or the cut height(s); k is the desired number of groups
R Library: flexclust
hclust(d, method): hierachical cluster analysis on a set of dissimilarities and methods for
analyzing it; d is the dissimilarity structure as produced by dist() function; method can be one of
"ward.D", "ward.D2", "single", "complete","average" (= UPGMA), "mcquitty" (= WPGMA),
"median" (= WPGMC) or "centroid" (= UPGMC); default is ‘complete’
- plot(hclustobj, hang): graphs the dendrogram for the data; set hang to -1 for the best
view
- rect.hclust(hclustobj, k): draws rectangles around hierarchical clusters on
dendrogram plot; k determines the number of clusters to highlight on the
dendrogram
- cutree(hclustobj, k): trims the hierarchical cluster to ignore smaller groups to make the
interpretation easier; choose the number of clusters you want to highlight
- table(cutreeobj): viewing the groups of data
-
aggregate(data, by = list(cluster = cutreeobj), func): aggregate data by the
cluster assignments; data should be the original data or the scaled data;
cutreeobj is the result from cutree() function; func should be set to median
Python
Note: requires matplotlib.pyplot (as plt) for visualization and numpy (as np)
Python: KMeans from sklearn.cluster
kmeans = KMeans(...): initialize KMeans; must save it to a variable (for convenience); if it is
saved to a variable, you can easily return properties of the cluster without repeatedly calculating
the fit; all properties and functions will be called on using kmeans
- Arguments:
- n_clusters: The number of clusters to divide, default is 8.
- max_iter: The maximal number of iterations, default is 300.
- n_init: Number of times the k-means algorithm will run with different centroid
seeds. The final results will be the best output of n_init consecutive runs in terms
of inertia. default is 10.
- random_state: Optional. The generator used to initialize the centers. If an
integer is given, it fixes the seed. Defaults to the global numpy random number
generator.
- Usually, we just need to set the argument n_clusters to determine how many groups
are we going to split.
- Attributes:
- cluster_centers_: The coordinates of cluster centers.
- labels_: The Label of each observation, which indicates the group number of
each observation.
- inertia_: Sum of distances of samples to their closest cluster center.
- The most import attribute here is the labels_ .
- Methods:
- fit: Fit k-means clustering on a given data set.
- fit_predict: Compute cluster centers and predict cluster index for each sample.
- get_params: Get parameters for this estimator.
- set_params: Set the parameters of this estimator.
- predict: Given a set of data, predict the closest cluster each sample belongs to.
- kmeans.set_params(params): change parameters; takes in the arguments in KMeans()
function; normally used to set n_clusters; mutating
- kmeans.fit(x): compute kmeans clustering; x is the data; every time you fit the data, you
remove all the information from the previous fit; mutating
- After the model has been fitted, you can get the properties of the clusters by
calling on KMeans() attributes
- kmeans.cluster_centers_: return the values of the cluster centers
- kmeans.labels_: returns the designated cluster of each observation; can use
labels for visualizing the clusters on a plot (set color to the labels)
For Image Compression
Note: requires matplotlib.image (as mpimg) for reading image data
mpimg.imread(filename): opens image file
- np.shape(mpimgobj): returns the shape of the image in three dimensions; the first two
are for the height and width in pixels and the third dimension is always 4 for the three
color groups (red, blue, and green) and alpha; this is useful for keeping the size of the
image the same while compressing the image to less colors
def KmeansCompression(data, nclus=16):
'''
data: data to cluster
nclus: number of colors
'''
cluster = KMeans(n_clusters = nclus, n_jobs=4)
cluster.fit(data)
centers = cluster.cluster_centers_
labels = cluster.labels_
return centers[labels]
# The number of clusters centers < the number of the original samples
# They are viewed as good representatives of the nearby sample points within the same
# cluster
# function used to compress the image using Kmeans;
KmeansCompression(...).reshape(np.shape(imgobj)): used to reshape the image; should be
reshaped to the original size of the image
Python: pairwise_distances from klearn.metrics.pairwise
pairwise_distances(x, metric): calculate distances between observations in data; metric can
be ‘l1’ (manhattan), ‘l2’ (euclidean), ‘cosine’ (cosine), etc.
Python: AgglomerativeClustering from sklearn.cluster
hier = AgglomerativeClustering(...): initialize hierarchical clustering; must save it to a variable
(for convenience); if it is saved to a variable, you can easily return properties of the cluster
without repeatedly calculating the fit; all properties and functions will be called on using hier
- Arguments:
- n_clusters: The number of clusters to find. default=2
-
affinity: Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”,
“manhattan”, “cosine”. If linkage is “ward”, only “euclidean” is accepted. default:
“euclidean”; "l1" is the same as "manhattan", while "l2" is the same as
"euclidean".; "cosine distance" here refers to 1−cos(θ)1−cos(θ), not cos(θ)cos(θ)
itself.
-
-
The smaller the euclidean/manhattan distance is, the closer the two
observations are. The smaller the cosine value is, farther the
observations are. The cos value is not non-negative.
So the cosine distance is defined as :
-
Notice it is NOT the cosine itself, but rather 1−cos(θ)1−cos(θ) instead.
-
-
Now the cosine distance ranges from 0 to 2, and the smaller it is, the
closer the pair of observations are.
- 0.0: two vectors point to the same direction
- 1.0: perpendicular
- 2.0: opposite direction
linkage: Which linkage criterion to use. The linkage criterion determines which
distance to use between sets of observation. The algorithm will merge the pairs
of clusters that minimize this criterion.
- ward minimizes the variance of the clusters being merged.
- average uses the average of the distances between all pairs of
observations of the two sets.
- complete or maximum linkage uses the maximum distances between all
pairs of observations of the two sets.
-
Attributes:
- labels_: Cluster label for each observation.
- n_leaves_: Number of leaves in the hierarchical clustering tree, which is also the
number of observations.
-
Methods:
- fit: Fit the hierarchical clustering on the data.
- get_params: Get parameters for this estimator.
- set_params: Set the parameters of this estimator.
hier.set_params(params): change parameters; takes in the arguments in
AgglomerativeClustering() function; usually used to set n_clusters; mutating
hier.fit(x): compute hier clustering; x is the data; every time you fit the data, you remove
all the information from the previous fit; mutating
- After the model has been fitted, you can get the properties of the clusters by
calling on AgglomerativeClustering() attributes
- hier.labels_: returns the designated cluster of each observation; can use labels
for visualizing the clusters on a plot (set color to the labels)
-
Python: linkage from scipy.cluster.hierarchy
linkage(x, method, metric): performs hierarchical clustering on condensed distance matrix x;
method can be any linkage metric (‘complete’, ‘single’, ‘average’, ‘wards’, etc.); metric is
defaulted to ‘euclidean’
Python: dendrogram from scipy.cluster.hierarchy
dendrogram(z, p, truncate_mode, leaf_rotation, leaf_font_size): plots a dendrogram; z is the
linkage matrix produced from linkage() function; p is the number of clusters to specify (default is
30); truncate_mode is used to condense the dendrogram (default none); set to ‘lastp’ to show
p number of nodes; set to ‘level’ to show no more than p levels on the dendrogram;
leaf_rotation default is 0 and it determines how to rotate the x axis labels (set to 90 to make
reading the labels easier); leaf_font_size default is none (varies depending on the number of
nodes on the dendrogram)
Other
R: caret
http://topepo.github.io/caret/index.html
Survival Analysis
https://www.cscu.cornell.edu/news/statnews/stnews78.pdf
Markov Chain
https://en.wikipedia.org/wiki/Markov_chain
A/B Testing
Download