CS910: Foundations of Data Analytics Graham Cormode G.Cormode@warwick.ac.uk Regression Module Outline Part 1: Preliminaries Part 2: Core methods Part 3: Advanced topics 2 Statistics and data handling Introduction to useful tools Case studies of analytics in action Regression: fitting a curve to data Classification: learning a model from data Clustering: finding groups in data Social Network Analysis Recommender systems Time series analysis Data management systems CS910 Foundations of Data Analytics Objectives 3 Understand the principle of regression to predict values See how simple linear regression works from first principles Extend simple linear regression to multiple linear regression Perform non-linear regression by transformation of variables Apply logistic regression for categoric values CS910 Foundations of Data Analytics Supervised and Unsupervised Methods Supervised methods in data analytics Classification: predict a class (categoric) value given other values – Regression: predict a numeric value given other values – Unsupervised methods in data analytics – Clustering: identify groups/clusters of similar records In-between: Semi-supervised methods – 4 Use a mixture of labeled and unlabeled data to infer labels CS910 Foundations of Data Analytics Regression Regression lets us predict a value for a numeric attribute – We fit a model to the data, and use the model to predict Linear regression is the most familiar example A linear function of the explanatory variables – Predicts a value for the dependent variable – Based on the principle of least squares – 5 Minimize the sum of squared differences between data and model Gauss-Markov theorem shows this provides “best” unbiased estimates for the parameters of the model Assuming unbiased independent noise with bounded variance CS910 Foundations of Data Analytics Representation by a constant Trivial case of regression: Given n observations of a random variable X (e.g., 8, 8, 10, 11, 13) – Find a single value x’ to represent them – Principle of least squares: Minimizex’ i (xi – x’)2 = Minimizex’ i (xi2 + x’2 – 2xi x’) = Minimizex’ nx’2 – 2x’(i xi) Differentiate with respect to x’ and set to 0: 2nx’ – 2 (i xi) = 0 Achieved by x’ = i xi / n The mean minimizes the squared deviation (so 10 for e.g. above) The variance gives the (minimal) expected squared deviation – The standard deviation gives the root-mean-squared-error (RMSE) – 6 CS910 Foundations of Data Analytics Representation by a line through the origin Given paired observations (xi, yi) of variables X, Y – We seek a model y = ax (with a as parameter to fit) Minimize the squared deviations For xi, the model predicts y = axi – The true observation is yi – Minimize the squared difference: Minimizea i (yi – axi)2 = i (yi2 + a2 xi2 – 2a xi yi) Differentiate with respect to a: 2a xi2 – 2 i (xi yi) = 0 a = i (xi yi) / ixi2 – Can write in terms of E[XY] = i (xi yi)/n and E[X2] = i xi2/n We have a = E[XY]/E[X2] – 7 CS910 Foundations of Data Analytics New Data set: Boston housing (1978) http://archive.ics.uci.edu/ml/datasets/Housing http://tunedit.org/repo/UCI/numeric/housing.arff 506 examples of regions in Boston USA last century – Can we find a model to predict house sales prices? Some obvious features are included: Crime rate nearby – Average number of rooms per home – Property tax ratio – Other features may be less expected: Pollution measures (nitrous oxide) – Proportion of population that is “lower status” – Proportion of population from ethnic minorities – Ethical question: should we include such features? – 8 Will this propagate and increase inequality? CS910 Foundations of Data Analytics Simple Linear Model Apply the model to housing.data Set X = average rooms (RM), Y = median house values (MEDV) – E[XY] = 146.1, E[X2] = 40.0, E[Y2] = 592.1 – a = E[XY]/E[X2] = 146.1/40.0 = 3.65 – What is the residual sum-of-squares error from this model? – SSres = i (yi – axi)2 = i yi2 + a2xi2 – 2axiyi = nE[Y2] + nE[X2] E[XY]2 /E[X2]2 – 2nE[XY] E[XY]/E[X2] = n (E[Y2] – E[XY]2/E[X2]) Root-Mean-Squared Error: RMSE = √(SSres /n) = √(E[Y2] – E[XY]2/E[X2]) – For housing.data: √(592.1 – 146.12/40.0) = 7.64 Error magnitude is 7.64 ($K), compared to average of 22.5 ($K) – 9 CS910 Foundations of Data Analytics Aside: Sample versus Population statistics Sometimes a statistical correction is needed for expectations Unbiased sample variance: s2(Y) = i (yi – E[Y])2/(n-1) – Corrects the bias in estimating unknown population parameters – Similar correction for covariance – Will gloss over this issue We assume that n is reasonably large (e.g. n = 506 for housing data) – Using n-1 vs. n has only minor impact (e.g. 1/506 = 0.002) – May cause minor discrepancies depending on software used – 10 CS910 Foundations of Data Analytics Linear model with constant Fit a line not constrained to go through origin: y = ax + b – Follow same set-up but now there are two unknowns, a and b: Minimizea,b i (yi – axi – b)2 = i (yi2 – 2byi – 2axi yi + b2 + 2baxi + a2xi2) = f(a,b) To solve, take partial derivatives with respect to a and b f(a,b)/a: i (-2xi yi + 2bxi + 2axi2) = 0 (+) f(a,b)/b: 2bn + i (-2yi + 2axi) = 0 (*) Rearrange (*): b = (i yi – ai xi)/n : average difference from model – Substitute into (+): a (i xi2 – (ixi)2/n) = i (xi yi) – (i xi)(i yi)/n Simplify LHS: (i xi2 – (ixi)2/n) = nVar(X) Simplify RHS: i (xi yi) – (i xi)(i yi)/n = nCov(X,Y) – So a = Cov(X,Y)/Var(X), and b = E[Y] – aE[X] – 11 CS910 Foundations of Data Analytics Linear model with constant We have y = ax + b where a = Cov(X,Y)/Var(X) and b = E[Y] – aE[X] Write model as y = xCov(X,Y)/Var(X) + E[Y] – Cov(X,Y) E[X]/Var(X) = E[Y] + (x – E[X])Cov(X,Y)/Var(X) Error (Residual sum of squares): SSres = i (yi - axi - b)2 = i (yi – E[Y] – (xi – E[X])Cov(X,Y)/Var(X))2 = i (yi – E[Y])2 + (xi – E[X])2Cov2(X,Y)/Var2(X) – 2(yi – E[Y])(xi – E[X])Cov(X,Y)/Var(X) = nVar(Y) + nVar(X)Cov2(X,Y)/Var2(X) – 2nCov2(X,Y)/Var(X) = n (Var(Y) – Cov2(X,Y)/Var(X)) 12 CS910 Foundations of Data Analytics Partitioning the Variance The total variance in Y, Var(Y), is explained by two components: The regression (explained) sum of squares, SSreg = i (axi + b – E[Y])2 – The residual sum of squares, SSres = i (yi – axi - b)2 – Can show for this (a, b) that SSreg + SSres = nVar(Y) = i (yi – E[Y])2 – Measure the “Coefficient of Determination” R2 = 1 – SSres/nVar(Y) Measures how much the model “explains” the variance in Y Compared to the initial variance, Var(Y) – Here, SSres = n (Var(Y) – Cov2(X,Y)/Var(X)) 2 2 – R2 = 1 – (1 – Cov (X,Y)/(Var(X)Var(Y)) = Cov (X,Y)/(Var(X)Var(Y)) = PMCC(X,Y)2 – Product-Moment Correlation Coefficient determines the quality of the linear regression between X and Y 13 CS910 Foundations of Data Analytics Application of the model For housing.data, with X = average rooms, Y = median house values Cov(X,Y) = 4.48, Var(X) = 0.493 a = Cov(X,Y)/Var(X) = 9.08 E[X] = 6.3, E[Y] = 22.5 b = E[Y] – aE[X] = -34.6 Line of best fit = 9.08x - 34.6 Price is -$34.6K+ $9.08K for every extra room. Does this make sense? R2 = Cov2(X,Y) /(Var(X)Var(Y)) = 0.48 A weak but positive correlation RMSE: (Var(Y) – Cov2(X,Y)/Var(X)) = s(Y)(1 – PMCC) = 4.77 Smaller than previous model (line through origin, RMSE 7.64) 14 CS910 Foundations of Data Analytics Computation in R To do the calculations using R: house <- read.table("housing.data",sep="", header=F) # read the data summary (house$V14) summary (house$V6) # show summary of the two variables cov(house$V14,house$V6) # show covariance of variables cor(house$V14,house$V6) # show correlation of variables cor(house$V14,house$V6)**2 # show PMCC squared / R2 fit <- lm(house$V14 ~ house$V6) # fit a linear model with V14 as Y print (fit) # show the parameters of the model summary(residuals(fit)) # summarize the distribution of residuals summary(fit) # summarize the model. # R shows the ‘significance’ of each parameter, based on a t-test plot(house$V6, house$V14) # plot the data abline(fit) # show the line of best fit on the data fit0 <- lm(house$V14 ~ -1 + house$V6) # fit linear model through origin 15 CS910 Foundations of Data Analytics Computation via spreadsheet Built in functions for the desired quantities: =average(range), =var(range) for mean and variance =covar(range1, range2) for covariance of paired values =correl(range1, range2) for PMCC of paired values Can add line of best fit to scatter plot 60 50 Make scatter plot of data 40 – Select plot, “add trendline” (right-click) 30 20 Can show equation and R2 value 10 0 Can set intercept (=0) – -10 y = 9.1021x - 34.671 R² = 0.4835 0 2 4 Series1 16 CS910 Foundations of Data Analytics 6 8 Linear (Series1) 10 Computation via Gnuplot Scatter plot of hours worked vs years of education (as before): Average Price set term emf enhanced font "Calibri,18" set output "roomprice.emf" set title "Rooms versus Price" set xlabel "Rooms" set ylabel "Average Price" set key under Add a line of best fit: Rooms versus Price 50 40 30 20 10 0 y(x)=a*x+b -10 3 fit y(x) "housing.data" using 6:14 via a,b plot "housing.data" u 6:14 w p t 'Houses', \ y(x) with lines title 'Fit' 4 5 6 Rooms Houses Output to standard output: Final set of parameters a = 9.10211 b = -34.6706 17 Asymptotic Standard Error +/- 0.419 (4.604%) +/- 2.65 (7.643%) CS910 Foundations of Data Analytics 7 Fit 8 9 Computation via Weka Open the data file, remove unwanted (non-numeric) attributes Under classify tab, choose “functions/Simple Linear Regression” Select “use training set” for test options – Hit start! – Partial output: 9.1 * RM - 34.67 Time taken to build model: 0.03 seconds === Evaluation on training set === === Summary === Correlation coefficient Mean absolute error Root mean squared error 18 CS910 Foundations of Data Analytics 0.6884 4.4953 6.6644 Extra practice Replicate this analysis in your preferred tool – Spreadsheet, R, gnuplot, weka or other Build simple linear regression models for other numerical variables in the housing.data set: – – – – – 19 E.g. Price versus pollution (nitric oxide rates) E.g. Pollution versus proportion of old properties Compute the quantities needed (Var[Y], Cov[X,Y] etc.), and see that they agree with those found by the software How well does the regression explain the data (in terms of R2, RMSE error)? What might explain these correlations or lack of correlations? CS910 Foundations of Data Analytics Multiple linear regression Suppose we want to include more variables – – – – – Model: Y = aX1 + bX2 + cX3 + … + z Y: dependent (response) variable. Xi: explanatory variables Could follow same outline, write out squared error and minimize Notation gets ugly, messy Instead, can solve via matrix representation Matrix form of linear regression with d explanatory variables Let the (d+1) model parameters be (w0, w1, … wd) = w Prediction for (vector) x will be f(x) = w0 1 + i=1d wi xi 1 x11 x12 – Encode the n examples as a n d+1 matrix X 1 x21 x22 1 x31 x32 First column is all 1s, for the constant term … – Vector of n corresponding yi values, y – ( 20 CS910 Foundations of Data Analytics x13 … x23 … x33 … ) Linear Algebra Refresher r c matrix X has r rows, c columns – Xi,j is entry in row i, column j Addition: add two r x c matrices entry-wise (X+Y)i,j = Xi,j + Yi,j Multiplication: multiply r n matrix X with n c matrix Y to get r c matrix Z = XY Zi,k = j=1n Xi,j Yj,k – Identity matrix I is n n matrix s.t. IX = XI = X – Inverse: X-1 is the matrix (if it exists) such that X-1X = XX-1 = I Transpose, XT switches rows and columns: XTi,j = Xj,i (X + Y)T = XT + YT – (XY)T = YTXT – 21 CS910 Foundations of Data Analytics Sum of squares error Column vector of predictions on data is Xw – Residuals are the column vector (y – Xw) Residual sum of squares is now RSS(w) = (y – Xw)T(y-Xw) = (yT – wTXT)(y-Xw) = yTy – yTXw – wTXTy + wTXTXw The inner product of the residuals with themselves (or L22) – A quadratic function in the (d+1) parameters w – 22 CS910 Foundations of Data Analytics Optimization (Partial) derivatives for quantities with respect to a vector a: b/a = [b/a1, b/a2, … ]T – aTb /a = bTa /a = b – aTBTBa /a = 2BTBa – (Rule 1) (Rule 2) (Rule 3) Take (partial) derivative with respect to (each) parameter: RSS/w = /w (yTy – yTXw – wTXTy + wTXTXw) = 0 – (yTX)T – XTy + 2XTXw = 0 – 2XTy + 2XTXw = – 2XT(y – Xw) – Set RSS/w = 0 = XTy – XTXw – So w = (XTX)-1XTy (XTX is square, symmetric), assuming (XTX)-1 exists Need that XTX is non-singular: determinant is non-zero X cannot have linearly dependent columns – 23 CS910 Foundations of Data Analytics Multilinear regression: matrix example Consider the housing data (n = 506 examples) – Predict y = sales price based on X = pollution and rooms w = (XTX)-1XTy : (XTX) is a 3 x 3 matrix – (XTX)i,j is inner product of values of ith variable with jth variance R can do matrix algebra: X <- matrix(1:1518, nrow = 506, ncol=3) X[,1]<-1; X[,2] <-house$V5; X[,3] <-house$V6; y<- house$V14 w <- (solve(t(X) %*% X) %*% t(X)) %*% y # %*% is matrix multiply RSS <- t(y - X %*% w) %*% (y - X %*% w) # t(X) is matrix transpose 280.67 3180.25 – XTX = 506 (XTX)-1 = 0.286 -0.141 -0.032 – [ 280.67 162.47 1751.52 3180.25 1751.52 20234.6 ] [ -0.141 0.162 0.008 -0.032 0.008 0.004 Obtain wT = [-18.20, -18.97, 8.16], RSS = 19844.37 24 CS910 Foundations of Data Analytics ] Prediction using the model Given new data point x, define x’ = [1 x1 … xd] – Prediction is x’w = x’(XTX)-1XTy As before, quality of the fit is given by R2 Computed as fraction of the sum of squares explained by regression – R2 = 1 – SSresidual/SStotal = 1 – SSresidual/(n Var(y)) – Same interpretation of R2 as in simple linear regression: Close to 1: good fit of model – Close to 0: weak fit of model – In R: R2 <- 1-RSS/(length(y)*var(y)) – 25 # = 0.536 Fortunately, R has multiple linear regression built in… CS910 Foundations of Data Analytics Multiple linear regression in R house <- read.table("housing.data",sep="", header=F) # read the data fit2 <- lm(house$V14 ~ house$V5 + house$V6) # fit a linear model with V14 as Y, V5 and V6 as X fit2 # show the parameters of the model # Model: y = -18.2 – 19.0(nox) + 8.2(rooms) summary(fit2) # R2 = 0.53, up from 0.48 pairs(house$V14 ~ house$V5 + house$V6) # plots of pairs of vars 26 CS910 Foundations of Data Analytics Multiple linear regression in Weka Open the data file, remove unwanted (non-numeric) attributes Under classify tab, choose “functions/LinearRegression” Select “use training set” for test options – Hit start! – Partial output: 27 class = -18.9706 * NOX + 8.1567 * RM + -18.2059 Time taken to build model: 0.01 seconds === Evaluation on training set === === Summary === Correlation coefficient Mean absolute error Root mean squared error CS910 Foundations of Data Analytics 0.7317 4.2419 6.2624 Dealing with categoric attributes Regression is fundamentally numeric – But we can numerically encode categoric (explanatory) variables Simple case: binary attribute (e.g. Male/Female) Create a variable that is 0 if male, 1 if female – Include this new variable in the regression – General categoric attributes (e.g. Country): “Dummy coding” Create a binary variable for each possibility – E.g. England (T/F), Mexico (T/F), France (T/F)… – Include all these variables in the regression – Effectively, adds a different constant for each category – 28 CS910 Foundations of Data Analytics Kitchen sink regression Build a regression model for price, put in all other variables – R automatically handles categoric variables, skips dependencies: fit3 <- lm(house$V14 ~ house$V1 + house$V2 + house$V3 + house$V4 + house$V5 + house$V6 + house$V7 + house$V8 + house$V9 + house$V10 + house$V11 + house$V12 + house$V13) summary(fit3) – Weka can “automatically” convert categoric variables to numeric R2 value 0.74 (R 0.86) We have explained a lot more of the variance (though not all) – But have built a complex model (dozens of variables/parameters) – At risk of “kitchen sink regression”: throw in everything possible May find false correlations, lead to erroneous conclusions – Some variables significant: pollution, rooms, lower status Others not: age of housing, industry – 29 CS910 Foundations of Data Analytics Regularization in regression There is a danger of overfitting the model to the data – Fits what we have seen already, but not future data Regularization: include the parameters in the optimization – Minimize (y – Xw)T(y-Xw) + α‖w‖pp instead of just (y – Xw)T(y-Xw) Tikhonov regularization, also known as ridge regression: p=2 Seems most natural, since both terms are quadratic – Has a closed form solution similar to before: w = (XTX + 2αI)-1XTy – Implemented in Weka Linear Regression: parameter ‘ridge’ sets α – To pick α, test quality of predictions on withheld data – Least absolute shrinkage and selection operator (LASSO): p=1 Minimize (y – Xw)T(y-Xw) + α‖w‖1 – no closed form, not in WEKA – Tends to set many coefficients to zero: effectively model selection! – 30 CS910 Foundations of Data Analytics Correlations in data Two measures of education level in the adult data – Years of education (numeric), Education level (categoric) How do they relate? adult <- read.csv(adult.data, header=F) fit4 <- lm(adult$V5 ~ adult$V4) #R2 = 1! plot(adult$V5 ~ adult$V4) – Years of education entirely determined by education level Conjecture: years of education computed from education level – Extra Practice: build a regression model for years of education Based on demographic factors: age, sex, nationality – Experiment with adding and removing features: what works best? – 31 CS910 Foundations of Data Analytics Fitting non-linear functions Not all relationships are linear Some are quadratic, cubic… – Exponential, logarithmic… – … – Do we need to find new methods for each different model? Idea: try transforming the data so that we seek a linear model Suppose we have a quadratic model: y = ax2 + bx + c – Introduce a new variable z=(x2) – Model is now y = az + bx + c – linear! – Use multiple linear regression to learn the parameters of this model – 32 CS910 Foundations of Data Analytics Non-linear models How exactly to learn the new model? For each example (xi, yi), create a new example (1, xi, xi2, yi) – Apply linear regression to the expanded data – Do we need to include terms like (xi2), (xi+)2? – No, because these models can be expressed as ax2 + bx + c But we do need to include the terms of the model we want Cannot learn a quadratic model unless x2 is present – log(xi) : logarithmic model – 1/x : reciprocal model – 33 CS910 Foundations of Data Analytics Fitting a non-linear model house <- read.table("housing.data",sep="", header=F) # read the data fit1 <- lm(house$V14 ~ house$V6) # R2 = 0.4835 fit2 <- lm(house$V14 ~ house$V6 + I(house$V6^2)) # fit a linear model with V13 as Y, V6 and V62 as X fit # show the parameters of the model (Intercept) 66.06 house$V6 -22.64 summary(fit) # R2 = 0.548 – Much better 34 CS910 Foundations of Data Analytics I(house$V6^2) 2.47 Exponential model Suppose we want to learn a model of the form y = exp( x) – For unknown parameters , Transforming x to exp(x) will not suffice – c exp(x) ≠ exp( x) Here, we can transform y: – (ln y) = (ln ) + x : simple linear regression However: linear regression minimizes sums of squared differences around (ln y) We have changed the objective function x fit <- lm(log(house$V14)) ~ house$V1) – 35 CS910 Foundations of Data Analytics Categoric outputs What about regression to predict categoric attributes? Regression so far predicts a number – Will focus on binary outputs – Can encode True=1, False=0, and try to use regression Predicts 0.03: probably False – Predicts 0.82: probably True – Not a principled approach Prediction of 13.3: really really true??? – Prediction of -5.7: ??? – How to interpret the slope of the dependency on variables? – 36 CS910 Foundations of Data Analytics Odds Ratios Need to transform the data so that regression makes sense – Perform a non-linear transform on the dependent variable Convert the probability of an outcome into an odds ratio Odds ratio(p) = p/(1-p) – E.g. toss a coin, get heads, p = ½, Odds ratio = 1 – E.g. roll a dice, get a 6: (1/6)/(5/6) = 0.2 – E.g. roll a dice, get anything other than 6: (5/6)/(1/6) = 5 – Using odds ratios ensures that any positive value is meaningful As p 1, Odds ratio ∞ – As p 0, Odds ratio 0 – What about negative values? 37 CS910 Foundations of Data Analytics Log Odds Ratios Taking the logarithm of the odds ratio allows negative values Heads on a coin: ln(1) = 0 – 6 on a dice: ln(0.2) = -1.6 – Anything other than 6 on a dice: ln(5) = 1.6 – This is the logit transformation Logit(p) = ln(p/(1-p)) – Invertible: p = 1/(1+exp(-l)) – Now all reals have meaning As p 0, logit(p) –∞ – As p 1, logit(p) +∞ – 38 CS910 Foundations of Data Analytics Logistic Regression Apply the logit function to dependent variable: logistic regression Model: logit(Y) = aX1 + bX2 + cX3 + … + z – Transformed model: Y = 1/(1 + exp(-(aX1 + bX2 + cX3 + … + z))) – Problem: this does not yield a closed form as did linear regression – Treat as an optimization problem and solve iteratively – – – – – Begin with initial estimate of parameters (e.g. all 1) Measure “goodness of fit” via log-likelihood function Likelihood: Probability of seeing data given (current) model Adjust parameters based on local gradient (Newton-Raphson) Repeat until convergence: obtain maximum likelihood estimates Built into R and Weka, needs plug-ins for spreadsheets (“solver”) Measure quality of the model by how well it fits the training data 39 CS910 Foundations of Data Analytics Logistic Regression in R adult <- read.csv("adult.data",header=F) # use adult data set, with binary class plot(adult$V15 ~ adult$V5) lfit <- glm(adult$V15 ~ adult$V5, family="binomial") # family = “binomial” tells R to use logistic regression summary(lfit) predict(lfit, type="response") round(predict(lfit, type=“response”) pairs <-(paste(round(predict(lfit, type="response")), adult$V15)) table(pairs) # get statistics on number of correct predictions summary(adult$V15) # compare to overall statistics <=50K >50K 24720 7841 40 CS910 Foundations of Data Analytics Logistic regression in R lfit <- glm(adult$V15 ~ adult$V1 + adult$V2 + adult$V4 + adult$V6 + adult$V7 + adult$V8 + adult$V9 + adult$V10, family="binomial") # family = “binomial” tells R to use logistic regression summary(lfit) # print summary of the fit pairs <-(paste(round(predict(lfit, type="response")), adult$V15)) table(pairs) Output: pairs 0 <=50K 0 >50K 1 <=50K 1 >50K 22840 3548 1880 4293 summary(adult$V15) <=50K >50K 24720 7841 41 CS910 Foundations of Data Analytics Interpreting a Logistic Regression Model Linear model y = ax + b: increase of 1 in x increases y by a Logistic model Logit(y) = aX1 + bX2 + cX3 + … + z : increase of 1 in x increases odds ratio of y by a factor of exp(a) Odds ratio(y | X1, X2, ...) = exp(aX1 + bX2 + cX3 + … + z) Odds ratio(y | (X1+1), X2, ...) = exp(a(X1+1) + bX2 + cX3 + … + z) = exp(a)exp(aX1 + bX2 + cX3 + … + z) – A factor of exp(a) between the two odds ratios – Example: predicting income based on years of education Coefficients: – 42 (Intercept) -5.0197 adult$V5 0.3643 Each extra year of education multiplies odds of high income by exp(0.3643) = 1.44: quite significant! CS910 Foundations of Data Analytics Logistic Regression in Weka Select ‘adult.arff’, remove unwanted attributes Select the classify tab Choose the classifier: classifiers/functions/Logistic For test options, pick ‘use training set’ Pick the target attribute Hit ‘start’ The result shows the model and some measures of quality Time taken to build model: 6.35 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 40612 Incorrectly Classified Instances 8230 43 CS910 Foundations of Data Analytics 83.1497 % 16.8503 % Multiple logistic regression Have so far assumed that the dependent variable is binary – How to handle categoric dependent variables? Quick fix: “one-versus-all” For category C, create a binary variable C : (not C) – Apply logistic regression for this variable – Repeat for each category – Select the category assigned the greatest likelihood – More generally: this is an example of classification – 44 Studied in depth in subsequent lectures CS910 Foundations of Data Analytics Summary of Regression Regression models the data to predict values Simple linear regression models one attribute in terms of another Multiple linear regression expresses an attribute in terms of many Non-linear regression can be done via transformation of variables Logistic regression predicts binary values (classification) Background reading: Chapters 2 (simple linear regression), 3 (multiple linear regression), 7 (polynomial regression models), 8 (indicator variables) and 13.2 (logistic regression) in Introduction to Linear Regression Analysis, (Montgomery, Peck, Vining) – Logistic Regression and Newton’s Method (how to fit the model) – 45 CS910 Foundations of Data Analytics