Regression CS910: Foundations of Data Analytics Graham Cormode

advertisement
CS910: Foundations
of Data Analytics
Graham Cormode
G.Cormode@warwick.ac.uk
Regression
Module Outline
Part 1:
Preliminaries
Part 2:
Core methods
Part 3:
Advanced topics
2
 Statistics and data handling
 Introduction to useful tools
 Case studies of analytics in action
 Regression: fitting a curve to data
 Classification: learning a model from data
 Clustering: finding groups in data




Social Network Analysis
Recommender systems
Time series analysis
Data management systems
CS910 Foundations of Data Analytics
Objectives





3
Understand the principle of regression to predict values
See how simple linear regression works from first principles
Extend simple linear regression to multiple linear regression
Perform non-linear regression by transformation of variables
Apply logistic regression for categoric values
CS910 Foundations of Data Analytics
Supervised and Unsupervised Methods
 Supervised methods in data analytics
Classification: predict a class (categoric) value given other values
– Regression: predict a numeric value given other values
–
 Unsupervised methods in data analytics
–
Clustering: identify groups/clusters of similar records
 In-between: Semi-supervised methods
–
4
Use a mixture of labeled and unlabeled data to infer labels
CS910 Foundations of Data Analytics
Regression
 Regression lets us predict a value for a numeric attribute
–
We fit a model to the data, and use the model to predict
 Linear regression is the most familiar example
A linear function of the explanatory variables
– Predicts a value for the dependent variable
–
 Based on the principle of least squares
–
5
Minimize the sum of squared differences between data and model
 Gauss-Markov theorem shows this provides “best” unbiased
estimates for the parameters of the model
 Assuming unbiased independent noise with bounded variance
CS910 Foundations of Data Analytics
Representation by a constant
 Trivial case of regression:
Given n observations of a random variable X (e.g., 8, 8, 10, 11, 13)
– Find a single value x’ to represent them
–
 Principle of least squares: Minimizex’ i (xi – x’)2
= Minimizex’ i (xi2 + x’2 – 2xi x’)
= Minimizex’ nx’2 – 2x’(i xi)
Differentiate with respect to x’ and set to 0:
2nx’ – 2 (i xi) = 0
Achieved by x’ = i xi / n
 The mean minimizes the squared deviation (so 10 for e.g. above)
The variance gives the (minimal) expected squared deviation
– The standard deviation gives the root-mean-squared-error (RMSE)
–
6
CS910 Foundations of Data Analytics
Representation by a line through the origin
 Given paired observations (xi, yi) of variables X, Y
–
We seek a model y = ax (with a as parameter to fit)
 Minimize the squared deviations
For xi, the model predicts y = axi
– The true observation is yi
– Minimize the squared difference:
Minimizea i (yi – axi)2 = i (yi2 + a2 xi2 – 2a xi yi)
Differentiate with respect to a:
2a  xi2 – 2 i (xi yi) = 0
a = i (xi yi) / ixi2
– Can write in terms of E[XY] = i (xi yi)/n and E[X2] = i xi2/n
 We have a = E[XY]/E[X2]
–
7
CS910 Foundations of Data Analytics
New Data set: Boston housing (1978)
http://archive.ics.uci.edu/ml/datasets/Housing
http://tunedit.org/repo/UCI/numeric/housing.arff
 506 examples of regions in Boston USA last century
–
Can we find a model to predict house sales prices?
 Some obvious features are included:
Crime rate nearby
– Average number of rooms per home
– Property tax ratio
–
 Other features may be less expected:
Pollution measures (nitrous oxide)
– Proportion of population that is “lower status”
– Proportion of population from ethnic minorities
–
 Ethical question: should we include such features?
–
8
Will this propagate and increase inequality?
CS910 Foundations of Data Analytics
Simple Linear Model
 Apply the model to housing.data
Set X = average rooms (RM), Y = median house values (MEDV)
– E[XY] = 146.1, E[X2] = 40.0, E[Y2] = 592.1
– a = E[XY]/E[X2] = 146.1/40.0 = 3.65
–
 What is the residual sum-of-squares error from this model?
–
SSres = i (yi – axi)2 = i yi2 + a2xi2 – 2axiyi
= nE[Y2] + nE[X2] E[XY]2 /E[X2]2 – 2nE[XY] E[XY]/E[X2]
= n (E[Y2] – E[XY]2/E[X2])
 Root-Mean-Squared Error:
RMSE = √(SSres /n) = √(E[Y2] – E[XY]2/E[X2])
– For housing.data: √(592.1 – 146.12/40.0) = 7.64
 Error magnitude is 7.64 ($K), compared to average of 22.5 ($K)
–
9
CS910 Foundations of Data Analytics
Aside: Sample versus Population statistics
 Sometimes a statistical correction is needed for expectations
Unbiased sample variance: s2(Y) = i (yi – E[Y])2/(n-1)
– Corrects the bias in estimating unknown population parameters
– Similar correction for covariance
–
 Will gloss over this issue
We assume that n is reasonably large (e.g. n = 506 for housing data)
– Using n-1 vs. n has only minor impact (e.g. 1/506 = 0.002)
– May cause minor discrepancies depending on software used
–
10
CS910 Foundations of Data Analytics
Linear model with constant
 Fit a line not constrained to go through origin: y = ax + b
–
Follow same set-up but now there are two unknowns, a and b:
Minimizea,b i (yi – axi – b)2
= i (yi2 – 2byi – 2axi yi + b2 + 2baxi + a2xi2) = f(a,b)
 To solve, take partial derivatives with respect to a and b
f(a,b)/a: i (-2xi yi + 2bxi + 2axi2) = 0
(+)
f(a,b)/b: 2bn + i (-2yi + 2axi) = 0
(*)
Rearrange (*): b = (i yi – ai xi)/n : average difference from model
– Substitute into (+): a (i xi2 – (ixi)2/n) = i (xi yi) – (i xi)(i yi)/n
 Simplify LHS: (i xi2 – (ixi)2/n) = nVar(X)
Simplify RHS: i (xi yi) – (i xi)(i yi)/n = nCov(X,Y)
– So a = Cov(X,Y)/Var(X), and b = E[Y] – aE[X]
–
11
CS910 Foundations of Data Analytics
Linear model with constant
 We have y = ax + b where a = Cov(X,Y)/Var(X) and b = E[Y] – aE[X]
 Write model as y = xCov(X,Y)/Var(X) + E[Y] – Cov(X,Y) E[X]/Var(X)
= E[Y] + (x – E[X])Cov(X,Y)/Var(X)
 Error (Residual sum of squares):
SSres = i (yi - axi - b)2
= i (yi – E[Y] – (xi – E[X])Cov(X,Y)/Var(X))2
= i (yi – E[Y])2 + (xi – E[X])2Cov2(X,Y)/Var2(X)
– 2(yi – E[Y])(xi – E[X])Cov(X,Y)/Var(X)
= nVar(Y) + nVar(X)Cov2(X,Y)/Var2(X) – 2nCov2(X,Y)/Var(X)
= n (Var(Y) – Cov2(X,Y)/Var(X))
12
CS910 Foundations of Data Analytics
Partitioning the Variance
 The total variance in Y, Var(Y), is explained by two components:
The regression (explained) sum of squares, SSreg = i (axi + b – E[Y])2
– The residual sum of squares, SSres = i (yi – axi - b)2
– Can show for this (a, b) that SSreg + SSres = nVar(Y) = i (yi – E[Y])2
–
 Measure the “Coefficient of Determination” R2 = 1 – SSres/nVar(Y)
Measures how much the model “explains” the variance in Y
 Compared to the initial variance, Var(Y)
– Here, SSres = n (Var(Y) – Cov2(X,Y)/Var(X))
2
2
– R2 = 1 – (1 – Cov (X,Y)/(Var(X)Var(Y)) = Cov (X,Y)/(Var(X)Var(Y))
= PMCC(X,Y)2
–
 Product-Moment Correlation Coefficient determines the quality
of the linear regression between X and Y
13
CS910 Foundations of Data Analytics
Application of the model
 For housing.data, with X = average rooms, Y = median house values
 Cov(X,Y) = 4.48, Var(X) = 0.493
 a = Cov(X,Y)/Var(X) = 9.08
 E[X] = 6.3, E[Y] = 22.5
 b = E[Y] – aE[X] = -34.6
 Line of best fit = 9.08x - 34.6
 Price is -$34.6K+ $9.08K for every extra room. Does this make sense?
 R2 = Cov2(X,Y) /(Var(X)Var(Y)) = 0.48
 A weak but positive correlation
 RMSE: (Var(Y) – Cov2(X,Y)/Var(X)) = s(Y)(1 – PMCC) = 4.77
 Smaller than previous model (line through origin, RMSE 7.64)
14
CS910 Foundations of Data Analytics
Computation in R
 To do the calculations using R:
house <- read.table("housing.data",sep="", header=F) # read the data
summary (house$V14)
summary (house$V6) # show summary of the two variables
cov(house$V14,house$V6) # show covariance of variables
cor(house$V14,house$V6) # show correlation of variables
cor(house$V14,house$V6)**2 # show PMCC squared / R2
fit <- lm(house$V14 ~ house$V6) # fit a linear model with V14 as Y
print (fit) # show the parameters of the model
summary(residuals(fit)) # summarize the distribution of residuals
summary(fit) # summarize the model.
# R shows the ‘significance’ of each parameter, based on a t-test
plot(house$V6, house$V14) # plot the data
abline(fit) # show the line of best fit on the data
fit0 <- lm(house$V14 ~ -1 + house$V6) # fit linear model through origin
15
CS910 Foundations of Data Analytics
Computation via spreadsheet
 Built in functions for the desired quantities:
=average(range), =var(range) for mean and variance
=covar(range1, range2) for covariance of paired values
=correl(range1, range2) for PMCC of paired values
 Can add line of best fit to scatter plot 60
50
Make scatter plot of data
40
– Select plot, “add trendline” (right-click) 30
20
 Can show equation and R2 value
10
0
 Can set intercept (=0)
–
-10
y = 9.1021x - 34.671
R² = 0.4835
0
2
4
Series1
16
CS910 Foundations of Data Analytics
6
8
Linear (Series1)
10
Computation via Gnuplot
 Scatter plot of hours worked vs years of education (as before):
Average Price
set term emf enhanced font "Calibri,18"
set output "roomprice.emf"
set title "Rooms versus Price"
set xlabel "Rooms"
set ylabel "Average Price"
set key under
 Add a line of best fit:
Rooms versus Price
50
40
30
20
10
0
y(x)=a*x+b
-10
3
fit y(x) "housing.data" using 6:14 via a,b
plot "housing.data" u 6:14 w p t 'Houses', \
y(x) with lines title 'Fit'
4
5
6
Rooms
Houses
Output to standard output:
Final set of parameters
a
= 9.10211
b
= -34.6706
17
Asymptotic Standard Error
+/- 0.419
(4.604%)
+/- 2.65
(7.643%)
CS910 Foundations of Data Analytics
7
Fit
8
9
Computation via Weka
 Open the data file, remove unwanted (non-numeric) attributes
 Under classify tab, choose “functions/Simple Linear Regression”
Select “use training set” for test options
– Hit start!
–
 Partial output:
9.1 * RM - 34.67
Time taken to build model: 0.03 seconds
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient
Mean absolute error
Root mean squared error
18
CS910 Foundations of Data Analytics
0.6884
4.4953
6.6644
Extra practice
 Replicate this analysis in your preferred tool
–
Spreadsheet, R, gnuplot, weka or other
 Build simple linear regression models for other numerical
variables in the housing.data set:
–
–
–
–
–
19
E.g. Price versus pollution (nitric oxide rates)
E.g. Pollution versus proportion of old properties
Compute the quantities needed (Var[Y], Cov[X,Y] etc.),
and see that they agree with those found by the software
How well does the regression explain the data
(in terms of R2, RMSE error)?
What might explain these correlations or lack of correlations?
CS910 Foundations of Data Analytics
Multiple linear regression
 Suppose we want to include more variables
–
–
–
–
–
Model: Y = aX1 + bX2 + cX3 + … + z
Y: dependent (response) variable. Xi: explanatory variables
Could follow same outline, write out squared error and minimize
Notation gets ugly, messy
Instead, can solve via matrix representation
 Matrix form of linear regression with d explanatory variables
Let the (d+1) model parameters be (w0, w1, … wd) = w
 Prediction for (vector) x will be f(x) = w0 1 + i=1d wi xi
1 x11 x12
– Encode the n examples as a n  d+1 matrix X
1 x21 x22
1 x31 x32
 First column is all 1s, for the constant term
…
– Vector of n corresponding yi values, y
–
(
20
CS910 Foundations of Data Analytics
x13 …
x23 …
x33 …
)
Linear Algebra Refresher
 r  c matrix X has r rows, c columns
–
Xi,j is entry in row i, column j
 Addition: add two r x c matrices entry-wise (X+Y)i,j = Xi,j + Yi,j
 Multiplication: multiply r  n matrix X with n  c matrix Y
to get r  c matrix Z = XY
Zi,k = j=1n Xi,j Yj,k
– Identity matrix I is n  n matrix s.t. IX = XI = X
–
 Inverse: X-1 is the matrix (if it exists) such that X-1X = XX-1 = I
 Transpose, XT switches rows and columns: XTi,j = Xj,i
(X + Y)T = XT + YT
– (XY)T = YTXT
–
21
CS910 Foundations of Data Analytics
Sum of squares error
 Column vector of predictions on data is Xw
–
Residuals are the column vector (y – Xw)
 Residual sum of squares is now
RSS(w) = (y – Xw)T(y-Xw) = (yT – wTXT)(y-Xw)
= yTy – yTXw – wTXTy + wTXTXw
The inner product of the residuals with themselves (or L22)
– A quadratic function in the (d+1) parameters w
–
22
CS910 Foundations of Data Analytics
Optimization
 (Partial) derivatives for quantities with respect to a vector a:
b/a = [b/a1, b/a2, … ]T
–  aTb /a =  bTa /a = b
–  aTBTBa /a = 2BTBa
–
(Rule 1)
(Rule 2)
(Rule 3)
 Take (partial) derivative with respect to (each) parameter:
RSS/w = /w (yTy – yTXw – wTXTy + wTXTXw)
= 0 – (yTX)T – XTy + 2XTXw
= 0 – 2XTy + 2XTXw
= – 2XT(y – Xw)
– Set RSS/w = 0 = XTy – XTXw
– So w = (XTX)-1XTy (XTX is square, symmetric), assuming (XTX)-1 exists
 Need that XTX is non-singular: determinant is non-zero
 X cannot have linearly dependent columns
–
23
CS910 Foundations of Data Analytics
Multilinear regression: matrix example
 Consider the housing data (n = 506 examples)
–
Predict y = sales price based on X = pollution and rooms
 w = (XTX)-1XTy : (XTX) is a 3 x 3 matrix
–
(XTX)i,j is inner product of values of ith variable with jth variance
 R can do matrix algebra: X <- matrix(1:1518, nrow = 506, ncol=3)
X[,1]<-1; X[,2] <-house$V5; X[,3] <-house$V6; y<- house$V14
w <- (solve(t(X) %*% X) %*% t(X)) %*% y # %*% is matrix multiply
RSS <- t(y - X %*% w) %*% (y - X %*% w) # t(X) is matrix transpose
280.67
3180.25
– XTX = 506
(XTX)-1 = 0.286 -0.141 -0.032
–
[
280.67
162.47
1751.52
3180.25
1751.52
20234.6
]
[
-0.141 0.162
0.008
-0.032 0.008
0.004
 Obtain wT = [-18.20, -18.97, 8.16], RSS = 19844.37
24
CS910 Foundations of Data Analytics
]
Prediction using the model
 Given new data point x, define x’ = [1 x1 … xd]
–
Prediction is x’w = x’(XTX)-1XTy
 As before, quality of the fit is given by R2
Computed as fraction of the sum of squares explained by regression
– R2 = 1 – SSresidual/SStotal = 1 – SSresidual/(n Var(y))
–
 Same interpretation of R2 as in simple linear regression:
Close to 1: good fit of model
– Close to 0: weak fit of model
–
 In R: R2 <- 1-RSS/(length(y)*var(y))
–
25
# = 0.536
Fortunately, R has multiple linear regression built in…
CS910 Foundations of Data Analytics
Multiple linear regression in R
 house <- read.table("housing.data",sep="", header=F)
# read the data
fit2 <- lm(house$V14 ~ house$V5 + house$V6)
# fit a linear model with V14 as Y, V5 and V6 as X
fit2 # show the parameters of the model
# Model: y = -18.2 – 19.0(nox) + 8.2(rooms)
summary(fit2)
# R2 = 0.53, up from 0.48
pairs(house$V14 ~ house$V5 + house$V6)
# plots of pairs of vars
26
CS910 Foundations of Data Analytics
Multiple linear regression in Weka
 Open the data file, remove unwanted (non-numeric) attributes
 Under classify tab, choose “functions/LinearRegression”
Select “use training set” for test options
– Hit start!
–
 Partial output:
27
class =
-18.9706 * NOX +
8.1567 * RM +
-18.2059
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient
Mean absolute error
Root mean squared error
CS910 Foundations of Data Analytics
0.7317
4.2419
6.2624
Dealing with categoric attributes
 Regression is fundamentally numeric
–
But we can numerically encode categoric (explanatory) variables
 Simple case: binary attribute (e.g. Male/Female)
Create a variable that is 0 if male, 1 if female
– Include this new variable in the regression
–
 General categoric attributes (e.g. Country): “Dummy coding”
Create a binary variable for each possibility
– E.g. England (T/F), Mexico (T/F), France (T/F)…
– Include all these variables in the regression
– Effectively, adds a different constant for each category
–
28
CS910 Foundations of Data Analytics
Kitchen sink regression
 Build a regression model for price, put in all other variables
–
R automatically handles categoric variables, skips dependencies:
fit3 <- lm(house$V14 ~ house$V1 + house$V2 + house$V3 + house$V4 + house$V5
+ house$V6 + house$V7 + house$V8 + house$V9 + house$V10 + house$V11 +
house$V12 + house$V13)
summary(fit3)
–
Weka can “automatically” convert categoric variables to numeric
 R2 value  0.74 (R  0.86)
We have explained a lot more of the variance (though not all)
– But have built a complex model (dozens of variables/parameters)
– At risk of “kitchen sink regression”: throw in everything possible
 May find false correlations, lead to erroneous conclusions
– Some variables significant: pollution, rooms, lower status
 Others not: age of housing, industry
–
29
CS910 Foundations of Data Analytics
Regularization in regression
 There is a danger of overfitting the model to the data
–
Fits what we have seen already, but not future data
 Regularization: include the parameters in the optimization
–
Minimize (y – Xw)T(y-Xw) + α‖w‖pp instead of just (y – Xw)T(y-Xw)
 Tikhonov regularization, also known as ridge regression: p=2
Seems most natural, since both terms are quadratic
– Has a closed form solution similar to before: w = (XTX + 2αI)-1XTy
– Implemented in Weka Linear Regression: parameter ‘ridge’ sets α
– To pick α, test quality of predictions on withheld data
–
 Least absolute shrinkage and selection operator (LASSO): p=1
Minimize (y – Xw)T(y-Xw) + α‖w‖1 – no closed form, not in WEKA
– Tends to set many coefficients to zero: effectively model selection!
–
30
CS910 Foundations of Data Analytics
Correlations in data
 Two measures of education level in the adult data
–
Years of education (numeric), Education level (categoric)
 How do they relate?
adult <- read.csv(adult.data, header=F)
fit4 <- lm(adult$V5 ~ adult$V4) #R2 = 1!
plot(adult$V5 ~ adult$V4)
– Years of education entirely determined by education level
 Conjecture: years of education computed from education level
–
 Extra Practice: build a regression model for years of education
Based on demographic factors: age, sex, nationality
– Experiment with adding and removing features: what works best?
–
31
CS910 Foundations of Data Analytics
Fitting non-linear functions
 Not all relationships are linear
Some are quadratic, cubic…
– Exponential, logarithmic…
– …
–
 Do we need to find new methods for each different model?
 Idea: try transforming the data so that we seek a linear model
Suppose we have a quadratic model: y = ax2 + bx + c
– Introduce a new variable z=(x2)
– Model is now y = az + bx + c – linear!
– Use multiple linear regression to learn the parameters of this model
–
32
CS910 Foundations of Data Analytics
Non-linear models
 How exactly to learn the new model?
For each example (xi, yi), create a new example (1, xi, xi2, yi)
– Apply linear regression to the expanded data
–
 Do we need to include terms like (xi2), (xi+)2?
–
No, because these models can be expressed as ax2 + bx + c
 But we do need to include the terms of the model we want
Cannot learn a quadratic model unless x2 is present
– log(xi) : logarithmic model
– 1/x : reciprocal model
–
33
CS910 Foundations of Data Analytics
Fitting a non-linear model
 house <- read.table("housing.data",sep="", header=F)
# read the data
fit1 <- lm(house$V14 ~ house$V6) # R2 = 0.4835
fit2 <- lm(house$V14 ~ house$V6 + I(house$V6^2))
# fit a linear model with V13 as Y, V6 and V62 as X
fit # show the parameters of the model
(Intercept)
66.06
house$V6
-22.64
summary(fit)
# R2 = 0.548 – Much better
34
CS910 Foundations of Data Analytics
I(house$V6^2)
2.47
Exponential model
 Suppose we want to learn a model of the form y =  exp( x)
–
For unknown parameters , 
 Transforming x to exp(x) will not suffice
–
c exp(x) ≠ exp( x)
 Here, we can transform y:
–
(ln y) = (ln ) +  x : simple linear regression
 However: linear regression minimizes sums of squared
differences around (ln y)
We have changed the objective function x
fit <- lm(log(house$V14)) ~ house$V1)
–
35
CS910 Foundations of Data Analytics
Categoric outputs
 What about regression to predict categoric attributes?
Regression so far predicts a number
– Will focus on binary outputs
–
 Can encode True=1, False=0, and try to use regression
Predicts 0.03: probably False
– Predicts 0.82: probably True
–
 Not a principled approach
Prediction of 13.3: really really true???
– Prediction of -5.7: ???
– How to interpret the slope of the dependency on variables?
–
36
CS910 Foundations of Data Analytics
Odds Ratios
 Need to transform the data so that regression makes sense
–
Perform a non-linear transform on the dependent variable
 Convert the probability of an outcome into an odds ratio
Odds ratio(p) = p/(1-p)
– E.g. toss a coin, get heads, p = ½, Odds ratio = 1
– E.g. roll a dice, get a 6: (1/6)/(5/6) = 0.2
– E.g. roll a dice, get anything other than 6: (5/6)/(1/6) = 5
–
 Using odds ratios ensures that any positive value is meaningful
As p  1, Odds ratio  ∞
– As p  0, Odds ratio  0
–
 What about negative values?
37
CS910 Foundations of Data Analytics
Log Odds Ratios
 Taking the logarithm of the odds ratio allows negative values
Heads on a coin: ln(1) = 0
– 6 on a dice: ln(0.2) = -1.6
– Anything other than 6 on a dice: ln(5) = 1.6
–
 This is the logit transformation
Logit(p) = ln(p/(1-p))
– Invertible: p = 1/(1+exp(-l))
–
 Now all reals have meaning
As p  0, logit(p)  –∞
– As p  1, logit(p)  +∞
–
38
CS910 Foundations of Data Analytics
Logistic Regression
 Apply the logit function to dependent variable: logistic regression
Model: logit(Y) = aX1 + bX2 + cX3 + … + z
– Transformed model: Y = 1/(1 + exp(-(aX1 + bX2 + cX3 + … + z)))
– Problem: this does not yield a closed form as did linear regression
–
 Treat as an optimization problem and solve iteratively
–
–
–
–
–
Begin with initial estimate of parameters (e.g. all 1)
Measure “goodness of fit” via log-likelihood function
 Likelihood: Probability of seeing data given (current) model
Adjust parameters based on local gradient (Newton-Raphson)
Repeat until convergence: obtain maximum likelihood estimates
Built into R and Weka, needs plug-ins for spreadsheets (“solver”)
 Measure quality of the model by how well it fits the training data
39
CS910 Foundations of Data Analytics
Logistic Regression in R
 adult <- read.csv("adult.data",header=F)
# use adult data set, with binary class
plot(adult$V15 ~ adult$V5)
lfit <- glm(adult$V15 ~ adult$V5, family="binomial")
# family = “binomial” tells R to use logistic regression
summary(lfit)
predict(lfit, type="response")
round(predict(lfit, type=“response”)
pairs <-(paste(round(predict(lfit, type="response")), adult$V15))
table(pairs) # get statistics on number of correct predictions
 summary(adult$V15) # compare to overall statistics
<=50K
>50K
24720
7841
40
CS910 Foundations of Data Analytics
Logistic regression in R
 lfit <- glm(adult$V15 ~ adult$V1 + adult$V2 + adult$V4 + adult$V6 +
adult$V7 + adult$V8 + adult$V9 + adult$V10, family="binomial")
# family = “binomial” tells R to use logistic regression
summary(lfit) # print summary of the fit
pairs <-(paste(round(predict(lfit, type="response")), adult$V15))
table(pairs)
 Output:
pairs
0 <=50K 0 >50K 1 <=50K 1 >50K
22840
3548
1880
4293
 summary(adult$V15)
<=50K
>50K
24720
7841
41
CS910 Foundations of Data Analytics
Interpreting a Logistic Regression Model
 Linear model y = ax + b: increase of 1 in x increases y by a
 Logistic model Logit(y) = aX1 + bX2 + cX3 + … + z :
increase of 1 in x increases odds ratio of y by a factor of exp(a)
Odds ratio(y | X1, X2, ...) = exp(aX1 + bX2 + cX3 + … + z)
Odds ratio(y | (X1+1), X2, ...) = exp(a(X1+1) + bX2 + cX3 + … + z)
= exp(a)exp(aX1 + bX2 + cX3 + … + z)
– A factor of exp(a) between the two odds ratios
–
 Example: predicting income based on years of education
Coefficients:
–
42
(Intercept)
-5.0197
adult$V5
0.3643
Each extra year of education multiplies odds of high income by
exp(0.3643) = 1.44: quite significant!
CS910 Foundations of Data Analytics
Logistic Regression in Weka
 Select ‘adult.arff’, remove unwanted attributes
Select the classify tab
Choose the classifier: classifiers/functions/Logistic
For test options, pick ‘use training set’
Pick the target attribute
Hit ‘start’
The result shows the model and some measures of quality
Time taken to build model: 6.35 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances
40612
Incorrectly Classified Instances
8230
43
CS910 Foundations of Data Analytics
83.1497 %
16.8503 %
Multiple logistic regression
 Have so far assumed that the dependent variable is binary
–
How to handle categoric dependent variables?
 Quick fix: “one-versus-all”
For category C, create a binary variable C : (not C)
– Apply logistic regression for this variable
– Repeat for each category
– Select the category assigned the greatest likelihood
–
 More generally: this is an example of classification
–
44
Studied in depth in subsequent lectures
CS910 Foundations of Data Analytics
Summary of Regression






Regression models the data to predict values
Simple linear regression models one attribute in terms of another
Multiple linear regression expresses an attribute in terms of many
Non-linear regression can be done via transformation of variables
Logistic regression predicts binary values (classification)
Background reading:
Chapters 2 (simple linear regression), 3 (multiple linear regression),
7 (polynomial regression models), 8 (indicator variables) and 13.2
(logistic regression) in Introduction to Linear Regression Analysis,
(Montgomery, Peck, Vining)
– Logistic Regression and Newton’s Method (how to fit the model)
–
45
CS910 Foundations of Data Analytics
Download