Short Course Spring 2012-Regression in JMP


1
Presentation and Data

http://www.lisa.stat.vt.edu

Short Courses

Regression Analysis Using JMP

Download Data to Desktop
Regression Analysis Using JMP
Mark Seiss, Dept. of Statistics
February 28, 2012
Presentation Outline
1.
Simple Linear Regression
2.
Multiple Linear Regression
3.
Regression with Binary and Count Response
Variables
Presentation Outline
Questions/Comments
Individual Goals/Interests
Simple Linear Regression
1.
2.
3.
4.
5.
6.
Definition
Correlation
Model and Estimation
Coefficient of Determination (R2)
Assumptions
Example
Simple Linear Regression
• Simple Linear Regression (SLR) is used to study the relationship
between a variable of interest and another variable.
• Both variables must be continuous
• Variable of interest known as Response or Dependent
Variable
• Other variable known as Explanatory or Independent Variable
• Objectives
• Determine the significance of the explanatory variable in
explaining the variability in the response (not necessarily
causation).
• Predict values of the response variable for given values of the
explanatory variable.
Simple Linear Regression
• Scatterplots are used to graphically examine the relationship
between two quantitative variables.
• Linear or Non-linear
• Positive or Negative
Simple Linear Regression
No Relationship
Non-Linear Relationship
Positive Linear Relationship Negative Linear Relationship
Simple Linear Regression
• Correlation
• Measures the strength of the linear relationship between two
quantitative variables.
• Pearson Correlation Coefficient
• Assumption of normality
• Calculation:
• Spearman’s Rho and Kendall’s Tau are used for non-normal
quantitative variables.
Simple Linear Regression
• Properties of Pearson Correlation Coefficient
• -1 ≤ r ≤ 1
• Positive values of r: as one variable increases, the other
increases
• Negative values of r: as one variable increases, the other
decreases
• Values close to 0 indicate no linear relationship between the
two variables
• Values close to +1 or -1 indicated strong linear relationships
• Important note: Correlation does not imply causation
Simple Linear Regression
• Pearson Correlation Coefficient: General Guidelines
• 0 ≤ |r| < 0.2 : Very Weak linear relationship
• 0.2 ≤ |r| < 0.4 : Weak linear relationship
• 0.4 ≤ |r| < 0.6 : Moderate linear relationship
• 0.6 ≤ |r| < 0.8 : Strong linear relationship
• 0.8 ≤ |r| < 1.0 : Very Strong linear relationship
Simple Linear Regression
• The Simple Linear Regression Model
• Basic Model: response = deterministic + stochastic
• Deterministic: model of the linear relationship between X
and Y
• Stochastic: Variation, uncertainty, and miscellaneous
factors
• Model
yi= value of the response variable for the ith observation
xi= value of the explanatory variable for the ith observation
β0= y-intercept
β1= slope
εi= random error, iid Normal(0,σ2)
Simple Linear Regression
• Least Square Estimation
• Predicted Values
• Residuals
Simple Linear Regression
• Interpretation of Parameters
• β0: Value of Y when X=0
• β1: Change in the value of Y with an increase of 1 unit of X
(also known as the slope of the line)
• Hypothesis Testing
• β0- Test whether the true y-intercept is different from 0
Null Hypothesis: β0=0
Alternative Hypothesis: β0≠0
• β1- Test whether the slope is different from 0
Null Hypothesis: β1=0
Alternative Hypothesis: β1≠0
Simple Linear Regression
• Analysis of Variance (ANOVA) for Simple Linear Regression
Source
Df
Sum of
Squares
Mean Square
F Ratio
P-value
Model
1
SSR
SSR/1
F1=MSR/MSE
P(F>F1,1-α,1,n-2)
Error
n-2
SSE
SSE/(n-2)
Total
n-1
SST
Simple Linear Regression
Simple Linear Regression
• Coefficient of Determination (R2)
• Percent variation in the response variable (Y) that is explained
by the least squares regression line
• 0 ≤ R2 ≤ 1
• Calculation:
Simple Linear Regression
• Assumptions of Simple Linear Regression
1. Independence
Residuals are independent of each other
Related to the method in which the data were collected or
time related data
Tested by plotting time collected vs. residuals
Parametric test: Durbin-Watson Test
2. Constant Variance
Variance of the residuals is constant
Tested by plotting predicted values vs. residuals
Parametric test: Brown-Forsythe Test
Simple Linear Regression
• Assumptions of Simple Linear Regression
3. Normality
Residuals are normally distributed
Tested by evaluating histograms and normal-quantile plots of
residuals
Parametric test: Shapiro Wilkes test
Simple Linear Regression
• Constant Variance: Plot of Fitted Values vs. Residuals
Good Residual Plot: No Pattern
Predicted Values
Bad Residual Plot: Variability Increasing
Predicted Values
Simple Linear Regression
• Normality: Histogram and Q-Q Plot of Residuals
Normal Assumption Appropriate
Normal Assumption Not Appropriate
Simple Linear Regression
• Some Remedies
• Non-Constant Variance: Weight Least Squares
• Non-normality:
Box-Cox Transformation
• Dependence:
Auto-Regressive Models
Simple Linear Regression
• Example Dataset: Chirps of Ground Crickets
• Pierce (1949) measure the frequency (the number of wing
vibrations per second) of chirps made by a ground cricket, at
various ground temperature.
• Filename: chirp.jmp
Simple Linear Regression
• Questions/Comments about Simple Linear Regression
Multiple Linear Regression
1.
2.
3.
4.
5.
6.
7.
Definition
Categorical Explanatory Variables
Model and Estimation
Adjusted Coefficient of Determination
Assumptions
Model Selection
Example
Multiple Linear Regression
•
Explanatory Variables
• Two Types: Continuous and Categorical
• Continuous Predictor Variables
• Examples – Time, Grade Point Average, Test Score, etc.
• Coded with one parameter – β#x#
• Categorical Predictor Variables
• Examples – Sex, Political Affiliation, Marital Status, etc.
• Actual value assigned to Category not important
• Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc.
• Coded Differently than continuous variables
Multiple Linear Regression
•
Categorical Explanatory Variables
• Consider a categorical explanatory variable with L categories
• One category selected as reference category
• Assignment of Reference Category is arbitrary
• Variable represented by L-1 dummy variables
• Model Identifiability
• Effect Coding (Used in JMP)
• xk = 1 if explanatory variable is equal to category k
0 otherwise
• xk = -1 for all k if explanatory variable equals category I
Multiple Linear Regression
• Similar to simple linear regression, except now there is more than
one explanatory variable, which may be quantitative and/or
qualitative.
• Model
yi= value of the response variable for the ith observation
x#i= value of the explanatory variable # for the ith observation
β0= y-intercept
β#= parameter corresponding to explanatory variable #
εi= random error, iid Normal(0,σ2)
Multiple Linear Regression
• Least Square Estimation
• Predicted Values
• Residuals
Multiple Linear Regression
• Interpretation of Parameters
• β0: Value of Y when X=0
• Β#: Change in the value of Y with an increase of 1 unit of X#
in the presence of the other explanatory variables
• Hypothesis Testing
• β0- Test whether the true y-intercept is different from 0
Null Hypothesis: β0=0
Alternative Hypothesis: β0≠0
• Β#- Test of whether the value change in Y with an increase
of 1 unit in X# is different from 0 in the presence of the
other explanatory variables.
Null Hypothesis: β#=0
Alternative Hypothesis: β#≠0
Multiple Linear Regression
• Adjusted Coefficient of Determination (R2)
• Percent variation in the response variable (Y) that is explained
by the least squares regression line with explanatory variables
x1, x2,…,xp
• Calculation of R2:
• The R2 value will increase as explanatory variables added to
the model
• The adjusted R2 introduces a penalty for the number of
explanatory variables.
Multiple Linear Regression
• Other Model Evaluation Statistics
• Akaike Information Criterion (AIC or AICc)
• Schwartz Information Criterion (SIC)
• Bayesian Information Criterion (BIC)
• Mallows’ Cp
• Prediction Sum of Squares (PRESS)
Multiple Linear Regression
• Model Selection
• 2 Goals: Complex enough to fit the data well
Simple to interpret, does not overfit the data
• Study the effect of each explanatory variable on the
response Y
• Continuous Variable – Graph Y versus X
• Categorical Variable - Boxplot of Y for categories of X
Multiple Linear Regression
• Model Selection cont.
• Multicollinearity
• Correlations among explanatory variables resulting in an
increase in variance
• Reduces the significance value of the variable
• Occurs when several explanatory variables are used in the
model
Multiple Linear Regression
• Algorithmic Model Selection
• Backward Selection: Start with all explanatory variables in the
model and remove those that are
insignificant
• Forward Selection: Start with no explanatory variables in the
model and add best explanatory
variables one at a time
• Stepwise Selection: Start with two forward selection steps
then alternate backward and forward
selection steps until no variables to add
or remove
Multiple Linear Regression
• Example Dataset:
Discrimination in Salaries
•
A researcher was interested in whether there was discrimination in the
salaries of tenure track professors at a small college. The professor collected
six variables from 52 professors.
•
Filename: Salary.xls
•
Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition.
New York: John Wiley and Sons. Page 194.
Multiple Linear Regression
• Other Multiple Linear Regression Issues
• Outliers
• Interaction Terms
• Higher Order Terms
Multiple Linear Regression
• Questions/Comments about Multiple Linear Regression
Regression with
Non-Normal Response
1. Logistic Regression with Binary Response
2. Poisson Regression with Count Response
Logistic Regression
• Consider a binary response variable.
• Variable with two outcomes
• One outcome represented by a 1 and the other represented by
a0
• Examples:
Does the person have a disease? Yes or No
Who is the person voting for?
McCain or Obama
Outcome of a baseball game?
Win or loss
Logistic Regression
•
Consider the linear probability model
yi = b0 + b1 xi
where
yi = response for observation i
xi = quantitative explanatory variable
Predicted values represent the probability of Y=1 given X
• Issue: Predicted probability for some subjects fall outside of the
[0,1] range.
Logistic Regression
•
Consider the logistic regression model
exp ( b0 + b1 xi )
E [Yi ] = P(Yi =1| xi ) = p (xi ) =
1+ exp ( b0 + b1 xi )
æ p (x ) ö
i
÷÷ = b0 + b1 xi
¾¾
® logit éëp ( xi )ùû = log çç
è 1- p ( xi ) ø
• Predicted values from the regression equation fall between 0 and
1
Logistic Regression
•
Interpretation of Coefficient β – Odds Ratio
• The odds ratio is a statistic that measures the odds of an event
compared to the odds of another event.
• Say the probability of Event 1 is π1 and the probability of Event 2 is
π2 . Then the odds ratio of Event 1 to Event 2 is:
Odds( 1 ) 1 11
Odds_ Ratio 
 2
Odds( 2 )
1 2
• Value of Odds Ratio range from 0 to Infinity
• Value between 0 and 1 indicate the odds of Event 2 are greater
• Value between 1 and infinity indicate odds of Event 1 are greater
• Value equal to 1 indicates events are equally likely
Logistic Regression
•
Example Dataset: A researcher is interested how GRE exam scores,
GPA, and prestige of a students undergraduate institution affect
admission into graduate school.
Filename: Admittance.csv
Important Note:
JMP models the probability of the 0 category
Poisson Regression
• Consider a count response variable.
• Response variable is the number of occurrences in a given
time frame.
• Outcomes equal to 0, 1, 2, ….
• Examples:
Number of penalties during a football game.
Number of customers shop at a store on a given day.
Number of car accidents at an intersection.
Poisson Regression
•
Consider the model
yi = b0 + b1 xi
where
yi = response for observation i
xi = quantitative explanatory variable for
observation i
• Issue: Predicted values range from -∞ to +∞
Poisson Regression
•
Consider the Poisson log-linear model
EYi | xi   i  exp  xi 

 logi     xi
• Predicted response values fall between 0 and +∞
• In the case of a single predictor, An increase of one unit of x
results an increase of exp(β) in μ
Poisson Regression
•
Example Data Set: Researchers are interested in the number of
awards earned by students at a high school. Other variables
measured as possible explanatory variables include type of program
in which the student was enrolled (vocational, general, or academic),
and the final score on their math final exam.
Filename: Awards.csv
Attendee Questions
If time permits