STATISTIC 378 – APPLIED REGRESSION ANALYSIS

advertisement
STATISTICS 378 – APPLIED REGRESSION ANALYSIS
Serious Crime Analysis in Metropolitan U.S. Regions (1976-1977)
Terence Fung
Kamal Harris
Colin Sproat
Prit Phul
S T A T I S T I C A L
A N A L Y S I S
Table of Contents
Introduction ....................................................................................................... 2
Data Explanation ............................................................................. 2
Conclusion ........................................................................................................ 3
Best Model ...................................................................................... 3
Interpretation of the model .......................................................................................................... 3
Confidence Interval ..................................................................................................................... 3
Correlation ................................................................................................................................... 4
Importance of Variables .............................................................................................................. 4
Shortcomings of Model ................................................................................................................ 4
Methodology ..................................................................................................... 5
Data Splitting ................................................................................... 5
Checking assumptions .................................................................... 5
Independence .............................................................................................................................. 5
Normality ..................................................................................................................................... 6
Homoscedasticity ........................................................................................................................ 7
Fitting the best model ...................................................................... 7
Collinearity ................................................................................................................................... 7
Selection Method ......................................................................................................................... 9
Interaction Terms......................................................................................................................... 9
Final Model and influence ........................................................................................................... 9
Fitting model to prediction data ..................................................... 10
Appendix......................................................................................................... 12
Correlation .................................................................................... 12
Interpretation of best model .......................................................... 12
Collinearity .................................................................................... 13
Selection Method .......................................................................... 14
Influential Points – Training Group ................................................ 15
Fitted Model - Holdout Group ........................................................ 16
Collinearity – Holdout Group ......................................................... 16
Influential Points – Holdout Group................................................. 17
1
Introduction
Information regarding 141 Standard Metropolitan Statistical Areas (SMSAs) in the United States is provided
in Neter et al. (1996)1. The study contains 11 variables that pertain to that specific metropolitan area between
the years of 1976-1977. We would like to find the relationship and importance of our regressors through the
least-square regression method. By finding this relationship, we want to see how well our model explains
serious crimes in any given metropolitan area. The goal of the study is to evaluate what is the best model that
can estimate the crime rate for a metropolitan area.
Data Explanation
We defined serious crime as our dependent
variable. Unlike many of the other variables,
serious crime is a random variable that has
potential to be predicted by the other
variables through the least-square method.
X2, X3, X4, X5, X6, X7, X8 X9, X10 which
correspond with their variable number from
the data set given. Variable X12 is a logic
term. To deal with this, we introduced two
extra variables: X13 and X14. This way we can
use binary term to indicate which region it is
located and by using this method our model
will not be influenced by these values in an
unnecessary way. We use binary terms to
indicate the regions as follows: if the area is
in NE, then (X12, X13, X14) = (1, 0, 0); if the
area is in NC, then (X12, X13, X14) = (0, 1, 0);
if the area is in S, then (X12, X13, X14) = (0, 0,
1); if the area is in W, then (X12, X13, X14) =
(0, 0, 0).
1
Neter, Kutner, Nachtsheim, and Wasserman (1996). “Applied Linear Regression Models”, IRWIN, pp.1367-1368.
2
Conclusion
Best Model
Interpretation of the model
From the residual analysis of our data, the best fit model for these 141 observations is:
Yp = -2.70029 + 1123.22515tX2 – 26.94736tX6 + 0.1153c8 – 229.5947c10 -0.44122X12 – 0.18717X13 + ε.
It should be noted that this model simply predicts number of crimes in an area but will not provide an exact
value. An increase of 1 unit in the inverse of the land area squared will increase our yp by 11223.22515 with
all other variables held constant. An increase of 1 unit in the inverse of the active physicians in the area will
decrease our yp by 26.94736 with all other variables held constant. An increase of 1 unit in the centered
percentage of high school graduates in the area will increase our yp by 0.1153 with all other variables held
constant. An increase of 1 unit in the centered inverse of the total personal income in the area will decrease
our yp by 229.5947 with all other variables held constant. If the observation occurs within NE region, the yp
is expected to decrease by 0.44122, and if the observation occurs within NC region, the yp is expected to
decrease by 0.18717.
The value of R2 for the complete dataset is 0.7676. 76.76% of the variability in y can be explained by the
regression model. If we add enough variables to our model R2 will increase, meaning that R2 is biased
towards the number of variables. A better indicator would be R2adj , which is unbiased towards the number of
variables in the model. R2adj for the complete dataset is 0.6746. 67.46% of the variability in y can be explained
by the regression model. R2adj is a value that can be used to safeguard us from over-fitting the model, so if we
have added any unnecessary terms to the model, R2adj will decrease significantly, which is not the case.
Mean squared error (MSE) is 0.04999, which is the lowest amongst all of the models available to us via the all
possible regression method. MSE is an unbiased estimator of population variance in normal distributed data.
The smaller we can make MSE, the smaller the variance in the model will be, which is ideal.
Mallow’s Cp for our data = 7. When values fall close to p, regression equations show minimum bias. It is
preferable to have a small Cp statistic but it must be large enough to be a good estimator. Cp always equals p
for the full model, hence this statistic was of no use to us to find our best model.
Confidence Interval
If yp = ln((y/(X3*1000))/(1-(y/(X3*1000)))) then y = X3*1000exp{yp}/(1+exp{yp}). To find the 95%
confidence interval for the dataset, we found the 95% confidence interval for the means of each observation.
Then we took the average of the lower and upper bounds. Then, by using the mean of the X3 values in place
of X3, we converted the confidence interval into a 95% confidence interval for the number of serious crimes
in the area. We found that the 95% CI for yp was (-3.6664, -2.2564). Considering multiple areas at a time
and total number of serious crime, we found the 95% CI to be (45379.02, 56762.94).
3
Correlation
A correlation matrix was generated to see which variables are correlated with one another (See Appendix –
Correlation).The matrix shows that only c10 and tX6 (Total personal income versus Number of active
physicians) are strongly positively correlated. This relationship is valid because physicians are generally earning
higher personal income. As a result, the more physicians in the area, it is the expected that the total personal
income of the area will also increase. We did not find any strongly correlated variables.
Importance of Variables
Through the forward and stepwise selection methods, it was found that X12 – NE region was the most
important variable in indicating the serious crimes. The second most important variable is tX6 – Number of
active physicians in the area. Base on our model, if the area is located in the NW region, the predicted
number of serious crimes is predicted to be lower than another region. This relationship also holds with the
number of physicians in the area.
Shortcomings of Model
Some of our observations have been determined to be influential points, therefore our model must be used
with caution. This is because influential points exert great leverage on the model such that our model is
explained by the 5 or 6 observations as opposed to the entire dataset. We found that the influential points
had a large impact on β1, the coefficient corresponding to the land area. The difference between β1 for the in
the holdout group with and without influential point in the dataset was 13178. Normally, robust regression
would be performed in order to correct for influential points. Robust regression will decrease the effect of
influential points on the model and is extremely resilient to outliers unlike normal least-square method.
4
Methodology
Data Splitting
Our data set was split into two groups: training and holdout. This was achieved by using the DUPLEX
algorithm. This method begins with defining zjj = [xij - / (Sjj)0.5] where Sjj = ∑(xij – )2 . Then, using the
Cholesky method we obtained ZTZ = TTT where T is an upper triangular matrix. We then define W = ZT-1.
This gives us an orthogonal matrix with unit variance. The W-Matrix was calculated using the R-statistical
package and then entered into Microsoft Excel where we calculated the Euclidean distances of all data points.
There were a total of 141C2 = 9,870 distances calculated. Next we found the maximum distance in our data
and found it to be (1,27). We put observations 1 and 27 into the training group. We then deleted the point
and found the next largest distance as (4,7) and we put observations 4 and 7 into the holdout group. Now the
next point that was the farthest away from the observation already in our training group will be added to the
training group. This was done by adding the distances of all the prior training group observations to a specific
observation. This variable is taken out of the distance matrix and then the same method is done for the
holdout group. Since our dataset is relatively small, we only placed 25 values in our holdout group and the
remaining 116 data points were placed in our training group.
The observations entered into our holdout group (in ascending order) were: 2, 3, 4, 7, 10, 12, 15, 17, 19, 24,
42, 48, 59, 68, 73, 78, 92, 107, 112, 113, 124, 133, 137, 138, and 139.
The remaining observations were placed into our training group.
To check the validity of the split we calculated
with 1/p = 1/13.
The generated determinant using the R-Statistical package generates 142.1728 and 127.9630 for training and
holdout respectively. Applying it to the equation, we found a value of 2.983374. For even data splitting this
value should be close to 1, but because we only put 25 observations into our holdout group. The value under
3 is acceptable.
Checking assumptions
Independence
In order to perform least-square linear regression we must account for the three assumptions: independence
normality, and homoscedasticity. Each variable from the observations have no effect on the variables of the
other observations. Assuming that the standard areas were selected randomly, we can say that the
observations are independent of each other. In addition, the least-square errors of the model should be
uncorrelated and normally distributed. When the values are uncorrelated this implies that covariance is zero
and under normal distribution this implies independence.
5
Normality
The split dataset was entered into SAS. We must
check our dataset for normality and constant
variance. Without these assumptions we cannot
create a valid model.
It is clear from this graph that normality assumption
does not hold because it has a light-tail distribution.
To fix this, we will have to perform a
transformation on Y.
The transformation we used was:
ln((y/(X3*1000))/(1-(y/(X3*1000)))).
yp =
The variable X3 – Total Population, was given in terms of thousand, so we multiplied it by 1,000. Y is the
number of serious crimes in the area. A more plausible description would be the serious crimes in the area in
comparison to the population of the area. Serious crime is a random variable that follows Poisson
distribution, which is a discrete distribution from [0, ∞). We converted this into a normal distribution which
is continuous from (-∞, ∞).
Y* = y/(X3 * 1000) is a proportion, hence the range of the dataset is [0,1]. If we divide Y* by 1-Y*, we get:
Y*/ (1-Y*) which has a range of [0, ∞). Now if we take the natural logarithm, we get yp = ln(Y*/(1-Y*)).
Since the ln(0) = -∞, the new range of the yp is (-∞, ∞) which is the same as a normal distribution.
Once we did our transform of y, we again check for normality of the model:
It is clear from this graph that normality assumption
holds because it follows a linear pattern. Now we
check for homoscedasticity.
6
Homoscedasticity
Homoscedasticity does not hold for this dataset. We
want observations in this graph to hold along a
horizontal band. For this to occur, transformation of
our variables is required.
Looking at the residuals versus regressor graphs, the
following transforms were made:
tX2 = 1/X22
tX3 = 1/X3
tX6 = 1/X6
tX7 = 1/X7
tX9 = 1/X9
tX10 = 1/X10
where tX* is the new transform variable.
This graph shows that homoscedasticity holds for the
model after the transformations were made. This also
implies that the expected value of the least-square
error is zero, which is also an assumption that must
hold.
Fitting the best model
Now that our assumptions are satisfied, we may now begin to fit our best model.
Collinearity
Before selecting our best model, we must check for collinearity between the variables (See Appendix –
Collinearity-Training Group). We found that tX3, tX6, X8, and tX10 had collinearity problems. To fix this, we
centered the variables by taking the mean of the variables subtracting from the variable in the observations
(c* = Xi* - mean(X*)).
c3 = tX3 - 0.0021298;
c8 = X8 - 54.35;
c9 = tX9 - 0.0047905;
c10 = tX10 - 0.000347226;
After centering our data, we removed the collinearity problems and now we may try to find best model.
7
8
Selection Method
We performed forward, backward, and stepwise selection method to remove variables not deemed to be
important in finding yp (See Appendix – Selection Method).
Forward Method: tX2, tX6, c8, X12, X13.
Backward Method: tX2, c8, c10, X12, X13.
Stepwise Method: tX2, tX6, c8, X12, X13.
After reducing the model, we performed the all possible regression method using R2 , Cp, R2Adj and MSE as
criterion on tX2, tX6, c8, c10, X12, X13 select the best model. R20 is the R2 adequate subset. Any R2 value under
R20 is considered unusable. R20 = 1 – (1- R2k+1)(1+ da,n,k) where da,n,k = kFa,k,n-k-1/(n-k-1). At the α= 0.05
significance level, da,n,k ~ 0.121682. R20 = 1 – (1- 0.5074)(1+ 0.121682) = 0.447459. Any R2 value below this is
deemed unacceptable. MSE should be such that it is at the minimum, and it is not a value that is increasing.
Mallow’s Cp should be a value that is as close to our p as possible because Cp helps out determine how many
variables we use. We found that the model could not be narrowed down further with R2 = 0.5074, Cp = 7,
R2Adj = 0.4803, and MSE = 0.05081. (See Appendix – Selection Method)
Interaction Terms
We then attempted to see if interaction terms could be added to our model in order to give us a better model
that can give us better estimation. We used our judgment to include what we feel to be logical interaction
terms. We felt that ‘Number of physicians’ with ‘Total personal income’ and ‘Percent of high school
graduates’ with ‘Total personal income’ could possibly interact with one anther. The all regression method
was performed on the models with interaction terms. We found for the 8 variable model, R2 = 0.5516, Cp =
9, R2 Adj = 0.5181, MSE = 0.04711. For the 7 variable model with x610 added, R2 = 0.5282, Cp = 8, R2 Adj =
0.4983, MSE = 0.04905. And for another 7 variable model with x810 added, R2 = 0.5259, Cp = 8, R2 Adj =
0.4952, MSE = 0.04935. We feel that with all three possible models, there are no significant differences
between the new models and our original best model and therefore no interaction terms were added to the
model.
Final Model and influence
We found that our best model from the training group was:
yp = -2.69327+1106.12535tX2 -24.47901tX6 + 0.01299c8 – 332.16264c10 – 0.45876X12 – 0.20684X13 + ε.
This model must now be tested for influential points in the training group. If influential points are in our
model, our model must be used with caution because only a few other observations have an affect on our
model. We will use six tests to check for possible influential points: R-Student, hii, Cook’s D, DFFITS,
Covariance Ratio, and DFBETAS (See Appendix – Influential Points-Training Group).
R-Student detects possible outliers, these occur when R-student > 3 with R-student > 2 indicating possible
outliers. Variables targeted under R-student criteria were: 1, 42, 49, 66, 85, 88, 106, 116.
Cooks D “measures square distance between the least-square estimates based on all points.” Points have
considerable influence on the least-square estimate if cooks D > 1. The variable targeted under Cooks D was:
49.
9
hii is the diagonal element of the Hat matrix, these values range from 0 < hii < 1. The values on the diagonal
measure the distance of the i-th observation from the center of the x – space. Variables far away from the
center will have significant leverage over the dataset. The criteria of hii is that the values above 2p/n =
2(7)/116 = 0.1207 have leverage over the dataset. The values targeted under hii: 49, 116.
Covariance Ratio expresses the role of the i-th observation on the precision of the estimate. A covariance
ratio outside the range of 1 ± 3p/n. = 1 ± 3(7)/116 = [0.8190,1.1810] would indicate that the i-th data point
has large influence. Values targeted by the covariance ratio were: 1, 49, 66, 85, 106.
DFFITS is the value of R-student multiplied by the leverage of the ith observation, if the data point is an
outlier, we will have a large R-student value hence a large value for DFFITS. |DFFITS| > |2√p/n| =
|2√7/116|= 0.4956 is considered a possible influential point. Values targeted by DFFITS were: 1, 42, 49, 66,
85, 106, 107, 112, 116.
DFBETAS is the value that indicates how much the regression coefficient changes in standard deviation
units. |DFBETAS|> |2/√ n|=|2/√ 116|=0.1857 are considered possible influential points. Values
targeted under DFBETAS were: 1, 42, 49, 66, 85, 106, 116.
Using the criteria, we felt that observations 1, 49, 66, 106, and 116 were potential influential points.
To see if they have influence on our model we remove them and refitted our model to see if any significant
change occurred. From the dataset and see how the model changes. A large change in this model indicates
these values are influential points.
To test whether or not the values are outliers or influential points, we removed them from our model and
then fit the new least-square line to see if the influential points made a significant change in out best fit
model. After removing the influential points, our model changed to
yp = -2.64213 – 4110.19878tX2 – 57.57935tX6 + 0.01022c8 – 129.71397c10 – 0.47682x12 – 0.18943X13 + ε.
From looking at the change in β1, we can see a change in our model when influential points are removed
from the original training group dataset. Our model was built with influential point hence must be interpret
with caution.
Fitting model to prediction data
Now that we have found our best model using the estimation data, we must now check the validity of the
model by fitting our model through the prediction dataset. For the center transform of X8 and X10 we use the
means of X8 and X10 columns from the prediction dataset. We then checked the model for collinearity and we
found CN for our prediction data was 11.19037 which is less then the required value of 30 to be considered
to have collinearity. Collinearity in the prediction set is not a factor.
To see how well our model can be used to predict the holdout group data we checked the R2 and R2Adj. If the
absolute difference between R2 estimation and R2 prediction is greater then 0.1 this model is no good in
predicting serious crime in the holdout group. We found that the difference for R2 was 0.6544 – 0.5646 =
0.0898 and for R2Adj, the difference was 0.5393 – 0.5392 = 0.0001. So our model seems good for predicting
serious crime in the holdout group.
10
The best fit model in the holdout group is
yp = - 2.80488 + 23394tx2 + 26.66752tx6 + 0.00974c8 – 340.22446c10 - 0.5151x12 – 0.10199x13 + ε.
Next, the holdout group must be tested for influence points (See Appendix – Influential Points-Holdout Group).
From these criterions, we feel that observations 7, 15, and 22 could be potential influential points in the
holdout group. We decided we would remove observations 7, 15, 22 to see how the model was affected.
After removing the observations, we found the new model for the holdout group is
yp = -3.01936 + 36572tx2 + 179.4668tx6 + 0.00688c8 – 1116.6124c10 – 0.6844x12 – 0.19492x13+ ε.
Again, we found significant change in the coefficients of the two models and hence we feel that observations
7, 15, 22 are influential points and this model must be used with caution.
11
Appendix
Correlation
Pearson Correlation Coefficients, N = 141
Prob > |r| under H0: Rho=0
yp
tx2
tx6
c8
c10
x12
x13
yp
1.00000
-0.01298
0.8785
-0.30794
0.0002
0.38814
<.0001
-0.25578
0.0022
-0.49832
<.0001
-0.06806
0.4226
tx2
-0.01298
0.8785
1.00000
-0.00629
0.9409
-0.20201
0.0163
-0.02943
0.7290
0.19667
0.0194
-0.05092
0.5488
tx6
-0.30794
0.0002
-0.00629
0.9409
1.00000
-0.23114
0.0058
0.89171
<.0001
-0.04200
0.6209
0.02350
0.7821
c8
0.38814
<.0001
-0.20201
0.0163
-0.23114
0.0058
1.00000
-0.18551
0.0276
-0.16998
0.0439
0.12609
0.1363
c10
-0.25578
0.0022
-0.02943
0.7290
0.89171
<.0001
-0.18551
0.0276
1.00000
-0.08804
0.2992
-0.05941
0.4841
x12
-0.49832
<.0001
0.19667
0.0194
-0.04200
0.6209
-0.16998
0.0439
-0.08804
0.2992
1.00000
-0.27965
0.0008
x13
-0.06806
0.4226
-0.05092
0.5488
0.02350
0.7821
0.12609
0.1363
-0.05941
0.4841
-0.27965
0.0008
1.00000
Interpretation of best model
Root MSE
0.16977
R-Square
0.7676
Dependent Mean
-2.87257
Adj R-Sq
0.6746
Coeff Var
-5.91009
12
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
F Value
Pr > F
Model
6
6.45218
1.07536
21.51
<.0001
Error
134
6.69894
0.04999
Corrected Total
140
13.15113
Collinearity
Before Centering:
Variable
DF
tx3
1
x8
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
-66.45402
81.26821
-0.82
0.4150
27.88359
1
0.00797
0.00350
2.28
0.0245
2.21071
tx9
1
91.29262
46.95957
1.94
0.0541
54.32695
tx10
1
-1039.41076
495.82432
-2.10
0.0380
34.83511
Number
Eigenvalue
Condition
Index
11
0.00717
34.75088
12
0.00402
46.37989
13
0.00291
54.56172
13
After Centering:
Number
Eigenvalue
Condition
Index
11
0.02331
15.04390
12
0.01358
19.70937
13
0.01068
22.22573
Selection Method
Summary of Stepwise Selection
Step
Variable
Entered
Variable
Removed
Number
Vars In
Partial
RSquare
Model
RSquare
C(p)
F
Value
Pr > F
1
x12
1
0.2367
0.2367
58.1027
35.36
<.0001
2
tx6
2
0.1221
0.3588
32.8997
21.51
<.0001
3
c8
3
0.0558
0.4146
22.4531
10.69
0.0014
4
x13
4
0.0617
0.4764
10.6995
13.08
0.0005
5
tx2
5
0.0212
0.4976
7.9682
4.65
0.0333
Summary of Backward Selection
Variables
Kept
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
Intercept
-2.72834
0.02840
466.28614
9227.91
<.0001
tx2
1104.33799
522.57050
0.22566
4.47
0.0368
c8
0.01325
0.00296
1.01362
20.06
<.0001
c10
-457.33004
108.03921
0.90541
17.92
<.0001
x12
-0.46273
0.05805
3.21106
63.55
<.0001
14
Summary of Backward Selection
Variables
Kept
x13
Parameter
Estimate
Standard
Error
Type II SS
F Value
Pr > F
-0.21588
0.04929
0.96937
19.18
<.0001
Summary of Forward Selection
Step
Variable
Entered
Number
Vars In
Partial
R-Square
Model
R-Square
C(p)
F
Value
Pr > F
1
x12
1
0.2367
0.2367
58.1027
35.36
<.0001
2
tx6
2
0.1221
0.3588
32.8997
21.51
<.0001
3
c8
3
0.0558
0.4146
22.4531
10.69
0.0014
4
x13
4
0.0617
0.4764
10.6995
13.08
0.0005
5
tx2
5
0.0212
0.4976
7.9682
4.65
0.0333
Influential Points – Training Group
Obs
Residual
RStudent
Hat Diag H
1
0.5925
2.8248
0.0785
0.7028
0.8246
42
-0.4751
-2.2091
0.0573
0.8304
-0.5447
48
-0.0212
-2.2191
0.9981
418.7112
-51.3459
66
-0.6408
-3.0334
0.0554
0.6371
-0.7347
85
0.4890
2.2816
0.0609
0.8170
0.5812
88
0.4497
2.0467
0.0221
0.8357
0.3079
106
-0.6674
-3.2211
0.0822
0.6115
-0.9642
112
0.3349
1.5723
0.0948
1.0057
0.5088
116
0.4239
2.3777
0.3476
1.1439
1.7354
15
Cov Ratio
DFFITS
Fitted Model - Holdout Group
Root MSE
0.19139
R-Square
0.6544
Dependent Mean
-2.85990
Adj R-Sq
0.5393
Coeff Var
-6.69223
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
F Value
Pr > F
Model
6
1.24871
0.20812
5.68
0.0018
Error
18
0.65935
0.03663
Corrected Total
24
1.90806
Collinearity – Holdout Group
Number
7
Eigenvalue
Condition
Index
0.01977
11.19037
16
Influential Points – Holdout Group
Obs
Residual
RStudent
Hat Diag
H
Cov
Ratio
DFFITS
2
-0.0143
-0.1055
0.5228
3.1124
-0.1104
7
0.3325
2.1801
0.2326
0.3462
1.2003
13
0.0470
0.2697
0.2147
1.8439
0.1410
15
-0.0274
-0.4254
0.8921
12.8400
-1.2232
17
0.3429
2.1051
0.1376
0.3419
0.8410
18
-0.2451
-1.5717
0.2821
0.8040
-0.9851
20
-0.0470
-0.2823
0.2822
2.0116
-0.1770
21
-0.1177
-0.8195
0.4468
2.0564
-0.7365
22
0.3103
1.8470
0.1263
0.4747
0.7023
24
0.1321
1.4046
0.7456
2.7188
2.4043
25
-0.0765
-0.4920
0.3683
2.1395
-0.3757
17
Download