Statistics Final Project 2012

advertisement
5/1/2012
MATH
1040
FINAL PROJECT
BOLTON
1
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Statistics Final Project
Appendix
Raw data collected from Stat Crunch, Excel Data Sets. #14, case study
http://cp03.coursecompass.com/webapps/portal/frameset.jsp?url=%2Fbin%2Fcommon%2Fcourse.pl%3
Fcourse_id%3D_578411_1
Price ($1,000s)
104.9
109
94.9
96.5
127.9
129.9
145
199.9
255.9
310
169
344.5
123
139.9
169.9
194.9
210
275
299.5
319.9
397.5
189.9
349.9
454.9
499.9
615
635
929
Acres
0.19
0.15
0.2
0.18
0.17
0.18
0.18
0.17
0.24
0.23
0.2
0.24
0.13
0.16
0.15
0.19
0.23
0.17
0.17
0.27
0.3
0.18
0.4
0.96
1
0.66
0.44
0.9
Bedrooms
3
3
3
2
3
3
3
4
5
4
4
4
3
2
2
3
3
4
4
3
4
2
4
3
5
4
4
5
Bathrooms
1
2
1.5
1
2.5
1.5
1.5
3.5
3
3.5
2
4.5
1
2
2
1
1
2
2
2
2.5
1
2.5
3
3
3.5
3.5
4.5
Sq. ft.
900
1431
1064
780
1140
1140
1845
1974
2460
2490
1896
2709
828
1131
1002
1024
1649
2380
1936
1648
2500
1016
1816
2160
3104
3205
3084
4470
Age
(yrs)
44
34
49
52
47
41
45
17
22
14
37
11
63
75
96
55
53
10
97
77
106
71
70
37
48
26
27
16
Rooms
8
6
6
4
6
7
6
8
8
9
8
8
5
5
7
6
9
8
8
8
10
6
8
7
10
10
10
14
1-Variable Summaries for X & Y data sets
X-data = Square Footage
Y-data = Price ($1,000s)
Regression/ R Value
2
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Hypothesis
The purpose of this study is to determine if there is a strong correlation between the size of a home in
square footage, and its value in US dollars.
At first glance, there does appear to be a linear correlation between the two. Therefore, we intend to
prove that in fact, as the size of a house in square footage increases; the value of the home also
increases.
Organizing Data
In order to get a closer look at the relationship between home value and the home’s square footage, we
begin to organize our data. First we ignore the acres, number of rooms, etc. and compare the two key
pieces of data we are interested in.
Here we have separated the data and sorted them from smallest to largest.
Square Footage
780
828
900
1002
1016
1024
1064
1131
1140
1140
1431
1648
1649
1816
Statistic
No. of observations
Price ($1,000s)
1845
1896
1936
1974
2160
2380
2460
2490
2500
2709
3084
3104
3205
4470
94.9
96.5
104.9
109
123
127.9
129.9
Sq. ft.
139.9
145
169
169.9
189.9
194.9
199.9
Statistic
28
No. of observations
210
255.9
275
299.5
310
319.9
344.5
349.9
397.5
454.9
499.9
615
635
929
Price
($1,000s)
28
Minimum
780
Minimum
94.9
Maximum
4470
Maximum
929
1st Quartile
1114.25
1st Quartile
137.4
Median
1830.5
Median
204.95
3rd Quartile
2467.5
3rd Quartile
345.85
Mean
Variance (n-1)
Standard deviation (n-1)
1885.071
803516.661
896.391
Mean
Variance (n-1)
Standard deviation (n-1)
281.807
38953.959
197.368
Outlier(s)
Note: Sq. Ft. has 1 outlier: 4,470 Sq. Ft.
Price ($1,000s) has 3 outliers: 929, 635, & 615
3
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Box plot (Price ($1,000s))
Box plot (Sq. ft.)
1000
4500
900
4000
800
3500
Price ($1,000s)
700
Sq. ft.
3000
2500
2000
600
500
400
300
1500
200
1000
100
500
0
Histogram (Sq. ft.)
Histogram (Price ($1,000s))
0.3
0.4
0.35
0.25
Relative frequency
Relative frequency
0.3
0.2
0.15
0.1
0.25
0.2
0.15
0.1
0.05
0.05
0
0
0
1000
2000
3000
Sq. ft.
4000
5000
0
200
400
600
800
1000
Price ($1,000s)
Here we see the box plots and Histograms of the respective statistics on a individual basis to better
illistrate their range, outliers, mean, std. deviation, etc.
4
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
With the outliers removed, we can further analyze our data to determine if there is a significant
correlation within our data set.
We now compare or r or residual figure of 0.908459108 to the r from Table II in the text.
Here we see that the r value from table II is 0.374 at the correct
critical value for n.
Because 0.908 > 0.374, significant correlation does exist, in our
data set, and we are okay to proceed with linear regression
analysis to come up with the best fit line and equation.
Sq. Ft. vs. Price ($1,000s)
1000
900
Price ($1,000s)
800
y = 0.2x - 95.254
700
600
500
400
300
200
100
0
0
1000
2000
3000
4000
5000
Sq. ft.
On the right, we have our scatterplot with all of our data points, the best-fit line, and our equation.
5
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Price ($1,000s)
Sq. Ft. vs. Price ($1,000s)
Regression of variable Price
($1,000s):
1000
900
800
700
600
500
400
300
200
100
0
0
1000
2000
3000
4000
Goodness of fit statistics:
28.000
Observations
28.000
Sum of weights
26.000
DF
0.908
R
0.825
R²
0.819
Adjusted R²
7067.080
MSE
84.066
RMSE
29.850
MAPE
0.855
DW
2.000
Cp
250.095
AIC
252.759
SBC
0.202
PC
5000
Sq. ft.
Correlation matrix:
Sq. ft.
Price ($1,000s)
Sq. ft.
1.000
0.908
Price ($1,000s)
0.908
1.000
Variables
Analysis of variance:
Source
DF
Sum of
squares
1
868012.813
Model
26
183744.085
Error
27
1051756.899
Corrected Total
Computed against model Y=Mean(Y)
Mean squares
868012.813
7067.080
F
Pr > F
122.825
< 0.0001
Model parameters:
Source
Intercept
Sq. ft.
Value
95.254
0.200
Standardized coefficients:
Source
Sq. ft.
Standard error
37.549
t
-2.537
Pr > |t|
0.018
Lower bound (95%)
-172.437
Upper bound (95%)
-18.070
0.018
11.083
<
0.0001
0.163
0.237
Value
0.908
Standard error
0.082
t
11.083
Pr > |t|
<
0.0001
Lower bound (95%)
0.740
Upper bound (95%)
1.077
Equation of the model:
Price ($1,000s) = -95.2538039599986+0.200024752962752*Sq. ft.
6
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Standardized coefficients
Price ($1,000s) / Standardized coefficients
(95% conf. interval)
1.5
Sq. ft.
1
0.5
0
Variable
Predictions and residuals:
Observation
Sq. ft.
Price
($1,000s)
Pred(Price
($1,000s))
Obs1
Obs2
Obs3
Obs4
Obs5
Obs6
Obs7
Obs8
Obs9
Obs10
Obs11
Obs12
Obs13
Obs14
Obs15
Obs16
Obs17
Obs18
Obs19
Obs20
Obs21
Obs22
Obs23
Obs24
Obs25
Obs26
Obs27
Obs28
900.000
1431.000
1064.000
780.000
1140.000
1140.000
1845.000
1974.000
2460.000
2490.000
1896.000
2709.000
828.000
1131.000
1002.000
1024.000
1649.000
2380.000
1936.000
1648.000
2500.000
1016.000
1816.000
2160.000
3104.000
3205.000
3084.000
4470.000
104.900
109.000
94.900
96.500
127.900
129.900
145.000
199.900
255.900
310.000
169.000
344.500
123.000
139.900
169.900
194.900
210.000
275.000
299.500
319.900
397.500
189.900
349.900
454.900
499.900
615.000
635.000
929.000
84.768
190.982
117.573
60.766
132.774
132.774
273.792
299.595
396.807
402.808
283.993
446.613
70.367
130.974
105.171
109.572
234.587
380.805
291.994
234.387
404.808
107.971
267.991
336.800
525.623
545.826
521.623
798.857
Residual
20.132
-81.982
-22.673
35.734
-4.874
-2.874
-128.792
-99.695
-140.907
-92.808
-114.993
-102.113
52.633
8.926
64.729
85.328
-24.587
-105.805
7.506
85.513
-7.308
81.929
81.909
118.100
-25.723
69.174
113.377
130.143
Std.
residual
0.239
-0.975
-0.270
0.425
-0.058
-0.034
-1.532
-1.186
-1.676
-1.104
-1.368
-1.215
0.626
0.106
0.770
1.015
-0.292
-1.259
0.089
1.017
-0.087
0.975
0.974
1.405
-0.306
0.823
1.349
1.548
Std.
dev. on
pred.
(Mean)
23.843
17.876
21.726
25.499
20.814
20.814
15.903
15.968
18.975
19.277
15.888
21.761
24.827
20.919
22.504
22.224
16.448
18.226
15.914
16.453
19.380
22.326
15.936
16.644
27.136
28.634
26.845
49.285
Lower
bound
95%
(Mean)
35.758
154.237
72.915
8.352
89.990
89.990
241.102
266.773
357.802
363.184
251.334
401.883
19.334
87.974
58.914
63.889
200.777
343.341
259.283
200.567
364.973
62.081
235.235
302.588
469.843
486.967
466.443
697.550
Upper
bound
95%
(Mean)
133.779
227.727
162.230
113.179
175.558
175.558
306.482
332.417
435.812
442.432
316.652
491.343
121.400
173.975
151.428
155.254
268.397
418.269
324.705
268.207
444.644
153.862
300.748
371.012
581.403
604.684
576.802
900.163
Std. dev. on
pred.
Observation
Lower
bound 95%
Observation
Upper
bound 95%
Observation
87.382
85.946
86.828
87.848
86.604
86.604
85.557
85.569
86.181
86.248
85.554
86.837
87.655
86.630
87.026
86.954
85.660
86.019
85.559
85.661
86.271
86.980
85.563
85.698
88.337
88.809
88.248
97.448
-94.847
14.318
-60.905
-119.809
-45.243
-45.243
97.927
123.705
219.660
225.523
108.134
268.118
-109.812
-47.096
-73.713
-69.165
58.510
203.991
116.125
58.309
227.476
-70.818
92.114
160.645
344.043
363.276
340.226
598.550
264.384
367.645
296.050
241.340
310.792
310.792
449.657
475.485
573.954
580.093
459.852
625.109
250.545
309.044
284.055
288.308
410.664
557.620
467.863
410.465
582.140
286.761
443.868
512.954
707.203
728.375
703.019
999.164
7
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Price ($1,000s)
Regression of Price ($1,000s) by Sq. ft. (R²=0.825)
800
300
-200 0
500
Active
1000
Model
1500
2000
2500
Sq. ft.
3000
Conf. interval (Mean 95%)
3500
4000
4500
Conf. interval (Obs. 95%)
Residual vs. X (Sq. Ft.)
4500
4000
3500
Sq. ft.
3000
2500
2000
1500
1000
500
-150
-100
-50
0
50
100
150
Residual
In the plot above, we compare our residual vs. x, or square feet, to make sure that our data are linearly
related. This plot does not show a discreet pattern, and it does not show the spread of the residuals
increasing or decreasing, therefore we can confirm that our data is linearly related, and move on to use
our equation to make predictions.
8
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Residual vs. Y Price ($1,000s)
1000
900
800
Price ($1,000s)
700
600
500
400
300
200
100
0
-150
-100
-50
0
50
100
150
Residual
Pred(Price ($1,000s)) / Residual
Residual
2
1
0
-1
0
100
200
300
-2
400
500
600
700
800
Pred(Price ($1,000s))
Price ($1,000s)
Pred(Price ($1,000s)) / Price ($1,000s)
1000
900
800
700
600
500
400
300
200
100
0
0
100
200
300
400
500
600
700
800
900
1000
Pred(Price ($1,000s))
9
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Residual / Price ($1,000s)
Obs28
Obs25
Obs22
Observations
Obs19
Obs16
Obs13
Obs10
Obs7
Obs4
Obs1
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Residual
Predictions
We have satisfied the three critical factors in confirming that our data is linearly related.
1. Compare calculated r to r from Table II in textbook. r=0.908 > critical r=0.374
2. R vs. X plot does not show discreet pattern.
3. R vs. X plot does not show spread increasing or decreasing.
Now that our data analysis is complete, and we have determined that there is in fact a linear
relationship between the size of a home in square feet, and the value of the home, We can now make
predictions as to what the value of a home might be based on the square footage. Below we see our
“best-fit line equation” and we can enter a variable square footage measurement and get the expected
value. We need to only keep in mind that we want our variable to be within the parameters of our data
range, because our linear regression equation is only good within the data range we used to create it.
With that said we take 10 predictions and run them through the equation.
10
Bryton Bolton
Math 1040-006
M. Rashid T-Th 8:30-9:45
Equation of the model:
Price ($1,000s) = -95.2538039599986+0.200024752962752*Sq. ft.
Sq. Ft.
1030
1500
1750
1825
2100
2200
2300
2600
2825
2950
Predicted Value (price ($1,000s))
110.938
204.783
254.79
269.791
324.798
344
344.801
424.811
469.816
494.819
Afterthought
Although we can say with confidence that there is a strong correlation between the size of a home, and
it’s value, there are many other factors that also contribute to the ultimate value or price of a home
such as, age, size of lot, number of rooms, etc. Not to mention the general condition that the home is in.
It is important for anyone looking at this analysis or using this data to understand that concept, and
realize that square footage is merely one of several factors. However, I do feel that it is one of the most
important, and one with arguable the biggest influence on home value.
One concern that may arise from this study is sampling size. With only 28 data points, it will be harder to
try and apply these findings to a large scale prediction such as the correlation between size and value of
a home in the entire United States. However our r value was so far above the critical r, we can
determine that the relationship between variables X & Y is strong.
One thing that would make the study more worthwhile would to compare to separate X variables and
see which one has a stronger relationship to our Y variable. For example, it might be interesting to see
which one has a better correlation to home value, square feet or year built.
11
Download