5/1/2012 MATH 1040 FINAL PROJECT BOLTON 1 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Statistics Final Project Appendix Raw data collected from Stat Crunch, Excel Data Sets. #14, case study http://cp03.coursecompass.com/webapps/portal/frameset.jsp?url=%2Fbin%2Fcommon%2Fcourse.pl%3 Fcourse_id%3D_578411_1 Price ($1,000s) 104.9 109 94.9 96.5 127.9 129.9 145 199.9 255.9 310 169 344.5 123 139.9 169.9 194.9 210 275 299.5 319.9 397.5 189.9 349.9 454.9 499.9 615 635 929 Acres 0.19 0.15 0.2 0.18 0.17 0.18 0.18 0.17 0.24 0.23 0.2 0.24 0.13 0.16 0.15 0.19 0.23 0.17 0.17 0.27 0.3 0.18 0.4 0.96 1 0.66 0.44 0.9 Bedrooms 3 3 3 2 3 3 3 4 5 4 4 4 3 2 2 3 3 4 4 3 4 2 4 3 5 4 4 5 Bathrooms 1 2 1.5 1 2.5 1.5 1.5 3.5 3 3.5 2 4.5 1 2 2 1 1 2 2 2 2.5 1 2.5 3 3 3.5 3.5 4.5 Sq. ft. 900 1431 1064 780 1140 1140 1845 1974 2460 2490 1896 2709 828 1131 1002 1024 1649 2380 1936 1648 2500 1016 1816 2160 3104 3205 3084 4470 Age (yrs) 44 34 49 52 47 41 45 17 22 14 37 11 63 75 96 55 53 10 97 77 106 71 70 37 48 26 27 16 Rooms 8 6 6 4 6 7 6 8 8 9 8 8 5 5 7 6 9 8 8 8 10 6 8 7 10 10 10 14 1-Variable Summaries for X & Y data sets X-data = Square Footage Y-data = Price ($1,000s) Regression/ R Value 2 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Hypothesis The purpose of this study is to determine if there is a strong correlation between the size of a home in square footage, and its value in US dollars. At first glance, there does appear to be a linear correlation between the two. Therefore, we intend to prove that in fact, as the size of a house in square footage increases; the value of the home also increases. Organizing Data In order to get a closer look at the relationship between home value and the home’s square footage, we begin to organize our data. First we ignore the acres, number of rooms, etc. and compare the two key pieces of data we are interested in. Here we have separated the data and sorted them from smallest to largest. Square Footage 780 828 900 1002 1016 1024 1064 1131 1140 1140 1431 1648 1649 1816 Statistic No. of observations Price ($1,000s) 1845 1896 1936 1974 2160 2380 2460 2490 2500 2709 3084 3104 3205 4470 94.9 96.5 104.9 109 123 127.9 129.9 Sq. ft. 139.9 145 169 169.9 189.9 194.9 199.9 Statistic 28 No. of observations 210 255.9 275 299.5 310 319.9 344.5 349.9 397.5 454.9 499.9 615 635 929 Price ($1,000s) 28 Minimum 780 Minimum 94.9 Maximum 4470 Maximum 929 1st Quartile 1114.25 1st Quartile 137.4 Median 1830.5 Median 204.95 3rd Quartile 2467.5 3rd Quartile 345.85 Mean Variance (n-1) Standard deviation (n-1) 1885.071 803516.661 896.391 Mean Variance (n-1) Standard deviation (n-1) 281.807 38953.959 197.368 Outlier(s) Note: Sq. Ft. has 1 outlier: 4,470 Sq. Ft. Price ($1,000s) has 3 outliers: 929, 635, & 615 3 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Box plot (Price ($1,000s)) Box plot (Sq. ft.) 1000 4500 900 4000 800 3500 Price ($1,000s) 700 Sq. ft. 3000 2500 2000 600 500 400 300 1500 200 1000 100 500 0 Histogram (Sq. ft.) Histogram (Price ($1,000s)) 0.3 0.4 0.35 0.25 Relative frequency Relative frequency 0.3 0.2 0.15 0.1 0.25 0.2 0.15 0.1 0.05 0.05 0 0 0 1000 2000 3000 Sq. ft. 4000 5000 0 200 400 600 800 1000 Price ($1,000s) Here we see the box plots and Histograms of the respective statistics on a individual basis to better illistrate their range, outliers, mean, std. deviation, etc. 4 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 With the outliers removed, we can further analyze our data to determine if there is a significant correlation within our data set. We now compare or r or residual figure of 0.908459108 to the r from Table II in the text. Here we see that the r value from table II is 0.374 at the correct critical value for n. Because 0.908 > 0.374, significant correlation does exist, in our data set, and we are okay to proceed with linear regression analysis to come up with the best fit line and equation. Sq. Ft. vs. Price ($1,000s) 1000 900 Price ($1,000s) 800 y = 0.2x - 95.254 700 600 500 400 300 200 100 0 0 1000 2000 3000 4000 5000 Sq. ft. On the right, we have our scatterplot with all of our data points, the best-fit line, and our equation. 5 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Price ($1,000s) Sq. Ft. vs. Price ($1,000s) Regression of variable Price ($1,000s): 1000 900 800 700 600 500 400 300 200 100 0 0 1000 2000 3000 4000 Goodness of fit statistics: 28.000 Observations 28.000 Sum of weights 26.000 DF 0.908 R 0.825 R² 0.819 Adjusted R² 7067.080 MSE 84.066 RMSE 29.850 MAPE 0.855 DW 2.000 Cp 250.095 AIC 252.759 SBC 0.202 PC 5000 Sq. ft. Correlation matrix: Sq. ft. Price ($1,000s) Sq. ft. 1.000 0.908 Price ($1,000s) 0.908 1.000 Variables Analysis of variance: Source DF Sum of squares 1 868012.813 Model 26 183744.085 Error 27 1051756.899 Corrected Total Computed against model Y=Mean(Y) Mean squares 868012.813 7067.080 F Pr > F 122.825 < 0.0001 Model parameters: Source Intercept Sq. ft. Value 95.254 0.200 Standardized coefficients: Source Sq. ft. Standard error 37.549 t -2.537 Pr > |t| 0.018 Lower bound (95%) -172.437 Upper bound (95%) -18.070 0.018 11.083 < 0.0001 0.163 0.237 Value 0.908 Standard error 0.082 t 11.083 Pr > |t| < 0.0001 Lower bound (95%) 0.740 Upper bound (95%) 1.077 Equation of the model: Price ($1,000s) = -95.2538039599986+0.200024752962752*Sq. ft. 6 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Standardized coefficients Price ($1,000s) / Standardized coefficients (95% conf. interval) 1.5 Sq. ft. 1 0.5 0 Variable Predictions and residuals: Observation Sq. ft. Price ($1,000s) Pred(Price ($1,000s)) Obs1 Obs2 Obs3 Obs4 Obs5 Obs6 Obs7 Obs8 Obs9 Obs10 Obs11 Obs12 Obs13 Obs14 Obs15 Obs16 Obs17 Obs18 Obs19 Obs20 Obs21 Obs22 Obs23 Obs24 Obs25 Obs26 Obs27 Obs28 900.000 1431.000 1064.000 780.000 1140.000 1140.000 1845.000 1974.000 2460.000 2490.000 1896.000 2709.000 828.000 1131.000 1002.000 1024.000 1649.000 2380.000 1936.000 1648.000 2500.000 1016.000 1816.000 2160.000 3104.000 3205.000 3084.000 4470.000 104.900 109.000 94.900 96.500 127.900 129.900 145.000 199.900 255.900 310.000 169.000 344.500 123.000 139.900 169.900 194.900 210.000 275.000 299.500 319.900 397.500 189.900 349.900 454.900 499.900 615.000 635.000 929.000 84.768 190.982 117.573 60.766 132.774 132.774 273.792 299.595 396.807 402.808 283.993 446.613 70.367 130.974 105.171 109.572 234.587 380.805 291.994 234.387 404.808 107.971 267.991 336.800 525.623 545.826 521.623 798.857 Residual 20.132 -81.982 -22.673 35.734 -4.874 -2.874 -128.792 -99.695 -140.907 -92.808 -114.993 -102.113 52.633 8.926 64.729 85.328 -24.587 -105.805 7.506 85.513 -7.308 81.929 81.909 118.100 -25.723 69.174 113.377 130.143 Std. residual 0.239 -0.975 -0.270 0.425 -0.058 -0.034 -1.532 -1.186 -1.676 -1.104 -1.368 -1.215 0.626 0.106 0.770 1.015 -0.292 -1.259 0.089 1.017 -0.087 0.975 0.974 1.405 -0.306 0.823 1.349 1.548 Std. dev. on pred. (Mean) 23.843 17.876 21.726 25.499 20.814 20.814 15.903 15.968 18.975 19.277 15.888 21.761 24.827 20.919 22.504 22.224 16.448 18.226 15.914 16.453 19.380 22.326 15.936 16.644 27.136 28.634 26.845 49.285 Lower bound 95% (Mean) 35.758 154.237 72.915 8.352 89.990 89.990 241.102 266.773 357.802 363.184 251.334 401.883 19.334 87.974 58.914 63.889 200.777 343.341 259.283 200.567 364.973 62.081 235.235 302.588 469.843 486.967 466.443 697.550 Upper bound 95% (Mean) 133.779 227.727 162.230 113.179 175.558 175.558 306.482 332.417 435.812 442.432 316.652 491.343 121.400 173.975 151.428 155.254 268.397 418.269 324.705 268.207 444.644 153.862 300.748 371.012 581.403 604.684 576.802 900.163 Std. dev. on pred. Observation Lower bound 95% Observation Upper bound 95% Observation 87.382 85.946 86.828 87.848 86.604 86.604 85.557 85.569 86.181 86.248 85.554 86.837 87.655 86.630 87.026 86.954 85.660 86.019 85.559 85.661 86.271 86.980 85.563 85.698 88.337 88.809 88.248 97.448 -94.847 14.318 -60.905 -119.809 -45.243 -45.243 97.927 123.705 219.660 225.523 108.134 268.118 -109.812 -47.096 -73.713 -69.165 58.510 203.991 116.125 58.309 227.476 -70.818 92.114 160.645 344.043 363.276 340.226 598.550 264.384 367.645 296.050 241.340 310.792 310.792 449.657 475.485 573.954 580.093 459.852 625.109 250.545 309.044 284.055 288.308 410.664 557.620 467.863 410.465 582.140 286.761 443.868 512.954 707.203 728.375 703.019 999.164 7 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Price ($1,000s) Regression of Price ($1,000s) by Sq. ft. (R²=0.825) 800 300 -200 0 500 Active 1000 Model 1500 2000 2500 Sq. ft. 3000 Conf. interval (Mean 95%) 3500 4000 4500 Conf. interval (Obs. 95%) Residual vs. X (Sq. Ft.) 4500 4000 3500 Sq. ft. 3000 2500 2000 1500 1000 500 -150 -100 -50 0 50 100 150 Residual In the plot above, we compare our residual vs. x, or square feet, to make sure that our data are linearly related. This plot does not show a discreet pattern, and it does not show the spread of the residuals increasing or decreasing, therefore we can confirm that our data is linearly related, and move on to use our equation to make predictions. 8 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Residual vs. Y Price ($1,000s) 1000 900 800 Price ($1,000s) 700 600 500 400 300 200 100 0 -150 -100 -50 0 50 100 150 Residual Pred(Price ($1,000s)) / Residual Residual 2 1 0 -1 0 100 200 300 -2 400 500 600 700 800 Pred(Price ($1,000s)) Price ($1,000s) Pred(Price ($1,000s)) / Price ($1,000s) 1000 900 800 700 600 500 400 300 200 100 0 0 100 200 300 400 500 600 700 800 900 1000 Pred(Price ($1,000s)) 9 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Residual / Price ($1,000s) Obs28 Obs25 Obs22 Observations Obs19 Obs16 Obs13 Obs10 Obs7 Obs4 Obs1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Residual Predictions We have satisfied the three critical factors in confirming that our data is linearly related. 1. Compare calculated r to r from Table II in textbook. r=0.908 > critical r=0.374 2. R vs. X plot does not show discreet pattern. 3. R vs. X plot does not show spread increasing or decreasing. Now that our data analysis is complete, and we have determined that there is in fact a linear relationship between the size of a home in square feet, and the value of the home, We can now make predictions as to what the value of a home might be based on the square footage. Below we see our “best-fit line equation” and we can enter a variable square footage measurement and get the expected value. We need to only keep in mind that we want our variable to be within the parameters of our data range, because our linear regression equation is only good within the data range we used to create it. With that said we take 10 predictions and run them through the equation. 10 Bryton Bolton Math 1040-006 M. Rashid T-Th 8:30-9:45 Equation of the model: Price ($1,000s) = -95.2538039599986+0.200024752962752*Sq. ft. Sq. Ft. 1030 1500 1750 1825 2100 2200 2300 2600 2825 2950 Predicted Value (price ($1,000s)) 110.938 204.783 254.79 269.791 324.798 344 344.801 424.811 469.816 494.819 Afterthought Although we can say with confidence that there is a strong correlation between the size of a home, and it’s value, there are many other factors that also contribute to the ultimate value or price of a home such as, age, size of lot, number of rooms, etc. Not to mention the general condition that the home is in. It is important for anyone looking at this analysis or using this data to understand that concept, and realize that square footage is merely one of several factors. However, I do feel that it is one of the most important, and one with arguable the biggest influence on home value. One concern that may arise from this study is sampling size. With only 28 data points, it will be harder to try and apply these findings to a large scale prediction such as the correlation between size and value of a home in the entire United States. However our r value was so far above the critical r, we can determine that the relationship between variables X & Y is strong. One thing that would make the study more worthwhile would to compare to separate X variables and see which one has a stronger relationship to our Y variable. For example, it might be interesting to see which one has a better correlation to home value, square feet or year built. 11