C ORRELATION OF HEIGHT AND SHOE SIZE By Rebecca Dick & Tyler Dunyon Math 1040 Final Project 12/13/2012 12/13/2012 Correlation of height and shoe size By Rebecca Dick & Tyler Dunyon Introduction: There are many claims that a person’s shoe size is directly related to their height. Usually, a person who is taller is figured to have a larger shoe size than that of a person who is shorter and figured to have a smaller shoe size. Statement of Task: The main purpose of this investigation and report is to determine and present the question: Is there a correlation between the height of a person and their shoe size? Hypothesis: Our hypothesis is that there is in fact a direct correlation between the height of a person and their shoe size. A persons shoe size will be dependent upon their measured height; concluding that the independent variable will be height, and the dependent variable will be the shoe size. Plan of Investigation: Data was collected from 40 students and/or teachers. The two types of data collected from the 40 individuals were their height, and then their shoe size. In order to get the most accurate calculation, only people who were over the age of sixteen were measured; younger people would still have time to grow taller causing a fault in the data. The independent variable for the set of data was the individual’s height. The data will be presented in a variety of ways and then tested to see if the linear model assumption is valid. The dependent variable is the individuals shoe size; this will test to see if their shoe size is dependent on their height. The presented data for the investigation was retrieved from: http://www.slideshare.net/deloti/correlation-of-height-and-shoe-size?from=share_email. 1 12/13/2012 Collected Data: Below is a table of the collected data for the investigation. # of Students Height (Inches) Shoe Size # of Students Height Shoe Size (Standard (Inches) (Standard American American units) Units) 1 153 5 21 170 8.5 2 154 6 22 171 9 3 154 6 23 173 10 4 155 6 24 174 8 5 158 5 25 174 10 6 159 7 26 174 9 7 160 6 27 175 12 8 161 5 28 175 11 9 163 6 29 176 9 10 164 7 30 177 10 11 165 7 31 178 11 12 165 6 32 178 11 13 165 7 33 178 12 14 166 10 34 179 10.5 15 167 9.5 35 179 11.5 16 167 10 36 179 11 17 168 10 37 180 13 18 168 9 38 180 12 19 170 10.5 39 183 12.5 20 170 9.5 40 185 13 TABLE: 1.1: Table 1.1 Displays the data that was collected from each number of individuals. Their height and shoe size was appropriately documented. Analysis and Summary of the Data: Below will be several tables and charts that display the data organized and summarized in a variety of ways. By analyzing the data in different ways, we are able to determine shape, outliers, mean, and variability in the data. Ultimately, we will be able to then determine if there is in fact a correlation between height and shoe size. Column Height Shoe Size n Std. Dev. Mean Variance Std. Err. Median Range 40 169.75 74.19231 8.613496 1.361913 170 32 153 40 9.0375 5.735737 9.5 8 5 2.39494 0.378673 Min Max Q1 Q3 185 164.5 177.5 13 7 11 Table 1.2: Table 1.2 displays the summary statistics for each data set. 2 12/13/2012 We will use table 1.2 to determine if there are any outliers in our data. If there proves to be any outliers, we would want to exclude them from our data and recalculate our data in order to have the most accurate data; outliers could skew our data and charts form the true correct answer. A value will prove to be an outlier by being outside of the lower and upper fences. The lower fence = Q1 – 1.5(IQR). The upper fence = Q3 + 1.5(IQR). The IQR = Q3 – Q1. The data was taken directly from table 1.2. Height: q3 = 177.5, Q1 = 164.5, IQR= 13, Lower fence= 151.5, Upper fence = 190.5 Shoe Size: q3 = 11, Q1 = 7, IQR = 4, Lower fence = 3, Upper fence = 15 All of our data collected was found to be in between the lower and upper fence of the data set. We can move forward with our study confidently without any outliers to skew our data. Shoe Size vs. Height Shoe Size (Standard American Units) 15 y = 0.25x - 33.4 R² = 0.8084 13 11 9 7 Shoe Size 5 Linear (Shoe Size) 3 1 75 125 175 225 Height (Inches) Graph 1.1: Graph 1.1 is a graph of the plotted data set. Additionally, the “R” value and “Best fit line” are displayed. By formulating graph 1.1, we have been given two very important pieces of information. One, the least squares regression line, also known as the best fit line; We will use this formula later in the study to make predictions provided our linear assumption model is valid. The other important piece of information is the calculated “r” value. This value compared against the table, “critical values for correlation coefficient” (Table 2; 3rd Ed. Prentice Hall “Statistics”, Sullivan). Table 2 asks that for 30(n) number of values, your r must be greater than or equal to 0.361. An R value of .8084 is clearly large enough to represent a direct correlation. This is the first step in proving our linear assumption model is valid. 3 12/13/2012 Table 1.3: Table 1.3 shows the data and its corresponding actual vs predicted and residuals values. The predicted value is calculated by entering the height (x) into the least squares regression formula (y=0.25x – 33.4). The residual is calculated by subtracting the predicted value from the actual observed value. Residuals Residuals 2.5 2 1.5 1 0.5 0 -0.5 140 -1 -1.5 -2 -2.5 150 160 170 180 190 Shoe Size (Standard American Units) Graph 1.2: Graph 1.2 is the residuals plotted from the data. Residuals are found by subtracting predicted y values from observed y values and comparing them against our x values. 4 12/13/2012 Table 1.3 and Graph 1.2 are used to determine our final two criteria in our linear model assumption. First, the graph 1.2 is lacking a discrete pattern proving our data has a clear linear relation. Graph 1.3 and 1.4 illustrate this. Graphs 1.3 and 1.4 – illustrating the difference between an appropriate and inappropriate linear model. Second, graph 1.2 does not spread while increasing or decreasing. This proves our data has a linear relation. Graph 1.5 is an example of data with no linear relation. Graph 1.5 – an example of increasing pattern. This data does NOT have a linear relation. 5 12/13/2012 Predicting Values Since our linear model assumption proves to be valid, we are safe to make several predictions about a person’s height and what shoe size they wear with our linear regression equation. Table 1.3 below displays several predictions and values. The predictions are shoe sizes predicted by selected height that is within our data range. y = 0.25x - 33.4 Trial # 1 2 3 Height (X) 156 175.5 181 Equation y = 0.25(156) - 33.4 y = 0.25(175.5) - 33.4 y = 0.25(181) - 33.4 Shoe Size (Y) 6 10.5 12 Table 1.3: Table 1.3 shows predicted values of height(X), plugged into linear regression model to predict shoe size(Y). The predicted shoe size is similar to the actual data. The equation given in graph 1.1 allows us to make safe predictions. Conclusion and Summary: Our investigation was very successful. Our data had adequate sample size and data points. The data has been represented very well in a number of ways in both graphs and tables for the reader to understand completely. Our analysis of the data and hypothesis proved to be true by proving the linear assumption model to be valid by meeting all three criteria. A more pleasing visual representation of the data, such as pie charts and graphs, could have been used. This would have made the report more enticing and therefore more worthwhile, as it would have been read more. It is also important to note that this data was collected from a group and was not specific of gender, race, or age; although only samples were collected from age 16 and older. Also, when using the linear regression equation, shoe size values(Y) were rounded up to the next full size or half size; this is to accommodate the size that shoes are sold in, half and full. Having collected the data through a survey on a high school campus the data was collected through random sampling. To summarize all 3 Criteria were appropriately met in our investigation. The R Value of .8084 represents a strong correlation of the data; the R value is large enough to represent a direct correlation of the data. The residual values represented by Graph 1.2 show that there is not a discrete pattern amongst the data’s residuals. Had there been a definite pattern it would have warranted our data NOT linear related. The residual values represented by Graph 1.2 do not spread while increasing or decreasing. Had this been the case, the pattern would have warranted our data NOT linear related. 6