ES 25 Quantitative Thinking Lab 4: Data Description: Summary Statistics and Histograms Due: Tuesday, May 1st, before 12 noon (E-mail to your facilitator) Investigation #1: ES25 Student’s Water Consumption 1. Summary Statistics Generate “summary statistics” for the list of student’s water consumption (measured in gallons/day) Tools/Data Analysis/Descriptive Statistics/Input range= (select data)/Output range (select destination cell on same worksheet)/Check “summary statistics” You should get the following chart (if you do NOT have the Data Analysis Toolpak on your computer, you can get the values in the chart using the formulas given): Summary Statistics Interpretation Mean 76.93 The "average" of students' water consumption is 77 gal/day One half of the students reported water consumption values 65.04 that were less than 65 gal/day, and the other half of the students reported consuming more than 65 gal/day. On average, the daily consumption values were 47.7 gal/day more or less than the mean value of 77 gal/day. Note: this is Standard Deviation 47.68 not a very good measure of spread, since the distribution is not symmetric (Normal). Minimum 24.71 The smallest reported water use was 24.7 gal/day Maximum 217.00 The largest reported water use was 217 gal/day Sum 3539.00 The total daily water consumption for ES 25 was 3539 gal/day Count 46 Forty-six students participated in this exercise. Median 2. Frequency table and Histogram Calculate the “frequencies” for the water consumption data, using the following bins: (0-40], (40-80], (80-120], (120, 160], (160, 200], (200-240] NOTE: for Excel to recognize the bins above, you must type them is as: 40, 80, 120, 160, 200, 240… it does not recognize parentheses as number values. Frequencies can be calculated using the Tools/Data Analysis/Histogram: o input= CO2 data, bins= bins on left hand column below, output = click on one cell above where you want the values to start) Calculate the “relative frequency” and the “density,” to generate the following table (replace the one below with your completed table): Bin Width Bin Relative = 40 Range Frequency Frequency Density 40 (0-40] 14 0.304348 0.007609 80 (40-80] 16 0.347826 0.008696 120 (80-120] 8 0.173913 0.004348 160 (120-160] 3 0.065217 0.001630 200 (160-200] 4 0.086957 0.002174 240 (200-240] 1 0.021739 0.000543 Sum 46 1 Use the “Chart” to make a “Frequency histogram.” Try to make it look like this: Histogram: ES 25 Student Water Consumption 18 16 14 Frequency 12 10 8 6 4 2 0 (0-40] (40-80] (80-120] (120-160] (160-200] (200-240] Total Daily Water Consumption (gal/day ) Use the “Chart” to make a Relative Frequency histogram. Paste below. Relative Frequency Histrogram: ES 25 Student Water Consumption 0.4 0.35 Relative Frequency 0.3 0.25 0.2 0.15 0.1 0.05 0 (0-40] (40-80] (80-120] (120-160] (160-200] (200-240] Total Daily Water Consumption (gal/day ) Use the “Chart” to make a Density histogram. Paste below. Density Function: ES 25 Student Water Consumption Relative Frequency/Bin Width 0.010000 0.009000 0.008000 0.007000 0.006000 0.005000 0.004000 0.003000 0.002000 0.001000 0.000000 (0-40] (40-80] (80-120] (120-160] (160-200] (200-240] Total Daily Water Consumption (gal/day ) 3. Investigate effects of bin width: Make Relative Frequency histograms for the bin widths below. Format the histograms like the previous group, for easier comparison. Paste below (resize them so that you can easily (and accurately) compare the graphs. o Bin width = 25 and Bin width = 10 Relative Frequency Histogram of ES 25 Daily Water Consumption 0.4 Relative Frequency 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 (0-25] (25-50] (50-75] (75-100] (100-125] (125-150] (150-175] (175-200] (200-225] Water Consumption (gal/day ) Relative Frequency Histogram of ES 25 Daily Water Consumption Relative Frequency 0.25 0.2 0.15 0.1 0.05 (0 -1 0] (1 020 ] (2 030 ] (3 040 ] (4 050 ] (5 060 ] (6 070 ] (7 080 ] (8 090 (9 ] 010 (1 0] 00 -1 10 (1 ] 10 -1 2 (1 0] 20 -1 30 (1 ] 30 -1 4 (1 0] 40 -1 50 (1 ] 50 -1 6 (1 0] 60 -1 70 (1 ] 70 -1 80 (1 ] 80 -1 90 (1 ] 90 -2 00 (2 ] 00 -2 1 (2 0] 10 -2 20 ] 0 Water Consumption (gal/day ) Which bin width do you think is most appropriate for displaying the water consumption data? Write a paragraph justifying your choice (think about what patterns you can see that may be “masked” by larger or smaller bin choices) Answers will vary: (looks more bimodal, get extra peak at (70-80], but lose big trend of lots of observations between (25-50]. Interesting outlier (highest observation with smaller bins) 4. Draw conclusions (answer is complete sentences): Write a paragraph description of the distribution (think about type, outliers, clusters, maximums, minimums, see pages 26-27 workbook). Right-skewed, which makes sense because you can’t consume negative amounts of water (therefore, a “normal” distribution would be difficult). From the wide binned histogram, it looks like a vast majority of students consume less than 100 gallons per day, but there are a few outliers that seem to consume way more than everyone else. I wonder if there was measurement error, or if student water consumption really varies as much as we see in this histogram. What proportion of students used greater than or equal to120 gallons of water per day? Explain how you got your answer (which histogram did you use, did you have to add numbers up, did you compare areas, etc.) Frequency Histogram of ES 25 Daily Water Consumption 10 9 8 Frequency 7 6 5 4 3 2 1 00 ] 11 (1 0] 10 -1 20 (1 ] 20 -1 3 (1 0] 30 -1 40 (1 ] 40 -1 5 (1 0] 50 -1 60 (1 ] 60 -1 7 (1 0] 70 -1 80 (1 ] 80 -1 9 (1 0] 90 -2 00 (2 ] 00 -2 1 (2 0] 10 -2 20 ] (1 00 - 0] (9 01 0] (8 09 0] (7 08 0] (6 07 0] (5 06 0] (4 05 0] (3 04 0] 03 02 (2 (1 (0 -1 0] 0 Water Consumption (gal/day ) I used the frequency histogram above. I added the number of observations >120 gal/day, which was: 1+2+3+1+1 =7. So, there were 7 students out of 46 who reported consuming more than 120 gal/day, which is 15.2%. What percentage of students consume less water than the mean (you may have to calculate this by hand, or assume that values are uniformly distributed within bins)? 58.7% (I counted 27 out of 46 less than the mean) What percentage of students consume less water than the median? By definition, 50% of the observations are less than, and greater than, the median. Would you say “most students consume less than average?” YES! Since the distribution is skewed right, the mean is influenced by the heavy water users, and is not a good measure of center. Also, since the standard deviation is measured using the mean, it is also not a good measure of spread (there is a measure called the Interquartile range which is better). What is the probability that a randomly selected student in ES 25 would use less than 50 gallons of water per day? Explain how you got your answer (which histogram did you use, did you have to add numbers up, did you compare areas, etc.) The proportion of students who consume less than 50 gallons per day is 17/46 (37%), o the probability that a randomly select student would consume less than 50 gallons per day is 37%. You can get this value from the relative frequency histogram just by looking at the heights of the bars below 50% and adding them up (true, you have to make sure 50.00 is not in the dataset, since the question asks for less than 50, and your bins ends with 50]…) Pretend that we actually did this experiment, for five days, and every single day a different, randomly selected student reported using less than 50 gal/day. How would you explain the discrepancy between the probability that you calculated above, and the observed phenomenon (clearly, there is no “right answer” we are looking for logical reasoning) Perhaps students started practicing water conservation! (or they lied) What is the z-score associated with your personal water consumption (report your water consumption, in gal/day)? If you did not do HW1, choose a value, report the value (in gal/day), and calculate the associated z-score. Interpret your answer, for this problem. Investigation #2: Distribution of Carbon Dioxide for a One-year Period at Mauna Loa 1. Copy the 2004 monthly CO2 data from the “MaunaLoaCO2” spreadsheet into a new spreadsheet (name it 2004CO2). Paste special/transpose to get the data in a column rather than a row. 2. Summary Statistics Generate “summary statistics” for the distribution of CO2 during the year 2004. Summary Statistics Mean 377.64 Median 377.43 Standard Deviation 1.95 Minimum Maximum 374.06 380.63 Range 6.57 Count 12 Interpretation The "average" carbon dioxide concentration over the 12 months of 2004 One half of the CO2 concentrations were less than 377.43 ppm, and the other half of the CO2 concentrations were more than 377.43 ppm. On average, the CO2 concentrations varied by about 1.95 ppm (more or less than the mean value of 377.64 ppm ). Note: this seems like a pretty good measure of spread, since the median is close to the mean, and the distribution looks symmetric (Normal). The smallest observed concentration of CO2 in 2004 was 374.06 ppm The largest observed concentration of CO2 in 2004 was 380.63 ppm The observations span 6.57 ppm (see min and max above). This, like standard deviation, is also a measure of spread (notice that it is much greater than the standard deviation, you should understand why) There are 12 months in a year. If you got 14, how do you explain it???? Paste your findings (in chart form) below. Interpret the values: mean, median, standard deviation, and range, in terms of this problem. You may report your answer in the chart above, if you like. 3. Frequency table and Histogram Using a bin width of 2, calculate the “frequencies,” “relative frequency” and the “density,” to generate the following table (paste your completed version, below): Bin Bin Label Frequency 373 375 377 379 381 (371-373] (373-375] (375-377] (377-379] (379-381] Sum 0 1 3 5 3 12 Relative Frequency 0 0.08333333 0.25 0.41666667 0.25 1 Density 0 0.04166667 0.125 0.20833333 0.125 Using the chart tool, make a “relative frequency” histogram (format them like you did in the water problem, with no space between the bins). Paste below. (below, next to 1 ppm bin width histogram) 4. Investigate effects of bin width: Make a new frequency table (like the one above) and relative frequency histogram for a bin width= 1 ppm. Paste below. Which bin width (1 ppm or 2 ppm) is best at revealing patterns of variation in the data? Justify your choice. ANSWERS WILL VARY, LOOK FOR CHANGES IN pattern of histogram. Relative Frequency Histogram, CO2, Mauna Loa, 2004 6 Relative Frequency 5 4 3 2 1 0 (371-373] (373-375] (375-377] (377-379] (379-381] CO2 (ppm) Relative Frequency Histogram, CO2, Mauna Loa, 2004 0.3500 Relative Frequency 0.3000 0.2500 0.2000 0.1500 0.1000 0.0500 0.0000 (373-374] (374-375] (375-376] (376-377] (377-378] CO2 (ppm) (378-379] (379-380] (380-381] 5. Is the distribution changing over time? Generate a new frequency table and histogram, for the CO2 distribution in the year that you were born (your “birth year”). If you were born before 1958, use the year 1960 data. Your histogram should be of “relative frequency,” and should use a bin width of 1 ppm. Paste the frequency table, and histogram, below. Distribution of CO2 at Mauna Loa, 1975 0.35 Relative Frequency 0.3 0.25 0.2 0.15 0.1 0.05 0 (328-329] (329-330] (330-331] CO2 (ppm) (331-332] (332-333] Bin Bin Label Frequency 328 329 330 331 332 333 (327-328] (328-329] (329-330] (330-331] (331-332] (332-333] 0 2 1 3 3 1 Relative Frequency 0 0.2 0.1 0.3 0.3 0.1 Draw conclusions (answer is complete sentences): In the year 2004, what proportion of months had carbon dioxide concentrations greater than 379 ppm? 0.25 (25%) In your birth year, what proportion of months had carbon dioxide concentrations greater than 379 ppm? 0 Compare the ‘typical’ CO2 observation in 2004 to the ‘typical’ CO2 observation in your birth year (think: should you be using the mean, or the median?) Explain why the values (from your birth year to 2004) are so different (hint: see Keeling Curve, page 4 of your workbook). The concentration of CO2 is increasing in the atmosphere. Therefore, the average amount in 2004 will be much higher than that in 1975. I expected the distribution to be skewed left, since many observed high values of CO2 are “piling up” on the right hand side, as CO2 increases over time. If it was skewed, it is better to use the median than the mean (however, my birth year is far enough back that it looks pretty symmetrical—and 12 data points is really not enough to get a smooth distribution). Compare the range and standard deviation for the 2004 data and your birth year data. Offer logical explanations (they don’t need to be “correct”) for the differences that you observe. 1975 mean 331.15 median 331.15 range 5.62 st deviation 1.81 All values are in ppm. 2004 377.64 377.43 6.57 1.95 It appears as if the Carbon dioxide concentrations are becoming more spread out within one year than they used to be. In other words, we see a broader range of measurements (which are not as concentrated around the “average” concentration). I think that this could be because of CO2 fertilization (plants may be growing more in the summer, taking up more CO2 than in the past, and thus respiring more CO2 in the winter). Compare the shape of the distribution to what you predicted it would look like (in HW3). Offer a reasonable hypothesis for the observed shape. I thought the distribution would have been bimodal, but in fact it seems to be Normal or skewed left. I thought bimodal because of the summer and winter extreme CO2 concentrations, but now I realize that most of the time, CO2 concentration is between these two values (so that the bulk of the observations are in the middle). As previously mentioned, the distribution is becoming more left skewed as the concentration of CO2 increses (forcing more observations to the right side of the distribution (with a few trailing observations on the left from the first few months of the year).