ACTIVITY SET 1 Jan

Displaying Data - Solutions 1.1 Suppose a medical researcher compares the average blood pressures of women who take oral contraceptives to the blood pressures of women who do not. a. Is blood pressure a categorical variable or a quantitative variable? Quantitative; we would measure blood pressure. If we categorized the values (e.g. low, medium, high) then this would be a categorical variable. b. Is oral contraceptive use (or not) a categorical variable or a quantitative variable? Categorical; data produced from yes/no questions are categorical. c. What variables that affect blood pressure might confuse the comparison of average blood pressures for users and nonusers? That is, what factors affecting blood pressure might differ for users and nonusers. Explain. Answers will vary. One possibility is age. Age affects blood pressure and also there may age differences between users and non-users of oral contraceptives. 1.2 A statistics class at UC Davis was asked “About how many hours do you watch television per week? A five-number summary of the responses from 173 students follows. Median 6 Quartiles 2 12.5 Extremes 0 100 a. What were the median hours of weekly television watching? In the context of this situation, write a sentence that interprets the median. Median = 6. About 50% of the class watched 6 or less hours per week. b. Give the value that completes the following sentence. About 1/4 of the students watch less than ___ hours of television per week. 2 (lower quartile, Q1) since Q1 represents the first quartile which is the value where 25% of the data falls at or below. c. Give the value that completes the following sentence. About 1/4 of the students watch more than ___ hours of television per week. 12.5 (upper quartile, Q3) since Q3 represents the third quartile which is the value where 75% of the data falls at or below, and conversely 25% watch at or above. d. What is an interval that describes the middle 1/2 of the student’s television watching amounts? 2 to 12.5 (between the quartiles Q1 and Q3) e. The mean for these data is 8.9 hours per week. How do you think the mean is calculated? Sum of all values divided by total number of values Why do you think it is larger than the median in this instance? There might be an outlier(s) 1 1.3 In ANGEL on the Lessons page, access the Datasets folder. Within this folder, click on the link for the data set named U.S. Smoking (Minitab file). This should cause a program named Minitab to open, with the data in place. The data are estimates of the percentage of adults who smoke in each state of the U.S. (and also District of Columbia). a. In Minitab, use Graph>Stem-and-Leaf to create a stemplot of the percents that smoke in the 50 states and Washington D.C. In the dialog box, double click on the name of the second column to enter it as the variable you want to plot. 1 1 2 5 15 (18) 18 13 3 1 1 1 1 1 1 2 2 2 2 2 3 3 2 6 999 0001111111 222222222233333333 44444 6666666777 89 2 About where do most states fall, in terms of percent smoking? In the 22 to 23% smoking range About what is lowest percent in the dataset? From the stemplot it is 12% corresponding to the 12.7% for UTAH About what is the highest percent in the dataset? 32% corresponding to the 32.6% for KENTUCKY What do you notice about the values in the worksheet and the values displayed in the stemplot? The values in the stemplot have been truncated. That is, the decimal place has been dropped from the data in the worksheet when creating the stemplot. How would you describe the shape of this data? Roughly symmetric b. Use Stat>Basic Statistics>Display Descriptive Statistics to determine summary statistics for the percents. As in part d, double click the name of the second column to list it as the variable we’re analyzing. Inspect the output, to find these values: Mean Percent = 23.353 Standard Deviation = 3.327 Median percent = 23.100 lower quartile (denoted by Q1) = 21.500 upper quartile ( denoted by Q3) = 26.000 c. Write a sentence that interprets the median in the context of this situation. One interpretation: The smoking percent is 23.1% or less in about one-half of the states (or conversely, the smoking percent is 23.1% or higher in about one-half of the states) d. What value completes the following sentence? In about 1/4 of the states, the percent that smokes is less than 21.5 (which is Q1). 2 e. What interval includes the middle 1/2 of the values of the state smoking percentages? 21.5 to 26% (between the Q1 and Q3) f. Use Calc>Calculator to manipulate the data in column 2. In the Store Result in Window type in ‘Plus10’ and in the Expression Window double click on the name of the second column and use the calculator pad to add 10 (click the ‘+’ and 1 then 0), and click OK. Repeat this step but in the Store Result Window enter ‘Times10’ and in the Expression box change the ‘+’ to ‘*’. Again find the Descriptive Statistics to get the mean and standard deviation for the original data (column 2) as well as the new data in Columns 3 and 4. For ease, enter all three variables into the Variables box at once. What do you notice about the changes in the mean and standard deviation from the original to the new data? Variable Percentage of Ad Plus 10 Times 10 Mean 23.353 33.353 233.53 StDev 3.327 3.327 33.27 Note that when you add a constant (in this case add 10) the mean increases by the constant but the standard deviation remains the same. This is because by adding a constant you have only in affect shifted the location of the center (mean) in the direction of the constant. But when you multiply all values by a constant, then both the mean and standard deviation are affected by that constant. 1.4 Car and truck speeds at a particular location have approximately a bell-shaped distribution with mean = 65 mph and standard deviation = 5 mph. mean here is 65 and s = 5, then 68% of the observations will fall between 65 +/- 5. Continuing, for 95% this will be the mean +/- 2s and for 99.7% this will be the mean +/- 3s. a. About 68% of cars and trucks travel between ___60____ and ___70____ at this location. Since mean here is 65 and s = 5, then 68% of the observations will fall between 65 +/- 5. b. About 95% of cars and trucks travel between ___55___ and ____75____at this location. [Mean +/- 2s] c. About 99.7% of cars and trucks travel between___50___ and ____80____at this location. [Mean +/- 3s] d. A z-score is a measure of how many standard deviations a value is from the mean. Later in the course, we will see that it is an important measure of the size of a value. The formula is Z = Observed Value - Mean . Standard deviation Determine a z-score for a vehicle speed of 72 mph. Plugging the numbers into the formula: Z= 72 - 65 = 1.4 5 3 e. What vehicle speed has a z-score = −1? Said another way, what vehicle speed is one standard deviation below the mean? (You will need to do some algebra to solve for Observed Value) You need to solve for the observed value which will produce the following formula: OV = Mean + SZ, and using this equation to answer the question we plug in: OV = 65 + (5)*(-1) = 60. So 60 mph would be -1 z-score from the mean. 1.5 Open the Class Survey (Minitab File) data file from the Datasets folder in ANGEL on the Lessons page. This data are from a survey given to students in my Stat200 courses last semester. You are a researcher and want to use this class survey data to research how PSU undergraduate students compare to these national averages. . a. The purpose of most statistical studies is to use the sample data to generalize to a larger group. What do you think are the weaknesses of using this class survey data for generalizing to all PSU undergraduate students? In order to generalize to a larger group, i.e. all PSU undergraduate students, your sample group needs to be representative of this larger group. That is, the make up of your sample group should reflect that of the larger group. For instance, does our class reflect the undergraduates as a whole in regards to percentage of females and percentage of race? The best way to accomplish this would be through random sampling. b. (Importance of checking data). Compute the Descriptive Statistics (for SATM (C16) and SATV (C17). Note the minimum and maximum value for each. (REMEMBER: Stat > Basic Statistics > Display Descriptive Statistics. Enter together into the Variables window SATM and SATV.) i. From the output, what does the * represent? The * represents missing data and the N* represents the number of missing data points for that variable. That is, 10 students did not provide and answer to SATM and 11 students did not answer SATV. ii. How many students answered the question regarding their SATM and SATV scores? SATM 216 SATV 215 c. Now find the Descriptive Statistics for SATM and SATV by Gender (Repeat what you did for part b but now enter Gender in the By Variable window) and use the output to answer the following: Female SATM: Q1 537.50 Female SATV: Q1 530.00 Male SATM: Q1 570.00 Male SATV: Q1 525.00 Q3 650.00 Q3 620.00 Q3 670.00 Q3 645.00 IQR 112.50 IQR 90.00 IQR 100.00 IQR 120.00 4 d. Using the 5-number summary, a data point is considered an outlier on a boxplot if it is either larger than Q3+ (1.5IQR), or smaller than Q1  (1.5IQR). Calculate and identify any outliers for the Female group. SATM: Calculate the value of Q3+ (1.5IQR) = 650 + 168.75 = 818.75 SATM: Calculate the value of Q1  (1.5IQR) = 537.50 – 168.75 = 368.75 SATV: Calculate the value of Q3+ (1.5IQR) = 620 + 135 = 755 SATV: Calculate the value of Q1  (1.5IQR) = 530 – 135 = 395 f. Based on the Descriptive Statistics you calculated by Gender and to answer the following:  How do the SAT scores from our survey compare across gender? Do you believe that any differences are significant? That is, do you think these differences are large enough that statistically they are the different? For both the Math and Verbal sections of the SAT, the males reported higher mean scores. However, since both of the female means are within the IQR of the males, I doubt if the difference is significant. Again, we will study later this semester how do determine if these differences are statistically significant or just represent small, but not noteworthy, differences. 1.6 Staying with the Class Survey (Minitab File). In column C20 Book Cost are the responses to how much students expected to pay for books that semester. a. Use Graph>Histogram click on Simple, and then enter Book Cost in the Variables box to draw a histogram. Use the mouse to identify in the graph the characteristics of the various bars in the histogram. Do this to complete the following sentences. The most frequently reported amount spent was between 275 and 325. Of the 226 students, 60 students said they spent that much. The second most frequently reported amount spent was between 375 and 425. b. Using Minitab, draw a boxplot of the Book Cost (Use Graph>Boxplot, select “Simple” and then enter Book Cost in the variable window). The boxplot provides a graph of the 5-number summary for a set of data.. By placing the mouse pointer over the “box” a pop-up will appear displaying part of the 5-number summary.  What does the * represent in a boxplot? Identify any outlier(s) based on 1.5*IQR rule  How many * are there for the variable Book Cost? 12  What are the outlier values? [place your mouse over the * to see the value] 100, 110, 140, 600, 800 5  What is the 5-number summary for Book Cost? Minimum = 100 Q1 = 300 Median = 350 Q3 = 400 Maximum = 800  The shape of the data represented by the box plot can be determined by the location of the median bar in the box and by comparing the length of the “whiskers” – the two lines that extend from either end of the box. If the median is in the center and the whiskers are of roughly equal length then the data is symmetrical. If the median is near the bottom of the box and the top whisker is longer, then the distribution is said to be skewed to the right or positively skewed. If the median is near the top of the box and the bottom whisker is longer, then skewed to the left or negatively skewed. What is the shape of Book Cost based on the boxplot? Does this concur with how you would interpret the histogram? The shape is symmetrical and yes the shapes are similar. Keep in mind that the larger the number of observations in your data set the less influenced the shape becomes by a few outliers. 6

ACTIVITY SET 1 Jan

Related documents

Products

Support

ACTIVITY SET 1 Jan

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib