EDF 6472 Introduction to Data Analysis in Education Assignments Due October 1, 2012 – Solutions Green, et al. We begin this assignment by bringing the file Lesson 21 Exercise File 1 into the Data View screen in the SPSS system. The screen should look like the one shown below. We are told that these are anxiety scores from 15 college students who visit the university health center during finals week. 1. Compute descriptive statistics on the anxiety scores. From the ouput, identify the following: (a) Skewness, (b) Mean, (c) Standard deviation, and (d) Kurtosis. We can find these statistics using the Explore facility of SPSS. To find it, click on the Analyze menu on top of the Data View screen. Now, click on the Descriptive Statistics submenu and then choose Explore…from the choices that appear. The procedure should look like the example on the next page. This will give us the Explore dialog box. In this box, move the variable containing the anxiety scores, anxiety into the Dependent List: box by highlighting it and click on the right arrow. Under Display in the lower left hand corner of the dialog box, choose Statistics. The dialog box should look like the one below. Now, click on the OK button and you will receive the output shown on the next page. Descriptives Statistic anxiety Anxiety Scores Mean 95% Confidence Interval for Mean Std. Error 32.27 Lower Bound 6.062 19.27 Upper Bound 45.27 5% Trimmed Mean Median Variance Std. Deviation 31.24 25.00 551.210 23.478 Minimum 5 Maximum 78 Range 73 Interquartile Range Skewness Kurtosis 40 .416 .580 -1.124 1.121 From this table we can see that the skewness of the distribution is .416; the mean is 32.27; the standard deviation is 23.478; and the kurtosis of the distribution of anxiety scores is -1.124. 2. Compute percentile ranks on the anxiety scores assuming that the distribution is normal. What are the scores associated with the percentile ranks of 12, 27, 38, 73, and 88? This task becomes easier to understand if we keep a few things we have learned in mind. The first is the definition of the percentile rank. Remember that the percentile rank of a score is percent of people who score below that score. Also recall that z-scores are useful because we know the proportion of people who fall below a given z-score if we know certain characteristics of the distribution of scores, such as the characteristics of a normal distribution. Finally, remember that the mean of a distribution of z-scores is 0 and that it has a standard deviation of 1. Since we know the characteristics of a normal distribution of z-score, let’s begin by converting all of the anxiety scores to z-scores. This is very simple to do using the Descriptives procedure in SPSS. To do this, click on the Analyze menu on the Data View screen and then click on the Descriptive Statistics submenu. From the choices now presented to you click on Descriptives…. The Data View screen should look like the one shown on the next page. This will give you the Descriptives dialog. Select the variable anxiety and move it into the Variable(s): window using the right arrow. Now click on the Save standardized values as variables box to obtain z-score equivalents of the raw anxiety scores for each case. The box should look like the one below. Now, click on the OK button. You will see a table with a group of statistics, but we aren’t really interested in these at this time. What we do want to see is the z-scores of the anxiety scores. Move back to the Data View menu and note that the new variable containing the zscores has been created by the system as you can see on the next page. We can use these z-scores to find the percentile ranks of the anxiety scores using the SPSS Compute function. The compute command CDF.NORMAL tells us the percent of the total scores that are below a certain score assuming the distribution in normal. If you read the last sentence carefully you will realize that the percent of total scores below a certain score is that score’s percentile rank! To do this compute task, click on the Transform menu on the Data View and click Compute on that menu. This should give you the Compute dialog box. Let’s call the new variable that will hold the anxiety score percentile ranks of the subjects anxrank. Type this variable name in the Target Variable: box. Now, scroll down the Functions: window until you come to CDF.NORMAL(q .mean.stddev) and click on the up arrow to move it into the Numeric Expression: box. The dialog box will look like the one shown to the left below. The values in the parenthesis represented by question marks need to be filled in. The letter q, represented by the first question mark, stands for the variable that will be converted to percentile ranks to form the new variable. In this case we will take the variable Zanxiety (the zscore of the anxiety scores) and use it to find the percentile ranks. So, highlight the variable Zanxiety, as well. Click the right arrow to move the variable name to replace the first question mark. The second question mark is in the place of the mean in the expression CDF.NORMAL(q.mean.stddev). The mean of any set of z-scores is 0, so replace term mean with the number 0 in the expression. Finally, the standard deviation of any group of z-scores is 1. Replace the value stddev in the expression with the number 1. Since we want percents rather then merely proportions we need to multiply the obtain value by 100, so add *100 to the end of the expression (the symbol * is the symbol that SPSS uses for multiplication). The Compute dialog box should now look like the one below. Now, click on the OK button and look at the Data View screen to see the new variable, anxrank, which is the percentile rank of the anxiety scores. The new data view screen should look like the one on the next page. The scores associated with the percentile ranks of 12, 27, 38, 73, and 88 can be found by finding the percentile rank in the anxrank column and looking at the variable anxiety for that case. For example Case #1 has a percentile rank of 12. The same case has an anxiety score of 5, telling us that a person who has a percentile rank of 12 on the anxiety scale has a raw score value of anxiety equal to 5. The other four values are associated with anxiety scores of 18, 25, 47, and 60, respectively. 3. Compute percentile ranks on the anxiety scores not assuming that the distribution of scores is normal. We begin to solve this problem by creating a variable that is the rank of the anxiety scores (in ascending order). This is done using the Rank Cases facility of SPSS. We can find this function by clicking on the Transform menu on the Data View screen and choosing the Rank Cases procedure. This gives us the Rank Cases dialog box. Since we want to rank the anxiety score we move the variable anxiety in to the Variable(s): window by highlighting it and clicking on the right arrow. Now, uncheck the Display summary tables box. You should see the dialog box shown on the next page. Now, click on the OK button. Look at the Data View screen shown below to see the new variable, Ranxiety (ranked anxiety). It is important to note that when we rank a set of values in ascending order, the rank is actually the number of cases that fall at or below the rank. For instance, the smallest value is given the rank of 1 and it makes sense to say that one score is equal to or less than the lowest score in a distribution. The score is equal to itself and there are no scores lower. By subtracting .5 from each rank and rounding down, we have the number of cases that fall below the score in question. If we divide this number by the total number of scores in the distribution, we have the proportion of total scores that fall below that value. Finally, if we divide that proportion by the total number of scores, we have the percent of scores that fall below that value. That is, we have the percentile rank of the raw scores that corresponds to that rank. We can use the Compute function of SPSS to do all this arithmetic for us. Click on the Transform menu on the Data View screen. On the menu that pops up, click Compute… and receive the Compute dialog box. Now, type in the name of the variable that will contain the percentiles when the normal distribution is not assumed, anxranknn, in the Target Variable: window. Type the expression that will give us this variable in the Numeric Expression: window. This expression is ((Ranxiety-.5)/15)*100. The dialog box should look like the one shown below. Now click on the OK button and the new variable is created. Note it on the Data View screen that is shown below. 4. Create a histogram to show the distribution of the anxiety scores. Edit the graph so that most of the normal curve is visible. We’ll create this histogram by clicking on the Graphs menu on the Data View screen and clicking on Histogram… in the submenu that appears. In the Histogram dialog box, move the variable anxiety into the Variable: box by highlighting it and clicking on the right arrow. Now, make sure that you have checked the box labeled Display normal curve. The dialog box should look like the one below. Now, click on the OK button and you should get the output shown below. 4 Frequency 3 2 1 Mean = 32.27 Std. Dev. = 23.478 N = 15 0 0 20 40 Anxiety Scores 60 80 A difficulty with this histogram is that we cannot see the ends of the normal curve that has been drawn over the histogram. We could see more of the histogram if we expanded the right and left side of the chart. We can do this using the editing facility built into SPSS graphs and charts. To do this, double click on the graph. When you have accomplished this you will see the Chart Editor with the histogram already placed in it as shown on the next page. We could move the left side of the Anxiety Score scale further to the left by beginning the chart at a lower value than the current value of zero. Something like -20 (one additional interval) should do it. Likewise, we can move the right side of the scale further to the right by defining a higher upper end such as 100 (also an additional one interval). To accomplish this, click anywhere one the horizontal axis. This gives you the Properties window shown below. Click on the Scale tab and get the box shown on the next page. Change the lower end of the scale of -20 by changing the value of 0 in the box on the Minimum line to -20. We can change the highest value from 80 to 100 by changing the number on the Maximum line from 80 to 100. The dialog box should look like the one shown below. Now click on the Apply button to change the graph. The edited graph shows up in the Chart Editor. Next, close both the Properties box and the Chart Editor by clicking on the Xs in the upper right hand corner of the objects. You should now be left with just the edited graph in the SPSS output as shown below. 4 Frequency 3 2 1 Mean = 32.27 Std. Dev. = 23.478 N = 15 0 -20 0 20 40 60 Anxiety Scores 80 100 5. Based on the histogram and the descriptive statistics, which percentile rank method should you use? In Exercise #1, we calculated the skewness of this distribution and found it to be .416. This indicates a positively skewed distribution where there are more subjects scoring below the mean than there are scoring above the mean. We can see this in the histogram. Both indicate that it probably wouldn’t be appropriate to assume that the data was distributed normally. Therefore, we should use the percentile rank method where it is not assumed that the data is distributed normally. Hinkle, et al. 2. Assume that a set of 200 scores is normally distributed with a mean of 60 and a standard deviation of 12. a. What are the z-scores corresponding to the raw scores of 76, 38, and 50? Since z-scores are number of standard deviations a score is from the mean of the distribution, we use the following formula to compute the z-scores of each of the raw 76 60 16 XX 1.33 . For a scores: Z . So, for a raw score of 76 we have Z 76 12 12 S 38 60 22 1.83 . Finally, for a raw score of 50, raw score of 38, Z 38 12 12 50 60 10 Z 50 .83 12 12 b. How many scores b lie between the values of 48 and 80? 65 and 75? 34 and 52? To find out how many score lie between 48 and 80, we must first find the z-scores associated with raw scores of 48 and 80. Then we can use the table of the normal distribution to find the proportion of scores between these two z-scores. 48 60 12 80 60 20 Z 48 1.00 . Z 80 1.67 . 12 12 12 12 Since Z = -1.00 is below the mean and Z = 1.67 is above the mean, we must first find the proportion of the area under the curve between –1.00 and the mean. Looking at the table of areas under the normal curve, we see that .3413 (34.13%) of the scores fall between – 1.00 and the mean (use the Area between X and z column to obtain this). The area between the mean and 1.67 is .4525 (45.25 %). So the total area between -1.00 and 1.67 is .3413 + .4525 = .7938 (79.38%). Since there are 200 people in the distribution, the number of people scoring between 48 and 80 is (.7938) (200) = 158.76 or 159 people. Using the same strategy we can find how many scores fall between 65 and 75. 65 60 5 75 60 15 .42 . Z 75 1.25 . Since both scores are above the 12 12 12 12 mean we can first find the area between the mean and 1.25. It is .3944. Next we will find the area under the curve between the mean and a score of .42. It is .1628. If we subtract the area between the mean and .42 from the area under the curve between the mean and 1.25, we can find the area between .42 and 1.25. So, .3944 - .1628 = .2316. Multiplying the area under the curve by 200 people we find that (.2316)(200) = 46.32 or 47 people have scores between 65 and 75. Z 65 Finally, using the same strategy, we can calculate the number of people who have scores 34 60 26 52 60 8 2.17 and Z 52 .67 . between 34 and 52. Z 34 12 12 12 12 Both scores are below the mean. If we find the area under the curve between the mean and -.67 and subtract it from the area between -2.17 and the mean we will have the area under the curve between -.67 and –2.17. Using the table of the normal curve we find that the area between a score of -.67 and the mean is .2486 of the total area under the curve. For a score of –2.17 the proportion of area under the curve between the score and the mean of the distribution is .4850. So .4850 - .2486 = .2364 of the area under the curve is between the values of 34 and 52. Since we have 200 subjects in the sample, we find that (.2364)(200) = 47.28 or 47 subjects have scores between 34 and 52. c. How many scores exceed the values of 80, 60, and 40? In part B of this exercise we found that a raw score of 80 corresponded to a z-score of 1.67. To find out the proportion of scores that exceed 1.67 we look in the table of scores under the normal curve and find the proportion of scores that is beyond a z-score of 1.67. This is .0475. Since there are 200 people in the sample, we can see that (.0475) (200) = 9.5 or 10 people have scores that exceed 80. Sixty is the mean of the distribution. We know that in a normal curve 50% of the scores are above the mean. Since there are 200 people in the sample, there must be 100 people who have scores beyond a score of 60. 40 60 20 1.67 . Since 12 12 this score is below the mean to find the proportion of scores beyond a z-score of –1.67 we can find the area between –1.67 and the mean and add it to the area beyond the mean which is know is .5000. The area under the normal curve between –1.67 and the mean is found in the table to be .4525. If we add this to the area beyond the mean we get .4525 + .5000 = .9525. Since we have 200 people in the sample (.9525)(200) = 190.50 or 191 of the 200 people in the sample have raw scores above 40. The z-score corresponding to a raw score of 40 is Z 40 d. How many scores are less than the values of 35, 50, and 75? 35 60 25 2.08 . The 12 12 area of the curve beyond (that is below) a z-score of –2.08 is .0188 according to the table. Therefore (.0188) (200) = 3.76 or 4 people have scores lower than a raw score of 35. The z-score corresponding to a raw score of 35 is Z 35 50 60 10 .83 . The area 12 12 of the curve beyond (that is below) a z-score of -.83 is .2061 according to the table. Therefore (.2061)(200) = 41.22 or 41 people have scores lower than a raw score of 50. The z-score corresponding to a raw score of 50 is Z 50 In part b of this exercise we found that the z-score corresponding to a raw score of 75 is 1.25. Since this value is above the mean we need to find the area between a score of 1.25 and the mean and add it to the area below the mean (.5000) in order to obtain the proportion of score that fall below a raw score of 75. In part b we found this was .3944. So, we find that .5000 + .3944 = .8944 of the area of the curve is below a raw score of 75. Since there are 200 people in the sample, (.8944)(200) = 178.9 or 179 people have raw scores below 75. e. Find P35 , P80 , PR55 , and PR70 . P35 is the 35th percentile. By definition, it is the score at or below which 35 percent of the scores fall. Since this value is below the mean (the mean is the 50th percentile in a normal distribution) all we need to know is the z-score below which (i.e. beyond which) .3500 of the scores fall. We can then convert this z-score into a raw score using the formula X ZS X . Using the table of areas under the normal distribution we find that the closest z = score to .3500 in the area beyond column is Z = -.39. So, the raw score at the 35th percentile is X .3912 60 55.32 . The 80th percentile would be above the mean. Since we know that .5000 of the scores below the 80th percentile are below the mean, all we have to do is find the z-score that has .3000 of the scores between it and the mean and add this proportion of scores to .5000 to find the z-score that is at the 80th percentile. The z-score with the area closest to .3000 between it and the mean is z = .84. The raw score corresponding to z = .84 is X .8412 60 70.08 . This is the 80th percentile. PR55 is the percentile corresponding to a raw score of 55. If we knew the z-score of a raw score of 55 we could determine what proportion of scores falls below this z-score. This would give us the percentile score corresponding to that raw score. The z-score 55 60 5 .42 . This value is below corresponding to a raw score of 55 is Z 55 12 12 the mean so we can find the proportion of scores that falls below it by just finding the proportion of scores beyond it. The table tells us that the proportion of scores beyond a z- score of -.42 is .3372. Therefore about 34% of the scores falls below a raw score of 55, so 55 is about at the 34th percentile. PR70 is the percentile corresponding to a raw score of 70. If we knew the z-score of a raw score of 70 we could determine what proportion of scores falls below this z-score. This would give us the percentile score corresponding to that raw score. The z-score 70 60 10 .83 . Since this score is above corresponding to a raw score of 70 is Z 70 12 12 the mean and we know that .5000 of the scores fall below the mean, we can find the proportion of scores between a z-score of .83 and the mean and add this to the .5000 of the area that is below the mean and get the proportion of scores that falls below a raw score of 70. We find that .2967 of the area under the curve falls between a raw score of 70 and the mean of the distribution. So, .5000 + .2967 = .7967 of the scores fall below a raw score of 70 making it the 80th percentile. 4. A statistics instructor tells the class that grading will be based on the normal distribution. He plans to give 10 percent As, 20 percent Bs, 40 percent Cs, 20 Ds, and 10 percent Fs. If the final examination scores have a mean of 75 and a standard deviation of 9.6, what is the range of scores for each grade? If 10% of the grades are to be As, then the lower cutoff score for a grade of A is the score below which 90% of the scores fall (the 90th percentile). We know that 50% of the scores fall below the mean of the distribution. This leaves us with 40% of the scores falling between the 90th percentile and the mean. What score includes 40% of the scores between it and the mean. If we knew the z-score of the 90th percentile we could figure this out. In Table C.1 of your text book, under the “Area between X and z-score of 1.28 comes closest to taking in .4000 of the scores. It actually takes in .3997 of the scores. Now if the mean is 75 and the standard deviation is 9.6 and the distribution is normal, X what raw score corresponds to a z-score of 1.28? Remember that z . Multiplying both sides by σ, we find that zσ = X – μ. Adding μ to both sides we find that X = zσ + μ. So, in this case we can find the value of the raw scores that corresponds to a z-score of 1.28 by doing: X z 1.289.6 75 87.29 . So, students who receive final exam scores of 87 and above will get grades of A. Since 20% of the students will obtain a grade of B, we know that the lower end of the B interval is at the 70th percentile (10% or As and 20% for Bs). Using the same rational as shown above, we know there are 50% of the scores below the mean and 20% (50% 30%) between the 70th percentile and the mean. Using Table C.1 we find that a z-score that has 20% of the scores between it and the mean. In this case the z-score we need is .52. So, we can find the raw score that corresponds to a z-score of .52 using: X z .529.6 75 79.99 . So, students who receive final exam scores of 80 to 86 will get grades of B. Since 40% of the students will obtain a grade of C, we know that the lower end of the C interval is at the 30th percentile (10% or As, 20% for Bs, and 40% for Cs). From the 70th percentile, going down through 20% of the scores gets us to the mean, since it is also the median or the 50th percentile when the scores are distributed normally. Going down another 20% of the scores (since there will be 40% Cs) brings us to the 30th percentile. Now, what value of z cuts off the lower 30% of the scores? To put it another way, what is the z-score beyond which 30% of the scores lie? Going to Table C.1 and looking at the “Area beyond z” column shows us that a z-score of -.52 is the value we are looking for. Remember, this value is negative because it is below the mean. We can find the raw score that corresponds with a z-score of -.52 by evaluating the equation: X z .529.6 75 70.01. So, examinees who score between 70 and 79 on the final examination will receive a grade of C. Since 20% of the students will obtain a grade of D, we know that the lower end of the D interval is at the 10th percentile (10% or As, 20% for Bs, 40% for Cs, and 20% for Ds). Now, what value of z cuts off the lower 10% of the scores? To put it another way, what is the z-score beyond which only 10% of the scores lie? Going to Table C.1 and looking at the “Area beyond z” column shows us that a z-score of -1.28 is the value we are looking for. Remember, this value is negative because it is below the mean. We can find the raw score that corresponds with a z-score of -1.28 by evaluating the equation: X z 1.289.6 75 62.71. So, examinees who score between 63 and 69 on the final examination will receive a grade of D. Finally, it follows that the rest of the students will receive Fs as their grades. So, any student with a final examination score less than 63 will receive a grade of F. 8. In an ancient culture, the average male life span was 37.6 years, with a standard deviation of 4.8 years. The average female life span was 41.2 years, with a standard deviation of 7.7 years. Use the properties of the normal distribution to find out the following: a. What percentage of men died before age 30? To find the percentage of men who died before age 30, we need to find the z-score that corresponds to an age of 30 in the distribution of mean and find what proportion of the X 30 37.6 7.6 1.58 . Since this men’s ages fell below this value. z30 4.8 4.8 value is below the mean, all we have to do if to find the area below (i.e. beyond) z = -1.58 and that will give us to proportion of men who died before age 30. The table of the normal distribution tells us that this is .0571. We can day that 5.71% of the men died before age 30. b. What percentage of women lived to an age of at least 50? Using the same logic we can find the proportion of women who lived to at least 50. 50 41.2 8.8 z50 1.14 . Since the age is above the mean, one simple way to do this 7.7 7.7 is to find the proportion of women who lived 50 years or longer. Using the table of the normal distribution we can find the proportion of scores above (beyond) z =1.14. It is .1271. So we see that 12.71% of the women lived to at least age 50 years. c. At what age is a female dealth at the same relative position in the distribution as a male death at age 35? 35 37.6 2.6 .54 . Now we 4.8 4.8 need to find the age that corresponds to a z-score of -.54 for women. We know that X z , so for women, X .547.7 41.2 37.04 . So, a death at the age of 37 for women has the same relative position as a male death at 35. A male age of 35 corresponds to a z-score of z35