EDF 6472 Introduction to Data Analysis in Education Assignments Due September 24, 2012 (Chapter 3) – Solutions Green, et al. Ann wants to describe the demographic characteristics of a sample of 25 individuals who completed a large-scale survey. She has demographic data on the participants’ gender (two categories), educational level (four categories), marital status (three categories), and community population size (eight categories). Before answering any of the questions, we want to open Lesson 20 Exercise File 1, the file that contains the data described above. When we do this, we see the Data View window shown below. 1. Conduct a frequency analysis on the gender and marital status variables. From the output, identify the following. 1) Percent of men, b) Mode for marital status, c) Frequency of divorced people in the sample. To carry out the frequency analysis we will use the Frequencies function of SPSS. To find it, first click on the Analyze menu on top of the data view screen. Now choose 1 Descriptive Statistics from the menu and finally choose Frequencies from the submenu that appears. Note the demonstration shown below. You should now see the Frequencies dialog box. Select the variables gender and marital and move these over to the Variable(s): window on the right. The window should look like the one below. Click on the OK button to obtain the output shown on the next page. 2 Frequencies Statistics N Valid Missing Gender 25 0 Marital Status 25 0 Frequency Table Gender Valid Men Women Total Frequency 13 12 25 Percent 52.0 48.0 100.0 Valid Percent 52.0 48.0 100.0 Cumulative Percent 52.0 100.0 Ma rita l Sta tus Valid Married Divorced Never Married Total Frequency 9 11 5 25 Percent 36.0 44.0 20.0 100.0 Valid Percent 36.0 44.0 20.0 100.0 Cumulative Percent 36.0 80.0 100.0 a. The percent of men in the sample is found by observing the Valid Percent value for Men in the first frequency table. We see that 52% of the participants are men. b. The mode is the value that has the highest frequency. In this sample, as we can see from the second frequency table, the mode for marital status is Divorced with a frequency of 11 participants. We could get the same information by looking at the value that has the highest percent. c. The second frequency table tells us that 11 people in this sample are diverced. 2. Create a frequency table to summarize the data on the educational level variable. We will use the Frequencies facility as we did in Question 1 to create a frequency table to summarize the data on educational level. First click on the Analyze menu on top of the data view screen. Now choose Descriptive Statistics from the menu and finally choose Frequencies from the submenu that appears. You should now see the Frequencies dialog box. Select the variable educ and move it over to the Variable(s): window on the right. The window should look like the one on the next page. 3 Now, click on the OK button and obtain the output shown below. This is a frequency table of the data on the educational level variable. Frequencies Ma rita l Sta tus Valid Married Divorced Never Married Total Frequency 9 11 5 25 Percent 36.0 44.0 20.0 100.0 Valid Percent 36.0 44.0 20.0 100.0 Cumulative Percent 36.0 80.0 100.0 Julie asks 50 men and 50 women to indicate what type of books they typically read for pleasure. She codes the responses into 10 categories: drama, mysteries, romance, historical nonfiction, travel books, children’s books, poetry, autobiographies, political science, and local interest books. She also asks participants how many books they read in a month. She categorizes their responses into four categories: no readers (no books a month), light readers (1-2 books a month), moderate readers (3-5 books a month), and heavy readers (more than 5 books a month). Julies SPSS data file contains two variables, book, a 10-category variable for type of books read, and readers, a fourcategory variable indicating the number of books read per month. We must first open file Lesson 20 Exercise File 2 to view this data. It is shown in the data screen pictured below. 4 5. Create a table to summarize the types of books that people report reading. We can create a frequency table using the variable book as we did in Question 1. The Frequency dialog box should look like the one show below. Next we click on the OK button and obtain the frequency distribution on the next page. 5 Frequencies Statistics Book Type N Valid Missing 100 0 Book Type Valid Drama Mysteries Romance Historical Nonfiction Travel Books Children's Literature Poetry Autobiographies Political Science Local Interest Total Frequency 16 7 10 6 13 10 9 9 8 12 100 Percent 16.0 7.0 10.0 6.0 13.0 10.0 9.0 9.0 8.0 12.0 100.0 Valid Percent 16.0 7.0 10.0 6.0 13.0 10.0 9.0 9.0 8.0 12.0 100.0 Cumulative Percent 16.0 23.0 33.0 39.0 52.0 62.0 71.0 80.0 88.0 100.0 6. Crate a pie chart to describe how many books per month Julie’s sample reads. The easiest way to construct a pie chart is to use the Graphs menu on the top of the Data View Screen in SPSS. Click on this menu and then on the Pie… choice as shown on the next page. 6 This will give us the Bar Chart dialog box shown below. Note that Summaries for groups of cases is the default for the arrangement of data in the pie chart. Take this default and click on the Define button to obtain the Define Pie: Summaries for Groups of Cases dialog box shown on the next page. Select the variable reader and move it into the Define Slices by: window using the right arrow since this is the variable that will determine the area of each of the slices of the pie chart. Now, click on the OK button to obtain the output shown on the next page. 7 Graph Your pie chart may look a little different than the one to the left. It may have pieces of solid colors or it may have different filler patterns. This is because of user settings that can be found under the Edit menu in SPSS. Just click on the Edit menu and then click on Options… . Then click on the Charts tab to see how these colors and other characteristics of charts can be changed in SPSS. 8 Hinkle, et al, 1. A college dean examines the cumulative grade-point averages of 120 sophomore students. The following frequency distribution is derived using class intervals of width 0.3. Class interval 3.8 – 4.0 3.5 – 3.7 3.2 – 3.4 2.9 – 3.1 2.6 – 2.8 2.3 – 2.5 2.0 – 2.2 1.7 – 1.9 1.4 – 1.6 1.1 – 1.3 0.8 – 1.0 f 4 8 15 18 20 17 12 12 10 4 0 a. Draw a histogram and a frequency polygon for the distribution. This will be easier to do if we expand the frequency distribution table shown above to include the exact limits and the midpoint of each of the intervals. See the table below. Class interval Exact limits Midpoint f cf % c% 3.8 – 4.0 3.75 – 4.05 3.9 4 120 3.3 100.0 3.5 – 3.7 3.45 – 3.75 3.6 8 116 6.7 96.7 3.2 – 3.4 3.15 – 3.45 3.3 15 108 12.5 90.0 2.9 – 3.1 2.85 – 3.15 3.0 18 93 15.0 77.5 2.6 – 2.8 2.55 – 2.85 2.7 20 75 16.7 62.5 2.3 – 2.5 2.25 – 2.55 2.4 17 55 14.2 45.8 2.0 – 2.2 1.95 – 2.25 2.1 12 38 10.0 31.6 1.7 – 1.9 1.65 – 1.95 1.8 12 26 10.0 21.6 1.4 – 1.6 1.35 – 1.65 1.5 10 14 8.3 11.6 1.1 – 1.3 1.05 – 1.35 1.2 4 4 3.3 3.3 0.8 – 1.0 0.75 – 1.05 0.0 0 0 0.0 0.0 A histogram representing distribution in the table is shown below. 9 The corresponding frequency polygon is shown below. b. Find P10 ,P45 , P60 , and P95 . 10 Np cf w where The percentiles can be found by using the formula PX f i is the real lower (exact) limit of the interval that has the percentile we are looking for, N is the number of subjects in the distribution, p is the proportion of subjects with scores below the percentile we are looking for, cf is the cumulative frequency of the interval below the interval that contains the desire percentile, fi is the frequency of the interval that contains the percentile we are looking for, and w is the width of the interval. The interval 1.4 – 1.6 contains the 10th percentile. We can see this by noting that this interval has 11.6% of the scores in or below it. The next lowest interval only has a cumulative percentage of 3.3%. Clearly, the score that has 10% of the scores at or below it is in the interval 1.4 – 1.6. The real or exact lower limit of this interval is 1.35. There are 120 scores in this distribution, so N equals 120. Ten percent of the scores are at or below the 10th percentile, so p=.10. The cumulative frequency of the interval below the interval that contains the 10th percentile (i.e., the interval 1.1 – 1.3) is 4. The frequency of the interval containing P10 (1.4 – 1.6) is 10, so f1 = 10. Finally, w is the width of the interval which is, in this case, 3. So, we find that P10 , the 10th percentile, is Np cf P10 fi 120.10 4 w 1.35 .3 1.35 0.24 1.59 10 The rest of the percentiles in this question are found in a similar manner. Np cf P45 fi 120.45 38 w 2.25 .3 2.25 0.28 2.53 17 Np cf P60 fi 120.60 55 w 2.55 .3 2.55 0.26 2.81 20 Np cf P95 fi 120.95 108 w 3.45 .3 3.45 0.23 3.68 8 c. Find the percentile ranks of cumulative grade point overages of 2.40, 2.75, 3.25, and 3.60. The percentile ranks of raw scores can be found using the formula Np cf 120.60 55 w 2.55 P60 .3 2.55 0.26 2.81 f 20 i 11 where all the symbols mean the same thing as in Part b and X corresponds to the raw score for which we are trying to find the percentile rank. In the first example, we see that a cumulative grade point average of 2.40 is in the interval 2.3 – 2.5 which has a real lower limit of 2.25 and a frequency 17. So, 2.25 and fi = 17. The cumulative frequency of the interval below 2.3 – 2.5 (that is, 2.0 – 2.2) is 38, so cf = 38. We are looking for the percentile rank of a score of 2.40, so X = 2.40. The width of the intervals (w) is 3 and there are 120 scores in this distribution, so N = 120. Therefore, the percentile rank of a cumulative grade point average of 2.40 in this distribution is PR2.40 X 2.40 2.25 17 cf w f1 38 .3 100 38.75 N 120 The other three percentile ranks are found in a similar manner. PR2.75 X 2.75 2.55 20 cf w f1 55 .3 100 56.94 N 120 PR3.25 X 3.25 3.15 15 cf w f1 93 .3 100 81.67 N 120 PR3.60 X 3.60 3.45 8 cf w f1 108 .3 100 93.33 N 120 12 10. The following are the final examination score of 40 students in a basic statistics class. These scores were randomly selected from the records of all students who have taken the same course over the past 10 years and have taken the standardized final examination. a. Determine the mean and median. The data are 58 86 70 80 82 The mean of a set of scores is found using the formula 88 60 80 72 75 X 89 61 72 76 80 X . In this case We add up the values to find ΣX, N 63 73 82 81 89 75 65 82 86 90 X 3069 75 63 65 84 82 which is equal to 3096. So, X N 40 77.4 . 76 68 82 91 94 68 74 79 84 96 The median will be the center score in the distribution. Since there are 40 scores this means that the median will be the score between the 20th and the 21st score. To find these we must first rank order the data. In this case we rank in ascending order and obtain the following: Rank Score Rank Score 1 58 21 80 2 60 22 80 3 61 23 80 4 63 24 81 5 63 25 82 6 65 26 82 7 65 27 82 8 68 28 82 9 68 29 82 10 70 30 84 11 72 31 84 12 72 32 86 13 73 33 86 14 74 34 88 15 75 35 89 16 75 36 89 17 75 37 90 18 76 38 91 19 76 39 94 20 79 40 96 In the table to the left we can see that the 20th score is 79 and the 21st score is 80. The median score is 79.5 since it is half way between these two scores. 13 b. Determine the variance and standard deviation. X XX X X 58 88 89 63 75 75 76 68 86 60 61 73 65 63 68 74 70 80 72 82 82 65 82 79 80 72 76 81 86 84 91 84 82 75 80 89 90 82 94 96 -19.40 10.60 11.60 -14.40 -2.40 -2.40 -1.40 -9.40 8.60 -17.40 -16.40 -4.40 -12.40 -14.40 -9.40 -3.40 -7.40 2.60 -5.40 4.60 4.60 -12.40 4.60 1.60 2.60 -5.40 -1.40 3.60 8.60 6.60 13.60 6.60 4.60 -2.40 2.60 11.60 12.60 4.60 16.60 18.60 376.36 112.36 134.56 207.36 5.76 5.76 1.96 88.36 73.96 302.76 268.96 19.36 153.76 207.36 88.36 11.56 54.76 6.76 29.16 21.16 21.16 153.76 21.16 2.56 6.76 29.16 1.96 12.96 73.96 43.56 184.96 43.56 21.16 5.76 6.76 134.56 158.76 21.16 275.56 345.96 The variance of a distribution of scores is found 2 X X 2 using the formula S 2 . Note that n 1 we use the denominator n-1 since we are told that this is a sample of a larger population (all the students who have taken the final exam over a period of years. We want to estimate the population standard deviation from the standard deviation this sample, so we subtract one from the sample size to adjust for the fact that samples tend to have smaller variation than the populations they are drawn from. In part a. of this problem we found that the mean of this distribution of scores was 77.4, so we will begin by subtracting 77.4 from each score as shown in the table to the left. Now we compute X X 2 by squaring the differences and placing them in the appropriate column of the table. We find that X X 2 3735.60 . So, X X 2 S 2 n 1 3735.60 3735.60 95.78 40 1 39 The standard deviation is simply the square root of the variance, so S S 2 95.78 9.79 . X X 2 3735.60 14 22. John took three standardized tests. a. Compute a z score for each of John’s test scores. The three standardized test scores obtained by John were Test Score X s Chemistry 85 82 10 Mathematics 80 75 8 English 80 90 12 The z score for a particular score is the number of standard deviations that that score is from the mean of the distribution. It is, therefore, found using the XX formula z . s 85 82 3 .3 . a. For Chemistry the z score is z 10 10 80 75 5 .625 . For Mathematics the z score is z 8 8 80 90 10 -.833. For English, the z score is z 12 12 b. What appears to be John’s strongest subject area? Johns strongest subject area on the basis of these tests in mathematics. He had the highest z score on this examination. 15