EDF 6472

advertisement
EDF 6472
Introduction to Data Analysis in Education
Assignments Due October 1, 2012 – Solutions
Green, et al.
We begin this assignment by bringing the file Lesson 21 Exercise File 1 into the
Data View screen in the SPSS system. The screen should look like the one shown below.
We are told that these are anxiety scores from 15 college students who visit the
university health center during finals week.
1. Compute descriptive statistics on the anxiety scores. From the ouput, identify the
following: (a) Skewness, (b) Mean, (c) Standard deviation, and (d) Kurtosis.
We can find these statistics using the Explore facility of SPSS. To find it, click
on the Analyze menu on top of the Data View screen. Now, click on the Descriptive
Statistics submenu and then choose Explore…from the choices that appear. The
procedure should look like the example on the next page.
This will give us the Explore dialog box. In this box, move the variable
containing the anxiety scores, anxiety into the Dependent List: box by highlighting it and
click on the right arrow. Under Display in the lower left hand corner of the dialog box,
choose Statistics. The dialog box should look like the one below.
Now, click on
the OK button and
you will receive the
output shown on the
next page.
Descriptives
Statistic
anxiety Anxiety Scores
Mean
95% Confidence Interval for
Mean
Std. Error
32.27
Lower Bound
6.062
19.27
Upper Bound
45.27
5% Trimmed Mean
Median
Variance
Std. Deviation
31.24
25.00
551.210
23.478
Minimum
5
Maximum
78
Range
73
Interquartile Range
Skewness
Kurtosis
40
.416
.580
-1.124
1.121
From this table we can see that the skewness of the distribution is .416; the mean
is 32.27; the standard deviation is 23.478; and the kurtosis of the distribution of anxiety
scores is -1.124.
2. Compute percentile ranks on the anxiety scores assuming that the distribution is
normal. What are the scores associated with the percentile ranks of 12, 27, 38, 73,
and 88?
This task becomes easier to understand if we keep a few things we have learned in
mind. The first is the definition of the percentile rank. Remember that the percentile
rank of a score is percent of people who score below that score. Also recall that z-scores
are useful because we know the proportion of people who fall below a given z-score if we
know certain characteristics of the distribution of scores, such as the characteristics of a
normal distribution. Finally, remember that the mean of a distribution of z-scores is 0
and that it has a standard deviation of 1.
Since we know the characteristics of a normal distribution of z-score, let’s begin
by converting all of the anxiety scores to z-scores. This is very simple to do using the
Descriptives procedure in SPSS. To do this, click on the Analyze menu on the Data View
screen and then click on the Descriptive Statistics submenu. From the choices now
presented to you click on Descriptives…. The Data View screen should look like the one
shown on the next page.
This will give you the Descriptives dialog. Select the variable anxiety and move it into
the Variable(s): window using the right arrow. Now click on the Save standardized
values as variables box to obtain z-score equivalents of the raw anxiety scores for each
case. The box should look like the one below.
Now, click on the OK button.
You will see a table with a
group of statistics, but we
aren’t really interested in
these at this time. What we
do want to see is the z-scores
of the anxiety scores. Move
back to the Data View menu
and note that the new
variable containing the zscores has been created by
the system as you can see on the next page.
We can use these z-scores to find the percentile ranks of the anxiety scores using
the SPSS Compute function. The compute command CDF.NORMAL tells us the percent
of the total scores that are below a certain score assuming the distribution in normal. If
you read the last sentence carefully you will realize that the percent of total scores below
a certain score is that score’s percentile rank! To do this compute task, click on the
Transform menu on the Data View and click Compute on that menu. This should give
you the Compute dialog box. Let’s call the new variable that will hold the anxiety score
percentile ranks of the subjects anxrank. Type this variable name in the Target Variable:
box. Now, scroll down the Functions: window until you come to
CDF.NORMAL(q
.mean.stddev)
and click on the
up arrow to move
it into the
Numeric
Expression: box.
The dialog box
will look like the
one shown to the
left below.
The
values in the
parenthesis represented by question marks need to be filled in. The letter q, represented
by the first question mark, stands for the variable that will be converted to percentile
ranks to form the new variable. In this case we will take the variable Zanxiety (the zscore of the anxiety scores) and use it to find the percentile ranks. So, highlight the
variable Zanxiety, as well. Click the right arrow to move the variable name to replace the
first question mark.
The second question mark is in the place of the mean in the expression
CDF.NORMAL(q.mean.stddev). The mean of any set of z-scores is 0, so replace term
mean with the number 0 in the expression. Finally, the standard deviation of any group
of z-scores is 1. Replace the value stddev in the expression with the number 1. Since we
want percents rather then merely proportions we need to multiply the obtain value by
100, so add *100 to the end of the expression (the symbol * is the symbol that SPSS uses
for multiplication). The Compute dialog box should now look like the one below.
Now, click on the OK button and look at the Data View screen to see the new variable,
anxrank, which is the percentile rank of the anxiety scores. The new data view screen
should look like the one on the next page.
The scores associated with the percentile ranks of 12, 27, 38, 73, and 88 can be
found by finding the percentile rank in the anxrank column and looking at the variable
anxiety for that case. For example Case #1 has a percentile rank of 12. The same case
has an anxiety score of 5, telling us that a person who has a percentile rank of 12 on the
anxiety scale has a raw score value of anxiety equal to 5. The other four values are
associated with anxiety scores of 18, 25, 47, and 60, respectively.
3. Compute percentile ranks on the anxiety scores not assuming that the distribution of
scores is normal.
We begin to solve this problem by creating a variable that is the rank of the
anxiety scores (in ascending order). This is done using the Rank Cases facility of SPSS.
We can find this function by clicking on the Transform menu on the Data View screen
and choosing the Rank Cases procedure. This gives us the Rank Cases dialog box. Since
we want to rank the anxiety score we move the variable anxiety in to the Variable(s):
window by highlighting it and clicking on the right arrow. Now, uncheck the Display
summary tables box. You should see the dialog box shown on the next page.
Now, click on the
OK button. Look at the
Data View screen shown
below to see the new
variable, Ranxiety (ranked
anxiety).
It is important to note that when we rank a set of values in ascending order, the
rank is actually the number of cases that fall at or below the rank. For instance, the
smallest value is given the rank of 1 and it makes sense to say that one score is equal to or
less than the lowest score in a distribution. The score is equal to itself and there are no
scores lower. By subtracting .5 from each rank and rounding down, we have the number
of cases that fall below the score in question. If we divide this number by the total
number of scores in the distribution, we have the proportion of total scores that fall below
that value. Finally, if we divide that proportion by the total number of scores, we have
the percent of scores that fall below that value. That is, we have the percentile rank of the
raw scores that corresponds to that rank. We can use the Compute function of SPSS to do
all this arithmetic for us.
Click on the Transform menu on the Data View screen. On the menu that pops
up, click Compute… and receive the Compute dialog box. Now, type in the name of the
variable that will contain the percentiles when the normal distribution is not assumed,
anxranknn, in the Target Variable: window. Type the expression that will give us this
variable in the Numeric Expression: window. This expression is ((Ranxiety-.5)/15)*100.
The dialog box should look like the one shown below.
Now click on
the OK button and the
new variable is
created. Note it on
the Data View screen
that is shown below.
4. Create a histogram to show the distribution of the anxiety scores. Edit the graph so
that most of the normal curve is visible.
We’ll create this histogram by clicking on the Graphs menu on the Data View
screen and clicking on Histogram… in the submenu that appears. In the Histogram
dialog box, move the variable anxiety into the Variable: box by highlighting it and
clicking on the right arrow. Now, make sure that you have checked the box labeled
Display normal curve. The dialog box should look like the one below.
Now, click on
the OK button and
you should get the
output shown below.
4
Frequency
3
2
1
Mean = 32.27
Std. Dev. = 23.478
N = 15
0
0
20
40
Anxiety Scores
60
80
A difficulty with
this histogram is that we
cannot see the ends of
the normal curve that has
been drawn over the
histogram. We could see
more of the histogram if
we expanded the right
and left side of the chart.
We can do this using the
editing facility built into
SPSS graphs and charts.
To do this, double click
on the graph. When you
have accomplished this
you will see the Chart
Editor with the
histogram already placed
in it as shown on the next
page.
We could move the left side of the Anxiety Score scale further to the left by
beginning the chart at a lower value than the current value of zero. Something like -20
(one additional interval) should do it. Likewise, we can move the right side of the scale
further to the right by defining a higher upper end such as 100 (also an additional one
interval). To accomplish this, click anywhere one the horizontal axis. This gives you the
Properties window shown below.
Click on the Scale tab and get the box shown on
the next page.
Change the lower end of the scale of -20 by
changing the value of 0 in the box on the Minimum
line to -20. We can change the highest value from
80 to 100 by changing the number on the Maximum
line from 80 to 100. The dialog box should look
like the one shown below.
Now click on the Apply button
to change the graph. The edited graph
shows up in the Chart Editor. Next,
close both the Properties box and the
Chart Editor by clicking on the Xs in
the upper right hand corner of the
objects. You should now be left with
just the edited graph in the SPSS output as shown below.
4
Frequency
3
2
1
Mean = 32.27
Std. Dev. = 23.478
N = 15
0
-20
0
20
40
60
Anxiety Scores
80
100
5. Based on the histogram and the descriptive statistics, which percentile rank method
should you use?
In Exercise #1, we calculated the skewness of this distribution and found it to be
.416. This indicates a positively skewed distribution where there are more subjects
scoring below the mean than there are scoring above the mean. We can see this in the
histogram. Both indicate that it probably wouldn’t be appropriate to assume that the data
was distributed normally. Therefore, we should use the percentile rank method where it
is not assumed that the data is distributed normally.
Hinkle, et al.
2. Assume that a set of 200 scores is normally distributed with a mean of 60 and a
standard deviation of 12.
a. What are the z-scores corresponding to the raw scores of 76, 38, and 50?
Since z-scores are number of standard deviations a score is from the mean of the
distribution, we use the following formula to compute the z-scores of each of the raw
76  60 16
XX

 1.33 . For a
scores: Z 
. So, for a raw score of 76 we have Z 76 
12
12
S
38  60  22

 1.83 . Finally, for a raw score of 50,
raw score of 38, Z 38 
12
12
50  60  10
Z 50 

 .83
12
12
b. How many scores b lie between the values of 48 and 80? 65 and 75? 34 and 52?
To find out how many score lie between 48 and 80, we must first find the z-scores
associated with raw scores of 48 and 80. Then we can use the table of the normal
distribution to find the proportion of scores between these two z-scores.
48  60  12
80  60 20
Z 48 

 1.00 . Z 80 

 1.67 .
12
12
12
12
Since Z = -1.00 is below the mean and Z = 1.67 is above the mean, we must first find the
proportion of the area under the curve between –1.00 and the mean. Looking at the table
of areas under the normal curve, we see that .3413 (34.13%) of the scores fall between –
1.00 and the mean (use the Area between X and z column to obtain this). The area
between the mean and 1.67 is .4525 (45.25 %). So the total area between -1.00 and 1.67
is .3413 + .4525 = .7938 (79.38%). Since there are 200 people in the distribution, the
number of people scoring between 48 and 80 is (.7938) (200) = 158.76 or 159 people.
Using the same strategy we can find how many scores fall between 65 and 75.
65  60 5
75  60 15

 .42 . Z 75

 1.25 . Since both scores are above the
12
12
12
12
mean we can first find the area between the mean and 1.25. It is .3944. Next we will
find the area under the curve between the mean and a score of .42. It is .1628. If we
subtract the area between the mean and .42 from the area under the curve between the
mean and 1.25, we can find the area between .42 and 1.25. So, .3944 - .1628 = .2316.
Multiplying the area under the curve by 200 people we find that (.2316)(200) = 46.32 or
47 people have scores between 65 and 75.
Z 65 
Finally, using the same strategy, we can calculate the number of people who have scores
34  60  26
52  60  8

 2.17 and Z 52 

 .67 .
between 34 and 52. Z 34 
12
12
12
12
Both scores are below the mean. If we find the area under the curve between the mean
and -.67 and subtract it from the area between -2.17 and the mean we will have the area
under the curve between -.67 and –2.17. Using the table of the normal curve we find that
the area between a score of -.67 and the mean is .2486 of the total area under the curve.
For a score of –2.17 the proportion of area under the curve between the score and the
mean of the distribution is .4850. So .4850 - .2486 = .2364 of the area under the curve is
between the values of 34 and 52. Since we have 200 subjects in the sample, we find that
(.2364)(200) = 47.28 or 47 subjects have scores between 34 and 52.
c. How many scores exceed the values of 80, 60, and 40?
In part B of this exercise we found that a raw score of 80 corresponded to a z-score of
1.67. To find out the proportion of scores that exceed 1.67 we look in the table of scores
under the normal curve and find the proportion of scores that is beyond a z-score of 1.67.
This is .0475. Since there are 200 people in the sample, we can see that (.0475) (200) =
9.5 or 10 people have scores that exceed 80.
Sixty is the mean of the distribution. We know that in a normal curve 50% of the scores
are above the mean. Since there are 200 people in the sample, there must be 100 people
who have scores beyond a score of 60.
40  60  20

 1.67 . Since
12
12
this score is below the mean to find the proportion of scores beyond a z-score of –1.67 we
can find the area between –1.67 and the mean and add it to the area beyond the mean
which is know is .5000. The area under the normal curve between –1.67 and the mean is
found in the table to be .4525. If we add this to the area beyond the mean we get .4525 +
.5000 = .9525. Since we have 200 people in the sample (.9525)(200) = 190.50 or 191 of
the 200 people in the sample have raw scores above 40.
The z-score corresponding to a raw score of 40 is Z 40 
d. How many scores are less than the values of 35, 50, and 75?
35  60  25

 2.08 . The
12
12
area of the curve beyond (that is below) a z-score of –2.08 is .0188 according to the table.
Therefore (.0188) (200) = 3.76 or 4 people have scores lower than a raw score of 35.
The z-score corresponding to a raw score of 35 is Z 35 
50  60  10

 .83 . The area
12
12
of the curve beyond (that is below) a z-score of -.83 is .2061 according to the table.
Therefore (.2061)(200) = 41.22 or 41 people have scores lower than a raw score of 50.
The z-score corresponding to a raw score of 50 is Z 50 
In part b of this exercise we found that the z-score corresponding to a raw score of 75 is
1.25. Since this value is above the mean we need to find the area between a score of 1.25
and the mean and add it to the area below the mean (.5000) in order to obtain the
proportion of score that fall below a raw score of 75. In part b we found this was .3944.
So, we find that .5000 + .3944 = .8944 of the area of the curve is below a raw score of 75.
Since there are 200 people in the sample, (.8944)(200) = 178.9 or 179 people have raw
scores below 75.
e. Find P35 , P80 , PR55 , and PR70 .
P35 is the 35th percentile. By definition, it is the score at or below which 35 percent of the
scores fall. Since this value is below the mean (the mean is the 50th percentile in a
normal distribution) all we need to know is the z-score below which (i.e. beyond which)
.3500 of the scores fall. We can then convert this z-score into a raw score using the
formula X  ZS  X . Using the table of areas under the normal distribution we find that
the closest z = score to .3500 in the area beyond column is Z = -.39. So, the raw score at
the 35th percentile is X  .3912  60  55.32 .
The 80th percentile would be above the mean. Since we know that .5000 of the scores
below the 80th percentile are below the mean, all we have to do is find the z-score that has
.3000 of the scores between it and the mean and add this proportion of scores to .5000 to
find the z-score that is at the 80th percentile. The z-score with the area closest to .3000
between it and the mean is z = .84. The raw score corresponding to z = .84 is
X  .8412  60  70.08 . This is the 80th percentile.
PR55 is the percentile corresponding to a raw score of 55. If we knew the z-score of a raw
score of 55 we could determine what proportion of scores falls below this z-score. This
would give us the percentile score corresponding to that raw score. The z-score
55  60  5

 .42 . This value is below
corresponding to a raw score of 55 is Z 55 
12
12
the mean so we can find the proportion of scores that falls below it by just finding the
proportion of scores beyond it. The table tells us that the proportion of scores beyond a
z- score of -.42 is .3372. Therefore about 34% of the scores falls below a raw score of
55, so 55 is about at the 34th percentile.
PR70 is the percentile corresponding to a raw score of 70. If we knew the z-score of a raw
score of 70 we could determine what proportion of scores falls below this z-score. This
would give us the percentile score corresponding to that raw score. The z-score
70  60 10

 .83 . Since this score is above
corresponding to a raw score of 70 is Z 70 
12
12
the mean and we know that .5000 of the scores fall below the mean, we can find the
proportion of scores between a z-score of .83 and the mean and add this to the .5000 of
the area that is below the mean and get the proportion of scores that falls below a raw
score of 70. We find that .2967 of the area under the curve falls between a raw score of
70 and the mean of the distribution. So, .5000 + .2967 = .7967 of the scores fall below a
raw score of 70 making it the 80th percentile.
4. A statistics instructor tells the class that grading will be based on the normal
distribution. He plans to give 10 percent As, 20 percent Bs, 40 percent Cs, 20 Ds, and
10 percent Fs. If the final examination scores have a mean of 75 and a standard
deviation of 9.6, what is the range of scores for each grade?
If 10% of the grades are to be As, then the lower cutoff score for a grade of A is the score
below which 90% of the scores fall (the 90th percentile). We know that 50% of the scores
fall below the mean of the distribution. This leaves us with 40% of the scores falling
between the 90th percentile and the mean. What score includes 40% of the scores
between it and the mean. If we knew the z-score of the 90th percentile we could figure
this out. In Table C.1 of your text book, under the “Area between X and z-score of 1.28
comes closest to taking in .4000 of the scores. It actually takes in .3997 of the scores.
Now if the mean is 75 and the standard deviation is 9.6 and the distribution is normal,
X 
what raw score corresponds to a z-score of 1.28? Remember that z 
.

Multiplying both sides by σ, we find that zσ = X – μ. Adding μ to both sides we find that
X = zσ + μ. So, in this case we can find the value of the raw scores that corresponds to a
z-score of 1.28 by doing: X  z    1.289.6  75  87.29 . So, students who
receive final exam scores of 87 and above will get grades of A.
Since 20% of the students will obtain a grade of B, we know that the lower end of the B
interval is at the 70th percentile (10% or As and 20% for Bs). Using the same rational as
shown above, we know there are 50% of the scores below the mean and 20% (50% 30%) between the 70th percentile and the mean. Using Table C.1 we find that a z-score
that has 20% of the scores between it and the mean. In this case the z-score we need is
.52. So, we can find the raw score that corresponds to a z-score of .52 using:
X  z    .529.6  75  79.99 . So, students who receive final exam scores of 80 to
86 will get grades of B.
Since 40% of the students will obtain a grade of C, we know that the lower end of the C
interval is at the 30th percentile (10% or As, 20% for Bs, and 40% for Cs). From the 70th
percentile, going down through 20% of the scores gets us to the mean, since it is also the
median or the 50th percentile when the scores are distributed normally. Going down
another 20% of the scores (since there will be 40% Cs) brings us to the 30th percentile.
Now, what value of z cuts off the lower 30% of the scores? To put it another way, what
is the z-score beyond which 30% of the scores lie? Going to Table C.1 and looking at the
“Area beyond z” column shows us that a z-score of -.52 is the value we are looking for.
Remember, this value is negative because it is below the mean. We can find the raw
score that corresponds with a z-score of -.52 by evaluating the equation:
X  z    .529.6  75  70.01. So, examinees who score between 70 and 79 on
the final examination will receive a grade of C.
Since 20% of the students will obtain a grade of D, we know that the lower end of the D
interval is at the 10th percentile (10% or As, 20% for Bs, 40% for Cs, and 20% for Ds).
Now, what value of z cuts off the lower 10% of the scores? To put it another way, what
is the z-score beyond which only 10% of the scores lie? Going to Table C.1 and looking
at the “Area beyond z” column shows us that a z-score of -1.28 is the value we are
looking for. Remember, this value is negative because it is below the mean. We can find
the raw score that corresponds with a z-score of -1.28 by evaluating the equation:
X  z    1.289.6  75  62.71. So, examinees who score between 63 and 69 on
the final examination will receive a grade of D.
Finally, it follows that the rest of the students will receive Fs as their grades. So, any
student with a final examination score less than 63 will receive a grade of F.
8. In an ancient culture, the average male life span was 37.6 years, with a standard
deviation of 4.8 years. The average female life span was 41.2 years, with a standard
deviation of 7.7 years. Use the properties of the normal distribution to find out the
following:
a. What percentage of men died before age 30?
To find the percentage of men who died before age 30, we need to find the z-score that
corresponds to an age of 30 in the distribution of mean and find what proportion of the
X   30  37.6  7.6

 1.58 . Since this
men’s ages fell below this value. z30 

4.8
4.8
value is below the mean, all we have to do if to find the area below (i.e. beyond) z = -1.58
and that will give us to proportion of men who died before age 30. The table of the
normal distribution tells us that this is .0571. We can day that 5.71% of the men died
before age 30.
b. What percentage of women lived to an age of at least 50?
Using the same logic we can find the proportion of women who lived to at least 50.
50  41.2 8.8
z50 

 1.14 . Since the age is above the mean, one simple way to do this
7.7
7.7
is to find the proportion of women who lived 50 years or longer. Using the table of the
normal distribution we can find the proportion of scores above (beyond) z =1.14. It is
.1271. So we see that 12.71% of the women lived to at least age 50 years.
c. At what age is a female dealth at the same relative position in the distribution as a
male death at age 35?
35  37.6  2.6

 .54 . Now we
4.8
4.8
need to find the age that corresponds to a z-score of -.54 for women. We know that
X  z   , so for women, X  .547.7  41.2  37.04 . So, a death at the age of 37
for women has the same relative position as a male death at 35.
A male age of 35 corresponds to a z-score of z35 
Download