Study Guide - Michigan State University

advertisement
Study Guide
1) Dependent vs. Independent Variables
a) The independent variable is said to be the cause, the dependent variable is said to
be the effect.
b) EXAMPLE: Let’s assume we are trying to predict one’s weight given one’s
diet. Clearly dietary considerations are the cause and how heavy you are is the
result because size generally does not dictate diet. So, if I only eat 3 apples per
day, my weight will be much less than if I eat 20 Big Macs per day.
c) Clearly, the independent variable precedes the dependent variable in time.
i) Note: oftentimes, it is not clear cut which variable comes before another. For
example, number of traffic accidents and seatbelt law. One is inclined to
believe that the dependent variable is number of traffic accidents, and
enactment of the seatbelt law acts to decrease them. However, it is very
possible that the high number of accidents has actually led to the enactment of
the seatbelt law in a particular state. Whatever the case, it is clear that in a
causal relationship, the variable that causes an effect comes first in temporal
order.
2) Discrete vs. Continuous variables
a) A discrete variable is one that takes on a countable number of values, whereas
a continuous variable is one that can theoretically take on an infinite number of
values.
i) Note: anything expressed as a percentage or proportion can take on an
infinite number of values, since the range is anything in the closed interval
[0,1].
b) EXAMPLES: age, height, weight, time are continuous. Gender, number of
children, number of sex partners are all discrete.
i) Note carefully, in practice, most everything is recorded as a discrete variable.
For example, if you are collecting data on “age”, you only record whole
numbers in the database, and not 12.1 years, for example.
(1) Exam strategy: if you are unsure of the level of measurement, your best
guess is that it is discrete.
ii) Note also, it is often very difficult to know whether something is discrete or
continuous, for example, number of sexual relations, number of questions left
unanswered on an exam.
iii) Rule of thumb: Ask yourself whether recording the variable as a fraction
makes sense. If not, it is discrete, if so, it is continuous. So, can I have ½ a
kid? Can I have ½ a gender? No, this is nonsensical.
3) Nominal, Ordinal and Interval/Ratio variables
a) Nominal: The categories are not “numerical” and cannot be thought of as
“higher” or “lower” in a numerical sense. THIS IS AKA CATEGORICAL
DATA, so that whenever the data is grouped into categories, the data is
considered to be nominal.
i) GOOD PROPERTIES OF NOMINAL VARIABLES:
(1) Mutual exclusivity
(a) The following example is a nominal variable (MOTHER’S AGE) that
is NOT mutually exclusive.
Mother's age
Valid
13-20 Years
20-30 Years
30-35 Years
35-55 Years
Total
Frequency
21
305
101
62
489
(b) Exhaustive: ALL values are considered. In the example above, are the
categories exhaustive?
(i) Relative homogeneity: cases should be truly comparable
ii) The median and nominal level data.
b) ORDINAL VARIABLES: Categories that are ranked
i) EXAMPLE: In the gss dataset, there is a variable called “spanking” that asks
whether the respondent favors spanking to discipline a child. The answers
range from strongly agree to strongly disagree, and these can be considered
ranked.
(1) Note: We can make 0 – 10 strongly disagree to strongly agree or 10 – 0
strongly disagree to strongly agree and it does not matter at all. The
“numeric” values we assign to the answers are arbitrary and meaningless.
c) Interval Ratio: Two properties
i) Equal distance between values
ii) 0 is a real value
4) What does Healey mean by “data reduction”?
a) Data reduction involves using a few numbers to summarize the distribution of a
variable, or an array of data as he calls it.
b) What is the problem with using only a few numbers to summarize the distribution
of a variable?
i) Summarizing a distribution involves using the mean, denoted x , or standard
deviation, denoted  , to describe the variable. This inevitably leads to a loss
of information (precision and detail).
5) Rates: rates are defined as the number of actual occurrences of some phenomenon
divided by the number of possible occurrences per some unit of time.
a) EXAMPLE: In a city of 750,000 people, the frequency of unwed pregnancies in
a one-year period was 1875. What is the unwed pregnancy rate for this city?
i) ANS. 1875/750,000 = .0025 = 2.5 per thousand
ii) What is the pregnancy rate per 10,000 people? 25
b) EXAMPLE: In a city with population 1,000,000, there were 516 homicides in the
past year. What is the homicide rate per 10,000 people? 5.16.
6) Measures of Central Tendency
a) Measures of central tendency measure the typical value of a distribution.
i) It is a way to summarize the distribution to give you an idea about the typical
case of that distribution, in other words, the center of it.
b) There are three measures of central tendency
i) The mean: describes the average score
ii) The mode: describes the most recurring score
(1) Only used with nominal variables
iii) The median: is the 50th Percentile of the distribution
(1) A median is a special case of a percentile, which is the percentage of cases
below which a specific percentage of cases fall.
c) How does the median differ from the mode and the mean? Unlike the mode or the
mean, the median always represents the exact center of a distribution of scores,
meaning that 50% of the cases always fall above the median and 50% of the cases
always fall below the median.
d) Characteristics of the mean
i) The mean is always the center of any distribution. The mean is the point
around which all of the scores cancel out. Mathematically, this says that if I
subtract the mean from each value and sum the results, the resulting sum will
n
be equal to 0. This is mathematically given as
 (x
i 1
i
 x)  0
(1) Example: consider 5 numbers 1, 2, 3, 4, 5. The mean is 3. The equation
says to subtract the mean from each observation and the sum should be 0.
1 – 3 = -2
2 – 3 = -1
3–3=0
4–3=1
5–3=2
(-2) + (-1) + (0) + (1) + (2) = 0 
n
More generally,
 (x
i 1
i
n
n
i 1
i 1
 x )   x i   x  nx  nx  0
REMEMBER THIS RESULT FOR THE EXAM, BUT THERE IS NO NEED
TO BE ABLE TO DERIVE IT.
ii) The mean may often be very misleading because it is sensitive to all
observations whereas the median is not. In fact, the median is less sensitive
to extreme observations and therefore it is often “better” to report the median.
(1) To illustrate this, consider the familiar normal or “bell” curve. This is a
symmetric distribution because there are as many values on the left as
there are on the right of the center. Many natural phenomena have normal
distributions, such as weight, height, etc.
(2) There are important distributions that are not symmetric. IF THE
DISTRIBUTION IS NOT SYMMETRIC THEN THE MEDIAN,
MODE AND MEAN ARE NOT EQUAL. IT IS ONLY IN
SYMMETRIC DISTRIBUTIONS WHEN THESE THREE
MEASURES ARE EQUAL. When a distribution is not symmetric, it is
skewed. There are two types of skewed distributions, right skewed and left
skewed.
(a) EXAMPLE of RIGHT SKEWED: Income. Often it is better to
report the median than the mean, since the mean is misleading in
extreme cases.
(b) EXAMPLE. Consider the following summary of AGE. Notice that the
arithmetic mean is somewhat greater than the median. The reason is
that the distribution is right skewed. If the mean is larger than the
median the distribution is __________ skewed.
Statistics
AGE OF RESPONDENT
N
Valid
Missing
Mean
Median
1385
2
44.94
41.00
To see this, create a histogram of the age variable.
300
200
100
Std. Dev = 17.08
Mean = 44.9
N = 1385.00
0
20.0
30.0
25.0
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
90.0
85.0
A GE OF RESPONDENT
(c) Calculation of Grouped Mean
Minutes spent on test
mid pt
f
mid pt x f
0 to less than 5 minutes
2.5
2
5
at least 5 but less than 10 mins
7.5
12
90
at least 10, less than 20 mins
15
16
240
30
335
Total
335/30 = 11·2
7) The summation operator
a)

is called the summation operator, it is useful when representing the sum
of a large group of numbers
n
b)
x
i 1
i
means
n
c)
x
i 1
2
i
means
8) Measures of Dispersion
a) What is a “measure of dispersion?”
i) Measures of Central Tendency don’t tell anything about how much the data
values differ from each other.
(1) EXAMPLE: What is the mean of the following two distributions of
AGE?
(a) 50 50 50 50 50
(b) 10 20 50 80 90
(2) The distributions are obviously very different.
(a) Measures of dispersion or variability attempt to quantify the spread of
observations.
(b) It is a measure of variability, usually defined in terms of variability
around the mean.
(c) The distance between the individual score and the mean value,
mathematically this is ( X i  X ).
(d) The larger the distance from the mean, the larger the deviation will be.
(e) If the scores were clustered around the mean, the less variability there
will be.
(i) PRACTICAL EXAMPLE: Let’s assume that average income for
people with PhD’s is $55,000 and average income for people with
a high school education is $20,000. Since opportunities for people
with merely a HS education are less than those with PhD’s most
people who only have a HS education would make somewhere
aroung 20K, there is not much variation. However, it is possible
for PhDs to make anywhere from $20K to $800K per year and
hence there is much more variation around the average salary for
PhDs than there is for HS graduates.
b) Measures of dispersion we have looked at
i) Inter-Quartile Range: defined as the 75th percentile minus the 25th
percentile.
ii) Quartile/Deciles
iii) Standard deviations
iv) Creating Box and Whiskers
9) Standardized Variables
a) EXAMPLES
i) Here is a random sample of eleven scores on a PLS 201 exam: 12, 16, 16, 18,
23, 23, 24, 25, 25, 26, 29
(1) Find the sample mean.
(a) Answer: x = 21.5
(2) Find the sample standard deviation.
(a) Answer: you should get something approximating 5.
(3) Find the median. Answer: 23
(4) Find the z-score for the student who received the highest score on the
exam. Answer: z = (29 - x )/sx = 1.5 where x = 21.5 and sx = 5.
Interpretation: this student’s score was 1 ½ deviations above the mean.
ii) Faculty salaries at a Midwestern university are normally distributed with a
mean of $51,500 and standard deviation of $3,000.
(1) Find the probability that one faculty member chosen at random has a
salary less than $50,000.
(a) Answer: X = salary of a randomly selected faculty member.
Given that X ~ N(51500, 3000), normal with mean 51,500 and
standard deviation 3,000. P(X < 50000) = P(Z < (50000 51500)/3000) = P(Z <= -.5) = .3085
iii) The mean height of adults in an African village is 150 cm, the standard
deviation is 6 cm. What is the probability that a randomly selected adult from
this village will be lower than 162 cm, if we assume that the distribution of
height in the population is normal?
(1) Calculate the z-value for 162 cm based on z-transformation, and look up
the corresponding p values using the table. The z-transformed value of x
z=
=162 cm:
z=
162 – 150
6
xx
sx
mean = 150 cm, SD = 6 cm
= +2
(2) Looking up the corresponding p value to the z-value = +2 is 0.4772. This is
the proportion of area under the curve between the mean and the specified
z-value. We also need to add the proportion under the mean, since we are
looking for the height under 162cm (which includes heights under 150 as
well). Therefore, (50% + 47.72%) = 97.72%, that is probability that a
randomly selected adult from this village will be lower than 162 cm is
97.72%.
iv) A random sample of 47 items is drawn from a population with mean 40 and
standard deviation 1.46.
(1) Give a range of values that is almost certain to contain any particular value
of each item drawn.
(a) Y should be within 3 standard deviations of the mean. That is, between
35.62 and 44.38.
(2) What is the probability that Y will be greater than 50?
(a) P(Y > 50) = 50 – 40 /1.46 = 10/1.46 = 6.84  0
(3) What is the probability that Y will be less than 38?
(a) P(Y < 38) = 38-40/1.46 = Z(-1.37) = 0.0853 (i.e. column c in the table)
(4) What is the probability that Y will be greater than 45?
(a) P(Y > 45) = 0.0000
b) The doctor of a school has measured the height of pupils in the class 5A. The
result (in cm) is follows
130
132
138
136
131
153
131
133
129
133
110
132
129
134
135
132
135
134
133
132
130
131
134
135
135
134
136
133
133
130
Table 3.2 Heights of the pupils of the class 5A
i) Box plot method
(1) Below are the steps to follow in constructing a box plot.
Steps to follow in constructing a box plot
1. Calculate the median M, lower and upper quartiles, Q1 and Q3, and the
interquartile range, IQR= Q3 – Q1, for the data set.
2. Construct a box with Q1 and Q3 located at the lower corners. The base width will
then be equal to IQR. Draw a vertical line inside the box to locate the median M.
3. Construct the limits on the box plot: Extreme Values are located a distance of 1.5
* IQR below Q1 and above Q3;
4. Locate the extremes on the box plot using asterisks (*).
Outer fences Inner fences
Inner fences
Q1
1.5 * IQR
M
IQR
Q3
Answer: your box plot should look like this
1.5 * IQR
Outer fences
Figure 3.6 Output from SPSS showing box plot for the data above.
(2) For the following find the
(a) Median
(b) Quartile 1
(c) Quartile 3
(d) Interquartile range.
(3) Draw a box and whisker plot, identifying any extreme values.
(a) Remember to order the data before you begin.
(i) 32 30 36 27 24 33 34
(ii) 998 92 432 223 785 335 367 444 457 458 488
(b) Answers
(i) Q1 = 27, Q2 = 32, Q3 = 34 IQR = 7 No extremes.
(ii) Q1 = 335, Q2 = 444, Q3 = 488 IQR = 153 extremes>=785, 92
Detailed Solutions
c) Order the data: 24 27 30 32 33 34 36
i) From here, it is easy to see that 32 is the median
ii) To find Q1,  (.25)(7) = 1.75, rounding up gives 2 so Q1 = 27
iii) To find Q3,  (.75)(7) = 5.25, rounding up gives 6 so Q3 = 34
iv) IQR = 34 – 27 = 7
v) Extremes: Q3 + 1.5(IQR) = 34 + 1.5(7) = 44.5, there are no values equal to
that in the data, so there are no extremes. Also, Q1 – 1.5(IQR) = 27 – 1.5(7) =
16.5, there are no values less than or equal to this, so there are no negative
extremes.
d) Follow the same procedure for (ii).
Practice Multiple Choice Questions
1. The average time between infection with the AIDS virus and developing AIDS
has been estimated to be 8 years with a standard deviation of about 2 years.
Approximately what fraction of people develop AIDS within 4 years of infection
b
a. about 5 %
b. about 2.5%
c. about 32%
d. about 16%
e. about 1%
2. An instructor decides to "curve grades" in a course depending upon the percentile
measures. Here are some summary statistics: b
Quantile Levels
Final Mark
Minimum
10
10.0%
48
25.0%
55
Median
66
75.0%
78
90.0%
87
Maximum
93
Which of the following is FALSE?
a.
b.
c.
d.
About 1/4 of the class received a score of 55 or less.
About 3/4 of the class received a score of 75% or less.
About 50% of the class received grades between 55 and 78.
This method assigns grades relative to how others do in a class rather than
against an absolute standard.
e. This method always has half of the class at or above the median grade.
3. An experiment was performed upon rats to investigate the effect of ingesting Alar
(a chemical sprayed on apple trees to keep fruit from dropping before ripe) upon
subsequent cancer rates. The following variables were measured:
gender (0=female, 1=male); weight (g); dose of Alar (nil, low, high); number of
tumors
The typical weight of a rat is about 800 g and the weights were rounded to the
nearest gram. The number of tumors is around 10. Which of the following is
FALSE? c
a. Gender is nominal scale; dose is ordinal scale
b. Gender is discrete; weight is continuous
c. Number of tumors is discrete and is interval scale
d. Dose is ordinal scale and discrete
e. Weight is ratio scale; and number of tumors is discrete.
4. Here are some summary statistics on the results of the experiment. Draw suitable
BOXPLOTS to compare the results. Salmon production is in kg/km of spawning
sites.
Quantiles
Level
Minimum 10.0%
clear cut 0.9
0.9
selective 0.9
1.2
25.0%
3.3
8.5
median 75.0%
19.2
48.1
29.3
51.5
90.0%
87.4
93.4
maximum
90.0
108.0
Means and Std Deviations
Level
Number
clear cut 12
selective 12
Mean
29.7
34.6
Std Dev
30.4
31.1
Std Err Mean
8.8
9.0
Solution: In this case, side-by-side box plots would be suitable
5. What do you conclude from your boxplot and the descriptive statistics? Be sure to
explain how your plot leads you to this conclusion.
Solution: It appears that clear cut streams produce, on average, less salmon than
selectively cut streams. This is because the box plot for the clear-cut areas is shifted
down relative to the box-plot for the selective harvest areas; and the median of the
clear cut areas appears to be less than the median of the selective harvest areas
6. Which one of the following statements is FALSE? a
a. Pie charts are better than bar graphs for comparing relative sizes.
b. Data that are nominal scale are presented using frequency tables.
c. Means and standard deviation of ordinal data are meaningless.
d. Box-plots are a good choice for comparing the distribution of values
among groups.
7. As part of a study to investigate the effects of stubble burning, the following
variables were measured at several sites around Winnipeg:
pH of soil (to one decimal place, e.g., 6.3) 0 ph is not meaningful
crop grown (0=wheat, 1=barley, 2=oats, 3=other);
amount of stubble (0=light, 1=medium, 2=heavy);
date of final harvesting (e.g., 10 Oct 92)
The scales of these variables are:
a.
b.
c.
d.
e.
interval, ordinal, ratio, ratio
interval, nominal, nominal, interval
interval, nominal, ordinal, interval
ratio, ordinal, ordinal, ratio
interval, nominal, ordinal, ratio
8. A student discovers that his grade on a recent test was the 72nd percentile. If 90
students wrote the test, then approximately how many students received a higher
grade than he did? b
a. 65
b. 25
c. 72
d. 71
e. 18
Solution: (1 - .72)(90) = 25.2 or (.72)(90) = 64.8 students who scored less than him
so 90 – 64.8 is approximately 25.
9. Many professional schools require applicants to take a standardized test. Suppose
that 1000 students write the test, and you find that your mark of 63 (out of 100)
was the 73rd percentile. This means: c
a. At least 73% of the people got 63 or better.
b. At least 270 people got 73 or better.
c. At least 270 people got 63 or better.
d. At least 27% of the people got 73 or worse.
e. At least 730 people got 73 or better.
Solution: We know that 73% of the people scored below 63 and 27% of the people
scored better than 63, so rule out a and d. This means that (.73)(1000) = 730 scored below
63 and (.27)(1000) = 270 scored better than 63. C must be the correct answer.
Last Question
1994: DIVORCES PER 1,000 POPULATION
Valid
Frequency
Percent
5.0
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
6.0
2
2
1
1
1
1
1
1
2
1
1
14.3
14.3
7.1
7.1
7.1
7.1
7.1
7.1
14.3
7.1
7.1
Total 14
Cumulative
Percent
14.3
28.6
35.7
42.9
50.0
57.1
64.3
71.4
85.7
92.9
100.0
100.0
Identify the Percent, Cumulative Percent and state the median, Q1, Q3, Interquartile
Range and identify any extreme values.
5.8 – 5.1 = .7
Q3 + 1.5(.7) = 5.8 + 1.5(.7) = 6.85  there are no upper extreme values
Q1 – 1.5(.7) = 5.1 – 1.5(.7) = 4.05  there are no lower extreme values
Download