Study Guide 1) Dependent vs. Independent Variables a) The independent variable is said to be the cause, the dependent variable is said to be the effect. b) EXAMPLE: Let’s assume we are trying to predict one’s weight given one’s diet. Clearly dietary considerations are the cause and how heavy you are is the result because size generally does not dictate diet. So, if I only eat 3 apples per day, my weight will be much less than if I eat 20 Big Macs per day. c) Clearly, the independent variable precedes the dependent variable in time. i) Note: oftentimes, it is not clear cut which variable comes before another. For example, number of traffic accidents and seatbelt law. One is inclined to believe that the dependent variable is number of traffic accidents, and enactment of the seatbelt law acts to decrease them. However, it is very possible that the high number of accidents has actually led to the enactment of the seatbelt law in a particular state. Whatever the case, it is clear that in a causal relationship, the variable that causes an effect comes first in temporal order. 2) Discrete vs. Continuous variables a) A discrete variable is one that takes on a countable number of values, whereas a continuous variable is one that can theoretically take on an infinite number of values. i) Note: anything expressed as a percentage or proportion can take on an infinite number of values, since the range is anything in the closed interval [0,1]. b) EXAMPLES: age, height, weight, time are continuous. Gender, number of children, number of sex partners are all discrete. i) Note carefully, in practice, most everything is recorded as a discrete variable. For example, if you are collecting data on “age”, you only record whole numbers in the database, and not 12.1 years, for example. (1) Exam strategy: if you are unsure of the level of measurement, your best guess is that it is discrete. ii) Note also, it is often very difficult to know whether something is discrete or continuous, for example, number of sexual relations, number of questions left unanswered on an exam. iii) Rule of thumb: Ask yourself whether recording the variable as a fraction makes sense. If not, it is discrete, if so, it is continuous. So, can I have ½ a kid? Can I have ½ a gender? No, this is nonsensical. 3) Nominal, Ordinal and Interval/Ratio variables a) Nominal: The categories are not “numerical” and cannot be thought of as “higher” or “lower” in a numerical sense. THIS IS AKA CATEGORICAL DATA, so that whenever the data is grouped into categories, the data is considered to be nominal. i) GOOD PROPERTIES OF NOMINAL VARIABLES: (1) Mutual exclusivity (a) The following example is a nominal variable (MOTHER’S AGE) that is NOT mutually exclusive. Mother's age Valid 13-20 Years 20-30 Years 30-35 Years 35-55 Years Total Frequency 21 305 101 62 489 (b) Exhaustive: ALL values are considered. In the example above, are the categories exhaustive? (i) Relative homogeneity: cases should be truly comparable ii) The median and nominal level data. b) ORDINAL VARIABLES: Categories that are ranked i) EXAMPLE: In the gss dataset, there is a variable called “spanking” that asks whether the respondent favors spanking to discipline a child. The answers range from strongly agree to strongly disagree, and these can be considered ranked. (1) Note: We can make 0 – 10 strongly disagree to strongly agree or 10 – 0 strongly disagree to strongly agree and it does not matter at all. The “numeric” values we assign to the answers are arbitrary and meaningless. c) Interval Ratio: Two properties i) Equal distance between values ii) 0 is a real value 4) What does Healey mean by “data reduction”? a) Data reduction involves using a few numbers to summarize the distribution of a variable, or an array of data as he calls it. b) What is the problem with using only a few numbers to summarize the distribution of a variable? i) Summarizing a distribution involves using the mean, denoted x , or standard deviation, denoted , to describe the variable. This inevitably leads to a loss of information (precision and detail). 5) Rates: rates are defined as the number of actual occurrences of some phenomenon divided by the number of possible occurrences per some unit of time. a) EXAMPLE: In a city of 750,000 people, the frequency of unwed pregnancies in a one-year period was 1875. What is the unwed pregnancy rate for this city? i) ANS. 1875/750,000 = .0025 = 2.5 per thousand ii) What is the pregnancy rate per 10,000 people? 25 b) EXAMPLE: In a city with population 1,000,000, there were 516 homicides in the past year. What is the homicide rate per 10,000 people? 5.16. 6) Measures of Central Tendency a) Measures of central tendency measure the typical value of a distribution. i) It is a way to summarize the distribution to give you an idea about the typical case of that distribution, in other words, the center of it. b) There are three measures of central tendency i) The mean: describes the average score ii) The mode: describes the most recurring score (1) Only used with nominal variables iii) The median: is the 50th Percentile of the distribution (1) A median is a special case of a percentile, which is the percentage of cases below which a specific percentage of cases fall. c) How does the median differ from the mode and the mean? Unlike the mode or the mean, the median always represents the exact center of a distribution of scores, meaning that 50% of the cases always fall above the median and 50% of the cases always fall below the median. d) Characteristics of the mean i) The mean is always the center of any distribution. The mean is the point around which all of the scores cancel out. Mathematically, this says that if I subtract the mean from each value and sum the results, the resulting sum will n be equal to 0. This is mathematically given as (x i 1 i x) 0 (1) Example: consider 5 numbers 1, 2, 3, 4, 5. The mean is 3. The equation says to subtract the mean from each observation and the sum should be 0. 1 – 3 = -2 2 – 3 = -1 3–3=0 4–3=1 5–3=2 (-2) + (-1) + (0) + (1) + (2) = 0 n More generally, (x i 1 i n n i 1 i 1 x ) x i x nx nx 0 REMEMBER THIS RESULT FOR THE EXAM, BUT THERE IS NO NEED TO BE ABLE TO DERIVE IT. ii) The mean may often be very misleading because it is sensitive to all observations whereas the median is not. In fact, the median is less sensitive to extreme observations and therefore it is often “better” to report the median. (1) To illustrate this, consider the familiar normal or “bell” curve. This is a symmetric distribution because there are as many values on the left as there are on the right of the center. Many natural phenomena have normal distributions, such as weight, height, etc. (2) There are important distributions that are not symmetric. IF THE DISTRIBUTION IS NOT SYMMETRIC THEN THE MEDIAN, MODE AND MEAN ARE NOT EQUAL. IT IS ONLY IN SYMMETRIC DISTRIBUTIONS WHEN THESE THREE MEASURES ARE EQUAL. When a distribution is not symmetric, it is skewed. There are two types of skewed distributions, right skewed and left skewed. (a) EXAMPLE of RIGHT SKEWED: Income. Often it is better to report the median than the mean, since the mean is misleading in extreme cases. (b) EXAMPLE. Consider the following summary of AGE. Notice that the arithmetic mean is somewhat greater than the median. The reason is that the distribution is right skewed. If the mean is larger than the median the distribution is __________ skewed. Statistics AGE OF RESPONDENT N Valid Missing Mean Median 1385 2 44.94 41.00 To see this, create a histogram of the age variable. 300 200 100 Std. Dev = 17.08 Mean = 44.9 N = 1385.00 0 20.0 30.0 25.0 40.0 35.0 50.0 45.0 60.0 55.0 70.0 65.0 80.0 75.0 90.0 85.0 A GE OF RESPONDENT (c) Calculation of Grouped Mean Minutes spent on test mid pt f mid pt x f 0 to less than 5 minutes 2.5 2 5 at least 5 but less than 10 mins 7.5 12 90 at least 10, less than 20 mins 15 16 240 30 335 Total 335/30 = 11·2 7) The summation operator a) is called the summation operator, it is useful when representing the sum of a large group of numbers n b) x i 1 i means n c) x i 1 2 i means 8) Measures of Dispersion a) What is a “measure of dispersion?” i) Measures of Central Tendency don’t tell anything about how much the data values differ from each other. (1) EXAMPLE: What is the mean of the following two distributions of AGE? (a) 50 50 50 50 50 (b) 10 20 50 80 90 (2) The distributions are obviously very different. (a) Measures of dispersion or variability attempt to quantify the spread of observations. (b) It is a measure of variability, usually defined in terms of variability around the mean. (c) The distance between the individual score and the mean value, mathematically this is ( X i X ). (d) The larger the distance from the mean, the larger the deviation will be. (e) If the scores were clustered around the mean, the less variability there will be. (i) PRACTICAL EXAMPLE: Let’s assume that average income for people with PhD’s is $55,000 and average income for people with a high school education is $20,000. Since opportunities for people with merely a HS education are less than those with PhD’s most people who only have a HS education would make somewhere aroung 20K, there is not much variation. However, it is possible for PhDs to make anywhere from $20K to $800K per year and hence there is much more variation around the average salary for PhDs than there is for HS graduates. b) Measures of dispersion we have looked at i) Inter-Quartile Range: defined as the 75th percentile minus the 25th percentile. ii) Quartile/Deciles iii) Standard deviations iv) Creating Box and Whiskers 9) Standardized Variables a) EXAMPLES i) Here is a random sample of eleven scores on a PLS 201 exam: 12, 16, 16, 18, 23, 23, 24, 25, 25, 26, 29 (1) Find the sample mean. (a) Answer: x = 21.5 (2) Find the sample standard deviation. (a) Answer: you should get something approximating 5. (3) Find the median. Answer: 23 (4) Find the z-score for the student who received the highest score on the exam. Answer: z = (29 - x )/sx = 1.5 where x = 21.5 and sx = 5. Interpretation: this student’s score was 1 ½ deviations above the mean. ii) Faculty salaries at a Midwestern university are normally distributed with a mean of $51,500 and standard deviation of $3,000. (1) Find the probability that one faculty member chosen at random has a salary less than $50,000. (a) Answer: X = salary of a randomly selected faculty member. Given that X ~ N(51500, 3000), normal with mean 51,500 and standard deviation 3,000. P(X < 50000) = P(Z < (50000 51500)/3000) = P(Z <= -.5) = .3085 iii) The mean height of adults in an African village is 150 cm, the standard deviation is 6 cm. What is the probability that a randomly selected adult from this village will be lower than 162 cm, if we assume that the distribution of height in the population is normal? (1) Calculate the z-value for 162 cm based on z-transformation, and look up the corresponding p values using the table. The z-transformed value of x z= =162 cm: z= 162 – 150 6 xx sx mean = 150 cm, SD = 6 cm = +2 (2) Looking up the corresponding p value to the z-value = +2 is 0.4772. This is the proportion of area under the curve between the mean and the specified z-value. We also need to add the proportion under the mean, since we are looking for the height under 162cm (which includes heights under 150 as well). Therefore, (50% + 47.72%) = 97.72%, that is probability that a randomly selected adult from this village will be lower than 162 cm is 97.72%. iv) A random sample of 47 items is drawn from a population with mean 40 and standard deviation 1.46. (1) Give a range of values that is almost certain to contain any particular value of each item drawn. (a) Y should be within 3 standard deviations of the mean. That is, between 35.62 and 44.38. (2) What is the probability that Y will be greater than 50? (a) P(Y > 50) = 50 – 40 /1.46 = 10/1.46 = 6.84 0 (3) What is the probability that Y will be less than 38? (a) P(Y < 38) = 38-40/1.46 = Z(-1.37) = 0.0853 (i.e. column c in the table) (4) What is the probability that Y will be greater than 45? (a) P(Y > 45) = 0.0000 b) The doctor of a school has measured the height of pupils in the class 5A. The result (in cm) is follows 130 132 138 136 131 153 131 133 129 133 110 132 129 134 135 132 135 134 133 132 130 131 134 135 135 134 136 133 133 130 Table 3.2 Heights of the pupils of the class 5A i) Box plot method (1) Below are the steps to follow in constructing a box plot. Steps to follow in constructing a box plot 1. Calculate the median M, lower and upper quartiles, Q1 and Q3, and the interquartile range, IQR= Q3 – Q1, for the data set. 2. Construct a box with Q1 and Q3 located at the lower corners. The base width will then be equal to IQR. Draw a vertical line inside the box to locate the median M. 3. Construct the limits on the box plot: Extreme Values are located a distance of 1.5 * IQR below Q1 and above Q3; 4. Locate the extremes on the box plot using asterisks (*). Outer fences Inner fences Inner fences Q1 1.5 * IQR M IQR Q3 Answer: your box plot should look like this 1.5 * IQR Outer fences Figure 3.6 Output from SPSS showing box plot for the data above. (2) For the following find the (a) Median (b) Quartile 1 (c) Quartile 3 (d) Interquartile range. (3) Draw a box and whisker plot, identifying any extreme values. (a) Remember to order the data before you begin. (i) 32 30 36 27 24 33 34 (ii) 998 92 432 223 785 335 367 444 457 458 488 (b) Answers (i) Q1 = 27, Q2 = 32, Q3 = 34 IQR = 7 No extremes. (ii) Q1 = 335, Q2 = 444, Q3 = 488 IQR = 153 extremes>=785, 92 Detailed Solutions c) Order the data: 24 27 30 32 33 34 36 i) From here, it is easy to see that 32 is the median ii) To find Q1, (.25)(7) = 1.75, rounding up gives 2 so Q1 = 27 iii) To find Q3, (.75)(7) = 5.25, rounding up gives 6 so Q3 = 34 iv) IQR = 34 – 27 = 7 v) Extremes: Q3 + 1.5(IQR) = 34 + 1.5(7) = 44.5, there are no values equal to that in the data, so there are no extremes. Also, Q1 – 1.5(IQR) = 27 – 1.5(7) = 16.5, there are no values less than or equal to this, so there are no negative extremes. d) Follow the same procedure for (ii). Practice Multiple Choice Questions 1. The average time between infection with the AIDS virus and developing AIDS has been estimated to be 8 years with a standard deviation of about 2 years. Approximately what fraction of people develop AIDS within 4 years of infection b a. about 5 % b. about 2.5% c. about 32% d. about 16% e. about 1% 2. An instructor decides to "curve grades" in a course depending upon the percentile measures. Here are some summary statistics: b Quantile Levels Final Mark Minimum 10 10.0% 48 25.0% 55 Median 66 75.0% 78 90.0% 87 Maximum 93 Which of the following is FALSE? a. b. c. d. About 1/4 of the class received a score of 55 or less. About 3/4 of the class received a score of 75% or less. About 50% of the class received grades between 55 and 78. This method assigns grades relative to how others do in a class rather than against an absolute standard. e. This method always has half of the class at or above the median grade. 3. An experiment was performed upon rats to investigate the effect of ingesting Alar (a chemical sprayed on apple trees to keep fruit from dropping before ripe) upon subsequent cancer rates. The following variables were measured: gender (0=female, 1=male); weight (g); dose of Alar (nil, low, high); number of tumors The typical weight of a rat is about 800 g and the weights were rounded to the nearest gram. The number of tumors is around 10. Which of the following is FALSE? c a. Gender is nominal scale; dose is ordinal scale b. Gender is discrete; weight is continuous c. Number of tumors is discrete and is interval scale d. Dose is ordinal scale and discrete e. Weight is ratio scale; and number of tumors is discrete. 4. Here are some summary statistics on the results of the experiment. Draw suitable BOXPLOTS to compare the results. Salmon production is in kg/km of spawning sites. Quantiles Level Minimum 10.0% clear cut 0.9 0.9 selective 0.9 1.2 25.0% 3.3 8.5 median 75.0% 19.2 48.1 29.3 51.5 90.0% 87.4 93.4 maximum 90.0 108.0 Means and Std Deviations Level Number clear cut 12 selective 12 Mean 29.7 34.6 Std Dev 30.4 31.1 Std Err Mean 8.8 9.0 Solution: In this case, side-by-side box plots would be suitable 5. What do you conclude from your boxplot and the descriptive statistics? Be sure to explain how your plot leads you to this conclusion. Solution: It appears that clear cut streams produce, on average, less salmon than selectively cut streams. This is because the box plot for the clear-cut areas is shifted down relative to the box-plot for the selective harvest areas; and the median of the clear cut areas appears to be less than the median of the selective harvest areas 6. Which one of the following statements is FALSE? a a. Pie charts are better than bar graphs for comparing relative sizes. b. Data that are nominal scale are presented using frequency tables. c. Means and standard deviation of ordinal data are meaningless. d. Box-plots are a good choice for comparing the distribution of values among groups. 7. As part of a study to investigate the effects of stubble burning, the following variables were measured at several sites around Winnipeg: pH of soil (to one decimal place, e.g., 6.3) 0 ph is not meaningful crop grown (0=wheat, 1=barley, 2=oats, 3=other); amount of stubble (0=light, 1=medium, 2=heavy); date of final harvesting (e.g., 10 Oct 92) The scales of these variables are: a. b. c. d. e. interval, ordinal, ratio, ratio interval, nominal, nominal, interval interval, nominal, ordinal, interval ratio, ordinal, ordinal, ratio interval, nominal, ordinal, ratio 8. A student discovers that his grade on a recent test was the 72nd percentile. If 90 students wrote the test, then approximately how many students received a higher grade than he did? b a. 65 b. 25 c. 72 d. 71 e. 18 Solution: (1 - .72)(90) = 25.2 or (.72)(90) = 64.8 students who scored less than him so 90 – 64.8 is approximately 25. 9. Many professional schools require applicants to take a standardized test. Suppose that 1000 students write the test, and you find that your mark of 63 (out of 100) was the 73rd percentile. This means: c a. At least 73% of the people got 63 or better. b. At least 270 people got 73 or better. c. At least 270 people got 63 or better. d. At least 27% of the people got 73 or worse. e. At least 730 people got 73 or better. Solution: We know that 73% of the people scored below 63 and 27% of the people scored better than 63, so rule out a and d. This means that (.73)(1000) = 730 scored below 63 and (.27)(1000) = 270 scored better than 63. C must be the correct answer. Last Question 1994: DIVORCES PER 1,000 POPULATION Valid Frequency Percent 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 2 2 1 1 1 1 1 1 2 1 1 14.3 14.3 7.1 7.1 7.1 7.1 7.1 7.1 14.3 7.1 7.1 Total 14 Cumulative Percent 14.3 28.6 35.7 42.9 50.0 57.1 64.3 71.4 85.7 92.9 100.0 100.0 Identify the Percent, Cumulative Percent and state the median, Q1, Q3, Interquartile Range and identify any extreme values. 5.8 – 5.1 = .7 Q3 + 1.5(.7) = 5.8 + 1.5(.7) = 6.85 there are no upper extreme values Q1 – 1.5(.7) = 5.1 – 1.5(.7) = 4.05 there are no lower extreme values