A Catechism for The Practice of Statistics, Second Edition This is the 6th version of this catechism, revised as of 2/2/2007 The excellent textbook was written by Daniel Yates, David Moore, and Daren Starnes, and is copyright 2003, W.H. Freeman and Company. This catechism was written by Joseph Strayhorn, in 2006, and is not copyrighted. Strayhorn’s email address is joestrayhorn@juno.com YMS Chapter 1: Exploring Data Q1. The science of data is known as ____. A1. Statistics Q2. Most raw data sets can be organized into rows and columns. Each row represents some object or person that is studied, and each column represents some characteristic about that thing that is measured. Our textbook calls those objects and characteristics what two things respectively? A2. Individuals and variables Q3. What are the two main classes of variable types? A3. Categorical and quantitative Q4. A description, depiction, or equation telling what values a variable takes on and how often it takes on these values is called the ___ of the variable. A4. Distribution Q5. Before studying the relationships among variables, it's usually good to begin by examining what? A5. Each variable by itself Q6. Before getting numerical summaries of the data, your textbook advises exploring the data with what? A6. Graphs Q7. What two types of graphs are usually most appropriate for categorical data? A7. Bar charts and pie charts Q8. If several percentages do not represent portions of the same whole, then what type of graph is inappropriate? A8. A pie chart Q9. When you are asked to describe a distribution after looking at a graph, the general tactic is to look for an overall pattern and also for striking deviations from that pattern. When describing the overall pattern, what three features should you mention? A9. Center, shape, and spread Q10. When you are asked to describe a distribution, the general tactic is to look for an overall pattern and also for striking deviations from that pattern. What are the striking deviations called? A10. Outliers Q11. Someone wants to display this center, shape, and spread of a data set with a picture. But the person also wants to communicate, through the same graph, the individual raw data values that were collected in the study. There are too many different values that the variable takes on to make a dot plot feasible. What type of graph should the person choose? A11. A stem plot Q12. Instead of a dot plot or a stem plot, a ____ is the most common graph of the distribution of one quantitative variable. A12. Histogram Q13. What does your textbook depict as a minimum number for either the number of stems in a stem plot, or the number of classes in a histogram? A13. Five Q14. If the right and left sides of a histogram are approximately mirror images of each other, we call the distribution what? A14. Symmetric Q15. If there's a big hump on the left side of a histogram and a long tail extending far out to the right, do we say that the distribution is skewed right or skewed left? A15. Skewed right Q16. If you look at people's incomes, defining income so that zero is the smallest possible value, and your sample includes mainly middle income people but at a few extremely high income people, will the distribution be skewed right or skewed left? A16. Skewed right Q17. Mary gets a test report saying that 79% of the test takers fell at or below the score that she made. The name of the type of score she got is what? A17. Percentile Q18. A relative cumulative frequency graph is often called what? A18. Ogive. Q19. In a relative cumulative frequency graph, or ogive, the horizontal axis is for the values of the variable you are looking at. For any given value on the horizontal axis, what does the value on the vertical axis stand for? A19. The fraction of observations less than or equal to that value Q20. If you are given a relative cumulative frequency graph, and someone asks you to find the center of the distribution, how do you do it? A20. Find the value on the x-axis that has a 50% or .5 value on the y-axis. Q21. On a time plot, what axis does time go on? A21. The horizontal axis Q22. On a time plot, an overall upward or downward slope is called what? A22. A trend Q23. On a time plot, what do you call the shorter-term variations that occur regularly, repeating themselves in a cyclic fashion? A23. Seasonal variation Q24. 1/n times the summation of the x(i), where n is the number of cases and x(i) is the value of the ith case, is known as what? A24. The mean Q25. The number in a distribution such than half the observations are smaller and the other half are larger is called what? A25. The median Q26. If there is no middle value in a data set because you have an even number of cases, how do you do find the median then? A26. You find the mean of the two center observations. Q27. Between the mean and median, which of these is pulled farther in the direction of extreme values or outliers? A27. The mean. Q28. If a distribution is highly skewed to the right, which value will be lower: the mean, or the median? A28. The median. Q29. From which statistic, the mean or the median, can you recover the total value of all the cases in your data set, if you know how many cases there are? A29. The mean Q30. What's the definition of the range of a distribution? A30. The difference between the largest and smallest value Q31. What's the chief problem with using the range as a measure of the spread of a distribution? A31. It's too sensitive to outliers, and it depends on only two values in the data set. Q32. What you call the median of the subset of observations whose position in the ordered list is to the left of the overall median? A32. The first quartile Q33. What's the definition of the interquartile range? A33. The third quartile minus the first quartile. Q34. What's the rule of thumb for defining outliers in terms of the interquartile range? A34. An outlier falls more than 1.5 times the interquartile range above the third quartile or below the first quartile. Q35. What five numbers are in the so-called five number summary? A35. The minimum, the first quartile, the median, the third quartile, and the maximum. Q36. What type of graph gives a picture of the five number summary? A36. The box plot Q37. What's the difference between a regular box plot and a modified box plot? A37. In a regular box plot, the whiskers go out to the maximum and minimum. In a modified box plot, the whiskers go out to the largest and smallest data points that are not outliers. The outliers are plotted as isolated points on a modified box plot. Q38. If you take the deviation of each observation from the mean of the whole set, square those deviations, add those squares, and divide by one less than the number of observations, what do you call the resulting number? A38. The variance Q39. What is the relationship between the variance and the standard deviation? A39. The standard deviation is the square root of the variance. Q40. How is the standard deviation like the interquartile range? A40. Both of them are measures of spread of the distribution. Q41. When you average the squared deviations from the mean to find the variance of a sample, what should you divide by: the n of cases, or the "degrees of freedom"? A41. The degrees of freedom Q42. Under what conditions will a standard deviation equal zero? A42. When all the observations have the same value. Q43. Between the interquartile range and the standard deviation, which is more resistant to the effects of the outliers? A43. The interquartile range Q44. How do you choose between the five number summary on the one hand, and the mean and standard deviation, on the other hand, as ways of describing a distribution? A44. The mean and standard deviation are good for reasonably symmetric distributions that are free of outliers. Otherwise the five number summary is usually better. Q45. If you add the same number to each observation, how does that affect the center and the spread of the distribution? A45. The number that you add is added to the measures of center, such as the mean and median. But measures of spread, such as the interquartile range and standard deviation, are not affected. Q46. If you multiply each observation by the same number, how does that affect measures of center and spread? A46. Both the measures of center (median and mean) and the measures of spread (standard deviation and interquartile range) are multiplied by the same number. (The variance, which is also a measure of spread, is multiplied by the square of the number each observation is multiplied by.) Q47. What are three graphical methods of comparing distributions? A47. Side by side bar graphs, back-to-back stem plots, and side-by-side box plots. YMS Chapter 2: The Normal Distribution Q1. The scales of density curves are adjusted so that the total area under each curve is what? A1. One Q2. The area under the density curves between a couple of x-axis values represents what? A2. The proportion of all observations that fall between those values. Q3. Do measures of center and spread apply a to density curve as well as to sets of observations? A3. Yes Q4. How do you define the median of the density curve? A4. The point with half the area under the curve to its left and the remaining half of the area to its right. Q5. The quartiles of a density curve divide the area into what? A5. Four equal parts. Q6. What is the relationship between the mean and the median of a symmetric density curve? A6. They are equal. Q7. Which is pulled the farther toward the tail of a skewed distribution: the median, or the mean? A7. The mean Q8. In conventional notation, what are the meanings of x-bar and s, as contrasted to mu and sigma? A8. The first two refer to the mean and standard deviation, respectively, of a set of observations, a sample. The second two refer to the mean and standard deviation, respectively, of a density curve or idealized distribution, or the population distribution. Q9. What three features describe the overall shape of normal curve? A9. Normal curves are symmetric, single peaked, and bell shaped. Q10. Is there only one normal curve, or is there an infinite number of normal curves? A10. An infinite number. Q11. For any given mean and standard deviation, is there only one normal curve, or an infinite number of normal curves? A11. Only one. Q12. How can you visually find the points one standard deviation from the mean of a normal curve? A12. Those points are the inflection points of the curve. That is, the curve changes from falling more and more steeply to falling less and less steeply, or vice versa. (Optional answer for calculus lovers: they are points where the second derivative of the curve equals zero.) Q13. The distributions of test scores, of measures of characteristics of living things, and of summary statistics for chance outcomes repeated many times, often (but not always!) follow what type of distribution? A13. The normal distribution Q14. What three percentages do you have to remember when you are stating the “empirical rule”? A14. 68%, and 95%, and 99.7%. Q15. Are the three percentages for 1, 2, and 3 standard deviations exact, or easier-toremember rounded approximations? A15. Approximations. Q16. What do the three percentages in the empirical rule apply to? in other words, what is the meaning of this rule? A16. The three numbers tell the per cent observations falling within the region plus or minus 1, 2, or 3 standard deviations from the mean, respectively, in a normal curve. (Note that the percents refer to the percent of observations encompassed by the interval from that number of standard deviations below the mean to that number above the mean.) Q17. True or false: If Mary scores one standard deviation above the mean on a normally distributed test, then approximately 68% of the test takers scored as close to the mean of the test as, or closer to the mean than, Mary did. A17. True Q18. True or false: If Mary scores one standard deviation above the mean a on a normally distributed test, her score is in the 68th percentile. A18. False Q19. True or false: if Mary scores one standard deviation above the mean on a normally distributed test, half of 68% or 34% are above the mean but at or below Mary’s score. An additional 50% are below the mean. Thus Mary equals or surpasses 50% plus 34% of the test takers, and is at the 84th percentile. A19. True Q20. What does the notation N(100,15) mean? A20. It denotes a normal in distribution with mean it 100 and standard deviation 15. Q21. True or false: the standard score for any observation tells how many standard deviations that score is from the mean. A21. True Q22. What two operations do we do, to standardize a score? A22. Subtract the mean and divide by the standard deviation. Q23. A standard score is often called by what other term? A23. The z-score. Q24. What does the sign of a standard score correspond to? A24. If the z-score is positive, it’s above the mean, and if negative, below the mean. Q25. Are there an infinite number of standard and normal curves, each with its own equation describing it, or just one standard normal curve, with just one equation describing it? A25. There is just one standard normal curve, with only one equation and describing it. Q26. In a table of areas under the standard normal curve, what does the table entry for each z score represent? A26. The area under the curve to the left of z, or in other words, the proportion of cases with values less than z. Q27. What steps do you follow to use the z table solve the following “problem prototype”: given N(mu, sigma), please find the proportion of observations less than x? A27. First we standardize x (by subtracting mu and dividing by sigma). Then we look at the z table to find the proportion of the distribution less than the z score we’ve obtained. Q28. What steps do you follow with a z table if you want to know what proportion of the scores are between two values? A28. Look up the proportion less than the first, and less than the second, and find the difference between the two proportions. Q29. What two pictures do YMS recommend drawing when solving problems where you are asked to solve problem where you are given normal curves and asked for proportions of the observations? A29. They recommend drawing unstandardized and standardized normal curves, and shading in the areas that are asked for. Q30. What do the authors recommend (as a word to the wise for future test-takers) as the last step of problems giving a normal distribution and asking for proportions of observations? A30. They recommend stating the conclusion in the context of the problem. Thus rather than just saying, the answer is 49%, you would say, “About 49% of boys have cholesterol levels between 170 and 240 mg/dl.” Q31. What steps do you go through when you want to find a value given a proportion of a normal distribution, using the z-table? A31. You look for the proportion in the body of the table, and you find at the margin the z-score that corresponds to it. Then you “unstandardize” the z-score. Q32. What operations do you do, in what order, to “unstandardize” a z-score, or turn the z-score into a raw score? A32. You multiply the z score by the standard deviation, and then you add it to the mean. Q33. Suppose you have a data set, and you want to see if it is approximately normally distributed. What’s the first thing to do, before doing calculations? A33. Make a frequency histogram or stemplot, and see if the curve looks bell-shaped and symmetric. Q34. What’s a way of checking a data set for normality, using the empirical rule? A34. Find the mean and sd of the data set, and count (or get a computer to count) the proportions of observations that are within 1, 2, and 3 standard deviations of the mean. See if these proportions correspond, roughly, to .68, .95, and .997. Q35. True or false: The point of making a normal probability plot is to see whether a set of numbers is normally distributed. A35. True Q36. If you were to take any set of numbers, and plot the numbers on the x-axis, and their z-scores on the y-axis, you would get what shape for your graph? (Hint1: y values are (1/sd)*x + mean/sd, or of the form y=mx +b) (Hint 2: Standardizing a score involves a linear transformation.) A36. A line. Q37. On a normal probability plot for a set of observations, what goes on the x axis? A37. The values of the observations themselves. Q38. True or false: On a normal probability plot, what goes on the y-axis for each x value is the z score that would be associated with the percentile for that value, assuming a normal distribution (and using midpoints of intervals in finding percentiles)? A38. True. Q39. What conclusion do you come to if a normal probability plot is not linear? A39. That the data are not normally distributed. YMS Chapter 3: Examining Relationships Q1. Suppose that a researcher wants to study the effect of people’s ever having taken the drug "ecstasy" upon the people’s memory scores when tested. Which of these is the response variable and which is the explanatory variable? A1. Ingestion of ecstacy is the explanatory variable and memory test scores are the response variable. Q2. How do the terms "dependent variable" and "independent variable" correspond to the terms “response variable” and “explanatory variable”? A2. Dependent corresponds to response, and independent corresponds to explanatory. Q3. Is it proper to use the terms, response variable and explanatory variable, if the explanatory variable does not actually cause the response variable? A3. Yes. No implication of causation is contained in the terms explanatory and response (or independent and dependent). Q4. What's the order of tasks involved in the examining relationships between two variables? A4. First plot the data, then use numerical summaries. Look for the overall pattern and deviations from that pattern, and when the overall pattern is regular, use a mathematical model to describe it. Q5. Suppose that someone has math scores for the children in one classroom, and English scores for a second set of children in another classroom. The person asks you about making a scatterplot for these data. What would you reply? A5. It doesn't make sense to use a scatterplot in this situation, because a scatterplot involves graphing two variables measured upon the same individuals. Q6. True or false: in a scatterplot, each point represents one individual; the x-coordinate of the point represents the value of one variable and the y-coordinate represents the value of another variable measured on that same individual. A6. True Q7. If there is an explanatory variable, which axis should it be graphed on? A7. The x-axis Q8. When describing a scatterplot, what three aspects of the pattern should you refer to? A8. The form, the direction, and the strength of the relationship. Q9. True or false: in describing the form of a scatterplot, it's important to say whether the graph appears to be linear or not. A9. True Q10. In describing the form of a scatterplot, what term do you use if the values tend to fall into two or more groups that are separated from one another by gaps? A10. You say that there are clusters. Q11. In describing the direction of a scatterplot, when there is a positive or negative slope, we say that the variables are positively or negatively what? A11. Positively or negatively associated. Q12. When any given x value on a scatterplot has vary widely varying y values associated with it: the more widely varying the y values, the less _____ is the relationship between the two variables. A12. strong Q13. When you are drawing a scatterplot, what symbols should you use in showing the axes if the origin of the graph is not at zero? A13. You use a symbol that looks like two slashes to indicate a break in the scale. Q14. What are about 3 other guidelines on how to draw scatterplots properly? A14. Make the intervals uniform. Label both axes. Choose a scale that makes your graph big enough. Q15. Suppose that you want your scatterplot to reflect the influence of a particular categorical variable, in addition to the relationship of the two quantitative variables that are plotted. For example, suppose you want to graph the relation between entertainment violence and real-life violence for males and females on the same graph, in such a way that displays the relationship separately for males and females. What should you do? A15. Use a different symbol on the scatterplot for the points designating males as for those designating females. Q16. A common problem in constructing a scatterplot occurs when two or more individuals have exactly the same values for each of the two variables. What should you do in that case? A16. Use a different plotting symbol to call attention to those points. Q17. Which is a better method for judging the strength of a linear relationship: simply to look at a graph, or to use a calculated numerical statistic that summarizes the strength of the linear relationship? Why do you think your chosen method is better? A17. A calculated statistic works better, because our eyes can be deceived by a different scaling methods used in graphs. Q18. What is the summary statistic that measures the strength of a linear relationship? A18. The correlation coefficient. Q19. We’ve used Greek letter mu to represent a population mean; x-bar to represent a sample mean; Greek sigma to represent the standard deviation, and s to represent the sample standard deviation. What letter does out book use to designate what it calls the correlation? A19. r Q20. Given that the letter r, for the correlation coefficient, is in our own alphabet and not the Greek alphabet, do you think it refers to a sample statistic or a population parameter? A20. sample statistic Q21. Would you guess that there is some other Greek letter that refers to the population value of the correlation coefficient? A21. Yes. (It’s the letter rho, which looks pretty much like a p!) Q22. When you look at the formula for the sample correlation coefficient that your text gives, you see (xi-xbar)/sx and (yi-ybar)/sy. Can you give a simpler name to these expressions? A22. These are the standard scores, or z-scores, for the ith individual. The first factor is the z-score for the x variable and the second is the z-score for the y variable. Q23. What is the meaning of a positive and negative sign associated with the correlation coefficient? A23. A positive sign means there’s a positive association between the variables; in other words, higher values of one are associated with higher values of the other. A negative sign means there’s a negative association; that is, higher values of one variable are associated with lower values of the other. Q24. Suppose one person calculates the correlation of IQ score of some individuals with number of boxing matches fought, testing the hypothesis that boxing (the explanatory variable) affects IQ (the response variable). A second person, using the same data set, also calculates the correlation of the number of fights with IQ score, only this person thinks of IQ as the explanatory variable and number of fights as the response variable. Do they get the same correlation, or different ones? A24. The same correlation. The correlation coefficient is not affected by which variable is considered explanatory and which is considered response. Q25. Suppose someone codes race as follows: 0=Caucasian, 1=African American, 2=Asian, 3=Hispanic, 4=American Indian 5=Other. Then someone calculates a correlation between race and a reading test score for a sample of kids. Do you have a problem with this? If so, what’s your problem? A25. The problem is that the correlation coefficient is to be used with quantitative variables, not categorical variables like this. The obtained correlation would be meaningless, and an artifact of the arbitrary coding system. Q26. Melinda computes a correlation between the height of mothers and their daughters. Lunk is looking at the computations, and says, “You blew it! You have the height of mothers measured in centimeters, and the height of the daughters measured in inches!” Please tell whether Melinda needs to do anything to fix her correlation coefficient, and if so, what? A26. Melinda did not blow it, because the correlation coefficient comes out the same no matter what units are used. (This is because a transformation from one unit to another (which involves multiplying each number in the data set by the same number) multiplies both the mean and the standard deviation of the data set by the same number (as was learned in chapter 1). The z-score, which is (xi-xbar)/sx comes out the same, because each of the three numbers that make up the z-score is multiplied by the same factor, and that factor cancels out. Since the z-scores are not affected by changes of units, the correlation coefficient is also not affected.) Q27. What range of values is possible for the correlation coefficient? A27. -1 to +1. Q28. What sort of correlation coefficient do you find when two variables have a very strong linear relationship, and when as the first gets greater, the second gets smaller? A28. A correlation close to –1. Q29. Suppose the data points are two variables collected for all the days of 2005. For each of those days, imagine that we know (variable 1) the number of words your instructor for this course spoke in that day, and (variable 2) the peak barometric pressure for that day in Caracas, Venezuela. About what would you guess the correlation between these two variables to be? Why? A29. You’d guess a correlation of about 0, since there is no reason to expect that these two variables would rise and fall in association with each other. Q30. Suppose there are two variables which, when graphed in a scatterplot, form an almost perfect u-shaped parabola. Would the strong relationship between these variables imply a high correlation coefficient (meaning one close to 1)? Why or why not? A30. No, because the correlation coefficient measures the strength of linear relationships only, not curvilinear relationships. A u-shaped curve isn’t a straight line! Q31. Does the correlation coefficient resemble the median and interquartile range in being fairly resistant to outliers, or resemble the mean and standard deviation in being heavily influenced by outliers? A31. Like the mean and sd, the correlation coefficient can be greatly influenced by outliers. Q32. Someone practices guessing correlation coefficients from scatterplots using an “applet” on the internet. Why should the person not get too confident of his or her guessing power given scatterplots of real-life data? A32. Because the scales of the variables are not necessarily the same as they were on the applet, and scales can throw off “eyeball” estimates. Q33. In attempting to give a more complete description of a set of data involving two variables, someone wants to give a measure of center and spread as well as measure of the correlation coefficient. Assuming the person has made a good decision to use the correlation coefficient, what measures of center and spread would be most consistent with the correlation coefficient: the mean and sd or the median and IQR? A33. The mean and sd, because the formula for the correlation uses the mean and sd. Q34. The women in a corporation think that they are being discriminated against in their salaries. A management spokesman says to them, “Look at this plot. The first data point is the average salary for men who have worked here 1 year, put into an ordered pair with the salary for women who have worked here one year. The second ordered pair is the average salary for men and women with two year’s experience, and so forth. The correlation between men’s salaries and women’s salaries is .95! That’s almost a perfect correlation! You women have nothing to complain about!” Is this argument valid? Why or why not? A34. It’s not valid. The correlation coefficient measures the predictability of one score from another, not the equality of the two scores. Adding same value to all values of either x or y does not change the correlation – the same with subtraction, multiplication, or division. So the salaries of women could be half those of the comparable salaries of men, or $10,000 less than the comparable salaries of men, and you could still get a high correlation. Q35. Suppose that you have a data set with a correlation fairly close to 0. All the numbers for both variables are between 0 and 10. There are about 10 individuals in the data set. Then suppose that one more individual gets added, an outlier with a value of 100, 100. What do you think the correlation coefficient will become? (Try it out with your calculator or minitab if you want, or mimic this situation on an “applet.”) A35. The correlation will become close to 1, because this one outlier has such a strong effect. Q36. True or false: In a regression line, like a correlation coefficient, you get the same numbers (slopes and intercepts) no matter which variable is considered the explanatory variable and which is considered the response variable. A36. False. The change in y per unit change in x, for example, is not the same as the change in x per unit change in y. Q37. Please explain, for a least squares regression line: the sum of the squares of what are being minimized? A37. The squares of the errors for each data point, where the errors are the distances from the data point to the regression line. (The word residuals is also correct.) Q38. Please explain why the distance from the data point to the regression line corresponds to the idea of an “error.” A38. The regression line gives predicted values of y (called y-hat) for each x. There is also an actual observed value y for each x, for each data point. The difference between the actual and the predicted value is the “error” in prediction that is made by using the regression line to predict the response variable. Q39. What’s the formula for the slope of a regression line, in terms of the correlation between the two variables and their standard deviations? (Call the slope b, the correlation r, and the two sd’s sx and sy.) A39. b=r (sy/sx). Q40. Every least-squares regression line passes through what point? A40. It passes through xbar, ybar, the ordered pair formed by the means of both variables. Q41. Once you know the slope of a regression line, how would you find the y-intercept, knowing the means of the x values and the y values? (call the intercept a, and the means for x and y xbar and ybar.) A41. a=ybar- b*xbar. You get this by just solving for a the equation ybar=b*xbar + a. And the second equation comes from the fact that ybar is always the predicted value of y for xbar. Q42. When you have a regression equation delivered by the computer software output, and someone asks you for the predicted value of y given a certain x value, what do you do? A42. Just substitute the value of x into the equation and solve for the predicted y value. Q43. Suppose that someone measures height as a function of weight for a bunch of human adults, and gets a regression equation predicting height as a function of weight. Why is the y-intercept of the equation not as meaningful or important as the slope, or as the equation as a whole? A43. Because the y-intercept corresponds to the height of someone with weight 0. But the weight of 0 is far outside the range of weights measured in the study and thus the height predicted will be an extrapolation. Secondly, the weight of 0 is one seldom found in human beings, (at least those who have already been born and aren’t dead yet)! Q44. Suppose you have a regression equation output from a computer and you are asked to plot the line by hand. How would you do it? A44. Just pick two values of x, and calculate the yhat values for each, and connect those two dots. It helps if you pick points that are close to the bottom and top ends of the range. (One easy point is the y-intercept.) Q45. When, in the context of regression, people speak about the SST (sum of squares total), what do they mean by that? A45. The sum of the squared deviations of each y value from the mean of the y values. Q46. When in the context of regression people speak of the SSE or sum of squares for error, what do they mean by that? A46. They mean the sum of the squares of the deviations of the actual y values from the predicted y values. (These deviations are also called residuals.) Q47. Your book doesn’t define very explicitly in this chapter what the sum of squares for regression is. Do you think it would be reasonable to think of that as the sum of all the squared deviations of the predicted y values (the y-hats) from the mean of the y values? Particularly if a trustworthy source hinted that it was? (P.S. you can calculate the SSReg in your head, easily, for the 3-point data set of example 3.11 on page 160: it comes out to 32. The SST comes out to 38, and the SSE to 6.) A47. Yes. Q48. The book speaks of the sum of squares for the regression as the SST-SSE, or the sum of squares total minus the sum of squares for error. Can we infer from this that the total sum of squares, SST, can be partitioned into the SSReg (sum of squares for regression) and the SSE, (sum of squares for error), and that SST=SSReg+SSE? (P.S. I use the notation SSReg so as not to confuse sum of squares for regression with sum of squares for residuals.) A48. Yes. Q49. The square of the correlation coefficient, or r-squared, a.k.a. the coefficient of determination, means what in terms of the fraction of the total sum of squares? Please answer in symbols and words. A49. r-squared = the SSReg/SST or (SST-SSE)/SST. The r-square is the fraction of the total sum of squares that is accounted for by the regression of y on x. Q50. One person studies IQ as a function of number of boxing matches participated in, and another uses the same data set to study boxing matches participated in as a function of IQ. (That is, matches is the explanatory variable in the first study and IQ is the explanatory variable in the second.) Do they both get the same value for the slope of the regression line? Can you explain in words the reason for this answer? (You may assume a negative relationship between the two variables in constructing language for your answer.) A50. No. The first slope tells how many points IQ changes per additional boxing match, and the second slope tells how many fewer boxing matches someone has for each additional IQ point. Q51. The slope of the regression line b is equal to r*(sy/sx). Along the regression line, a change in 1 standard deviation in x results in a change of how many standard deviations of y? (Hint: the slope is the change in y over the change in x. So the change in y equals the slope times the change in x. So if the change in x is sx, we get r*(sy/sx)*sx, which equals... A51. r standard deviations of y, or r*sy. Q52. True or false: the slope of the regression line tells you how many unstandardized units the predicted value of y changes for each unstandardized unit change in x. A52. True. Q53. True or false: the correlation coefficient tells you how many standard deviations the predicted y changes for each standard deviation change in x. A53. True. Q54. True or false: If both of two variables y and x are standardized, (so that the standard deviation of both is 1) then the slope of the regression line and the correlation coefficient are equal. A54. True. Q55. What is another name for y – yhat, or the deviation of the observed y value from the predicted value, or the error in prediction for a given value, or the vertical distance between any data point and the regression line? A55. The residual. Q56. True or false: when we speak of a “least squares” regression line, we mean that we choose the line so as to minimize the squares of the residuals. A56. True. Q57. Someone draws a graph of residuals (on the y axis) versus the values of the explanatory variable. This graph is called what? A57. A residual plot. Q58. Someone draws a residual plot, and all the values are positive. Someone says to that person, “There must be some mistake.” Why did the person say that? A58. Because the mean of the least-squares residuals is always zero; thus if you have positive values you have to have at least one negative value. Q59. If the linear regression equation fits the data well, what do you see on the residual plot? A59. A uniform scatter of points, without a clear pattern, and with no unusual individual observations. Q60. What do you call a data point that has a big effect on the slope or intercept of the regression line? A60. An influential point. Q61. Does an influential point necessarily have a large residual? (Hint: the influential point can pull the line close to it.) A61. No. YMS Chapter 4: More on Two-Variable Data Q1. In the example at the beginning of this chapter, a plot of the log of brain weight as a function of the log of body weight provides a “better fit” for the observed data than a simple plot of brain weight as a function of body weight. What is meant by better fit? (This is also the answer to the question, what are we trying to do (at least in this course) when we transform data?) A1. A better fit means that the data are more linear – a linear model is more successful in describing the relationship, the correlation coefficient is higher, the sum of squares of residuals for a regression line is lower. Q2. Can you figure out why we would want to transform data so as to get a more linear relationship? A2. Because then we can use the methods of correlation and regression that were studied in the previous chapter and will be studied more in future chapters, which are very powerful methods of statistical analysis. Q3. True or false: if we have a curvilinear function, and we want to straighten it out to make a linear function, we can’t do that by multiplying or dividing by constants or adding or subtracting constants (i.e. by using linear transformations). A3. True. Linear transformations don’t straighten curves. Q4. What are the transformations that are most commonly used, other than linear transformation? A4. Positive and negative powers, and logarithms. Q5. What is the definition of a monotonic function? A5. It’s one where as x increases, y always increases (a monotonic increasing function) or as x increases, y always decreases (a monotonic decreasing function). It produces a graph that doesn’t go up and dip down, but consistently has either a positive or negative slope. Q6. Is it kosher to speak of a function as being, for example, monotonic increasing over part of the domain of x, and monotonic decreasing over another part? If so, can you give an example? A6. Yes. An example is y=x2, which is monotonic decreasing for negative values of x, and monotonic increasing for positive values of x. Q7. True or false: There are often two steps in transformation. The second is to apply a power or logarithmic function that simplifies the data. The first is to use a linear transformation, such as adding a constant, that makes the values all positive, so that the function applied in the second step will be defined and monotonic increasing. A7. True. Q8. How is the ladder of power functions useful? A8. When we are trying to straighten out curved data sets, we can go in one direction or the other along the ladder, seeing how straight the line becomes, rather than just randomly picking different functions. Q9. Linear growth is to adding a fixed amount per unit time as exponential growth is to ______ by a fixed amount per unit time. A9. Multiplying. Q10. If the number of a certain type of bacteria doubles every two hours, is that linear growth or exponential growth? A10. Exponential. Q11. Increasing everyone’s salary by a certain percentage is to ______ growth as increasing everyone’s salary by the same dollar amount is to _______ growth. A11. Exponential, linear. Q12. Suppose we have a function y=ab^x, where a and b are constants and x is the explanatory or independent variable, and y is the response or dependent variable. Is this an example of an exponential function, or a power function? A12. An exponential function. (This is a function like y=2^x.) Q13. Suppose we have a function y=ax^b, where a and b are constants and x is the explanatory variable and y is the response variable. Is this an example of an exponential function, or a power function? A13. A power function. (This is a function like y=x^2.) Q14. If y is an exponential function of x, plotting what function of y versus x should result in a linear graph? A14. The log of y versus x. Q15. Suppose you do a regression of the log (base 10) of y versus x, and you get a nice linear scatterplot and a high coefficient of determination (r^2) when you do a regression. Now you can use this linear relationship for prediction. Suppose someone (like a testmaker) asks you what the predicted value is of y (not log y) for a given value of x. How would you find it? A15. You’d just use your equation to find the predicted value of log y. Then you take the antilog (or 10 to that number) to get the predicted value of y. In other words, you “untransform” the value back to the original scale. Q16. If a variable grows exponentially, its logarithm grows how? A16. Linearly. Q17. To make an exponential function linear, we use the log transformation just with the response variable y. To make a power function linear, we use the log transformation with what? A17. Both the explanatory and the response variable. Q18. If you start with the power function y=ax^p, and take the log of both sides, what result do you end up with? A18. log y=log a +p log x. Q19. Suppose you have a data set, and its scatterplot is curved. Then you take the log of both explanatory and response variables, and plot them, and you get a line. What do you infer from this? A19. That the original variables were related according to a power function (or power law). Q20. When you plot the log of y vs. the log of x, do you give any meaningful interpretation to the slope of the line that you get? If so, what is it? A20. According to the equation log y =log a + p log x, the slope of the line is the power to which x is raised in the original power function. Q21. Jane gets a regression coefficient (i.e. a slope) of 3.617 when regressing log y vs. log x. She says, “Now I know that x and y are related in a power function, and y= a constant times x to the 3.617 power.” Do you think this conclusion should be tempered or qualified? If so, how? A21. Yes. The slope you obtain in any given experiment is an estimate of the population value of the slope, and not an exact rendering of it. Another sample is very likely to give a different slope, and it could possibly even lead to a different conclusion about the form of the functional relationship! Q22. Suppose you plot the log y vs. the log x and you get a good line, with intercept 2 and slope 3. So log y=2+3log x. Now you are asked to find the equation for y in terms of x, without logs in it. How do you do this? A22. You just take the antilog of both sides. You get y= 10^(2+3 log x), or y=10^2*(10^log x)^3, or y=100*x^3. That is, y=100 times x cubed. Q23. Suppose I find that in the range of 3 to 7 milligrams of Ritalin given to a group of children, their math scores rise in linear fashion with increasing dose. A parent looks at the regression equation and says, “By my calculations, all it would take would be 400 milligrams of Ritalin for my child to get an 5 on the AP statistics course, while he’s still in 3rd grade.” What do we call this type of reasoning (which often leads to incorrect conclusions)? A23. Extrapolation, which is using the regression equation to make predictions for values of the explanatory variable that we have no experience with. Q24. I notice a linear relationship between shoe size and basketball prowess, and propose to help my daughter’s fifth grade basketball team win by dressing them all in size 15 Converse All-Stars. Someone says, “But in your study, there’s another variable, namely height, that was not among your explanatory or response variables and yet may influence the interpretation of the relationship between shoe size and basketball skill.” What type of variable is height, in this situation? A24. A lurking variable. Q25. In the example you just read, a lurking variable enhanced the apparent association between two variables. Can lurking variables also mask or attenuate the apparent association between two variables? If so, can you think of an example? A25. Yes. The relation of overcrowding and lack of indoor toilets on page 227 of YMS is an example. Here’s another: suppose a researcher finds that the correlation between violent entertainment viewed and aggressive behavior in real life is low. Then someone reanalyzes the data and finds that when you make separate scatterplots for males and females, you get a strong relationship. Males are much more aggressive than females, and when you throw the data together on one scatterplot you get more of a horizontal line than a sloped one. Thus the lurking variable of gender obscured the real association that was there. Q26. What’s one way of discovering lurking variables that someone may not have thought about? A26. Plot the data as a function of time. You may able to see relationships that can be linked to events that happened at a certain time, or conditions that changed at a certain time. Q27. One researcher studies the murder rate of every state as a function of the average literacy level of the teenagers in that state. The correlation is reported to be very strong (and close to –1). Another researcher studies aggression in individuals in a high school, as a function of their reading scores. The correlation, though in the same direction, shows much less strong a relationship. Can you explain why? A27. Correlations with averaged data are generally considerably higher than correlations obtained with individuals. This is because some of the random variation in each variable gets reduced when you sum or average over many individuals. (Later on in the course, you’ll find out that the variance of the mean of a set of observations for a population is equal to 1/n times the variance of the individual observations, where n is the number of observations.) The less “noisy” both variables are, the more highly they tend to correlate with each other. Q28. When two variables X and Y are found to correlate with each other, of course two possible explanations for this association are 1) that X causes Y, and 2) (one not diagrammed on page 232) that Y causes X. Please name the other two possible explanations that are good to keep in mind when interpreting findings of associations. A28. Common response (z causes both x and y) and confounding (z, which is associated with x, may cause y). Q29. Suppose a researcher studies the effects of a way of teaching children not to be violent. The researcher gives the instruction to all the children in Mrs. Harmony’s classroom, and uses the kids in Mr. Gutsly’s classroom as a comparison group. But then the researcher realizes that Mrs. Harmony has a very different personality and interpersonal style than Mr. Gutsly: she tries to promote kindness and good will, whereas Mr. Gutsly is mainly interested in promoting competitiveness and not being wimps. What would we say about the variables of teacher personality and interpersonal style in this study? A29. That they are CONFOUNDED with the intervention. Thus the effects of these teacher variables can’t be distinguished from the effects of the intervention the study is meant to test. Q30. Someone finds that the degree of physical fitness in youth (as measured by heart rate recovery from exercise) is correlated with the number of ankle injuries the person has had. But before concluding that we should hurt the ankles of youth in order to make them more fit, a COMMON RESPONSE explanation for the association comes to mind. Can you posit this common response explanation? A30. That both fitness and ankle injuries are associated with more running or more athletic activity – both are responses to this basic causal variable. Q31. Even when causation is present, is there usually one and only one contributing cause for a given effect, at least in the types of phenomena people study with statistics? A31. No. Q32. Someone says, “Lots of kids play “shooter” video games for hundreds of hours, and never do anything violent. Therefore these games can’t cause violence.” What does the principle, as stated in your text, that “Even when direct causation is present, it is rarely a complete explanation of an association between two variables” have to do with this reasoning? A32. Another way of stating this principle is that one phenomenon does not have to be a necessary and sufficient condition for a second, in order to be causally related. Therefore one or several instances of non-association do not disprove a causal relationship. Q33. What is the strongest type of evidence for causal relations? A33. Well-designed experiments that are meant to control for all lurking variables. (These usually entail randomly assigning individuals to different conditions.) Q34. What’s the problem with doing a well-designed experiment, for example, to see what the effects of child abuse are? A34. We will never find it ethical to randomly assign children to conditions of child abuse versus nonabuse. Q35. Is it possible to come to valid causal inferences without doing experiments that randomly assign people to various conditions? Can you give an example of such? A35. Although your text says that “the only fully compelling method” of establishing causality is an experiment, we can and do come to valid causal inferences without randomly assigning people to conditions. The example of smoking and lung cancer is one where the evidence for causation is “overwhelming” despite no study in which people were randomly assigned to smoke or not smoke over many years. Q36. A two-way table describes the relation between two of what kind of variables? A36. Categorical. Q37. When you look at a two-way table that looks like this party affiliation Democrat Republican Total Approval of president’s performance Yes No 25 100 125 5 130 105 255 125 150 Total What is the row variable, and what is the column variable? A37. The party affiliation is the row variable and the approval of the president’s performance is the column variable. Q38. If we look, at the table above, at the totals for the rows, we get how many Democrats and how many Republicans are in the sample. Similarly, the column totals tell us how many approvers and disapprovers are in our sample. These give us the distribution for each variable separately, in our sample. These distributions are called what? A38. Marginal distributions. (Because they’re in the right and bottom margins of the table.) Q39. The above table gives the results in counts. Especially when the marginal distributions are not equal (for example, if the sample should contain twice as many Republicans as Democrats) we should convert the count data to what kind of data? A39. To per cents (or fractions). Q40. True or false: When describing the relationship between two quantitative variables, the scatterplot and the correlation coefficient are usually the graph and numeric measure of choice; but in describing the relation between two categorical variables, no single graph or numeric measure summarizes the strength of the association. We usually pick and choose among bar charts and pie charts and the reporting of various per cents. A40. True. Q41. Someone looks at a sample of 500 men and 100 women. 250 men oppose the war, whereas 80 women oppose the war. The researcher says, “Lots more men than women oppose the war. Therefore the idea that women in this area are more pacificist is incorrect.” What’s the problem with this reasoning, and what should the researcher do? A41. He should not just use the counts, but find the per cents. 50% of men, but 80% of women, in this sample opposed the war. So in this region it looks like the women are more anti-war than the men. Q42. Suppose you have three age groups, and you have data on how many individuals got educated to each of 4 different levels. Suppose you calculate, just for one of the age groups, the per cent of people in that age group who attained each level. This distribution of per cents for one age group is called what? A42. A conditional distribution. Q43. Do the per cents for a conditional distribution add to 100 for each of the different groups for which you calculate them? A43. Yes. Q44. Do the per cents for conditional distributions equal the per cents for marginal distributions? A44. No, not necessarily. Q45. There were two AP Statistics teachers. 40% of the 40 students in the first teacher’s classes got 5’s, and 25% of the 40 students in the second teacher’s classes got 5’s. People assumed that the first teacher is better. However, someone then studied the results based on whether or not the students scored above or below a certain cutoff on the SAT, before going into AP Statistics. The first teacher had 80% of students above this cutoff and 20% below. The second teacher had 20% above and 80% below. The first teacher had 50% of the “aboves” get 5’s, and none of the “belows.” The second teacher had 75% of the “aboves” get 5’s, and 12.5% of the “belows.” Now which teacher appears to be better, and why? A45. The second teacher, because a higher fraction of that teacher’s students got 5s from those both above the cutoff and below the cutoff. Q46. The situation above is whose paradox? A46. Simpson’s. Q47. True or false: In Simpson’s paradox, there is a lurking variable, which predisposes the results against one of the two groups; controlling for the effects of that lurking variable by looking separately at the subsets formed by the categories of it reveals results in the opposite direction from those obtained when ignoring the lurking variable. A47. True. Q48. If a lurking variable can actually reverse the direction of results, do you think it is also possible that a lurking variable could result in lack of an observed association when in fact there is a causal influence? A48. Yes. Q49. Does the fact that lurking variables can obscure influences that are actually present imply that: not only does correlation not imply causation, but lack of correlation does not rule out causation? A49. Yes. YMS Chapter 5: Producing Data Q1. The difference between an observational study and an experiment is that in the first, the explanatory variable is observed and measured, whereas in an experiment, the explanatory variable is ____. A1. Deliberately imposed, manipulated. Q2. When there is a jobs program for welfare recipients, and you simply observe that those who voluntarily take part in the program do better than those who don’t, what’s the problem with inferring that the program causes better results? A2. The effects of the program are confounded with the characteristics that lead people to seek the program, for example motivation and values. Q3. The entire group of individuals we want information about is called the _____. A3. Population Q4. The subset of the population we actually examine in order to gather information is called the ______. A4. Sample. Q5. Studying the whole population by attempting to contact every individual is called conducting a ______. A5. Census. Q6. Studying a population by taking a subset of it in order to generalize to the whole population is called _____. A6. Sampling. Q7. The method used for selecting the sample from the population is called the ____ of sampling. A7. Design. Q8. If a radio station invites anyone who wants to call and give an opinion on a question, the set of people thus obtained is called a _____ response sample. A8. Voluntary. Q9. If the researcher enrolls a group of people in the study on the basis of how easy it is to contact them and get them to enroll, that method of sampling is called ______ sampling. A9. Convenience. Q10. The systematic error introduced when the sample is very different from the population is called ____. A10. Bias. Q11. If a conservative radio commentator polls his listeners, and a liberal commentator polls her listeners, both polls are likely to be biased as methods of ascertaining the sentiment of the country, because _______. A11. It’s very likely that these samples differ highly from the country as a whole. Q12. A SRS, or simple random sample, is a subset of n individuals from a population, chosen in such a way that ____. A12. Every subset of n individuals has an equal chance of being chosen for the sample. Q13. True or false: if every individual in the population has an equal chance of being included in the sample, the sample is a simple random sample. A13. False. You need not only this condition, but also that every subset of the population of size n is equally likely to be chosen. Q14. Suppose I take the numbers 1, 2, 3, and 4, and write them on identical pieces of paper, put them into a hat and mix them thoroughly, and draw out two numbers. Is this a simple random sample of the 4 numbers? A14. Yes. Q15. Suppose I take the numbers 1, 2, 3, and 4. First I take the numbers 1 and 2 and put them into a hat, and choose one of them. Then I take the numbers 3 and 4 and put them into a hat and choose one of them. For each of the numbers 1, 2, 3, and 4, what is the probability that this number will end up in the sample? A15. .5 Q16. Is it possible that the subset {1,2} would be chosen for our sample using the sampling method just mentioned (that is, pick randomly from 1 and 2, then pick randomly from 3 and 4)? A16. No. Q17. So the sampling method just mentioned is one where each individual has equal probability of being chosen, but each subset is not equally likely to be chosen; thus the sample obtained is, or is not, a simple random sample? A17. Is not. Q18. In a table of random digits, each triple of digits is equally likely to be any of the ____ possibilities from _____ to ______. A18. 1000, 000, 999 Q19. The two rhyming words (with different ways of spelling the second syllable) that summarize the process of using a table of random digits to select a simple random sample are ___ and ____. A19. Label and table. Q20. There are 7 members in a class. Please describe how you would use a table of random digits to select a simple random sample of 3 of them. A20. Assign each of them a single digit label. Enter the random number table at any point, and look at the numbers in order. If the number isn’t one of the ones you assigned, ignore it and go to the next. If it is one that you assigned, put that individual in the sample. Keep going until you have put 3 individuals in the sample. Q21. A sample chosen by chance is called a ____ sample. A21. Probability Q22. Suppose there is a class, and someone wants to choose a random sample of it. But the researcher wants to make sure that both males and females are adequately sampled. So the researcher takes the names of the girls, and draws a simple random sample of them, and then does the same with the boys’ names. The total sample thus obtained is not a simple random sample, but a _____ random sample. A22. Stratified Q23. Suppose a researcher wants to collect a random sample of high school students in the U.S. The researcher first takes a simple random sample of counties in the country, then takes a simple random sample of high schools within each county, and then a simple random sample of students within each high school. This sampling method is called ____. A23. Multistage sampling. Q24. The above method of sampling high school students leaves out homeschoolers. The general term for such a problem in sampling is ____. A24. Undercoverage. Q25. When you get a survey in the mail and immediately toss it in the trash, the source of bias this introduces into the survey is called _____. A25. Nonresponse. Q26. If you were asked what is the “essential principle of statistical sampling,” would you say that it’s to have a simple random sample, a probability sample, a stratified sample, or a multistage sample? A26. A probability sample, because the most essential factor is that the sample be chosen by chance. Q27. During recent decades, society has become less and less tolerant of any sexual activity between therapists and their clients. Surveys of the incidence of such behavior are now almost impossible to obtain, because therapists would avoid trusting a researcher with a confession of behavior that would lead to severe penalties. This introduces bias into any survey that is called ______ bias. A27. Response Q28. One survey question asks, “Do you believe that children should be legally protected from exposure to violent models on TV that can lead them to commit acts of violence?” And a separate question asks, “Do you believe that government should limit the free expression of ideas by censoring television?” The major difference in results these questions would yield would be referred to as ______ effects. A28. Wording. Q29. Which would give more accurate results in a poll: a probability sample of 1000 people, or a voluntary response sample of 100,000 people? A29. The probability sample of 1000. Q30. A study in which we actually do something to people, animals, or objects in order to learn about the response is called an _____. A30. Experiment. Q31. The individuals on which an experiment is done are called the experimental whats? A31. Units. Q32. When the experimental units are human beings, according to our book they are called ____, (although the preferred term among psychological researchers these days is “participants.”) A32. Subjects. Q33. The thing that is done to the subjects (or participants) (for example giving them a drug or teaching them to read) is called a ____. A33. Treatment. Q34. Suppose that in an experiment, learning of math facts is your response variable. You are studying two explanatory variables, and varying them systematically in your study: amount of practice, and the frequency of recurrence of any one math fact in a practice session. These two explanatory variables are called the two _____ in the experiment. A34. Factors. Q35. In an experiment on math facts, one is studying the frequency of recurrence of any one math fact in practice sessions: does, for example, 7+8 occur every 3 problems, every 20 problems, or every 90 problems? If the experiment is set up like this, within the factor called “frequency of recurrence” there are three different degrees of that factor, three different specific values of that factor, which in the jargon is called three ____ of that factor. A35. Levels. Q36. A pill that is made of inactive material, which is used so that subjects can have information withheld about which treatment group they are in, is an example of a _____. A36. Placebo. Q37. If you want evidence for causation, and if you want to study the interactions of factors, and you are able to do either an observational study or an experiment, you should, all other things equal, choose the ______. A37. Experiment. Q38. When people get better from an inactive treatment, that is called the ____ effect. A38. Placebo. Q39. A group of individuals who receive an inactive treatment, so that the effects of a possibly active treatment can be contrasted with those of inactive treatment, is called a ____ group. A39. Control. Q40. A researcher tries to make two treatment groups equal on every variable other than the treatment of interest. The researcher does this by fashioning two groups that are very similar on several variables relevant to outcome. What is the problem with this method? A40. There are too many lurking variables – the experimenter may not measure all of them, and some of them may not become apparent until after the experiment. Some of them may not be measurable at all. Q41. What’s the “gold standard” method of assuring the equivalence of two treatment groups? A41. To randomly assign subjects to treatment groups. Q42. Suppose you first assemble pairs of subjects that are very similar on the preintervention measure of the response variable. Then, you randomly assign one member of each pair to the experimental group or the control group. This method combines random assignment with ______. A42. Matching. Q43. A researcher is studying the effect of two methods of teaching reading. Instead of matching the subjects on their reading level and then randomly assigning one of each pair to the two groups, the researcher ignores the initial reading level for purposes of assignment to groups, and instead picks a simple random sample of the whole set of subjects to be in each group. Is this an acceptable method of assignment to groups? A43. Yes. Understanding why this is true is central to the logic of experimental design. Q44. True or false. When subjects are assigned at random to two groups, get two treatments, and they differ on the response variable, it must be true that the treatment accounts for the difference between the groups. A44. False. It is also possible that the “play of chance in the random assignment” accounts for the difference in the groups. (However, the likelihood of this alternative explanation can be quantified, and when it is small enough, the other explanation is favored.) Q45. Suppose that both treatments studied in an experiment in fact have no causal influence upon the response variable. Under what conditions are we more likely to see big differences between the two groups, due to the vagaries of random assignment: with a very small sample size, or a very big sample size? A45. A very small sample size. Q46. A difference between groups that is so large (and with so many subjects) that it would “rarely” (i.e. to whatever criterion of rarity we specify) occur by chance is called a _____ ______ effect. A46. Statistically significant. Q47. The three central principles of experimental design are _____ (which is making comparisons between groups), _______ (a method of assigning individuals to groups), and _______ (which has to do with how many individuals you have in your groups). A47. Control, random assignment, and replication. (The word replication here refers to repeating the observation on more subjects within a given experiment. The word is also used, in a different sense, to refer to repeating the experiment.) Q48. Suppose that we want to study the effect of a new curriculum and an old one, on reading skills. We also want to study the effects of whether the curriculum is delivered in person or over the phone. We randomly assign subjects to the curriculum, but we can’t randomly assign them to in person or over the phone, because certain people live too far away to get the training in person. So subjects are allocated to the curriculum at random, but not to the delivery method. We would say that the experimental design here is not _____ ________. A48. Completely randomized. Q49. In a “double-blind” experiment, what two sets of people are “blind” to which group the subject is in? A49. The subjects themselves, and the research staff who have contact with them. Q50. What’s a problem in making inferences from experiments that is often less of a problem in observational studies? A50. The problem of “lack of realism,” in other words, the problem that the conditions in the study do not match those to which we wish to generalize. Q51. How do you do random assignment in a matched pairs design? A51. First choose pairs that are as similar as possible, then randomly choose one subject from each pair. Q52. In a certain type of matched pair design where each subject serves as his or her own control, and each “pair” consists of only one individual, what is randomly assigned? A52. Whether the subject gets treatment 1 first or treatment 2 first. Q53. Suppose that we want to compare two method of tutoring in reading that children receive after school. We know that the teacher the student has is also an important variable in the outcome variable, which is reading skill. Please describe how we would use a block design to control the effect of teacher when studying the method of teaching reading. A53. We randomly assign the students of the first teacher to the two groups, then do the same for the students of the second teacher, and so forth, rather than using a simple random sample of all students. Q54. True or false: If we want to make separate conclusions about males and females in a study, it’s a good idea to block on gender when making our assignment to groups. A54. True. Q55. Making a model that accurately reflects the experiment under consideration and imitating chance behavior based on that model is called doing a _____. A55. Simulation. Q56. What are the 5 steps of doing simulations? A56. State problem, state assumptions, assign digits to represent outcomes, simulate repetitions, state conclusions. Q57. Someone is wanting to simulate a situation where there’s a 3/10 chance that a child will be involved in bullying. The person assigns the digits 0 to 3 for involved in bullying, and the rest of the digits to noninvolved in bullying. Do you have a problem with this? If so, what’s your problem? A57. Yes a problem. There are 4 digits from 0 to 3 inclusive, and 6 other digits, so the person would be simulating a 40% probability situation rather than a 30% probability. Q58. Please use your calculator to generate 4 random integers in the range from 0 to 99. Please tell what you entered on your calculator to get these, and what 4 integers you got. A58. On the TI 83 or 84, you do math>prb>5:randint, (0, 99, 4). On the TI 89, you do catalog, F3, and then scroll down to randInt and hit enter. Then you insert 0, 99, 4 in the parentheses. You’ll get different sets of numbers each time, unless something very unlikely happens! YMS Chapter 6 Probability Q1.The branch of mathematics that deals with the pattern of chance outcomes is ____. A1. Probability Q2. The big idea of the study of probability is that chance behavior is unpredictable in the _____ but has a regular and predictable pattern in the _____. A2. short run, long run Q3. An illustration of the “big idea” mentioned in Q2 is that while it is unpredictable whether a single coin toss will come out heads, the ________ is almost always very close to .5. A3. Fraction of heads in a very large number of tosses Q4. What is the difference between a changing, or variable phenomenon that is “random” and one that is not? A4. A random phenomenon is uncertain with respect to individual outcomes, but nonetheless there is a regular distribution of outcomes in a large number of repetitions. Q5. The ____ of any outcome of a random phenomenon is the proportion of times the outcome would occur in a very long series of repetitions, i.e. long-term relative frequency. A5. Probability. Q6. When there are independent trials, that means that the outcome of one trial _______. A6. Does not influence the outcome of another. Q7. The set of all possible outcomes of a random phenomenon is called the ______. A7. Sample space. Q8. An event is defined as a subset of ____. A8. The sample space. Q9. When we make a mathematical description of a random phenomenon by describing a sample space and a way of assigning probabilities to events, we are constructing a A9. Probability model. Q10. Jane has 2 shirts and 3 pairs of pants. If we want to picture the 6 ways she can dress in these garments, we can draw a diagram with a bifurcation point at the left of the page, with two lines going out to two points called “red shirt” and “brown shirt.” From each of these, you then draw 3 lines, saying “blue pants,” “green pants,” and “black pants.” This sort of picture is called a _____. A10. Tree diagram. Q11. Jane has 2 shirts and 3 pairs of pants. The “Cartesian Product” of these two sets produces 6 possible combinations. This illustrates what our book calls the _____ principle, which says that if you can do one task in a ways, and another in b ways, you can do both together in _____ ways. A11. multiplication, ab Q12. Please give an example of sampling with and without replacement. A12. As one of many possible examples: in sampling without replacement, you draw first one, then another card from a deck without putting the first card back. In sampling with replacement, you draw one card from the deck, note its identity, replace it, shuffle them, draw again, and note the identity of the second draw. Q13. The probability of any event A has to satisfy the inequality x<= P(A) <=y. What are x and y? A13. 0 and 1. Q14. If you sum the probabilities for each member of the sample space, you always come to a grand total of ____. A14. 1. Q15. If the probability that A will occur is P(A), the probability that A will not occur is ____. A15. 1-P(A). Q16. What notation do we use to represent the probability that A will not occur, if P(A) is the probability that A will occur? A16. P(Ac) Q17. What does it mean when we call two events disjoint or mutually exclusive? A17. That if one happens, the other can’t also happen. Q18. If two events are mutually exclusive, and P(A) is the probability of A, and P(B) is the probability of B, what is the probability of (A and B) (a.k.a. A intersect B)? A18. 0 Q19. If two events are mutually exclusive, and P(A) is the probability of A, and P(B) is the probability of B, what is the probability of (A or B) (a.k.a. A union B)? A19. P(A) + P(B). Q20. Suppose a random event has k equally likely outcomes. What’s the probability of any one of these outcomes? A20. 1/k. Q21. When there are k equally likely outcomes, and a certain number of outcomes are counted as being in event A, the P(A) = (number of outcomes in A)/k. Please give an example of this. A21. As one possible example: when a deck of cards has 52 cards, each of which is equally likely to be drawn, and 13 of these are called hearts, then P(drawing a heart) = number of hearts/number of cards, or 13/52, or ¼. Q22. In a finite sample space, with outcomes that are not equally likely, the probability of any event is the sum of the probabilities of the outcomes making up the event. Someone’s statistics instructor asks, “Please give an example of this.” The person says, “Suppose the probability that a randomly selected person in a certain community will own a dog is .3, a cat is .2, and another pet is .1. Then the probability that the person will own a pet is (.3 +.2 +.1). Do you have a problem with this example? If so, what’s your problem? A22. One big problem is: what about the people who own two or more different types of pets? If the only people who owned cats or other pets were the dog owners, the probability of pet ownership would be .3 instead of .6. The addition rule only applies to disjoint events, and these categories are not disjoint. Q23. When two events are independent, the probability that both will occur? A23. The product of their individual probabilities. Q24. How would you use the multiplication rule for independent events, as well as the complement rule, to calculate the probability that in two coin tosses, you would get at least 1 tail? Please explain your reasoning as you compute an answer. A24. Your strategy is to realize that (at least one is tails) is the complement of the event (both are heads). So you use the multiplication rule to compute the probability of two heads (1/2 * ½ = ¼) and then compute the complement of that (1-1/4) to get ¾. Q25. Suppose you have a lab test that gets a “false positive” 1 in 100 times. (That means that the P is 1/100 that the test will say you have the disease, when you don’t.) How would you use the multiplication rule as well as the complement rule to compute the probability that out of 100 tests where the correct answer is “negative,” you will get at least one false positive? Please explain your reasoning as you compute an answer. A25. The logic for this is exactly the same as for Q24. You realize that the event (at least one false positive in 100 tests) is the complement of (no false positives in 100 tests). The probability of not getting a false positive on one test is .99, and each test is assumed independent of the others, so .99100 (or about .37) is the probability that all the tests will be free of false positive results. The complement of this, which is the probability that at least one is a false positive, is about .63. Q26. What is the general addition rule for unions of two events that may or may not be disjoint? A26. P(A or B) = P(A) +P(B) –P(A and B). Q27. The probability that someone plays the guitar, in a certain community, is .4, and the probability that the person plays the harmonica is .2. The probability that the person plays both guitar and harmonica is .1. What’s the probability that the person plays either guitar or harmonica? A27. .4+.2-.1=.5 Q28. In a certain community, .7 of people take physics and .6 of people take statistics. The fraction of people taking one or the other is .85. What’s the probability that a randomly drawn person will take both courses? A28. P(A or B) = P(A) + P(B) - P(A and B); therefore .85=.7+.6 –P(A and B), from which P(A and B) =.45. Q29. What does the symbol P(B|A) mean? A29. The probability of B given A, or the probability that B will happen given that A has happened. Q30. What is the general multiplication rule for any two events? A30. P(A and B) = P(A)P(B|A) Q31. Why does this rule simplify to the multiplication rule for independent events, when the two events are independent? A31. Because the definition of independence is that P(B|A) =P(B), another way of saying that A doesn’t influence the occurrence of B. Thus when the two events are independent, we can substitute P(B) for P(B|A) in the general multiplication rule, and get P(A and B)= P(A)P(B). Q32. The general multiplication rule for any two events really follows from the definition of conditional probability, and is a rearrangement of the defining formula. What is that defining formula for conditional probability? A32. P(B|A) = P(A and B) /P(A) Q33. True or false: P(A and B and C) = P(A)P(B|A)P(C|A and B)? A33. True. Q34. True or false: the fact that P(A and B and C)=P(A)P(B|A)P(C|A and B) is the basis of tree diagrams. A34. True. Q35. If there’s a .05 chance that a male high school athlete will play in college, and a .017 chance that a college athlete will play professionally, what’s the chance that a male high school athlete will both go to college and play professionally? A35. .05 * .017, or .00085. Q36. If there’s a .95 chance that a male high school athlete will not play in college, and a .0001 chance that someone not playing in college will play professionally, what’s the chance that someone will both not play in college and play professionally? A36. .95 * .0001, or .000095. Q37. If the chance is .00085 that a male high school athlete will play in college and play professionally, and the chance is .000095 that an athlete will not play in college and play professionally, what’s the chance that a male high school athlete will play professionally? A37. .00085+.000095, or .000945, which is a little less than one in a thousand. Q38. For a tree diagram: the probability of reaching the end of any complete branch is the product of what? A38. The probabilities written on its segments (provided that those probabilities are written so as to mean the conditional probability of going down this branch given that you have reached the previous bifurcation!) Q39. When you look at Bayes’s rule, (page 375) does the numerator of the fraction, P(B|A)P(A) equal the probability of A and B? A39. Yes. Q40. When you look at Bayes’s rule, (page 375) does the denominator of the fraction reduce to the probability of B and A plus the probability of B and Ac, which reduces to the probability of B? A40. Yes. Q41. Is a simpler form of Bayes’s rule the following: P(A|B) = P(B|A)P(A)/P(B) ? A41. Yes. Q42. Does this simpler form of Bayes’s rule say that if you want to “reverse the direction of a conditional probability,” i.e. to find P(A|B) when you know P(B|A), you multiply the P(B|A) by the ratio of the P(A) to the P(B)? A42. Yes. Q43. P(A|B)=P(B|A)P(A)/P(B) implies that if you know the probability of a positive test given having a disease, you can calculate the probability of having the disease given a positive test, provided that you also know what two other quantities? A43. The probability (or rate) of the disease, and the probability (or rate) of positive tests. So for example if the probability of a positive test given the disease was .99, but the rate of positive tests was 1 in 100, but the rate of the disease was 1 in 1000, the probability of the disease given a positive test would be only .99*.001/.01, or .099. This illustrates that the probability of having the disease given a positive test does not necessarily equal the probability of having a positive test given the disease! Q44. What should someone do to organize the information for decision analysis if one is given a set of complex information about branching alternatives and outcomes given those alternatives? A44. Use a tree diagram. YMS Chapter 7 Random Variables Q1. A random variable is a variable whose value is a ________ of a random phenomenon. A1. Numerical outcome Q2. A random variable with a countable number of possible values is a _____ random variable. A2. Discrete. Q3. What is a probability distribution of a discrete random variable? A3. A list of the values the variable can take on, and the probability for each value. Q4. For the probability distribution of a discrete random variable, every probability is between ___ and ___, and the sum of all the probabilities is equal to ___. A4. 0 and 1, 1 Q5. In a probability histogram, what quantity do the horizontal and vertical axes represent, respectively? A5. The horizontal axis represents the possible values the random variable can take on, and the vertical axis represents the probability of that value. Q6. A continuous random variable can take on how many values for a certain interval in its domain? A6. An infinite number Q7. A continuous random variable’s probability distribution is described by a graph called the ___. A7. Density curve, or probability density curve.(This is the graph of the probability density function, or pdf.) Q8. Events, for continuous random variables, are described by the random variable’s taking on a value within a certain interval. The probability of that event is represented by what aspect of the density curve? A8. The area under the curve, between the two points that bound the interval, or the area under the curve, over the values (on the x-axis) that make up the event. Q9. Suppose you have a continuous random variable X. What is the probability that X=10? A9. Zero. Continuous probability distributions assign probability 0 to every individual outcome. Q10. In a continuous probability distribution, what is the relationship between the probability that X<10 and the probability that X<=10? A10. The two are equal, because the probability that X=10 is 0. Q11. True or false: the normal distribution is an example of a continuous probability distribution. A11. True. Q12. The mean of a discrete random variable is the sum of the products of all the possible values and the __________. A12. Probabilities of those values. Q13. Suppose there are two possible outcomes for a certain random variable, 0 and 100. The probability of getting 0 is .99 and the probability of getting 100 is .01. What is the mean of the random variable? A13. 1. Q14. The mean of a random variable is often called the e_____ v_____ of the variable. A14. Expected value Q15. Someone is offered a gambling game where there is a .25 chance of her losing $100, and a .75 chance of her winning $60. If she plays many times, what would her average winnings be? A15. $20. (Because this number is positive, it is unlike any gambling games anyone is likely to be offered.) Q16. Someone is invited to send in a contest entry in which the chances are 1 in 50 million of winning a million dollars, and one in a million of willing a thousand dollars. How do the expected earnings (or the mean earnings) from this contest compare with the price of a first-class stamp? A16. The expected earnings are 1/50million * 1million + 1/1million*1000 + the rest of the probability*0. This comes out to one fiftieth of a dollar plus one thousandth of a dollar, or 2.1 cents. At the time of this writing, a first class stamp costs 39 cents. So the cost is 36.9 cents more than the expected earnings. Q17. The mean of symmetric continuous probability distributions lies at the ____ of the curves. A17. Center Q18. The variance of a discrete random variable is the sum of the products of the squared deviation of each possible value from the mean of the distribution and the _____ for that value. A18. Probability Q19. Suppose there is a distribution with possible values 0, 1, and 2, each with probability 1/3. What is the variance, i.e. the sigma-squared, of this distribution? (This is also known as the variance of the population.) A19. 1*1/3 +0*1/3 +1*1/3, or 2/3. Q20. Think back to the definition of the variance of a sample. Suppose you had a sample consisting of 0, 1, and 2, with mean 1. Is the variance of this sample the same as the variance of the population? A20. No. The variance of the sample is the sum of the squared deviations over n-1. So the variance of the sample would be (1+0+1)/(3-1) or 1, rather than 2/3. Q21. Please take a few seconds to enter 0,1,and 2 in a list on your calculator. (On the TI83 or 84, stat >edit.) Then please compute 1 variable stats on these. (Stat>calc>1varstats, listname). Look at sx and sigma x results. What are they, and why do they make sense vis a vis the different definitions of population and sample variance and standard deviation? A21. They are 1 and .81650, respectively, and they do make sense because they are the square roots of 1 and 2/3, respectively. Q22. What is the law of large numbers, in your own words? A22. One way of putting it is that as the sample size approaches infinity, the sample mean approaches the population mean. Another is that you can make the sample mean get as close as you want to the population mean by getting a large enough sample. Q23. True or false: If by chance, you flip a coin and get 10 heads in a row, the law of large numbers tells us that if we flip many more times, we will get just a tiny bit under 50% heads in the remaining tosses, to compensate for the first 10 heads and make the long-range probability equal 50%. A23. False. The definition of independent trials implies that the coin “doesn’t remember” the first 10 flips and the subsequent results are not influenced by the initial ones. Q24. The law of large numbers says that you can get a sample mean as close as you want to a population mean by using a large enough sample. (Or: the probability that the difference between x-bar and mu is less than a given constant c approaches 1 as the sample size approaches infinity.) True or false: Given a certain desired accuracy, i.e. a certain maximum difference between the sample mean and the population mean, you will need a larger sample the greater the variance of the population. A24. True. For example, with very small variability in the population, the sample mean gives a close approximation to the population mean with a much smaller sample size than when there is great variability in the population. Q25. True or false: the mean of a linear function of a random variable is that same linear function of the mean of the random variable. In other words, the mean of a + bX is a +b*the mean of X. A25. True. Q26. The mean of the sum of two random variables equals what? A26. The sum of the means of the two variables. Q27. If the mean amount that Linda makes at her lemonade stand per day is $10 and the mean amount her brother Tom makes is $9, what’s the average of their total daily receipts? A27. $19. Q28. Suppose someone tells you that the standard deviation of single scores on the SAT is 100 points. Suppose that there are two people who take the SAT independently of one another. How would you find the sd of the sum of their scores? A28. You’d first square their scores to get the variances of their individual scores – 10000 apiece. Then you add those variances, to get the total variance – 20000. Then you take the square root of that to get the sd of the sum, 100*the square root of 2, or 141.42. Q29. If the standard deviations of the SAT math and critical reading are both about 100, is the standard deviation of the sum of these two scores for an individual more or less than 141.42? Why? A29. More, because the two scores are correlated with each other. The variance of the sum is the sum of the variances plus twice the correlation between the variables times the product of their standard deviations. Q30. Can you give an intuitive explanation for why the variance of the sum of two random variables is increased, the more highly they are correlated with each other? A30. The variance is increased the more extreme values you have. If the two variables are independent, then a high value of one variable tends to be balanced by a lower value of the other. For example, if you get a 6 on rolling one die, you’ll on the average get a sum of 9.5 once you add in the value of the other die – not a sum of 12. The same thing goes in the other direction: low values of one variable are on the average, balanced by higher values of the other. But when the values are highly correlated, then high values of one variable predict high values of the other, and you sums that are higher, and low values of one variable predict low values of the other, and you get sums that are lower. So with sums that are higher and lower, the variance of the sum is greater with correlated variables. Q31. A linear combination of two independent normally distributed random variables is distributed how? A31. Normally. YMS Chapter 8 The Binomial and Geometric Distributions Q1. Suppose someone looks at the numbers of 1’s, 2’s, 3’s, 4’s, 5’s, and 6’s that result from 600 die rolls. Is this situation an example of the “binomial setting”? A1. Almost, but not quite. For there to be a binomial setting, you have to have each observation fall into only two categories, rather than the 6 categories described here. However, if you defined a 1 as a “success” and anything else as a “failure,” then you would have a binomial setting, and you could then do the same thing separately with 2, 3, 4, 5, and 6. Q2. What are the four requirements for the binomial setting? A2. 1. Two categories 2. Fixed n of observations 3. Independence 4. p of success same for all observations. Q3. The distribution of the number of successes out of n trials (with probability of success p on each trial) is the ______ _______. A3. binomial distribution Q4. If someone has 51 socks in a drawer, with 1/3 red and 2/3 black, and the person grabs a handful of 5 of them, and counts the number of black, will the results of such a trial follow the binomial distribution? Why or why not? A4. Not quite, because grabbing a handful of 5 is equivalent to sampling without replacement. The probability of a black’s being included in the handful is altered some depending on what other socks are also in the handful. If you picked a sock one at a time, replaced the sock and mixed them thoroughly, and then picked again, the binomial distribution would apply. Q5. Suppose you roll a die 1000 times and count “1” as success and “anything else” as failure. Is this an example of the binomial setting, and does the count have the binomial distribution? Why or why not? A5. Yes and yes, because the conditions of 2 categories, fixed n of observations, independence, and p of success constant all hold. Q6. In Chapter 1, the word distribution was defined as what values the variable takes and how often it takes these values. Let’s say you roll a die 1000 times and count the number of 1’s. The count of successes comes out to 165. Someone asks you, “What does this have to do with a distribution? We just got one number from this experiment. How’s anybody going to plot a histogram or any other representation of a distribution with this?” What would be your answer? A6. The number of 1’s that you would get in such an experiment is a random variable. If you did the experiment many times, you would get a distribution that could be plotted with a histogram, and which would take the approximate shape of the theoretically derived binomial distribution for this situation. Q7. When we say a certain random variable has a B(100, .7) distribution, what do we mean? A7. That there is a binomial distribution with 100 observations and probability of success on each observation .7. Q8. If there is a discrete random variable (such as a binomial), and you want to find the probability of any given value of X, what function do you use – the cumulative distribution function or the probability distribution function? (cdf or pdf?) A8. pdf Q9. Suppose you want to know the probability that a binomial random variable B(100, .7) takes on a value less than or equal to 60. One way would be to use the binomial pdf for values 0, 1, .. 60 and then add them all up. A much less laborious way would be to do what? A9. Use the binomial cdf function. On the TI-83 or TI-84 the command would be binomcdf(100, .7, 60). Q10. Suppose you roll a die six times, and you want the probability of getting exactly 3 1’s. What would be the appropriate expression of the binomial formula that would give the answer to this? A10. (6 choose 3) * (1/6^3 * 5/6^3). Q11. Can you please explain why the binomial probability formula is as it is, using this example of rolling a die six times? Please give an explanation for why each of the three factors is what it is. A11. You have 6 independent rolls. The probability of getting 1 on the first three and something other than 1 on the next three is 1/6^3 * 5/6^3 because of the multiplication rule for independent events. The number of different ways you can get three ones is the number of ways you can select 3 dice to be ones out of the 6 different rolls: i.e. roll 1, 2, and 3, roll 1, 2, and 4, etc. This is the number of combinations of six things taken 3 at a time, or 6 choose 3. Because each of these ways of getting 3 1’s is mutually exclusive of the others, you can use the addition rule to add the probability of each of the 6 choose 3 ways of getting the outcome of 3 1’s, and since these probabilities are all the same, a shorter way is to multiply the probability of any one of them time 6 choose 3. Q12. Can you please explain why, to obtain the binomial coefficient, you use the number of combinations rather than the number of permutations, in calculating n choose k? A12. In our die roll example, using the number of permutations would for example count the event of getting 1’s on roll number 1, 3, and 5 as a different event from getting 1’s on rolls number 5, 3, and 1. Since these events are the same, order does not make a difference in enumerating subsets, and therefore you want combinations rather than permutations. Q13. In chapter 7 we learned that means and variances are additive when you want to know the mean and variance of sums of independent random variables. How are these facts crucial in figuring out the formulas for the mean and variance of a binomial random variable? A13. In the case of both the mean and the variance, we consider a certain random variable to be the outcome of any one trial, giving a success numerical value 1 and failure value 0. We calculate the mean and variance of this variable from the defining formulas. Then we define another random variable to be the sum of n of these random variables, which is the number of successes in n trials. We use the additive properties of the mean and the variance that we learned in chapter 7 to move us from an expression for the mean and variance of any one trial, to the mean and variance of the sum of n trials. Q14. What are the formulas for the mean, variance, and standard deviation of a binomial random variable in terms of n and p, and, if you want, q (or 1-p). A14. The mean mu=np. The variance sigma squared = npq. The standard deviation is the square root of npq or (np(1-p))^.5. Q15. When n is “large,” the binomial distribution with n trials and success probability p can be approximated by what? A15. The normal distribution with mean np and standard deviation (npq)^.5. Q16. As a rule of thumb, the normal distribution may be used as an approximation to the binomial when both np and nq (expected successes, expected failures) equal or exceed what number? A16. 10 Q17. Please describe how to have your calculator simulate a binomial experiment. What are the keys that you press? A17. For the TI-83 or 84, You press math>prb>randbin( and then enter 1, p, and n in parentheses. You press the sto> key to store the results, and 2nd L1 to get the results stored in L1; then you hit the colon which is done by alpha and then the decimal point key; then you do the sum function which is under list>math>. Then in parentheses you enter L1. So the command is randbin(1, p, n)->L1:sum(L1). What this does is to generate n numbers that are either 0 or 1, with a p probability of being 1, and adds them to get the number of successes in the n trials, and displays that number. Each time you press the enter key, this will be repeated. So you take your number 2 pencil and write down each of these numbers, and then you count how many of them had the result you were looking for, and you express that result as a fraction of the number of trials you did. Q18. For a binomial setting, the number of trials is fixed, and the random variable is the number of successes in that trial. For a geometric setting, the random variable is the number of ____ necessary to achieve the first _____. A18. trials, success Q19. True or false: in the geometric setting, as in the binomial setting, you have 1. two categories, 2. with the same probability for each observation, and 3. independent observations. A19. True. Q20. In a geometric setting, with probability of success p, what is the probability that the first success will occur on the nth trial? A20. P(X=n) = (1-p)^n-1 * p Q21. True or false: the probabilities of success on the first, second, third, etc. trial in a geometric setting, when arranged in order, form a geometric series where p is the first term and each successive term being (1-p) (or q) times the previous one? A21. True Q22. True or false: if you apply the formula a/(1-r) for the sum of the terms of an infinite geometric series, where a is the first term and r is the ratio of each term to the previous one, for the geometric setting p is the first term and (1-p) is the ratio, so the sum becomes p/(1-(1-p)) or 1. Thus even though there are infinitely many possibilities for the outcome of the experiment in the geometric setting, the probabilities of each outcome sum to 1. A22. True Q23. If your chances of rolling a 1 on a die roll are one in 6, what is the expected or average number of times that you would have to roll the die before getting a 1? A23. 6 times. Q24. If your chances of getting a success at anything in the geometric setting is p, what is the average or expected number of trials that you would have to conduct before getting a success? A24. 1/p trials. Q25. What is the variance in the geometric random variable? A25. q/p^2, or (1-p)/p^2 Q26. In the geometric setting, if q=1-p, what is the probability that it takes more than n trials to see the first success? A26. P(X>N)= q^n. Q27. On page 470 of YMS there is a derivation of the formula for the probability that it takes more than n trials to see the first success. Can you think of a really simple way to arrive at the same formula? A27. If, and only if, the first n trials are failures, it will take more than n trials to get the first success. The probability of the first n trials being failures, using the multiplication rule for independent events, is q^n. Q28. For a geometric distribution, would you say that it is approximately true that 34% of the observations would fall between the mean and 1 standard deviation above the mean, and 34% would fall between the mean and 1 standard deviation below the mean? Why or why not? A28. No, because the geometric distribution is always strongly skewed to the right, and its shape doesn’t resemble the normal distribution (for which the above statement is true). Q29. Suppose that some experts estimate that the probability of a major nuclear war in any given year is 1%. You think that you will live another 65 years. You are wondering what the chance is that you will and will not see a nuclear war. Please fit some of the concepts of this chapter to this situation, and calculate the probability. A29. This is the geometric setting, where “success” is defined as nuclear war, and “failure” is defined as no nuclear war! What you are being asked is the probability that it will take more than 65 trials to see a “success.” You use the formula P(X>n)=q^n, to get the P(X>65)=.99^65. The probability of no nuclear war comes out to .52. So would that be comforting, or what? YMS Chapter 9 Sampling Distributions Q1. A parameter (which begins with p) is a number describing a _____; a statistic (which begins with s) is a number describing a ______. A1. population, sample Q2. What symbols are used in our book’s notation to represent a population mean, sample mean, sample proportion, and population proportion, respectively? A2. (mu), x-bar, p-hat, and p. (Note that the letter P will also be used later to indicate the “P-value” for a statistical test. Our text will use a capital P for that, or the phrase Pvalue. But many other writings will use small p for P-value.) Q3. Suppose you were to take a large number of samples (all the same size) from a population, compute the mean of each, and plot a histogram of the sample means that you obtain. This histogram would approximate the shape of the ______ ________ of the mean. A3. Sampling distribution. Q4. The sampling distribution for a proportion or mean changes as the number in the sample increases: the mean of that sampling distribution (increases, stays the same, decreases) and the variance of the sampling distribution (increases, stays the same, decreases). A4. stays the same, decreases Q5. If the mean of a sampling distribution is the true value of the parameter being estimated, we refer to the statistic used to estimate the parameter as being _____. A5. unbiased Q6. True or false: if a statistic is unbiased, the value of the statistic computed from the sample equals the population parameter. A6. False. Samples vary. It’s only the mean of all possible samples that equals the population parameter for an unbiased statistic. Q7. True or false: the variability (and thus the accuracy) of statistics are very sensitive to the size of the population from which the samples are drawn. A7. False. The sample size is much more important than the population size. Q8. An organization wants to sample with equal accuracy from each state of the USA. Would it make more sense to sample 2000 from each state, or 1% of each state? A8. 2000 from each state, because the absolute sample size determines the accuracy, and you don’t need a greater sample with a higher population. Q9. To review from Chapter 7, on the binomial distribution: what are the mean and standard deviation of a binomially distributed variable X, where p is the (population) probability of success, q is the probability of failure, and n is the size of the sample? A9. The mean is np, and the standard deviation is the sqrt(npq). Q10. What are the mean and sd of sample proportion, which is X/n where X is binomially distributed (but X/n is not binomially distributed)? A10. The mean is p, and the standard deviation is sqrt(pq/n). Q11. If you want a standard deviation for a sample proportion that is half as big as some other one, you have to get a sample that is how many times bigger? A11. 4 times bigger. Q12. If the sample is a substantial fraction of the population, then the assumption of independence that leads to the binomial distribution is violated. How many times bigger should the population be than the sample, so that we don’t worry about this? A12. At least 10 times bigger. Q13. True or false: The standard deviation of the sampling distribution of a proportion is only approximately sqrt(pq/n); this approximation is most accurate when np>=10 and nq>=10. A13. False. The sd of the sampling distribution of a proportion is always exactly sqrt(pq/n). But that distribution is approximately NORMAL when np and nq are >=10. Q14. If you know the population proportion, how do you use the normal approximation to figure out the probability that the proportion obtained from a random sample of size n will be between two given values? A14. You use p and sqrt(pq/n) as the mean and sd, and with these compute a z score for the upper and lower bounds of the interval you’re interested in. Then you use the normal table, or a calculator, to get the proportion of the normal curve between those two z scores. This is the probability that the sample proportion will fall between those values. Q15. How do the sampling distributions of means compare with the distributions of individual observations? They are less _____ and more _____. A15. Variable, normal Q16. Suppose you have a population with mean mu and sd sigma. What are the mean and sd of the sampling distribution for means with sample size n? A16. The mean of the sampling distribution is mu and the sd is sigma/sqrt(n). Q17. Under what conditions will the sampling distribution of the mean have an exact normal distribution, no matter what the sample size is? A17. When the population is normally distributed. Q18. What does the central limit theorem tell us? A18. That as the sample size gets larger, the sampling distribution of the mean approaches the normal, regardless of the distribution of the population from which the observations are drawn. Q19. True or false: suppose that income in a large country is not normally distributed, but is very skewed. The central limit theorem tells us that if we were to collect several very large samples and compute the mean income for each sample, those means would be approximately normally distributed, even though the incomes in the population are not normally distributed. A19. True. Q20. Why do you think the central limit theorem is so “central” to statistics? A20. Because it enables us to use normal probability calculations to answer questions about sample means even when population distributions are not normal. Those questions include the big idea of confidence intervals: how likely is the right answer to be between these two bounds. Thus the central limit theorem helps us say, “There’s x probability that the true mean of the population is between a and b.” YMS Chapter 10 Introduction to Inference Section 1 Q1. Statistical inference consists in drawing conclusions about a ____ from data in a ____. A1. population, sample Q2. If the standard deviation of a population is sigma, what is the sd of the sampling distribution for the sample mean (this is often called the standard error of the mean) with sample size n? A2. sigma/sqrt(n) Q3. Suppose we know that the sd of the sample mean (a.k.a. standard error of the mean) is 4.5. This implies that if we were to draw many samples from the population, about 95% of these sample means would fall within what interval? A3. The population mean plus or minus 9. Q4. True or false: we should imagine the sample mean as being at the center of a bellshaped curve, with 2 standard deviations of the sample mean (a.k.a. standard errors) on either side of this point encompassing 95% of the other sample means.(Assume the sample is an srs and the sample means are normally distributed.) A4. False. We should imagine the population mean as being at the center of that bell curve. We visualize the sample mean as falling within 2 standard errors of the population mean 95% of the time. Q5. True or false: The reasoning we use in making confidence intervals around a sample mean is as follows: if the sample mean is normally distributed, then 95% of the time, xbar will be within 2 sample standard deviations (standard errors) of the population mean, mu. Whenever x-bar is within 2 standard errors of mu, mu is within 2 standard errors of x-bar. So if we make an interval + or – 2 standard errors around x-bar, that interval will encompass mu for 95% of the sample means we obtain. A5. True. Q6. A confidence interval has two parts: 1) the interval itself (usually expressed as an estimate plus or minus a margin of error) and 2) ______________. A6. The confidence level, e.g. .95. Q7. Someone says, “I read that the 95% confidence interval for a certain group’s score on a certain test was 115 to 128. That means that 95% of all the members of the group score in that range.” Is this an accurate interpretation? If not, please give a better one. A7. Not accurate. The confidence interval stated means that we are 95% confident that the population mean lies within the stated interval. And 95% confident means that 95% of the intervals obtained the way we got this one would encompass the population mean. Q8. In order to construct a confidence interval for a mean, what two conditions need to be met? A8. That the data come from a SRS of the population, and that the sampling distribution of the x-bar is approximately normal. Q9. A first person says, “I want a 90% confidence interval. So I’ll look in the normal table for the z-score with 95% of the area to the left of it.” A second person says, “You mean 90% of the area, don’t you?” What is the correct way to look in the table? A9. The first person got it right. The region around the population mean that subsumes 90% of the sample means is that with 5% above that region and 5% below that region. So you want z for .95 or the negative of the z for .05. Q10. What are the “tail areas” you look for, for confidence intervals of .90, .95, and .99, respectively? A10. .05, .025, and .005. Q 11. If C is the confidence level, what is the expression for the area to the right of the interval subsuming fraction C of the distribution for sample means? A11. (1-C)/2. Q12. What does the symbol z* stand for? A12. The z-score with (1-C)/2 of the area lying to the right of it. Or: the number of standard deviations above and below the mean that bound the C level confidence interval. Q13. True or false: The values mu- z* sigma/sqrt(n) and mu + z* sigma/sqrt(n) represent the upper and lower bounds for the confidence interval for the mean. A13. False. The confidence interval is centered around x-bar, not around mu, because we don’t know mu. (If we did, we wouldn’t need to make a confidence interval.) The values listed above are the bounds within which there is a probability C that any observed sample mean will fall. Whenever the sample mean falls between these values, then we can say that the population mean falls between x-bar –z* sigma/sqrt(n) and x-bar + z* sigma/sqrt(n). Q14. If my wife’s age falls in the interval of my age plus or minus 5 years, then my age must fall within the interval of my wife’s age plus or minus 5 years. Is this true, and is this sort of reasoning central to the reasoning about confidence intervals? A14. Yes and yes. Q15. True or false: The way in which the statement in the previous question has its analogy in the reasoning about confidence intervals is: any time the sample mean falls within the interval of mu plus or minus the margin of error, then the population mean must fall within the interval of x-bar plus or minus the same margin of error. A15. True. Q15. True or false: the values x-bar –z* sigma/sqrt(n) and x-bar + z* sigma/sqrt(n) form the upper and lower bounds for the confidence interval for the mean (assuming the assumptions are met). A15. True. Q16. Example 10.5 on page 546 is worthy of careful study. What are the 4 steps that were exemplified in using confidence intervals? A16. 1. Identify the population of interest and the parameter to be estimated. 2. Choose the appropriate procedure, and verify that the conditions for using it are met. 3. Carry out the procedure. CI = estimate + or – margin of error. 4. Interpret the results in the context of the problem. Q17. Please tell whether the margin of error (which is half the width of the confidence interval), or the width of the confidence interval itself, gets bigger or smaller under each of the following circumstances: a. the population standard deviation gets smaller, b. the level of confidence C gets bigger (e.g. a move from a 90% confidence interval to a 99% confidence interval) c. the sample size gets bigger, and d. the population size gets bigger? A17. a. smaller, b. bigger, c. smaller, d. no effect Q18. Is it preferable in research for a 95% confidence interval to have its upper and lower bounds closer together, or farther apart? A18. Closer together, because this represents a more accurate estimate of whatever you’re trying to estimate. Q19. Suppose you are a researcher planning a study, and you are deciding how many subjects to enroll. You want a certain margin of error m. You know what level of confidence you want, and you know (or estimate) the sigma for the population. How do you figure out the sample size? A19. Set m>= z* sigma/sqrt(n) and solve that inequality for n. As usual, you use as z* the z score that has (1-C)/2 area to the right of that score. Q20. Some of the problems in the use of confidence intervals can be surmounted by getting a large enough sample size – with this, the distribution of sample means can be considered normal even if the population isn’t normal. Also, with a large enough sample size, the sample standard deviation is close to the population standard deviation. What’s the main problem that can’t be overcome with a large sample size? A20. That the data should be a SRS of a population, and that the margin of error in a confidence interval covers only random sampling errors. If you are dealing with a convenience sample (as in the vast majority of research studies) then you can go ahead and compute a confidence interval, but the results are generalizable only to the population of which this particular sample is representative, and sometimes it’s impossible to say what that population is. Chapter 10, Section 2 Q21. True or false: The basic reasoning for significance testing is: an outcome that would happen rarely if a claim were true is good evidence that the claim is not true. A21. True. (Note: there are problems with this basic presumption.) Q22. In doing statistical tests, the first step is to identify what you want to make conclusions about. Are you always wanting to make conclusions about sample statistics, or population parameters? Or is it sometimes one or sometimes another? A22. Always the population parameter. Q23. What does a “null hypothesis” typically state? A23. That there is no difference between two parameters, or no effect, or no change (or that a parameter is equal to a certain value). Q24. A significance test works by assessing how likely the ______ _____ would be if the ____ _____ were true. A24. observed outcome, null hypothesis Q25. True or false: the p-value is the probability of getting exactly the results we observed, presuming the null hypothesis to be true. A25. False. The probability of getting exactly the results we obtained is almost always very small (or even theoretically 0 when dealing with continuous distributions). The p value is the probability of getting results as extreme, or more extreme, than the actually observed results; “extreme” means “far from what we would expect if the null hypothesis were true.” Q26. We are more likely to reject the null hypothesis of “no difference” or “no effect,” and infer that there is a difference or an effect, when the P-value is large, or small? A26. Small Q27. The null hypothesis has to do with a population parameter; in analyzing your sample data you calculate a ______ that estimates that population parameter. A27. Statistic (the phrase, “sample statistic” is correct but redundant.) Q28. When a drug company researcher is hoping to find evidence that a drug is better than placebo, is the researcher wishing to reject, or fail to reject, the null hypothesis? A28. Reject. The null hypothesis would be that drug and placebo are equal in effect. Q29. Suppose someone is testing a drug versus placebo. If the researcher is interested only in the alternative that the drug is better than placebo, then the alternative hypothesis is _____-sided, but if the researcher both harmful effects (drug worse than placebo) and beneficial effects (drug better than placebo) as rejections of the null hypothesis, then the alternative hypothesis is ____-sided. A29. one, two Q30. What is the meaning of the significance level, or alpha? A30. It’s a threshold level for the p-value that we consider decisive, with which the obtained p-value is compared. Q31. Do we reject the null hypothesis when the p value is less than alpha, or greater than alpha? A31. Less than alpha. Q32. Are we more likely to reject the null hypothesis with a larger alpha, or a smaller alpha, all other things equal? A32. With a larger alpha. The larger alpha is, the more likely the p-value we obtain is less than that cutoff. Thus smaller p-values, and larger alphas, dispose to rejection of the null. Q33. If a test is statistically significant at the .05 level, what does that mean? A33. That the p-value obtained is less than or equal to .05. Q34. Someone finishes writing up a statistical test by saying, “In conclusion, p=.021.” What step of the “inference toolbox” are they leaving out, that should come after what they said? A34. Interpreting the results in the context of the problem. So they should say something like, “Therefore we reject the hypothesis that drug and placebo are equal; our study gives evidence that our drug is more effective than placebo.” Q35. When we are testing the hypothesis that a population mean is equal to a certain hypothesized value, in the unlikely situation where we know the population standard deviation, what is our test statistic? A35. The standardized sample mean, which is the sample mean minus the hypothesized population mean, divided by the standard deviation of the sample means. This is also known as the one-sample z statistic. Q36. What distribution does the one-sample z statistic, a.k.a. the standardized sample mean, have when the null hypothesis is true? A36. The standard normal distribution. Q37. True or false: for a one-sided test (or a one-sided alternative hypothesis), results extreme in one direction are counted as evidence against the null hypothesis; for a twosided test (or a two-sided alternative hypothesis), results extreme in either direction are counted as evidence against the null hypothesis. A37. True. Q38. Please explain why the two-sided p-value is double that of the one-sided p-value. A38. The p-value is the probability of getting results as deviant as, or more deviant than, the results obtained. For a two sided test, we add the probability of getting results deviant in both directions to get the total p value; for symmetrical distributions, those two probabilities are equal, thus amounting to twice the value for any single direction. Q39. How do you compute the one sample z statistic? A39. Sample mean minus hypothesized population mean, over the sd of the sampling distribution of the mean (that sd is sigma/sqrt(n) ). Q40. True or false: What is meant by doing “tests with fixed significance level” for a one-sample z test is that you become aware of what the cutoff (or critical) values are for z for the alpha you’ve picked. If the z your data yield is more extreme than the z for the alpha you’ve picked, the test is significant at the specified level of alpha. This method is most useful for those who don’t have access to calculators or computers that will give a p-value directly. A40. True. Q41. True or false: If you obtained a 95% confidence interval for a mean that ranged from 10 to 30, then a null hypothesis that the mean was equal to any value outside that range would be rejected and a null hypothesis of a mean within that range would not be rejected, at the .05 level, using a two-sided test. A41. True. Chapter 10, Sections 3 and 4 Q42. Suppose you thought your research would overturn a conclusion that many people had held for a long time. Would you tend to choose a smaller alpha (thus necessitating a larger sample size to reject the null hypothesis) or a larger alpha? A42. A smaller alpha. You know that people will be skeptical of your conclusion, so you'd be better off getting a larger sample size that enables you to say, "This result is significant at the .001 level" rather than just the .05 level. Q43. Suppose you had limited subjects with whom to work, and you were looking for evidence of toxicity from a chemical. The consequences of declaring that the chemical is safe when it isn't are very bad. The consequences of declaring the chemical dangerous when it isn't are primarily that more studies would be done than you have the resources to do. Given these consequences, would you tend to set alpha higher, or lower? A43. Higher. You want to give yourself as good a chance as possible to reject the null, under these conditions. Q44. True or false: If you report the p-value itself, rather than saying, that p<.05, you in a sense let the readers of your journal set their own alpha, i.e. make their own decision as to whether they want to reject the null hypothesis given the p-value you report. A44. True. Q45.True or false: P-values slightly over .05 should not be considered statistically significant. A45. False. Where we draw the cutoff depends on the stakes of the various outcomes of the decision we are trying to make. Q46. Suppose we test a drug with a very large number of subjects. We find that on a 60point rating scale, the drug group has a mean depression score rating that is 2 points lower than the placebo group. The p-value is .03. Someone is likely to say that the difference is ____ significant but not ____ significant. A46. Statistically, practically Q47. A researcher designs a study, gathers data, punches the data into the computer, runs a significance test, and interprets the result based on the significance test. What important step is being left out? Please give one reason why this step is important. A47. Inspecting the data from graphs. This is important because 1) outliers can have big effects on significance tests, 2) you want to see how closely the data follow a normal distribution, especially with small samples; and 3) you want to see if an effect is visible in the plots; if not, perhaps the effect is not practically significant. Q48. Suppose you are trying to find the effect of lighting levels on worker productivity. But your design has confounded lighting level with the effect of workers being observed and monitored; the effect of monitoring gives a false impression that lighting makes a difference. This situation would be referred to as the ____ effect. A48. Hawthorne Q49. If there are bad design problems, is it likely, or unlikely, that sophisticated inferential statistical analysis techniques can get around these problems to produce valid inferences? A49. Unlikely. Q50. Should an ethical statistician attempt to produce inferential statistics on a data set that did not involve random sampling or random assignment? A50. Definitely yes. Much data which was not produced by random methods can nonetheless be treated as if it were, to produce very useful results. Q51. Someone is looking for drugs or dietary factors that could prevent Parkinson's Disease. The researcher gives a questionnaire to a random sample of people with the disease and of people without it. The researcher looks at the mean intake of each of 100 substances for people with and without Parkinson's, and tests each one to see if there is a significant difference in intake between the two groups. The researcher comes out with about 5 substances that are significantly different between the two groups, at the .05 level. The researcher recommends that people adjust their intakes accordingly. Do you have a problem with this? If so, what? A51. Yes -- this is the multiple comparison problem. Out of 100 hypothesis tests you would expect about 5 to come out significant by chance alone. It could be that the effects are real, but it is very possible that they have happened by chance alone. Q52. Our textbook agrees with your instructor, that (confidence intervals, tests of significance) are underutilized, whereas _______(same options) are perhaps overutilized. A52. confidence intervals, significance tests Q53. Our book points out that "many tests run at once will probably produce some significant results by chance alone." What does you logical reasoning tell you about many tests run sequentially, each testing one association, in a separate study? (The answer to this isn't specifically discussed in our book, but tell what you think.) A53. Logically, the same problem holds. And if people choose to publish the significant findings and relegate the non-significant ones to the "file drawer," there is a distortion of the truth just as there can be from the multiple comparison problem. Q54. What is acceptance sampling? A54. Sampling a subset of a product and using statistical methods to decide whether to accept or throw out the whole batch on the basis of the sample. Q55. When we are thinking of statistical inference as making a decision, as in acceptance sampling, what is different about the status of the null hypothesis and the alternative hypothesis from null hypothesis significance testing? A55. There's no special status allocated to the null hypothesis; there is a null hypothesis and an alternative hypothesis that we are trying to decide between. Q56. Suppose the null hypothesis is that a drug has no effect. If this is true and our analysis yields the decision that the drug is effective, that is a _____ error; if the drug has an effect but our analysis concludes that there is no effect, that is a ______ error. A56. Type 1, Type 2 Q57. If we set alpha at .01, what is the probability of a type 1 error given that the null hypothesis is true? A57. The probability is .01. Q58. Somebody does a statistical test and rejects the null with p less than the preset alpha of .05. Someone else says, "There is a 5% chance that your result is a type 1 error." Is this correct? A58. Despite the sentence in our text on page 596 that reads, "The significance level of any fixed level test is the probability of a Type 1 error," the answer to this is no. The 5% probability is a conditional probability: the probability of rejecting the null GIVEN that the null is false. The person who made the statement above is mistakenly thinking that "Given that the null is rejected, there is a 5% chance that the null is rejected falsely." To understand better why such reasoning is wrong: suppose we are testing whether running really fast raises the heart rate of previously resting college students. We set an alpha of .05 when comparing the resting heart rates and the immediate post-running heart rates. When we find a huge effect of our intervention, what is the chance that it's a type 1 error? Approximately zero, because it's not an error at all. On the other hand, if we are testing the effects of a drug with absolutely no effect, versus placebo, then if we rejected the null, the chance that we had made a type 1 error would be 100%; before doing the test, our probability of making a type 1 error would be 5%. Q59. Suppose someone says, "I'm testing whether a certain mean equals 0 -- that's my null hypothesis. I'm setting alpha at .05. I have a sample size of 50. The population standard deviation, we're assuming, is known to be 1. Can you please calculate a probability of a type 2 error for me? (Or the "power" of my test, which is 1-the probability of a type 2 error, the probability of rejecting a false null?) Can you calculate this, or is there something missing, and if something's missing, what is it? A59. The probability of rejecting a false null depends upon "how false" the null is. If the true mean of the population the researcher is drawing from is really 100, the probability of rejecting the false null is very close to 1. If the population mean is really 0.000001, the probability of rejecting the false null is very close to 0. So what's missing is a hypothesized mean that represents "how false" the null is. Q60. Suppose you are given a hypothesized true mean, a null hypothesis mean, an alpha, a sample size, and a population sd, and the right to assume a normal distribution of the mean you obtain. How do you calculate the probability of a type 2 error? A60. You figure out the region where the null hypothesis is not rejected, which is, for a .05 alpha, 1.96 sample standard deviations on either side of the null hypothesized mean. Then you use the normal cumulative distribution function for the distribution N(mu, sigma), where mu and sigma are the population mean and sd, to find the probability that the sample mean will fall within the interval you've calculated. Q61. The probability that a fixed level alpha significance test will reject the null when a particular alternative value of the parameter is true is called the _____ of the test against that alternative. A61. Power Q62. The power of a test is what function of the probability of a type 2 error? A62. 1 minus prob of type 2 error. Q63. True or false: the p-value tells what would happen if we tested many samples, when the null hypothesis is true; the power tells what would happen if we tested many samples, when a particular alternative hypothesis is true. A63. True. Q64. When alternative hypothesis parameter is very close to that of the null hypothesis, the power tends to be (lower than, higher than, the same as) the power when the alternative hypothesis parameter is very far from that of the null hypothesis. A64. Lower than. Q65. What happens to power as you increase sample size, all other things equal? A66. It increases. Q67. What happens to power as you increase the diversity of the population you are looking at, i.e. increase the population standard deviation of whatever parameter you are estimating? A67. Power goes down. Q68. What happens to power as you increase the precision and accuracy of your measuring devices? A68. Power goes up, because the more accurately the phenomenon is measured, the less the standard deviation of the measurements becomes. Q69. If too many tests of a hypothesis have insufficient power, then what happens: (a false null tends to become believed to be true, or a true null tends to become believed to be false)? A69. The false null tends to become accepted as true. This is because study after study fails to reject the null. YMS Chapter 11 Inference for Distributions (T-Tests of Means) Q1. What two conditions does our text list for inference about means when the population standard deviation is not known? A1. That the data are a SRS from the population of interest, and that the observations from the population have a normal distribution. Q2. The sample standard deviation, divided by the sqrt of the n of observations in the sample, is called the s_____ e_____ of the sample mean. A2. Standard error Q3. When the standard deviation of any statistic is estimated from the data, the result is called the ____ ____ of that statistic. (Thus you can have these that apply not just to the sample mean.) A3. standard error Q4. Does it make sense to speak of the standard deviation of the population mean? If not, why not? A4. A population parameter is a single number, not a random variable. As such, it doesn't have a variance or standard deviation. Sigma/sqrt(n) gives the standard deviation of the sample mean, and s/sqrt(n) gives a less accurate estimate of the standard deviation of the sample mean. Q5. The z-statistic is the (x-bar-mu)/(sigma/sqrt(n)). What is the t-statistic? A5. t=(x-bar-mu)/(s/sqrt(n)) where s is the sample standard deviation. Q6. There is just one standard normal distribution. Is there just one t-distribution? A6. No; there is a t-distribution for each number of degrees of freedom of the statistic, where the degrees of freedom in dealing with means is n-1. Q7. What is the general shape of the t-distribution? A7. Bell-shaped, similar to the normal. Q8. As the degrees of freedom increase, the shape of the t-distribution more and more closely approximates what? A8. The standard normal distribution. Q9. Can you please explain the reason for the way the shape of the t-distribution differs from that of the normal when the degrees of freedom are low? A9. The distribution is more spread out, less peaked, with less probability in the center and more in the tails -- in other words, it has more variation. This is because estimating sigma by s rather than knowing sigma for sure adds more variation to the statistic. Q10. What's the expression for the level C confidence interval for the population mean (mu), using the t distribution to estimate when the population standard deviation is unknown? A10. The confidence interval is x-bar +- t* (s/sqrt(n)) where t* is the upper (1-C) critical value for the t distribution with n-1 degrees of freedom. (And s is the sample standard deviation, and n is the sample size.) Q11. How do you construct the t statistic and use it to test a hypothesis about a hypothesized population mean mu-zero? A11. The t statistic is (x-bar - mu-zero)/(s/sqrt(n)). Once you compute that statistic, you look in a table for the area in the t(n-1) distribution that is >= that value, <= that value, or twice the area >= that value, depending on whether your alternative hypothesis is that the mean is greater than mu-zero, less than mu-zero, or not equal to mu-zero, respectively. Q12. Suppose you make from the subjects who have signed up for your study a set of pairs who are as similar as you can make them. Then you randomly assign one from each pair to get a treatment, and one to get placebo or no-treatment. What sort of design is this? A12. A matched pairs design. Q13. If in a matched pairs design, you simply measure the post-treatment outcome for each subject and want to know if the treated subjects do better than the untreated subjects, how does our text suggest conducting that test? A13. By creating a variable which is the difference between treated and untreated for each pair, and then using the one-sample t procedure to either get a confidence interval for the mean difference or to do a hypothesis test on the mean difference. Q14. When you using a matched pairs design as suggested, what population parameter are you estimating? A14. The mean difference between pairs for all matched pairs in the population. Q15. What is a "robust" procedure? A15. One where the accuracy of a confidence interval or significance test is not very greatly affected by violation of the assumptions. Q16. The t procedures are very robust against (nonnormality of the population, outliers) but not very robust against (nonnormality of the population, outliers). A16. nonnormality of the population, outliers Q17. The main reason why the t procedures are robust against the nonnormality of the population is what theorem? A17. The central limit theorem, i.e. that sample means become more nearly normally distributed as the sample size gets larger, even when the population does not have a normal distribution. Q18. The rules of thumb for sample size and t procedures are that with sample size less than ____, you can use t if the data are close to normal and there are no outliers; with sample size ___ to ____, you can use t unless there are outliers or strong skewness, and with sample size over ____, you can use t procedures even for clearly skewed distributions. A18. 15, 15 to 40, 40. Q19. When calculating the power of a t-test, (as when calculating the power of a z-test), what two steps do you go through? A19. First, write the rule for rejecting H0 in terms of x-bar. (This is done by writing the formula for t, plugging in the critical t value, sample size, and alternative hypothesis mean, and solving for x-bar.) Second, you compute the probability of rejecting H0 given that the alternative is true. (This is done by standardizing x-bar relative to the alternative, assuming that sigma is equal to s, and using the normal distribution to compute the probability that this mean will fall in the rejection region computed by the first step.) Q20. In power calculations, and in explaining them, there's a tricky distinction between the "alternative hypothesis" for the test, and the "alternative" that is assumed for the sake of power calculations. Please clear up any confusion on this by giving an example. (The example on page 640 will do fine.) A20. Suppose you are hoping that the listening score of 20 subjects will improve by 2 points, and that your test can detect such an improvement. Your null hypothesis would be a mean improvement score of zero, tested against an Ha, "alternative hypothesis," that the mean is greater than zero (or, if a two-sided test is desired, that the mean is not equal to zero). But the "alternative" that you use in your power calculations is that the mean change is really 2 points. So the "alternative hypothesis" for the test is that the mean is greater than 0, and the "alternative" for the power calculation is that the mean is equal to 2. Q21. One researcher randomly samples 2 groups from a population, and gives training to one and not the other. The researcher uses a t procedure to compare the test scores of the two groups. Another researcher samples a group from the population, and gives a test to the group 2 times, once before training and once after. The researcher uses a t procedure to compare the results after testing with those before testing. How are these two situations different, and what different statistical procedures should they result in? A21 In the first case, the samples are independent of one another, and in the second, they are not. So in the first case, you use a 2 sample t to study the difference in the means. In the second case, you create a new variable, the postscore-prescore, and use a 1-sample t to study the mean of the differences. Q22. Has it always been an assumption or condition, for every inference procedure we have studied, (as contrasted to the exploratory analysis procedures we studied) that the sample can be considered a SRS of the population of interest? A22. Yes. (This is despite the fact that the overwhelming majority of inferential statistics reported in research are NOT based on SRS's of populations of interest!) Q23. Suppose someone were to draw many pairs of samples from two populations, and compute the difference between the sample means for each pair. What would the mean of this difference approach as the number of samples drawn approached infinity? A23. The difference in population means. Q24. The fact that the mean of the difference in sample means approaches the difference in population means as the number of samples gets larger is a long way of saying that the difference in sample means is an ____ estimator of the difference in population means. A24. unbiased Q25. True or false: just as the difference in sample means estimates the difference in population means, the difference in sample standard deviations estimates the population standard deviation of the difference between two means. A25. This is triply false! First, what you would combine would be variances, not standard deviations. Second, to find the variance of the difference between two random variables you add the variances; you don't subtract them. Third, the sample variances would have to be divided by n to estimate the variance of the sample mean. Q26. True or false: the variance of the difference between two population means is estimated by s1^2/n1 + s2^/n2, where s1 and s2 are the sample standard deviations (and thus s1^2 and s2^2 are the sample variances) and where n1 and n2 are the sample sizes. A26. True. Q27. When the standard deviations of the two populations you are sampling from are different, why does the difference of the means of two independent samples not exactly follow the t-distribution? A27. Because there are two population standard deviations replaced by the sample standard deviations in the formula, not just one. Q28. When the sample sizes for the two samples are different, how many degrees of freedom do you use for t procedures? Please mention two options. A28. One option is to use n-1, where n is the smaller of the two samples. The other (and the one almost always used in research) is to let the computer program calculate a noninteger degrees of freedom according to a more complicated formula to be presented later, which does not need to be memorized! Q29Can you explain in words how you form a confidence interval for the difference between two means? A29You take the difference between the sample means, and add and subtract the margin of error, which is the t* multiplied by the estimated standard deviation of the difference. The t* is the upper (1-C)/2 critical value. The estimated standard deviation of the difference is the square root of the sum of the estimated variances of each group mean. The estimated variance of a group mean is the sample variance over the n for that group. Q30 Can you explain in words how you do a hypothesis test of the null hypothesis that two group means are equal? A30 You compute the t statistic by putting the difference in sample means in the numerator, and the estimated standard deviation of the difference in the denominator. You get the P-value that corresponds to that t. Q31 The guidelines for sample sizes with one-sample t procedures are also applicable to two-sample t tests, if you substitute for the phrase "sample size" the phrase ____. A31 Sum of the sample sizes. Q32 For using t procedures with means of independent samples, an excellent approximation is achieved by using the t distribution with a not-necessarily-integer degrees of freedom computed by the formula on p. 659. This approximation is quite accurate when the sample size of both samples is what? A32 Five or larger. Q33 When software gives you a choice between assuming, or not assuming, equal variances for the two populations whose means you are comparing with a t procedure, which choice should you generally make? A33 Not to assume equal variances, because this assumption is very difficult to check. YMS Chapter 12 Inference for Proportions Q1. The statistic that estimates (in an unbiased way) the population proportion is ____. A1. The sample proportion. Q2. What is the standard deviation of the sample proportion (provided the population is at least 10 times as big as the sample)? A2. sqrt(pq/n) where p is the population proportion, q is 1-p, and n is the sample size. Q3. If np and nq are at least 10, then we can treat the distribution of p-hat as approximately what? A3. Normal. Q4. Just for review: if asked for the standard deviation of the population proportion, what would you say? A4. Population parameters do not have standard deviations, because they are not random variables. The population proportion is a single number. For example, if a certain proportion of the population of the U.S. is left-handed, there is no variance in that number. Q5. Just for review: what does it mean to "standardize" any statistic? A5. To subtract the mean and divide by the standard deviation. Q6. When you standardize p-hat, what do you subtract and divide by? A6. You subtract the population proportion and divide by the population standard deviation, which is sqrt(pq/n). Q7. What familiar ploy do we use to cope with not knowing the population proportion, in creating a standardized p-hat? A7. We use the familiar ploy of substituting the sample statistic for the population parameter. So we use the standard error, sqrt( phat qhat/n,), instead of the standard deviation of the proportion. Q8. What is the expression for a confidence interval around the sample proportion? A8. The confidence interval is phat +- z*sqrt(phat*qhat/n). This fits the format of estimate+- z* SEestimate for any normally distributed estimator. Q9. Just for review: what would you say if asked to do a hypothesis test about whether a sample proportion was equal to a certain number? A9. Hypothesis tests, and statistical inference in general, have to do with what population parameters are and not what sample statistics are. Q10. How do you do a hypothesis test that a population proportion is equal to a certain number? A10. You form a z statistic with phat - p0 in the numerator, and sqrt(p0*q0/n) in the denominator. You then get a p value associated with that z statistic, from the standard normal table or the normal cdf function of a calculator. (Note that when doing a hypothesis test, you use the hypothesized proportion, p0, for the estimate of the standard deviation; when doing a confidence interval, you use the sample proportion, phat.) Q11. Given a certain margin of error, for example +- .03, how do you determine how big a sample size you need to get a margin of error that small or smaller? A11. You set the expression for margin of error, namely z* sqrt(p*q*/n)<=m and solve; z* is the critical value of z for (1-C)/2, and p* is the estimated proportion of successes, and q* is 1-p*. Q12. How do you solve for sample size needed in a study of proportion, without knowing the proportion you are looking for in the first place? Please mention two strategies. A12. If you can make an informed guess, you can do that. If your guess is a range, use both ends of the range. If you have no information, p*=.5 yields the biggest margin of error and thus the biggest and most conservative estimate of the necessary sample size. Q13. If the guessed proportion is anywhere between .3 and .7, you should as a rule of thumb use what as the p* or estimated proportion of successes, for your calculations on determining the necessary sample size? A13. .5. Q14. Suppose we give a cancer drug to one group and not to another group, and look at the difference in proportions of people who survive for five years in these two conditions. What parameter are we trying to estimate, and what statistic do we use to estimate it? A14. The parameter is the difference between the population proportions of survivors for the two conditions. The parameter is the difference in sample proportions. Q15. How do you find the standard deviation of the difference in sample proportions? A15. The variance of the difference is the sum of the variances of the individual proportions. So the sd of the difference is sqrt (p1q1/n1 + p2q2/n2). And because you don't know the p's and q's for this expression, you substitute the sample statistics for the population parameters (the usual ploy). Q16. What is the expression for the confidence interval for the difference of two proportions? A16. It's the estimate +- the margin of error, or: the difference in sample proportions, +- z*(sqrt(p1q1/n1 + p2q2/n2)). Q17. What are the conditions for the confidence interval for a difference in proportions? A17. Population 10 or more times the sample; all 4 np and nq quantities are 5 or more. Q18. In doing a hypothesis test for the difference of two proportions, we compute a z statistic. What, in general terms (that is without going into the specific formula) is in the numerator and the denominator of this statistic? A18. The numerator is the difference in sample proportions. The denominator is an estimate of the standard deviation of the difference of sample proportions (a.k.a. the standard error of the difference). Q19. In the special case where the null hypothesis is that the two proportions are equal, in other words the difference between them is 0, what do we do differently because of this? A19. In this case, under the null hypothesis, there is only one population p rather than a p1 and a p2. So we can substitute p for p1 and p2 in the expression we used for the standard deviation of the difference, and it simplifies some. To estimate that p, we use the pooled p-hat, which is the total successes over the total trials for both samples combined. YMS Chapter 13 Inference for Tables: Chi-Square Procedures Q1. Suppose that you are dealing with a situation where there are several possible outcomes, not just 2 (success and failure). You are interested in seeing whether the proportion of outcomes falling into each of a certain set of categories is consistent with a certain hypothesized population distribution. What is the name of the test you use? A1. The chi-square test for goodness of fit. Q2. Suppose that your hypothesized population distribution for the percent of objects that are certain colors is 20% black, 50% white, and 30% green. Suppose you draw a sample of 200, to test this hypothesis. What are the "expected" values that you use when you do the chi-square goodness of fit test? A2. 40, 100, and 60. Q3. Suppose you are doing a goodness of fit test for the distribution hypothesized in Q2. What is the formula that you use for chi-square? A3. Chi-square = Summation of (observed-expected)^2/expected. Or summation of (O-E)^2/E. Q4. In testing the hypothesis mentioned in Q2, suppose your observed counts are 45, 90, and 65. What does chi-square equal for this goodness of fit test? Please write a numerical expression without bothering to calculate the result. A4. chi-square = (45-40)^2/40 + (90-100)^2/100 + (65-60)^2/60 Q5. Is there just one chi-square distribution, or a family of distributions, with one distribution for each number of degrees of freedom? A5. A family, with one distribution for each number of degrees of freedom. Q6. How do you find the number of degrees of freedom for a chi-square goodness of fit test? For example, how many degrees of freedom would there be if you were looking at the proportion of blacks, whites, and greens as in Q2? A6. The degrees of freedom is one less than the number of categories in the distribution; for example, when there are blacks, whites, and greens, the number of degrees of freedom is 3-1=2. Q7. When you look up in a table or a calculator the P-value associated with a certain chisquare, what is that the probability of? A7. The probability of obtaining results as extreme as, or more extreme than, the ones you obtained, if the hypothesized distribution is true. (Extreme means deviant from what is expected.) Q8. Is the chi-square distribution symmetrical? If not, in which direction is it skewed? A8. Skewed to the right. Q9. When you are doing a chi-square test for goodness of fit, what are the hypothesis H0 and the alternative hypothesis Ha? A9. The H0 is that the population percents are equal to the set of hypothesized percents. The Ha is that the population percents do not equal that set of hypothesized percents. Q10. What are the rule of thumb conditions for the use of the chi-square goodness of fit test? A10. All individual expected counts are at least 1 and no more than 20% of the expected counts are less than 5. Q11. What's an efficient way of calculating a chi-square on a TI 83 or 84 graphing calculator? (If you have a TI-89, please tell me how to do the same thing on it.) A11. Enter the observed counts as a list, say L1, and the expected counts as another list, say L2. Then define L3 as (L1-L2)^2/L2. On the Ti-83 or 84 you do this by putting the cursor on L3, and then typing in the expression. Then when you hit the enter key, you see the (observed-expected)^2/expected values in L3. Then you sum the values in L3. You can do this by hitting (second) List >math>5. sum, and then L3. Another way of getting the sum of the values in a list is by hitting stat >calc>1.1-var stats, and then entering L3. Then you look in the output for the summation of x, in the second line. Q12. If a chi-square goodness of fit test yields a significant result, what should you inspect before you interpret the results? A12. You see which observed counts deviated the most from the expected ones -- in other words, you see which cells contributed the most to the chi-square that was calculated. You take these observations into account when interpreting your results. Q13. Two-way tables describe relationships between two (categorical or continuous?) variables. A13. Categorical Q14. When there are multiple comparisons that can be made, what two steps are often carried out? A14. First an overall test for evidence of any differences among the parameters being compared, and then a follow-up analysis to decide which parameters differ and to estimate how large the differences are. Q15. When doing a chi-square test to compare several proportions, the first step is to set up the table with the numbers in it being (proportions of success and number of trials, or counts of the number of cases falling into each category). A15. Counts of the number of cases falling into each category. Q16. When there are two categorical variables being displayed in an r by c table (with r rows and c columns), each of the r x c possible categories into which the observations may fall is called a _____ of the table. A16. cell Q17. When we are comparing the proportion of successes for three treatment conditions, what null hypothesis would we use? A17. That the proportion of successes is the same among all three conditions, i.e. that p1=p2=p3. Q18. When comparing the proportion of successes for three treatment conditions, what would be the alternative hypothesis? A18. That not all the proportions are equal. Q19. In testing H0 via chi-square with a two-way table, we compare the observed counts with the expected counts. Evidence against H0 consists of observed and expected counts that are (far from each other, close to each other). A19. Far from each other. Q20. How do you compute the expected count in a certain cell of a two-way table? A20. The expected count is the (row total * column total)/table total. Q21. The calculation of the expected value for a cell of a two-way table assumes what relationship between the row and column variables is (disjoint, independent). A21. Independent Q22. Please explain why the formula for expected cell counts is what it is. A22. The row total/table total is the probability that a case will fall into a certain row. The column total is the total number of cases that fall into a certain column. When there are n trials and a probability p of success on n trials, the expected successes is np, if we assume independent trials. Thus if there are a certain number of cases in the column and there is a probability row total/table total that each of those cases will fall into the row, the expected number to fall into both the row and the column is (row total/table total) * column total, or row total * column total/table total. This assumes that the probability of a case's being in a certain row is independent of what column it's in. Q23. When you want to test the statistical significance of the deviation of observed from expected counts, in a two-way table, using chi-square, how do you compute the chisquare statistic? A23. chi-square is the summation of the (observed count - expected count)^2/expected count. The summation is over all r * c cells of the table. Q24. Large values of chi-square are evidence for, or against H0? Why? A24. Against. This is because chi-square will be bigger, the bigger are the deviations of observed counts from those that would be expected under H0. Q25. How many degrees of freedom do you have in a chi-square test with an r * c twoway table? A25. (r-1)(c-1) Q26. True or false: when doing chi-square tests, the p-value is always the area under the distribution curve that is to the right of the observed chi-square, and never the area to the left. A26. True. For the chi-square distribution, the farther you go to the right, the more you have deviated from the null hypothesis. The value most consistent with the null hypothesis is 0, which is the left end of the domain for the function. To get the probability of results as deviant as, or more deviant than, the obtained results, you look at the probability under the curve to the right of the obtained results. (This includes the probability exactly at the obtained results, but since chi-square is a continuous function, the distinction between "above" and "at or above" is not meaningful.) Q27. What cell counts are required for doing a chi-square test for homogeneity of populations? A27. The same as for tests of goodness of fit: all expected counts are 1 or greater, and no more than 20% of the expected counts are less than 5. Q28. In the special case of a two-by-two table (r=2 and c=2), how many cell counts need to be 5 or greater in order to do a chi-square? A28. All four of them. Q29. How many degrees of freedom would be used for a 3 by 2 table? A29. (3-1)*(2-1)=2 Q30.Just for review: what are the four steps for the "inference toolbox" that you use for doing chi-square tests as well as any other hypothesis tests? A30.1) Identify populations of interest; state hypotheses in words and symbols. 2) Choose the appropriate inference procedure and verify the conditions for its use. 3) Carry out the procedure. 4) Interpret your results in the context of the problem. Q31.After having done an overall test rejecting the hypothesis that all the proportions are equal, what should be done? A31.A follow-up analysis that asks which cells most contribute to the deviation from expectations under the null hypothesis. You can do this informally by observation; there are more formal methods that do significance tests and confidence intervals for the individual proportions. Q32.True or false: the chi-square tests the hypothesis that "the row and column variables are not related to each other," even when it is difficult to conceive of the groups defined by the rows and columns as different populations, i.e. when you are dealing with the relation of some variables in one population. A32.True. Q33.True or false: for a chi-square test of association/independence of variables, you compute the expected counts just as in the other situations: the row total * column total/ table total. A33.True. Q34.True or false: converting table entries to percents is not necessary for the computation of chi-square, but it does help to shed light on the association among the variables. A34.True. Q35.For a chi-square test of association/independence of variables, what is the null hypothesis? A35.That the variables are independent, or that there is no association between them. Q36.True or false: the distinction between tests of homogeneity of populations and tests of association/independence is that in the first, there is a sample from each of two or more populations, and in the second, there is a single sample from a single population. A36.True. (However, distinguishing whether there is one or more than one population involved in a study can be a debatable procedure. If you collect a sample of people, some of whom are wealthy and some of whom are poor, can you argue that you have sampled some individuals from the population of poor people and some from the population of rich people? Or have you drawn from one population of people, who simply differ in one variable? Fortunately, the chi-square test is done in the same way regardless of the outcome of such a debate.) Q37.When there is a two-by-two table, and you wish to compare two proportions, how will a two-sided z test for equality of proportions and a chi-square test compare with respect to the p values that result? A37.The same p values will result. Q38.If there is a two-by-two table and you wish to compare two proportions, which test is usually recommended, between a z test and a chi-square, and why? A38.The z test has the advantages that it is related to a confidence interval for the difference in proportions, plus you can do a one-sided test if desired. YMS Chapter 14: Inference for Regression Q1. To review: we use least-squares regression to study the relation between a couple of variables, both of which are (quantitative, categorical). A1. Quantitative Q2. Before doing regressions to study the relationship between two quantitative variables, we should explore the data by examining a _______. A2. Scatterplot. Q3. The statistic that describes the strength of a linear relationship, that is the same whichever variable is thought of as the explanatory variable, and which has a familiar relationship to the percent of variance in one variable explained by the other, is the ______ ______. A3. correlation coefficient (or just, the correlation) Q4. The formulas for generating a slope and intercept to form a regression line are set up so as to minimize what quantity? A4. The sum of the squared residuals or errors. Q5. What is a residual? A5. A residual is a vertical distance between the data point and the regression line, or yy-hat. Q5. The r-squared, which is part of the regression output, tells us how much of what is what? A5. How much of the variation in the y variable is accounted for by the linear relationship with x. Q6. Suppose we draw lots of samples and compute a regression line for each sample. The slope and intercept of each sample line estimate a true value. Thus the slope and intercept we obtain from our sample are _____ that estimate population ______. A6. Statistics parameters Q7. One of the conditions for regression inference is that for any fixed value of x, the response variable y varies according to a _____ distribution. A7. Normal Q8. Another assumption for regression inference is that for any fixed value of x, the repeated responses y are ____ of each other. A8. Independent. Q9. Another assumption for regression inference is that the means of the sets of y-values for each x value have what relationship to the x values? A9. That the means of the y's for each x are a linear function of x: mean for y's = alpha + beta * x Q10. Another assumption for regression inference is that what measure of dispersion is equal for each value of x? A10. The standard deviation of the y's for the various x values. Q11. True or false: the slope and intercept we obtain from the least squares regression for our sample are unbiased estimators, respectively, of the line connecting the population means for each of the x's. A11. True. Q12. What is the unbiased estimator for the standard deviation of the y values around the regression line (in other words, the standard deviation of the y values around the means of each of those values for each x)? A12. The statistic called s, which is the standard error, or the standard deviation of the residuals. . Q13. What's the formula for the standard error s obtained from a regression? A13. The square root of (the sum of squared residuals/(n-2)). Or s= sqrt (summation of (y-y-hat)^2)/(n-2)) Q14. The parameter we are usually most interested in estimating from regression output is the (slope, y-intercept) of the line. A14. slope Q15. How do we put a confidence interval around the slope of a regression line? A15. Confidence interval = b+- t*(SEb) where b is the obtained slope, t* is the (1-C)/2 critical value for t, and SEb is the standard error of the slope. Q16. When you find the confidence interval for the slope of a regression line, one of the variables in the equation is t*. Please explain how you would find t* when your sample size is 40 and your desired level of confidence is 95%. A16. You would use n-2 or 38 degrees of freedom. You could use an inverse T function on your calculator to find the t score that corresponds to .025 (1-C)/2 (or the T that corresponds to .975. Or you can look up the result in a table of critical values for T. (If the table give values for 30 and 40 but not 38, you can extrapolate, i.e. get the value that is .8 of the way between the value for 30 and the value for 40.) Q17. The standard error of the slope is a function of the s. But it's not really necessary to memorize the formula for this, because _____. A17. Regressions are seldom done by hand, and the SEb is part of the regression output of the calculator or computer program. Q18. The most commonly tested hypothesis about regressions is that Beta, the "Population slope," is 0. Can you put this hypothesis in some other phrasings? A18. That the straight line dependence on x is of no value in predicting y. Or that the population correlation between x and y is 0. Or that there is no true linear relationship between x and y in the population. Q19. If you form the ratio of the slope obtained in your sample to the standard error of that slope, what is the sampling distribution of that statistic? A19. It's distributed according to the t distribution, with n-2 degrees of freedom. Q20. Regression output usually gives a two-sided p value for the hypothesis test that the population slope is 0. How do you obtain a one-sided p-value for the same hypothesis? A20. Divide the two-sided p-value by two. Q21. Someone asks, "The regression equation we obtained from analyzing our data is y=2 + 10x. What's the 95% confidence interval for the y when x is 8?" The other person says, "That depends upon what question you want the confidence interval to answer." What does the second person mean by this? A21. The two possible questions are: 1) "What is the 95% confidence interval for the MEAN of the y values when x=8," and 2) "What is the 95% confidence interval for a single y value when x=8?" In other words, what do we want to predict: the average value for a group, or a score for a single individual? Q22. When we are getting a confidence interval for the predicted mean of y's for a given x value, our confidence interval is (narrower, wider) than when we get a "prediction interval" for a single individual? A22. The confidence interval is narrower for predicting the mean than predicting an individual. Q23, Suppose that some researchers are testing how long a thermometer takes to come to equilibrium, given the temperature difference between starting and ending temperature. They regress the time to equilibrium on the temperature difference. Then the researchers want to produce a table for a clinician. The table will contain numbers so as to help the clinician predict the time to wait for a certain individual who is using the thermometer, such that there is a 95% chance that this individual will not start trying to raise the temperature before the temperature has equilibrated. Please explain whether the researchers should use a confidence interval or a prediction interval in constructing their table. A23. The researchers are interested in predicting an individual response, not the mean of a set of responses. This is what the prediction interval chooses the correct interval for. Q24. Please list 4 conditions for inference in regression. A24. Independent observations, true relationship linear, standard deviation of the response about the true line is the same everywhere, and the response varies normally about the true line. Q25. Is there an additional condition placed upon the construction of a prediction interval via regression? A25. The prediction interval assumes normality of the individual observations, not just the approximate normality of statistics like the slope and intercept. Q26. Suppose that in a residual plot, the values are close to 0 when x is low, but the residuals get bigger and bigger in absolute value as the x values get greater. What condition of regression is violated in this circumstance? A26. The condition that the standard deviation of the response around the true line is the same everywhere. Q27. Someone examines a residual plot and a scatterplot and observes a curvilinear pattern. What condition of regression is being violated, and what should the researcher consider doing in order to correct this? A27. The condition violated is that the true relationship is linear. The researcher should consider transforming one or more of the variables. YMS Chapter 15: Analysis of Variance Q1. Inference about the spread of populations and inference about several means by oneway-analysis of variance are very different goals. But they will both involve the use of what distribution? A1. The F-distribution. Q2. What's one of the big problems with trying to infer whether one population's spread is greater than another by the F test? A2. That the test is extremely sensitive to departures from normality, and this situation does not improve as the sample size gets larger. Thus it's difficult to tell whether a significant F-value implies unequal population spreads, or nonnormal populations. Q3. What's a problem with even using the standard deviation as a measure of spread, with asymmetric distributions? A3. When distributions are skewed, with unequally spread tails, no single measure of spread does a good job of describing the spread. Q4. At one time, it was common to test equality of standard deviations as a preliminary to performing a pooled two-sample t test. What's now regarded as better practice? A4. To check the distributions graphically, and to use the version of the two-sample t-test that does not require equal standard deviations. Q5. How do you do a significance test of the hypothesis that two sample standard deviations are equal? A5. You compute the ratio of the larger variance (standard deviation squared, not the standard deviation itself) to the smaller variance. This is the F statistic, with n1-1 and n21 degrees of freedom. Then you compare the value of F with critical values for rejection of the null hypothesis at a given region. Or, you use a calculator or statistical software to arrive at a P-value for this F statistic and these two degrees of freedom. Q6. Suppose you want a two sided p value once you have computed a one-sided p-value. What do you do? A6. Double the one-sided value to get the two-sided p-value. Q7. If you want to carry out a significance test for hypothesis that two sample spreads are equal, how do you do it on a TI 83 or 89 calculator? A7. You use stat>tests>D:2-SampFTest. Then you fill in the blanks for the two standard deviations and sample sizes, and the calculator does the rest for you. Notice that you enter the standard deviations, not the variances; the calculator does the squaring for you in calculating F. Q8. Now that you've learned how to compare two standard deviations with the F test, is it recommended that you go out and start using this knowledge? A8. No, because the sensitivity to normality limits the usefulness of the procedure. However, understanding the F test is going to help you understand ANOVA, and thus you haven't wasted your time! Q9. What is the main objective of one-way analysis of variance? A9. To compare several group means. Q10. If we are comparing, say, 10 different treatments, why do we prefer analysis of variance to using a bunch of t-tests to compare the treatments 2 at a time? A10. Because in this situation, for example, that involves (10*9)/2=45 different pairwise comparisons, and the results get overwhelming. Also, you run into the multiple comparisons problem, where the more you do, the more you get some significant by chance alone. Q11. What's the usual two-step procedure in doing multiple comparisons? A11. First do an overall test to see if there is good evidence of any differences among the parameters. Then do a detailed follow-up to decide which parameters differ and to estimate how large the differences are. Q12. In doing an anova with several groups, what are the null hypothesis and the alternative hypothesis? A12. The null hypothesis is that all the population means are equal, and the alternative hypothesis is that not all the means are equal. Q13. When you look at computer output regarding a one-way anova, where do you look to see if the null hypothesis was rejected? A13. You look at the far right column of the output at the p-value. (This is low when the F-statistic is high, relative to the degrees of freedom.) Q14. What is the main idea of analysis of variance? A14. You look at the variation among sample means, compared with the variation within each sample. The variance within the samples provides an indicator of how much the means should differ by the randomness of sampling; if the means differ by more than this chance expectation, you reject the null hypothesis. Q15. The F statistic for one-way anova is the ratio of what to what? A15. Variation among the sample means/variation among the individuals in the same sample. Q16. When doing an F test, how do you get the numerator and denominator degrees of freedom? A16. The numerator degrees of freedom is one less than the number of groups. The denominator degrees of freedom is the total number of subjects in the study, minus the number of groups. Q17. When you are doing a one-way anova, you are thinking about how many different populations? A17. One population for each of the groups. Q18. You assume that each group is derived in what manner from its population, when doing a one-way anova? A18. Each group is a SRS of its population. Q19. What other assumptions or conditions are there for one-way anova, other than that each group is a SRS of its population? A19. That each population has a normal distribution, and that all the populations have the same standard deviation. Q20. With respect to robustness to violations of the assumption of normality, does anova more resemble the t-test, which is rather robust, or the F-test for differences in population standard deviations, which is not very robust? A20. The T-test, which is rather robust. Q21. What's a rule of thumb that helps you decide when the violation of the assumption of equal variances of the populations invalidates the one-way anova? A21. When the largest sample standard deviation is no more than twice as big as the smallest one. Q22. The numerator of the ratio used in one-way anova is called the mean square for groups (elsewhere a.k.a. the mean square "between"). How is this calculated? A22. You take each of the group means, subtract the overall mean from it, and square each of those deviations; you multiply each squared deviation by the number in that group; then you sum these terms, then divide the sum by one less than the number of groups. In other words, MSG= (summation of ni(x-bari-xbar)^2)/I-1 where ni is the number of subjects in the ith group, x-bari is the mean of the ith group, xbar is the overall mean, and I is the number of groups. Q23. The denominator of the ratio used in one-way anova is called the mean square for error (elsewhere a.k.a. the "mean square within"). How is this calculated? A23. You take the variance of each group and multiply it by one less than the number in that group -- this is the "sum of squares" for the group, i.e. the sum of squared deviations of the individual observations from their group means. (Alternatively, you can just compute the sum of the squared deviations from the group means, without converting to a variance and back again to a sum of squares!) You add up all those squared deviations, and divide by the total number of subjects in the study minus the number of groups. In symbols, MSE= (summation of (ni-1)si^2))/(N-I) where ni is the number of subjects in group i, si is the standard deviation of group i, N is the total subjects in the study, and I is the number of groups. Q24. In the anova table: the mean square for groups is the ratio of what two quantities? A24. The sum of squares for groups over the degrees of freedom for groups. Q25. The mean square error is the ratio of what two quantities? A25. The sum of squares error (meaning the sum of the squared deviations of each observation from its group mean) over the degrees of freedom for error (which is the number of observations minus the number of groups.) Q26. The F ratio presented in the Anova table is the ratio of what two quantities? A26. The mean squares for groups over the mean squares for error. Q27. The square root of the MSE (mean squares for error) is called what? A27. The pooled standard deviation sp, which estimates the common standard deviation of each group. Q28. How do you form a confidence interval for the means found in an analysis of variance? A28. You take each sample mean, and add to and subtract from it, its margin of error. The margin of error is the critical value of t for the desired interval, multiplied by the pooled standard deviation, divided by the square root of the number in that group. Q29. If you have a bunch of group means, and you know the sample size for each group, and you want to calculate the overall mean from them, how do you do that? A29. Multiply each group mean by the number in that group, to get the sum of the observations in that group; add up all those sums to get the overall sum; add up the n's for each group to get the overall n; divide the overall sum by the overall n. YMS Chapter 16 Multiple Regression Q1. Please explain both in words and symbols what the multiple regression model is, or the population regression equation. A1. There is a response variable that is a linear function of not just one explanatory variable, but more than one. The prediction equation is muy =beta0 +beta1x1 ... +bpxp where muy is the mean y value for a given subpopulation defined by a particular combination of scores on the several different values of the several variables, P is the number of predictor variables (statisticians like the letter p, don't they? Proportion, pvalue, and now number of predictor variables!) beta0 is the y-intercept of this equation, and each of the other terms involves a certain coefficient multiplied by the value of the variable that coefficient applies to. Q2. What are some other assumptions of the regression model? A2. That in each subpopulation, determined by a certain combination of explanatory variable values, y is normally distributed with a mean given by the population regression equation. Also, the model assumes that the standard deviation of the response variable is the same in all the subpopulations. Q3. Please give an example of predicting something from more than one explanatory variable, in such a way that a multiple regression equation could be used. A3. The example our text uses is one in which grade point average for computer science students in college after 3 semesters is predicted by math, science, and English average grades in high school. We can think of an almost infinite number of other examples: life span as a function of number of cigarettes smoked, years lived by grandparents, and body weight; popularity as a function of physical attractiveness, conversation skill, friendliness, and self-control; wealth as a function of parents' wealth, educational attainment, and stated desire for high income; etc. Q4. According to the notation for naming observations our text uses, xij means what? A4. The value of the jth explanatory variable for the ith subject or case. Q5. True or false: the multiple regression model requires the explanatory variables to be normally distributed. A5. False. Only the population y's are assumed to be normally distributed around their means. Q6. True or false: The first step in doing multiple regression analysis is to use graphs and tables to do exploratory analysis with all the variables individually. A6. True. Q7. True or false: when doing a multiple regression analysis, the second step is to use scatterplots and correlations to examine the bivariate relations between pairs of variables. A7. True. Q8. What does the equation data=fit+residual mean? A8. In the multiple regression equation, the "data" is the observed data point; the "fit" is the y-hat, or predicted y-value, or predicted mean of the y-values for that combination of x-values; the residual is the difference between the actual y value and the predicted y value. Q9. True or false: in the multiple regression model, the residuals are independent, normally distributed, and with mean 0 and sd sigma. A9. True. Q10. What are the parameters of the multiple regression model? A10. Beta0, Beta1, Beta2... BetaP and sigma, the standard deviation of each y value about its mean (or the standard deviation of the residuals). Q11. The b coefficients in the multiple regression equation obtained from our sample data are chosen so as to minimize what? A11. The sum of the squares of the residuals. Q12. The s that our computer output reports for multiple regressions, that is the estimator of the sigma of each y value for each combination of x values, is computed how? A12. s is the square root of (the sum of squared residuals/n-p-1) where n is the number of subjects, p is the number of explanatory variables, and n-p-1 is the degrees of freedom for this quantity. Q13. How do you get a confidence interval for a certain regression coefficient? A13. The confidence interval is that coefficient plus or minus t*(standard error of that coefficient). Q14. How do you do a significance test for the null hypothesis that a given regression coefficient is equal to 0? A14. Your test statistic is t=b/standard error of b, with n-p-1 degrees of freedom. Q15. In multiple regression, as in bivariate regression, we can ask software to produce confidence intervals or prediction intervals. What is meant by a 95% interval of each of these types? A15. A 95% confidence interval means that we have 95% confidence that the interval we produce encloses the mean of y for the given combination of x's. A 95% prediction interval means that we have 95% confidence that an individual observation of y with the given combination of x's would fall into the given interval. Q16. What are the two sources of variation for the y values analyzed in a multiple regression model? A16. The y values are different for two reasons: first, they have different x-values. Second, even for a given set of x-values, the y values vary around the mean y value -- in other words, the residuals are not zero. So there is variation due to the regression and variation due to the residuals or to error. Q17. The total sum of squares of the y values, that is the sum of the squared deviations of the y values from the overall y mean, can be partitioned into what two components? A17. The two components are the sum of squares for the model, i.e. the sum of squared deviations of the predicted y's from the overall y mean, and the sum of squares for error or residuals, i.e. the sum of squared deviations of the actual y's from the predicted y's. In other words, SS total = SS model + SS error. Q18. The analysis of variance that is done for a multiple regression tests what null hypothesis? A18. That all our population regression coefficients are zero. Q19. In general, the mean squares of something is equal to what divided by what? A19. The sum of squares divided by the degrees of freedom. Q20. The F ratio presented in the analysis of variance for multiple regression is the ratio of what to what? A20. The ratio MSM/MSE, or the mean squares for the model divided by the mean squares for error. Q21. The degrees of freedom for the model is (p, or n-p-1)? A21. p (the number of explanatory variables. The total degrees of freedom is n-1, and that minus p give n-p-1 for the degrees of freedom for error. Q22. When there is only one predictor variable, the t-test for the regression coefficient b and the F test in the analysis of variance always yield the same p value, because they are testing the same hypothesis. In this case, the F bears what relation to t? A22. F is the square of t. Q23. The R2 statistic for multiple regression is the ratio of what to what? A23. The sum of squares for the model to the sum of squares total. Q24. The R2 statistic for multiple regression has what meaning? A24. The proportion of the total variation in y that is explained by the explanatory variables. Q25. The square root of the R2 statistic for multiple regression is called the multiple correlation coefficient, and is interpretable as the correlation between what and what? A25. The y-hats that are predicted by the multiple regression equation, and the actual y values. In other words, the correlation between the actual and predicted values of y. Q26. When, in regression output, you find that a regression coefficient is significantly different from zero, what should you check to help you interpret the meaning of that coefficient? A26. You should look at the sign of the coefficient and figure out whether the relationship between the explanatory variable and response variable is an increasing or decreasing function. Q27. When examining multiple regression output, why is it a good idea to verify the degrees of freedom? A27. To check for errors in data entry or in specifying the model. Q28. In the regression output, what is the meaning of the quantity that some statistical packages call the Root MSE and others call s? A28. The root mean square error or the s is the estimate of sigma, the standard deviation of the y values about their individual means. This is estimated by the square root of the MSE given in the Anova table; the MSE is the SSE/DFE or sum of squares error divided by the degrees of freedom for error (and that is n-p-1). Q29. The parameter estimates of the regression output, or the coefficients (as they are called by other packages) can be put together to form what? A29. The fitted regression equation, the Y-hat=b0 + b1x1 ... prediction equation. Q30. The t statistics for each parameter estimate are formed by dividing what by what? A30. The value of each coefficient (or intercept) by its standard error. Q31. Are the p values in the regression output two-sided or one-sided? A31. Two-sided. The null hypothesis is that the coefficient is 0 and the alternative is that it's not 0. Q32. True or false: the significance tests for the coefficients test, not whether the single variables can predict the response variable, but whether those variables improve upon the prediction achieved by all the other variables in the model. Thus you can have a variable that correlates well with the explanatory variable by itself, which does not have a significant p-value in the regression equation, because it does not add to what other explanatory variables can do. A32. True. Q33. It is good practice to make several residual plots when doing a multiple regression. What several plots should be done? A33. Plot the residuals as a function of each of the explanatory variables. You can also make a normal probability plot to determine the normality of the residuals. Q34. Suppose that after examining the regression output with three explanatory variables in the model, you eliminate one of the explanatory variables and run a multiple regression with two explanatory variables. As a rule, the intercept, coefficients, t's, and p values for the remaining values will (remain the same as they were before, change)? A34. Change. The values you find are thus contingent upon what other variables are in the model. Q35. True or false: it is possible, using multiple regression, to test a certain set of variables, asking whether that set of variables improves significantly upon the prediction made by another set of variables. A35. True. YMS Chapter 17: Logistic Regression Q1. When our response variable is dichotomous, or has only two values (like live or die) and there is either one or several explanatory variables (that can be categorical or quantitative) such as for example, severity of condition on admission to the hospital, degree of training of the doctor, and rating of cleanliness of the ward, then the method of _____ ______ is often an appropriate method of analysis. A1. Logistic regression Q2. Why can't you use ordinary least-squares regression with the situation when the response variable is dichotomous? A2. Because the least-squares regression model we've studied so far carries the condition that the y values are normally distributed around the model fit. You can't have a normal distribution with only two possible values. Q3. Suppose there are 200 violent people among a set of 1000. Thus the proportion of violent people is .20. What are the odds of being a violent person in this set? A3. Odds=p/(1-p) = .20/(1-.20)=.20/.80=.25. OR: odds = ratio of violent to nonviolent, = 200/800=.25. Q4. Suppose that in our group of 1000, 150 of 500 males are violent and 50 of 500 females are violent. What method did we learn in chapter 12 for comparing these proportions via a confidence interval? A4. We would get a confidence interval for the difference between the two proportions, which are .30 and .10. We would use the normal approximation to the binomial distribution to do this. We would use the fact that the variance of a difference between two independent random variables is the sum of the variances of the variables, to compute the standard error of the difference in proportions by taking the square root of the sum of the pq/n for men and women. We'd multiply that standard error of the difference by the critical z, or z*, for whatever level of confidence we wanted, and add and subtract that margin of error to and from the obtained difference in proportions. Q5. If instead of this method, we use logistic regression, how would we turn gender into an "indicator variable"? A5. By letting, for example, female=0 and male=1. Q6. If we have 150 out of 500 males violent and 50 out of 500 females violent, how would we transform these proportions to odds, if we are doing logistic regression? A6. For males, the odds of violent to nonviolent are .15/.35=.43. For females, the odds are .5/.45=.11. Q7. In simple linear regression, we modeled the mean of the response variable for a given value of x by mu=B0+B1x. Why don't we do something similar with proportions, and model p= B0+B1x? A7. Because with extreme values of x, we get values of p that are inconsistent with the constraint that p has to be between 0 and 1. Q8. So what model do we use for logistic regression? A8. We transform the odds by taking the natural log of it, and let that equal B0+B1x. Thus ln(p/(1-p)) = B0+B1x Q9. If we have 150 out of 500 males violent and 50 out of 500 females violent, how would we compute the logistic regression equation? A9. The odds for men are .43 and the odds for women are .11. The natural logs, or ln's, respectively, are -.844 and -2.21. So with x=0 for women, the ln(p/(1-p)) becomes= B0. So B0= -2.21. For men, -.844 = -2.21 + B1*1, so B1=2.21-.844=1.366. Thus the logistic regression equation is: ln(p/(1-p))=-2.21+1.366x Q10. True or false: in the logistic regression example that we just gave, the slope in the logistic regression model is the difference between the natural logs of the odds for men and women, that is the odds for men minus the odds for women. A10. True. Q11. Given that the slope, 1.366, is the difference in the natural logs of the odds, the antilog of that number corresponds to what relation between the odds? A11. The odds ratio, which is the odds of violence for men divided by that for women. Since the ln (a/b)=ln(a)-ln(b), the antilog of ln(a)-ln(b)=a/b. Thus the antilog of 1.366 (or e to the 1.366 power) is equal to the ratio of the odds for men to the odds for women. This equals 3.92. To check, the ratio of .43 to .11 is approximately the same. Q12. More generally, the odds ratio is the ratio of the odds for explanatory variable equal to _____ to the odds for explanatory variable equal to ______. A12. x+1, x Q13. How do we form a confidence interval for the slope in the logistic regression model? A13. It's the obtained slope plus or minus z*(SEb1), where z* is the critical value of z in the normal distribution, and SEb1 is the standard error of the regression coefficient. (The computer prints out the standard error of the regression coefficient.) Q14. A confidence interval for the odds ratio is obtained by getting the antilogs of what? A14. The antilogs of the boundaries for the confidence interval for the slope. Q15. To do a hypothesis test of the null hypothesis that the slope in the logistic regression is 0, what statistic do you use? A15. The square of the ratio of the slope to its standard error is distributed according to the chi-square distribution with 1 degree of freedom. Q16. How can the hypothesis that the slope is 0 be translated to hypotheses about the odds ratio? A16. If the slope in the logistic regression model is 0, then the odds ratio is e0, or 1. In other words, the explanatory variable does not alter the odds for the response variable. Q17. We would reject the null hypothesis at the .05 level if the 95% confidence interval for the odds ratio did not include what value? A17. One. Q18. Suppose that instead of the simple case where you have one explanatory variable that takes on only two values, you have one explanatory variable that takes on, say, five or six values. Then your raw data are a proportion, that is associated with each of those values of the one explanatory variable. How do you fit the logistic regression model to these data? A18. You transform the proportions to ln(odds), or ln(p/(1-p)) as before. Then you do a least-squares regression with the ordered pairs that you have. The result is an equation of the form ln(p/(1-p)) = B0+B1x as before. Q19. When doing this logistic regression, as in doing all analyses, we should examine the data via graphs. What graph should we do? A19. We should do a scatterplot of ln(p/(1-p)) versus the explanatory variable x. (Note: in the pesticide example in the text, the explanatory variable is also transformed to get a better linear relationship.) Q20. When we do a logistic regression with more than one explanatory variable, the computer output that is labeled "chi-square for covariates" gives a chi-square statistic and a p-value that is testing what null hypothesis? A20. That all the regression coefficients in the model are equal to zero. Q21. True or false: as in multiple linear regression, the regression coefficients obtained from a multiple logistic regression are meant to measure how much additional each variable contributes to the prediction over that contributed by the other variables, and not to measure how much that variable by itself predicts the response variable. A21. True. Q22. True or false: as in multiple linear regression, when you are doing multiple logistic regression, you can sometimes get results when, for example, one or two of three explanatory variables are included in the model that you would not have predicted from looking at the results when all three variables are included in the model. A22. True.