Review • residual: the prediction error for an observation, which is the differency ŷ − y between the actual value and the predicted value of the response variable, is called a residual. • Residual sum of squares: Residual sum of squares = X (residual)2 = X (y − ŷ )2 • Least Squares Method Among the possible lines that can go through data points in a scatterplot, this method gives the regression line that has the smallest value for the residual sum of squares in using ŷ = a + bx to predict y . • having some positive residuals and some negative residuals, but the sum (and the mean) of the residuals equals 0. • passing through the point (x̄, ȳ ). • The slope: sy b=r sx The y-intercept: a = ȳ − bx̄ • r-Squared (r 2 ): Interpretation: the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x. Some Cautions in Analyzing Associations • Extrapolation is dangerous. Extrapolation refers to using a regression line to predict y values for x values outside the observed range of data. • Be cautious of influential outliers • Correlation does not imply causation. • lurking variable: a variable, usually unobserved, that influences the association between the variables of primary interest. • A lurking variable may be a common cause of both the explanatory and response variable. • The change of response variable may due to multiple cause. • experiment: assigning subjects to certain experimental conditions and then observing outcomes on the response variable. • treatments: the experimental conditions, which correspond to assigned values of the explanatory variable. • observational study (nonexperimental studies): observing values of the response variable and explanatory variables for the sampled subjects, without anything being done to the subjects (such as imposing a treatment) Advantage of Experiments over Observational Studies • In an experiment, by some sort of “random” selection to determine which subjects receive each treatment, the effects of lurking variables are “balanced”; that is, the groups have similar distribution on other variables. Thus, we can study the effect of an explanatory variable on a response variable more accurately with an experiment than with an observational study. • Why bother to do observational studies? Data Types in Observational Studies • anecdotal evidence • sample survey: selecting a sample of subjects from a population and collects data from them • census: a survey for the whole population — expensive, time consuming or impossible • sampling frame: the list of subjects in the population from which the sample is taken Ideally, the sampling frame lists the entire population. In practice, it’s usually hard to identify every subject in the population.