The probability framework for statistical inference Population The group or collection of entities of interest Here, “all possible” school districts “All possible” school districts means “all possible” circumstances that lead to specific values of STR (student-teacher ratio), test scores The set of “all possible” schools districts includes but is much larger than the set of 420 schools districts observed in 1998. We will think of populations as infinitely large; the task is to make inferences from a sample from a large population Random variable Y A random variable assigns a number to each member of the population in a particular way. The adjective “random” refers to the fact that the value the variable takes is determined by a drawing from the population. The district average test scores and the district STRs are random variables; their numerical values are determined once we choose a year/district to sample. Characterizing Random Variables: - Distribution Moments of the Distribution Joint and Distributions Covariance; Correlation Conditional Moments of the Distribution Population distribution of Y Discrete Random Variables: The probabilities of different values of Y that occur in the population For ex. Pr[Y = 650], when Y is discrete. This probability is the proportion of elements of the population for which the value of Y is exactly equal to 650. Continuous Random Variables: The probabilities of sets of these values For ex. Pr[Y 650], when Y is continuous. This probability is the proportion of elements of the population for which the value of Y is less than or equal to 650. “Moments” of the population distribution mean = expected value = E(Y) = Y = long-run average value of Y over repeated realizations of Y For a discrete random variable, the mean is found by a weighted average of each possible value of Y, where the weight assigned to a given value of Y is the probability of that Y. For a continuous random variable the mean is found by integrating over all possible values of Y weighting each value of Y by the “density function” evaluated at that Y. variance = E(Y – Y)2 = Y2 = measure of the squared spread of the distribution standard deviation = variance = Y Note that – 1. The variance is an expected value of a random variable. 2. The variance is in squared units of Y; the standard deviation is in the same units as Y. Joint distributions Corresponding to each member of the population there may be more than one value assigned. E.g., test score (Y) and STR (X). There is a probability distribution for Y (from which we can derive the mean and variance of Y) and a probability distribution for X (from which we can derive the mean and variance of X). The joint probability distribution for Y and X provides the probability that the random variables Y and X take on the values y and x, respectively (if Y and X are discrete random variables), i.e., Prob(Y=y and X=x), or the probability that the random variables Y and X lie in some subset of R2 (if Y and X are continuous random variables), i.e., Prob(Y<y and X<x). For example, what is the probability of drawing a district from the population for which the average test score is 650 and the STR is 20? The marginal distributions of Y and X are simply the individual probability distributions of Y and X, which can be recovered from their joint distribution (although the reverse isn’t true.) The random variables Y and X are independent if (and only if) their joint distribution factors into the product of their marginal distributions, i.e., Prob(Y=y and X=x) = Prob(Y=y)*Prob(X=x) Prob(Y<y and X<x) = Prob(Y<y) Prob(X<x) for all x and y. The covariance between r.v.’s X and Y is, cov(X,Y) = E[(X – X)(Y – Y)] = XY cov(X,Y) > 0: X and Y are positively related; when X is above (below) its mean, Y tends to be above (below) its mean. cov(X,Y) < 0 … (We hypothesis that the random variables test score and STR have a negative covariance.) If X and Y are independently distributed, then cov(X,Y) = 0 (but not vice versa!!) The correlation coefficient is defined in terms of the covariance: corr(X,Y) = cov( X , Z ) XZ = rXZ var( X ) var( Z ) X Z –1 corr(X,Y) 1 corr(X,Y) = 1 mean perfect positive linear association corr(X,Y) = –1 means perfect negative linear association corr(X,Y) = 0 means no linear association Conditional distributions The distribution of Y, given value(s) of some other random variable, X So, conditional distributions are distributions of “subpopulations,” created from the original population according to some criteria. Ex: the distribution of test scores, given that STR < 20. (Divide the population into two subpopulations according to their STRs. Then consider the distribution of test scores for each population.) Moments of conditional distributions conditional mean = mean of conditional distribution = E(Y|X = x) (important notation) conditional variance = variance of conditional distribution Example: E(Test scores|STR < 20), the mean of test scores for districts with small class sizes; Var(Test scores|STR < 20), the variance of test scores for districts with small class sizes; The difference in means is the difference between the means of two conditional distributions: = E(Test scores|STR < 20) – E(Test scores|STR ≥ 20) Other examples of conditional means: Wages of all female workers (Y = wages, X = gender) One-year mortality rate of those given an experimental treatment (Y = live/die; X = treated/not treated) The conditional mean is a new term for a familiar idea: the group mean Inference about means, conditional means, and differences in conditional means We would like to know (test score gap; gender wage gap; effect of experimental treatment), but we don’t know it. (We don’t know it? Didn’t we calculate last week?) Therefore we must collect and use data by sampling from the population, permiting us to make statistical inferences about . Experimental data Observational data Simple random sampling Choose an individual (district, entity) at random from the population Randomness and data Prior to sample selection, the value of Y for is random because the individual selected is random Once the individual is selected and the value of Y is observed, then Y is just a number – not random The data set is (Y1, Y2,…, Yn), where Yi = value of Y for the ith individual (district, entity) sampled. Thus, the data set is made up of realized values of n random variables. Implications of simple random sampling Because individuals #1 and #2 are selected at random, the value of Y1 has no information content for Y2. Thus: Y1, Y2 are independently distributed Y1 and Y2 come from the same distribution, that is, Y1, Y2 are identically distributed That is, a consequence of simple random sampling is that Y1 and Y2 are independently and identically distributed (i.i.d.). More generally, under simple random sampling, {Yi}, i = 1,…, n, are i.i.d