The probability framework for statistical inference

The probability framework for
statistical inference
 The group or collection of entities of
 Here, “all possible” school districts
 “All possible” school districts means “all
possible” circumstances that lead to
specific values of STR (student-teacher
ratio), test scores
 The set of “all possible” schools
districts includes but is much larger than
the set of 420 schools districts observed
in 1998.
 We will think of populations as infinitely
large; the task is to make inferences
from a sample from a large population
Random variable Y
 A random variable assigns a
number to each member of the
population in a particular way.
 The adjective “random” refers to
the fact that the value the variable
takes is determined by a drawing
from the population.
 The district average test scores
and the district STRs are random
variables; their numerical values
are determined once we choose a
year/district to sample.
Characterizing Random Variables:
Moments of the Distribution
Joint and Distributions
Covariance; Correlation
Conditional Moments of the
Population distribution of Y
 Discrete Random Variables: The
probabilities of different values of Y that
occur in the population
For ex. Pr[Y = 650], when Y is
discrete. This probability is the
proportion of elements of the
population for which the value of
Y is exactly equal to 650.
 Continuous Random Variables: The
probabilities of sets of these values
For ex. Pr[Y  650], when Y is
continuous. This probability is the
proportion of elements of the population
for which the value of Y is less than or
equal to 650.
“Moments” of the population
= expected value = E(Y) = Y
= long-run average value of Y
over repeated realizations of Y
For a discrete random variable, the mean is
found by a weighted average of each
possible value of Y, where the weight
assigned to a given value of Y is the
probability of that Y.
For a continuous random variable the mean
is found by integrating over all possible
values of Y weighting each value of Y by
the “density function” evaluated at that Y.
variance = E(Y – Y)2
=  Y2
= measure of the squared
spread of the distribution
standard deviation = variance = Y
Note that –
1. The variance is an expected
value of a random variable.
2. The variance is in squared units
of Y; the standard deviation is in
the same units as Y.
Joint distributions
 Corresponding to each member
of the population there may be
more than one value assigned.
E.g., test score (Y) and STR
 There is a probability
distribution for Y (from which
we can derive the mean and
variance of Y) and a probability
distribution for X (from which
we can derive the mean and
variance of X).
 The joint probability
distribution for Y and X
provides the probability that the
random variables Y and X take
on the values y and x,
respectively (if Y and X are
discrete random variables), i.e.,
Prob(Y=y and X=x), or the
probability that the random
variables Y and X lie in some
subset of R2 (if Y and X are
continuous random variables),
i.e., Prob(Y<y and X<x).
 For example, what is the
probability of drawing a district
from the population for which
the average test score is 650
and the STR is 20?
 The marginal distributions of
Y and X are simply the
individual probability
distributions of Y and X, which
can be recovered from their
joint distribution (although the
reverse isn’t true.)
 The random variables Y and X
are independent if (and only
if) their joint distribution
factors into the product of their
marginal distributions, i.e.,
Prob(Y=y and X=x) =
Prob(Y<y and X<x) =
Prob(Y<y) Prob(X<x)
for all x and y.
The covariance between r.v.’s X and
Y is,
cov(X,Y) = E[(X – X)(Y – Y)] = XY
 cov(X,Y) > 0: X and Y are
positively related; when X is above
(below) its mean, Y tends to be
above (below) its mean. cov(X,Y) <
0 … (We hypothesis that the random
variables test score and STR have a
negative covariance.)
 If X and Y are independently
distributed, then cov(X,Y) = 0 (but
not vice versa!!)
The correlation coefficient is defined
in terms of the covariance:
corr(X,Y) =
cov( X , Z )
 XZ
= rXZ
var( X ) var( Z )  X  Z
 –1  corr(X,Y)  1
 corr(X,Y) = 1 mean perfect
positive linear association
 corr(X,Y) = –1 means perfect
negative linear association
 corr(X,Y) = 0 means no linear
Conditional distributions
 The distribution of Y, given
value(s) of some other random
variable, X
 So, conditional distributions are
distributions of “subpopulations,”
created from the original
population according to some
 Ex: the distribution of test scores,
given that STR < 20. (Divide the
population into two subpopulations
according to their STRs. Then
consider the distribution of test
scores for each population.)
Moments of conditional
 conditional mean = mean of
conditional distribution
= E(Y|X = x) (important notation)
 conditional variance = variance of
conditional distribution
 Example: E(Test scores|STR <
20), the mean of test scores for
districts with small class sizes;
Var(Test scores|STR < 20), the
variance of test scores for districts
with small class sizes;
The difference in means is the
difference between the means of
two conditional distributions:
 = E(Test scores|STR < 20) – E(Test
scores|STR ≥ 20)
Other examples of conditional means:
 Wages of all female workers (Y =
wages, X = gender)
 One-year mortality rate of those
given an experimental treatment (Y
= live/die; X = treated/not treated)
The conditional mean is a new term
for a familiar idea: the group mean
Inference about means, conditional
means, and differences in
conditional means
We would like to know  (test score
gap; gender wage gap; effect of
experimental treatment), but we don’t
know it. (We don’t know it? Didn’t
we calculate  last week?)
Therefore we must collect and use
data by sampling from the population,
permiting us to make statistical
inferences about .
 Experimental data
 Observational data
Simple random sampling
 Choose an individual (district,
entity) at random from the
Randomness and data
 Prior to sample selection, the value
of Y for is random because the
individual selected is random
 Once the individual is selected and
the value of Y is observed, then Y
is just a number – not random
 The data set is (Y1, Y2,…, Yn),
where Yi = value of Y for the ith
individual (district, entity)
 Thus, the data set is made up of
realized values of n random
Implications of simple random
Because individuals #1 and #2 are
selected at random, the value of Y1 has
no information content for Y2. Thus:
 Y1, Y2 are independently
 Y1 and Y2 come from the same
distribution, that is, Y1, Y2 are
identically distributed
 That is, a consequence of simple
random sampling is that Y1 and Y2
are independently and identically
distributed (i.i.d.).
 More generally, under simple
random sampling, {Yi}, i = 1,…, n,
are i.i.d