Notes 1 - Wharton Statistics Department

advertisement
Statistics 550 Notes 1
Reading: Section 1.1.
I. Basic definitions and examples of models (Section 1.1.1)
Goal of statistics: Draw useful information from data.
Model based approach to statistics: Treat data as the
outcome of a random experiment that we model
mathematically. We draw useful information from the data
by drawing inferences about parameters that describe the
random experiment.
Random Experiment: Any procedure that (1) can be
repeated, theoretically, over and over; and (2) has a welldefined set of possible outcomes (the sample space).
The outcome of the experiment is data X .
Examples of experiments and data:
 Experiment: Randomly select 1000 people without
replacement from the U.S. adult population and ask
them whether they are employed. Data:
X  ( X1 , , X1000 ) . X i  1 if person i in sample is
employed, X i  0 if person i in sample is not
employed.
 Experiment: Randomly sample 500 handwritten ZIP
codes on envelopes from U.S. postal mail . Data:
X  ( X1 , , X 500 ) . X i = 216 x 72 matrix where
1
elements in the matrix are numbers from 0 to 255 that
represent the intensity of writing in each part of the
image.
The probability distribution for the data X over repeated
experiments is P .
Frequentist concept of probability:
P ( X  E) = Proportion of times in repeated experiments
that the data X falls in the set E.
(Statistical) Model: Family of possible P ’s:
P P = { P , } . The  ’s label the P ’s and  is a
space of labels called the parameter space.
Goal of statistical inference: On the basis of the data X ,
make inferences about the true P that generated the data.
We will study three types of inferences:
(1) Point estimation – best estimate of  .
(2) Hypothesis testing – decide whether  is in a
specified subset of  .
(3) Interval (set) estimation – estimate a set that  lies
in.
2
Goal of this course: Study how to make “good” inferences.
Example of a statistical model:
Example 1: Yao Ming’s free throw shooting.
What is the probability p that Yao Ming will make a free
throw the next time he attempts one in an NBA game?
Data: In the 2007-2008 season, Yao made 345 out of the
406 free throws he attempted (85.0%). Let X1 , , X 406
denote whether or not Yao made his 1,2,…,406th free
throw of the season (1 denotes made, 0 denotes missed).
3
Model 1: IID Bernoulli model. X1 , , X 406 are
independent and identically distributed (iid) random
variables with a Bernoulli ( p ) distribution.
  p ,   [0,1] .
Model 2: Markov chain model. Let X 2 , , X 406 follow a
Markov chain with transition matrix
Last Shot
This Shot
Made
Missed
Make
a
b
Miss
c
d
where the stationary probability of making a free throw,
b /(b  c) , is defined as p . We furthermore assume that
the initial free throw X1 is drawn from the stationary
distribution, i.e., X1 ~ Bernoulli( p) . In this model, each of
the free throws X1 , , X 406 has a marginal Bernoulli ( p )
distribution, but the free throws can be dependent. The
model can be specified as
   b, c},   [0,1]  [0,1] ,   [0,1] .
Other modeling issues:
(1) Should we use just the 2007-2008 data or use previous
seasons’ data (Yao’s free throw shooting might be
changing over time).
(2) Even if we knew p in the above models, is this the
right p to use to predict Yao’s first free throw in the 2008-
4
2009 (again, Yao’s free throw shooting might change over
time).
Choosing models:
Consultation with subject matter experts and knowledge
about how the data are collected are important for selecting
a reasonable model.
George Box (1979): “Models, of course, are never true but
fortunately it is only necessary that they be useful.”
We will focus mostly on making inferences about the true
P conditional on the model’s validity, i.e., the true P
belongs to the family of distributions specified by the
model, P P = { P , } . Another important step in
data analysis is to investigate the model’s validity through
diagnostics (techniques for doing this will be discussed in
Chapter 4).
II. Parameterization and Parameters (Section 1.1.2)
Model: P P = { P , } .
Parameterization: A way of labeling the distributions in the
model. Formally, an onto map from a parameter space
  P is called a parameterization of P .
Example 1 continued: In model 1 for Yao’s free throw
5
shooting, a parameterization is   p ,   [0,1] where
X1 , , X 406 are iid Bernoulli ( p ) .
The parameterization is not unique. In Model 1 for Yao’s
free throw shooting, we could also use the parameterization
  10 p,   [0,10] to label the distributions in the model.
We try to choose a parameterization in which the
components of the parameterization are interpretable
in terms of the phenomenon we are trying to measure.
Example 2: The level of phosphate in the blood of kidney
dialysis patients is of concern because kidney failures and
dialysis can lead to nutritional problems. Phosphate levels
tend to vary normally over time. Doctors are interested in
the mean level of phosphate over a period of time. A blood
test was performed on a dialysis patient on six consecutive
clinic visits. The data is ( X 1 , , X 6 ) , the milligrams of
phosphate per deciliter at the six visits. The model is
( X 1 , , X 6 ) iid N (  ,  2 ) (where  ,  2 are the mean and
variance of the normal distribution respectively).
2
2
2
Three possible parameterizations are (  ,  ), (  ,    )
2
2
and (   ,  sign( )) . The first two parameterizations
are more interpretable because they contain one parameter
 that corresponds exactly to what we are interested in, the
patients’ mean phosphate level.
2
6
Parametric vs. Nonparametric models: Models in which
 is a nice subset of a finite dimensional Euclidean space
are called “parametric” models, e.g., the model in Example
2 is parametric. Models in which  is infinite dimensional
are called “nonparametric.” For example, if in Example 2,
we considered ( X 1 , , X 6 ) iid from any distribution with a
density, the model would be nonparametric.
Identifiability: The parameterization is identifiable if the
map   P is one-to-one, i.e., if 1   2  P1  P 2 .
The parameterization is unidentifiable if there exists
1   2 such that P1  P 2 .
When the parameterization is unidentifiable, then parts of
 remain unknowable even with “infinite amounts of data”,
i.e., even if we knew the true P
Example 3: Suppose X 1 ,..., X n iid exponential with mean
 , i.e.,
1
n
p( x1 , , xn )  n exp(  i 1 xi /  )

The parameterization    is identifiable. The
parameterization   (1 , 2 ),   12 ,   (0, )  (0, )
is unidentifiable because P( 1 , 2 )  P( 12 ,1) .
Note on notation: We will use p() to denote the
probability mass function if the distribution is discrete or
7
the probability density function if the distribution is
continuous.
Parameter: A parameter is a feature  ( P ) of P , i.e., a map
from P to another space N .
2
e.g., for Example 2, ( X 1 , , X 6 ) iid N (  ,  ) ,
 ,the mean of each X i , is a parameter.
 2 , the variance of each X i , is a parameter.
 2   2  E ( X i2 ) is a parameter.
Some parameters are of interest and others are nuisance
parameters that are not of central interest.
2
In Example 2, for the parameterization (  ,  ), the
parameter  is the parameter of interest and the parameter
 2 is a nuisance parameter. The doctors’ primary interest
is in the mean phosphate level.
A parameter is by definition identified, meaning that if we
knew the true P , we would know the parameter.
Proposition: For a given parameterization   P ,  is a
parameter if and only if the parameterization is identifiable.
Proof: If the parameterization is identifiable, then  is
equal to the inverse of the parameterization which maps
  P . If the parameterization is not identifiable, then for
8
some 1 , 2 , we have P1  P 2 and consequently we can’t
write    ( P) for any function  .
Remark: Even if the parameterization is unidentifiable,
components of the parameterization may be identified (i.e.,
parameters).
Why would we ever want to consider an unidentifiable
parameterization?
Components of the parameterization may capture the
scientific features of interest. We may be interested if
certain components of the parameterization are identified.
Example 5: Survey nonresponse. A major problem in
survey sampling is nonresponse.
Example: On Sunday, Sept. 11, 1988, the San
Francisco Examiner ran a story headlined:
3 IN 10 BIOLOGY TEACHERS BACK BIBLICAL
CREATIONISM
Arlington, Texas. Thirty percent of high school biology
teachers polled believe in the biblical creation and 19
percent incorrectly think that humans and dinosaurs lived at
the same time, according to a nationwide survey published
Saturday…
The poll was conducted by choosing 400 teachers at
random from the National Science Teachers Association’s
list of 20,000 teachers and sending these 400 teachers
questionnaires. 200 of these 400 teachers returned the
9
questionnaires and 60 of the 200 believe in biblical
creationism.
Let Yi  1 or 0 according to whether the ith teacher believes
in biblical creationism, i  1, , 20000 .
Let Ri  1 or 0 according to whether the ith teacher would
respond to the questionnaire if sent it, i  1, , 20000 .
We would like to know the proportion of teachers that
20,000
Y
i 1
i
believe in biblical creationism, 20, 000 .
The data X from the experiment of randomly sampling 400
teachers is (i) the number of teachers that respond, call this
X1 and (ii) the number of teachers that respond and believe
in biblical creationism, call this X 2 .
The distribution of X1 , X 2 over repeated random samples is
that X1 is hypergeometric
20000
 20000  

R
20000

R


i 
i

i 1
i 1



 j 

400  j



 , j  0, , 400
P( X 1  j ) 
 20000 


400


and the conditional distribution of X 2 | X 1  j is
10
20000
 20000
 20000

Y
R
R

Y
R



i i 
i
i i

i 1
i 1
i 1



 k 

jk


 , k  0,
P( X 2  k | X 1  j ) 
20,000


R
  i
 i 1

 j 


A parameterization for the model is
20,000
20,000
 20,000

R
Y
Y
R
  i  i  i i
i 1
  1 ,  2 ,  3 
   i 1
, i 1
, 20,000
 20, 000 20, 000

,
R

i 

i 1


1  proportion of people who would respond if sent the
questionnaire
 2  proportion of teachers who back biblical creationism.
3  among the teachers who would respond, proportion
who back biblical creationism
This parameterization includes our quantity of interest, the
proportion of teachers who back biblical creationism --  2
But this parameterization is not identifiable:
11
,j
20,000
R
i
i 1
20, 000
20,000
 0.5,
20,000
R
i
i 1
20, 000
Y
i 1
i
20, 000
20,000
 YR
 0.3,
i 1
20,000
 0.5,
i 1
i
20, 000
i
R
 0.2,
 YR
i 1
20,000
i
i
R
i 1
 0.3
and
i
i 1
20,000
20,000
Y
i
 0.3
i
have the same distribution for ( X 1 , X 2 ) .
20,000
 YR
The quantity the article reported an estimate of,
i 1
20,000

i 1
i
i
Ri
is identified (i.e., a parameter) but our quantity of interest,
20,000
Y
i 1
i
20, 000 , is not identified (i.e., not a parameter).
III. Statistics
A statistic Y  T ( X ) is a random variable or random
vector that is a function of the data.
12
,
Example 2 continued: ( X1 ,
statistics are

X 
n
i 1
, X n ) iid N (  ,  2 ) . Two
Xi
n
and the sample variance
1 n
s 
( X i  X )2 .

n  1 i 1
2
III. Regression Model (Section 1.1.4)
In the regression setting, each individual unit has a
response variable Yi and a vector of explanatory variables
zi . A regression model is a model for E (Yi | zi )
p
The multiple linear regression model is E (Yi | zij )   zij  j
j 1
p
2
Y

z



,

~
N
(0,

) . The coefficients

i
ij
j
i
i
and
j 1
 j can be interpreted as the change in the mean of Y that is
associated with a one unit change in z j when
z1 , , z j 1 , z j 1 , , z p are held fixed.
Example: The 1966 Coleman Report on “Equality of
Educational Opportunity” sought to explain how student
achievement in schools was associated with the resources
of the school and the socioeconomic background of the
student, e.g.,
Y = verbal achievement score in school (6th graders)
13
z1  staff salaries per pupil
z2  % of students in 6th grade of school whose father has
a white collar occupation
z3  SES (socioeconomic status)
z4  teachers’ average verbal scores
z5  mothers’ average education
Problem 1.1.9 (Problem 4 on homework 1) is concerned
with the impact of collinearity of the explanatory variables
on identifiability of the parameter vector  . The variables
z1 , z2 , z3 , z4 , z5 would be collinear if one variable was a
linear function of the other variables. z1 , z2 , z3 , z4 , z5 were
close to being collinear in the Coleman study because
socioeconomic status was highly correlated with the
resources of the school (staff salaries per pupil) prior to the
desegregation of schools.
14
Download