Data and estimation Business Statistics 41000 Fall 2015 or

advertisement
Data and estimation
or
“Where do probabilities come from anyway?”
Business Statistics 41000
Fall 2015
1
Topics
1. Empirical distributions: creating random variables from data
2. Exploratory data analysis
3. EstimaND, estimaTOR, estimaTE.
4. Gauging sampling variability.
2
Determining probabilities from data
We’ve seen that probability concepts allow us to describe statistical
patterns and to exploit them to our benefit using the idea of expected
utility maximization.
But where do those probabilities come from in the first place?
The short answer — data!
If we have enough of it, probabilities estimated from data are accurate,
but when we have only limited, more care is needed because of sampling
variability — the observed data does not always reflect the underlying
probabilities accurately.
3
Creating a RV via random sampling
By data we mean the recorded facts of the world, which can take many
forms: presence or absence data (dummy variables), categorical data
(this, that or the other), and measurements of various kinds (continuous,
discrete).
We’ve seen these types before in defining the sample space of random
variables.
The empirical distribution is the probability distribution defined by
randomly sampling our data (with replacement), with each observation
getting equal probability.
4
Example: Bernoulli probability
iid
Consider the random variables Xi ∼ Ber(p = 1/2), for i = 1, . . . , 10, and
consider a realization of this random variable. That is, consider the
outcome of tossing a fair coin 10 times. Let’s say the results look like
this:
[0, 0, 1, 0, 1, 0, 0, 0, 0, 1].
The empirical distribution refers to the distribution defined by randomly
sampling our data (with replacement). In this case each draw is a
iid
random variable Di ∼ Ber(p̂ = 3/10).
5
Example: die rolls
Consider a three-sided fair die. In ten rolls we might get
[3, 1, 1, 3, 1, 2, 3, 3, 3, 1].
x
1
2
3
X
P(X = x)
1
3
1
3
1
3
d
1
2
3
D
P(D = d)
4
10
1
10
5
10
6
Example: milk demand
If the true probabilities are as in our earlier cafe milk ordering scenario
x
P(X = x)
1
4%
2
15 %
3
35%
4
5%
5
5%
6
5%
7
5%
8
20%
9
3%
10
3%
we might actual observe, over the past 100 days, numbers like
d
100 · P(D = d)
1
5
2
15
3
36
4
2
5
7
6
6
7
6
8
18
9
2
10
3
7
Example: data from a normal distribution
The same idea holds for continuous random variables. Assume the
underlying distribution of heights for NBA players is X ∼ N(79, 13) We
may randomly observe 10 heights
D
d
P(D = d)
74.66610
1
10
1
10
1
10
1
10
1
10
1
10
1
10
1
10
1
10
1
10
74.81724
74.82560
75.33874
78.54953
78.59774
79.06059
79.23552
80.24751
83.02688
8
Summary features of empirical distributions
Summary features of empirical distributions — like summary features of
any distributions — are fixed numbers.
With the data d1 , d2 , . . . , dn in hand, we define a random variable D, and
then we can compute various properties of D as we would with any other
distribution.
9
Summary features of empirical distributions
Given a sample that defines an empirical distribution, we can compute
the usual summary features of a distribution:
I
mean
I
median
I
mode
I
variance and standard deviation
I
etc...
10
Mean of empirical distribution
(Population) Mean
The mean of a random variable X is defined as
E (X ) =
J
X
xj P(X = xj ).
j=1
Empirical (sample) mean
The random variable D defined by a list of n observed numbers,
d1 , d2 , . . . , dn has expected value
E (D) =
n
X
j=1
dj
n
1
1X
=
dj .
n
n
j=1
11
Median of empirical distribution
(Population) Median
A random variable X has median m if
P(X ≤ m) ≥
1
1
and P(X ≥ m) ≥ .
2
2
Empirical (sample) median
Let D be the random variable defined by a list of n observed numbers,
d(1) ≤ d(2) ≤ . . . , d(n) . Any number m so that
d(a) ≤ m ≤ d(b)
n+1
is a median of D, where a = b n+1
2 c and b = d 2 e.
12
Median of empirical distribution
Let’s unpack this a bit.
The sorted list of n observed numbers, d(1) ≤ d(2) ≤, . . . , d(n) are called
order statistics; d(j) is the jth smallest number.
So, if m is between d(a) and d(b) we know that
P(D ≤ m) =
b n+1
1
2 c
≥
n
2
P(D ≥ m) =
d n+1
1
2 e
≥ .
n
2
and
13
Mode of empirical distribution
The mode is easy.
Empirical (sample) mode
Let D be the random variable defined by a list of n observed numbers,
d1 , d2 , . . . , dn . The mode of D is the number (or numbers) occurring
most often.
In the three-sided-die example, we observed [3, 1, 1, 3, 1, 2, 3, 3, 3, 1]. The
mode of the empirical distribution is 3. The underlying distribution in
this example was uniform, so each value was a mode.
14
Variance of empirical distribution
(Population) Variance
The variance of a random variable X with distribution p(x) is defined as
V (X ) =
J
X
2
(xj − E (X )) p(xj ).
j=1
Empirical (sample) variance
Let D be the random variable defined by a list of n observed numbers,
d1 , d2 , . . . , dn . The variance of D is defined as
n
1X
2
V (D) =
(di − E (D)) .
n
i=1
15
Variance of empirical distribution
Plug-in formula for empirical variance
Let D be the random variable defined by a list of n observed numbers,
d1 , d2 , . . . , dn . The variance of D can be written as
n
1X 2
V (D) =
di −
n
i=1
n
1X
di
n
!2
i=1
= E (D 2 ) − E (D)2 .
16
Multivariate empirical distributions
Consider the following discrete, bivariate random variable
(x1 , x2 )
(1,1)
(2,1)
(2,2)
(17,1)
X = (X1 , X2 )
P(X1 = x1 , X2 = x2 )
9
20
1
20
9
20
1
20
17
Multivariate empirical distributions
Here is the same information in a different form.
X = (X1 , X2 )
X1 = 1 X1 = 2
X2 = 1
9
20
X2 = 2
0
1
20
9
20
X1 = 17
1
20
0
18
Multivariate empirical distributions
1.0
1.2
1.4
x2
1.6
1.8
2.0
We can visualize with a bubble plot.
5
10
15
x1
19
Multivariate empirical distributions
Consider a sample of size n = 10 drawn from this distribution:
[(17, 1), (2, 2), (17, 1), (2, 2), (2, 2), (2, 2), (1, 1), (17, 1), (2, 2), (2, 2)].
This defines the empirical distribution
D = (D1 , D2 )
(d1 , d2 ) P(D = (d1 , d2 ))
(1,1)
1
10
(2,1)
0
(2,2)
6
10
3
10
(17,1)
20
Formula: correlation
Recall the formula for correlation
Plug-in formula for correlation
The correlation between two random variables X and Y can be expressed
as
E (XY ) − E (X )E (Y )
corr(X , Y ) =
.
σX σY
21
Formula: correlation of empirical distribution
Let di refer to the ith observed pair; i.e. d1 = (17, 1), and let di,j refer to
the jth co-ordinate of di .
Plug-in formula for correlation
Let D be the random variable defined by randomly sampling n observed
points (d1,1 , d1,2 ), (d2,1 , d2,2 ), . . . (dn,1 , dn,2 ).
The correlation between D1 and D2 can be expressed as
corr(X , Y ) =
=
E (D1 D2 ) − E (D1 )E (D2 )
σD1 σD2
1
n
P
i
di,1 di,2 −
1
n
Pn
i=1
di,1
σD1 σD2
1
n
Pn
i=1
di,2
.
22
1.0
1.2
1.4
D2
1.6
1.8
2.0
Tool: scatter plots
5
10
15
D1
We have added some “jitter” in order to reflect the number of points at
each location. We could as well have directly bubble-plotted the table for
D from a few slides back.
23
Example: NBA height and weight
Here is the NBA (2008) height and weight data.
80
75
65
70
Height in inches
85
90
NBA 2008
150
200
250
300
Weight in pounds
For continuous bivariate data, no jitter is necessary.
24
Example: NBA height and weight
85
80
75
70
65
Height in inches
90
Recall p
that the best linear
p predictor is given in terms of E(X ), E(Y ),
σX = V(X ), σY = V(Y ) and ρ = cor(X , Y ).
150
200
250
300
Weight in pounds
Applied to the empirical distribution (see above) we can find the best
linear predictor for a particular data set. How can we interpret this in
terms of random sampling?
25
Empirical risk minimization
Notice that the best linear predictor for an empirical distribution directly
minimizes the sum of squares error:
n
1X
(yi − [a + xi b])2 .
n
i=1
This general strategy is called empirical risk minimization...which can
be thought of as empirical utility maximization.
This is similar to the idea of back-testing: make future decision
(actions) on the basis of which past decisions would have worked well.
26
Tool: histograms
With univariate empirical distributions for continuous data, each value is
unique so they all get the same n1 weight. If we create evenly spaced
“buckets”, something like a density function emerges.
0.010
0.000
0.005
Density
0.015
NBA 2008
150
200
250
300
Weight in pounds
Such a plot is called a histogram.
27
Empirical CDF plots
We can visualize the empirical CDF plot.
0.0
0.2
0.4
F(x)
0.6
0.8
1.0
NBA 2008
150
200
250
300
Weight in pounds
This plot shows the order statistics d(j) plotted against nj .
28
Probability ideas meet data
To reiterate: the thought experiment of randomly sampling observations
from our data permits us to connect the probability concepts and
terminology from the first three weeks of class to whichever real-world
problem we happen to be studying.
In particular, we can determine the probability of certain events or
visualize certain conditional distributions — we just have to know how to
ask the computer to pull them up.
29
Data interrogation
Now we will begin to explore some real data sets in R and Excel.
We will see the following exploratory data analysis (or EDA) tools:
I
pivot tables and subsetting
I
boxplots
I
histograms and binning
I
scatterplots and time series plots
I
simple trend lines
[Begin computation lab session.]
30
Guiding principle of statistical estimation
The guiding principle of statistical estimation is that:
Empirical distributions tend to look like the underlying distributions of
the data which define them.
Moreover, this likeness improves as we get more and more data. (We
rarely have as much data as we would like.)
Note: the estimated quantities in our data are not the “true” values of
the underlying RV!
31
Guiding principle
Consider some event A. In terms of A, the claim is that as the sample
size gets bigger and bigger (n → ∞),
P(Dn ∈ A) → P(X ∈ A).
In more detail,
n
X
1
1(xi ∈ A) → P(X ∈ A).
n
i=1
We can perform simulations to support this claim.
32
Guiding principle in action
iid
For example, let A = {h | 2 ≤ h ≤ 5} and Xi ∼ N(2, 32 ) for i = 1, . . . , n.
We can compute P(X ∈ A) using the R command
pnorm(5,2,3) - pnorm(2,2,3)
and find it to be 0.341.
We can then draw x1 , x2 , . . . , xn using the R command
x <- rnorm(n,2,3).
Finally, we can compute P(Dn ∈ A) with the command
mean((2 < x) & (x < 5)).
33
Guiding principle
I
Because A was arbitrary, we can conclude that essentially any
feature of the distribution of Dn will have to look like the
corresponding feature of the X distribution.
I
Every time you perform the demo above, you get a slightly different
answer for P(Dn ∈ A).
I
By the Law of Large numbers (from last lecture), as n gets bigger,
P(Dn ∈ A) gets closer and closer to P(X ∈ A).
To see this last point, compute the mean and variance of
n
X
1
1(Xi ∈ A).
n
i=1
Seriously, like right now. Do it.
34
Terminology: estimand, estimator, estimate
Suppose we are interested in a particular feature of an unknown
distribution. It could be a common summary statistic — such as the
mean or the variance of the distribution.
Perhaps we are interested in the probability of a particular event.
We may be interested in the optimal action under some problem-specific
utility function (such as our “milk demand” example).
In each case, any quantity of interest of an unknown probability
distribution is referred to as an estimand.
35
Estimand, estimator, estimate
Next, we proceed to use data to figure out what the estimand is.
To do this, we come up with a recipe for taking observed data and
spitting out our best guess as to what the estimand is.
This recipe itself is called an estimator.
Note that this recipe defines a random variable, induced by the data
generating process: estimators are random variables!
36
Estimand, estimator, estimate
Finally, after we observe actual data, we can apply the estimator recipe
to it, to obtain an actual number.
This number — our post-data guess as to the value of the unknown
estimand — is called our estimate.
Estimates are fixed numbers based on the observed data.
37
Estimand, estimator, estimate
An estimand is a fixed but unknown property of a probability
distribution. It is the thing we want to estimate.
An estimator is a recipe for taking observed data and formulating a
guess about the unknown value of the estimand.
An estimate is a specific value obtained when the estimator is applied to
specific observed data.
38
Example: population mean: E(X ); sample mean: X̄ ;
sample mean: x̄
Suppose we are interested in E(X ), the mean of a random variable X .
This is sometimes referred to as the population mean, in reference to
polling problems.
The standard estimator of the population mean is the mean of the
observed data, or the sample mean. Before any P
data is observed, this
n
defines our estimator, commonly denoted X̄ ≡ n1 i=1 Xi .
Once the
Pndata has been observed, we have an estimate in hand, denoted
x̄ ≡ n1 i=1 xi . (Note the lower-case x’s here.) This is the observed
sample mean.
39
Gauging sampling variabiliy
Now we will look at some examples of sampling variability:
1. fair coin or biased coin?,
2. milk demand (aggregate daily demand).
3. best linear predictor of NBA height, given weight,
40
Example: fair or biased coin?
Suppose a partner at your firm always tosses a coin to see if you or he
pays for your weekly Thursday lunch meeting.
After ten lunches, you’ve had to pay eight times. We can denote this by
8
.
saying that p̂ = 10
Should you accuse him of using a loaded coin?
(This “hat” notation, p̂, indicates an estimate of the corresponding
estimand p.)
41
Example: fair or biased coin? (cont’d)
First, let’s approach this question mathematically: how much do we trust
8
our estimate based on a sample size of 10?
p̂ = 10
First, we
Pncompute the sampling distribution of our estimator
p̂ ≡ n1 i=1 Xi , where each Xi ∼ Ber(p), for some unknown p (our
estimand).
Specifically, we learned last week that a sum of n independent Bernoulli
RV’s has a binomial distribution, so np̂ ∼ Bin(n, p), for whatever the
actual value of p is.
Notice that the sampling distribution depends on the unknown estimand
E(X ) = p. That’s annoying, but typical.
42
Example: fair or biased coin? (cont’d)
However, because our null hypothesis was that the coin is fair, we can
use p = 21 as a benchmark. Under the null hypothesis the observed
number of “successes” (call it Y ) is a draw from a Bin(10, 12 ) distribution.
Accordingly, we calculate P(Y ≤ 7) = 0.945 — if you accuse your senior
colleague in cases like the one you observed, you’d wrongly accuse him of
being a scoundrel 5.5% of the time.
Thus 5.5% is the p-value of your data: the probability, under the null
hypothesis, of seeing data as or more extreme than what you actually
witnessed.
43
Example: fair or biased coin? (cont’d)
An important caveat: for any fixed data set either the guy is cheating or
he isn’t. But the rule itself brings with it long-run guarantees.
The 5.5% probability of false-accusations, that is a statement about the
rule itself, not about the coin on any particular occasion or even on
average.
IMO, unless you plan to accuse a series of bosses of shenanigans, this
sort of guarantee isn’t so very useful.
Nonetheless, this approach of developing decision rules that work well
averaged over many different data sets (the “frequentist” approach) is
standard. We will discuss it in detail next week.
44
Predicting milk demand
Recall our milk demand random variable is:
x
Pr(X = x)
1
4%
2
15 %
3
35%
4
5%
5
5%
6
5%
7
5%
8
20%
9
3%
10
.
3%
And our utility function is
(
−$5(a − x)
u(a, x) =
−$35(x − a)
if a > x,
if x > a.
where the action a is the number of gallons we order and the “state” x is
the amount of milk required.
If we based our calculation on the past 3 weeks of data, how often would
we make the wrong decision?
45
Milk order sampling variation
0.0
0.2
0.4
0.6
0.8
We can investigate this question via simulation.
1
2
3
4
5
6
7
8
9
10
Most of the time we get it right. Sometimes we over-order and
sometimes we under-order (less often).
46
Example: NBA heights and weights
4
3
0
1
2
Frequency
5
6
7
An easy way to approximate the sampling distribution is simply to
re-sample your data (with replacement), compute your estimate, and
repeat.
0.100
0.105
0.110
0.115
β
This technique is called bootstrapping.
47
80
75
70
65
Height in inches
85
90
Example: NBA heights and weights
150
200
250
300
Weight in pounds
48
Example: NBA heights and weights
Here is the code to produce the bootstrap samples.
49
Download