Probability and Estimation - Department of Statistics | Rajshahi

advertisement
Prof. Dr. S. K. Bhattacharjee
Department of Statistics
University of Rajshahi
Statistical Inference
• Statistical inference is the process of making
judgment about an unknown population based on
sample.
• An important aspect of statistical inference is using
estimates to approximate the value of an unknown
population parameter.
• Another type of inference involve choosing between
two opposing views or statements about the
population; this process is called hypothesis
testing.
Statistical Estimation
 An estimator is a statistical parameter that provides
an estimation of a population parameter.
 Point Estimation
 Interval Estimation
,.
Point Estimation
 A point estimator is a single numerical estimate of a
population parameter.
 The sample mean,
is a point estimator of the
population mean, μ.
 The sample proportion, p is a point estimator of the
population proportion, π.
Properties of a good Estimator
Principles of Parameter Estimation





Unbiased
– The expected value of the estimate is equal to
population parameter
Consistent
– As n (sample size) approaches N (population size),
estimator converges to the population parameter
Efficient
– With the smallest variance.
Minimum Mean-Squared Error
-Variance of estimator be as low as possible.
Sufficient
– Contains all information about the parameter through
a sample of size n
Unbiased Estimator
 An unbiased estimator is a statistics that has an
expected value equal to the population parameter
being estimated.
 E[θ]n = θ0 for any n
 Examples:
 The sample mean,
population mean, μ.
 The sample variance
the population variance,
is an unbiased estimator of the
is an unbiased estimator of
Consistent Estimators
 A statistics is a consistent estimator of a parameter if its
probability that it will be close to the parameter's true value
approaches 1 with increasing sample size.
 Mathematically, a sequence of estimators {tn; n ≥ 0} is a consistent estimator for parameter θ
if and only if, for all ϵ > 0, no matter how small, we have
 The standard error of a consistent estimator becomes smaller as
the sample size gets larger.
 The sample mean and sample proportions are consistent estimators,
since from their formulas as n gets bigger, the standard errors
become smaller.

and
Consistent Estimators
An estimator's distribution (like that of any any other non trivial statistic) becomes
narrower and narrower, and more and more normal-like as larger and larger samples are
considered. If we take for granted the fact that the variance of the estimator will tend to 0
as the sample size grows without limit, what consistency really means is that the mean of
the estimator's distribution tends to θ0 as the sample size grows without limit, as shown in
the upper and lower images below :
In technical terms, a consistent estimator is a sequence of random variables
indexed by n (the sample size) that converge in probability to θ0.
Relative Efficiency
A parameter may have several unbiased estimators. For example, given a
symmetrical continuous distribution, both :
* The sample mean
and
* The sample median
are unbiased estimators of the distribution mean (when it exists).
Which one should we choose ?
Certainly we should choose the estimator that generates estimates that are
closer (in a probabilistic sense) to the true value θ0 than estimates generated by
the other one. One way to do that is to select the estimator with the lower
variance.
This leads to the definition of the relative efficiency of two unbiased
estimators. Given two unbiased estimators θ *1 and θ *2 of the same parameter θ
, one defines the efficiency of θ *2 with respect to θ *1 (for a given sample size n)
as the ratio of their variances :
Relative efficiency (θ *2 with respect to θ *1)n = Var(θ *1)n / Var(θ *2 )n
Efficient Estimator
 The estimator has a low variance, usually relative to other
estimators, which is called relative efficiency. Otherwise, the
variance of the estimator is minimized.
 An efficient estimator consider the reliability of the estimator
in terms of its tendency to have a smaller standard error for the
same sample size when compared each other
 The median is an unbiased estimator of μ when the sample
distribution is normally distributed; but is standard error is 1.25
greater than that of the sample mean, so the sample mean is a
more efficient estimator than the median.
 The Maximum Likelihood Estimator is the most efficient
estimator among all the unbiased ones.
Minimum Mean-Squared Error Estimator
The practitioner is not particularly keen on unbiasedness. What is really important is
that, on the average, the estimate θ* be close to the true value θ 0. So he will tend to
favour estimators such that the mean-square error :
E[(θ* - θ0 )]²
be as low as possible, whether θ * is biased or not. Such an estimator is called a
minimum mean-square-error estimator.
Given two estimators :
*
 θ 1: that is unbiased, but with a large variance,
*
 θ 2 : that is somewhat biased, but with a small variance,

θ *2 might prove a better estimator than θ *1 in practice .
Minimum Mean-Squared Error Estimator
Sufficient Estimator
 We have shown that
of μ and
and
are unbiased estimators
.
 Are we loosing any information about our target
parameters relying on these statistics?
 The statistics, that summarizes all the information
about target parameters are said to have the property
of sufficiency, or they are called sufficient statistics.
 “Good” estimators are (or can be made to be) functions
of any sufficient statistic.
Sufficient Estimator
*Let
Y1 , Y2 ,..., Yn denote a random sample from a probability
distribution with unknown parameter
to be sufficient for

 . Then the statistics u is said
if the conditional distribution of
given u does not depend on  .
*Let u be a statistic based on the random sample
Y1 , Y2 ,..., Yn . Then
u is sufficient statistic for the estimation of a parameter
if the likelihood
Y1 , Y2 ,..., Yn

if and only
L  y1 , y2 ,..., yn |   can be factored into two
nonnegative functions
L  y1 , y2 ,..., yn |    g  u ,   h  y1 , y2 ,..., yn 
Where
g  u , 
is
a
function
only
h  y1 , y2 ,..., yn  is not a function of  .
of
u
and
,
and
Example : Sufficient Estimator
Example : Sufficient Estimator
Methods of Point Estimation
Classical Approach.



Bayesian Approach.
Classical Approach:
Method of Moment
Method of Maximum Likelihood
Method of Least Square
Method of Moments
i) Sample moments should provide good estimates
of the corresponding population moments.
ii) Because the population moments are functions
of population parameters, we can use i) to get these
parameters
Formal Definition:
Choose as estimates those values of the parameters
that are solutions of the equations k'  mk' , for
k  1, 2,..., t , where t is the number of parameters to
be estimated.
Example
A random sample Y1 , Y2 ,..., Yn is selected from a
population in which Yi possesses a uniform density
function over the interval  0,   where  is
unknown. Use the method of moments to estimate
.
Solution
The value of 1' for a uniform random variable is
1'   

2
The corresponding first sample moment is
m1'
1

n
n
 Yi
Y
i 1
From which:
1' 

2
Thus,
ˆ  2Y
Y
Method of Maximum Likelihood
The likelihood and log-likelihood functions are the
basis for deriving estimators for parameters, given
data. While the shapes of these two functions are
different, they have their maximum point at the same
value. In fact, the value of parameter that corresponds
to this maximum point is defined as the Maximum
Likelihood Estimate (MLE). This is the value that is
“mostly likely" relative to the other values. The
maximum likelihood estimate of the unknown
parameter in the model is that value that maximizes
the log-likelihood, given the data.
Method of Maximum Likelihood
 Using calculus one could take the first partial
derivative of the likelihood or log-likelihood function
with respect to the parameter(s), set it to zero and
solve for parameter(s). This solution will give the
MLE(s).
Method of Maximum Likelihood
If x is a continuous random variable with pdf:
where
are k unknown constant parameters which
need to be estimated, conduct an experiment and obtain N
independent observations, x1, x2,...,xN. Then the likelihood
function is given by the following product:
The logarithmic likelihood function is given by:
The maximum likelihood estimators (MLE) of
obtained by maximizing L or .
are
By maximizing , which is much easier to work with than L, the
maximum likelihood estimators (MLE) of
are the
simultaneous solutions of k equations such that:
Properties of Maximum Likelihood
Estimators
 For “large" samples (“asymptotically"), MLEs are optimal.
 1. MLEs are asymptotically normally distributed.
 2. MLEs are asymptotically “minimum variance."
 3. MLEs are asymptotically unbiased (MLEs are often
biased, but the bias→ 0 as n → ∞.
 MLE is consistent
 The Maximum Likelihood Estimator is the most
efficient estimator among all the unbiased ones.
 Maximum likelihood estimation represents the backbone
of statistical estimation.
Example
Suppose.
The likelihood is
and the loglikelihood is
Taking derivatives and solving, we find
Find the MLE of p.
Example
Suppose
. Find the MLE of .
The likelihood is
and the loglikelihood is
Maximizing this equation,
Method of Least Squares
 A statistical technique to determine the line of best fit
for a model.
 The least squares method is specified by an equation
with certain parameters to observed data.
 This method is extensively used in regression analysis
and estimation.
 Ordinary least squares - a straight line is sought to be
fitted through a number of points to minimize the sum
of the squares of the distances (hence the name "least
squares") from the points to this line of best fit.
Method of Least Squares
Define the distance from the data point from the line, denoted by u, as follows:
Method of Least Squares
Example: Method of Least Squares
To illustrate the computations of b and a, refer to the following data. All the sums required are
computed and shown here:
Interval Estimation
Estimation of the parameter is not sufficient. It is necessary to analyse and see
how confident we can be about this particular estimation. One way of doing it
is defining confidence intervals. If we have estimated  we want to know if the
“true” parameter is close to our estimate. In other words we want to find an
interval that satisfies following relation:
P{L ˂ μ ˂ U} ≥ 1- α
I.e. probability that “true” parameter  is in the interval (L ,U) is greater than 1. Actual realisation of this interval - (L ,U) is called a 100(1- )% of confidence
interval, limits of the interval are called lower and upper confidence limits. 1- 
is called confidence level.
Example: If population variance is known (2) and we estimate population
x
mean then
Z 
is normal N (0,1)
/ n
We can find from the table that probability of Z is more than 1 is equal to
0.1587. Probability of Z is less than -1 is again 0.1587. These values comes
from the tables of the standard normal distribution.
Interval Estimation
Interval estimation, Credible interval, and Prediction interval
Confidence intervals are one method of interval estimation, and
the most widely used in Classical statistics. An analogous
concept in Bayesian statistics is credible intervals, while an
alternative Classical and Bayesian both methods is that of
prediction intervals which, rather than estimating parameters,
estimate the outcome of future samples.
An interval estimator of the sample mean can be expressed as
the probability that the mean between two values.
Interval estimation, “Confidence Interval”
– use a range of numbers within which the
parameter is believed to fall (lower bound,
upper bound)
– e.g. (10, 20)
Interval Estimation for the mean of
a Normal Distribution
Confidence Interval
The simplest and most commonly used formula for a binomial confidence interval relies on
approximating the binomial distribution with a normal distribution. This approximation is
justified by the central limit theorem. The formula is
where is the proportion of successes in a Bernoulli trial process estimated from the
statistical sample, z1 − α / 2 is the 1 − α / 2 percentile of a standard normal distribution, α
is the error percentile and n is the sample size. For example, for a 95% confidence level
the error (α) is 5%, so 1 − α / 2 = 0.975 and z1 − α / 2 = 1.96.
Exponential Distribution
The 100(1 − α)% exact confidence interval for this estimate is given by[2]
which is also equal to:
2
is the MLE estimate, λ is the true value of the parameter, and χ p,ν is the
100(1 – p) percentile of the chi squared distribution with ν degrees of freedom
where
Bayesian Estimation
Bayesian statistics views every unknown as a random quantity. Bayesian
statistics is a little more complicated in the simple cases than computing the
Maximum Likelihood Estimate Suppose we have data
from a
distribution.
Our goal is to estimate the unknown
. The first step in
Bayesian statistics is to select a prior distribution,
, intended to represent
prior information about the . Often, you don't have any available. In this case,
the prior should be relatively diffuse. For example, if we are trying to guess the
average height (in feet) of students at RU, we may know enough to realize the
most student are between 5 and 6 feet tall, and therefore the mean should be
between 5 and 6 feet, but we may not want to be more specific than that. We
wouldn't, for example, want to specify
. Even though
5.6 feet may be a good guess, this prior places almost all its mass between
5.599995 and 5.600005 feet, indicating we are almost sure, before seeing any
data, then the mean height is in this range. I'm personally not that sure, so I
might choose a much more diffuse prior, such as setting
,
indicating that I'm sure the mean height is between 5 and 6 feet but every value
in there seems about as likely as any other.
Prior and Posterior Distribution
The tool for guessing at the parameters value with prior knowledge of parameter
and data is called the posterior distribution, which is defined as the
conditional distribution of the parameter given data, formally
where
is the likelihood function.
The posterior
is a distribution over and has all the usual
properties of a distribution. In particular
Prior and Posterior Distribution
Prior and Posterior Distribution
1. Although not guaranteed, in almost all practical situations the posterior
distribution provides a more refined guess of than the prior. We are
combining our prior information with the information contained in the
data to make better guesses about .
2. If we observe a large amount of data, the posterior distribution is
determined almost exclusively by the data, and tends to place more and
more mass near the true value of . Thus, we don't have to be too
precise about specifying our prior distribution in advance. Any errors will
tend to wash out as we observe more data.
Properties of Posterior Mean
 The Bayes estimate of a parameter is the posterior mean.
Usually the posterior distribution will have some common
distributional form (such as Gamma, Normal, Beta, etc.). Some
things to remember about the posterior mean
 The data only enter the equation for the posterior in terms of the
likelihood function. Therefore, the parameters of the posterior
distribution, and hence the posterior mean, are functions of the
sufficient statistics.
 Often the posterior mean has lower MSE than the MLE for
portions of the parameter space, so its a worthwhile estimator to
consider and compare to the MLE.
 The posterior mean is consistent, asymptotically unbiased
(meaning the bias tends to 0 as the sample size increases), and
the asymptotic efficiency of the MLE compared to the posterior
mean is 1. Actually, for large n the MLE and posterior mean are
very similar estimators, as we will see in the examples.
Example: Geometric
Suppose we wished to use a general
for the posterior in terms of
density to be
and
prior. We would like a formula
. We proceed as before, finding the prior
The likelihood is unchanged, so the product of the prior and likelihood
simplifies is
The prior parameters
and are treated as fixed constants.Thus the Gamma
functions in front may be considered part of the normalizing constant C, leaving
the kernel
This is the kernel of a
mean
distribution, with posterior
Example : Binomial
Example: Poisson
Let
posterior distribution of
. Suppose you have a
prior on
. Compute the
.
As stated above, our first goal is to compute and simplify and product of the likelihood and
prior. If the data are
and the prior density is
The posterior distribution
which simplifies to
, then the likelihood is
Example: Poisson
All the , , and are constants since
terms that involve are
is the only thing random in this expression. The
Hence the posterior distribution is
distribution.
The Bayes estimate is the posterior mean. The posterior mean of a
distribution is
Notice that
The only terms that get large as n increases are
, the MLE.
and n. Thus, for large n,
is approximately
Example: Normal
Bayesian Interval Estimation
Prediction
Predictive Distribution : Binomial-Beta
Predictive Density : Normal-Normal
Predictive Distribution : Binomial-Beta
Download