Probability and Estimation - Department of Statistics | Rajshahi

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi Statistical Inference • Statistical inference is the process of making judgment about an unknown population based on sample. • An important aspect of statistical inference is using estimates to approximate the value of an unknown population parameter. • Another type of inference involve choosing between two opposing views or statements about the population; this process is called hypothesis testing. Statistical Estimation  An estimator is a statistical parameter that provides an estimation of a population parameter.  Point Estimation  Interval Estimation ,. Point Estimation  A point estimator is a single numerical estimate of a population parameter.  The sample mean, is a point estimator of the population mean, μ.  The sample proportion, p is a point estimator of the population proportion, π. Properties of a good Estimator Principles of Parameter Estimation      Unbiased – The expected value of the estimate is equal to population parameter Consistent – As n (sample size) approaches N (population size), estimator converges to the population parameter Efficient – With the smallest variance. Minimum Mean-Squared Error -Variance of estimator be as low as possible. Sufficient – Contains all information about the parameter through a sample of size n Unbiased Estimator  An unbiased estimator is a statistics that has an expected value equal to the population parameter being estimated.  E[θ]n = θ0 for any n  Examples:  The sample mean, population mean, μ.  The sample variance the population variance, is an unbiased estimator of the is an unbiased estimator of Consistent Estimators  A statistics is a consistent estimator of a parameter if its probability that it will be close to the parameter's true value approaches 1 with increasing sample size.  Mathematically, a sequence of estimators {tn; n ≥ 0} is a consistent estimator for parameter θ if and only if, for all ϵ > 0, no matter how small, we have  The standard error of a consistent estimator becomes smaller as the sample size gets larger.  The sample mean and sample proportions are consistent estimators, since from their formulas as n gets bigger, the standard errors become smaller.  and Consistent Estimators An estimator's distribution (like that of any any other non trivial statistic) becomes narrower and narrower, and more and more normal-like as larger and larger samples are considered. If we take for granted the fact that the variance of the estimator will tend to 0 as the sample size grows without limit, what consistency really means is that the mean of the estimator's distribution tends to θ0 as the sample size grows without limit, as shown in the upper and lower images below : In technical terms, a consistent estimator is a sequence of random variables indexed by n (the sample size) that converge in probability to θ0. Relative Efficiency A parameter may have several unbiased estimators. For example, given a symmetrical continuous distribution, both : * The sample mean and * The sample median are unbiased estimators of the distribution mean (when it exists). Which one should we choose ? Certainly we should choose the estimator that generates estimates that are closer (in a probabilistic sense) to the true value θ0 than estimates generated by the other one. One way to do that is to select the estimator with the lower variance. This leads to the definition of the relative efficiency of two unbiased estimators. Given two unbiased estimators θ *1 and θ *2 of the same parameter θ , one defines the efficiency of θ *2 with respect to θ *1 (for a given sample size n) as the ratio of their variances : Relative efficiency (θ *2 with respect to θ *1)n = Var(θ *1)n / Var(θ *2 )n Efficient Estimator  The estimator has a low variance, usually relative to other estimators, which is called relative efficiency. Otherwise, the variance of the estimator is minimized.  An efficient estimator consider the reliability of the estimator in terms of its tendency to have a smaller standard error for the same sample size when compared each other  The median is an unbiased estimator of μ when the sample distribution is normally distributed; but is standard error is 1.25 greater than that of the sample mean, so the sample mean is a more efficient estimator than the median.  The Maximum Likelihood Estimator is the most efficient estimator among all the unbiased ones. Minimum Mean-Squared Error Estimator The practitioner is not particularly keen on unbiasedness. What is really important is that, on the average, the estimate θ* be close to the true value θ 0. So he will tend to favour estimators such that the mean-square error : E[(θ* - θ0 )]² be as low as possible, whether θ * is biased or not. Such an estimator is called a minimum mean-square-error estimator. Given two estimators : *  θ 1: that is unbiased, but with a large variance, *  θ 2 : that is somewhat biased, but with a small variance,  θ *2 might prove a better estimator than θ *1 in practice . Minimum Mean-Squared Error Estimator Sufficient Estimator  We have shown that of μ and and are unbiased estimators .  Are we loosing any information about our target parameters relying on these statistics?  The statistics, that summarizes all the information about target parameters are said to have the property of sufficiency, or they are called sufficient statistics.  “Good” estimators are (or can be made to be) functions of any sufficient statistic. Sufficient Estimator *Let Y1 , Y2 ,..., Yn denote a random sample from a probability distribution with unknown parameter to be sufficient for   . Then the statistics u is said if the conditional distribution of given u does not depend on  . *Let u be a statistic based on the random sample Y1 , Y2 ,..., Yn . Then u is sufficient statistic for the estimation of a parameter if the likelihood Y1 , Y2 ,..., Yn  if and only L  y1 , y2 ,..., yn |   can be factored into two nonnegative functions L  y1 , y2 ,..., yn |    g  u ,   h  y1 , y2 ,..., yn  Where g  u ,  is a function only h  y1 , y2 ,..., yn  is not a function of  . of u and , and Example : Sufficient Estimator Example : Sufficient Estimator Methods of Point Estimation Classical Approach.    Bayesian Approach. Classical Approach: Method of Moment Method of Maximum Likelihood Method of Least Square Method of Moments i) Sample moments should provide good estimates of the corresponding population moments. ii) Because the population moments are functions of population parameters, we can use i) to get these parameters Formal Definition: Choose as estimates those values of the parameters that are solutions of the equations k'  mk' , for k  1, 2,..., t , where t is the number of parameters to be estimated. Example A random sample Y1 , Y2 ,..., Yn is selected from a population in which Yi possesses a uniform density function over the interval  0,   where  is unknown. Use the method of moments to estimate . Solution The value of 1' for a uniform random variable is 1'     2 The corresponding first sample moment is m1' 1  n n  Yi Y i 1 From which: 1'   2 Thus, ˆ  2Y Y Method of Maximum Likelihood The likelihood and log-likelihood functions are the basis for deriving estimators for parameters, given data. While the shapes of these two functions are different, they have their maximum point at the same value. In fact, the value of parameter that corresponds to this maximum point is defined as the Maximum Likelihood Estimate (MLE). This is the value that is “mostly likely" relative to the other values. The maximum likelihood estimate of the unknown parameter in the model is that value that maximizes the log-likelihood, given the data. Method of Maximum Likelihood  Using calculus one could take the first partial derivative of the likelihood or log-likelihood function with respect to the parameter(s), set it to zero and solve for parameter(s). This solution will give the MLE(s). Method of Maximum Likelihood If x is a continuous random variable with pdf: where are k unknown constant parameters which need to be estimated, conduct an experiment and obtain N independent observations, x1, x2,...,xN. Then the likelihood function is given by the following product: The logarithmic likelihood function is given by: The maximum likelihood estimators (MLE) of obtained by maximizing L or . are By maximizing , which is much easier to work with than L, the maximum likelihood estimators (MLE) of are the simultaneous solutions of k equations such that: Properties of Maximum Likelihood Estimators  For “large" samples (“asymptotically"), MLEs are optimal.  1. MLEs are asymptotically normally distributed.  2. MLEs are asymptotically “minimum variance."  3. MLEs are asymptotically unbiased (MLEs are often biased, but the bias→ 0 as n → ∞.  MLE is consistent  The Maximum Likelihood Estimator is the most efficient estimator among all the unbiased ones.  Maximum likelihood estimation represents the backbone of statistical estimation. Example Suppose. The likelihood is and the loglikelihood is Taking derivatives and solving, we find Find the MLE of p. Example Suppose . Find the MLE of . The likelihood is and the loglikelihood is Maximizing this equation, Method of Least Squares  A statistical technique to determine the line of best fit for a model.  The least squares method is specified by an equation with certain parameters to observed data.  This method is extensively used in regression analysis and estimation.  Ordinary least squares - a straight line is sought to be fitted through a number of points to minimize the sum of the squares of the distances (hence the name "least squares") from the points to this line of best fit. Method of Least Squares Define the distance from the data point from the line, denoted by u, as follows: Method of Least Squares Example: Method of Least Squares To illustrate the computations of b and a, refer to the following data. All the sums required are computed and shown here: Interval Estimation Estimation of the parameter is not sufficient. It is necessary to analyse and see how confident we can be about this particular estimation. One way of doing it is defining confidence intervals. If we have estimated  we want to know if the “true” parameter is close to our estimate. In other words we want to find an interval that satisfies following relation: P{L ˂ μ ˂ U} ≥ 1- α I.e. probability that “true” parameter  is in the interval (L ,U) is greater than 1. Actual realisation of this interval - (L ,U) is called a 100(1- )% of confidence interval, limits of the interval are called lower and upper confidence limits. 1-  is called confidence level. Example: If population variance is known (2) and we estimate population x mean then Z  is normal N (0,1) / n We can find from the table that probability of Z is more than 1 is equal to 0.1587. Probability of Z is less than -1 is again 0.1587. These values comes from the tables of the standard normal distribution. Interval Estimation Interval estimation, Credible interval, and Prediction interval Confidence intervals are one method of interval estimation, and the most widely used in Classical statistics. An analogous concept in Bayesian statistics is credible intervals, while an alternative Classical and Bayesian both methods is that of prediction intervals which, rather than estimating parameters, estimate the outcome of future samples. An interval estimator of the sample mean can be expressed as the probability that the mean between two values. Interval estimation, “Confidence Interval” – use a range of numbers within which the parameter is believed to fall (lower bound, upper bound) – e.g. (10, 20) Interval Estimation for the mean of a Normal Distribution Confidence Interval The simplest and most commonly used formula for a binomial confidence interval relies on approximating the binomial distribution with a normal distribution. This approximation is justified by the central limit theorem. The formula is where is the proportion of successes in a Bernoulli trial process estimated from the statistical sample, z1 − α / 2 is the 1 − α / 2 percentile of a standard normal distribution, α is the error percentile and n is the sample size. For example, for a 95% confidence level the error (α) is 5%, so 1 − α / 2 = 0.975 and z1 − α / 2 = 1.96. Exponential Distribution The 100(1 − α)% exact confidence interval for this estimate is given by[2] which is also equal to: 2 is the MLE estimate, λ is the true value of the parameter, and χ p,ν is the 100(1 – p) percentile of the chi squared distribution with ν degrees of freedom where Bayesian Estimation Bayesian statistics views every unknown as a random quantity. Bayesian statistics is a little more complicated in the simple cases than computing the Maximum Likelihood Estimate Suppose we have data from a distribution. Our goal is to estimate the unknown . The first step in Bayesian statistics is to select a prior distribution, , intended to represent prior information about the . Often, you don't have any available. In this case, the prior should be relatively diffuse. For example, if we are trying to guess the average height (in feet) of students at RU, we may know enough to realize the most student are between 5 and 6 feet tall, and therefore the mean should be between 5 and 6 feet, but we may not want to be more specific than that. We wouldn't, for example, want to specify . Even though 5.6 feet may be a good guess, this prior places almost all its mass between 5.599995 and 5.600005 feet, indicating we are almost sure, before seeing any data, then the mean height is in this range. I'm personally not that sure, so I might choose a much more diffuse prior, such as setting , indicating that I'm sure the mean height is between 5 and 6 feet but every value in there seems about as likely as any other. Prior and Posterior Distribution The tool for guessing at the parameters value with prior knowledge of parameter and data is called the posterior distribution, which is defined as the conditional distribution of the parameter given data, formally where is the likelihood function. The posterior is a distribution over and has all the usual properties of a distribution. In particular Prior and Posterior Distribution Prior and Posterior Distribution 1. Although not guaranteed, in almost all practical situations the posterior distribution provides a more refined guess of than the prior. We are combining our prior information with the information contained in the data to make better guesses about . 2. If we observe a large amount of data, the posterior distribution is determined almost exclusively by the data, and tends to place more and more mass near the true value of . Thus, we don't have to be too precise about specifying our prior distribution in advance. Any errors will tend to wash out as we observe more data. Properties of Posterior Mean  The Bayes estimate of a parameter is the posterior mean. Usually the posterior distribution will have some common distributional form (such as Gamma, Normal, Beta, etc.). Some things to remember about the posterior mean  The data only enter the equation for the posterior in terms of the likelihood function. Therefore, the parameters of the posterior distribution, and hence the posterior mean, are functions of the sufficient statistics.  Often the posterior mean has lower MSE than the MLE for portions of the parameter space, so its a worthwhile estimator to consider and compare to the MLE.  The posterior mean is consistent, asymptotically unbiased (meaning the bias tends to 0 as the sample size increases), and the asymptotic efficiency of the MLE compared to the posterior mean is 1. Actually, for large n the MLE and posterior mean are very similar estimators, as we will see in the examples. Example: Geometric Suppose we wished to use a general for the posterior in terms of density to be and prior. We would like a formula . We proceed as before, finding the prior The likelihood is unchanged, so the product of the prior and likelihood simplifies is The prior parameters and are treated as fixed constants.Thus the Gamma functions in front may be considered part of the normalizing constant C, leaving the kernel This is the kernel of a mean distribution, with posterior Example : Binomial Example: Poisson Let posterior distribution of . Suppose you have a prior on . Compute the . As stated above, our first goal is to compute and simplify and product of the likelihood and prior. If the data are and the prior density is The posterior distribution which simplifies to , then the likelihood is Example: Poisson All the , , and are constants since terms that involve are is the only thing random in this expression. The Hence the posterior distribution is distribution. The Bayes estimate is the posterior mean. The posterior mean of a distribution is Notice that The only terms that get large as n increases are , the MLE. and n. Thus, for large n, is approximately Example: Normal Bayesian Interval Estimation Prediction Predictive Distribution : Binomial-Beta Predictive Density : Normal-Normal Predictive Distribution : Binomial-Beta

Probability and Estimation - Department of Statistics | Rajshahi

Related documents

Products

Support

Probability and Estimation - Department of Statistics | Rajshahi

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib