Inference Mean, Proportion, CLT Bootstrap From Probability to Statistics • In all our probability calculations, we have assumed that we know all quantities needed to solve the problem: – Portfolio problems: To find the expected return and standard deviation of a portfolio, we assumed we knew the mean and standard deviation of the returns of the underlying stocks. – Potato chip example: To find the proportion of bags below the 8-ounce minimum, we assumed we knew the mean and standard deviation of the weight of chips in each bags. – In practice, these types of parameters are not given to us; we must estimate them from data. • Statistical analysis usually proceeds along the following lines: – Postulate a probability model (usually including unknown parameters) for a situation involving uncertainty; e.g., assume that a certain quantity follows a normal distribution. – Use data to estimate the unknown parameters in the model. – Plug the estimated parameters into the model in order to do make predictions from the model. How do we start with? •The first step, picking a model, must be based on an understanding of the situation to be modeled. – Which assumptions are plausible? – Which are not? – These questions are answered by judgment, not by precise statistical techniques. •Examples: – Assume that daily changes in a stock price follow a normal distribution. •Use historical data to estimate the mean and standard deviation. •Once we have estimates, we might use the model to predict future price ranges or to value an option on the stock. – Assume that demand for a fashion item is normally distributed. •Use historical data to estimate the mean and standard deviation. •Once we have estimates, we might use the model to set production levels. How do we get data and make inference? •The first step in understanding the process of estimation is understanding basic properties of sampled data and sample statistics, since these are the basis of estimation. •When we talk about sampling it is always in the context of a fixed underlying population: – If we look at 50 daily changes in IBM stock, we are looking at a sample of size 50 from the population of all daily changes in IBM stock. – If we ask 150 shoppers whether or not they buy corn flakes, we have a sample of size 150 from all possible shoppers. •If the population is very large (as in these examples), we generally treat it as though it were infinite; this simplifies matters. Thus, we are primarily concerned with finite samples from infinite populations. •A single sample from a population is a random variable. Its distribution is the population distribution; e.g., – The distribution of a randomly selected daily change in IBM stock is the distribution over all daily changes; – The probability that a randomly selected shopper buys corn flakes is the proportion of the entire population that buys corn flakes. Random Sample • A random sample from a population is a set of randomly selected observations from that population. If X1,…, Xn are a random sample, then – they are independent; – they are identically distributed, all with the distribution of the underlying population. • A sample statistic is any quantity calculated from a random sample. The most familiar example of a sample statistic is the sample mean , given by = (X1 + X2 + … + Xn)/n • The sample mean gives an estimate of the the population mean m = E[Xi]. Distribution of the Sample Mean • Every sample statistic is a random variable. – Randomness is introduced through the sampling mechanism. • As noted above, the sample mean of a random sample X1,…, Xn is an estimate of the population mean m = E[Xi]. – How good an estimate is it? – How can we assess the uncertainty in the estimate? – To answer these questions, we need to examine the sampling distribution of the sample mean; that is, the distribution of the random variable . • Assume that the underlying population is normal with mean m and variance s2. – This means that Xi ~ N(m,s2) for all i. – The Xi's are independent, since we assume we have a random sample. • The sum of independent normal random variables is normally distributed. The usual rules for means and variances apply: – The expected value of the sum is the sum of the expected values. – The variance of the sum is the sum of the variances (by independence). • Any linear transformation of a normal random variable is normal; in particular, multiplication by a constant preserves normality. Distribution of the Sample Mean •Using these two facts, we find that if Xi ~ N(m,s2) for all i, then – X1 + X2 + … + Xn ~ N(nm,ns2); •The sample mean from a normal population has a normal distribution. •First consequence: – The expected value of the sample mean is the population mean; “on average" the sample mean correctly estimates the underlying mean. – The standard deviation of a sample statistic is called its standard error. Thus, we have shown that the standard error of the sample mean is s/√n, where s is the underlying standard deviation and n is the sample size. •Second consequence: – Because the standard error of sample mean is s/√n, the uncertainty in this estimate decreases as the sample size n increases. (That's good.) – The uncertainty (as measured by the standard deviation) decreases rather slowly: to cut the standard deviation in half, we need to collect four times as much data, because of the square root. (That's not so good, but that's life.) Example: •Suppose the number of miles driven each week by US car owners is normally distributed with a standard deviation of s = 75 miles. – Suppose we plan to estimate the population mean number of miles driven per week by US car owners using a random sample of size n = 100. – What is the probability that our estimate will differ from the true value by more than 10 miles? •Denote the population mean by m and the sample mean by . – We need to find P X m 10 . – By symmetry of the normal distribution, it is 2 P X m 10 2 P sX/ mn 75 /10100 2 P Z 1.33 0.1836. Thus, the probability that our estimate will be o by more than 10 miles is 18.36%. • If the underlying population is not normal, what can be done? Central Limit Theorem • By the central limit theorem, regardless of the underlying population, the distribution of sample mean tends towards N(m,s2/n) as n becomes large. – If we accept the use of this approximation, we don't need to assume that the number of miles driven per week in the example is normally distributed (as long as our sample size n is large). – repeatedly to assess the error in X as an estimate of . • How large should n be for the normal approximation to be accurate? – There is no simple answer (it depends on the underlying distribution), but n≧ 30 is a reasonable rule of thumb. • If the underlying population is finite of size N, and if the sample size n is not a small proportion of N, we use the following small sample correction to the standard error: Std Error X s n N n . N 1 Sampling Distribution of the Sample Proportion •Consider estimating any of the following quantities: – Proportion of voters who will vote for a third-party candidate in the next election. – Proportion of visits to a web site that result in a sale. – Proportion of shoppers who prefer crunchy over creamy. •In each of these examples, we are trying to estimate a population proportion. Denote a generic population proportion by the symbol p. •Estimate a population proportion using a sample proportion. – For example, if a poll surveys 1000 voters and finds that 85 of those surveyed plan to vote for a third-party candidate, then the sample proportion is 8.5%. – The population proportion is what the poll would find if it could ask every voter in the population. – Denote the sample proportion by the symbol p̂ – Once we have collected a random sample, the sample proportion p̂ is known. We use it to estimate the true, unknown population proportion p. Estimating a proportion can be formulated as a special case of estimating a population mean. • Consider again the example of a poll of 1000 voters. – Imagine encoding responses to a question about third-party candidates as follows: for the ith person polled, Xi = 1; if ith person plans to vote for third-party candidate; = 0; otherwise. – Our random sample consists of X1,…, X1000. If 85 respondents indicated that they would vote for a third-party candidate, then X1+…+ X1000 = 85; because 85 of the Xi 's are equal to 1 and all the rest are equal to 0. – The sample proportion is just a special case of the sample mean. • How good an estimate of the population proportion p is the sample proportion? How effective are polls and surveys? – By how much is the sample proportion likely to deviate from the true population proportion p? – This is measured by the standard deviation of sample proportion (standard error). – [p(1-p)/n]1/2, It is greatest when p = 0.5. EXAMPLE • Suppose that the true, unknown proportion p of voters who will vote for a third-party candidate in the next election is 9%. – What is the probability that a poll of 1000 voters will find a sample proportion that differs from the true proportion by more than 2%? • We need to find P pˆ p 0.02 2 P Z 0.090(1.02 0.027. 0.09 ) 1000 • We conclude that the probability that the poll will be off by more than two percentage points is .027. Confidence Intervals •For the mean m of a population a 100(1-a)% CI is: –When the population is normal and SD s is known - x za / 2s / n , where za/2 comes from the normal table. –Reason: – When the population is normal, σ is not known, but n is large (maybe >50). Use the same formula with s in place of σ. – When the population is not necessarily normal, but n is large (maybe > 50 to 100) (depending on how close to normal the population is ?or seems to be) Use the same formula with σ, if known, or with s if σ is not known. •Summary: These intervals have probability approximately 1 α of containing the true value of µ. Demonstration with R •Take 1000 samples of size 200 from a Normal(µ=0,σ2=1) population. – Calculate a 95% CI for each sample. – Check to see how many of these contain the true µ. Answer = ___. Check to see the percentage is approximately 1-a. x<- rnorm(200) #generate 200 standard normal rv mu<- mean(x); sd<- sqrt(var(x)) #calculate sample mean and sd q95<- qnorm(c(0.025,0.975)) #find quantiles of normal distribution q95<- qt(0.975,199) #find quantiles of normal distribution lower<- mu-q95*sd/sqrt(199); upper<- mu+q95 *sd/sqrt(199) #CI if(lower*upper > 0) contain<- 0 else contain<- 1 Demonstration with R •Write a function to find whether the confidence interval contains mean. demons<- function(nsize,conf){ x<- rnorm(nsize) #generate 200 standard normal rv mu<- mean(x); sd<- sqrt(var(x)) #calculate sample mean and sd q95<- qt((1+conf)/2,nsize-1) #find quantiles of normal distribution lower<- mu-q95*sd/sqrt(nsize-1); upper<- mu+q95 *sd/sqrt(nsize-1) if(lower*upper > 0) contain<- 0 else contain<- 1 contain } •Conduct a simulation study to check the validity of confidence interval based on t-statistic. nsimu<- 1000 contain<- 1:nsimu for (i in 1:nsimu) contain[i]<- demons(200,0.95) Higher confidence(Good!) = Wider interval(Bad!) ! • The only way to control both confidence and interval size is to choose sufficiently large n. • For confidence 100(1-a)% and width w we need w = 2za/2s/√n. – If s or s is not known, use your best guess (or preliminary data). • Example. Fisher’s Iris data had n=50, s=3.5, and a 95% CI of 5.0±1.96× (3.5/√50) = 5.0 ± 0.97 = (4.03, 5.97). * – This CI has width w=2×0.97=1.94. – *The sample size 50, here, is on the borderline of what could be acceptable for the use of this procedure. It would be (slightly) better to use the tprocedure discussed below. • Suppose we want a CI of total width w = 0.5 (ignoring the data we have already gathered). How large a sample size should we use? – Our best guess for s is 3.5. (We don’t have any other information to give us a better idea.) – We should choose n≒ (2×1.96×3.5/0.5)2 =753.** – **This value of n is large. If the answer to a question like this works out to be a small n (suggesting use of a t-test) then it’s not really a valid answer – or, at best, it should be thought of as only a very rough estimate. t-Interval •When the population is normal but 2 is not known and n is not large! (p.s.: How do we tell whether the population is normal?) •What we’ve done so far doesn’t work. •Demonstration: •Repeat the previous demonstration, but with 50,000 samples of size 4 from an exponential distribution. Bootstrap • As a general term, bootstrapping describes any operation which allows a system to generate itself from its own small well-defined subsets (e.g. compilers, software to read tapes written in computerindependent form). • The word is borrowed from the saying pull yourself up by your own by your own bootstraps. • In statistics, the bootstrap is a method allowing one to judge the uncertainty of estimators obtained from small samples, without prior assumptions about the underlying probability distributions. –The method consists of forming many new samples of the same size as the observed sample, by drawing a random selection of the original observations, i.e. usually introducing some of the observations several times. –The estimator under study (e.g. a mean, a correlation coefficient) is then formed for every one of the samples thus generated, and will show a probability distribution of its own. –From this distribution, confidence limits can be given. –For details, see B. Efron (Computers and the Theory of Statistics, SIAM Rev. 21 (1979) 460.) or Efron (The Jackknife, the Bootstrap and Other Resampling Plans, SIAM, Bristol, 1982. ) Random Numbers • Random numbers are particular occurrences of random variables. They are used in Monte Carlo calculations, where three different types may be distinguished according to the method used to generate them: – Truly random numbers are unpredictable in advance and can only be generated by a physical process such as radioactive decay: in the presence of radiation, a Geiger counter will record particles at time intervals that follow a truly random (exponential) distribution. – Pseudo random numbers are those most often used in Monte Carlo calculations. They are generated by a numerical algorithm, and are therefore predictable in principle, but appear to be truly random to someone who does not know the algorithm. – Quasi random numbers are also generated by a numerical algorithm, but are not intended to appear to have the properties of a truly random sequence, rather they are optimized to give the fastest convergence of the Monte Carlo calculation. Pseudo Random Numbers • Generated in a digital computer by a numerical algorithm, pseudorandom numbers are not random, but should appear to be random when used in Monte Carlo calculations. –The most widely used and best understood pseudorandom generator is the Lehmer multiplicative congruential generator, in which each number r is calculated as a function of the preceding number in the sequence: ri ≡ ari-1 (mod m) or ri ≡ ari-1 + c (mod m) where a and c are carefully chosen constants, and m is usually a power of two, 2k. –All quantities appearing in the formula (except m) are integers of k bits. –The expression in brackets is an integer of length 2k bits, and the effect of the modulo m is to mask off the most significant part of the result of the multiplication. –r0 is the seed of a generation sequence; many generators allow one to start with a different seed for each run of a program, to avoid re-generating the same sequence, or to preserve the seed at the end of one run for the beginning of a subsequent one. –Before being used in calculations, the ri are usually transformed to floating point numbers normalized into the range [0,1]. • Generators of this type can be found which attain the maximum possible period of 2k-2, and whose sequences pass all reasonable tests of ``randomness'', provided one does not exhaust more than a few percent of the full period. – D.E. Knuth, The Art of Computer Programming, Addison-Wesley, 1981. – A detailed discussion can be found in G. Marsaglia, A Current View of Random Number Generators in Computer Science and Statistics, Elsevier, Amsterdam, 1985. Jackknife • The jackknife is a method in statistics allowing one to judge the uncertainties of estimators derived from small samples, without assumptions about the underlying probability distributions. • The method consists of forming new samples by –omitting, in turn, one of the observations of the original sample. –For each of the samples thus generated, the estimator under study can be calculated, and the probability distribution thus obtained will allow one to draw conclusions about the estimator's sensitivity to individual observations.