The Uniform and Normal Distribution When a random variable X can take on any value in an interval, we call it a continuous random variable and describe its outcomes using a probability density function. Instead of talking about the probability that X takes on a specific value P(X=x) as we do in the discrete case, we talk about the probability that X falls in an interval such as [c, d] and write it as P(c X d ) . The probability that X takes on a specific value in the continuous case is 0. The probability that X falls in an interval [c, d] is determined by calculating the area under the density function that is bounded by c, d, and the x-axis. In the continuous case, probabilities correspond to areas. Uniform Distribution The uniform distribution is possibly the simplest example of a continuous random variable. Its probability density function (“density” for short) is given by 1 b a for a x b f ( x) 0 otherwise The graph of this density is shown below: Uniform Density 1 ba b a Notice that the height of the elevated horizontal line is precisely 1 so that the total area under ba the curve is 1. The mean of a uniform random variable is E( X ) ab ( b a) 2 . The variance is Var( X ) . 2 12 Example. Suppose U follows a uniform distribution on [3,8] (usually written U[3,8]). (a) What is the probability that U falls between 4 and 6? (b) What is the probability that U falls between 0 and 5? (a) A R E A 1/5 3 4 6 8 1|P age The Normal Distribution The Normal distribution is the most important continuous distribution in all of statistics. The “bell curve,” as it is frequently referred to, has a density function given by 1 x 2 1 f ( x) e 2 2 for x , where and are parameters ( 0 ) which affect the shape of the distribution. The choice of Greek letters is prophetic. If X is a random variable whose density function is 1 x 2 1 f ( x) e 2 , then X has a normal distribution with mean E( X ) and variance 2 Var( X ) 2 (equivalently, the standard deviation of X is ). Notation: A normal random variable X with mean and variance 2 is described using the shorthand notation X ~ N ( , 2 ) . The tilde (~) is read “is distributed as.” Calculations with the Normal Distribution in Excel Example. Suppose a Book Rep informs you that his/her publisher can provide you with a high quality customized statistics textbook for your MBA students, who just happen to be attending an elite, private Business School in North Texas whose name rhymes with the word “Sox.” However, as the Rep acknowledges, this typically requires 6-8 weeks of production time after it is ordered. Suppose the production time is normally distributed with a mean ( ) of 7 weeks and a standard deviation ( ) of 2 weeks. (a) What is the probability that the production time really takes between 6 and 8 weeks? (b) What is the probability that the production of the customized book takes more than 11 weeks? Solution: Draw a picture! For part (a), we want the probability (read area) under the density between 6 and 8. Be sure to shade in the desired area in your diagram. Excel provides the cumulative distribution function for any normal distribution. This function calculates the area to the left of a point x for a normal distribution with mean ( ) and standard 2|P age deviation ( ). The function is NORMDIST(x, mean, standard deviation, true). For example, the total area to the left of the point 6 in our book production example is =NORMDIST(6,7,2,true), which equals .308538. The area to the left of 8 in the same distribution is =NORMDIST(8,7,2,true), which equals .691462. The area between 6 and 8 is therefore .691462 – .308538 = .382925. For part (b), draw another picture and shade the appropriate area. (Answer: 1-NORMDIST(11,7,2,True) = .02275) Example. Suppose you know going into negotiations that production time for a customized STAT book (used at an elite, private B-School in ……..) is normally distributed with a mean of 7 weeks and a standard deviation of 2 weeks. You want to determine how much time to allocate for production so that your book is done within the allotted time with a high probability, say 98% (equivalently, a 2% chance that the book isn’t completed in the time allotted). How much time should you allocate for production? Solution: This problem does not ask for a probability, but rather states a probability (read area) and asks for a value (time) that corresponds to this area. This is the inverse of the previous problem, which started with values and asked for a probability. Still, it is important to draw a rough picture and label all relevant information. Excel provides values correspond to cumulative areas through the inverse cumulative distribution function (NORMINV(area, mean, standard deviation)). This is actually the inverse function of the cumulative distribution function. Given an area A ( 0 A 1) , the function determines the value x in the normal distribution such that the cumulative area to the left of x is A. In our example 3|P age above, we want to determine the value x in a normal distribution (with mean 7 and standard deviation 2) such that 98% of the area is to the left of x. The desired x-value is =NORMINV(.98,7,2), which equals 11.1075 of production time. Standardization The case where 0 and 1 is of special importance. It is called the standard normal and its random variable is denoted by Z. Tables are available for Z (Appendix B, page 630). One of the advantages of working with Normal random variables is that probability statements regarding general normal random variables can be transformed into equivalent statements regarding a standard normal. This is due to the following standardization rule. Standardization Rule for Normal Random Variables If X ~ N ( , 2 ) , then X ~ N (0,1) . Sums of Normal Random Variables In the previous book production example, we were only concerned with production time. In reality, we probably need to be concerned with two things, the production time and the order/delivery time. Example. Suppose the production time for the STAT book is as before (i.e., N(7,4)). The order/delivery time (from a separate delivery company) follows a normal distribution with a mean of 2 weeks and a standard deviation of 1 week (i.e., N(2,1)). What is the probability that the textbook is at the campus bookstore within 12 weeks from the time it is ordered? Solution. We’d like to combine the production time and the order/delivery time into a single random variable representing the total time between placing the order and its arrival in the bookstore. If we denote the random production time by PT (normally distributed with a mean of 7 and a standard deviation of 2) and the order/delivery time by ODT (normally distributed with a mean of 2 and a standard deviation of 1), then the total time is PT+ODT. But what is the distribution of PT+ODT? In general, this is a hard problem, but there is a simple case that occurs quite frequently. It is the case where the two random variables are independent. In simple words, independence means that the two outcomes are not related to one another in any way. This condition is often simply assumed when it seems plausible. Assume PT and ODT are independent (this makes sense in the context of our example because the delivery company is a separate entity). Then we have the following rule, which will dramatically improve our ability to obtain a solution. Rule for Combining Independent Normal Random Variables If X ~ N ( X , X2 ) and Y ~ N ( Y , Y2 ) are independent and a and b are any two constants, then aX bY ~ N (a X b Y , a 2 X2 b 2 Y2 ) . 4|P age We can now complete the problem. In our example, we apply the formula above with a=1, b=1. PT+ODT has a normal distribution with mean 1×7+1×2=9, and variance (1)2×4+(1)2×1=5. The standard deviation of this distribution is therefore 5 = 2.236 (approximately). The probability that the book arrives in the bookstore within 12 weeks is =NORMDIST(12,9,sqrt(5),true), which is equal to .9101 (approximately). An Application to Inventory Pooling. The monthly demand for a product at four outlet stores follows a normal distribution with a mean of 100 and a variance of 625. Demands are independent at the four stores. Unsatisfied demand is lost (i.e., you cannot backorder units). Each store is allowed to place a single order at the end of the month. This order arrives right at the start of the next month. There are no additional shipments per month. (a) Suppose each outlet stocks with the goal of meeting monthly demand 90% of the time. How much stock should each outlet have on hand at the start of each month? Calculate how much is needed at the four stores collectively at the start of each month (this is the inventory held by the entire chain at the start of a month). (b) Now suppose the stores decide to pool (or share) their inventory from a common source. For example, there could be a central warehouse that all four stores draw from as depicted below. How much inventory is needed in the system’s central pool at the start of a month to satisfy customer demand 90% of the time? [Here, observe that when the central pool is empty, all four stores are out of stock simultaneously; compare this situation with that in (a)]. Is the total amount of inventory held by the chain at the start of a month the same as in part (a)? Explain the difference, if any. Store 1 Store 2 Inventory Pool Store 3 Store 4 5|P age Sampling Distributions: Estimating a Population Mean Until now, you have been given privileged information about a random variable. You have been given its distribution, which has enabled you to know the “true” mean and “true” variance with precision, even if it involved a minor calculation. In practice, you will not have such detailed information regarding the distribution of a random variable. Because the mean (expected value) of a random variable is one of the most important measures in all of statistics, we need a way of estimating its value in the real world. This involves the intuitive concepts of sampling and estimation, which are best motivated by a real example. Example (From: Jon Danklefs, SMU MBA Class 45D, KIA Motors). In March of 2000, a hailstorm ripped through North Texas and did serious damage to a number of homes and businesses. KIA Motor’s Midlothian distribution center was particularly hard hit. Nearly 5000 exposed vehicles were hail-damaged. The distributor initially authorized a local hail-dent remover to repair up to 200 of the vehicles. Each car was fixed panel by panel using a “paintless” dent removal method, and then a detailed invoice was prepared to document the cost of the repairs. After finishing approximately 180 vehicles to the distributor’s satisfaction, it was mutually agreed that the remaining cars should be repaired using a flat rate per car. Jon Danklefs was largely responsible for negotiating the appropriate flat rate. He already had a sample of 180 cars whose repair costs were known. The costs for these 180 vehicles are listed in the file dents.xls. Using this information, come up with a reasonable estimate of the expected cost per vehicle. How does your value compare with the true expected cost per vehicle? What are the risks (to both sides) if the current invoice method is continued instead of using a flat rate? A Solution Our proposed solution method first involves drawing a random sample, which is a set of observations drawn in such a manner that on each draw the remaining observations are all equally likely to be selected. Assuming the cars were drawn this way (by Jon Danklefs), we have a random sample “without replacement” from a finite population.1 If the observations are replaced after each draw, then we have a random sample with replacement. When the sample is small compared to the population (less than 5%), this distinction is not critical. One may think of the population as nearly infinite, which makes the replacement issue—with or without—unimportant2. This simplifies the formulas we use (more complicated formulas are needed for the case of small finite populations where items are drawn without replacement). Procedures for drawing random samples from both finite and infinite populations are discussed in section 7.2 of your book. As is often the case in the 1 Some authors reserve the term simple random sampling for precisely this situation. Your book does not. One can make any finite population infinite by drawing with replacement. The term infinite refers to the number of observations of the random variable, not the number of distinct outcomes of the random variable. 6|P age 2 real world, our data was collected without our involvement. We will assume it constitutes a random sample from an infinite population. This will be our standard assumption throughout the course. The second part of our solution involves selecting a formula or rule to convert the observed sample values into an appropriate estimate. With a little thought, I suspect you’d come up with the following estimate of the expected value: x x1 x 2 x3 x178 x179 x180 $215.66 . 180 The value x is called the sample mean, and it is a sample estimate of the true expected cost or true mean cost, cost . As a practical matter, the KIA distributor might refuse to accept a fixed price higher than $215.66 per car. But the true mean cost cost could actually be higher than $215.66, in which case $215.66 might be a good deal. The unfortunate truth is that we will never know with 100% accuracy how close $215.66 is to cost since the true cost of repairs on the remaining vehicles will not be documented with invoices. But we can get a probabilistic idea of how close we are. To understand how this is done, imagine we were to draw another random sample (now with replacement) of size n=180 and compute another sample estimate of the true mean repair cost. Would this second estimate likely be $215.66? What if we took a third random sample of size n=180? In general, we can think of our original estimate x $215.66 as a single draw from a distribution of sample averages. In the language of statistics, we are interested in the distribution of X X 2 X 3 X 178 X 179 X 180 the estimator X 1 (remember, capital X’s stand for 180 random variables whose values have not yet been realized). We think of the estimator as a rule, and the number we get from a particular sample as an estimate. Characterizing the distribution of the estimator X called its sampling distributionis a critical step in understanding how close (in a probabilistic sense) our computed sample mean x $215.66 is to the true mean cost cost . To get there, we need two theoretical facts about the distribution of our estimator X . Fact 1: The expected value of X satisfies E ( X ) cost . In simple terms, this means the estimator’s theoretical average is “dead on” the value it is intended to estimate. This can be shown using the rules for expectation given in an earlier lecture. Generally speaking, estimator’s that have this property are said to be unbiased. Unfortunately, this is like knowing that a manufacturing machine makes parts that are “correct on average” without knowing what that average actually is. Fact 2: The distribution of X is approximately Normal with a mean of cost and a variance of 2 / n , where 2 is the variance of the population we are sampling from. This is not at all obvious and is a consequence of the Central Limit Theorem (CLT). 7|P age Fact 2 deserves some discussion, which will be supplemented by an Excel demonstration. What makes it such a profound result is that it doesn’t depend on the distribution of the underlying population we are sampling from. An Excel Simulation/Demonstration of the Central Limit Theorem In this demonstration, we use a random variable that is uniformly distributed on [100,300], denoted by U[100, 300]. Consequently $200 and 3333.333 (you can calculate these from the formulas given for the uniform distribution earlier). I selected this distribution for two convenient reasons: (1) it is clearly not normal; (2) we talked about it at the start of the lecture (so you know a bit about it). Of course, since we know the distribution, we wouldn’t need to estimate because we could actually compute it. However, I am using this distribution simply to demonstrate what the Central Limit Theorem tells us about the distribution of our estimator X . Suppose we draw a random sample of 100 values from U[100, 300] and compute the sample mean. We can actually do this using the computer (I will show you how). Now suppose we repeat this step over and over. We would never actually do this in practice; we are only doing it here to prove a point about the distribution of X . If we take enough samples, we’ll get a pretty good idea of the distribution of X for the case where the sample size is n = 100. I generated 10000 samples (each of size n = 100) for our in-class example. Here’s a really geeky summarization: Excel Experiment Do 888 888 Sample = 1, 10,000 Draw a Sample of n=100; Calculate the Sample Mean; Store the Mean. End Construct a histogram of the 10,000 sample means. The number of samples (10,000) is arbitrary; I chose it because I figured it would give us a really nice picture (meaning histogram) for the distribution of X . The n stated in the CLT is the sample size, n=100, not the number of samples. Observe that the distribution of X is quite different from the underlying population (U[100,300]). It is approximately bell-shaped. Evidently, adding values together and taking their average has this effect. The histogram gets more bell-shaped if we take a larger sample size (a bigger n than 100). The Central Limit Theorem (CLT) Let ( X 1 , X 2 ,...., X n ) be a random sample from any infinite population with mean and variance X X2 Xn 2 . As n becomes large, the distribution of X 1 is approximately Normal n Approx. 2 2 ) ). with mean and variance (in fancy notation, X ~ N ( , n n Recall that this rule does not require the underlying population distribution to have any particular form. The underlying population above was uniform, which is relatively nondescript. However, if 8|P age the underlying distribution is normal to begin with, the distribution of X is exactly normal (for all n), and we can dispense with the word “approximately.” This follows directly from the combination rule for independent normal random variables. How big does n need to be for this approximation to work? In this class, we will agree that n = 100 is big enough (although many people say n = 30 is big enough). Nevertheless, the bigger the sample size, the better the approximation. Notice that the variance of X is shrinking as n becomes large in the statement of the Central Limit Theorem above. Another way of stating the theorem that avoids this is to convert X to a standard normal. The equivalent statement becomes X Approx. ~ N (0,1) (for large n). n This is often a more convenient form to work with. The Central Limit Theorem does have direct applications. Consider the following example. Application: Suppose you run an insurance company that processes claims at 100 regional offices around the country. Weekly claims at the regional offices are independent of one another and follow a distribution with 1000 and 2 22500 . What is the probability that the company experiences more than 103,000 claims nationwide in a given week? (Note: This is only an average of 1030 claims per regional office. Does exceeding 1030 at a regional office seem unusual given 150 ?). Solution. Turn this into an equivalent statement involving average claims and then apply the CLT. Assignment #1 (Due Saturday, July 12th) 1. 2. 3. 4. Book, 6.19 Book, 6.20 Book, 6.23 Book, 6.25 5. Suppose you run an insurance company that processes claims at 25 regional offices around the country. Weekly claims at the regional offices are independent of one another and follow a normal distribution with 100 and 2 400 . (a) What is the probability that a single regional office experiences more than 110 claims in a given week? (b) What is the probability that the company experiences more than 2750 claims nationwide (an average of 110 per office at all 25 stores) in a given week? (c) What is the probability that at least 15 of the 25 offices experience more than 110 claims in a week? Note: The problems on Assignment 1 do not require the Central Limit Theorem. Problems involving the Central Limit Theorem appear at the beginning of Assignment 2. 9|P age