Probability Introduction The primary objective of this section is to learn how probability can be used to help understand and quantify uncertainty, in order to make more informed decisions. To this point we have covered the first of the four areas of the class, namely descriptive statistics. Now we begin the second area, probability. Recall that probability is a numerical measure of chance or likelihood, with higher numbers indicating a higher degree of likelihood. With probability, we assume that we know something about the entire population, and use that knowledge to characterize samples from the population. We need some definitions to begin. A random experiment is one in which the outcome is uncertain before the experiment is performed. An experimental outcome is one realization of the experiment. The sample space is the collection of all possible outcomes. An event is a collection of experimental outcomes. Events are mutually exclusive if only one can occur at a time. A set of events is collectively exhaustive if the set describes every possible outcome of the experiment. Probability - 1 There are two fundamental requirements of probability: 1. The probability of an experimental outcome must be between 0 and 1. 2. The sum of the probabilities of a set of mutually exclusive, collectively exhaustive outcomes or events is 1. Assigning Probability There are three methods used to assign probabilities to outcomes or events. 1. The classical method assumes that all experimental outcomes are mutually exclusive and equally likely, and hence the outcomes have the same probability. Examples: 2. The relative frequency method uses historical data to calculate how frequently the event or outcome has occurred in the past. Examples: 3. The subjective method relies on degree of belief. How likely do we think the outcome is? Examples: Probability - 2 Important Relationships The complement of an event is every other possible event but the one in question. When two events occur together, they intersect, and the probability of their intersection is called the joint probability. The key word here is AND. The probability of the union of two events is the probability that one or the other or both occur. The key word here is OR. The probability that one event occurs, given that we know another event has already occurred, is called the conditional probability. Key words here include GIVEN, WHEN and OF. Probability - 3 With these definitions, we can define some useful mathematical relationships. Addition: Multiplication: We can also define what it means if two events are statistically independent: Probability - 4 Contingency Tables A contingency table gives us one way to look at all of these probabilities and relationships. Consider the following example. Suppose that 800 individuals participated in a market research study. They were asked about a particular product, and a television advertisement for the product. Each participant indicated whether or not they had purchased the product, and whether or not they could recall the ad. The results are shown in the table below: Purchased Did not purchase Total Recalled Ad 160 240 400 Did not recall 80 320 400 Total 240 560 800 Define R to be the event that an individual could recall the add, and B to be the event he or she purchased (bought) the product. P(R) = P( B ) = P(RB) = P(B | R) = Are purchasing the product and recalling the ad statistically independent? Probability - 5 Creating Contingency Tables in Excel Excel has a utility called a Pivot Table that allows us to create and analyze tabular summaries (contingency tables) of qualitative data. It can also be used with quantitative data or combinations of quantitative and qualitative data. To use the pivot table feature, data must be entered in columns and each column must have a title or header. Before invoking the procedure, be sure that the cursor is in one of the cells containing a header or data. To start the “wizard,” go to Data/PivotTable and PivotChart Report. In the first step, just click on Next (the default values are what we want). In the second step, verify that the data range shown contains all of the data that you want to analyze, then click on Next again. In step 3, click on the button called “Layout.” You will be presented with the following dialog box (except the buttons on the right will change according to the data set you are using). Probability - 6 At this point, click on and drag the button corresponding to the variable that you want to be on the rows of your output table to the area labeled “Row” and the variable you want in columns to the area that says “Column.” Then drag either of the two buttons that you just used to the “Data” area. I recommend always dragging one of the qualitative variables’ buttons. The button should change to say “Count of VARIABLE” “where VARIABLE is the name of the variable that you dragged to the middle. Then say OK. To complete the procedure there are a few other options you can change if you desire, but I usually just click on Finish at this point and change options later if the output is not what I desire. If you have used a quantitative variable, you will likely want to group it. To do so, right click on the variable name in the table. One item in the popup menu should say Group. Choose it, and then specify how you want the variable to be grouped. The pivot table can display several different types of summary measues. The default or “normal” state is to display total counts. There may be times that you want to display the numbers in the table as overall percentages, as row percentages, etc. To change the display, click any where in the table and go again to the Data/PivotTable and PivotChart Report menu item. You should be at step 3 again. Click on Layout and then double click what is in the middle of the table (it should say “Count of…”). Then select options. A drop down menu that says “Show Data As” will be in the middle of the dialog box. Use the drop down menu to say how you want to display the data. Then exit out of all of the boxes. Probability - 7 A Note on Lists in Excel: By default, Excel lists the categories of qualitative variables alphabetically. You may want them listed in some kind of logical ascending order (for example, you may want to list class standing as Freshman, Sophomore, Junior and Senior). To tell Excel how you want the labels to be ordered, go to the Tools menu, select options, and then click on the tab called “Custom Lists.” Then you can type in the list items in the order you want them (separate them with a comma or return) in the List Entries section. Or you can import the list in the order that you want by identifying the cells where they are listed. Below is a portion of an Excel worksheet with both qualitative and quantitative variables. It shows both a portion of the original data and the the resulting pivot table. I created a custom list in Excel as “Good, Very Good, Excellent.” Probability - 8 The COUNTIF Function Sometimes rather than create a contingency table, we just need to count the frequency of each category within one quantitative variable. The easiest way to do so in Excel is with the COUNTIF function. The COUNTIF function has 2 arguments: a reference to the data set, and a condition to be met. The function then looks at the cells indicated by the first argument, and counts the number of times that the condition (in the second argument) is met. If you want to look for an exact value, then the second argument is just that value (or a reference to the value). If, on the other hand, you want to enter a range to look for, then the condition needs to be in quotes. For example, if I wanted to look for the number of times the cells A1:A30 have values that exceed 10, I would enter =COUNTIF(A1:A30,“>10”). Using the same example as above, suppose that all we are interested in knowing is the number of restaurants that each of the three ratings. The ratings of the restaurants are listed in cells B2:B301. The first thing I would do is enter the different possible ratings. For example, I might type Good, Very Good, and Excellent in cells E8:E10. Then in cell F8 I can enter =COUNTIF($B$2:$B$301,E8). This function says to look in cells B2:B301 for what is entered in cell E8 (“Good” in this case). Every time it finds Good it counts it. The final result will be the total frequency of the word Good in cells B2:B301. To count the frequency for the other two ratings, I can copy my formula in F8 to cells F9:F10. Probability - 9 Random Variables Another way to represent the results from probabilistic experiments is with random variables. Random variables are variables which have a set of possible values, one for each experimental outcome. There are two main types of random variables: Discrete random variables take on a countable number of values. For example, the number of defective items in a production batch is a discrete random variable. A discrete random variable, X, has an associated probability mass function (pmf), f(x), where f(x) = P(X=x). The pmf must meet the following: a. 0 f(x) 1, b. f(x) = 1. The random variable X also has a cumulative mass function (cmf) (also called the distribution function of X), denoted F(x), where F(x)=P(Xx). Example: Probability - 10 Continuous random variables take on continuous or interval values (there are an infinite number of possibilities). For example, the width of an extruded bar is a continuous random variable. A continuous random variable, Y, has an associated probability density function (pdf), f(y). The pdf must meet the following: a. f(y) 0, b. f ( y)dy 1 . all y The probability that a continuous random variable is exactly equal to a single value is 0. Instead, we quantify the probability that the random variable falls within a certain interval: b P( a Y b) f ( y) dy . a This is equivalent to saying that the probability is equal to the area under the pdf curve between a and b. We will show some examples of contintuous variables shortly. Probability - 11 Expected Value and Variance Probability distributions are often summarized using 2 measures referred to as the expected value and the variance. The expected value gives us information about the center of the distribution (it is another name for population mean), and the variance tells us about the spread. When we are talking about probability distributions, these quantities are population quantities. Mathematical representation: The expected value is denoted E(X) or more commonly as . The variance is denoted Var(X), or more commonly as 2. Expected Value and Variance for a discrete random variable: = E( X) xf ( x) , all x 2 2 = Var ( X) ( x ) f ( x) x f ( x) 2 . all x all x 2 Expected Value and Variance for a continuous random variable: = E( Y) yf ( y)dy , all y 2 2 = Var ( Y) ( y ) f ( y) dy y f ( y) dy 2 . all y all y 2 As before, the standard deviation can also be used and is denoted . Probability - 12 Examples: Some rules of expected value and variance: E(cX) = cE(X) E(X1+X2) = E(X1) + E(X2) Var(cX) = c2Var(X) Var(c1X1+c2X2) = c12 Var(X1) + c22 Var(X2) + 2 c1c2Cov(X1,X2) = c12 Var(X1) + c22 Var(X2) if X1 and X2 are independent. Probability - 13 Application to Portfolio Management When looking at financial investments, financial managers use expected or average return to measure a security's return. The standard deviation or variance of return is the proper measure of risk. It is common for investors to hold more than one security in an investment portfolio to try to reduce the amount of risk. This is called hedging. Suppose we have two securities, A and B, in a portfolio. Let p A be the proportion of money invested in security A and pB be the proportion invested in security B. Also, let 2A be the variance in return of security A, 2B be the variance in return of security B, and AB be the covariance between securities A and B. Then the variance of return for the portfolio is VAR(portfolio) = p 2A 2A + p2B 2B + 2pApBAB. Example Suppose that an investor holds two securities in her portfolio. She has $500 invested in Andrews stock and $1000 in Dean stock. She has no information on the variance or covariance of returns, so she takes a random sample of 4 years of returns and finds the results in the table below. Year 1 2 3 4 Andrews 0.08 0.25 0.10 0.04 Dean -0.05 0.55 0.19 0.30 We want to estimate the variance of return of the investor's portfolio. Probability - 14 First, what are pA and pB? Next we need estimates of the variances and covariances of the individual securities. Now we can estimate the variance of the portfolio. How does the risk of the portfolio compare to that of the individual securities? Probability - 15 Portfolio Management Practice Problem Security F has an expected return of 10% and a standard deviation of 5% per year. Security G has an expected return of 20% and a standard deviation of 60% per year. a. What is the expected return on a portfolio composed of 40% of Security F and 60% of Security G? b. Find the variance of the portfolio described in part a if the correlation coefficient between F and G is .5. c. Repeat part b if the correlation coefficient is -.5. What is the effect of the correlation coefficient’s sign on the variance of the portfolio? Probability - 16 Common Probability Distributions Our goal for this section is to learn the assumptions and usefulness of three common probability distributions. We will discuss their applications and computations using spreadsheets. The binomial distribution is discrete and the uniform and normal are continuous. The Binomial Distribution The binomial distribution describes probabilistic experiments with 2 possible outcomes. For example, if we are inspecting finished goods, they can be classified as either good or bad. There are three important assumptions of the binomial distribution. 1. We have n independent trials; 2. There are only 2 possible outcomes for each trial; 3. The probability of a "success," p, is constant from trial to trial. Probability - 17 If we will let X = the number of successes in n independent trials, then n P( X i) p i (1 p) n i , for i=0,1,...,n, i n n! where . i i!( n i)! Fortunately for us, Excel will find binomial probabilities so we don’t have to do the computation above. The formula is =BINOMDIST(x,n,p,I) where I is 0 if we want P(X=x) and I=1 if we want P(Xx). For the binomial distribution, = np, and 2 = np(1-p). Example: In a population of sales invoices, 5 percent have no shipping document attached. If an auditor takes a random sample of 50 invoices, what is the likelihood that 3 will have missing shipping documents? What is the probability that there will be fewer than 1 with missing documents? What is the expected number of invoices with missing documents? Probability - 18 Common Continuous Distributions We now begin to discuss continuous distributions. Recall that finding probabilities with continuous distributions is equivalent to finding areas under the curve. Areas can always be found by integration and sometimes can be found geometrically. We will see both types in our examples. The most common continuous distribution is the normal distribution, so we will spend most of our time with it. We will also discuss the uniform distribution, since it occurs naturally in some cases, and mainly because it gives insight into working with continuous distributions. The Uniform Distribution The uniform distribution is similar to the classical method of assigning probability in that it assumes that every outcome is equally likely. There is a discrete uniform distribution (e.g., tossing a die), but we will discuss the continuous uniform distribution. The uniform distribution has density function 1 b a for a y b f ( y) 0 otherwise. Probability - 19 Because of its density, the uniform distribution is also called the rectangular distribution. 0.5 0.4 f(y) 0.3 0.2 0.1 0 0 1 2 3 4 5 6 y . Examples: P(Y=2) = P(Y>2) = P(Y<2) = P(3Y4) = ( b a)2 ab 2 For the uniform distribution, = and = . So for our 12 2 example, = and Probability - 20 2 = The Normal Distribution Last time we talked about the bell-shaped rule. We want to say a little more about this now. The numbers (68%, 95%, etc.) are derived from the normal or gaussian distribution. The normal distribution is the most commonly used continuous distribution, and is assumed for many problems in statistical inference. We would use it whenever we believe or have evidence that the data we are working with have a most likely value in the center, and that as we move to values away from that most likely value in either direction, the probability of obtaining such values declines. The normal probability density function is f ( y) 1 e 2 1 y 2 2 , y . where is the mean and is the standard deviation of the distribution. Probability - 21 The normal pdf is a symmetric bell shaped curve. 0.4 0.35 0.3 0.25 f(y) 0.2 0.15 0.1 0.00135 0.00135 0.3413 0.3413 0.05 0.1359 0.1359 0.0214 0.0214 0 -4 -3 -2 -1 0 1 2 3 4 Number of standard deviations from the mean of y Using the Normal Distribution: In order to use the distribution to make probability calculations, you probably learned in earlier classes to first transform the given distribution to the standard normal distribution. The standard normal is a normal distribution with mean 0 and variance 1. It is usually represented by the random variable Z. Once the Z value is calculated, it is possible to look up areas in a table. We, however, will let the spreadsheet do this work for us. In the spreadsheet we can find normal probabilities of the form P(X<x) by using =NORMDIST(x,,,1). Probability - 22 Even with the NORMDIST function, it is very important to draw a picture of the probability that you want. That is because the NORMDIST function only finds left-tail areas. If we want other types of areas, we need to manipulate the desired probability to get it in terms of left tail areas. The best way to visualize what areas we want is to draw a picture. Example: Suppose we have a process which produces rods with a mean diameter of .625 in. and a standard deviation of .01 in. The customer requires that no diameter exceed .65 in., or be smaller than .6 in. What proportion of parts will not meet these specifications? Probability - 23