Statistics 5311 Applied Statistics for Management I Textbook: See Current Syllabus Prerequisites: See Current UT Dallas Catalog for Educational and Course Prerequisites Access to a Statistical Package ( for example EXCEL with the Data Analysis Toolpack Add-in, or MINITAB, or SAS, or SPSS, etc.) I will be using Microsoft EXCEL. Instructor: John J. Wiorkowski, Professor of Statistics Wiorkow@utdallas.edu Telephone: 972-883-2274 (USA, on Central Standard Time) What is Statistics? Humorous: The Science of drawing a precise line between an unwarranted assumption and a forgone conclusion. The Science of stating precisely what you don’t know. Popular Conceptions: Facts Demographics Census Counts Product Sales Touchdowns in American Football Runs scored in British Cricket Economic Projections Sales Forecasts Market Projections Consumer Price Index Probability Odds Gambling “Lies, Damned Lies, and Statistics” (Mark Twain) Actually, “Statistics” encompasses all of the above Popular Concept: You can prove anything with Statistics In fact, if you think of the word “prove” in its mathematical sense, that is that things are either true or false, then in fact you can’t prove anything with statistics. Statistics uses mathematics, but it generalizes the concept of “true or false”. EXAMPLE: YOU LAUNCH A NEW PRODUCT Mathematics: It Will Fail 0 or It Will Succeed 1 As a decision maker, you cannot afford to have absolute certainty. Accordingly you must assess the risks (as measured by probability) and make your decisions in a world of uncertainty. Statistics: It Will Fail It Will Succeed 0________________x___1 .8 This means that your project has an 80% chance of succeeding and a 20% chance of failing. You as the decision maker must decide if you can live with this risk. The main reason that the myth of proving anything with statistics exists is because most people do not think using statistical logic rather they think “mathematically” and ignore variability. For example, recently (Fall 1999) a story aired on television in the Dallas area indicating that fewer Hispanic (individuals of Latin American Descent) college students received financial aid than Anglo (individuals of European Descent) college students. This led to angry responses by politicians in the State of Texas. The local newspaper in fact reported the actual statistics: Ethnic Group Proportion of College Students Receiving Aid Hispanic 57 percent Anglo 59 percent Black 77 percent. Anyone skilled in statistical thinking would have realized that the HispanicAnglo difference is well within statistical fluctuation and the two groups are essentially the same, while both the Hispanic and Anglo groups are much below the Black Ethnic group, a fact which was not reported on TV. One Definition of “Statistics” Statistics is a body of theory and techniques designed to: a) Convert data into “information” through the use of graphical displays, summarization, and other techniques so that patterns in the data are apparent: And b) extrapolate any perceived pattern in the data to a broader area of applicability. Statistics Usually Begins With a Business Problem Examples: Accounting -- an auditor is interested in the costs of business travel Organization Behavior -- a manager is interested in why turnover of employees seems to have increased Marketing -- your firm wishes to expand its product line and determine which products are of interest to consumers and within the scope of your company’s expertise Finance -- the CEO wishes to understand what factors are affecting the firm’s stock price Economics -- your firm is interested in the status of the economy over the next year Operations Research – store managers have been reporting increased waiting lines at checkout counters Management Information Systems – you are contemplating replacing your financial reporting system International Management -- you are interested in expanding your firm’s services to a non-US market Having Defined Your Problem You Need to Collect Data The set of all objects (could be persons, records, computer transactions, etc) which are relevant to your problem is called the Population . For example, in the Marketing example above suppose you worked for a beverage firm and were interested in marketing a new Cola drink, say Vanilla Cola. The first question of interest is whether people would like it (a later question is whether they would switch from their present preferences). What is the Population? Guess 1 -- All consumers. (Probably too broad.) Guess 2 -- All consumers who drink soft drinks (Probably too broad) . Guess 3 -- All consumers who drink cola beverages. This population is extremely large, we cannot possibly ask everyone so we will ask a smaller set of consumers called a sample. How does one take a sample? Perhaps the least understood aspect of statistical analysis is the importance of taking the sample in an appropriate fashion. All the formulae used in this course, and indeed in most simple statistical analyses, require a random sample. A random sample is a sample chosen using a randomization scheme. This randomization scheme must have the following properties: the probability of any individual object in the population being included in the sample must be the same as any other object; and, the probability of any pair of objects in the population being included in the sample must be the same any other pair, and, the probability of any three objects in the population being included in the sample must be the same as any other set of three objects, and .. . the probability of any n objects in the population being included in the sample must the same any other set of n objects. We will use the symbol “n” for the sample size and “N” for the size of the population. Also a sample can either be taken with replacement (meaning the same object may be chosen more than once), or without replacement (meaning that once an object is picked, it can never be picked again). All the formulae in your textbook assume that sampling is done randomly with replacement. (If n/N is less than .02, the formulae in the text can be used even though sampling is done without replacement). It is important to realize that you can’t tell if a sample is random by looking at it. In the Original Texas Lottery six numbers were picked at random from 50 without replacement (this is equivalent to N=50 and n=6). There are 15,890,700 different samples which could be picked. Most of them look like the following: 11, 17, 26, 31, 48, 53. But it could just have easily come out: 1, 2, 3, 4, 5, 6. Both are equally likely, however most of samples will look more like the former than the latter. How To Take a Random Sample It is usually necessary to form a list of some kind to take the sample. In the case of our “Vanilla Cola” we would need to get a list of Cola drinkers. One way this could be obtained is by purchasing this information from food stores which can often track which of their customers are buying Cola when they use various discount cards. Alternatively, one might cull the list of individuals who have responded to various promotions of your company which usually demonstrates use of your Cola products. (If a list cannot be obtained then other sampling techniques need to be used, we will discuss these later). To illustrate how a random sample is taken, let us use N=20 and n=5, and use EXCEL. Please open your EXCEL worksheet now. Your screen should look like this. I have entered the list of the population. If you press the button labeled fx on the tool bar at the top of the screen you will see something that looks like this: In the left panel click on “All” and then scroll down the right panel till you see the function “RAND”. Your screen should look something like this: The description indicates that this function will generate random number between 0 and 1. In EXCEL two symbols indicate a function. You begin with an “=” sign, type in the function name and enclose the arguments of the function in parentheses. The function RAND has no arguments so one just types =rand() as shown below. When you hit enter a random number will appear in the cell as shown below. (Note don’t be surprised if this number changes whenever you do something, it is programmed to generate a new random number whenever any computation is made or the F9 key is pressed. Press F9 a few times to see the number change.) Now copy down the entry in this cell to the remaining 19 cases by grabbing the lower corner of the cell and holding your left mouse button down as you drag down the column as shown below. The result should look something like the following: Since the values of the random numbers will keep changing, we need to fix them so they will remain the same. With the value still highlighted as above, click on the word “Edit” on the top line of your spreadsheet. It should look something like this: Now press copy on the displayed menu. You will see a shimmering line around the twenty highlighted numbers. Hit “Edit” again, and then click on “Paste Special”. You should see something like the following: Now click on the word “Values” on the displayed panel and hit “OK”. This should result in something that looks like the following. Now when you press the F9 key, the values will no longer change. To take our sample of size 5 with replacement I will first multiply the random number by the population size, in this case 20. This is shown below: In the next column, I will round the case number by using the EXCEL function “ROUND”. This has two arguments, the number to be rounded and the number of decimal places to round to (positive for rounding to the right of the decimal point, negative for rounding to the left of the decimal point, and 0 for no decimals at all). The entry would look like this: Now copy down the two entries in Columns C and D for five rows, and that is my sample. This is shown below: This is sampling with replacement, notice that I included case 11 twice. If I wanted to sample without replacement I could do one of two things. First I could delete the second pick of Case 11 and continue copying down till I got 5 distinct cases. This is illustrated below: Alternatively, I could sort the 20 cases in order of the random numbers and then just take the first 5 values. Since the data is sorted I could get no duplications. To begin, highlight the first two columns of data and click on the word “Data”. This is shown below: Now click on “Sort” to get a menu that look like this: Replace the word “Case” in the menu with the word “Column C” and hit OK. This will result in something that looks like the page below: My random sample, without replacement, consists of 2, 6, 7, 10, and 20. Both methods are equally valid, but the second can be faster if the sample size n is a substantial fraction of the population size N. Technical Point (not required): In the first method of taking a random sample described above, we used the EXCEL “round” function. This actually creates a slightly biased sample since values between .5 and 1.5 would round to the value of “1”, and values between 18.5 and 19.5 would round to “19”. This means values between 0 and .5 are unused and only the values of 19.5 to 20 would round to “20”. This means that the value “20” has only half the probability of being generated as any of the values “1” through “19”. This can be fixed in EXCEL by using the “rounddown” function rather than the “round” function (the function arguments are exactly the same as for the “round” function). The function “rounddown” would take any value between 0 and .99… and round it to “0”, any value between 1.00 and 1.99 and round it to 1 ….and any value between 19.0 and 19.99 and round it to “19”. This would generate equally likely values “0” through “19”. Since we want the values to range from 1 to 20 we would add one to the generated values. The following EXCEL command would produce the correct values: =rounddown(20*rand(),0) + 1 The larger the size of the population (N) the less important is this correction. Other Forms of Sampling A random sample will usually automatically mirror different aspects of the population. For example if half the population is male, then approximately half the values in the sample will be male. Similarly if 20 percent of the population is over 65 years of age, then approximately 20 percent of the sample will be over 65 years in age. Suppose however, that you wanted to guarantee that the sample exactly reflected certain proportions by gender and/or by age. Then one could take a stratified sample. In this case, one takes the initial population lists and divides them into groups (say males and females). Such a division is called stratification and males would be one stratum and females another stratum. You then take two random samples, one from the male stratum and one from the female stratum. Within each stratum you can use the results in this text, but in combining the results you would need to use formulae not in your book. Stratified Sampling is clearly more expensive than random sampling since you have to divide the lists and take multiple random sample one for each of the possible stratum combinations (e.g. males over 65, males under 65, etc.). However under certain circumstances it can give better results than a random sample. If no lists can be made, then another form of sampling, called Cluster Sampling, can be used. In this case we make use of natural clusters in the population. We then sample the clusters randomly, and then randomly sample subjects within the clusters. For example in our “Vanilla Cola” case, we might not be able to obtain a “list” of Cola drinkers. However, since much soda is purchased in supermarkets, we could randomly pick supermarkets within a metropolitan area, and then visit the stores and observe who is buying Cola. We could either ask opinions of every buyer, or we could randomly pick say 10 percent of the Cola purchasers for interviews. This method is usually cheaper than random sampling however it is usually not as accurate since cluster to cluster variability can add significant error to the results. Again if you take a cluster sample, you cannot use the formulae in your textbook. Notice that at the basis of all sampling techniques is the random sample. Course Structure Population Inference (Module 3) Probability Basis for Inference (Module 2) Random Sample Description (Module 1)