Ch 2 powerpoint notes, descriptive statistics: Tabular and graphical displays 1. Summarizing data for two variables using tables 2. Summarizing data for two variables using graphical displays Data visualization: best practices in creating effective graphical displays SUMMARIZING DATA FOR TWO VARIABLES USING TABLES o Thus far we have focused on methods that are used to summarize the data for one variable at a time o Often a manager is interested in tabular and graphical methods that will help understand the relationship between two variables Crosstabulation is a method for summarizing the data for two variables o Tabular summary of data for two variables o Crosstabulation can be used when One variable is qualitative, and the other variable is quantitative Both variables are quant. Both variables are qual. o The left and top margin labels define the classes for the two variables o Example: Finger Lakes Homes o Price range- quantitative variable o Home styles- categorical variable o The # of finger lakes homes sold for each style and price for the past 2 years is shown below Price range (homestyles) log split A-frame total colonial <$200,000 18 6 19 12 55 >/=$200,000 12 14 16 3 45 total 30 20 35 15 100 o Insights gained from preceding crosstabulation o the greatest number of homes (19) in the sample are a split-level style and priced at less than 200,000 1 o o o o o only three homes in the sample are an A-frame style and priced at 200,000 or more The numbers 55 and 45 are the frequency distribution for the price range variable 30, 20, 35, 15 are the frequency distribution for the home style variable Crosstabulation: Row or Column percentages Converting the entries in the table into row percentages or column percentages can provide additional insight about the relationship between the two variables Crosstabulation: row percentages Price colonial log split range <200,000 32.73 10.91 34.55 >/=200,000 26.67 31.11 35.56 (colonial and >/=200k/(all>/=200k)*100=(12/45)*100 a-frame total 21.82 6.67 100 100 Price Colonial log split range <200k 60.00 30.00 54.29 </=200k 40.00 70.00 45.71 total 100 100 100 (colonial and >/=200k)/(all colonial)*100=(12/30)*100 a-frame 80.00 20.00 100 Crosstabulation: Simpson’s paradox o Data in two or more crosstabulations are often aggregated to produce a summary crosstabulation o We must be careful in drawing conclusions about the relationship between the two variables in the aggregated crosstabulation o In some cases the conclusions based upon an aggregated crosstabulation can be completely reversed if we look at the unaggregated data. The reversal of conclusions based on aggregate and unaggregated data is called the Simpson’s paradox Summarizing data for two variables using graphical displays 2 o In most cases, a graphical display is more useful than a table for recognizing patterns and trends o Displaying data in creative ways lead to powerful insights o Scatter diagrams and trendlines are useful in exploring the relationship between two variables Scatter diagram and trendline o A scatter diagram is a graphical presentation of the relationship between two quantitative variables o One variable is shown on the horizontal axis and the other variable is shown on the vertical axis o The general pattern of the plotted points suggests the overall relationship between the variables. o A trendline provides an approximation of the relationship o Positive relationship is when it goes up o Negative relationship is when it goes down o No apparent relationship is a straight line Side by side bar chart o Graphical display for depicting multiple bar charts on the same display o Each cluster of bars represents one value of the first variable o Each bar within a cluster represents one value of the second variable Stacked bar chart is another way to display and compare two variables on the same display o It is a bar chart in which each bar is broken into rectangular segement of a different color o If percentage frequencies are displayed, all bars will be of the same height (or length) extending to the 100% mark Data visualization: best practices in creating effective graphical displays o Data vis. Describes the use of graphical displyas to summarize and present info about a data set o The goal is to communicate as effectively and clearly as possible the key info about the data 3 Creating effective graphical displays o Art and science o Guidelines o Give the display a clear and concise title o Keep display simple o Clearly label each axis and provide the units of measure o If colors are used make sure theyre distinct o If multiple colors or lines are used provide a legend Choosing the type of graphical display o Displays used to show the dist of the data o Bar chart, pie chart, dot plot, histogram, stem and leaf display o Displays used to make comparisons o Side by side bar chart, Stacked bar chart o Displays used to show relationships o Scatter diagram, trendline Data dashboard is a widely used data visualization tool o It organizes and presents key performance indicators (kpis) used to monitor an org or process o It provides timely, summary info that is easy to read, understand and interpret o Some additional guidelines include o Minimize the need for screen scrolling o Avoid unnecessary use of color or 3d o Use borders between charts to improve readability Tabular and graphical displays 4 Ch. 2 part A, Descriptive Stats: tabular and graphical displays o Summarizing data for a categorical variable o Summarizing data for a quantitative variable Categorical data use labels or names to identify categories of like items Quantitative data are numerical values that indicate how much or how many FREQUENCY DISTRIBUTION o A frequency distribution is a tabular summary of data showing the number (frequency) of obserations in each of several non-overlapping categories or classes o The objective is to provide insights about the data that cannot be quickly obtained by looking only at the original data Relative Frequency Distribution 5 o The relative frequency of a class is the fraction or proportion of the total number of data items belonging to the class o A relative frequency distribution is a tabular summary of a set of data showing the relative frequency for each class o A percent frequency of a class is the relative frequency multiplied by 100 o A percent frequency distribution is a tabular summary of a set of data showing the percent frequency for each class Bar chart o A bar chart is a graphical display for depicting qualitative data o On one axis (usually the horizontal axis) we specify the labels that are used for each of the classes o A frequency, relative frequency or percent frequency scale can be used for the other axis (usually the vertical axis) o Using a bar of fixed width drawn above each class label, we extend the height appropriately o The bars are separated to emphasize the fact that each class is a separate category Pareto diagram o In quality contro, bar charts are used to identify the most important causes of problems o When the bars are arranged in descending order of height from left to right (with the most frequently occurring cause appearing first) the bar chart is called a Pareto diagram o This diagram is named for its founder Vilfredo Pareto an Italian economist Pie chart o Pie chart is a commonly used graphical display for presenting relative frequency and percent frequency distributions for categorical data o First draw a circle then use the relative frequencies to subdivide the circle into sectors that correspond to the relative frequency for each class 6 o Since therea re 360 degrees in a circle a class with a relative frequency of .25 would consume .25(360)=90 degrees of the circle. Frequency distribution o The three steps necessary to define the classes for a frequency distribution with quantitative data are: 1. Determine the # of non overlapping classes 2. Determine the width of each class 3. Determine the class limits Guidelines for determining the number of classes o Use between 5 and 20 classes o Data sets with a larger number of elements usually require a larger number of classes o Smaller data sets usually require fewer classes o The goal is to use enough classes to show the variation in the data but not so many classes that some contain only a few data items o Use classes of equal width o Approximate class width= (largest data value-smallest data value)/# classes o Making the classes the same width reduces the chance of inappropriate interpretations o *Note on number of classes and class width o In practice the number of classes and the appropriate class width are determined by trail and error o Once a possible number of classes is chosen the appropriate class width is found o The process can be repeated for a diff number of classes o Ultimately the analyst uses judgment to determine the combo of the number of classes and class width that provides the best frequency distribution for summarizing the data o Guidelines for determining the class limits o Class limits must be chosen so that each data item belongs to one and only one class 7 o The lower class limit identifies the smallest possible data value assigned to the class o The upper class limit identifies the largest possible data value assgiend to the class o The upper class limit identifies the largest possible data value assigned to the class o The appropriate values for the class limits depend on the level of accuracy of the data o An open end class requires only a lower class or an upper class limit Dot plot o One of the simplest graphical summaries of data is dot plot o A horizontal axis shows the range of data values o Then each data value is represented by a dot placed above the axis Histogram o Another common graphical display of quantitative data is a histogram o The variable of interest is placed on the horizontal axis o A rectangle is drawn above each class interval with its height corresponding to the interval’s frequency, relative frequency or percent frequency o Unlike a bar graph, a histogram has no natural separation between rectangles of adjacent classes Cumulative distributions o Cumulative frequency distribution shows the number of items with values less than or equal to the upper limit of each class o Cumulative relative frequency distribution- shows the proportion of items with values less than or equal to the upper limit of each class o Cumulative percent frequency distribution o Shows the percentage of items with values less than or equal to the upper limit of each class Cumulative distributions 8 o The last entry in a cumulative frequency distribution always equals the total number of observations o The last entry in a cumulative relative frequency distribution always equal 1.00 o The last entry in a cumulative percent frequency distribution always equals 100 Stem and leaf display o A stem and leaf display shows both the rank order and shape of the distribution of the data o It is similar to a histogram on its side but it has the advantage of showing the actual data values o The first digits of each data item are arranged to the left of a vertical line o To the right of the vertical line we record the last digit for each item in rank order o Each line in the display is referred to as a stem o Each digit on a stem is a leaf Stretched stem and leaf display o If we believe the original stem and leaf display has condensed the data too much we can stretch the display vertically by using two stems for each leading digits o Whenever a stem value is stated twice the first value corresponds to leaf values of 0-4 and the second value corresponds to leaf values of 5-9 o Leaf units o A single digit is used to define each leaf o In the preceding example the leaf unit was 1 o Leaf units may be 100,10,1,0.1, and so on o Where the leaf unit is now shown, it is assumed to equal 1 o The leaf unit indicates how to multiply the stem and leaf numbers to approximate the original data Statistics 9 o The term stats can refer to numerical facts such as averages medians percents and index numbers that help us understand a variety of business and economic situations o Stats can also refer to the art and science of collecting, analyzing, presenting and interpreting data Applications in Business and Economics o Accounting o Public accounting firms use statistical sampling procedures when conducting audits for their clients o Economics o Economists use statistical information in making forecasts about the future of the economy or some aspect of it o Finance o Financial advisors use price earnings ratios and dividend yields to guide their investment advice Applications in Business and Economics o Marketing o Electronic point of sale scanners at retail checkout counters are used to collect data for a variety of marketing research applications o Production o A variety of statistical quality control charts are used to monitor the output of a production process o Information Systems o A variety of statistical info helps administrators assess the performance of computer networks Data and Data sets o Data- facts and figures collected, analyzed and summarized for presentation and interpretation o All the data collected in a particular study are referred to as the data set for the study Elements, Variables and observations 10 o Elements are the entities on which data are collected o A variable is a characteristic of interest for the elements o The set of measurements obtained for a particular element is called an observation o A data set with n elements contains n observations o The total number of data values in a complete data set is the number of elements multiplied by the number of variables Scales of measurement o Includes nominal, interval, ordinal, ratio o The scale determines the amt of info contained in the data o The scale indicates the data summarization and statistical analyses that are most appropriate o Nominal o Data are labels or names used to identify an attribute of the element o A nonnumeric label or numeric code may be used o Ex. Students of a uni are classified by the school in which they are enrolled using a nonnumeric label such as business, humanities, education and so on o Alternatively a numeric code could be used for the school vbariable like 1 denotes business, 2 denotes humanities, etc o Ordinal o The data have the properties of nominal data and the order or rank of the data is meaningful o A nonnumeric label or numeric code may be used o Ex. Students of uni are classified by class standing using a nonnumeric label such as freshman sophomore junior or senior o Alternatively a numeric code could be used for the class standing variable (e.g. 1 denotes freshman, 2 denotes sophomore) o Interval o The data have the properties of ordinal data and the interval between observations is expressed in terms of a fixed unit of measure o Interval data are always numeric 11 o o o o o o o Ex. Melissa has an SAT score of 1985, while Kevin has an SAT score of 1880. Melissa scored 105 points more than Kevin Ratio o The data have all the properties of interval data and the ratio of two values is meaningful o Variables such as distance height weight and time use the ratio scale o This scale must contain a zero value that indicates that nothing exists for the variable at the zero point o Ex. Melissas college record shows 36 credit hours earned, while kevin’s shows 72 credit hours earned. Kevin has twice as many credit hours earned as Melissa Categorical and Quantitative Data o Data can be further classified as being categorical or quantitative o The statistical analysis that is appropriate depend son whether the data for the variable are categorical or quantitative o In general there are more alternatives for statistical analysis when the data are quantitative o Labels or names used to identify an attribute of each element o Often referred to as qualitative data o Use either the nominal or ordinal scale of measurement o Can be either numeric or nonnumeric o Appropriate statistical analyses are rather limited Quantitative data indicate how many or how much o Discrete, if measuring how many o Continuous, if measuring how much o Quantitative data are always numeric o Ordinary arithmetic operations are meaningful for quantitative data Scales of measurement Cross sectional data are collected at the same or approx. the same point in time o Ex. Data detailing the number of building permits issued in nov 2012 in each of the counties of Ohio Time Series Data 12 o Time series data are collected over several time periods Ex. Data detailing the number of building permits issues in lucas county, ohio in each of the last 36 months o Graphs of time series help analysts understand: What happened in the past Identify any trends over time Project future levels for the time series o Data sources o Existing sources Internal company records- almost any department Business database services- Dow Jones and Co. Govt Agencies- US dept of labor Industry associations- travel industry association of America Special interest orgs- graduate management admission council Internet- more and more firms o Data Sources o Data available from internal company records Employee records- data including name, address, SS # Production records- part number, quantity produced, direct labor cost, material cost Inventory records- part number, quantity in stock, reorder level, economic order quantity Sales records- product number, sales volume, sales volume by region Credit scores- customer name, credit limit, accounts receivable balance Customer profile- age, gender, income, household size o Data available from selected govt agencies o Census bureau- population data, number of households, household income o Federal Reserve Board- data on money supply, exchange rates, discount rates o Office of Mgmt and Budget- data on revenue, expenditures, debt of federal govt 13 o Dept of Commerce- data on business activity, value of shipments, profit by industry o Bureau of Labor Statistics- customer spending, unemployment rate, hourly earnings, safety record o Data Sources o Statistical studies- experimental In experimental studies the variable of interest is first identified. Then one or more other variables are identified and controlled so that data can be obtained about how they influence the variable of interest The largest experimental study ever conducted is believed to be the 1954 public health service experiment of the salk polio vaccine. Nearly 2 million US children were selected o Statistical studies- observational In observational (nonexperimental) studies no attempt is made to control or influence the variables of interest A survey is a good example Studies of smokers and nonsmokers are observational studies because researchers do not determine or control who will smoke and who wont o Data Acquisition considerations o Time requirement Searching for info can be time consuming Info may not longer be useful by the time its available o Cost of acquisition Orgs often charge for info even when it’s not their primary business activity o Data errors Using any data that happen to be available or were acquired with little care can lead to misleading info o Descriptive stats o Most of the statistical info in newspapers, mags, company reports and other publications consists of data that are summarized and presented in a form that is easy to understand 14 o Such summaries of data which may be tabular, graphical, or numerical are referred to as descriptive stats o Ex. The manager of Hudson auto would like to have a better understanding of the cost of parts used in the engine tune ups performed in her shop. She examines 50 customer invoices for tune-ups. The costs of parts, rounded to the nearest dollar, are listed on the next slide Numerical Descriptive Stats o The most common numerical descriptive statistic is the average (or mean) o The avg demonstrates a measure of the central tendency or central location of the data for a variable Statistical inference o Population- the set of all elements of interest in a particular study o Sample- a subset of the population o Statistical inference- the process of using data obtained from a sample to make estimates and test hypotheses about the characteristics of a population o Census- collecting data for the entire population o Sample survey- collecting data for a sample Process of statistical inference 1. 2. 3. 4. Population consists of all tune ups. Avg cost of parts is unknown A sample of 50 engine tune-ups is examined. The sample data provide a sample average parts cost of 79 per tuneup The sample avg is used to estimate the population avg Computers and statistical analysis o Statisticians often use computer software to perform the statistical computations required with large amts of data o Many of the data sets in this book are available on the website that accompanies the book o The data sets can downloaded in either minitab or excel format 15 o Also, the excel add-in stat tools can be downloaded from the website Data warehousing o Organizations obtain large amts of data on a daily basis by means of magnetic card readers, bar code scanners, point of sale terminals and touch screen monitors o Wal-mart captures data on 20-30 million transactions per day o Visa processes 6800 payment transactions per second o Capturing storing and maintaining the data, referred to as data warehousing, is a significant undertaking Data Mining o Analysis of the data in the warehouse might aid in decisions that will lead to new strats and higher profits for the organization\ o Using a combination of procedures from stats, math, and comp sci, analysts mine the data to convert it into useful info o The most effective data mining systems use automated procedures to discover relationships in the data and predict future outcomes, prompted by only general even vague queries by the user Data mining applications o The major applications of data mining have been made by companies with a strong consumer focus such as retail, financial and communication firms o Data mining is used to identify related products that customers who have already purchased a specific product are also likely to purchase (and then pop-ups are used to draw attention to those related products) o As another example, data mining is used to identify customers who should receive special discount offers based on their past purchasing volumes Data mining reqts o Statistical methodology such as multiple regression, logistic regression, and correlation are heavily used 16 o Also needed are computer science techs involving AI and machine learning o A significant investment in time and money is reqd as well Data mining model reliability o Finding a statistical model that works well for a particular sample of data does not necessarily mean that it can be reliably applied to other data o With the enormous amt of data available the data set can be partitioned into a training set (for model development and a test set (for validating the model) o There is however a danger of over fitting the model to the point that misleading association and conclusions appear to exist o Careful interpretation of results and extensive testing is important Ethical guidelines for statistical practice o In a statistical study, unethical behavior can take variety of forms including o Improper sampling o Inappropriate analysis of the data o Development of misleading graphs o Use of inappropriate summary stats o Biased interpretation of the results o You should strive to be fair, thorough, objective, and neutral as you collect, analyze and present data o As a consumer of stats, you should also be aware of the possibility of unethical behavior by others o The American statistical association developed the report ethical guidelines for statistical practice o The report contains 67 guidelines organized into 8 topic areas o Professionalism o Responsibilities to funders, clients, employers o Responsibilities in publications and testimony o Responsibilities to research subjects o Responsibilities to research team colleagues 17 o Responsibilities to other statisticians/practitioners o Responsibilities regarding allegations of misconduct o Responsibilities of employers including orgs, individuals, attorneys, or other clients 18 Ch. 3 part A Descriptive Statistics: Numerical Measures Measures of location o Mean o Most important measure of location is the mean o Provides a measure of central location o The mean of a data set is the avg of all the data values o The sample mean x is the point estimator of the population mean u o Weighted mean o In some instances, the mean is computed by giving each observation a weight that reflects its relative importance o The choice of weights depends on the application o The weights might be the number of credit hours earned for each grade, as in GPA o In other weighted mean computations, quantities such as pounds dollars, or volume are frequently 19 If data is from a population, u replaces x o Median o The median of a data set is the value in the middle when the data items are arranged in ascending order o Whenever a data set has extreme values the median is the preferred measure of central location o The median is the measure of location most often reported for annual income and property value data o A few extremely large incomes or property values can inflate the mean o Trimmed Mean o Another measure, sometimes used when extreme values are present is the trimmed mean o It is obtained by deleting a percentage of the smallest and largest values from the data set and then computing the mean of the remaining values Ex. The 5% trimmed mean is obtained by removing the smallest 5% and the largest 5% of data values and then computing the mean of the remaining values o Geometric mean o Calculated by finding the nth root of the product of n values o It is often used in analyzing growth rates in financial data (where using the arithmetic mean will provide misleading results o It should be applied anytime you want to determine the mean rate of change over several successive periods, be it years, quarters, weeks 20 o Other common applications include changes in populations of species, crop yields, pollution levels and birth and death rates o Mode o The mode is the value that occurs w greatest frequency o The greatest frequency can occur at two or more diff values o If the data have exactly two modes, the data are bimodal If the data have more than two modes, the data are multimodal Caution: if the data are bimodal or multimodal Excel’s MODE function will incorrectly identify a single mode o Percentiles o A percentile provides info about how the data are spread over the interval from the smallest value to the largest value o Admission test scores for colleges and unis are frequently reported in terms of percentiles o The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100-p) percent of the items take on this value or more o Arrange data in ascending order o Compute index I the position of the pth percentile i=(p/100)n o If I is not an integer round up the pth percentile is the value in the ith position o If I is an integer the pth percentile is the avg of the values in positions I and i+1 o Quartiles o Specific percentiles o First quartile 25% 21 o Second 50% (median) o Third quartile 75% If the measures are computed for data from a sample, they are called sample stats If the measures are computed for data from a population, they are called population parameters A sample statistic is referred to as the point estimator of the corresponding population parameter Measures of variability o It is often desirable to consider measures of var as well as measures of location o For example if choosing supplier a or b we might consider not only the avg delivery time for each but also the variability in delivery time for each o Range o Difference b/w the largest and smallest data values o It is the simplest measure of variability o It is very sensitive to the smallest and largest data values o Interquartile range o Difference between the third and first quartile, middle 50% of data o It overcomes the sensitivity to extreme data values o Variance o Measure of the variability that utilizes all the data o Based on diff between value of each observation and the mean o The variance is useful in comparing the variability of two or more variables o Variance is the avg of the squared differences between each data value and the mean 22 o Standard deviation o Positive square root of the variance o It is measured in the same units as the data, making it more easily interpreted than the variance o Coefficient of variation o Indicates how large the standard deviation is in relation to the mean 23 Ch. 3 part B Descriptive stats: numerical measures Distribution shape: skewness o Important measure of the shape of a distribution is called skewness o The formula for the skewness of sample data is o Skewness can be easily computed using statistical software o Look at PowerPoint for types of skewness Z-scores o Often called the standardized value o Denotes the number of standard deviations a date value is from the mean o Excel’s standardize function can be used to compute the z score o An observation’s z score is a measure of the relative location of the observation in a data set o A data value less than the sample mean will have a z score less than zero o A data value greater than sample mean will have z score greater than zero o A data value equal to the sample mean will have a z score of zero Chebyshev’s theorem o At least (1-1/z^2) of the items in any data set will be within z standard deviations of the mean where z is any value greater than 1 o Chebyshev’s theorem requires z>1 but z need not be an integer 24 o At least 75% of the data values must be within z=2 standard deviations of the mean o At least 89% of the data values must be within z=3 standard deviations of the mean o At least 94% of the data values must be within z=4 standard deviations of the mean Empirical Rule o When the data are believed to approximate a bell-shaped distribution o The empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean o The empirical rule is based on the normal distribution which is covered in Ch. 6 For data having a bell-shaped distribution o 68/26% of the values of a normal random variable are within the +/-2 standard deviations of its mean o 99.72 of the values of a normal random variable are within +/-3 standard deviations of its mean Detecting outliers o An outlier is an unusually small or unusually large value in a data set o A data value with a z score less than -3 or greater than +3 might be considered an outlier o It might be o An incorrectly recorded data value o A data value that was incorrectly included in the data set o A correctly recorded data value that belongs in the data set Five number summaries and box plots o Summary stats and easy-to-draw graphs can be used to quickly summarize large quantities of data o Two tools that accomplish this are five number summaries and box plots 25 o Five number summary o 1 smallest value o 2 first quartile o 3 median o 4 third quartile o 5 largest value Box plot o Graphical summary of data that is based on five number summary o A key to the development of a box plot is the computation of the median and the quartiles Q1 and Q3 o Box plots provide another way to identify others o Limits are located, not drawn, using the interquartile range (IQR) o Data outside these limits are considered outliers o The locations of each outlier is shown with the symbol * Measures of association between two variables o Thus far we’ve examined numerical methods used to summarize the data for one variable at a time o Often a manager or decision maker is interested in the relationship between two variables o Two descriptive measures of the relationship between two variables are covariance and correlation coefficient Covariance o Measure of linear association between two variables o Positive values indicate a positive relationship o Negative values indicate a negative relationship 26 o Correlation coefficient o Correlation is a measure of linear association and not necessarily causation o Just bc two variables are highly correlated it doesn’t mean that one variable is the cause of the other o o o o Coefficient can take on values between -1 and +1 Values near -1 indicate a strong negative linear relationship Values near +1 indicate a strong positive linear relationship The closer the correlation is to zero, the weaker the relationship Data dashboards: adding numerical measures to improve effectiveness o Not limited to graphical displays o The addition of numerical measures, like the mean and standard deviation of KPIs to a data dashboard is often critical o Dashboards are often interactive o Drilling down refers to functionality in interactive dashboards that allows the user to access info and analyses at increasingly detailed level Ch. 4 Intro to Probability 27 Uncertainties o Managers often base their decisions on an analysis of uncertainties such as the following: o What are the chances that sales will decrease if we increase prices? o What is the likelihood a new assembly method will increase productivity? o What are the odds that a new investment will be profitable? Probability o o o o Numerical measure of the likelihood that an event will occur Probability values are always assigned on a scale from 0-1 A probability near zero indicates an event is quite unlikely to occur A probability near 1 indicates an event is almost certain to occur Probability as a numerical measure of the likelihood of occurrence Statistical Experiments o In stats, the notion of an experiment differs somewhat from that of an experiment in the physical sciences o In statistical experiments, probability determines outcomes o Even though the experiment is repeated in exactly the same way an entirely different outcome may occur o For this reason statistical experiments are sometimes called random experiments 28 An Experiment and its Sample Space o An experiment is any process that generates well defined outcomes o The sample space for an experiment is the set of all experimental outcomes o An experimental outcome is also called a sample point experiment Experiment outcomes Toss a coin Head, tail Inspection a part Defective, non defective Conduct a sales call Purchase, no purchase Roll a die 1,2,3,4,5,6 Play a football game Win, lose, tie o Bradley investments example o Bradley has invested in two stocks, Markley oil and colins mining o Bradly has determined that the possible outcomes of these investments three months from now are as follows Investment gain or loss in 3 months (in $000) Markley oil Collins mining 10 8 5 -2 0 -20 A counting rule for multiple-step experiments o If an experiment consists of a sequence of k steps in which there are n1 possible results for the first step, n2 possible results for the second, and so on, then the total number of experimental outcomes is given by (n1)(n2)…(nk). o A helpful graphical representation of a mult step experiment is a tree diagram A counting rule for multiple step experiments o Bradley investments o Bradley investments can be viewed as a two step experiment. o It involved two stocks, each with a set of experimental outcomes Markley oil: n1=4 29 Collins mining: n2=2 Total # of experimental outcomes= (n1)(n2)=(4)(2)=8 o Counting rule for combinations o Number of combos of N objects taken n at a time o A second useful counting rule enables us to count the number of experimental outcomes when n objects are to be selected from a set of N objects o Counting rule for Permutations o Number of permutations of N objects taken n at a time o A third useful counting rule enables us to count the number of experimental outcomes when n objects are to be selected from a set of N objects, where the order of selection is important 30 o Assigning probabilities o Basic req’t for assigning probabilities 1. The probability assigned to each experimental outcome must be between 0 and 1, inclusively 2. The sum of the probabilities for all experimental outcomes must = 1 o Classical method o Assigning probabilities based on the assumption of equally likely outcomes o Ex. Rolling a die If an experiment has n possible outcomes, the classical method would assign a probability of 1/n to each outcome Experiment: rolling a die Sample space: S={1,2,3,4,5,6} Probabilities: each sample point has a 1/6 chance of occurring o Relative frequency method o Assigning probabilities based on experimentation or historical data 31 o Ex. Lucas tool rental They would like to assign probabilities to the number of car polishers it rents each day Office records show the following frequencies of daily rentals for the last forty days Each probability assignment is given by dividing the frequency (number of days) by the total frequency (total number of days) o Subjective method o Assigning probabilities based on judgment o When economic conditions and a company’s circumstances change it might be inappropriate to assign probabilities based solely on historical data o We can use any data available as well as out experience and intuition but ultimately a probability value should express our degree of belief that the experimental outcome will occur o The best probability estimates are often obtained by combining the estimates from the classical or relative frequency approach with the subjective estimate 32 o Ex. Bradley investments o An analyst made the following probability estimates Events and their probabilities o An event is a collection of sample points o The probability of any event is equal to the sum of the probabilities of the sample points in the event o If we can identify all the sample points of an experiment and assign a probability to each, we can compute the probability of an event Some basic relationships of probability 33 o There are some basic probability relationships that can be used to compute the probability of an event without knowledge of all the sample point probabilities o Complement of an event The complement of event A is defined to be the veent consisting of all sample points that aren’t A The complement of A is denoted by A^c o Union of two events The union of the events A and B is the event containing all sample points that are in A or B or both The union of events A and B is denoted by A U B 34 o Intersection of two events The intersection of events A and B is the set of all sample points that are in both A and B The intersection of events A and B is denoted by A upside down u B o Addition law Provides a way to compute the probability of event A, or B, or both A and B occurring The law is written as 35 o Mutually exclusive events Two events are said to be mutually exclusive if the events have no sample points in common Two events are mutually exclusive if, when one event occurs, the other can’t If events A and B are exclusive, P(AupsidedownUB)=0 The addition law for mutually exclusive events is Conditional probability 36 o The probability of an event given that another event has occurred is called cond probability o The conditional probability of A given B is denoted by P(A|B) o A conditional probability is computed as follows: Multiplication Law o Provides a way to computer the probability of the intersection of two events o The law is written as: o Joint Probability table 37 Independent events o If the probability of event A is not changed by the existence of event B we would say that events A and B are independent o Two events A and B are independent if o P(A|B)=P(A) or P(B|A)=P(B) Multiplication Law for Independent Events o The mult law also can be used as a test to see if two events are independent o The law is written as: Mutual Exclusiveness and Independence 38 o Do not confuse the notion of mutually exclusive events with that of independent events o Two events with nonzero probabilities cannot be both mutually exclusive and independent o If one mutually exclusive event is known to occur the other cannot occur, thus the probability of the other event occurring is reduced to zero and they are therefore dependent o Two events that are not mutually exclusive might or might not be independent Bayes’ Theorem o Often we begin probability analysis with initial or prior probabilities o Then, from a sample, special report or a product test we obtain some additional info o Given this info we calculate revised or posterior probabilities o Bayes theorem provides the means for revising prior probabilities o Probabilities -> new info -> application of bayes’ theorem-> posterior probabilities o Ex. L.S. Clothiers o A proposed shopping center will provide strong competition for downtown businesses like L.S. Clothiers. If the shopping center is built the owner of LS Clothiers feel sit would be best to relocate to the shopping center o The shopping center cannot be built unless a zoning change is approved by the town council o The planning board must first make a recommendation, for or against the zoning change, to the council o New info 39 The planning board has recommended against the zoning change. Let b denote the event of a negative recommendation by the planning board Given that B has occurred, should LSC revise the probabilities that the town council will approve or disapprove the zoning change? Conditional probabilities o Ex. LS Clothiers o Past history with the planning board and the town council indicates the following: o Bayes Theorem o To find the posterior probability that event A will occur given that event B has occurred, we apply this theorem o o Bayes theorem is applicable when the events for which we want to compute posterior probabilities are mutually exclusive and their union is the entire sample space Posterior Probabilities o Ex. LS clothier o Given the planning board’s recommendation not to approve the zoning change we revise the prior probabilities as follows: 40 o The planning board’s recommendation is good news for LS clothiers. The posterior probability of the town council approving the zoning change is .34 compared to a prior probability of .70 Bayes Theorem: Tabular approach o Example: LS Clothiers o Step 1 Prepare the following three columns Column 1: the mutually exclusive events for which posterior probabilities are desired Column 2: the prior probabilities for the events Column 3: the conditional probabilities of the new info given each event o Step 2 Prepare the fourth column Column 4: compute the joint probabilities for each event and the new info B by using the multiplication law Multiply the prior probabilities in column 2 by the corresponding conditional probabilities in column 3 We see that there is a .14 probability of the town council approving the zoning change and a negative recommendation by the planning board There is a .27 probability of the town council disapproving the zoning change and negative recommendation by the planning board o Step 3 Sum the joint probabilities in column 4 The sum is the probability of the new info, P(B) The sum .14+.27 shows an overall probability of .41 of a negative recommendation by the planning board o Step 4 Prep the fifth column 41 Column 5: compute the posterior probabilities using the basic relationship of conditional probability The joint probabilities of are in column 4 and the probability P(B) is the sum of column 4 42 Ch. 5 Discrete Probability Distributions Random Variables o Random variable: numerical description of the outcome of an experiment o Discrete random variable: may assume either a finite number of values or an infinite sequence of values o Continuous random variable: may assume any numerical value in an interval or collection of intervals Discrete Random variable with a Finite number of values o Ex. JSL Appliances o Let x= number of TVs sold at the store in one day, where x can take on 5 values (0,1,2,3,4) o We can count the TVs sold and there is a finite upper limit on the number that might be sold (which is the number of TVs in stock) Discrete Random variable with an Infinite Sequence of values o Ex. JSL Appliances o Let x= number of customers arriving in one day, where x can take on the values 0,1,2,…. o We can count the customers arriving but there is no finite upper limit on the number that might arrive Random variables Question Family size Random Variable X X=# dependents reported on tax return Distance from home to X=distance in miles store from home to the store site Own dog or cat X=1 if own no pet, 2 if own dogs only, 4 if own dogs and cats Discrete probability distributions Type Discrete Continuous discrete 43 o The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable o We can describe a discrete probability distribution with a table graph or formula o Two types of discrete probability distributions will be introduced o 1. Uses the rules of assigning probabilities to experimental outcomes to determine probabilities for each value of the random variable o 2. uses a special mathematical formula to compute probabilities for each value of the random variable o The probability distribution is defined by a probability function denoted by f(x) that provides the probability for each value of the random variable o The required conditions for a discrete probability function are o o There are three methods for assigning probabilities to random variables: the classical method the subjective method and the relative frequency method o The use of the relative frequency method to develop discrete probability distributions leads to what is called an empirical discrete distribution o Ex. 44 o o In addition to tables and graphs, a formula that gives the probability function f(x) for every value of x is often used to describe the probability distributions o Several discrete probability distributions specified by formulas are the discrete uniform, binomial, Poisson and hypergeometric distributions Discrete uniform probability distribution o Simplest example of discrete probability distribution given by a formula o The discrete uniform probability function is f(x)=1/n where n= number of values the random variable may assume Expected Value o Measure of its central location o o The expected value is a weighted avg of the values the random variable may assume o The weights are the probabilities o The expected value doesn’t have to be a value the random variable can assume Variance and Standard deviation o The variance summarizes the variability in the values of a random variable o o The variance is a weighted average of the squared deviations of a random variable form its mean 45 o The weights are the probabilities o The standard deviation is defined as the positive square root of the variance o Expected Value o o Variance o Bivariate Distributions o A probability distribution involving two random variables o Each outcome of a bivariate experiment consists of two values, one for each random variable o Ex. Rolling a pair of dice o When dealing with bivariate probability distributions, we are often interested in the relationship between the random variables A bivariate discrete probability distribution o A company asked 200 of its employees how they rated their benefit package and job satisfaction 46 o The crosstabulation below shows the ratings data o o The bivariate empirical discrete probabilities for benefits rating and job satisfaction are listed below o o o 47 o o Covariance for random variables x and y o o Binomial Probability distribution o Four properties of a binomial experiment 1. The experiment consists of a sequence of n identical trials 2. Two outcomes, success and failure, are possible on each trial 3. The probability of a success, denoted by p, does not change from trial to trial (stationary assumption) 4. The trials are independent o Our interest is in the number of successes occurring in the n trials o We let x denote the number of successes occurring in the n trials 48 o o o Ex. Evans Electronics o Evans electronics is concerned about a low retention rate for its employees o In recent years, management has seen a turnover of 10% of hourly employees annually o Thus for any hourly employee chosen at random management estimates a probability of .1 that the person will not be with the company next year o Choosing 3 hourly employees at random, what is the probability that 1 of them will leave the company this year? The probability of the first employee leaving and the second and third employees staying, denoted (S,F,F), is given by P(1-p)(1-p) With a .10 probability of an employee leaving on any one trial, the probability of an employee leaving on the first trial and not on the second and third trials is given by (.1)(.9)(.9)=(.1)(.9)^2=.081 49 o Two other experimental outcomes also result in one success and two failures o The probabilities for the three experimental outcomes involving one success follow o o Binomial probabilities and cumulative probabilities o Statisticians have developed tables that give probabilities and cumulative probabilities for a binomial random variable o These tables can be found in some stats textbooks o With modern calculators and the capability of statistical software packages, such tables are almost unnecessary 50 o o o Poisson probability distribution o A Poisson distributed random variable is often useful in estimating the number of occurrences over a specified interval of time or space o It is a discrete random variable that may assume an infinite sequence of values (x=0,1,2,…) o Examples of Poisson distributed random variables o The number of knotholes in 14 linear feet of pine board o The number of vehicles arriving at a toll booth in one hour 51 o Bell labs used the poisson distribution to model the arrival of phone calls o Two properties of a poisson experiment 1. The probability of an occurrence is the same for any two intervals of equal length 2. The occurrence or nonoccurrence in any interval is independent of the occurrence or nonoccurrence in any other interval o Poisson probability function o o Where: X= number of occurrences in an interval F(x)= the probability of x occurrences in an interval U= mean number of occurrences in an interval E= 2.71828 X!=x(x-1)(x-2)… (2)(1) o Since there is no stated upper limit for the number of occurrences, the probability function f(x) is applicable for values x=0,1,2,… without limit o In practical applications, x will eventually become large enough so that f(x) is approximately zero and the probability of any larger values of x become negligible o Ex. Mercy Hospital o Patients arrive at the emergency room of Mercy Hospital at the average rate of 6 per hour on weekend evenings. What’s the probability of 4 arrivals in 30 mins on a weekend evening? o 52 o o A property of the Poisson distribution is that the mean and variance are equal o o Ex. Mercy Hospital o Variance for number of arrivals During 20 min periods Hypergeometric probability Distribution - Closely related to binomial distribution - However for the hypergeometric distribution o the trials are not independent o the probability of success changes from trial to trial - 53 - - Ex. Neveready’s Batteries o Bob Neveready has removed two dead batteries from a flashlight and inadvertently mingled them with the two good batteries he intended as replacements o The four batteries look identical o Bob now randomly selects two of the four batteries o What is the probability he selects the two good batteries? - 54 - - Consider a hypergeometric distribution with n trials and let p=r/n denote the probability of a success on the first trial - If the population size is large, the term (N-n)(N-1) approaches 1 - The expected value and variance can be written E(x)=np and Var(x)=np(1-p) - Note that these are the expressions for the expected value and variance of a binomial distribution - When the population size is large a hypergeometric distribution can be approx.. by a binomial distribution with n trials and probability of success p=(r/N) 55 Ch. 6 Continuous Probability Distributions Continuous probability distributions - A continuous random variable can assume any value in an interval on the real line on in a collection of intervals - It’s not possible to talk about the probability of the random variable assuming a particular value - Instead, we talk about the probability of the random variable assuming a value within a given interval - The probability of the random variable assuming a value within a given interval from x1 to x2 is defined to be the areas under the graph of the probability density function between x1 and x2 Uniform probability distribution - A random variable is uniformly distributed whenever the probability is proportional to the interval’s length - The uniform probability density function is: o o Where a = smallest value the variable can assume o B= largest value the variable can assume - Expected value of x: E(x)=(a+b)/2 - Variance of x: Var(x)=(b-a)^2/12 - Ex. Slater’s Buffet o Slater customers are charged for the amt of salad they take 56 o Sampling suggests that the amt of salad taken is uniformly distributed between 5 and 15 ounces o o o o Area as a measure of probability - The area under the graph of f(x) and probability are identical 57 - This is valid for all continuous random variables - The probability that x takes on a value between some lower value x1 and some higher value x2 can be found by computing the area under the graph of f(x) over the interval from x1 to x2 Normal probability distribution - Most important distribution for describing a continuous random variable - It is widely used in statistical inference - It has been used in a wide variety of applications including o Heights of peoples o Rainfall amts o Test scores o Scientific measurements - Abraham de Moivre, a French mathematician, published The Doctrine of Chances in 1733 - He derived the normal distribution o - Characteristics o The distribution is symmetric, its skewness measure is zero o o The entire family of normal probability distributions is define by its mean and its standard deviation 58 o o The highest point on the normal curve is at the mean, which is also the median and mode o The mean can be any numerical value: negative, zero, or positive o The standard deviation determines the width of the curve: larger values result in wider, flatter curves o Probabilities for the normal random variable are given by areas under the curve o The total area under the curve is 1 (.5 to the left of the mean and .5 to the right) o 59 o 68.26% of values of a normal random variable are within +/standard deviation of its mean o 95.44% of values of a normal random variable are within +/-2 standard deviations of its mean o 99.72% of values of a normal random variable are within +/-3 standard deviations of its mean o Standard normal probability distribution - Characteristics o A random variable having a normal distribution with a mean of 1 and standard deviation of 1 is said to have a standard normal probability distribution o The letter z is used to designate the standard normal random variable o o Converting to the standard normal distribution, we can think of z as a measure of the number of standard deviations x is from u o o Ex. Pep Zone 60 Pep zone sells auto parts and supplies including a popular multi-grade motor oil When the stock of this oil drops to 20gal, a replenishment order is placed The store manager is concerned that sales are being lost due to stockouts while waiting for a replenishment order It has been determined that demand during replenishment lead-time is normally distributed with a mean of 15 gal and a standard deviation of 6 gal The manager would like to know the probability of a stockout during replenishment lead-time. In other words, what is the probability that demand during lead time will exceed 20 gal? P(x>20)=? - Solving for a stockout probability o Step 1: convert x to the standard normal distribution o o Step 2: find the area under the standard normal curve to the left of z=.83 o o Step 3: compute the area under the standard normal curve to the right of z=.83 o 61 o - If the manager of Pep Zone wants the probability of a stockout during replenishment lead-time to be no more than .05 what should the reorder point be? - Hint: given a probability, we can use the standard normal table in an inverse fashion to find the corresponding z value - Solving for reorder point o Step 1: find the z value that cuts off an area of .05 in the right tail of the standard normal distribution o o Step 2: convert z.05 to the corresponding value of x o o A reorder of 25 gal will place the probability of a stockout during leadtime at slightly less than .05 62 o o By raising the reorder point from 20 to 25 gal on hand, the probability of a stockout decreases from about .2 to .05 o This is a significant decrease in the chance that Pep Zone will be out of stock and unable to meet a customer’s desire to make a purchase Normal Approximation of Binomial Probabilities - When the number of trials n becomes large evaluating the binomial probability function by hand or with a calculator is difficult - The normal prob dist provides an easy to use approximation of binomial probabilities where - Add and subtract a continuity correction factor because a continuous distribution is being used to approximate a discrete distribution - Ex. o Suppose that a company has a history of making errors in 10% of its invoices. A sample of 100 invoices has been taken and we want to compute the probability that 12 invoices contain errors 63 o o o 64 o o Exponential probability distribution - Useful in describing the time it takes to complete a task - The exponential random variables can be used to describe: o Time between vehicle arrivals at a toll booth o Time req’d to complete a questionnaire o Distance between major defects in a highway o In waiting line applications, the exponential distribution is often used for service times - A property of exponential distribution is that the mean and the standard deviation are equal - The exponential distribution is skewed to the left 65 - Cumulative Probabilities o - Ex. Al’s Full service pump o The time between arrivals of cars at Al’s follows an exponential probability distribution with a mean time between arrivals of 3 mins o Al would like to know the probability that the time between two successive arrivals will be 2 mins or less o Relationship between the Poisson and Exponential Distributions - The Poisson distribution provides an appropriate description of the number of occurrences per interval - And the exponential distribution provides an appropriate description of the length of the arrival between occurrences 66 67