Data Science for Managerial Decisions Course Syllabus

Data Science for Managerial Decisions Objective of the Course This course aims to provide the basic foundations needed for data scientists. It includes the fundamental concepts and covers the mathematical and statistical essentials required for understanding and implementing predictive and prescriptive models for solving business problems. It will provide hands-on training to students through excel based approach. Textbooks for the Course  Anderson, D. R., Sweeney, D. J., Williams, T. A., Camm, J. D., and Cochran, J. J. (2018). Statistics for Business & Economics. Cengage learning.  Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer.  Provost, F., and Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic thinking. O'Reilly Media, Inc. *Grading & Evaluation Mid-Term Exam 20% End-Term Exam 40% Project, Assignment, Attendance, Quiz, Class participation 40% Total 100% Course Outline Topic(s) to be covered 1. Introduction -> Applications of quantitative techniques in business -> Data – Categorical & Quantitative -> Scales of measurement 2. Descriptive Statistics – Tabular & Graphical Display -> Summarizing data for qualitative variables -> Summarizing data for quantitative variables -> Summarizing data for two variables 3. Descriptive Statistics-Numerical Measures -> Measures of location – mean, weighted mean, median, geometric mean, mode, percentiles, quartiles -> Measures of variability – range, interquartile range, variance, standard deviation, coefficient of variation -> Measures of distribution shape, relative location, and detecting outliers -> Five-Number Summaries & Box plot -> Measures of association between two variables 4. Introduction to Probability -> Experiments, counting rules, assigning probabilities -> Events and their probabilities -> Basic relationships of probabilities -> Conditional probability -> Bayes’ theorem 5. Discrete and Continuous Probability Distribution -> Random Variables -> Expected value and variance -> Binomial probability distribution -> Poisson probability distribution -> Hypergeometric probability distribution -> Uniform probability distribution -> Normal probability distribution -> Exponential probability distribution 6. Sampling and sampling distributions -> Selecting a sample -> Point estimation -> Sampling distribution of 𝑥𝑥̅ -> Properties of point estimators 7. Interval Estimation -> Margin of error and interval estimate -> Population mean: σ known -> Population mean: σ unknown -> Determining sample size 8. Hypothesis Tests -> Developing Null & Alternative hypothesis -> Type I and type II errors -> One-Tailed Test -> Two-Tailed Test 9. Inference -> Inferences About the Difference Between Two Population Means: σ1 and σ2 Known -> Inferences About the Difference Between Two Population Means: σ1 and σ2 Unknown -> Inferences About the Difference Between Two Population Means: Matched Sample Topics for Chapter 1 Statistics Applications in Business and Economics Data Data Sources Descriptive Statistics Statistical Inference Analytics Big Data and Data Mining Computers and Statistical Analysis Ethical Guidelines for Statistical Practice We live in a world that’s drowning in data Websites track every user’s every click Your smartphone is building up a record of your location and speed every second of every day. The Dominance of Data “Quantified selfers” wear pedometers-onsteroids that are ever recording their heart rates, movement habits, diet, and sleep patterns. Smart cars collect driving habits Smart homes collect living habits Smart marketers collect purchasing habits The Internet itself represents a huge graph of knowledge that contains (among other things) an enormous cross-referenced encyclopedia; domain-specific databases about movies, music, sports results, pinball machines, memes, and too many government statistics Data Science Data Science Business Acumen Mathematics Expertise Technology: Hacking skills Data science is a blend of skills in three major areas Data science is an interdisciplinary field that combines techniques, tools, and methodologies from statistics, mathematics, computer science, and domain knowledge to extract insights and knowledge from large volumes of structured and unstructured data. Applications of Data Science in Business Fraud detection and risk management Sales forecasting and demand prediction Recommen der systems Supply chain optimization Applications of Data Science in Healthcare Disease prediction and early diagnosis Personalized medicine and treatment optimization Health monitoring and wearable devices Drug discovery and clinical trials Public health analytics and outbreak prediction Applications of Data Science in Finance CREDIT SCORING AND RISK ASSESSMENT ALGORITHMIC TRADING AND FINANCIAL FORECASTING FRAUD DETECTION AND ANTI-MONEY LAUNDERING CUSTOMER SENTIMENT ANALYSIS FOR INVESTMENT DECISIONS PORTFOLIO OPTIMIZATION AND WEALTH MANAGEMENT What is Data ? Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation. All the data collected in a particular study are referred to as the data set for the study. What is data ? Every day, we come into contact with an enormous amount of data in the shape of facts, numerical figures, tables, and graphs, Newspapers, television, magazines, and other forms of communication These could be cricket batting or bowling averages, company revenues, city temperatures, five-year plan expenditures, polling results, and more. Data are numerical or nonnumerical facts or figures collected for a purpose. Applications of Data Science Netflix Personalized video Ranking, Trending now ranker Over 208 million paid subscribers worldwide. i) With thousands of smart devices streaming supported, Around 3 billion hours watched every month Personalized Recommendation System Data: Viewing time, platform searches for keywords, and content pause time, rewind, rewatched. Predict what viewer is likely to watch and give personalized watchlist to user. Advanced use of data analytics and recommendation systems ii) Content Analytics Data collected over 100 billion events every day Development using Data Come up with content that their viewers want to watch even before they know they want to watch it. Data Types Data types Qualitative Nominal Quantitative Ordinal Discrete Continuous Ratio Interval Scale of Measurement Data Categorical Numeric Nominal Ordinal Quantitative Nonnumeric Nominal Numeric Ordinal Interval Ratio Nominal scale Used to categorize data into mutually exclusive categories or groups Examples: • Gender • Marital status • Religion • Race • Hair Color • Country • Zip code • Student ID Ordinal scale Simply depict the order of variables and not the difference between each of the variables Interval Scale • Negative values are possible because the zero point does not indicate the absence of value. • Display not only the order and direction of your values but also their precise distinctions. • On the interval measure, the distance between every number is identical. • There is no fixed beginning point. • Cannot calculate ratios. • Can perform addition and subtraction, but not multiplication or division. Examples of Interval Scale Time of day in a 12hour clock Temperature in degrees Fahrenheit or Celsius (not Kelvin) IQ test SAT and ACT scores Age Income range Year Voltage Grade levels in school Ratio Scale • A quantitative scale with a true zero and equal distances between adjacent nodes.Expansion of the interval measurement level.The difference between the values, and the ratio of values, both are meaningful. • Temperature ratios can only be calculated using the Kelvin scale. Although 40 degrees is double the temperature of 20 degrees, it is not twice as scorching on the Celsius or Fahrenheit scales.. • However, on the Kelvin scale, 40 K is twice as hot as 20 K because this scale's beginning point is a true zero. Ratio Scale of Measurement Temperature outside is 0-degree Celsius. 0 degree doesn’t mean it’s not hot or cold, it is a value. However, the temperature in Kelvin is a ratio variable, as 0.0 Kelvin really does mean 'no heat'. Temperature, expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those scales does not mean 'no heat. Examples of Ratio Scale Characteristics of Ratio Scale Source: https://www.questionpro.com/blog/ratio-scale/ https://www.cuemath.com/measurement/scales-ofmeasurement/ Comparison of four scales of measurement Data, Data Sets, Elements, Variables, and Observations • Elements are the entities for which data collection is performed. • A variable is an element's characteristic of concern. • An observation is the collection of measurements obtained for a particular constituent • An n-element data set comprises n observations. • The total number of data values in a comprehensive data set is equal to the product of the number of elements by the number of variables. Question: Identify the type of scale of measurement ? Practice problem Which of the following variables are qualitative and which are quantitative? If the variable is quantitative, then specify whether the variable is discrete or continuous. a. Points scored in a football game. b. Racial composition of a high school classroom. c. Heights of 15-year-olds. Data Types 1. Cross-sectional data Refers to data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time. Place Max temperature Humidity Wind Date P1 T1 H1 W1 30August2021 P2 T2 H2 W2 30August2021 P3 T3 H3 W3 30August2021 Data Types 2. Time series data Refers to data collected over several time periods focusing on certain groups of people, specific events, or objects. Time series data can include hourly, daily, weekly, monthly, quarterly, or annual observations. Place Max temperature Humidity Wind Date P1 T1 H1 W1 11-March-2020 P2 T2 H2 W2 11-March-2021 P3 T3 H3 W3 11-March-2022 Data Types 3. Panel Data (Longitudinal Data) Combination of Cross-sectional and Time-series data Place Max temperature Humidity Wind Date P1 T1 H1 W1 1-March-2020 P2 T2 H2 W2 12-April-2021 P3 T3 H3 W3 09-Jan-2022 Data Types 4. Structured data • Reside in a pre-defined, row-column format. • Spreadsheet or database applications. • Enter, store, query, and analyze. • Numerical information that is objective and not open to interpretation. 5. Unstructured data • • • • • Do not conform to a pre-defined, row-column format. Textual and multimedia content. Do not conform to database structures. Do not conform to a row-column model required in most database systems. Example: social media data such as Twitter, YouTube, Facebook, and blogs. 6. Big Data Six characteristics (a) Volume: immense amount of data compiled from single or multiple sources (b) Velocity: generated at rapid speed, management is critical issue (c) Variety: all types, forms, structured or unstructured (d) Veracity: credibility and quality of data, reliability (e) Values: methodological plan for formulating questions, curating right data, and unlocking hidden potential Types of Data Classification of Data Based on its Presentation Ungrouped Data Grouped Data Types of Data In statistics, raw data refers to data that has been collected directly from a primary source and has not been processed in any way. For example: Data obtained from the Government reports. Ungrouped data is defined as the data given as individual points (i.e. values or numbers) such as 15, 63, 34, 20, 25, and so on Grouped data means the data (or information) given in the form of class intervals such as 0-20, 20-40 and so on. Grouped Vs. Ungrouped Data • For Example: Let us say there are 30 men in a colony, whose age groups are as follows: • 55, 35, 29, 35, 24, 77, 65, 45, 26, 29, 35, 66, 57, 59, 33, 31, 64, 28, 63, 55, 25, 69, 46, 38, 48, 61, 37, 55, 24, 64 • Data available in such a form is called raw data. And each entry i.e., 55, 35, 296, and so forth, is the value or observation • For analyzing this data, we have to arrange this data in increasing order • 24, 24, 25, 26, 28, 29, 29, 31, 33, 35, 35, 35, 37, 38, 45, 46, 48, 55, 55, 55, 57, 59, 61, 63, 64, 64, 65, 66, 69, 77 Grouped Vs. Ungrouped Data Age Frequency 24 2 25 1 26 1 28 1 29 2 31 1 33 1 35 3 37 1 38 1 45 1 46 1 48 1 55 3 57 1 59 1 61 1 Frequency Distribution Table for Ungrouped Data 24, 24, 25, 26, 28, 29, 29, 31, 33, 35, 35, 35, 37, 38, 45, 46, 48, 55, 55, 55, 57, 59, 61, 63, 64, 64, 65, 66, 69, 77 If we create a frequency distribution table for each and every observation, then it will form a large table. So, for easy understanding, we can make a table with a group of observations say 0 to 10, 10 to 20 etc. Grouped Vs. Ungrouped Data Consider the marks of 50 students of MBA class in Supply Chain Management obtained in an examination. The maximum marks of the exam are 50. 23, 8, 13, 18, 32, 44, 19, 8, 25, 27, 10, 30, 22, 40, 39, 17, 25, 9, 15, 20, 30, 24, 29, 19, 16, 33, 38, 46, 43, 22, 37, 27, 17, 11, 34, 41, 35, 45, 31, 26, 42, 18, 28, 30, 22, 20, 33, 39, 40, 32 Frequency Distribution Table for Grouped Data Groups (marks) Frequency (No. of Students) 0-10 3 10-20 11 20-30 14 30-40 14 40-50 8 Total: 50 Data Preparation Three Steps: • Counting and sorting • Handling missing values • Sub-setting Subsetting is the process of extracting a portion of the data set that is relevant for subsequent statistical analysis. Data Presentation: Tabular and Graphical Methods A categorical variable consists of observations that represent labels or names. Summarize the data with a frequency distribution. Methods to Visualize a Categorical Variable A. Bar chart B. Pie chart It is a segmented circle whose segments portray the relative frequencies of the categories of a qualitative variable. Data Presentation: Tabular and Graphical Methods Understanding the relationship between two categorical variables • Use a contingency table to examine the relationship between two categorical variables. • Use a stacked column chart to visualize more than one categorical variable. Personality Relationship between two categorical variables Analyst Diplomat Explorer Sentinel Female 55 164 194 79 Male 61 160 210 77 Data Presentation: Tabular and Graphical Methods Use a frequency distribution to summarize a numerical variable. Instead of categories, series of intervals or classes need to be designed. Methods to Visualize a Numeric Variable Histogram: A histogram is the counterpart to the vertical bar chart used for a categorical variable. No gaps between bars/intervals Data Presentation: Tabular and Graphical Methods Polygon • Midpoint of each interval/class on the x-axis • Frequency or relative frequency on the y-axis Line chart Stem-and-leaf diagram An ogive depicts a cumulative frequency or cumulative relative frequency. Use a scatterplot to examine the relationship between two numerical variables. Data Presentation: Tabular and Graphical Methods Source: https://www.mathsisfun.com/definitions/stem-andleaf-plot.html Stem and Leaf Diagram Application: Japanese train timetable bus Timetable • Source: https://byjus.com/maths/frequencypolygons/ Tabular and Graphical Procedures Data Qualitative Data Tabular Methods •Frequency Distribution •Rel. Freq. Dist. •% Freq. Dist. •Cross-tabulation Graphical Methods •Bar Graph •Pie Chart Quantitative Data Tabular Methods •Frequency Distribution •Rel. Freq. Dist. •Cum. Freq. Dist. •Cum. Rel. Freq. Distribution •Cross tabulation Graphical Methods •Histogram •Freq. curve •Box plot •Scatter Diagram •Stem-and-Leaf Display Descriptive Statistics: Tabular and Graphical Displays 1 2 Summarizing Data for a Categorical Variable Summarizing Data for a Quantitative Variable • Categorical data use labels or names to identify categories of like items. • Quantitative data are numerical values that indicate how much or how many. Summarizing Categorical Data Frequency Distribution Relative Frequency Distribution Bar Chart Pie Chart Percent Frequency Distribution Frequency Distribution • A frequency distribution is a tabular overview of data indicating the number (frequency) of observations in each of multiple non-overlapping groups or classes. • The goal is to reveal insights into the data that cannot be immediately reached by looking at the raw data. Frequency Distribution • Guests at Marada Inn were requested to evaluate the quality of their accommodations using a rating scale that included the options of excellent, above average, average, below average, or poor. • The ratings given by a sample of 20 individuals are as follows: Below Average Average Above Average Above Average Above Average Above Average Above Average Below Average Below Average Average Poor Poor Above Average Excellent Above Average Average Above Average Average Above Average Average Frequency Distribution: Marada Inn Rating Frequency Poor 2 Below Average 3 Average 5 Above Average 9 Excellent 1 Total 20 Relative Frequency Distribution • The relative frequency of a class refers to the proportion or fraction of the total number of data items that belong to that specific class. Relative frequency of a class = Frequency of the class 𝑛𝑛 • A relative frequency distribution is a table that summarizes a dataset by displaying the relative frequency of each class. Percent Frequency Distribution • The percentage frequency of a class is calculated by multiplying the relative frequency by 100. • A percent frequency distribution is a table that provides a concise summary of a dataset by displaying the percentage frequency for each class. Relative Frequency and Percent Frequency Distributions Marada Inn Rating Relative Frequency Percent Frequency Poor .10 10 Below Average .15 15 Average .25 25 Above Average .45 45 Excellent .05 5 1.00 100 Total 61 Bar Chart • A bar chart is a visual representation used to display qualitative data. • The labels for each class are specified on one axis, typically the horizontal axis. • The other axis, typically the vertical axis, can be represented using a frequency, relative frequency, or percent frequency scale. • We extend the height of each class label by using a fixed-width bar drawn above it. The bars are segregated to highlight the distinct categorization of each class. Bar Chart Marada Inn Quality Ratings 10 9 Frequency 8 7 6 5 4 3 2 1 Poor Below Average Average Above Average Excellent Quality Rating Pareto Diagram • Bar charts are used in quality control to pinpoint the most significant sources of errors. • A Pareto diagram is a bar chart in which the bars are arranged in descending height order from left to right, with the most frequent cause appearing first. • This illustration bears the name of its creator, Italian economist Vilfredo Pareto. Pie Chart A popular way to show relative frequency and percent frequency distributions for categorical data is with a pie chart. First, draw a circle. Then, use the relative frequencies to divide the circle into sections that match the relative frequency for each class. Since a circle has 360 degrees, a class with a relative frequency of.25 would take.25 divided by 360 is 90 degrees of a circle. Pie Chart Marada Inn Quality Ratings Excellent 5% Above Average 45% Poor 10% Below Average 15% Average 25% Half of the customers who were asked about the quality of Marada said it was "above average" or "excellent" (look at the left side of the pie). This might make the boss happy. If you look at the top of the pie, you can see that for every person who gave an "excellent" rating, there were two who gave a "poor" rating. This should make the boss unhappy. Summarizing Quantitative Data Frequency Distribution Relative Frequency and Percent Frequency Distributions Dot Plot Histogram Cumulative Distributions Stem-andLeaf Display Frequency Distribution The boss of Hudson Auto wants to know more about how much the parts used to tune up engines in the shop cost. She looks at 50 tune-up bills from customers. The prices of parts, rounded to the nearest dollar, are shown below: Frequency Distribution THE THREE STEPS NEEDED TO DETERMINE THE CLASSES FOR A FREQUENCY DISTRIBUTION WITH QUANTITATIVE DATA ARE: FIND THE NUMBER OF GROUPS THAT DON'T OVERLAP. FIND THE WIDTH OF EACH CLASS. FIND OUT THE CLASS LIMITS. Frequency Distribution: Guidelines for Determining the Number of Classes Use between 5 and 20 classes. Data sets with a larger number of elements usually require a larger number of classes. Smaller data sets usually require fewer classes. The goal is to use enough classes to show the variation in the data, but not so many classes that some contain only a few data items. Frequency Distribution: Guidelines for Determining the Width of Each Class • Use classes of equal width. • Approximate Class Width = Largest data value − Smallest data value Number of classes • Making the classes the same width reduces the chance of inappropriate interpretations Note on Number of Classes and Class Width In practice, the number of classes and the appropriate class width are determined by trial and error. Once a possible number of classes is chosen, the appropriate class width is found. The process can be repeated for a different number of classes. Ultimately, the analyst uses judgment to determine the combination of the number of classes and class width that provides the best frequency distribution for summarizing the data. Guidelines for Determining the Class Limits Class limits need to be set so that each piece of data only goes to one class. The lower-class limit shows the smallest data number that can be given to a class. The upper-class limit shows the biggest data number that can be given to the class. How accurate the data is determining what numbers should be used for the class limits. A class with no higher limit or lower limit is called an open-end class.. Hudson Auto Repair • If we choose six classes: • Approximate Class Width = (109 - 52)/6 = 9.5 = 10 Parts Cost ($) Frequency 50-59 2 60-69 13 70-79 16 80-89 7 90-99 7 100-109 5 Total 50 Relative Frequency and Percent Frequency Distributions Hudson Auto Repair Insights Gained from the % Frequency Distribution: Hudson Auto Repair Only 4% of the parts costs are in the $50-59 class. 30% of the parts costs are under $70. The greatest percentage (32% or almost one-third) of the parts costs are in the $70-79 class. 10% of the parts costs are $100 or more Dot Plot • One of the simplest graphical summaries of data is a dot plot. • A horizontal axis shows the range of data values. • Then each data value is represented by a dot placed above the axis Histogram A histogram is a popular way to show quantitative data on a graph. The important variable is on the horizontal line. Above each class interval, a rectangle is made with a height that matches the frequency, relative frequency, or percent frequency of the interval. Unlike a bar graph, a histogram doesn't have an easy way to separate rectangles that belong to the same class. Histogram: Example: Hudson Auto Repair Histograms: Symmetric • The left tail is a copy of the right tail. • Height of People is an example. Histograms showing Skewness: Moderately Skewed Left The left side has a longer tail. Example: Exam Scores Histograms showing Skewness: Moderately Right Skewed A Longer tail to the right Example: Housing Values Histograms showing Skewness: Highly Skewed Right A very long tail to the right Example: Executive Salaries Cumulative Distributions Cumulative frequency distribution - indicates the number of items with values below each class's upper limit. Cumulative relative frequency distribution — indicates the proportion of items with values less than or equal to each class's upper limit. The cumulative percent frequency distribution indicates the percentage of items with values less than or equal to each class's upper limit. Cumulative Distributions The total number of observations is always equal to the last entry in a cumulative frequency distribution. In a cumulative relative frequency distribution, the last element always equals 1.00. In a cumulative percent frequency distribution, the last entry always equals 100. Cumulative Distributions: Hudson Auto Repair Stem-andleaf • A stem-and-leaf diagram depicts the rank order as well as the contour of the data distribution. • It is comparable to a histogram on its side, but it displays the real data values. • Each data item's first digits are positioned to the left of a vertical line. • We record the last digit for each item in rank order to the right of the vertical line. • Each line (row) in the display is called a stem. Each digit on a stem represents a leaf. Example: Hudson Auto Repair The manager of Hudson Auto would like to obtain a better knowledge of the cost of parts utilized in the shop’s engine tune-ups. She checks 50 client invoices for potential improvements. The following information shows the component costs, rounded to the nearest dollar. Stem-and-Leaf Display Example: Hudson Auto Repair Stretched Stem-and-Leaf Display • If we believe the original stem-and-leaf display has condensed the data too much, we can stretch the display vertically by using two stems for each leading digit(s). • Whenever a stem value is stated twice, the first value corresponds to leaf values of 0 - 4, and the second value corresponds to leaf values of 5 - 9. Stem-and-Leaf Display Example: Hudson Auto Repair Stem-and-Leaf Display: Leaf Units • Each leaf is represented by a single digit. • The leaf unit in the preceding case was 1. • Leaf units can range from 100 to 10, 1, 0.1, and so on. • Where the leaf unit is not depicted, it is believed to be 1. • The leaf unit tells you how to multiply the stem-and-leaf numbers to get close to the original data. Stem-and-Leaf Display • Example: Leaf Unit = 0.1 8.6 11.7 If we have data with values such as 9.4 9.1 10.2 11.0 8.8 Leaf Unit = 0.1 8 9 10 11 6 8 1 4 2 0 7 93 Stem-and-Leaf Display • Example: Leaf Unit = 10 If we have data with values such as 1806 1717 1974 1791 1682 1910 1838 Leaf Unit = 10 8 16 17 1 9 0 3 18 19 1 7 The 82 in 1682 is rounded down to 80 and is represented as an 8. 94 Methods of organizing, exploring, and summarizing data Visual (charts and graphs) provides insight into characteristics of a data set without using mathematics. Numerical (statistics or tables) provides insight into characteristics of a data set using mathematics. Descriptive Statistics: Tabular and Graphical Displays • Summarizing Data for Two Variables Using Tables • Summarizing Data for Two Variables Using Graphical Displays • Data Visualization: Best Practices in Creating Effective Graphical Displays Summarizing Data for Two Variables using Tables • Thus far we have focused on methods that are used to summarize the data for one variable at a time. • Often a manager is interested in tabular and graphical methods that will help understand the relationship between two variables. • Crosstabulation is a method for summarizing the data for two variables. 97 Cross-tabulation • A cross-tabulation is a tabular summary of data for two variables. • Crosstabulation can be used when: one variable is categorical and the other is quantitative, both variables are categorical, or  both variables are quantitative. • The left and top margin labels define the classes for the two variables. Example: Finger Lakes Homes: Cross-tabulation The number of Finger Lakes homes sold for each style and price for the past two years is shown below. Price Range Home Style Colonial Log Split A-Frame Total < $250,000 > $250,000 18 12 6 14 19 16 12 3 55 Total 30 20 35 15 100 45 99 Cross-tabulation Example: Finger Lakes Homes Insights Gained from Preceding Cross-tabulation • The greatest number of homes (19) in the sample are a split-level style and priced at less than $250,000. • Only three homes in the sample are an A-Frame style and priced at $250,000 or more. 100 Cross-tabulation: Simpson’s Paradox Data in two or more crosstabulations are often aggregated to produce a summary crosstabulation. We must be careful in drawing conclusions about the relationship between the two variables in the aggregated crosstabulation. In some cases, the conclusions based upon an aggregated crosstabulation can be completely reversed if we look at the unaggregated data. The reversal of conclusions based on aggregate and unaggregated data is called Simpson’s paradox. Summarizing Data for Two Variables Using Graphical Displays • In most cases, a graphical display is more useful than a table for recognizing patterns and trends. • Displaying data in creative ways can lead to powerful insights. • Scatter diagrams and trendlines are useful in exploring the relationship between two variables. 102 Scatter Diagram and Trendline • A scatter diagram depicts the relationship between two quantitative variables graphically. • One variable is depicted on the horizontal axis, while the other is depicted on the vertical axis. • The overall link between the variables is suggested by the general pattern of the plotted dots. • The relationship is approximated by a trendline. Scatter Diagram A Positive Relationship A Negative Relationship No Relationship Scatter Diagram • Example: Panthers Football Team The Panthers football team is interested in investigating the relationship, if any, between interceptions made and points scored. x = Number of Interceptions 1 3 2 1 3 y = Number of Points Scored 14 24 18 17 30 Scatter Diagram and Trendline Number of Points Scored 35 y 30 25 20 15 10 5 0 0 1 2 Number of Interceptions 3 4 x 106 Example: Panthers Football Team • Insights Gained from the Preceding Scatter Diagram • The scatter diagram indicates a positive relationship between the number of interceptions and the number of points scored. • Higher points scored are associated with a higher number of interceptions. • The relationship is not perfect; all plotted points in the scatter diagram are not on a straight line. Side-by-Side Bar Chart • A side-by-side bar chart is a graphical display for depicting multiple bar charts on the same display. • Each cluster of bars represents one value of the first variable. • Each bar within a cluster represents one value of the second variable. Side-by-Side Bar Chart 20 18 Finger Lake Homes < $250,000 > $250,000 Frequency 16 14 12 10 8 6 4 2 Colonial Log Split-Level A-Frame Home Style 109 Stacked Bar Chart • Another method for comparing and displaying two variables simultaneously is a stacked bar chart. • It is a bar chart with rectangular segments of every bar being a different color. • When percentage frequencies are shown, all bars have the same height (or length) and reach 100%. Stacked Bar Chart 40 36 Finger Lake Homes Frequency 32 28 < $250,000 > $250,000 24 20 16 12 8 4 Colonial Log Split A-Frame Home Style Stacked Bar Chart Percentage Frequency Finger Lake Homes 100 90 80 70 < $250,000 > $250,000 60 50 40 30 20 10 Colonial Log Split A-Frame Home Style 112 Data Visualization: Best Practices in Creating Effective Graphical Displays • Data visualization is the process of presenting and summarizing information about a data set using graphical displays. • The objective is to provide the most important details about the data in an efficient and understandable manner. Creating Effective Graphical Displays • Effective graphical display design requires both science and art. • Here are some suggestions. • Clearly and succinctly title the display. • Don't complicate the display. • Include the units of measurement and clearly mark each axis. • If colors are utilized, make sure they stand out. Choosing the Type of Graphical Display Displays used to show the distribution of data • Bar Chart to show the frequency distribution or relative frequency distribution for categorical data • Pie Chart to show the relative frequency or percent frequency for categorical data • Dot Plot to show the distribution of quantitative data over the entire range of the data • Histogram to show the frequency distribution for quantitative data over a set of class intervals Choosing the Type of Graphical Display A. Displays used to make comparisons: • Side-by-Side Bar Chart to compare two variables • Stacked Bar Chart to compare the relative frequency or Percent frequency of two categorical variables B. Displays used to show relationships • Scatter Diagram to show the relationship between two quantitative variables • Trendline to approximate the relationship of data in a scatter diagram Data Dashboards • A data dashboard is a widely used data visualization tool. • It organizes and presents key performance indicators (KPIs) used to monitor an organization or process. • It provides timely, summary information that is easy to read, understand, and interpret. • . Some additional guidelines include . . . • • Minimize the need for screen scrolling. • Avoid unnecessary use of color or 3D. • Use borders between charts to improve readability. Tabular and Graphical Displays Data Categorical Data Tabular Displays • Frequency Distribution • Rel. Freq. Dist. • Percent Freq. Distribution • Cross-tabulation Quantitative Data Graphical Displays • Bar Chart • Pie Chart • Side-by-Side Bar Chart • Stacked Bar Chart Tabular Displays • Frequency Dist. • Rel. Freq. Dist. • % Freq. Dist. • Cum. Freq. Dist. • Cum. Rel. Freq. Dist. • Cum. % Freq. Dist. • Cross-tabulation Graphical Displays • Dot Plot • Histogram • Stem-andLeaf Display • Scatter Diagram 118 Practice Problem Construct a stem and leaf display for the following data A. 11.3, 9.6, 10.4, 7.5, 8.3, 10.5, 10.0, 9.3, 8.1, 7.7, 7.5, 8.4, 6.3, 8.8 B. 1161, 1206, 1478, 1300, 1604, 1725, 1361,1422, 1221, 1378, 1623, 1426, 1557, 1730, 1706, 1689 Solution Assignment1 Showcase the representation of various graphical methods for categorical and numerical variables using Excel, Tableau, and Power BI applications. Mode of Submission: Online (OneDrive) Submission: PPT Type: Individual Last Date: 12/July/2023 Descriptive Statistics: Numerical Measures Measures of Location • Mean • Median • Mode • Weighted Mean • Geometric Mean • Percentiles • Quartiles Measures of Variability • Range • Interquartile Range • Variance • Standard Deviation • Coefficient of Variation Descriptive Measures Measures of Central Tendency or Measure of Location/Position Mean Median Mode These are statistical terms that endeavour to determine the "centre" of a set of numbers. Measures of central tendency seek to condense data into a single value and make data comparisons easier. Mean • It is the most commonly used measure • It represents the average of the data values • Useful for a data set that does not have outliers • (Outlier- is a number in a set of data that is much bigger or smaller than the rest of the numbers) • Types of Mean- (a) Arithmetic or Simple Mean (b) Harmonic (c) Geometric Mean: Merits and Demerits Merits of Mean • It is simple to compute and has a distinct Demerits of Mean • value. • It is least susceptible to sampling undue influence on the mean. • statistical analysis. • It is founded on every observation For qualitative data, it cannot be computed. variations. • It may be employed for additional Extreme items (outliers) have an • It cannot be computed by observation nor by graphical location. Arithmetic Mean The following equation (1) can be used to calculate the mean for any data set: � (for sample) or 𝜇𝜇 ( for Population) It is represented using 𝒙𝒙 𝑆𝑆𝑢𝑢𝑢𝑢 𝑜𝑜𝑜𝑜 𝑎𝑎𝑎𝑎𝑎𝑎 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑀𝑀𝑒𝑒𝑒𝑒𝑒𝑒 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑜𝑜𝑜𝑜 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 Mean for Ungrouped data ∑ 𝒙𝒙𝒊𝒊 �= 𝒙𝒙 𝒏𝒏 (1) Mean for Grouped data ∑ 𝒇𝒇𝒙𝒙𝒊𝒊 �= ; 𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯 𝒏𝒏 = � 𝒇𝒇𝒊𝒊 𝒙𝒙 𝒏𝒏 Mean for ungrouped data Example 1: Rahul requires a B or better in chemistry to graduate. He did poorly on the first three of his tests, but well on the last four. These are his evaluations: You can select Sigma from the drop-down menu in Excel 46; 53; 54; 74; 78; 81 and 100 Compute the mean and determine if Rahul’s grade will be a B (80 to 89 average) or a C (70 to 79 average) �= 𝒙𝒙 ∑ 𝒙𝒙𝒊𝒊 𝒏𝒏 = (46+53+54+74+78+81+100)/7= 71. 285 Geometric Mean Geometric Mean (GM): 𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮𝑮 𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 = Grouped Data ∑ 𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒊𝒊 𝑮𝑮. 𝑴𝑴 = 𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨 𝒏𝒏 𝒏𝒏 𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 𝒐𝒐𝒐𝒐 "𝒏𝒏𝒏 𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑𝒑 𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗𝒗 Ungrouped Data ∑ 𝒇𝒇 ∗ 𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒍𝒊𝒊 ; 𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯 𝒏𝒏 = � 𝒇𝒇𝒊𝒊 𝑮𝑮. 𝑴𝑴 = 𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨 𝒏𝒏 EXCEL Command to compute the Geometric Mean for grouped data = GEOMEAN (Number 1, Number 2--------Number n) Geometric Mean • Calculated by finding the nth root of the product of n values. • Analyzing growth rates in financial data (where using the arithmetic mean will provide misleading results). • Determine the mean rate of change over several successive periods (be it years, quarters, weeks, . . .). Applications Changes in populations of species Crop yields Pollution levels Birth and death rates Geometric Mean Rate of Return Period 1 2 3 4 5 Return (%) -6.0 -8.0 -4.0 2.0 5.4 𝑥𝑥𝑔𝑔̅ = 5 Growth Factor 0.940 0.920 0.960 1.020 1.054 .94 . 92)(.96)(1.02)(1.054) = [.89254]1/5 = .97752 Average growth rate per period is (.97752 - 1) (100) = -2.248% Harmonic Mean Harmonic Mean Computation Grouped Data 𝒏𝒏 𝑯𝑯 = 𝟏𝟏 ∑ 𝒙𝒙𝒊𝒊 Ungrouped Data 𝒏𝒏 𝑯𝑯 = ; 𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯 𝒏𝒏 = � 𝒇𝒇 𝒇𝒇 ∑ 𝒙𝒙𝒊𝒊 EXCEL Command to compute the Harmonic mean =HARMEAN (Number 1, Number 2--------Number n) Weighted Mean Example 2: Suppose your midterm test score is 83 and your final exam score is 95. Using weights of 40% for the midterm and 60% for the final exam, compute the weighted average of your scores. If the minimum average for an A is 90, will you earn an A? ∑ 𝒘𝒘𝒙𝒙 �= 𝒙𝒙 ; 𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯 𝒏𝒏 = � 𝒘𝒘 𝒏𝒏 𝒘𝒘 = 𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘 Weighted mean = 90.2 Solved Example: Using Excel Example 3: The weight recorded to the nearest grams of 60 apples picked out at random from a consignment are given below (Source: http://www.uop.edu.pk/ocontents/chapter%203.pdf) 106; 107; 76; 82; 109; 107; 115; 93; 187; 95; 123; 125; 111; 92; 86; 70; 126; 68; 130; 129; 139; 119; 115; 128; 100; 186; 84; 99; 113; 204; 111; 141; 136; 123; 90; 115; 98; 110; 78; 185; 162; 178; 140; 152; 173; 146; 158; 194; 148; 90; 107; 181; 131; 75; 184; 104; 110; 80; 118; 82 Calculate arithmetic mean, geometric mean and harmonic mean? Weight (grams) 65----84 85----104 105----124 125----144 145----164 165----184 185----204 Frequency 9 10 17 10 5 4 5 Answer: AM: 122.5 ; GM: 117.7021; HM: 113.1139 Practice Problem Practice Problem 1: Calculate arithmetic mean, harmonic mean and geometric mean for the following raw data? Try Yourself Hint: Lower 0 150 300 450 600 750 Upper 150 300 450 600 750 900 fj 3 8 21 10 1 4 41 96 109 203 264 266 267 285 289 290 292 307 311 358 362 372 Raw Data (seconds) 378 380 382 385 414 417 421 421 423 424 427 429 429 439 445 448 454 470 484 499 514 522 524 552 561 587 653 775 792 809 878 Median Median: “When the observations are arranged in ascending or descending order, then a value, that divides a distribution into equal parts, is called median. Median for ungrouped data 𝑰𝑰𝑰𝑰 𝒏𝒏 𝒊𝒊𝒊𝒊 𝒐𝒐𝒐𝒐𝒐𝒐, 𝒊𝒊𝒊𝒊 𝒏𝒏 𝒊𝒊𝒊𝒊 𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆, 𝒏𝒏 + 𝟏𝟏 𝒕𝒕𝒕𝒕 𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 = 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒐𝒐𝒐𝒐 𝟐𝟐 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 = 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒐𝒐𝒐𝒐 𝒏𝒏 𝒏𝒏 + 𝟏𝟏 𝒕𝒕𝒕𝒕 + 𝒕𝒕𝒕𝒕 𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐 𝟐𝟐 𝟐𝟐 𝟐𝟐 Median Median for ungrouped data when n is even Calculate the median for the following marks obtained by 9 students are given below X: 45; 32; 37; 46; 39; 36; 41; 48; 36 Solution: Arrange the data in ascending or descending order 32, 36, 36, 37, 39, 41, 45, 46, 48 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 = 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒐𝒐𝒐𝒐 𝒏𝒏+𝟏𝟏 𝟐𝟐 𝟓𝟓𝟓𝟓𝟓𝟓 𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐 = 𝟑𝟑𝟑𝟑 : n = 9 i.e. “n” is odd 𝒕𝒕𝒕𝒕 𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐 = 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒐𝒐𝒐𝒐 𝟗𝟗+𝟏𝟏 𝟐𝟐 𝒕𝒕𝒕𝒕 = Median Median for ungrouped data when n is odd Calculate the median for the following the marks obtained by 9 students are given below X: 45; 32; 37; 46; 39; 36; 41; 48; 36; 50 Solution: Arrange the data in ascending or descending order 32, 36, 36, 37, 39, 41, 45, 46, 48, 50 𝒊𝒊𝒊𝒊 𝒏𝒏 𝒊𝒊𝒊𝒊 𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆, 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 = 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒐𝒐𝒐𝒐 𝟏𝟏𝟏𝟏 𝟐𝟐 𝒕𝒕𝒕𝒕+ : n = 10 i.e. n is even 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 = 𝟏𝟏𝟏𝟏+𝟏𝟏 𝟐𝟐 𝟐𝟐 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔 𝒐𝒐𝒐𝒐 𝒕𝒕𝒕𝒕 𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐 𝒏𝒏 𝒏𝒏 + 𝟏𝟏 𝒕𝒕𝒕𝒕 𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐𝒐 𝟐𝟐 𝒕𝒕𝒕𝒕 + 𝟐𝟐 𝟐𝟐 = Size of (5th + 6th ) observation/2= (39+41)/2= Median Median for grouped data 𝒉𝒉 𝒏𝒏 𝑴𝑴𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 = 𝒍𝒍 + − 𝑪𝑪 ; 𝑯𝑯𝑯𝑯𝑯𝑯𝑯𝑯 𝒏𝒏 = � 𝒇𝒇 𝒇𝒇 𝟐𝟐 Where, l= lower class boundary of the median class h= width of the class f= frequency of the median class C= Cumulative frequency of the class preceding the median class Solved Example: Median Median for continuous grouped data Example 4: Find the median, for the distribution of examination marks given below: Class Boundaries 29.5-39.5 39.5-49.5 49.5-59.5 59.5-69.5 69.5-79.5 79.5-89.5 80.5-99.5 𝑴𝑴𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆𝒆 = 𝒍𝒍 + 𝒉𝒉 𝒏𝒏 − 𝑪𝑪 = 𝟓𝟓𝟓𝟓. 𝟓𝟓 + Number of Students (fi) 8 87 190 304 211 85 20 𝟏𝟏𝟏𝟏 𝟗𝟗𝟗𝟗𝟗𝟗 − 𝟐𝟐𝟐𝟐𝟐𝟐 = 𝟔𝟔𝟔𝟔 𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 Median: Advantages and Disadvantages Advantages It is easy to compute and simple to understand Disadvantages • Since the median is an average position, arranging It is not affected by extreme values the data in ascending or descending order of magnitude is time- It has a definite and certain value because it is consuming in case of a large number of rigidly defined observations. It can be calculated even if the values of the • consider the magnitude of the items extremes are not known. However, the number of items should be known. It is a positional average and does not • It is not dependent on all the observations, so it cannot be considered as their good representative Mode A value that occurs most frequently in a data is called mode For example: 4, 6, 7, 8, 3, 4, 5, 4, 3, 5, 5, 2, 1, 4, 9, 4 Mode is 4 The data having one mode is called uni-modal distribution. The data having two modes is called bi-modal distribution. The data having more than two modes is called multi-modal distribution Mode Mode for grouped data A value which has the largest frequency in a set of data is called mode For example; No of Assistants 0 1 2 3 4 5 6 7 8 9 Frequency 3 4 6 7 10 6 5 5 3 1 Mode Mode for continuous grouped data Where l = the lower limit of modal class. h = the size of class interval. fm = the frequency of the modal class. f1 denotes the frequency of the class preceding the modal class. f2 denotes the frequency of the class succeeding the modal class Solved Example: Mode Example 5: Find the mode, for the distribution of examination marks given below: Class Boundaries 29.5-39.5 39.5-49.5 49.5-59.5 59.5-69.5 69.5-79.5 79.5-89.5 80.5-99.5 Answer: 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 = 𝟓𝟓𝟓𝟓. 𝟓𝟓 + Number of Students (fi) 8 87 190 304 211 85 20 𝟑𝟑𝟑𝟑𝟑𝟑 − 𝟏𝟏𝟏𝟏𝟏𝟏 = 𝟓𝟓𝟓𝟓. 𝟓𝟓 + 𝟓𝟓. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎 = 𝟔𝟔𝟔𝟔. 𝟓𝟓𝟓𝟓 𝟑𝟑𝟑𝟑𝟑𝟑 − 𝟏𝟏𝟏𝟏𝟏𝟏 + 𝟑𝟑𝟑𝟑𝟑𝟑 − 𝟐𝟐𝟐𝟐𝟐𝟐 Practice Problem: Mode Practice Problem 2: Calculate the mode for the following data ? Try Yourself Wages Below 100 100-200 200-300 300-400 400-500 Above 500 Number of Workers 8 12 25 15 10 6 Mode: Advantage and Disadvantages Advantage It is easy to compute and simple to understand Disadvantage • The value of mode is not based on each and It is a value around which there is maximum every item of the series as it considers only the concentration of observations, Hence, it is the best highest concentration of frequencies. representative of the data • Value of mode may not be determined always It is not affected by extreme values of the given • There are two methods of determining mode, Inspection Method and Grouping Method. data We may not get the same value of mode by It can be determined graphically with the help of Histogram Useful for both quantitative & qualitative data the two methods. So, it is not rigidly defined. • Since it is not based on all the observations and not rigidly defined, it is not suitable for further algebraic treatment. Application of Mean, Median and Mode Mode: Everyday decisions are based on the modal concept. Similar to selecting an elective in which the majority of students are enrolling. Choosing the car color that the majority of people are purchasing. Going to a movie where the majority of people are going. Median: A country's income distribution in which the income of 20% of the wealthy is roughly equivalent to the income of the remaining 80% of the population. Due to the elevated income of 20% of the population, the average income will be distorted. The median provides a more realistic average income. Application of Mean, Median and Mode Mean: If data are symmetrical or normally distributed, the mean provides a very accurate depiction of the central value. As most distributions are symmetrical, the mean is the most common measure of central tendency in industrial and non-industrial settings. Mean height, weight, blood pressure, score, etc. Percentiles • Information about how data are spread over the interval from the smallest value to the largest value. • Example: Admission test scores for colleges and universities • The pth percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100 - p) percent of the items take on this value or more. Percentiles Median is middle value for given set of observations What is percentile, Quartile, Deciles? Suppose your GATE Exam percentile is 90. It implies that, 90 % percent of the people who have given the gate exam has score marks less you. Percentiles Q1= First Quartile /Lower Quartile = 25th percentile Q2= Second Quartile/Middle Quartile = 50th percentile Q3 = Third Quartile/ Upper Quartile = 75th percentile IQR = Interquartile Range = Q3 - Q1 In general, the pth percentile divides the data into two parts. Approximately p percent of the observations are less than the pth percentile Approximately 1-p percent of the observations are greater than the pth percentile Percentiles • Compute Lp, the location of the pth percentile. Arrange the data in ascending order. Lp = (p/100)(n + 1) 80th Percentile Example: Apartment Rents Lp = (p/100)(n + 1) = (80/100)(70 + 1) = 56.8 (the 56th value plus .8 times the difference between the 57th and 56th values) 80th Percentile = 635 + .8(649 – 635) = 646.2 525 540 550 565 580 610 675 530 540 550 570 585 615 675 530 540 550 570 590 625 680 535 545 550 572 590 625 690 535 545 550 575 590 625 700 535 545 560 575 600 635 700 535 545 560 575 600 649 700 535 545 560 580 600 650 700 540 550 565 580 600 670 715 540 550 565 580 610 670 715 80th Percentile Example: Apartment Rents “At least 80% of the items take on a value of 646.2 or less.” “At least 20% of the items take on a value of 646.2 or more.” 56/70 = .8 or 80% 525 540 550 565 580 610 675 530 540 550 570 585 615 675 530 540 550 570 590 625 680 14/70 = .2 or 20% 535 545 550 572 590 625 690 535 545 550 575 590 625 700 535 545 560 575 600 635 700 535 545 560 575 600 649 700 535 545 560 580 600 650 700 540 550 565 580 600 670 715 540 550 565 580 610 670 715 154 Quartiles Quartiles are specific percentiles. First Quartile = 25th Percentile Second Quartile = 50th Percentile = Median Third Quartile = 75th Percentile Third Quartile (75th Percentile) Example: Apartment Rents Lp = (p/100)(n + 1) = (75/100)(70 + 1) = 53.25 (the 53rd value plus .25 times the difference between the 54th and 53rd values) Third quartile = 625 + .25(625 – 625) = 625 525 540 550 565 580 610 675 530 540 550 570 585 615 675 530 540 550 570 590 625 680 535 545 550 572 590 625 690 535 545 550 575 590 625 700 535 545 560 575 600 635 700 535 545 560 575 600 649 700 535 545 560 580 600 650 700 540 550 565 580 600 670 715 540 550 565 580 610 670 715 156 Five-Number Summaries and Box Plots • Summary statistics and easy-to-draw graphs can be used to quickly summarize large quantities of data. • Two tools that accomplish this are five-number summaries and box plots. 157 Five-Number Summary 1. Smallest Value 2. First Quartile 3. Median 4. Third Quartile 5. Largest Value 158 Five-Number Summary: Example: Apartment Rents Lowest Value = 525 First Quartile = 545 Median = 575 530 540 550 570 590 625 680 535 545 560 575 600 635 700 Largest Value = 715 Third Quartile = 625 525 540 550 565 580 610 675 530 540 550 570 585 615 675 535 545 550 572 590 625 690 535 545 550 575 590 625 700 535 545 560 575 600 649 700 535 545 560 580 600 650 700 540 550 565 580 600 670 715 540 550 565 580 610 670 715 159 Box Plot • A box plot is a graphical summary of data that is based on a fivenumber summary. • A key to the development of a box plot is the computation of the median and the quartiles Q1 and Q3. • Box plots provide another way to identify outliers. 160 Boxplot A boxplot, also referred to as a box-and-whisker plot, is a way to graphically display a five-number summary. Steps for constructing Boxplot a.Place the five-number summary values on the horizontal axis in ascending order. b.Make a box that includes the first and third quartiles. c.In the box, draw a dashed vertical line at the median. d.Draw a line ("whisker") from Q1 to the smallest number that is not more than 1.5*IQR from Q1. Similarly for the other side. e.To indicate observations that are more than 1.5*IQR from the box, use an asterisk (or another symbol) (outliers). Steps for constructing Boxplot A boxplot is used to estimate the shape of a distribution informally. Symmetry: median is in the center of the box, and the left and right whiskers are equally distant from their respective quartiles Positively skewed: the median is left of center and the right whisker is longer than the left whisker Negatively skewed: the median is right of center and the left whisker is longer than the right whisker Measures of Variability/Dispersion • It is often desirable to consider measures of variability (dispersion), as well as measures of location. • For example, in choosing supplier A or supplier B we might consider not only the average delivery time for each but also the variability in delivery time for each. Measures of Variability/Dispersion • Measures of central location gives the value on which all the observations could be assumed to be located or concentrated. • does not give any information about the variability present in the data i.e., how much observations are scattered or differ from each other. • help to interpret the variability of data i.e., to know how much homogenous or heterogeneous the data is. Measures of Variability Coefficient of Variation Standard Deviation Range Interquartile Range Variance Range • The range of a data set is the difference between the largest and smallest data values. Range = Largest value – Smallest value • It is the simplest measure of variability. • It is very sensitive to the smallest and largest data values. RANGE What is the range 4;8;1; 6; 6;2; 9; 3; 6; 9 of the following data: The maximum value is 9; the minimum value is 1; the range is 9 - 1 = 8 Two very different sets of data can have the same range: 1 1 1 1 9 vs 1 3 5 7 9 Range Example: Apartment Rents Range = largest value - smallest value Range = 715 - 525 = 190 525 540 550 565 580 610 675 530 540 550 570 585 615 675 530 540 550 570 590 625 680 535 545 550 572 590 625 690 535 545 550 575 590 625 700 535 545 560 575 600 635 700 535 545 560 575 600 649 700 535 545 560 580 600 650 700 540 550 565 580 600 670 715 540 550 565 580 610 670 715 169 Interquartile Range • The interquartile range of a data set is the difference between the third quartile and the first quartile. • It is the range for the middle 50% of the data. • It overcomes the sensitivity to extreme data values. Interquartile Range (IQR) Example: Apartment Rents 3rd Quartile (Q3) = 625 1st Quartile (Q1) = 545 IQR = Q3 - Q1 = 625 - 545 = 80 Variance • The variance is a measure of variability that utilizes all the data. • It is based on the difference between the value of each observation (xi) and the mean (𝑥𝑥̅ for a sample, m for a population). • The variance is useful in comparing the variability of two or more variables. Variance • Average of the squared differences between each data value and the mean. • Computed as follows: ∑ 𝑥𝑥 − 𝑥𝑥 ̅ 𝑖𝑖 𝑠𝑠 2 = 𝑛𝑛 − 1 for a sample 2 ∑ 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2 𝜎𝜎 = 𝑁𝑁 2 for a population Standard Deviation • Positive square root of the variance. • Measured in the same units as the data, making it more easily interpreted than the variance. • Computed as follows: s = 𝑠𝑠 2 for a sample σ= σ2 for a population Coefficient of Variation Indicates how large the standard deviation is in relation to the mean. 𝑠𝑠 𝑥𝑥̅ x 100 % for a sample 𝜎𝜎 𝜇𝜇 x 100 % for a population Mean Deviation For ungrouped data ∑ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 𝑛𝑛 For grouped data ∑ 𝑓𝑓𝑖𝑖 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 𝑛𝑛 Standard Deviation For ungrouped data 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽 𝝈𝝈𝟐𝟐 = 𝟏𝟏 𝒏𝒏 ∑ 𝒙𝒙𝒊𝒊 − 𝒙𝒙 � 𝟐𝟐 ; 𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅 = 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽 Example: Compute variance and standard variable for the following ungrouped data ? Observation 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ 𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ Sum Mean 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽 𝝈𝝈𝟐𝟐 = 𝟏𝟏 𝒏𝒏 ∑ 𝒙𝒙𝒊𝒊 − 𝒙𝒙 � 𝟐𝟐 𝟐𝟐 𝟑𝟑 = = 𝟎𝟎. 𝟔𝟔𝟔𝟔 (𝑥𝑥𝑖𝑖 ) 1 -1 1 2 0 0 3 +1 1 6 0 2 2 0 2/3=0.67 and S.D ( 𝜎𝜎) = 0.67 = 0.82 Standard Deviation For grouped data ∑ 𝒇𝒇𝒊𝒊 𝒙𝒙𝒊𝒊 − 𝒙𝒙 � 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽 = ∑ 𝒇𝒇𝒊𝒊 𝟐𝟐 ∑ 𝒇𝒇𝒊𝒊 𝒙𝒙𝒊𝒊 𝟐𝟐 − ∑ 𝒇𝒇𝒊𝒊 𝒙𝒙 �𝟐𝟐 = ∑ 𝒇𝒇𝒊𝒊 𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅𝒅 = 𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽𝑽 Example 6: Compute mean, mean deviation, standard deviation and variance? Class Interval Frequency 40-50 50-60 60-70 70-80 80-90 90-100 10 20 20 15 15 20 Mid-point of class interval (𝒙𝒙𝒊𝒊 ) 45 55 65 75 85 95 Practice Problem Practice Problem : Compute mean, mean deviation, standard deviation and variance for the given data below? Class Interval 2000-3000 3000-4000 4000-5000 5000-6000 6000-7000 Frequency 2 5 6 4 3 Practice Problem Example: using the five-number summaries for Growth and Value Find the IQR for Growth. Determine whether any outliers exist Repeat for Value Solution Example: using the five-number summaries for Growth and Value Growth IQR = Q3 − Q1 = 36.94 − 2.86 = 34.11, Limit = 1.5*IQR = 1.5*34.11 = 51.17; Q1− Min = 2.86 − (−40.90) = 43.76 (left whisker), Max − Q3 = 79.48 − 36.94 = 42.51 (right whisker) Both are less than 51.17, no outliers Solution Value IQR = Q3 − Q1 = 22.44 − 1.70 = 20.74; Limit = 1.5*IQR = 1.5*20.74 = 31.11; Q1- Min = 1.70 − (−46.52) = 48.22 (left whisker), Max − Q3 = 44.08 − 22.44 = 21.64 (right whisker) Left whisker limit exceeds the limit, hence, there is an outlier(s) on the left side Numerical problem • Revenue Growth Rate. Annual revenue for Corning Supplies grew by 5.5% in 2014, 1.1% in 2015, −3.5% in 2016, −1.1% in 2017, and 1.8% in 2018. What is the mean growth annual rate over this period? Solution To calculate the mean growth rate, we must first compute the geometric mean of the five growth factors: Year % Growth Growth Factor xi 2014 5.5 1.055 2015 1.1 1.011 2016 -3.5 0.965 2017 -1.1 0.989 2018 1.8 1.018 Solution The mean annual growth rate is (1.007152 – 1)100% = 0.7152%. Numerical Problem • Hardshell Jacket Ratings. OutdoorGearLab is an organization that tests outdoor gear used for climbing, camping, mountaineering, and backpacking. Suppose that the following data show the ratings of hardshell jackets based on the breathability, durability, versatility, features, mobility, and weight of each jacket. The ratings range from 0 (lowest) to 100 (highest). • 42, 66, 67, 71, 78, 62, 61, 76, 71, 67 • 61, 64, 61, 54, 83, 63, 68, 69, 81, 53 a. Compute the mean, median, and mode. b. Compute the first and third quartiles. c. Compute and interpret the 90th percentile. Solution Numerical Problem Air Quality Index. The Los Angeles Times regularly reports the air quality index for various areas of Southern California. A sample of air quality index values for Pomona provided the following data: 28, 42, 58, 48, 45, 55, 60, 49, and 50. a. Compute the range and interquartile range. b. Compute the sample variance and sample standard deviation. c. A sample of air quality index readings for Anaheim provided a sample mean of 48.5, a sample variance of 136, and a sample standard deviation of 11.66. What comparisons can you make between the air quality in Pomona and that in Anaheim on the basis of these descriptive statistics? Solution Measures of Distribution Shape, Relative Location, and Detecting Outliers Distribution Shape: Skewness z-Scores Chebyshev’s Theorem Empirical Rule Detecting Outliers Distribution Shape: Skewness • An important measure of the shape of a distribution is called skewness. • The formula for the skewness of sample data is Skewness = 𝑛𝑛 𝑥𝑥𝑖𝑖 −𝑥𝑥̅ 3 ∑ (𝑛𝑛−1)(𝑛𝑛−2) 𝑠𝑠 A histogram provides a graphical display showing the shape of a distribution. Source: https://en.wikipedia.org/wiki/Skewness Distribution Shape: Skewness Symmetric (not skewed) • Skewness is zero • Mean and median are equal Relative Frequency .35 Skewness = 0 .30 .25 .20 .15 .10 .05 0 194 Often, the distribution of income is positively skewed. Few individuals make a substantial quantity of money, as the majority of people earn an average income. The distribution of income is right-skewed. On the extreme right of the distribution, there are a few affluent individuals. A researcher conducts a survey with a group of elderly people about their age of retirement. Because the majority of people retire in their mid-60s or older, the distribution would be negatively skewed Distribution Shape: Skewness Moderately Skewed Left Skewness is negative. Mean will usually be less than the median. Relative Frequency .35 Skewness = - .31 .30 .25 .20 .15 .10 .05 0 196 Distribution Shape: Skewness Moderately Skewed Right Skewness is positive Mean will usually be more than the median Relative Frequency .35 Skewness = .31 .30 .25 .20 .15 .10 .05 0 197 Distribution Shape: Skewness Highly Skewed Right • Skewness is positive (often above 1.0). • Mean will usually be more than the median. Relative Frequency .35 Skewness = 1.25 .30 .25 .20 .15 .10 .05 0 198 z-Scores In addition to measures of location, variability, and shape, we are also interested in the relative location of values within a data set. Measures of relative location help us determine how far a particular value is from the mean. • The z-score is often called the standardized value. • It denotes the number of standard deviations a data value xi is from the mean. 𝑥𝑥𝑖𝑖 −𝑥𝑥̅ 𝑧𝑧𝑖𝑖 = 𝑠𝑠 • Excel’s STANDARDIZE function can be used to compute the z-score. z-Scores • An observation’s z-score is a measure of the relative location of the observation in a data set. • A data value less than the sample mean will have a z-score less than zero. • A data value greater than the sample mean will have a z-score greater than zero. • A data value equal to the sample mean will have a z-score of zero. 200 Sample mean, x = 44, Sample deviation, s = 8. standard The z-score of −1.50 for the fifth observation shows it is farthest from the mean; it is 1.50 standard deviations below the mean. The z-score for any observation can be interpreted as a measure of the relative location of the observation in a data set. Chebyshev’s Theorem • Enables us to make statements about the proportion of data values that must be within a specified number of standard deviations of the mean. • At least (1 - 1/z2) of the items in any data set will be within z standard deviations of the mean, where z is any value greater than 1. • Chebyshev’s theorem requires z > 1; but z need not be an integer. 202 Chebyshev’s Theorem • At least 75% of the data values must be within k = 2 standard deviations of the mean. • At least 89% of the data values must be within k = 3 standard deviations of the mean. • At least 94% of the data values must be within k = 4 standard deviations of the mean. Practice Problem Suppose that the midterm test scores for 100 students in a college business statistics course had a mean of 70 and a standard deviation of 5. • How many students had test scores between 60 and 80? • How many students had test scores between 58 and 82? Solution For the test scores between 60 and 80, we note that 60 is two standard deviations below the mean and 80 is two standard deviations above the mean • Using Chebyshev’s theorem, we see that at least .75, or at least 75%, of the observations, must have values within two standard deviations of the mean. • Thus, at least 75% of the students must have scored between 60 and 80. Solution For the test scores between 58 and 82, we see that (58 − 70)/5 = −2.4 indicates 58 is 2.4 standard deviations below the mean and that (82 − 70)/5 = +2.4 indicates 82 is 2.4 standard deviations above the mean. Applying Chebyshev’s theorem with z = 2.4, we have At least 82.6% of the students must have test scores between 58 and 82. Empirical Rule Chebyshev’s theorem advantage is that it applies to any data set regardless of the shape of the distribution of the data. It could be used with any of the distributions. In many practical applications, however, data sets exhibit a symmetric mound-shaped or bellshaped distribution. When the data are believed to approximate this distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean. Empirical Rule When the data are believed to approximate a bell-shaped distribution: • The empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean. • The empirical rule is based on the normal distribution For data having a bell-shaped distribution: • Approximately 68% of the data values will be within one standard deviation of the mean. • Approximately 95% of the data values will be within two standard deviations of the mean. • Almost all of the data values will be within three standard deviations of the mean. Empirical Rule 209 Detecting Outliers • An outlier is an unusually small or unusually large value in a data set. • A data value with a z-score less than –3 or greater than +3 might be considered an outlier. • It might be: • an incorrectly recorded data value • a data value that was incorrectly included in the data set • a correctly recorded data value that belongs in the data set Measures of Association Between Two Variables • We have examined numerical methods used to summarize the data for one variable at a time. • Often a manager or decision maker is interested in the relationship between two variables. • Two descriptive measures of the relationship between two variables are covariance and correlation coefficient. 211 Covariance • • • • The covariance is a measure of the linear association between two variables. Positive values indicate a positive relationship. Negative values indicate a negative relationship. The covariance is computed as follows: For samples: For populations: Correlation Coefficient Correlation is a measure of linear association and not necessarily causation. Just because two variables are highly correlated, it does not mean that one variable is the cause of the other. The correlation coefficient is computed as follows: For samples: For populations: Correlation Coefficient • The coefficient can take on values between –1 and +1. • Values near –1 indicate a strong negative linear relationship. • Values near +1 indicate a strong positive linear relationship. • The closer the correlation is to zero, the weaker the relationship Covariance and Correlation Coefficient Example: Golfing Study A golfer is interested in investigating the relationship, if any, between driving distance and 18-hole score. Average Driving Distance (yds.) Average 18-Hole Score 277.6 259.5 269.1 267.0 255.6 272.9 69 71 70 70 71 69 215 Covariance and Correlation Coefficient Example: Golfing Study x Average Std. Dev. y 277.6 259.5 269.1 267.0 255.6 272.9 69 71 70 70 71 69 267.0 8.2192 70.0 .8944 (𝑥𝑥𝑖𝑖 -𝑥𝑥)̅ 10.65 -7.45 2.15 0.05 -11.35 5.95 (𝑦𝑦𝑖𝑖 -𝑦𝑦) � -1.0 1.0 0 0 1.0 -1.0 � (𝑥𝑥𝑖𝑖 -𝑥𝑥)(𝑦𝑦 ̅ 𝑖𝑖 -𝑦𝑦) -10.65 -7.45 0 0 -11.35 -5.95 Total -35.40 216 Covariance and Correlation Coefficient • Example: Golfing Study • Sample Covariance 𝑠𝑠𝑥𝑥𝑥𝑥 = ∑(𝑥𝑥𝑖𝑖 −𝑥𝑥)(𝑦𝑦 ̅ � 𝑖𝑖 −𝑦𝑦) 𝑛𝑛−1 • Sample Correlation Coefficient 𝑟𝑟𝑥𝑥𝑥𝑥 = 𝑠𝑠𝑥𝑥𝑥𝑥 𝑠𝑠𝑥𝑥 𝑠𝑠𝑦𝑦 = = −35.40 6−1 −7.08 8.2192 .8944) = -7.08 = −.9631 217 Quiz-1 Q-1 Q-2 In each of the following scenarios, define the type of measurement scale. a. An investor collects data on the weekly closing price of gold throughout the year. b. An analyst assigns a sample of bond issues to one of the following credit ratings, given in descending order of credit quality (increasing probability of default): AAA, AA, BBB, BB, CC, D. c. The dean of the business school at a local university categorizes students by major (i.e., accounting, finance, marketing, etc.) to help in determining class offerings in the future. Practice Problem A data set has a mean of 1,500 and a standard deviation of 100. a. Using Chebyshev’s theorem, what percentage of the observations fall between 1,300 and 1,700? b. Using Chebyshev’s theorem, what percentage of the observations fall between 1,100 and 1,900? Solution Practice Problem A sample space S yields five equally likely events, A, B, C, D, and E. a. Find P(D) b. Find P(Bc) c. Find P(A ∪ C ∪ E) Solution Practice Problem An alarming number of U.S. adults are either overweight or obese. The distinction between overweight and obese is made on the basis of body mass index (BMI), expressed as weight/ height. An adult is considered overweight if the BMI is 25 or more but less than 30. An obese adult will have a BMI of 30 or greater. According to a January 2012 article in the Journal of the American Medical Association, 33.1% of the adult population in the United States is overweight and 35.7% is obese. Use this information to answer the following questions. a. What is the probability that a randomly selected adult is either overweight or obese? b. What is the probability that a randomly selected adult is neither overweight nor obese? c. Are the events “overweight” and “obese” exhaustive? d. Are the events “overweight” and “obese” mutually exclusive Solution Introduction to Probability • Random Experiments, Counting Rules, and Assigning Probabilities • Events and Their Probability • Some Basic Relationships of Probability • Conditional Probability • Bayes’ Theorem 225 Uncertainties • Managers often base their decisions on an analysis of uncertainties such as the following: • What are the chances that the sales will decrease if we increase prices? • What is the likelihood a new assembly method will increase productivity? • What are the odds that a new investment will be profitable? 226 Probability • Probability is a numerical measure of the likelihood that an event will occur. • Probability values are always assigned on a scale from 0 to 1. • A probability near zero indicates an event is quite unlikely to occur. • A probability near one indicates an event is almost certain to occur. Statistical Experiments • In statistics, the notion of an experiment differs somewhat from that of an experiment in the physical sciences. • In statistical experiments, probability determines outcomes. • Even though the experiment is repeated exactly the same way, an entirely different outcome may occur. • For this reason, statistical experiments are sometimes called random experiments. 228 Random Experiment and Its Sample Space • A random experiment is a process that generates well-defined experimental outcomes. • The sample space for an experiment is the set of all experimental outcomes. • An experimental outcome is also called a sample point. Experiment Toss a coin Inspect a part Conduct a sale call Roll a die Play a football game Experimental Outcomes Head, tail Defective, non-defective Purchase, no purchase 1, 2, 3, 4, 5, 6 Win, lost, tie Probability A probability is a numerical value that measures the likelihood that an event occurs. – Between zero (0) and one (1) – 0 → impossible event that never occurs – 1 → a definite event that always occurs • An experiment is a process that leads to one of several possible outcomes. – Actual outcome is not known with certainty before the experiment begins – Diversity of outcomes is due to uncertainty • Example: rolling a fair die Probability Sample space of an experiment is denoted 𝑆𝑆. The sample space Contains all possible outcomes of the experiment. For example, Letter grades in a course: 𝑆𝑆 = 𝐴𝐴, 𝐵𝐵, 𝐶𝐶, 𝐷𝐷, 𝐹𝐹 , Passing a course or not: 𝑆𝑆 = {𝑃𝑃, 𝐹𝐹} An event is any subset of outcomes of the experiment. Simple event if it contains a single outcome, May contain several outcomes. For example, a passing grade, 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = {𝐴𝐴, 𝐵𝐵, 𝐶𝐶, 𝐷𝐷}, Tossing a coin Probability Mutually exclusive events They do not share any common outcomes The occurrence of one event prohibits the occurrence of others For example, Sample space for dice S= 1,2,3,4,5, 6 X= event for the number to appear as even = 2,4,6 Y= event for the number to appear as odd= 1,3,5 Here, X and Y are mutually exclusive events. 𝑿𝑿 ∩ 𝒀𝒀 = ∅ Two events are mutually exclusive events if occurrence of any of those events excludes the occurrence of other event Probability Exhaustive events All possible outcomes of an experiment belong to the events, Include all outcomes in the sample space For example, Sample space for throwing a dice once S= 1,2,3,4,5, 6 X= event for the number that appear on the dice less than 4 = 1,2,3 Y= event for the number that appear on the dice greater than 2 and less than 5= 3,4 Z= event for the number that appear on the dice greater than 4 = 5,6 – Here, 𝑿𝑿 ∪ 𝒀𝒀 ∪ 𝒁𝒁 = 𝟏𝟏, 𝟐𝟐, 𝟑𝟑, 𝟒𝟒, 𝟓𝟓, 𝟔𝟔 = 𝑺𝑺 Two events are exhaustive events if at least one of them are necessarily occurs whenever the experiment is performed Probability Exhaustive events All possible outcomes of an experiment belong to the events, Include all outcomes in the sample space For example, Sample space for throwing a dice once S= 1,2,3,4,5, 6 X= event for the number that appear on the dice less than 4 = 1,2,3 Y= event for the number that appear on the dice greater than 2 and less than 5= 3,4 Z= event for the number that appear on the dice greater than 4 = 5,6 – Here, 𝑿𝑿 ∪ 𝒀𝒀 ∪ 𝒁𝒁 = 𝟏𝟏, 𝟐𝟐, 𝟑𝟑, 𝟒𝟒, 𝟓𝟓, 𝟔𝟔 = 𝑺𝑺 Two events are exhaustive events if at least one of them are necessarily occurs whenever the experiment is performed Probability We can define events based on one or more outcomes of the experiment and also combine events to form new events. Venn Diagram Sample space S with a rectangle Two circles to represent the events A and B Union of two events Denoted 𝐴𝐴 ∪ 𝐵𝐵 All outcomes in A or B (or both) The portion in the Venn diagram that is included in either A or B Probability Intersection of two events Denoted 𝐴𝐴 ∩ 𝐵𝐵 All outcomes in A and B The portion in the Venn diagram that is included in both A and B, the overlap Probability Complement of an event A Denoted 𝐴𝐴𝑐𝑐 All outcomes in the sample space S that are not in A The portion in the Venn diagram that is everything in S that is not included in A Practice Problem You roll a die with the sample space S = {1, 2, 3, 4, 5, 6}. You define A as {1, 2, 3}, B as {1, 2, 3, 5, 6}, C as {4, 6}, and D as {4, 5, 6}. Determine which of the following events are exhaustive and/or mutually exclusive. a. A and B b. A and C c. A and D d. B and C Solution a. 𝐴𝐴 ∪ 𝐵𝐵 = {1, 2, 3, 5, 6 } ≠ {1, 2, 3, 4, 5, 6} = 𝑆𝑆; the events 𝐴𝐴 and 𝐵𝐵 are not exhaustive. 𝐴𝐴 ∩ 𝐵𝐵 = {1,2,3}; the events A and B are not mutually exclusive. b. 𝐴𝐴 ∪ 𝐶𝐶 = {1, 2, 3, 4, 6 } ≠ {1, 2, 3, 4, 5, 6} = 𝑆𝑆; the events 𝐴𝐴 and 𝐶𝐶 are not exhaustive. 𝐴𝐴 ∩ 𝐶𝐶 = ∅; the events A and C are mutually exclusive. c. 𝐴𝐴 ∪ 𝐷𝐷 = {1, 2, 3, 4, 5, 6 } = 𝑆𝑆; the events 𝐴𝐴 and 𝐷𝐷 are exhaustive. 𝐴𝐴 ∩ 𝐷𝐷 = ∅; the events A and D are mutually exclusive. d. 𝐵𝐵 ∪ 𝐶𝐶 = {1, 2, 3, 4, 5, 6 } = 𝑆𝑆; the events 𝐴𝐴 and 𝐶𝐶 are exhaustive. 𝐵𝐵 ∩ 𝐶𝐶 = {6}; the events B and C are not mutually exclusive. Probability Properties of probability 1. The probability of an event A is a value between 0 and 1; that is, 0 ≤ 𝑃𝑃(𝐴𝐴) ≤ 1. 2. The sum of the probabilities of any list of mutually exclusive and exhaustive events equals 1. There are three types of probabilities. Subjective: calculated by drawing on personal and subjective judgement Empirical: calculated as a relative frequency of occurrence Classical: based on logical analysis Empirical and classical probabilities do not vary, they are often grouped as objective probabilities. Rules of Probability 1. Complement rule Follows from one of the defining properties of probability: 𝑃𝑃 𝐴𝐴 + 𝑃𝑃 𝐴𝐴𝑐𝑐 = 1 Rearrange: 𝑃𝑃 𝐴𝐴𝑐𝑐 = 1 − 𝑃𝑃(𝐴𝐴) 2. Addition rule Used to find the probability of the union of two events The probability that A or B occurs, or that at least one of these events occurs 𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃 𝐴𝐴 + 𝑃𝑃 𝐵𝐵 − 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 is double-counted in both 𝑃𝑃 𝐴𝐴 and 𝑃𝑃 𝐵𝐵 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 is referred to as the joint probability Rules of Probability For mutually exclusive events A and B, the probability of their intersection is zero 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 = 0. A B 𝑃𝑃 𝐴𝐴 ∪ 𝐵𝐵 = 𝑃𝑃 𝐴𝐴 + 𝑃𝑃 𝐵𝐵 Conditional probability In business applications, the probability of interest is often a conditional probability. Examples include the probability that the customer will make an online purchase conditional on receiving an e-mail with a discount offer; the probability of making a six-figure salary conditional on getting an MBA; and the probability that sales will improve conditional on the firm launching a new marketing campaign. Conditional probability The conditional probability that A occurs given that B has occurred is derived as 𝑃𝑃 𝐴𝐴 𝐵𝐵 = 𝑃𝑃(𝐴𝐴∩𝐵𝐵) 𝑃𝑃(𝐵𝐵) Because 𝑃𝑃 𝐴𝐴 𝐵𝐵 is conditional on B (B has occurred), the sample space reduces to B. 𝑃𝑃 𝐴𝐴 𝐵𝐵 is the 𝐴𝐴 ∩ 𝐵𝐵 portion in the Venn diagram that is included in 𝐵𝐵. Similarly, 𝑃𝑃 𝐵𝐵 𝐴𝐴 = 𝑃𝑃(𝐴𝐴∩𝐵𝐵) 𝑃𝑃(𝐴𝐴) Multiplication rule of Probability We can find the joint probability as the product of probabilities using the conditional probability formula; this is the multiplication rule. The joint probability of events A and B is derived as 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 = 𝑃𝑃 𝐴𝐴 𝐵𝐵 𝑃𝑃 𝐵𝐵 Probability Two events are independent if the occurrence of one event does not affect the probability of the occurrence of the other event. Events are considered dependent if the occurrence of one is related to the probability of the occurrence of the other event. Two events, A and B, are independent if 𝑃𝑃 𝐴𝐴 𝐵𝐵 = 𝑃𝑃(𝐴𝐴) or, equivalently, 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 = 𝑃𝑃 𝐴𝐴 𝑃𝑃 𝐵𝐵 . Practice Problem Suppose that for a given year there is a 2% chance that your desktop computer will crash and a 6% chance that your laptop computer will crash. Moreover, there is a 0.12% chance that both computers will crash. Is the reliability of the two computers independent of each other? Solution Let D represent the outcome that your desktop crashes, 𝑃𝑃 𝐷𝐷 = 0.02 Let L represent the outcome that your laptop crashes, 𝑃𝑃 𝐿𝐿 = 0.06 The joint probability is 𝑃𝑃(𝐷𝐷 ∩ 𝐿𝐿) = 0.0012 𝑃𝑃(𝐷𝐷∩𝐿𝐿) 0.0012 Calculate 𝑃𝑃 𝐷𝐷 𝐿𝐿 = = = 0.02 𝑃𝑃(𝐿𝐿) 0.06 So, 𝑃𝑃 𝐷𝐷 𝐿𝐿 = 𝑃𝑃 𝐷𝐷 . If your laptop crashes, it does not alter the probability that your desktop also crashes The reliability of the two computers is independent Practice Problem Let P(A) = 0.65, P(B) = 0.30, and P(A | B) = 0.45. a. Calculate P(A ∩ B). b. Calculate P(A ∪ B). c. Calculate P(B | A). Consider the following probabilities: P(A) = 0.40, P(B) = 0.50, and P(AC ∩ BC ) = 0.24. Find: a. P(AC | BC) b. P(AC ∪ BC) c. P(A ∪ B) Hint for part C: 𝑃𝑃 (𝐴𝐴𝑐𝑐 ∩ 𝐵𝐵𝑐𝑐) = 𝑃𝑃 ((𝐴𝐴 ∪ 𝐵𝐵) 𝑐𝑐) = 1 − 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 0.24. Solution Solution a. b. c. 0.48 0.86 0.76. Practice Problem An analyst estimates that the probability of default on a seven- year AA-rated bond is 0.06, while that on a seven-year A-rated bond is 0.13. The probability that they will both default is 0.04. a. What is the probability that at least one of the bonds defaults? b. What is the probability that neither the seven-year AA-rated bond nor the seven-year A-rated bond defaults? c. Given that the seven-year AA-rated bond defaults, what is the probability that the seven-year A-rated bond also defaults? Solution 𝑃𝑃(𝐴𝐴) = 0.06, 𝑃𝑃(𝐵𝐵) = 0.13, and 𝑃𝑃 (𝐴𝐴 ∩ 𝐵𝐵) = 0.04 a. 𝑃𝑃 (𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵) – 𝑃𝑃 (𝐴𝐴 ∩ 𝐵𝐵) = 0.06 + 0.13 − 0.04 = 0.15 b. 𝑃𝑃 ((𝐴𝐴 ∪ 𝐵𝐵)𝑐𝑐) = 1 − 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 1 − 0.15 = 0.85 Practice Problem Apple products have become a household name in America. Suppose that the likelihood of Owning an Apple product is 61% for households with kids and 48% for households without kids. Suppose there are 1,200 households in a representative community, of which 820 are with kids and the rest are without kids. a. Are the events “household with kids” and “household without kids” mutually exclusive and exhaustive? Explain. b. What is the probability that a household is without kids? c. What is the probability that a household is with kids and owns an Apple product? d. What is the probability that a household is without kids and does not own an Apple product? Practice Problem Contingency tables and probabilities A contingency table is useful when examining the relationship between two categorical variables. It shows the frequencies for two categorical variables, x and y. Each cell represents a mutually exclusive combination of the pair of x and y values. We can estimate an empirical probability by calculating the relative frequency to the occurrence of the event. Practice Problem Consider the following contingency table. A A AC B 26 14 BC 34 26 a. Convert the contingency table into a joint probability table. b. What is the probability that A occurs? c. What is the probability that A and B occur? d. Given that B has occurred, what is the probability that A occurs? e. Given that Ac has occurred, what is the probability that B occurs? f. Are A and B mutually exclusive events? Explain. g. Are A and B independent events? Explain. Solution B A 𝐴𝐴𝑐𝑐 Total Total 0.26 𝐵𝐵𝑐𝑐 0.34 0.60 0.14 0.26 0.40 0.40 0.60 1 Contingency Tables and Probabilities Enrollment and age group from the introductory case • What is the probability that a randomly selected attendee enrolls in the fitness center? • What is the probability that a randomly selected attendee is over 50 years old? • What is the probability that a randomly selected attendee enrolls in the fitness center and is over 50 years old? • What is the probability that an attendee enrolls in the fitness center, given the attendee is over 50 years old? Solution Let E denote the event of enrolling in the fitness center. Let O denote the event of being over 50 years old. a.What is the probability that a randomly selected attendee enrolls in the fitness center? 140 = 0.35 𝑃𝑃 𝐸𝐸 = 400 b. What is the probability that a randomly selected attendee is over 50 years old? 132 = 0.33 𝑃𝑃 𝑂𝑂 = 400 Solution c. What is the probability that a randomly selected attendee enrolls in the fitness center and is over 50 years old? 44 = 0.11 𝑃𝑃 𝐸𝐸 ∩ 𝑂𝑂 = 400 d. What is the probability that an attendee enrolls in the fitness center, given the attendee is over 50 years old? 44 𝑃𝑃 𝐸𝐸 𝑂𝑂 = = 0.33 132 𝑃𝑃 𝐸𝐸 𝑂𝑂 = 𝑃𝑃(𝐸𝐸∩𝑂𝑂) 𝑃𝑃(𝑂𝑂) = 0.11 0.33 = 0.33 Joint Probability Let X and Y be two events in a sample space. Then the joint probability of the two events, written as P(X ∩ Y), is given by Number of observations in 𝐗𝐗 ∩ 𝐘𝐘 𝐏𝐏𝐗𝐗 ∩ 𝐘𝐘 ) = Total number of observations Total probability rule and Bayes’ Theorem The total probability rule expresses the probability of an event, 𝐴𝐴, in terms of probabilities of the intersection of 𝐴𝐴 with any mutually exclusive and exhaustive events. The total probability rule based on two events, 𝐵𝐵 and 𝐵𝐵𝑐𝑐 , is given by 𝑃𝑃 𝐴𝐴 = 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 + 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵𝑐𝑐 . Total probability rule and Bayes’ Theorem Bayes’ theorem is a procedure for updating probabilities based on new information; it uses the total probability rule. The original probability is an unconditional probability called a prior probability, in the sense that it reflects only what we know before the arrival of new information. On the basis of new information, we update the prior probability to arrive at a conditional probability called a posterior probability. The posterior probability 𝑃𝑃 𝐵𝐵|𝐴𝐴 can be found using the information on the prior probability 𝑃𝑃 𝐵𝐵 along with conditional probabilities as Bayes Theorem • Bayes’ theorem is a fundamental concept in analytics that is widely utilized in solving various problems through the application of Bayesian statistics. 𝐏𝐏(𝐀𝐀 ∩ 𝐁𝐁) 𝐏𝐏(𝐀𝐀|𝐁𝐁) = 𝐏𝐏(𝐁𝐁) The two equations allow us to demonstrate that 𝐏𝐏(𝐀𝐀 ∩ 𝐁𝐁) 𝐏𝐏(𝐁𝐁|𝐀𝐀) = 𝐏𝐏(𝐀𝐀) 𝐏𝐏(𝐀𝐀|𝐁𝐁)𝐏𝐏(𝐁𝐁) P(𝐁𝐁|𝐀𝐀) = 𝐏𝐏(𝐀𝐀) Terms for Bayes Theorem components • The prior probability (estimate of the probability without any further information) is denoted by P(B). • P(B|A) is known as the posterior probability (that is, given that event, A has occurred, what is the probability that event B will occur). That is, given the new information (or additional evidence) that A has occurred, what is the projected chance of B occurring? • P(A|B) denotes the likelihood of observing evidence A if B is true. • P(A) represents the prior probability of A. Prior Probabilities New Information Application of Bayes’ Theorem Posterior Probabilities Total probability rule and Bayes’ Theorem The posterior probability 𝑃𝑃 𝐵𝐵|𝐴𝐴 can be found using the information on the prior probability 𝑃𝑃 𝐵𝐵 along with conditional probabilities as 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = = 𝑃𝑃 𝐵𝐵|𝐴𝐴 = 𝑐𝑐 𝑃𝑃(𝐴𝐴) 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵 + 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵 ) 𝑃𝑃(𝐴𝐴|𝐵𝐵)𝑃𝑃(𝐵𝐵) 𝑃𝑃 𝐴𝐴|𝐵𝐵 𝑃𝑃(𝐵𝐵) + 𝑃𝑃 𝐴𝐴|𝐵𝐵𝑐𝑐 𝑃𝑃(𝐵𝐵𝑐𝑐 ) Total probability rule and Bayes’ Theorem The analysis to include an n mutually exclusive and exhaustive events 𝐵𝐵1 , 𝐵𝐵2 , ⋯ , 𝐵𝐵𝑛𝑛 can be included as follows: For the extended case, Bayes’ theorem, for any i = 1, 2, . . ., n, is 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵𝑖𝑖 ) 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵𝑖𝑖 ) 𝑃𝑃 𝐵𝐵𝑖𝑖 |𝐴𝐴 = = 𝑃𝑃(𝐴𝐴) 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵1 + 𝑃𝑃 𝐴𝐴 ∩ 𝐵𝐵2 + ⋯ + 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵𝑛𝑛 ) 𝑃𝑃 𝐵𝐵𝑖𝑖 |𝐴𝐴 = 𝑃𝑃 𝐴𝐴|𝐵𝐵1 𝑃𝑃 𝐵𝐵1 𝑃𝑃(𝐴𝐴|𝐵𝐵𝑖𝑖 )𝑃𝑃(𝐵𝐵𝑖𝑖 ) + 𝑃𝑃 𝐴𝐴|𝐵𝐵2 𝑃𝑃 𝐵𝐵2 + ⋯ + 𝑃𝑃 𝐴𝐴|𝐵𝐵𝑛𝑛 𝑃𝑃 𝐵𝐵𝑛𝑛 Generalization of Bayes Theorem Event generated from mutually exclusive subsets Practice Problem Christine has always been weak in mathematics. Based on her performance prior to the final exam in Calculus, there is a 40% chance that she will fail the course if she does not have a tutor. With a tutor, her probability of failing decreases to 10%. There is only a 50% chance that she will find a tutor at such short notice. a. What is the probability that Christine fails the course? b. Christine ends up failing the course. What is the probability that she had found a tutor? Solution Practice Problem • In a lie-detector test, an individual is asked to answer a series of questions while connected to a polygraph (lie detector). • This instrument measures and records several physiological responses of the individual on the basis that false answers will produce distinctive measurements. • Assume that 99% of the individuals who go in for a polygraph test tell the truth. • These tests are considered to be 95% reliable. • In other words, there is a 95% chance that the test will detect a lie if an individual actually lies. • Let there also be a 0.5% chance that the test erroneously detects a lie even when the individual is telling the truth. • An individual has just taken a polygraph test and the test has detected a lie. What is the probability that the individual was actually telling the truth? Solution • Let D and T correspond to the events that the polygraph detects a lie and that an individual is telling the truth, P T = 0.99 and P T c = 0.01. • We formulate 𝑃𝑃 𝐷𝐷|𝑇𝑇 𝑐𝑐 = 0.95 and 𝑃𝑃 𝐷𝐷|𝑇𝑇 = 0.005. • We can use Bayes’ theorem to find 𝑃𝑃(𝐷𝐷|𝑇𝑇)𝑃𝑃(𝑇𝑇) 𝑃𝑃 𝑇𝑇|𝐷𝐷 = 𝑃𝑃 𝐷𝐷|𝑇𝑇 𝑃𝑃(𝑇𝑇) + 𝑃𝑃 𝐷𝐷|𝑇𝑇 𝑐𝑐 𝑃𝑃(𝑇𝑇 𝑐𝑐 ) = 0.005∗0.99 0.005∗0.99+0.95∗0.01 = 0.34256 Solution The table provided can assist in solving the problem in a systematic manner. Monty Hall Problem Practice Problem Dr. Miriam Johnson has been teaching accounting for over 20 years. From her experience, she knows that 60% of her students do homework regularly. Moreover, 95% of the students who do their homework regularly pass the course. She also knows that 85% of her students pass the course. a. What is the probability that a student will do homework regularly and also pass the course? b. What is the probability that a student will neither do homework regularly nor will pass the course? c. Are the events “pass the course” and “do homework regularly” mutually exclusive? Explain. d. Are the events “pass the course” and “do homework regularly” independent? Explain Solution Let event A correspond to “Do homework regularly” and B to “Pass the course”. 𝑃𝑃(𝐴𝐴) = 0.60, 𝑃𝑃(𝐵𝐵|𝐴𝐴) = 0.95, 𝑃𝑃(𝐵𝐵) = 0.85 a. P (𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴) = 0.95(0.60) = 0.57 b. P ((𝐴𝐴 ∪ 𝐵𝐵)𝑐𝑐) = 1 − 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) 𝑃𝑃(𝐴𝐴 ∪ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴) + 𝑃𝑃(𝐵𝐵) − 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 0.60 + 0.85 − 0.57 = 0.88 Therefore, 𝑃𝑃((𝐴𝐴 ∪ 𝐵𝐵)𝑐𝑐) = 1 − 0.88 = 0.12 c. No, because P(𝐴𝐴 ∩ 𝐵𝐵) = 0.57 ≠ 0 d. No, because P(𝐵𝐵|𝐴𝐴) = 0.95 ≠ 0.85 = 𝑃𝑃(𝐵𝐵) Counting Rules to compute possible event arrangements • Basic Rule  If event X can occur in n1 ways and event Y can occur in n2 ways, then events X and Y can occur in n1 × n2 ways.  In general, m events can occur n1 × n2 × … × nm ways.  How many different stock-keeping units (SKU) labels can a hardware store make by utilizing two letters (AA-ZZ) followed by four numerals (0-9)? Factorial • The number of possibilities to arrange n items in a specific order is n factorial. • The product of all integers from 1 to n is n factorial. • n! = n(n−1)(n−2)...1 • Factorials can be used to count the possible combinations of any n things. • There are n different ways to choose the first, n1 different ways to choose the second, and so on. • A home appliance service vehicle is required to make three stops (A, B, and C). How many different ways may the three stops be arranged? 3! = 3 × 2 × 1 = 6 Permutations • A permutation, indicated by nPr, is an arrangement in a specific order of r randomly picked elements from a set of n objects. • In other words, how many ways can the r things be organised from the n things while treating each arrangement as distinct (i.e., XYZ is distinct from ZYX)? Combinations • A combination is an arrangement of r items chosen at random among n items, where the order of the elements is irrelevant (i.e., XYZ is the same as ZYX). • The symbol for a combination is nCr. Practice Problem Black boxes used in aircrafts manufactured by three companies A, B and C. 75% are manufactured by A, 15% by B, and 10% by C. The defect rates of black boxes manufactured by A, B, and C are 4%, 6%, and 8%, respectively. If a black box tested randomly is found to be defective, what is the probability that it is manufactured by company A? Solution Let P(A), P(B), P(C) be events corresponding to the black box being manufactured by companies A, B, and C, respectively, and P(D) be the probability of defective black box. The probability P(A|D) has been calculated as follows: P( A | D) = P( D | A) × P( A) P( D) • Now P(D|A) = 0.04 and P(A) = 0.75. P(D) = 0.75 × 0.04 + 0.15 × 0.06 + 0.10 × 0.08 = 0.047 So, P( A | D) = 0.04 × 0.75 = 0.6382 0.047 Discrete & Continuous Probability Distribution Case: Available Staff for Probable Customers Anne Jones works as a manager at a nearby Starbucks. Starbucks revealed plans to close 500 stores in the United States in 2008. While Anne's shop will remain open, she is afraid that surrounding store closings would have an impact on her company. Anne must decide on workforce requirements. A store with too many employees would be pricey. Customers who opt not to wait may be lost if there are not enough personnel. Determine the likelihood that a typical client will visit the store a certain number of times in a particular time period. Anne will be able to: Calculate the predicted number of visits from a typical Starbucks customer in a particular time period using her grasp of the probability distribution of customer arrivals. Random Variables • A random variable is a function that converts every possible outcome in the sample space to a real value. • Each sample point in the sample space S is assigned a real number using this function. • A random variable is a reliable and practical technique of describing the results of a random experiment. Discrete Random Variables A discrete random variable is one that can only take on a finite or countably infinite set of values. Here are some examples of discrete random variables: Credit rating (typically categorized as low, medium, or high, or using designations such as AAA, AA, A, BBB, and so on). The number of orders received by an e-commerce retailer, which can be countably limitless. Customer churn (the random variables have binary values of 1. Churn and 2. Do not churn). Fraud (the random variables have binary values, 1. Fraudulent transaction, and 2. Genuine transaction). Any experiment that involves counting (for example, the number of returns in a day from users of e-commerce sites such as Amazon and Flipkart; the number of clients who do not accept job offers from an organization). Continuous Random Variables • Continuous random variable X is a random variable that can take on any of an infinite number of possible values. • The following are examples of random variables that are continuous: • A company's market share (which can assume any of an infinite number of values between 0% and 100%) is infinitely variable. • A company's attrition rate as a proportion of its workforce. • Engineering systems' time to failure. • The amount of time necessary to execute an online order. • Call and service centers’ resolution duration for consumer complaints. Probability mass function Every discrete random variable is associated with a probability distribution. For a discrete random variable, the probability that a random variable X takes a specific value xi, P(X = xi), is called the probability mass function P(xi). That is, a probability mass function is a function that maps each outcome of a random experiment to a probability Expected Value • Expected value (or mean) of a discrete random variable is given by 𝐧𝐧 𝐄𝐄(𝐗𝐗) = � 𝐱𝐱𝐢𝐢 𝐏𝐏(𝐱𝐱 𝐢𝐢 ) 𝐢𝐢=𝟏𝟏 • xi is the specific value taken by a discrete random variable X and P(xi) is the corresponding probability, that is, P(X = xi). Variance and Standard Deviation Variance of a discrete random variable is given by n Var( X ) =∑ [ xi − E ( X )] × P( xi ) 2 i =1 • Standard deviation of a discrete random variable is given by σ = VAR( X ) Probability Density Function (pdf) The probability density function, f(xi), is defined as probability that the value of random variable X lies between an infinitesimally small interval defined by xi and xi + δx P(xi ≤ X ≤ xi + δx) f(x) = lim δx→0 δx Cumulative Distribution Function (CDF) • The cumulative distribution function (CDF) of a continuous random variable is defined by F (a)= P( X ≤ a)= a ∫ −∞ f ( x)dx Key Points about the Distributions • Probability density function and cumulative distribution function of a continuous random variable satisfy the following properties f(x) ≥ 0 +∞ 𝑭𝑭(∞) = � 𝒇𝒇(𝒙𝒙)𝒅𝒅𝒅𝒅 = 𝟏𝟏 −∞ 𝒃𝒃 𝑷𝑷(𝒂𝒂 ≤ 𝑿𝑿 ≤ 𝒃𝒃) = � 𝒇𝒇(𝒙𝒙)𝒅𝒅𝒅𝒅 = 𝑭𝑭(𝒃𝒃) − 𝑭𝑭(𝒂𝒂) 𝒂𝒂 The probability between two values a and b, 𝑷𝑷(𝒂𝒂≤𝑿𝑿≤𝒃𝒃)is the area between the values a and b under the probability density function Key Points about the Distributions • The expected value of a continuous random variable, E(X), is given by E( X ) = +∞ ∫ xf ( x)dx −∞ • The variance of a continuous random variable, Var(X), is given by Var(= X) ∞ ∫ [ x − E ( x)] −∞ 2 f ( x)dx Discrete Types of Probability Distributions Continuous Binominal Distribution Normal Distribution Uniform Distribution Exponential Distribution Poisson Distribution Logistic Distribution Hypergeometric Distribution Practice Problem Brad Williams is the owner of a large car dealership in Chicago. Brad decides to construct an incentive compensation program that equitably and consistently compensates employees on the basis of their performance. a. Calculate the expected value of the annual bonus amount. b. Calculate the variance and the standard deviation of the annual bonus amount. c. What is the total annual amount that Brad can expect to pay in bonuses if he has 25 employees? Solution Let the random variable X denote the bonus amount (in $1,000’s). a. The expected value is 𝐸𝐸 𝑋𝑋 = 𝜇𝜇 = ∑ 𝑥𝑥𝑖𝑖 𝑃𝑃 𝑋𝑋 = 𝑥𝑥𝑖𝑖 = 4.2 or $4,200. b. The variance is 𝑉𝑉𝑉𝑉𝑉𝑉 𝑋𝑋 = 𝜎𝜎 2 = ∑ 𝑥𝑥𝑖𝑖 − 𝜇𝜇 2 𝑃𝑃 𝑋𝑋 = 𝑥𝑥𝑖𝑖 = 9.97 (in ($1,000s)2), the standard deviation is 𝑆𝑆𝑆𝑆 𝑋𝑋 = 𝜎𝜎 = 3.158 or $3,158. c. If Brad has 25 employees, we can expect to pay $4,200*25 = $105,000 in bonuses. Binomial Distribution A random variable X is said to follow a Binomial distribution when • The random variable can have only two outcomes success and failure (also known as Bernoulli trials). • The objective is to find the probability of getting k successes out of n trials. • The probability of success is p and thus the probability of failure is (1 − p). • The probability p is constant and does not change between trials. Binomial Probability Distribution Four Properties of a Binomial Experiment 1. The experiment consists of a sequence of n identical trials. 2. Two outcomes, success and failure, are possible on each trial. 3. The probability of a success, denoted by p, does not change from trial to trial. (This is referred to as the stationarity assumption) 4. The trials are independent. 300 Binomial Probability Distribution • Our interest is in the number of successes occurring in the n trials. • We let x denote the number of successes occurring in the n trials. 301 Binomial Probability Distribution Binomial Probability Function n! f x = px (1 − p)(n−x) x! n − x ! Where: x = the number of successes p = the probability of a success on one trial n = the number of trials f(x) = the probability of x successes in n trials n! = n(n – 1)(n – 2) ….. (2)(1) 302 Binomial Probability Function 𝑛𝑛! 𝑓𝑓 𝑥𝑥 = 𝑝𝑝 𝑥𝑥 (1 − 𝑝𝑝)(𝑛𝑛−𝑥𝑥) 𝑥𝑥! 𝑛𝑛 − 𝑥𝑥 ! Number of experimental outcomes providing exactly x successes in n trials Probability of a particular sequence of trial outcomes with x successes in n trials 303 Probability Mass Function (PMF) of Binomial Distribution • The PMF of the Binomial distribution (probability that the number of success will be exactly x out of n trials) is given by 𝐏𝐏𝐏𝐏𝐏𝐏(𝐱𝐱) = 𝐏𝐏(𝐗𝐗 = 𝐱𝐱) = 𝒏𝒏! 𝒏𝒏 = 𝒙𝒙 𝒙𝒙! (𝒏𝒏 − 𝒙𝒙)! 𝐧𝐧 𝐱𝐱 𝐩𝐩 (𝟏𝟏 − 𝐩𝐩)𝐧𝐧−𝐱𝐱 , 𝐱𝐱 𝟎𝟎 ≤ 𝐱𝐱 ≤ 𝐧𝐧 Cumulative Distribution Function (CDF) of Binomial Distribution • CDF of a binomial distribution function, F(x), representing the probability that the random variable X takes value less than or equal to a, is given by 𝑎𝑎 𝑎𝑎 𝑘𝑘=0 𝑘𝑘=0 𝑛𝑛 𝑘𝑘 F(a) = 𝑃𝑃(𝑋𝑋 ≤ 𝑎𝑎) = � 𝑃𝑃(𝑋𝑋 = 𝑘𝑘) = � 𝑝𝑝 (1 − 𝑝𝑝)𝑛𝑛−𝑘𝑘 𝑘𝑘 Mean and Variance of Binomial Distribution The Mean of a binomial distribution is given by: n x Mean =E ( X ) =∑ x × PMF( x) =∑ x ×   p (1 − p ) n − x =np x 0= x 0 =  x n n The variance of a binomial distribution is given by n x np(1 − p) Var( X ) = ( x − E ( X )) × PMF( x) = ( x − E ( X )) ×   p (1 − p)n − x = ∑ ∑ =x 0=x 0  x n 2 n 2 If the number of trials (n) in a binomial distribution is large, then it can be approximated by normal distribution with mean np and variance npq. Mean, Variance, and SD of Binomial Distribution Expected value Variance Standard deviation Practice Problem Fashion Trends Online (FTO) is an e-commerce company that sells women apparel. It is observed that about 10% of their customers return the items purchased by them for many reasons (such as size, color, and material mismatch). On a particular day, 20 customers purchased items from FTO. Calculate: (a) Probability that exactly 5 customers will return the items. (b) Probability that a maximum of 5 customers will return the items. (c) Probability that more than 5 customers will return the items purchased by them. (d) Average number of customers who are likely to return the items. (e) The variance and the standard deviation of the number of returns. Solution In this case, the value of n = 20 and p = 0.1. (a)Probability that exactly 5 customers will return the items purchased is  20  P( X = 5) =   × (0.1)5 × (0.9)15 = 0.03192 5  (b)Probability that a maximum of 5 customers will return the items purchased is 5  20  P( X ≤ 5) = ∑   × (0.1) k × (0.9) 20 − k = 0.9887 k =0  k  (c)Probability that more than 5 customers will return the product is 5  20  P( X > 5) = 1 − P( X ≤ 5) = 1 − ∑   × (0.1) k × (0.9) 20− k = 1 − 0.9887 = 0.0113 k =0  k  (d)The average number of customers who are likely to return the items is E(X) = n × p = 20 × 0.1 = 2 (e) Variance of a binomial distribution is given by Var(X) = n × p × (1 − p) = 20 × 0.1 × 0.9 = 1.8 and the corresponding standard deviation is 1.3416 Poisson Distribution Poisson Distribution •It is another type of discrete probability distribution •It is considered as limiting form of binomial distribution in which n, the number of trials, becomes very large and p, probability of the success of the event is very small. •It was proposed by the French mathematician Simeon Poisson in 1837. Why we need Poisson Distribution It is used in cases where chance of any individual event being a success is very small. This distribution is used to describe the behavior of rare events. Examples, • • • • • • • The number of defective screws per box of 2000 screws. The number of printing mistakes in each page of the first proof of book The number of air accidents in India in one year Occurrence of number of scratches on a sheet of glass The number of customers who use a new banking app in a day The number of spam emails received in a month The number of defects in a 50-yard roll of fabric Characteristics of Poisson Distribution An experiment satisfies a Poisson process for the following condition: • The random variable X should be discrete. • Happening of event must be of two alternatives such as success and failure • Applicable in those cases where the number of trials n is very large and p is very small • Statistical independence is assumed Two Properties of a Poisson Experiment 1. The probability of an occurrence is the same for any two intervals of equal length. 2. The occurrence or nonoccurrence in any interval is independent of the occurrence or nonoccurrence in any other interval. 314 Poisson Probability Function 𝑥𝑥 −𝜇𝜇 𝜇𝜇 𝑒𝑒 𝑓𝑓 𝑥𝑥 = 𝑥𝑥! Where: x = the number of occurrences in an interval f(x) = the probability of x occurrences in an interval µ = mean number of occurrences in an interval e = 2.71828 x! = x(x – 1)(x – 2) . . . (2)(1) 315 Poisson Distribution For a Poisson random variable X, the probability of x successes over a given interval of time or space is 𝑷𝑷 𝑿𝑿 = 𝒙𝒙 = 𝒆𝒆 −𝝁𝝁 𝒙𝒙 𝝁𝝁 𝒙𝒙! This is for 𝑥𝑥 = 0, 1, 2, ⋯ 𝜇𝜇 is the mean number of successes 𝑒𝑒 ≈ 2.718 is the base of the natural logarithm The mean is 𝐸𝐸 𝑋𝑋 = 𝜇𝜇 The variance is 𝑉𝑉𝑉𝑉𝑉𝑉 𝑋𝑋 = 𝜎𝜎 2 = 𝐸𝐸(𝑋𝑋) = 𝜇𝜇 Example: Mercy Hospital Patients arrive at the emergency room of Mercy Hospital at the average rate of 6 per hour on weekend evenings. What is the probability of 4 arrivals in 30 minutes on a weekend evening? 317 Solution Using the probability function: µ = 6/hour = 3/half-hour, x = 4 𝑓𝑓 4 = 34 (2.71828)−3 4! = .1680 318 Numerical Example The average number of accidents at a particular intersection every year is 18. (a) calculate the probability that there are exactly 2 accidents there this month Solution There are 12 months in a year, so 𝝁𝝁= 18/12= 1.5 accidents per month 𝒆𝒆−𝝁𝝁 𝝁𝝁𝒙𝒙 𝑷𝑷 𝑿𝑿 = 𝒙𝒙 = 𝒙𝒙! 𝑷𝑷 𝑿𝑿 = 𝟐𝟐 = 𝒆𝒆−𝟏𝟏.𝟓𝟓 𝟏𝟏.𝟓𝟓𝟐𝟐 𝟐𝟐! = 0.2510 Numerical Example Example: Anne is concerned about staffing needs at the Starbucks that she manages. She believes that the typical Starbucks customer averages 18 visits to the store over a 30-day month. a. How many visits should Anne expect in a 5-day period from a typical Starbucks customer? b. What is the probability that a customer visits the chain five times in a 5-day period? c. What is the probability that a customer visits the chain no more than two times in a 5-day period? d. What is the probability that a customer visits the at least three times in a 5-day period? Numerical Example a. Given the rate of 18 visits over a 30-day month, the mean for the 30-day period as 𝜇𝜇30 = 18. So the mean for the 5-day period is 𝜇𝜇5 = 3 b. P X = 5 = e−3 35 5! = 0.1008 c. P X ≤ 2 = P X = 0 + P X = 1 + P X = 2 = 0.0498 + 0.1494 + 0.2241 = 0.4233 d. 𝑃𝑃 𝑋𝑋 ≥ 3 = 𝑃𝑃 𝑋𝑋 = 3 + 𝑃𝑃 𝑋𝑋 = 4 + ⋯ Cannot be found since there is an infinite number of possibilities 𝑃𝑃 𝑋𝑋 ≥ 3 = 1 − 𝑃𝑃 𝑋𝑋 = 0 + 𝑃𝑃 𝑋𝑋 = 1 + 𝑃𝑃 𝑋𝑋 = 2 = 1 − 0.4233 = 0.5767 Numerical Example using Excel • Craft breweries that make beer in small batches are experiencing a spectacular growth in bars and liquor stores across the nation. The craft beer industry now boasts of 4,269 breweries, representing a 12% market share of the total beer market in the United States (Fortune, March 22, 2016). It has been estimated that 1.5 craft breweries open every day. Assume this number represents an average that remains constant over time. a. What is the probability that no more than 10 craft breweries open every week? b. What is the probability that exactly 10 craft breweries open every week? Numerical Example using Excel • Excel Command A. In order to find the probability that no more than 10 craft breweries open every week, P(X ≤ 10), We enter “=POISSON.DIST(10, 10.5, 1)” and Excel returns 0.5207. Solution There is a 52.07% chance that no more than 10 craft breweries open every week. In order to find the probability that exactly 10 craft breweries open every week, P(X = 10), we enter “=POISSON.DIST(10, 10.5, 0)” and Excel returns 0.1236. There is a 12.36% chance that 10 craft breweries open every week. Practice problem Assume that X is a Poisson random variable with μ = 20. Use Excel’s function options to find the following probabilities. a) b) c) d) P(X < 14) P(X ≥ 20) P(X = 25) P(18 ≤ X ≤ 23) Practice problem: Solution a. 𝑃𝑃(𝑋𝑋 < 14) = 𝑃𝑃(𝑋𝑋 ≤ 13) = 0.0661 In Excel: =POISSON.DIST(13,20,TRUE) b. 𝑃𝑃(𝑋𝑋 ≥ 20) = 1 − 𝑃𝑃(𝑋𝑋 ≤ 19) = 0.5297 In Excel: =1 - POISSON.DIST(19,20,TRUE) c. 𝑃𝑃(𝑋𝑋 = 25) = 0.0446 In Excel: =POISSON.DIST(25,20,FALSE) d. 𝑃𝑃(18 ≤ 𝑋𝑋 ≤ 23) = 𝑃𝑃(𝑋𝑋 ≤ 23) − 𝑃𝑃(𝑋𝑋 ≤ 17) = 0.4905 In Excel: = POISSON.DIST(23,20,TRUE) - POISSON.DIST(17,20,TRUE) Practice problem A textile manufacturing process finds that on average, two flaws occur per every 50 yards of material produced. a. What is the probability of exactly two flaws in a 50-yard piece of material? b. What is the probability of no more than two flaws in a 50-yard piece of material? c. What is the probability of no flaws in a 25-yard piece of material? Practice problem: Solution a. What is the probability of exactly two flaws in a 50-yard piece of material? 𝑃𝑃(𝑋𝑋 = 2) In Excel: =POISSON.DIST(2,2,0)= 0.270670566 b. What is the probability of no more than two flaws in a 50-yard piece of material? 𝑃𝑃(𝑋𝑋 ≤ 2) = 𝑃𝑃(𝑋𝑋 = 0) + 𝑃𝑃(𝑋𝑋 = 1) + 𝑃𝑃(𝑋𝑋 = 2) = 0.6767 In Excel: =POISSON.DIST(2,2,1) c. What is the probability of no flaws in a 25-yard piece of material? 𝑃𝑃(𝑋𝑋 = 0) In Excel: =POISSON.DIST(0,1,0)=0.367879 Exponential distribution Exponential distribution The exponential distribution is a useful non-symmetric continuous probability distribution. Related to Poisson: number of occurrences over a given interval of time or space. Now we are interested in the time/space between the occurrences or arrivals. The exponential random variable is nonnegative. Exponential distribution The probability distribution is defined in terms of its rate parameter 𝜆𝜆. Same 𝜆𝜆 as with Poisson Average number of arrivals per unit of time/space For an exponential random variable, the mean is the inverse of 𝜆𝜆: the average time between arrivals. Used to model lifetimes or failure times. Exponential distribution Exponential distribution Exponential distribution Example: the time between e-mail messages during work hours is exponentially distributed with a mean of 25 minutes. a. Calculate the rate parameter 𝜆𝜆. b. What is the probability that you do not get an e-mail for more than one hour? c. What is the probability that you get an e-mail within 10 minutes? SOLUTION a. 𝜆𝜆 = 1 𝐸𝐸(𝑋𝑋) = 1 25 = 0.04 emails per minute b. 𝑃𝑃 𝑋𝑋 > 60 = 𝑒𝑒 −0.04(60) = 0.0907 c. 𝑃𝑃 𝑋𝑋 ≤ 10 = 1 − 𝑒𝑒 −0.04 10 = 1 − 0.6703 = 0.3297 Continuous Uniform Distribution Continuous Uniform Distribution • One of the simplest continuous probability distributions is called the continuous uniform distribution. • This distribution is appropriate when the underlying random variable has an equally likely chance of assuming a value within a specified range [a,b]. 1 𝑓𝑓 𝑥𝑥 = �𝑏𝑏 − 𝑎𝑎 0 for 𝑎𝑎 ≤ 𝑥𝑥 ≤ 𝑏𝑏 for 𝑥𝑥 < 𝑎𝑎 or 𝑥𝑥 > 𝑏𝑏 𝑎𝑎 + 𝑏𝑏 𝐸𝐸 𝑋𝑋 = 𝜇𝜇 = 2 𝑆𝑆𝑆𝑆 𝑋𝑋 = 𝜎𝜎 = 𝑏𝑏 − 𝑎𝑎 2 ⁄12 Continuous Uniform Distribution The probability density function does not directly represent probability. The area under the curve represents probability This is the area of a rectangle: base times height 1 Length of an interval ∗ 𝑏𝑏 − 𝑎𝑎 Practice Problem • Example: Sales for a particular cosmetic line follow a continuous uniform distribution with a lower limit of $2,500 and an upper limit of $5,000. • What are the mean and standard deviation? • 𝜇𝜇 = • 𝜎𝜎 = 2500+5000 2 = $3,750 5000 − 2500 2 ⁄12 = $721.69 Solved Problem Example, continued. What is the probability the sales exceed $4,000? 1 𝑃𝑃 𝑋𝑋 > 4000 = 1000 × = 0.40 5000 − 2500 Solved Problem Example, continued. What is the probability the sales are between $3,200 and $3,800? 1 𝑃𝑃 3200 ≤ 𝑋𝑋 ≤ 3800 = 600 × = 0.24 5000 − 2500 Practice Problem Example: For a continuous random variable X with an upper bound of 4, P (0 ≤ X ≤ 2.5) = 0.54 and P (2.5 ≤ X ≤ 4) = 0.16. Calculate the following probabilities. a. P(X < 0) b. P(X > 2.5) c. P(0 ≤ X ≤ 4) Solution a. 𝑃𝑃(𝑋𝑋 < 0) = 1 − 𝑃𝑃(𝑋𝑋 ≥ 0) = 1 − (0.54 + 0.16) = 1 − 0.70 = 0.30 b. Since 4 is the upper bound, 𝑃𝑃(𝑋𝑋 > 2.5) = 𝑃𝑃(2.5 ≤ 𝑋𝑋 ≤ 4) = 0.16 c. 𝑃𝑃(0 ≤ 𝑋𝑋 ≤ 4) = 𝑃𝑃(0 ≤ 𝑋𝑋 ≤ 2.5) + 𝑃𝑃(2.5 ≤ 𝑋𝑋 ≤ 4) = 0.54 + 0.16 = 0.70 Practice Problem Suppose the average price of electricity for a New England customer follows the continuous uniform distribution with a lower bound of 12 cents per kilowatthour and an upper bound of 20 cents per kilowatt-hour. a. Calculate the average price of electricity for a New England customer. b. What is the probability that a New England customer pays less than 15.5 cents per kilowatt-hour? c. A local carnival is not able to operate its rides if the average price of electricity is more than 14 cents per kilowatt-hour. What is the probability that the carnival will need to close? Solution Practice Problem You were informed at the nursery that your peach tree will definitely bloom sometime between March 18 and March 30. Assume that the bloom times follow a continuous uniform Distribution between these specified dates. a. What is the probability that the tree does not bloom until March 25? b. What is the probability that the tree will bloom by March 20? Solution Practice Problem For a continuous random variable X, P (20 ≤ X ≤ 40) = 0.15 and P (X > 40) = 0.16. Calculate the following probabilities. a. P(X = 40) b. P(X < 40) Solution The probability of a continuous random variable taking a particular value is zero: that is, 𝑃𝑃 ( 𝑋𝑋 = 𝑥𝑥 ) = 0 for any value 𝑥𝑥 . The fact that the random variable has a zero probability of taking any one value is what distinguishes the continuous random variables from the discrete ones. Hypergeometric Distribution The Hypergeometric Distribution The binomial distribution is appropriate when you sample with replacement. The probability of success does not change from trial to trial The trials are independent Sampling without replacement: after an item is drawn, it is not put back for subsequent draws. Trials are not independent; the trials are dependent on each other The probability of success changes from trial to trial Use the hypergeometric distribution in place of the binomial distribution when sampling without replacement. The number of successes in a two-outcome experiment Trials are not independent of one another With replacement means the same item can be chosen more than once. Without replacement means the same item cannot be selected more than once. The Hypergeometric Distribution The probability of x successes in a random selection of n items is 𝑺𝑺 𝑷𝑷 𝑿𝑿 = 𝒙𝒙 = 𝒙𝒙 𝑵𝑵 − 𝑺𝑺 𝒏𝒏 − 𝒙𝒙 𝑵𝑵 𝒏𝒏 N is the population size, S is the number of population successes, n is the sample size For 𝑥𝑥 = 0,1,2, ⋯ , 𝑛𝑛 if 𝑛𝑛 ≤ 𝑆𝑆 or 𝑥𝑥 = 0,1,2, ⋯ , 𝑆𝑆 if 𝑛𝑛 > 𝑆𝑆 The Hypergeometric Distribution The formula consists of three parts 𝑆𝑆 : the number of ways to select x success from S population 𝑥𝑥 successes 𝑁𝑁 − 𝑆𝑆 : the number of ways to select 𝑛𝑛 − 𝑥𝑥 failures from 𝑁𝑁 − 𝑆𝑆 𝑛𝑛 − 𝑥𝑥 population failures 𝑁𝑁 : the number of ways a sample of size n can be selected from a 𝑛𝑛 population of size N 𝒏𝒏𝒏𝒏 𝑵𝑵 − 𝒏𝒏 𝒌𝒌 𝒌𝒌 𝟐𝟐 𝝁𝝁 = 𝐚𝐚𝐚𝐚𝐚𝐚 𝝈𝝈 = . 𝐧𝐧. (𝟏𝟏 − ) 𝑵𝑵 𝑵𝑵 − 𝟏𝟏 𝑵𝑵 𝑵𝑵 The Hypergeometric Distribution: Excel Command The concept of hypergeometric distribution is important because it provides an accurate way of determining the probabilities when the number of trials is not a very large number and that samples are taken from a finite population without replacement. Solution using Excel Example: Inspect five mangoes from a box containing 20 mangos with exactly two damaged mangos. What is the probability that one out of the five mangoes is damaged? (n=5, N=20, x=1, S=2) 2 𝑃𝑃 𝑋𝑋 = 1 = 1 20 − 2 5 − 1 = 0.3947 20 5 If the manager decides to reject the shipment if one or more of the mangoes are damaged, what is the probability that the shipment will be rejected? 𝑃𝑃 𝑋𝑋 = 0 = 2 20−2 0 5−0 20 5 = 0.5526 𝑃𝑃 𝑋𝑋 ≥ 1 = 1 − 𝑃𝑃 𝑋𝑋 = 0 = 1 − 0.5526 = 0.4474 Calculate the expected value, the variance, and the standard deviation. 2 2 2 20 − 5 = 0.50, 𝑉𝑉𝑉𝑉𝑉𝑉 𝑥𝑥 = 5 1− = 0.3553, 𝑆𝑆𝑆𝑆 𝑋𝑋 = 0.5960 • 𝐸𝐸 𝑋𝑋 = 5 20 20 20 20 − 1 Practice Problem Assume that X is a hypergeometric random variable with N = 25, S = 3, and n = 4. Calculate the following probabilities. a. P(X = 0) b. P(X = 1) c. P(X ≤ 1) Normal Distribution Normal Distribution • The normal distribution is bell-shaped and symmetric around its mean. • The normal distribution is completely described by two parameters—the population mean μ and the population variance σ2. • The normal distribution is asymptotic in the sense that the tails get closer and closer to the horizontal axis but never touch it. The total area under the curve and above the horizontal axis is equal to 1. ∞ ∫−∞ 𝒇𝒇 𝒙𝒙 𝒅𝒅𝒅𝒅 = 𝟏𝟏 𝝈𝝈 𝟐𝟐𝟐𝟐 𝟏𝟏 ∞ − 𝟐𝟐 (𝒙𝒙−𝝁𝝁)𝟐𝟐 𝒅𝒅𝒅𝒅 ∫−∞ 𝒆𝒆 𝟐𝟐𝝈𝝈 = 𝟏𝟏 𝜇𝜇 x1 x2 Normal Distribution 𝑃𝑃 𝑥𝑥1 < 𝑥𝑥 < 𝑥𝑥2 = Mean= 𝜇𝜇 = 1 𝜎𝜎 2𝜋𝜋 𝑥𝑥2 � 𝑥𝑥1 1 − 2 (𝑥𝑥−𝜇𝜇)2 𝑒𝑒 2𝜎𝜎 𝑑𝑑𝑑𝑑 denotes the probability of x in the interval (𝑥𝑥1 , 𝑥𝑥2 ). ∞ ∫−∞ 𝑥𝑥. 𝑓𝑓 Standard Deviation = 𝑥𝑥 𝑑𝑑𝑑𝑑 = 1 𝜎𝜎 2𝜋𝜋 𝜎𝜎 2 = 1 1 ∞ − 2 (𝑥𝑥−𝜇𝜇)2 𝑑𝑑𝑑𝑑 ∫−∞ 𝑥𝑥. 𝑒𝑒 2𝜎𝜎 𝜎𝜎 2𝜋𝜋 ∞ � (𝑥𝑥 − −∞ 1 (𝑥𝑥−𝜇𝜇) � 2] [ − 𝜎𝜎 𝑑𝑑𝑑𝑑 𝜇𝜇)2 . 𝑒𝑒 2 Standard Normal Distribution μ controls location σ controls spread Wider the curve, the larger the standard deviation and more variation exits in the process The location of the normal distribution is determined by mean and dispersion or spread of distribution is determined by standard deviation Standard Normal Distribution The normal distribution has computational complexity to calculate 𝑃𝑃(𝑥𝑥1 < 𝑥𝑥 < 𝑥𝑥2 ) for any two (𝑥𝑥1 , 𝑥𝑥2 ) and given 𝜇𝜇 and 𝜎𝜎 To avoid this difficulty, the concept of standard normal distribution is followed The standard normal distribution is a special case of the normal distribution, denoted by Z. Condition 1: The mean has to be equal to zero μ = E(Z) = 0 Condition 2: The standard deviation/variance has to be equal to one σ = SD(Z) = 1 Lowercase letter z is used/followed to denote the value that the standard normal Finding the probability of the standard normal distribution Z-Tables for standard normal probabilities Tabulated areas under the standard normal density are the probabilities of intervals extending from mean μ = 0 to points z to its right Any data normally distributed data can be converted to the standardized form using the formula: Where x is the data point in the question Z (Z-score) is a measure of the number of standard deviations of that data point from the mean Finding the probability of the standard normal distribution Finding the probability of the standard normal distribution Example: 𝑷𝑷 𝒁𝒁 ≤ 𝟏𝟏. 𝟓𝟓𝟓𝟓 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗𝟗𝟗 Finding the probability of the standard normal distribution Example: 𝑷𝑷 𝒁𝒁 ≤ −𝟏𝟏. 𝟗𝟗𝟗𝟗 = 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎 Finding the probability of the standard normal distribution table Example 𝑃𝑃 0 ≤ 𝑍𝑍 ≤ 1.96 = 𝑃𝑃 𝑍𝑍 ≤ 1.96 − 𝑃𝑃 𝑍𝑍 ≤ 0 = 0.4750 − 0.00 = 0.4750 Finding the probability of the standard normal distribution table 𝑷𝑷 𝟏𝟏. 𝟓𝟓𝟓𝟓 ≤ 𝒁𝒁 ≤ 𝟏𝟏. 𝟗𝟗𝟗𝟗 = 𝑷𝑷 𝒁𝒁 ≤ 𝟏𝟏. 𝟗𝟗𝟗𝟗 − 𝑷𝑷 𝒁𝒁 ≤ 𝟏𝟏. 𝟓𝟓𝟓𝟓 = 𝟎𝟎. 𝟒𝟒𝟕𝟕𝟕𝟕𝟕𝟕 − 𝟎𝟎. 𝟒𝟒𝟑𝟑𝟑𝟑𝟑𝟑 = 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎 Finding the probability of the standard normal distribution table 𝑷𝑷 −𝟏𝟏. 𝟓𝟓𝟓𝟓 ≤ 𝒁𝒁 ≤ 𝟏𝟏. 𝟗𝟗𝟗𝟗 = 𝑷𝑷 𝒁𝒁 ≤ 𝟏𝟏. 𝟗𝟗𝟗𝟗 − 𝑷𝑷 𝒁𝒁 ≤ −𝟏𝟏. 𝟓𝟓𝟓𝟓 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗𝟗𝟗 − 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎 = 𝟎𝟎. 𝟗𝟗𝟗𝟗𝟗𝟗𝟗𝟗 Finding the probability of the standard normal distribution table Example: find z that satisfies the following 𝑷𝑷 𝒁𝒁 ≤ 𝒛𝒛 = 𝟎𝟎. 𝟔𝟔𝟔𝟔𝟔𝟔𝟔𝟔; 𝒛𝒛 = 𝟎𝟎. 𝟒𝟒𝟒𝟒 Finding the probability of the standard normal distribution table Example: find z that satisfies the following 𝑃𝑃 𝑍𝑍 > 𝑧𝑧 = 0.0212 → 1 − 𝑃𝑃 𝑍𝑍 ≤ 𝑧𝑧 = 0.9788 0.9788 − 0.5 = 0.4788 → 𝑧𝑧 = 2.03 Normal distribution transformation Any normally distributed random variable can be transformed into the standard normal random variable based on the following explanations Let 𝑋𝑋 have a normal distribution with mean 𝜇𝜇 and standard deviation 𝜎𝜎. 𝑋𝑋 can be transformed into 𝑍𝑍 using Z = 𝑋𝑋−𝜇𝜇 𝜎𝜎 Any 𝑥𝑥 can be transformed into 𝑧𝑧 using z = 𝑥𝑥−𝜇𝜇 𝜎𝜎

Data Science for Managerial Decisions Course Syllabus

Related documents

Study collections

Products

Support

Data Science for Managerial Decisions Course Syllabus

Related documents

Study collections

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib