Introduction to Probability Dr. Indranil Ghosh, IT & Analytics Area, Institute of Management Technology, Hyderabad, Telangana, India A Brief History Early Generalizations Axiomatic Development “But to us, probability is the very guide of life.” Bishop Joseph Butler Chance of Occurrence Random Experiment • Random experiment is an experiment in which the outcome is not known with certainty. • Predictive analysis mainly deals with random experiment like: • Predicting quarterly revenue of an organization • Customer churn • Demand for a product at future time period etc. Fundamentals Experiment Event • An experiment is a process which produces outcomes. For example, if we toss a fair coin, we may obtain either a head or a tail. So, tossing this fair coin is an experiment which can produce two outcomes, either a head or a tail. • Similarly, when we roll a die, six possible outcomes can arise, that is, turning of any of the six numbers 1, 2, 3, 4, 5, 6 on the upper face of the dice • An interview to gauge the job satisfaction levels of the employees in an organization is also an experiment because this will produce outcomes. • An event is the outcome of an experiment. • If the experiment is to roll a dice, an event can be defined as obtaining a 6 on the upper face of the dice. • If the experiment is to toss a fair coin, an event can be obtaining a tail. • If an event has a single possible outcome, it is called a simple (or elementary) event. • A subset of outcomes corresponding to a specific event is called an event space. Union of two sets Intersection of two sets Compound Event • The joint occurrence of two or more simple events is known as a compound event. • In other words, if two or more events are connected with each other, then their simultaneous occurrence is called a compound event. In an experiment in which two coins are tossed, the event of obtaining “one head and one tail” is a compound event as it consists of two events: (1) one head occurrence and (2) one tail occurrence. Independent and Dependent Events • Two events are said to be independent events if the occurrence or non-occurrence of one is not affected by the occurrence or nonoccurence of the other. • For example, when tossing a coin, a tail on the first toss does not affect the possibility of obtaining a tail on the second toss. So, this is an independent event. • Two or more events are said to be dependent if the occurrence of one event influences the occurrence of the other. Dependence indicates a relationship between two events and implies that knowledge of one event can be used in assessing the occurrence of the other event. For example actual sales and expense incurred in advertising. Mutually Exclusive Events • Two or more events are said to be mutually exclusive if the occurrence of one implies that the other cannot occur. In other words, two events are mutually exclusive if the occurrence of one of them rules out the occurrence of the other. • For example, in an unbiased coin tossing experiment, either a head can occur or a tail can occur, but the two events head and tail cannot occur together. Similarly, when rolling a dice, two numbers 3 and 4 cannot occur on the upper face in one throw. Equally Likely Events • Two or more events are said to be equally likely if each has an equal chance of occurrence. • In other words, two or more events are said to be equally likely if any of them cannot be expected to occur in preference over the other. For example, in an unbiased coin tossing experiment, both the outcomes, that is, head and tail, have an equal chance of occurrence. • Similarly, in a die rolling experiment, all possible outcomes, that is, 1, 2, 3, 4, 5, 6 are equally likely because none of the outcomes can occur in preference over the other. Complementary Events • The complement of event A is the set of all the outcomes in a sample space that are not included in the event A. This is generally denoted by A’ or 𝐴 . • For example, in a die rolling experiment, if event A is getting 2, then the complement A is getting 1, 3, 4, 5, 6 on the upper face of the die. • Two events are complementary, when one event occurs if and only if the other does not. Sample Space • The sample space denoted by S is the set of all possible outcomes of an experiment. For a single die rolling experiment, the sample space will be {1, 2, 3, 4, 5, 6}. When we roll a pair of dice, sample space or all possible elementary events are given as: Possible outcomes for rolling a pair of dice Counting Rule • Multi-Step Experiment: If an experiment is defined as a sequence of k steps, with n1 possible outcomes in the first step, n2 possible outcomes in the second step, and so on, then the total number of experimental outcomes is given by (n1) × (n2) ×…× (nk). Counting Rules for Combinations • The second counting method uses the concept of combinations. Sampling of n items from a population of size N (usually larger) without replacement provides Example • A firm wants to randomly select 3 employees from a total of 10 employees. How many combinations of 3 employees can be selected? Counting Rules for Combinations • The second counting method uses the concept of combinations. Sampling of n items from a population of size N (usually larger) without replacement provides Example • A firm wants to randomly select 3 employees from a total of 10 employees. How many combinations of 3 employees can be selected? Counting Rules for Permutations • A third rule of counting known as the counting rule for permutations helps in computing the possible number of experimental outcomes when n items are to be selected from a set of N items in a particular order. • The same n items selected in a different order would be considered a different experimental outcome. • The number of permutations of N items taken n at a time is given by Example • A quality control inspector selects two parts out of five for inspecting defects. How many permutations may be selected? Classical Definition of Probability • This is a mathematical approach of assigning probability. If for an experiment there are N exhaustive, mutually exclusive, and equally likely cases, and out of these, 𝑛𝑒 are favorable to the occurrence of an event E, then as per the classical approach of probability, the probability of occurrence of the event E is given by Illustrations • A company employs a total of • So, the probability of randomly 400 workers. Out of these, 150 selecting a skilled worker from a workers are skilled and 250 total of 400 workers is 37.5%. workers are unskilled. The Probability of non-occurrence of probability of randomly selecting an event 𝐸 is given by a skilled worker is • Probability of not selecting a skilled worker from a total of 400 workers is: Probability Estimation using Relative Frequency • According to frequency estimation, the probability of an event X, P(X), is given by P( X ) Number of observations in favour of event X n( X ) Total number of observations N Examples A website displays 10 advertisements and the revenue generated by the website depends on the number of visitors to the site clicking on any of the advertisements displayed in the website. The data collected by the company has revealed that out of 2500 visitors, 30 people clicked on 1 advertisement, 15 clicked on 2 advertisements, and 5 clicked on 3 advertisements. Remaining did not click on any of the advertisements. Calculate (a) The probability that a visitor to the website will click on an advertisement. (b) The probability that the visitor will click on at least two advertisements. (c) The probability that a visitor will not click on any advertisements. Solution (a) Number of customers clicking an advertisement is 50 and the total number of visitors is 2500. Thus, the probability that a visitor to the website will click on an advertisement is 50 0.02 2500 (b) Number of customers clicking on at least 2 advertisements is 20. Thus, the probability that a visitor will click on at least 2 advertisements is 20 2500 0.008 (c) Probability that a visitor will not click on any advertisement is 2450 0.98 2500 Algebra of Events • Assume that X, Y and Z are three events of a sample space. Then the following algebraic relationships are valid and are useful while deriving probabilities of events: • Commutative rule: X Y = Y X and X Y = Y X • Associative rule: (X Y) Z = X (Y Z) and (X Y) Z = X (Y Z) • Distributive rule: X (Y Z) = (X Y) (X Z) X (Y Z) = (X Y) (X Z) Contd. • The following rules known as DeMorgan’s Laws on complementary sets are useful while deriving probabilities: (X Y)C = XC YC (X Y)C = XC YC where XC and YC are the complementary events of X and Y, respectively Axioms of Probability According to axiomatic theory of probability, the probability of an event E satisfies the following axioms 1. The probability of event E always lies between 0 and 1. That is, 0 P(E) 1. 2. The probability of the universal set S is 1. That is, P(S) = 1 3. P(X Y) = P(X) + P(Y), where X and Y are two mutually exclusive events. The elementary rules of probability are directly deduced from the original three axioms of probability, using the set theory relationships 1. For any event A, the probability of the complementary event, written AC, is given by P(A) = 1 – P(AC) If P(A) is a probability of observing a fraudulent transaction at an ecommerce portal, then P(AC) is the probability of observing a genuine transaction. 2. The probability of an empty or impossible event, , is zero: P( ) 0 3. If occurrence of an event A implies that an event B also occurs, so that the event class A is a subset of event class B, then the probability of A is less than or equal to the probability of B: P ( A) P ( B ) 4. The probability that either events A or B occur or both occur is given by P( A B) P( A) P( B) P( A B) 5. If A and B are mutually exclusive events, so that P( A B) 0 , then P( A B) P( A) P( B) 6. If A1, A2, …, An are n events that form a partition of sample space S, then their probabilities must add up to 1: P( A ) P( A ) P( A ) P( A ) 1 n 1 2 n i 1 i Types of Probability • Marginal Probability • Union Probability • Joint Probability • Conditional Probability Union Probability • Union probability is the second type of probability. If E1 and E2 are two events, then union probability is denoted by P(E1∪ E2) and is the probability that event E1 will occur or that event E2 will occur or both event E1 and event E2 will occur. Joint Probability • Let A and B be two events in a sample space. Then the joint probability of the two events, written as P(A B), is given by Number of observations in A B P( A B) Total number of observations Example • ABRC, a leading marketing research firm in India, wants to collect information about households with computers and Internet access in urban Mumbai. After conducting an intensive survey, it was revealed that 60% of the households have computers with Internet access; 70% of the households have two or more computer sets. Suppose 50% of the households have computers with Internet connection and two or more computers. A household with computer is randomly selected. • 1. What is the probability that the household has computers with Internet access or two or more computers? • 2. What is the probability that the household has computers with Internet access or two or more computers, but not both? • 3. What is the probability that the household has neither computers with Internet access nor two or more computers? Solution Solution (Contd.) Joint Probability • Let A and B be two events in a sample space. Then the joint probability of the two events, written as P(A B), is given by Number of observations in A B P( A B) Total number of observations Example At an e-commerce customer service centre a total of 112 complaints were received. 78 customers complained about late delivery of the items and 40 complained about poor product quality. (a) Calculate the probability that a customer will complain about both late delivery and product quality. (b) What is the probability that a complaint is only about poor quality of the product? Solution • Let A = Late delivery and B = Poor quality of the product. Let n(A) and n(B) be the number of events in favour of A and B. So n(A) = 78 and n(B) = 40. Since the total number of complaints is 112, hence n(A B) = 118 – 112 = 6 • Probability of a complaint about both delivery and poor product quality is n(A B) 6 P(A B) 0.0535 Total number of complaints 112 • Probability that the complaint is only about poor quality = 1-P(A) = 1 78 0.3035 112 • Marginal probability is simply a probability of an event X, denoted by P(X), without any conditions • Independent Events : Two events A and B are independent when occurrence of one event (say event A) does not affect the probability of occurrence of the other event (event B). Mathematically, two events A and B are independent when P(A B) = P(A) P(B). • Conditional Probability: If A and B are events in a sample space, then the conditional probability of the event B given that the event A has already occurred, denoted by P(B|A), is defined as P( B | A) P( A B) , P( A) 0 P( A) Application of Simple Probability Rules in Analytics • Association rule mining is one of the popular algorithms used to solve problems such as market basket analysis and recommender systems. • Market basket analysis (MBA) is used frequently by retailers to predict products a customer is likely to buy together, which further can be used for designing planogram and product promotions Association Rule Mining • Association rule learning (also known as association rule mining) is a method of finding association between different entities in a database • Association rule is a relationship of the form X Y (that is, X implies Y). Association rule learning Example Binary representation of point of sale data • In Table , transaction ID is the transaction reference number and apple, orange, etc. are the different SKUs sold by the store. Binary code is used to represent whether the SKU was purchased (equal to 1) or not (equal to 0) during a transaction. The strength of association between two mutually exclusive subsets can be measured using ‘support’, ‘confidence’, and ‘lift’ • Support between two sets (of products purchased) is calculated using the joint probability of those events: n( X Y ) Support P( X Y ) N • Where n(X Y) is the number of times both X and Y is purchased together and N is the total number of transactions • Confidence is the conditional probability of purchasing product Y given the product X is purchased. It measures probability of event Y (customer buying a product Y) given the event X has occurred (the customer has already purchased product X). That is, Confidence = P(Y | X ) P( X Y ) P( X ) • Lift: The third measure in association rule mining is lift, which is given by Lift = P( X Y ) P( X ) P(Y ) Association rules can be generated based on threshold values of support, confidence and lift. For example, assume that the cut-off for support is 0.25 and confidence is 0.5 (Lift should be more than 1) Bayes Theorem • Bayes theorem is one of the most important concepts in analytics since several problems are solved using Bayesian statistics P( A | B) P( A B) P( B) and P( B | A) P( A B) P( A) • Using the two equations, we can show that P( B | A) P( A | B) P( B) P( A) Terminologies used to describe various components in Bayes Theorem 1. P(B) is called the prior probability (estimate of the probability without any additional information). P( B | A) P( A | B) P( B) P( A) 2. P(B|A) is called the posterior probability (that is, given that the event A has occurred, what is the probability of occurrence of event B). That is, post the additional information (or additional evidence) that A has occurred, what is estimated probability of occurrence of B. 3. P(A|B) is called the likelihood of observing evidence A if B is true. 4. P(A) is the prior probability of A Monty Hall Problem Monty Hall Problem Using Bayes Theorem • Let C1, C2, and C3 be the events that the car is behind door 1, 2, and 3, respectively. Let D1, D2, and D3 be the events that Monty opens door 1, 2, and 3, respectively. Prior probabilities of C1, C2, and C3 are P(C1) = P(C2) = P(C3) = 1/3 • Assume that the player has chosen door 1 and Monty opens door 2 to reveal a goat. Now we would like to calculate the posterior probability P(C1|D2), that is, the probability that the car is behind door 1 (door chosen initially by the player) when Monty has provided the additional information that the car is not behind door 2 • Using, Bayes theorem P(C1 | D2 ) P( D2 | C1 ) P(C1 ) (1/ 2) (1/ 3) 1/ 3 P( D2 ) (1/ 2) • P(D2|C1) = 12(if the car is behind door 1, then Monty can open either door 2 or 3) P(D2) = 1 2 Note that P(C2|D2) = 0. P(D2 | C3 ) P(C3 ) 1 (1/ 3) P(C3 | D2 ) 2/3 P(D2 ) (1/ 2) Thus, changing the initial choice will increase the probability of winning the car. P(D2|C3) = 1 (if the car is behind door 3 and the player has chosen door 1, Monty has to open door 2 with probability 1) Generalization of Bayes Theorem Example • Black boxes used in aircrafts manufactured by three companies A, B and C. 75% are manufactured by A, 15% by B, and 10% by C. The defect rates of black boxes manufactured by A, B, and C are 4%, 6%, and 8%, respectively. If a black box tested randomly is found to be defective, what is the probability that it is manufactured by company A? Solution Probable but not Possible!!! https://www.pinterest.com/pin/64317100903604229/ Random Variables • Random variable is a function that maps every outcome in the sample space to a real number. • A function that assigns a real number to each sample point in the sample space S. • Random variable is a robust and convenient way of representing the outcome of a random experiment Discrete Random Variables • If the random variable X can assume only a finite or countably infinite set of values, then it is called a discrete random variable. • Examples of discrete random variables are: • Credit rating (usually classified into different categories such as low, medium and high or using labels such as AAA, AA, A, BBB, etc.). • Number of orders received at an e-commerce retailer which can be countably infinite. • Customer churn (the random variables take binary values, 1. Churn and 2. Do not churn). • Fraud (the random variables take binary values, 1. Fraudulent transaction and 2. Genuine transaction). • Any experiment that involves counting (for example, number of returns in a day from customers of e-commerce portals such as Amazon, Flipkart; number of customers not accepting job offers from an organization). Continuous Random Variables • A random variable X which can take a value from an infinite set of values is called a continuous random variable • Examples of continuous random variables are listed below: • Market share of a company (which take any value from an infinite set of values between 0 and 100%). • Percentage of attrition among employees of an organization. • Time to failure of engineering systems. • Time taken to complete an order placed at an e-commerce portal. • Time taken to resolve a customer complaint at call and service centers. Problem Solving Dr. Indranil Ghosh, IT & Analytics Area, Institute of Management Technology, Hyderabad, Telangana, India Problem • A store receives 3 red, 6 white, and 7 blue shirts. Two shirts are drawn at random. Determine the probability that: 1. Both the shirts are white 2. Both the shirts are blue 3. One shirt is red and the other is white 4. One shirt is white and the other shirt is blue. Solution Solution (Contd.) Problem • The probability that a contractor will not get a plumbing contract is 1/3, and the probability that he will get an electrical contract is 4/9. If the probability of getting at least 1 contract is 4/5, what is the probability that he will get both the contracts? Let A and B stand for the event of getting the plumbing and electrical contracts, respectively. Solution Problem (Independent Event) • A candidate is selected for an interview for 3 posts. In the first post, there are 3 candidates, for the second, there are 4, and for the third, there are 2. What are the chances of his getting at least 1 post? Solution Problem • From a well-shuffled pack of 52 cards, a card is drawn at random. Find the probability that it is an ace or a heart. Probability Matrices • A company is interested in understanding the consumer behaviour of the capital of the newly formed state Chhattisgarh, that is, Raipur. For this purpose, the company has selected a sample of 300 consumers and asked a simple question, “Do you enjoy shopping?” Out of 300 respondents, 200 were males and 100 were females. Out of 200 males, 120 responded “Yes,” and out of 100 females, 70 responded “Yes.” A respondent is selected randomly. Construct a probability matrix and ascertain the probability that: 1. The respondent is a male 2. Enjoys shopping 3. Is a female and enjoys shopping 4. Is a male and does not enjoy shopping 5. Is a female or enjoys shopping 6. Is a male or does not enjoy shopping 7. Is a male or female. Solution The probability matrix can be constructed as shown in the table below. Independent Events • Delta is a leading marketing research firm in India. A client of Delta is interested in the probable relationship between telephone and television purchase of a particular region. The company prepared a single question “Do you have a telephone and/or a television in your home” and conducted a survey on 75 persons. Is Television Purchase Dependent on Telephone Purchase Problem • A market survey was conducted in four cities to find out the preference for brand A soap. The responses are shown below: (a) What is the probability that a consumer selected at random, preferred brand A? (b) What is the probability that a consumer preferred brand A and was from Chennai? (c) What is the probability that a consumer preferred brand A, given that he was from Chennai? (d) Given that a consumer preferred brand A, what is the probability that he was from Mumbai Solution • Let X denote the event that a consumer selected at random preferred brand A. Then Revisiting Bayes Theorem • The Bayes’ theorem is useful in revising the original probability estimates of known outcomes as we gain additional information about these outcomes. The prior probabilities, when changed in the light of new information, are called revised or posterior probabilities. Proof Generalization of Bayes Theorem Problem • In a bolt factory, machines X, Y, and Z manufacture 20%, 35%, and 45% of items, respectively. Out of which 8%, 6%, and 5% items are defective from machines Y and Z. One bolt is drawn at random from the product and is found defective. What is the probability that it is manufactured by machine Z? Solution • Tabulate the prior and posterior probabilities: Representation in the form of tree diagram Problem • Suppose an item is manufactured by three machines X, Y, and Z. All the three machines have equal capacity and are operated at the same rate. It is known that the percentages of defective items produced by X, Y, and Z are 2, 7, and 12 per cent, respectively. All the items produce by X, Y, and Z are put into one bin. From this bin, one item is drawn at random and is found to be defective. What is the probability that this item was produced on Y? Example • Black boxes used in aircrafts manufactured by three companies A, B and C. 75% are manufactured by A, 15% by B, and 10% by C. The defect rates of black boxes manufactured by A, B, and C are 4%, 6%, and 8%, respectively. If a black box tested randomly is found to be defective, what is the probability that it is manufactured by company A? Solution Discrete Random Variables • If the random variable X can assume only a finite or countably infinite set of values, then it is called a discrete random variable. • Examples of discrete random variables are: • Credit rating (usually classified into different categories such as low, medium and high or using labels such as AAA, AA, A, BBB, etc.). • Number of orders received at an e-commerce retailer which can be countably infinite. • Customer churn (the random variables take binary values, 1. Churn and 2. Do not churn). • Fraud (the random variables take binary values, 1. Fraudulent transaction and 2. Genuine transaction). • Any experiment that involves counting (for example, number of returns in a day from customers of e-commerce portals such as Amazon, Flipkart; number of customers not accepting job offers from an organization). Continuous Random Variables • A random variable X which can take a value from an infinite set of values is called a continuous random variable • Examples of continuous random variables are listed below: • Market share of a company (which take any value from an infinite set of values between 0 and 100%). • Percentage of attrition among employees of an organization. • Time to failure of engineering systems. • Time taken to complete an order placed at an e-commerce portal. • Time taken to resolve a customer complaint at call and service centers. Probability Distribution Dr. Indranil Ghosh, IT & Analytics Area, Institute of Management Technology, Hyderabad, Telangana, India Random Variables • Random variable is a function that maps every outcome in the sample space to a real number. • A function that assigns a real number to each sample point in the sample space S. • Random variable is a robust and convenient way of representing the outcome of a random experiment A random variable is a numerical description of the outcome of an experiment. Why Random Variables? To Predict Discrete Random Variables • If the random variable X can assume only a finite or countably infinite set of values, then it is called a discrete random variable. • Examples of discrete random variables are: • Credit rating (usually classified into different categories such as low, medium and high or using labels such as AAA, AA, A, BBB, etc.). • Number of orders received at an e-commerce retailer which can be countably infinite. • Customer churn (the random variables take binary values, 1. Churn and 2. Do not churn). • Fraud (the random variables take binary values, 1. Fraudulent transaction and 2. Genuine transaction). • Any experiment that involves counting (for example, number of returns in a day from customers of e-commerce portals such as Amazon, Flipkart; number of customers not accepting job offers from an organization). Continuous Random Variables • A random variable X which can take a value from an infinite set of values is called a continuous random variable • Examples of continuous random variables are listed below: • Market share of a company (which take any value from an infinite set of values between 0 and 100%). • Percentage of attrition among employees of an organization. • Time to failure of engineering systems. • Time taken to complete an order placed at an e-commerce portal. • Time taken to resolve a customer complaint at call and service centers. Instances of Discrete Random Variables Types of Random Variables Discrete Probability Distributions • The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable. • We can describe a discrete probability distribution with a table, graph, or equation. Property • The probability distribution is defined by a probability function, denoted by f(x), which provides the probability for each value of the random variable. • The required conditions for a discrete probability function are: f(x) > 0 f(x) = 1 Examples P(X) 0.4 0.3 0.2 0.1 0 1 2 3 4 5 X Properties Probability mass function • For a discrete random variable, the probability that a random variable X taking a specific value xi, P(X = xi), is called the probability mass function P(xi). • That is, a probability mass function is a function that maps each outcome of a random experiment to a probability Probability density function Examples on Random Variables • From a bag containing 3 red balls and 2 white balls, a man is to draw two balls at random without replacement. He gains Rs. 20 for each red ball and Rs. 10 for each white one. What is the expectation of his draw? Examples (Contd.) • In a cricket match played to benefit an ex-player, 10,000 tickets are to be sold at Rs. 500. The prize is a Rs. 12,000 fridge by lottery. If a person purchases two tickets, what is his expected gain? Orthodox Probability Distributions Binomial Distribution • A random variable X is said to follow a Binomial distribution when • The random variable can have only two outcomes success and failure (also known as Bernoulli trials). • The objective is to find the probability of getting k successes out of n trials. • The probability of success is p and thus the probability of failure is (1 p). • The probability p is constant and does not change between trials Possible Applications for the Binomial Distribution • A manufacturing plant labels items as either defective or acceptable. • A firm bidding for contracts will either get a contract or not. • A marketing research firm receives survey responses of “yes I will buy” or “no I will not.” • New job applicants either accept the offer or reject it. Illustration Probability Mass Function (PMF) of Binomial Distribution Example Fashion Trends Online (FTO) is an e-commerce company that sells women apparel. It is observed that about 10% of their customers return the items purchased by them for many reasons (such as size, color, and material mismatch). On a particular day, 20 customers purchased items from FTO. Calculate: (a) Probability that exactly 5 customers will return the items. (b) Probability that a maximum of 5 customers will return the items. (c) Probability that more than 5 customers will return the items (d) Average number of customers who are likely to return the items. (e) The variance and the standard deviation of the number of returns. purchased by them. Solution Problem • Of the 41,636 residents of Tamil Nadu, 20% were born outside Tamil Nadu. A group of 5 people is to be randomly selected from the state and the discrete random variable is X, the number of persons in the group who were born in outside Tamil Nadu. Find 1. The probability for exactly 2 persons born outside Tamil Nadu. 2. The probability for at least 3 persons born outside Tamil Nadu. Time to be Normal!! Normal Distribution Dr. Indranil Ghosh, IT & Analytics Area, Institute of Management Technology, Hyderabad, Telangana, India Continuous Probability Distributions • A continuous variable is a variable that can assume any value on a continuum (can assume an uncountable number of values): • • • • thickness of an item. time required to complete a task. temperature of a solution. height, in inches. • These can potentially take on any value depending only on the ability to precisely and accurately measure. Continuous Probability Distributions Vary By Shape • • • Symmetrical Bell-shaped Ranges from negative to positive infinity • • Symmetrical Also known as Rectangular Distribution • Every value between the smallest & largest is equally likely • • • Right skewed Mean > Median Ranges from zero to positive infinity The Normal Distribution • ‘Bell Shaped.’ • Symmetrical. . • Mean, Median and Mode are Equal. Location is determined by the mean, μ. Spread is determined by the standard deviation, σ. The random variable has an infinite theoretical range: - to +. The Normal Distribution Density Function Gaussian Distribution Applications • Stock Market Modelling • Analyzing Mutual Funds • Predictive Analytics • Sampling The Standardized Normal • Any normal distribution (with any mean and standard deviation combination) can be transformed into the standardized normal distribution (Z). • To compute normal probabilities need to transform X units into Z units. • The standardized normal distribution (Z) has a mean of 0 and a standard deviation of 1. Translation to the Standardized Normal Distribution The Standardized Normal Probability Density Function The Standardized Normal Distribution Example Finding Normal Probabilities Probability as Area Under the Curve The Standardized Normal Table The Standardized Normal Table (Contd.) General Procedure for Finding Normal Probabilities Finding Normal Probabilities Finding Normal Probabilities (Contd.) Solution: Finding P(Z < 0.12) Finding Normal Upper Tail Probabilities Finding Normal Upper Tail Probabilities (Contd.) Finding a Normal Probability Between Two Values Solution: Finding P(0 < Z < 0.12) Probabilities in the Lower Tail Probabilities in the Lower Tail (Contd.) Example Solution Evaluating Normality • Not all continuous distributions are normal. • It is important to evaluate how well the data set is approximated by a normal distribution. • Normally distributed data should approximate the theoretical normal distribution: • The normal distribution is bell shaped (symmetrical) where the mean is equal to the median. • The empirical rule applies to the normal distribution. • The interquartile range of a normal distribution is 1.33 standard deviations. Evaluating Normality (Contd.) Comparing data characteristics to theoretical properties: •Construct charts or graphs: • For small- or moderate-sized data sets, construct a stem-and-leaf display or a boxplot to check for symmetry. • For large data sets, does the histogram or polygon appear bell-shaped? •Compute descriptive summary measures • Do the mean, median and mode have similar values? • Is the interquartile range approximately 1.33σ? • Is the range approximately 6σ? Evaluating Normality (Contd.) Comparing data characteristics to theoretical properties: • Observe the distribution of the data set: • Do approximately 2/3 of the observations lie within mean ±1 standard deviation? • Do approximately 80% of the observations lie within mean ±1.28 standard deviations? • Do approximately 95% of the observations lie within mean ±2 standard deviations? • Evaluate normal probability plot: • Is the normal probability plot approximately linear (i.e. a straight line) with positive slope? Constructing A Normal Probability Plot • Normal probability plot: • Arrange data into ordered array. • Find corresponding standardized normal quantile values (Z). • Plot the pairs of points with observed data values (X) on the vertical axis and the standardized normal quantile values (Z) on the horizontal axis. • Evaluate the plot for evidence of linearity. The Normal Probability Plot Interpretation Evaluating Normality An Example: Mutual Fund Returns Evaluating Normality An Example: Mutual Fund Returns (Contd.) Evaluating Normality An Example: Mutual Fund Returns (Contd.) Evaluating Normality An Example: Mutual Fund Returns (Contd.) • Conclusions • • • • • The returns are right-skewed. The returns have more values concentrated around the mean than expected. The range is larger than expected. Normal probability plot is not a straight line. Overall, this data set greatly differs from the theoretical properties of the normal distribution. Introduction to Sampling Dr. Indranil Ghosh, IT & Analytics Area, Institute of Management Technology, Hyderabad, Telangana, India Essence of Sampling Random Sampling • Shewhart (1931) defines random sample as a ‘sample drawn under conditions such that the law of large number applies’ • Random sampling is usually carried out without replacement, that is, an observation which is selected in the sample is removed from the population for further consideration • Random samples can also be created with replacement, that is, an observation which is selected for inclusion in the sample can again be considered since it is replaced (not removed) in the population. Random Sampling (Example) Stratified Sampling • The population can be divided into mutually exclusive groups using some factor (for example, age, gender, marital status, income, geographical regions, etc.). The groups, thus, formed are called stratum • It is important that the groups are mutually exclusive and exhaustive of the population. Stratified Sampling Examples a) Amount of time spent by male and female users in sending messages in a day. Here the strata are male and female users. b) Efficacy of a drug among different age groups. Age group can be classified into categories such as less than 40, between 41 and 60, and over 60 years of age. c) Performance of children in school and the parents’ marital status. Here, marital status can be (a) Single, (b) Married, (d) Divorced. In this case we assume that the parent’s marital status may influence children’s academic performance. d) Television rating points for a program across different geographical regions of a country. For India, geographical regions could be different states of the country. Steps in Stratified Sampling a) Identify the factor that can be used for creating strata (for example: factor = Age; Strata 1: age less than 40; Strata 2: age between 41 and 60; and Strata 3: Age more than 60). b) Calculate the proportion of each stratum in the population (say p1, p2, and p3 for three strata identified in step 1). c) Calculate the sample size (say N). The sample size for strata 1, 2, and 3 identified in step 2 are p1 × N, p2 × N, and p3 × N, respectively. d) Use random sampling procedure explained in Section 4.4.1 to generate random samples in each strata. e) Combine samples from each stratum to create the final sample. Cluster Sampling Cluster Sampling Steps Bootstrap Aggregating • Bootstrap Aggregating (known as Bagging) is sampling with replacement used in machine learning algorithms, especially the random forest algorithm (Breiman, 1996) • The size of each sample and the number of samples are determined based on factors such as population size, target accuracy of the model developed using bagging and convergence, etc • Bagging is frequently used in ensemble methods (in which several models are developed and the final prediction is usually based on the majority voting) Non-Probability Sampling • Convenience sampling is a non-probability sampling technique in which the sample units are not selected according to a probability distribution • Sampling the data is collected from people who volunteer for such data collection. There could be bias in case of voluntary sampling Sampling Distribution Examples Sampling Distribution • A sampling distribution is a distribution of all of the possible values of a sample statistic for a given sample size selected from a population. • For example, suppose you sample 50 students from your college regarding their mean GPA. If you obtained many different samples of size 50, you will compute a different mean for each sample. We are interested in the distribution of all potential mean GPAs we might calculate for any sample of 50 students. Developing Sampling Distribution Developing Sampling Distribution (Contd.) Developing Sampling Distribution (Contd.) Developing Sampling Distribution (Contd.) Developing Sampling Distribution (Contd.) Comparing the Population Distribution to the Sample Means Distribution Sample Mean Sampling Distribution: Standard Error of the Mean Sample Mean Sampling Distribution: If the Population is Normal Z-value for Sampling Distribution of the Mean Sampling Distribution Properties Sampling Distribution Properties Sample Mean Sampling Distribution: If the Population is not Normal Central Limit Theorem Sample Mean Sampling Distribution: If the Population is not Normal How Large is Large Enough? • For most distributions, n > 30 will give a sampling distribution that is nearly normal. • For fairly symmetric distributions, n > 15. • For a normal population distribution, the sampling distribution of the mean is always normally distributed. Example Population Proportions Sampling Distribution of p Z-Value for Proportions Example Central Limit Theory (CLT) Alternative Version Implications Example Estimation Dr. Indranil Ghosh, IT & Analytics Area, Institute of Management Technology, Hyderabad, Telangana, India To find the true story we need to have confidence in the work that produced the numbers Estimation Process • Estimation is a process used for making inferences about population parameters based on samples • Point Estimate: Point estimate of a population parameter is the single value (or specific value) calculated from sample (thus called statistic). • Interval Estimate: Instead of a specific value of the parameter, in an interval estimate the parameter is said to lie in an interval (say between points a and b) with certain probability (or confidence). • According to the central limit theorem, the sample means for a sufficiently large samples (n >= 30), are approximately normally distributed, regardless of the shape of the population distribution. For a normally distributed population, sample means are normally distributed for any size of the sample. z formula for this is as below: • This formula can be rearranged algebraically for population mean • Sample mean x can be greater than or less than the population mean; hence, the formula takes the following form • Confidence interval for estimating population mean Deeper Insights • Population mean is located within the confidence interval 99% Confidence Interval Problem • A researcher has taken a random sample of size 70 from a population with a sample mean of 35 and a population standard deviation of 4.62. Construct a 90% confidence interval to estimate the population mean. Sampling from a Finite Population Problem • A researcher wants to measure the income level of employees working in a company. The total employee strength of the company is 1200. A random sample of 50 employees reveals that the average income of sampled employees is Rs 15,000. Historical data reveals that the standard deviation of the income of the employees is approximately Rs 1500. Construct a 99% confidence interval for obtaining the average income of all the employees working in this company. Solution Interval Estimates Using t-Distribution • We have seen that when the population standard deviation is unknown, sample standard deviation can be used for estimating the confidence interval for large samples (n>= 30). • In a real-life situation, a sample size less than 30 is not very uncommon. In the case of small sample size (n < 30), the z formula discussed earlier is not applicable. The problem can be solved by using the t statistic, developed by a British statistician, William S. Gosset. • When the population standard deviation is not known and the sample size is 30 or less, tdistribution is used. • Assumption for using t-distribution: The population is normal or approximately normal. • Applicable when population standard deviation is not known t-Distribution • The t-distribution is symmetrical but flatter than the normal distribution, and there is a different t-distribution for different sample sizes (or degrees of freedom). As the sample size gets larger, the shape of the tdistribution becomes approximately equal to the normal distribution. • Interval estimate for mean using tdistribution: X + t. s/√n : Upper confidence limit X – t. s/√n : Lower confidence limit (value of t depends upon degree of freedom and α) Process t-Distribution Problems • In a grocery store, the mean expenditure per customer is Rs 2000 with a standard deviation of Rs 300. If a random sample of 50 customers is selected, what is the probability that the sample average expenditure per customer is more than Rs 2080? Problems • By the year 2014–2015, the telephone instrument industry is estimated to grow by 106.20 million units as compared to 1993–1994 when the total market size was only 3 million units. Bharti Teletech, BPL Telecom, ITI (Indian Telephone Industries), Bharti Systel, Tata Telecom, and Gigrej Telecom are some of the major players in the market. Bharti Teletech has a market share of 24%.3 If 200 purchasers of telephone instruments are randomly selected, what is the probability that 55 or more are Bharti Teletech customers? Problems • n order to estimate the customer loyalty for a particular product, a researcher poses the following question to a sample of 100 customers: How many years have you been continuously using this product? This sample yielded a mean period of 8 years with a sample standard deviation of 2 years. Construct a 95% confidence interval for estimating the population mean. Problems • The personnel department of an organization wants to apply costcutting measures for improving efficiency. As the first step, the personnel department wants to curtail telephone expenses incurred by employees. For this, personnel department has taken a random sample of 10 employees and gathered the following data about telephone expenses (in thousand rupees) in the previous year: 10, 12, 24, 23, 11, 14, 15, 34, 16, 23 Construct a 95% confidence interval to estimate the average telephone expenses of the employees in the population Ten Commandments of Sampling and Estimation • If sample size (n) is large enough, the sampling distribution of the sample mean is approximately normal regardless of population distribution/shape. (n>=30) • A larger sample automatically reduces the standard error of mean. • The primary endeavor of sampling is to minimize the difference of sample and population mean. • Sampling constructs the entrance towards inferential statistics. • For a normal population distribution, the sampling distribution of the mean is always normally distributed irrespective of sample size. • For imposing confidence interval, sample standard deviation can be utilized if population standard deviation is not known beforehand. • For smaller sample size (n<30), estimation of confidence interval resorts to t-Distribution. • The t-Distribution tends to follow normal distribution with higher degrees of freedom. • Sample size is determined on the basis of tolerance of residual and desired confidence interval. • Likewise mean, confidence interval can be imposed for population proportion as well.