Uploaded by 2362816632

Business analytics

advertisement
2023
Lecture 5
Probability Distributions and Statistic Inference
Dr. Sangwook HA
Department of Management (EBIS programme)
Faculty of Business and Management
Busin
WEEK FIVE
CONTENTS
01
Session 1: Probability Distributions and Data Modeling
02
Session 2: Statistical Inference (Inferential Statistics) I
01
Probability Distributions and Data Modeling
1-1 Probability Distributions
1-2 Data Modeling
4
Probability Distributions and Data Modeling
•
Data visualizations and descriptive statistics we have learned so far provide
some information about (mostly):
▪
▪
•
However, knowing the past and present may not enough to make decisions
for the future, mainly because of uncertainty and randomness in business
▪
•
Data at hand (sample)
Past and/or present events
e.g., product delivery can be delayed because of a bad weather
The concept of probability (distribution) and data modeling help managers to
make (future) decisions under the presence of uncertainty and randomness
and serve a basis for predicting future events (i.e., predictive analytics)
5
Basic Concepts of Probability
● An (probability) experiment is the process (action, trial)
that results in an outcome. In a random experiment,
outcome cannot be predicted with certainty
Experiment:
roll two dice
Outcome (2 = 1+1)
observe
● The sample space is the collection of all possible
outcomes of an experiment
Sample space
● The outcome of an experiment is a result that we
● An event is a collection of one or more outcomes from a
sample space
Event: {Even outcomes} = {2, 4, 6, 8, 10, 12}
6
Probability Definitions
•
Probability is the likelihood that an outcome occurs. Probabilities are
expressed as values between 0 and 1
•
Probabilities may be defined from one of three perspectives:
▪ Classical definition: probabilities can be deduced from theoretical
arguments
▪ Relative frequency definition: probabilities are based on empirical data
▪ Subjective definition: probabilities are based on judgment and
experience
7
Classical Definition of Probability
Probability of a specific event X occurs
•
=
no. of the specific event X occurred
total no. of event occurred
Suppose we roll 2 dice:
Probability die rolls sum to 3 = 2/36 ≈ 0.0556
8
Relative Definition of Probability
•
Probability a computer is repaired in 10 days = 0.076
▪
This probability can change if additional observations are collected
8
9
Basic Concepts of Probability
● Rule 1: Probability of any event is the sum of probability of
outcomes that comprise that event
● Rule 2: Probability of complement of an event A is P(Ā)=1-P(A)
- For an event A, P(Ā)+P(A) = 1 Therefore, P(Ā)=1-P(A)
● Rule 3: If events A and B are mutually exclusive,
then P(A or B) = P(A)+P(B)
● Rule 4: If two events A and B are not mutually exclusive,
then P(A or B) = P(A) + P(B) – P(A and B)
10
Conditional Probability
● Conditional probability is the probability of
occurrence of one event A, given that another event B
is known to be true or has already occurred.
● P(A|B) = P(A and B)/P(B)
- the conditional probability of A given B
● Data shows the first and second purchases for a
sample of 200 customers.
● Probability of purchasing an iPad given already
Second purchase
purchased an iMac = 2/13
First purchase
10
11
Random Variables and Probability Distribution
● A random variable is a numerical description of the outcome of an experiment
= data (a set of metrics) generated by an experiment
○ A discrete random variable is one for which the number of possible
outcomes can be counted
■ e.g., outcomes of dice rolls, whether a customer likes or dislikes a product, number
of hits on a website link today
○ A continuous random variable has outcomes over one or more continuous
intervals or real numbers
■ e.g., weekly change in Dow Jones Industrial Average, daily
temparature, time between machine failures
12
Random Variables and Probability Distribution
● Probability distribution: a characterization of the possible values that a random variable
may assume along with the probability of assuming these values
▪ A theoretical model of the random variable (contain probabilities of all possible outcomes)
▪ Can be constructed by using observed data (= an empirical probability distribution)
✓
✓
✓
X axis = all possible outcomes (values) from an experiment
Y axis = probability of observing each outcome (value) from the experiment
The sum of area under the function is always 1 (= 100%)
13
Discrete Probability Distribution
• The probability distribution of the discrete outcomes is called a Probability Mass
Function (PMF)
• A mathematical function f(x) specifying the probability of the random variable X
• xi represents the i th value of X, and 𝑓(𝑥𝑖 ) is the probability
• Properties:
14
Discrete Probability Distributions
Example:
Probability Mass Function for the sum of two independent rolling dice
f(x=2)=1/36
f(x=3)=2/36
f(x=4)=3/36
f(x=5)=4/36
f(x=6)=5/36
⋮
f(x=12)=1/36
15
Cumulative Distribution Function
• A cumulative distribution function, F(x) specifies the probability that the discrete
(or continuous) random variable X assumes a value less than or equal to a
specified value, x; that is,
Example:
Using the Cumulative Distribution Function
● Probability of rolling between 4 and 8:
P(4≤X≤8)
= P(3<X≤8)
= F(X=8)-F(X=3)
=26/36-3/36
=23/36
16
Expected Value of a Discrete Random Variable
• The expected value of a random variable corresponds to
the notion of the mean, or average, for a sample.
• For a discrete random variable X, the expected value,
denoted
is the weighted average of all possible
possible outcomes, where the weights are the probabilities:
17
Computing the Expected Value
• Rolling two dice
18
Application: Airline Revenue Management
•
•
•
•
•
Full and discount airfares are available for a flight.
Full-fare ticket costs $560.
Discount ticket costs $400.
X = ticket price paid
p = 0.75 (the probability of selling a full-fare ticket)
•
• The airline should not discount full-fare tickets because
the expected value of a full-fare ticket is greater than the
cost of a discount ticket.
• Break-even point:
$399 ≈ 0.714*($560)
19
Expected Value and Decision Making
• The expected value is a “long-run average” and is
appropriate for decisions that occur on a repeated basis.
• For one-time decisions, however, you need to consider the
downside risk and the upside potential of the decision.
20
Expected Value of a Charitable Raffle
• Cost of raffle ticket is $50.
• 1000 raffle tickets are sold.
• Winning prize is $25,000.
• E(x) = -50*0.999 + 24950*0.001 = -25
• If you played this game repeatedly over the long run, you
would lose an average of $25.00 each time you play.
• However, for any one game, you would either lose $50 or
win $24,950.
– Is the risk of losing $50 worth the potential of winning $24,950?
21
Variance of a Discrete Random Variable
• The variance,
of a discrete random
variable X is a weighted average of the squared
deviations from the expected value:
22
Computing the Variance of a Random Variable
• Rolling two dice
Continuous Probability Distributions
• Continuous random variable is defined over one or more intervals of real numbers
= has an infinite number of possible outcomes
▪ Change in DJIA using 5% increment
▪ Change in DJIA using 2.5% increment
The probability distribution is approaching a smooth curve
as the interval for outcomes decreases
34
35
Continuous Probability Distributions
• A probability density function is a mathematical function that characterizes a
continuous random variable (e.g., stock market index)
Continuous Probability Distributions
Probability density function
● A curve described by a mathematical function that characterizes a continuous
random variable
Properties of a probability density function
● f(x)≥0 for all values of x -> a graph of the density function must lie at or above
the x-axis
● Total area under the density function equals 1.
● P(X=x)=0 -> we cannot define a probability of a specific value in the case of a
continous random variable (infinite numbers!)
● Probabilities are only defined over an interval.
● P(a≤X≤b) is the area under the density function between a and b.
P(a≤X≤b) = P(X≤b)-P(X≤a) = F(b)- F(a)
36
Distributions and Business Decisions
Why? Working knowledge of common families of
probability distributions:
1. helps you to understand underlying process that
generates sample data
2. useful in building decision models with theoretical
distribution of data
3. helps to compute probabilities of occurrence of
outcomes to assess risks and make decisions
37
Commonly Used Distributions
●
●
Discrete Variables
○ Bernoulli Distribution
○ Binomial Distribution
○ Poisson Distribution
…
Continuous Variables
○ Uniform Distribution
○ Normal Distribution (and Standard Normal Distribution)
○ (Student’s) t-distribution
○ Exponential Distribution
38
39
Uniform Distribution
• The uniform distribution characterizes a continuous random variable for which all
outcomes between a minimum (a) and a maximum (b) are equally likely.
• Density function:
• Cumulative distribution function:
• Expected value
40
Computing Uniform Probabilities
• Sales revenue for a product varies uniformly each week between
$1000 and $2000.
• Probability that sales revenue will be less than x = $1,300.
–
• Probability that revenue will be between $1,500 and $1,700.
–
41
Discrete Uniform Distribution
• A variation of the uniform distribution is one for
which the random variable is restricted to integer
values between a and b (also integers); this is
called a discrete uniform distribution.
– Example: roll of a single die. Each of the numbers 1
through 6 have a
probability of occurrence.
42
Normal Distribution
● f(x) is a bell-shaped curve
● Characterized by 2 parameters
○ 𝝁 (mean)
○ 𝝈 (standard deviation)
● Properties
1. Symmetric
2. Mean = Median = Mode
3. Range of X is unbounded
(negative ~ positive ∞)
1. Empirical rules apply
= The area under the density function within ± 2 standard deviation is
95.4%, and that within ± 3 standard deviation is 99.7%)
43
Normal Distribution
Example
The distribution for customer demand (units per month) is normal with:
mean=750
stdev.=100
Find the probability that demand will be:
a) at most 900 units/month
b) exceed 700 units/month
c) be between 700 and 900 units/month
Normal Distribution
a) Probability of Demand be at most 900
units/month
Normal Distribution
b) Probability of Demand exceeds 700
units/month
46
Normal Distribution
c) Probability of Demand be between 700 and 900
units/month
Standard Normal Distribution
● A standard normal distribution is a normal distribution with a mean
0 and standard deviation of 1
▪ A standard normal random variable is denoted by Z (z-scores)
▪ The scale along the z-axis represents the number of standard
deviations from the mean of zero
47
48
Using Standard Normal Distribution Tables
● Table 1 of Appendix A
● We can compute probabilities for any normal random variable X having a mean 𝝁 and standard
deviation 𝝈 by converting it to a standard normal random variable Z:
● In other words, all normal distributions can be converted into the standard normal distribution &
use the standard normal distribution table to calculate its probabilities
49
Computing Probabilities with Standard Normal Tables
● In the earlier example, what is the probability
that demand will be at least 900 units/month?
● Using the table, we find:
P(X<900)=P(Z<1.50)=0.93319
𝑧=
(900 − 750)
= 1.50
100
50
Question (b)
● Probability that demand will exceed 700 units, or P(X>700).
= 1-pnorm(700,750,100)=1-0.3085=0.6915
51
Question (c)
● Probability that demand will be between 700 and 900, or P(700<X<900).
= pnorm(900,750,100)-pnorm(700,750,100)=0.9332-0.3085
= 0.6247
55
Probability Functions in R
In R, probability functions take the form: distribution_abbreviation ()where the
first letter refers to the aspect of the distribution returned:
• d = density
• p = cumulative distribution function
• q = quantile function
• r = random generation (random deviates)
Abbreviations:
• multinom (multinominal distribution)
• binom
(binominal distribution)
• nbinom
(negative binominal distribution)
• norm
(normal distribution)
• exp
(exponential distribution)
• pois
(poison distribution)
• unif
(uniform distribution)
Example: rnorm() generates values drawn from a normal distribution with a
specified mean and standard deviation
56
Probability Functions in R
By default, R’s functions for normal distribution returns values for the standard normal distribution
57
Examples of Using R to plot probability distribution
•
Plot the standard normal curve on the interval [–3,3]
•
What is the area under the standard normal curve to the left of z=1.96?
•
What is the value of the 90th percentile of a normal distribution with a mean of
500 and a standard deviation of 100?
•
Generate 50 random normal deviates with a mean of 50 and a standard
deviation of 10.
58
Data Modeling and Distribution Fitting
● Using sample data may limit our ability to predict uncertain events that may
occur because potential values outside the range of the sample data are not
included
● A better approach is to identify the underlying probability distribution from which
sample data come by “fitting” a theoretical distribution to the data and verifying
the goodness of fit statistically
= test whether the sample data has a shape of a specific probability distribution
▪ Examine a histogram for clues about the distribution’s shape
▪ Look at summary statistics such as the mean, median,
▪ standard deviation, coefficient of variation, and skewness
59
Analyzing Airline Passenger Data
● Sample data on passenger demand for 25 flights
● The histogram shows a relatively symmetric distribution.
● The mean, median, and mode are all similar, although there is
moderate skewness. Normal distribution seems reasonable.
31
60
Goodness of Fit
● A better approach than simply visually examining a histogram and
summary statistics is to analytically fit the data to the best type of
probability distribution.
● Statistical measures of goodness of fit:
▪ Chi-square (need at least 50 data points)
▪ Kolmogorov-Smirnov (works well for small samples)
▪ Anderson-Darling (puts more weight on the differences
between the tails of the distributions)
▪ Shapiro-Wilk Normality Test (test data against normal
distribution)
32
Goodness of Fit
● Kolmogorov-Smirnov
● Shapiro-Wilk test
Goodness of Fit
● Graphical methods
Probability density plot
Quantile-Quantile (QQ) Plot
02
Statistical Inference I
2-1 Part 1: Sampling and Estimation
64
Sampling and Estimation
Average monthly spending of
Chinese university students = ? (𝝁)
● Sampling refers to a process of collecting a subset of
observations from its (intended / assumed) population
● Sample data allow us to infer the characteristics of population,
which is usually not known (= Estimation of unknown
population parameters)
● Estimators are measures used to estimate unknown population
parameters
● A point estimate is a single number derived from a
sample that is used to estimate the value of a population
parameters
● Examples:
ഥ is a point estimate of 𝝁
○ Mean: 𝒙
○ Standard deviation: s is a point estimate of 𝝈
Average monthly spending of
100 UIC students = RMB 1000 (ത𝒙)
65
Sampling and Estimation
● (Classical) statistics aim to infer population level
values / probabilities
● To do this, we need to assume the characteristics of
population in terms of probability = probability
distribution (e.g., normal distribution)
Distribution of values for variable diesum in
data (data distribution, n=36)
● We already know that if we can assume the probability
distribution of a variable, we can calculate the
population-level value for a given sample value
Probability function
● In practice, performing such calculation is often
challenging because:
1) Errors in the sampling process
2) Population information for sample data is usually
unknown (e.g., mean, S.D., …)
Area under the
function
(probability)
Distribution of all possible values & probability
of observing these values for variable diesum
at population (probability distribution, n = ∞)
66
Sampling Error
● Different samples from the same population can have different characteristics because:
1. Sampling (statistical) error occurs because samples are only a subset of the total
population
▪ Sampling error depends on the size of the sample relative to the population; This
type of sampling error cannot be totally avoided
2. Non-sampling error occurs when the sample does not adequately represent
the target population
▪ Nonsampling error usually results from a poor sample design or choosing the wrong
population frame (e.g., convenience sampling)
Both errors can influence (point) estimates
67
A Sampling Experiment
● A population is uniformly distributed between 0 and 10.
▪ Mean = (0+10)/2=5
▪ Variance = (10-0)2/12=8.333
● Experiment:
1. Generate 25 samples of size 10 from this population (10 rows * 25
2.
3.
4.
5.
columns = 250 obs)
For each sample, compute its mean, sd, mean± 3sd, and its range
Prepare a histogram of the 250 observations (= all samples)
Prepare a histogram of the 25 sample means (mean of the means
from 25 columns)
Repeat for larger sample sizes (size 25, 100, 500) and draw
comparative conclusions
Experiment Results
Note that the average of all the sample means is
quite close the true population mean of 5.0.
68
69
Experiment Results
● Repeat the sampling experiment for samples of size 25, 100, and 500.
• As the sample size increases,
the average of the sample
means are all still close to the
expected value of 5;
• however, the standard
deviation of the sample means
becomes smaller,
•
meaning that the means of
samples are clustered closer
together around the true
expected value. The
distributions become normal.
41
Estimating Sample Error Using the Empirical Rules
● Using the empirical rule for 3 standard deviations away from
the mean, ~99.7% of sample means should be between:
[2.55,7.45]
[3.65,6.35]
[4.01,5.91]
[4.76,5.24]
for
for
for
for
n=10
n=25
n=100
n=500
● As the sample size increases, the sampling error decreases.
70
71
Sampling Distribution of the Mean
● Sampling distribution of the mean is the distribution of the means of all
possible samples of a fixed size n from some population
● The standard deviation of the sampling distribution of the mean is
called the standard error of the mean:
● As n increases, the standard error decreases
▪ Sample means are less dispersed around the population mean
= Larger sample sizes have less sampling error
72
Central Limit Theorem (CLT)
1. If the sample size is large enough (n>=30), then the sampling
distribution of the mean:
● is approximately normally distributed regardless of the distribution of
the population
● has a mean equal to the population mean
2. If the population is normally distributed, then the sampling
distribution is also normally distributed for any sample size.
● The central limit theorem allows us to use the theory we learned about
calculating probabilities for normal distributions to draw conclusions about
sample means.
When calculating probabilities, determine whether it is related to an individual
observation or mean or a sample (std dev is the std error
).
73
Central Limit Theorem (CLT)
Why CLT is important?
As mentioned earlier, performing a
statistical inference by using sample
data is challenging because of
1) Errors in the sampling process
2) Population information for sample data is
usually unknown (e.g., mean, S.D., …)
CLT helps us to resolve these issues by defining
the probability distribution of sample means from a
population (i.e., sampling distribution)
With CLT, we can calculate the probability of
observing a sample with a specific mean value
at the population level
74
Using Standard Error in Probability Calculations
● The amount of purchase orders for books on a publisher’s website is
normally distributed with a mean of $36 and a standard deviation of $8.
● Find the probability that:
1. The amount of someone’s purchase order exceeds $40.
Use the population standard deviation:
P(x>40)=1-pnorm(40,36,8)=0.3085
2. the mean amount of 16 customers’ purchase orders exceeds $40.
Use the standard error of the mean (8/ 16 = 2):
P(x̅>40)=1-pnorm(40,36,2)=0.0228
75
Interval Estimates
● An interval estimate provides a range for a population characteristic based on a sample
(population distribution is assumed)
▪ Intervals specify a range of plausible values for the characteristic of interest and
a way of assessing “how plausible” they are
▪ e.g., if we observe a value “X” in a sample, what would be its value in the
population?
● 100(1-α)% probability interval is any interval [A,B] such that the probability
of falling between A and B is 1-α.
▪ Probability intervals are often centered on the mean or median
▪ Example: in a normal distribution, the mean +/- 1
sd describes an approximate 68% probability
interval around the mean.
▪ Another example, the 5th and 95th percentiles in a
data set constitute a 90% probability interval
76
Interval Estimates in the News
● In the U.S., news media often conduct a poll (sample) to predict the outcome of
an election (population)
1. A Gallup poll might report that 56% of voters support a certain candidate with a
margin of error of ±3%
▪ We would have a lot of confidence that the candidate would win since the
interval estimate is [53%, 59%]
2. Suppose the poll reported a 52% level of support with a ±4% margin of error
▪ We would be less confident in predicting a win for the candidate since the
interval estimate is [48%, 56%]
How to calculate the error associated with a point estimate?
77
Confidence Intervals
● A confidence interval is a range of values between which the
value of the population parameter is believed to be, along with
a probability that the interval correctly estimates the true
(unknown) population parameter.
▪ This probability is called the level of confidence, denoted by 1-α, where α is a
number between 0 and 1.
▪ The level of confidence is usually expressed as a percent; common values are 90%, 95%,
or 99%.
● For a 95% confidence interval, if we chose 100 different samples, leading to 100 different
interval estimates, we would expect that 95% of them would contain the true population
mean
▪ In other words, Confidence interval estimates provide a way of assessing the accuracy of
a point estimate
Confidence Interval for the Mean with
Known Population Standard Deviation
● We can use the standard normal distribution to calculate the range of sample mean at the
population level if SD is known
● Sample mean ± margin of error
● Margin of error is: zα/2 (standard error)
○
zα/2: value of standard normal random variable for an upper tail area of α/2 (or a
lower tail area of 1-α/2).
○ Example: if α=0.05 (for a 95% confidence interval), then z0.975=1.96
○ Example: if α=0.10 (for a 90% confidence interval, then then z0.95=1.645
78
Confidence Interval for the Mean with
Known Population Standard Deviation
● A production process fills bottle of liquid detergent. The standard
deviation in filling volumes is constant at 15 mls. A sample of 25 bottles
revealed a mean filling volume of 796 mls.
● A 95% confidence interval estimate of the mean filling volume for the
population is
But what if we don’t know the population’s standard deviation?
79
The t-Distribution
● Also called Student’s t-Distribution
● Used for confidence intervals when the population standard deviation is
unknown
● Its only parameter is the degree of freedom (df) (no. of sample values - no.
of est parameters)
▪ The shape of t-distribution changes with df
80
Confidence Interval for the Mean with
Unknown Population Standard Deviation
81
● Formula:
●t
value from t-distribution with (n-1) degrees of freedom,
giving an upper tail probability of α/2 (or a lower tail area of
1-α/2).
α/2,n-1:
○ Example: if α=0.05, n=30 (for a 95% confidence interval), then
• t0.975,29=2.05;
○ Example: if α=0.10 (for a 90% confidence interval), then
• t0.95,29=1.70.
T Dist, n=30
T Dist, n=100
T Dist, n=1000
Z Dist
The shape of t-distribution becomes closer to a normal distribution as its df increases
Confidence Interval for the Mean with Unknown
Population Standard Deviation
● Excel file Credit Approval Decisions. Find a 95% confidence interval
estimate of the mean revolving balance of homeowner applicants.
● Sample mean=$12,630.37; s=$5393.38;
standard error=$1037.96; t0.025,26=2.056
12,630.37± 2.056(5393.38/√27)
82
Confidence Interval for the Mean with Unknown
Population Standard Deviation
83
Thank You For Listening
Homework
Intended
Learning
Outcomes
Assessment Tasks
• Run the R codes and try to
understand all the codes
Assessment
Tasks
Learning
Activities
Download