Lecture 5 - The Department of Statistics and Applied Probability, NUS

advertisement
Theoretical Distributions in
Probability and Statistics
Decision-making
In a large family where it is known that there is genetic pre-disposition to
suffer from diabetes, how many children out of a possible 7 are likely to be
affected by diabetes?
A hospital administrator needs to decide how many people to staff the
Accident and Emergency Department of the hospital during 9am to 12pm on
weekdays. How should the administrator decide?
In 2004, the World Health Organisation (WHO) revised the body-mass index
(BMI) definitions for overweight and obese individuals in Asian populations.
Instead of a BMI range of 25 – 29 for defining overweight, and a BMI range
of > 30 for defining obese (as are used in Caucasian populations), the
corresponding ranges for Asian populations are 23 – 27.5 and > 27.5. How
did the scientists at WHO decide on the new ranges?
Modeling the outcome variable with
some appropriate theoretical
framework
Data exploration and Statistical analysis
1. Data checking, identifying problems and characteristics
2. Understanding chance and uncertainty
3. How will the data for one attribute behave, in a
theoretical framework?
Data
Data exploration,
categorical / numerical
outcomes
Model each outcome with
a theoretical distribution
Random variable
Definition:
A random variable is a theoretical consideration of the possible outcome of
an event.
Example:
In a survey of 5 students, how many female students are there?
The answer to this is a random variable. The possible outcome are 0, 1, 2,
3, 4 or 5 female students. So the random variable describes what the
answer could have been, prior to finding out the actual answer.
Suppose we know that out of 5 students, there are 4 girls. Then there is no
uncertainty nor variability anymore, the exact answer is known and thus this
is not a random variable anymore.
Discrete random variables
Probability mass function
Definition:
The PMF describes the probability of the possible events for a random
outcome.
Properties of a probability function:
Example 1:
Let X denote the number of heads obtained when an unbiased coin is
tossed 3 times. Find the probability distribution of X. Find also P(|X – 2| 
1.2).
Cumulative distribution function
Definition:
The CDF describes the joint probability of multiple events, and is formally
defined as F(X) = P(X  x) for any real x.
Properties of a CDF:
Example 2:
Uniform Distribution
Definition:
A random variable is said to follow a Uniform distribution if any of the
possible outcomes are equally likely.
Mathematically: P(X = x) = constant.
So if there are n possible outcomes, the chance of each of the outcomes is
1 / n.
Example 3:
In a game of chance, a gambler chooses an integer between 13 and 18
inclusive (including 13 and 18). There are equal chances for any number in
the set {13, 14, 15, 16, 17, 18} to be drawn. Let X be the random variable
denoting the number drawn. Find the probability distribution of X and also
P(X < 16).
Bernoulli Distribution
A random experiment with two possible outcomes, conveniently defined as
“success” or “failure” is called a Bernoulli trial after Jacob Bernoulli (1654 –
1705). The choice of the event as “success” or “failure” is completely
arbitrary.
Example: a toss of a coin will show either a head or a tail. The “success” event can
be either the head, or the tail.
Conventionally, p denotes the probability of success and 1 – p denotes the probability
of failure.
Images from www.google.com
Binomial Distribution
The number of “success” events out of n repeated trials, each trial resulting
in 2 mutually exclusive outcomes with the repeated trials being mutually
independent, follows a Binomial distribution.
Example 4:
A batch of pregnancy test kit contains 50 kits of which 10% are known to be
defective. If 3 test kits are randomly chosen with replacement from the batch,
what is the probability that:
(i) all will be defective;
(ii) none will be defective;
(iii) at least one will be defective;
(iv) exactly one will be defective;
(v) exactly two will be defective;
(vi) not more than two will be defective.
Multinomial Distribution
The Binomial distribution has been used to obtain probabilities for the
number of times an event of interest (out of 2 possible events) occurs when
the same experiment is repeated several times.
Sometimes one is interested to count the number of occurrences of several
events simultaneously. In such a situation the multinomial distribution is
useful.
Assuming there are k possible outcomes, and E1, E2, …, Ek denote the
corresponding number of occurrences of each of the possible outcomes out
of a total of n events, then
with pi = P(Ei).
Example 5:
When snapdragons with pink flowers are crossed, a randomly chosen
offspring has either red (with prob. 0.25), pink (with prob. 0.50) or white (with
prob. 0.25) flowers. What is the probability that among 10 randomly chosen
seeds, 3 will develop white flowers, 2 red ones and 5 pink flowers?
Poisson Distribution
The Poisson distribution is usually used to calculate the probabilities of a
number of occurrences of a rare event. Often these cases are such that an
event can occur repeatedly over a long period of time or over a large area;
the distribution applies to the number of occurrences in a small interval of
time or over a small area.
Example: machine breakdowns, arrivals of calls at a telephone exchange,
faults developing in a pipeline, random arrival of customers at a service
station, accident occurrences, radioactive decay, gene mutations at a
particular locus
Assumptions of a Poisson Distribution
• The outcomes occur randomly.
• The number of outcomes occurring in one time interval or specified region
is independent of the number that occur in any other disjoint time interval or
region.
• The probability that a single outcome will occur during a very short time
interval or in a small region is a very small and is constant.
• The probability of 2 or more outcomes occurring in such a short time
interval or fall in such a small region is negligible.
Properties of a Poisson Distribution
(A) If X ~ Binomial(n, p), X  Poisson (np) as n  , p  0, with np 
constant. That is, the Poisson distribution arises as the limiting case of the
Binomial distribution.
(B) Suppose that X1 and X2 are independent random variables with X1 ~
Poisson(1) and X2 ~ Poisson(2), then Y = X1 + X2 ~ Poisson(1 + 2). That is,
the sum of two independent Poisson random variables also has a Poisson
distribution.
Example 6:
The number of emergency admissions each day to a hospital is found to
have a Poisson distribution with mean 2.
a) Evaluate the probability that on a particular day there will be no
emergency admissions.
b) At the beginning of one day, the hospital has 5 beds available for
emergencies. Calculate the probability that this will be an insufficient
number for the day.
c) Calculate the probability that there will be exactly 3 admissions
altogether on two consecutive days.
Example 7:
Oranges are packed in crates each containing 250. On the average 0.6% are
found to be bad when the crates are opened. What is the probability that
there will be more than 2 bad oranges in a crate?
Recap – Numerical EDA
• Calculating
informative numbers which summarise the
dataset
• What are the numbers useful for describing the age of
1,059 individuals with diabetes?
• Location parameters (mean, median, mode)
• Spread (range, standard deviation, interquartile
range)
• Skewness
20
30
40
Properties of means and variances
Mean age (54.6 years)
in theoretical distributions play
important roles in determining
variations
in the definitions
of the
50
60
70
80
outcomes
AGE
Mean (Expectation) of a discrete random
variable
The expectation of a discrete outcome X, commonly known as the mean of X
or the expected value of X, is denoted as E(X) and defined as
The value of E(X) refers to the average value of x that one can expect after
sampling a large number of values from . E(X) is the long run average of
observations of the variable X.
The expectation of any function g(.) which depends on the random variable X,
g(X), is defined as follows
Variance of a discrete random variable
The variance of X, or the population variance of X, is denoted by Var(X) and is
defined as
Var(X) is usually denoted by 2, and  is defined to be the standard deviation
of X.
Functions of means and variances
Example 8:
Find the expected score of a single roll of a fair die.
Continuous random variables
Definition:
A continuous random variable X takes any value in a given range, and
theoretically can be measured to any desired degree of accuracy. (E.g. height,
weight, age, etc.)
When the total number of possible outcome is very large, the histogram will
approximate to a smooth curve called a frequency curve or a probability
density curve. The function represented by this curve is called the frequency
function, or more commonly known as the probability density function,
denoted by f.
As the function f denotes a probability function,
Some notes on continuous random
variables
Properties of continuous random variables
The cumulative density function (cdf) of a continuous
random variable is denoted FX(x) = P(X  x) for any
real x
Uniform Distribution
Definition:
A random variable is said to follow a Uniform distribution in the interval [a, b]
if the probability density function is a constant in the interval.
Normal Distribution
68% of the probability,
1 standard deviation
away
95% of the
probability, 2 SDs
away
40
50
60
70
Exam marks for Mathematics exam
80
Normal Distribution
Also known as the Gaussian distribution.
A useful distribution to model outcomes in the
natural world.
Images from www.google.com
Properties of the Normal distribution
- Special case: If  = 0, 2 = 1, the X has a Standard Normal distribution.
Usually, the probability density function of the standard normal is written
(x), and the cdf is written (x).
- If X ~ N(0, 1), and Y = aX + b, then Y ~ N(b, a2). Conversely, if
X ~ N(, 2), and Y = (X – ) / , then Y ~ N(0, 1).
- If X1 ~ N(1, 12) and X2 ~ N(2, 22), and X1 and X2 are mutually
independent, then Y = X1 + X2 ~ N(1 + 2 , 12 + 22).
- The plot of density function f is bell-shaped and symmetrical about the line
x =  with a single peak. So the mean, mode and median of the normal
distribution coincide.
- Practically all of the population (about 99.7%) lies in the interval   3,
about 95% of the population lies in the interval   2 and about 68% of the
population lies in the interval   .
Properties of the Normal distribution
- Suppose X ~ Binomial(n, p), for large n and relatively large p, the normal
distribution can be used as an approximation and X  N(np, np(1 – p))
- Suppose X ~ Poisson(), for large , the normal distribution can also be
used as an approximation and X  N(,)
- When the Normal distribution is used to approximate to a discrete
distribution, continuity correction must be used. This is because the discrete
probability P(X = ) is equivalent to the continuous probability of P(  0.5 
X <  + 0.5).
- For example, suppose X is discrete and the normal approximation is used.
Suppose also the question requires to find P(X < 35). This is equivalent to
finding the continuous probability P(X < 34.5), since the discrete value x =
35 is not included in the range X < 35, and so the continuous random
variable cannot be bigger than 34.5. (since 34.5  x < 34.9999…will still
round up to give 35 in the discrete random variable)
Calculating probabilities for N(0,1)
- http://www.stat.psu.edu/~babu/418/norm-tables.pdf
- Cumulative Standard
Normal table
Images from training.ce.washington.edu
P(Z < 0.45) = 0.67364
?
P(Z > 1.12) = 1
? – P(Z < 1.12)
= 1 – 0.8684
= 0.1316
P(Z < -0.45) = 1 – P(Z > -0.45)
= 1 – P(Z < 0.45)
RExcel and Normal distribution
RExcel and Normal distribution
Example 9:
Suppose X ~ N(0, 1), and x takes values from the set X. Find the following
probabilities, by using RExcel.
a)
P(X < x)
for x = 0.65
b)
P(X  x)
for x = 0.123
c)
P(X > x)
for x = 2.78
d)
P(X > x)
for x = 0
Example 10:
X and Y are independent random variables which are both normally
distributed, with X ~ N(100, 25) and Y ~ N(120, 20).
Calculate the following probabilities:
(a)
P(X > 92)
(b)
P(Y > X)
(c)
P(2X + Y < 300)
(d)
P(|X – Y| < 10)
Exponential Distribution
Recall that, under certain assumptions, the number of occurrences of rare
events follows a Poisson distribution. Sometimes, the interest may be in the
time till the observation of the event.
Let Yt denote the number of occurrences of rare events in t time units.
Suppose the mean number of events is  per time unit. Then Yt follows a
Poisson distribution with mean = t.
Let X denote the time, measured from an arbitrary moment to the first event.
Then
P(X > x)
= P(No events in an interval of x time units)
= P(Yx = 0)
= e x
Therefore FX(x) = P(X  x) = 1 – P(X > x) = 1 – ex ,
and
f(x) =  ex
This is called the exponential distribution or the waiting time distribution.
Exponential Distribution
The waiting time until an event occurs in a Poisson process follows the
exponential distribution.
Lack of memory property
This is rather relevant to some of you! The waiting time for a bus follows an
Exponential distribution (prove this!), and this property of an Exponential
distribution is rather depressing.
It says that the chance that you have to wait for another 5 minutes for the
bus is exactly the same if you had waited for 20 minutes already and yet still
have not seen it arrive!
Example 11:
Assume that the number of radioactive particles emitted by a radioactive
substance is 1.5 per second. What is the chance that we have to wait more
than three seconds for the first emission to occur?
Example 12:
Assume that the average time between two subsequent visits of insects to a
certain flower is 12 minutes. You are starting to observe the flower. What is
the chance that you will have to wait for no more than 15 minutes for the first
insect to arrive? What is the chance that the time between the first and
second arriving insect is less than 15 minutes? What is the chance that less
than 3 insects will visit the flower, given that you observe the flower for one
hour?
Entropy
Often in medical research, we are interested in predicting the outcome given
some probability statements.
Suppose there are four possible outcomes after chemotherapy treatment:
(complete remission, partial remission, no change, early death)
If the probabilities of the four outcomes estimated from current data are:
(0.90, 0.08, 0.02, 0.00),
you will feel confident about the treatment, since current data intuitively
provided a lot of information and this information seems to suggest a highl
likelihood of positive outcomes.
Similarly, if the probabilities are
(0.01, 0.01, 0.08, 0.90)
You will also feel confident that you should avoid undergoing the treatment,
because again, current data provided a lot of information to suggest
negative outcomes.
Entropy
However, if the probabilities are:
(0.25, 0.25, 0.25, 0.25)
you actually will not gain additional information from previous data, or
previous data are perfectly uninformative.
Entropy is a statistical measure to quantify the amount of information
available for prediction, and is calculated from using all the probabilities of
the possible outcomes (i.e. from the probability function).
Statistical definition
The entropy of a random variable X with probability function p(x) is defined
to be the quantity
Entropy
It can be shown that for a random variable with n possible values, the
entropy is always bounded between 0 and log(n), where:
0 corresponds to the situation with perfect information
Log(n) corresponds to the situation with no information.
Relative mutual information
It is increasingly common to define the relative mutual information (RMI) as
RMI(X) = 1 – [H(X)/log(n)]
to yield a more intuitive information criterion that is bounded between 0 and
1, where:
0 corresponds to the situation with no information
1 corresponds to the situation with perfect information.
Example 13:
Let X denote the outcome when flipping a fair coin and Y the outcome when
rolling a fair die. Let furthermore Z be one, if two fair dice show a double six
and zero otherwise. Notice that if you want to predict the outcome of these
random variables, you have the best chance to predict Z correctly. Y is
hardest to predict. Calculate the entropies and the relative mutual
information of these three random variables.
Something fun – practical application of
what we have learnt so far!
Very common for students to go through the material on probability and
theoretical distributions thinking about what’s the relevance of all these in
real life!
Let’s look at something fun, which most of you will hopefully have some
experience with:
Images from www.google.com
Monopoly
- 40 grids possible
- each player moves his avatar
around the game board by
rolling two dice
- Community Chest / Chance
- Acquire properties across the
game board
- Develop properties of the
same colour combination into
houses and hotels
- Aim to bankrupt other players
and be the richest (sounds
familiar?)
- Potential of going to jail if
landing on “Go to jail”
- or if you roll doubles 3 times
in a row
- or if Chance / Community
Chest sends you there.
Images from www.google.com
Monopoly
- 40 grids possible
- Every grid equally likely? (or
2.5% chance?)
- What are the properties that
are most likely to be landed on?
- Computer simulation of
Monopoly, with all the rules and
regulations
- turns out that the Jail spot has
the highest occupancy rates
(5.88%)
- that inevitably results in the
orange properties being the
most frequented (8.47%)
Images from www.google.com
Simple probability theory and
knowledge of dice outcome
can provide a marginal edge
in games!
Possible outcomes from roll of
two dice:
Prob(X = 2) = 1 in 36
Prob(X = 3) = 2 in 36
Prob(X = 4) = 3 in 36
Prob(X = 5) = 4 in 36
Prob(X = 6) = 5 in 36
Prob(X = 7) = 6 in 36
Prob(X = 8) = 5 in 36
Prob(X = 9) = 4 in 36
Prob(X = 10) = 3 in 36
Prob(X = 11) = 2 in 36
Prob(X = 12) = 1 in 36
8.19%
7.61%
3.06%
2.65%
7.52%
8.47%
2.96%
2.20%
2.91%
4.61%
7.17%
2.15%
5.88%
6.62%
4.57%
Waiting time?
We can model the waiting time
for someone to land on a
particular grid with an
Exponential distribution.
For example, let’s suppose we
are interested in the most
expensive property on the
board.
38.2
39.8 32.7 37.838.8 39.4 38.4 40.1
36.0
40.2
33.8
39.2
36.6
41.9
34.3
45.3
41.5
45.3
47.9
41.9
39.0
39.6
46.5
46.1
43.7
46.2
47.6
40.5
Students should be able to
• know the definitions of the various terminologies and
distributions
• know how to calculate the probability mass/density function for
the theoretical distributions, and in empirical situations
• calculate the probability of specific outcomes, when assuming
a theoretical distribution for these outcomes
• understand the interpretation of entropy and know how to
calculate the entropy
Download