Probabilistic Models Example #1

advertisement
Probabilistic Models
Example #1
A production lot of 10,000 parts is tested for defects. It is expected that a defective
part occurs once in every 1,000 parts. A sample of 500 is tested, with 2 defective
parts found.
Should you conclude that this is a ”bad” lot?
Is the number of defective parts within the tolerance?
To understand and analyze this test, we need to have the right model for the events.
We need to identify an ”event” and its probability.
Basic probability properties:
P (event) =



0
event cannot occur
1
event must occur
0 < p(E) < 1
otherwise.



If A is set of all possible events =⇒ P (A) = 1
If distinct events are disjoint =⇒
P (A1 or A2 or A3 ) = P (A1 ) + P (A2 ) + P (A3 )
j=
X
P (Aj ) = 1
all possible
A1
A2
A3
A
If all events are equally likely =⇒ P (A j ) = 1/n for n events.
For example we can consider:
A1 = the first part defective
A2 = the second part defective
62
A3 = only the third part defective
Then A1 and A3 are disjoint =⇒ P (A1 or A3 ) = P (A1 ) + P (A3 )
Note: If A1 and A2 not disjoint =⇒ P (A1 or A2 ) 6= P (A1 ) + P (A2 )
A1
A2
A3
A
In this case we have P (A1 or A2 ) = P (A1 ) + P (A2 ) − P (A1 and A2 )
In our example: The event space is all possible outcomes of the test - large and
complicated !
So we would like to break it down to smaller tests: testing each part individually.
Defective/ Not Defective is similar to a coin flip:
Fair Coin
Biased Coin
P (H) = 1/2
P (T ) = 1/2
P (T ) = p
P (H) = 1 − p = q
It is useful to designate H, T (Success or Failure) with 1 and 0. Each test is then a
Bernoulli trial with parameter p: P (1)p.
Now we can calculate the probability of events involving more than one test of a
part, by noting that the tests of the parts are assumed to be independent.
Conditional Probability
P (A|B) = probability that event A occurs given B
P (A, B)
where P(A,B) = probability that both occur, and P(B) = probP (A|B) =
P (B)
ability only B occurs.
63
Simple example showing dependence:
Flip a coin three times. Find the P(observe 2 heads | second is a head).
If we view all possible outcome ( for a fair coin ) we have:
HHH HHT THH THT
HTH HTT TTH TTT.
The first four events are those with H as second outcome.
Then P(observe 2 heads | second is a head) = P(2 out of 3 | second is head) = 2/4
= 1/2.
Note: P(2 out of 3) is unconditional probability , i.e the probability of 2 successes
in 3 Bernoulli trials.
! 1 2 1 1 3
3
P(2 out of 3) =
=
2
2
2
8
We can also calculate P(observe 2 heads | second is a head) using
P (2 out of 3 and second is H)
2/8
1
=
=
P (second is H)
1/2
2
Simple example showing independence:
Successive outcomes of the coin flip (Bernoulli trial) are independent.
So P (X1 = H|X2 = T ) = P (X1 = H) Then, it must be that
P (X1 = H|X2 = T ) =
P (X1 = H, X2 = T )
P (X2 = T )
=⇒ P (X1 = H, X2 = T ) = P (X1 = H) · P (X2 = T )
In general the joint probability density for two independent random variables X 1 , X2
is written as P (X1 , X2 ) = f (x1 , x2 ) = f (x1 )f (x2 ) that is the density for the joint
distribution is just the product of the individual densities.
The statement of a problem can make a big difference:
Example
Given a room of n people, what is the probability that at least 2 have the same
birthday?
So we have to find P (X ≥ 2) where X is the number of people with same birthday.
Stated in this way it is difficult because we have to consider all possible combinations.
64
But we can also state as 1 − P (X = 1). So P (X ≥ 2) = 1 − P (X = 1) where
P (1) = P (the 2nd person is different from the 1st )
· P (the 3rd person is different from the 2nd and 1st )
· P (the nth person is different from all the rest)
1
2
n−1
= 1−
1−
··· 1 −
365
365
365
P (X ≥ 2)
0.315
0.5
0.75
0.98
n
17
23
32
56
Recall our problem.
Define Xj = outcome of j th test. P (Xj = 1) = p where p is the probability of
defective part. Here ”success” is the event of finding a defective part. Then if the
individual tests are independent:
P (X1 = 1 and X2 = 0) = P (X1 = 1) · P (X2 = 0)
Again, the treatment of ”and” and ”or” in probability of events can be viewed from
the set point of view.
P (X1 = 1 or X2 = 0) = P (X1 = 1) + P (X2 = 0) − P (X1 = 1 and X2 = 0)
Let’s write the event space for the first two tests:
P (X2 = 0)
−→
00
10
01
11
|{z}
P (X1 = 1)
P (X1 = 1) = p
P (X2 = 0) = 1 − p
P (X1 = 1 and X2 = 0) = p(1 − p) using independence
P (X1 = 1 or X2 = 0) = p + (1 − p) − p(1 − p) = 1 − p(1 − p)
Note: we also have to be careful about specifying order:
P (X1 = 1 and X2 = 0) 6= P(1 success and 1 failure)
65
P(1 success and 1 failure) = P (X1 = 1, X2 = 0) + P (X1 = 0, X2 = 1) = 2p(1 − p)
So if we ask what is the probability of a defective part in first n trials, we need to
take into account the number of ways that could occur:
1 0 0 0··· 0
0 1 0 0··· 0
0 0 1 0··· 0
and so on
Similarly, if we ask
1 1 0 0···
1 0 1 0···
0 0 1 1···
for P(two defective parts in n trials):
0
0
0
and so on
For any one of these sequences of n trials with j defective (successes) we have p j (1 −
p)n−j for the probability for that particular sequence.
Since the order does not affect this probability, and the trials are identical, we look
at the number of combinations/permtations
we can have in these sequences:
!
n!
n
choose j from n =
=
j
j!(n − j)!
So the probability for j defective in n trials is:
P (j out of n) =
n
j
!
pj (1 − p)n−j
This is called the binomial distribution B(n, p) with parameters n and p, where n
is the number of Bernoulli trials and p is the success for any 1 trail.
Note: to use the binomial distribution we must have a sequence of tests which are:
1. Identical Bernoulli trials (only two possible outcomes)
2. Independent
So if Y is the number of defectives found in a batch of size n,
!
P
n
P (Y = j) =
P(all possible sequences with j successes) =
pj (1 − p)n−j .
j
Back to our example:
We have a test of 500 parts with 2 defectives, and we have to determine if that is
enough to decide that we have a bad lot of parts.
Then , not only do we have to calculate the probability of this event, but also give
a tolerance for the number of defectives observed. Tolerance is topically defined in
terms of 90%, or 95%, or 99% chance of observing some event, under the assumption
66
that the lot is good. First we calculate the probability of the obsserved event P(Y
= 2) for Y ∼ B(500, 0.001). So
P (Y = 2) =
500
2
!
(0.001)2 (0.999)498 ≈ 0.0758
How do we compare this to the tolerance?
If the lot is good, then we would expect that the probability of observing k or more
defectives is
!
n
X
n
P (Y ≥ k) =
pj (1 − p)n−j
j
j=k
Graphically, our probability mass function for the discrete binomial variable Y, P(Y
= j) looks like:
P(Y = j)
0.6
0.3
j
So P (Y ≥ k) is adding up the mass for all values j ≥ k
Here we see:
P (Y ≥ 1) = 1 − P (Y = 0) = 1 − 0.6064 = 0.3936
P (Y ≥ 2) = 1 − [P (Y = 0) + P (Y = 1)] = 1 − (0.6064 + 0.3035) = 0.0901
P (Y ≥ 3) = 1−[P (Y = 0)+P (Y = 1)+P (Y = 2)] = 1−(0.6064+0.3035+0.0758) =
0.0143
So if we define a 90 % ”tolerance” - usually called a confidence interval - then this
defines a ”rare” event as an event that occurs with 10% probability. Similarly an
event which occurs with 5% probability, defined as ”rare”, corresponds to defining
a 95% confidence interval.
Thus for 90% confidence interval, we would conclude that our observation of 2 defectives is a ””rare” event for a good batch.
67
For a 95% confidence interval, we would conclude that observing 2 defectives is not
”rare”..
Thus the choice of the confidence interval influences how many false positives or false
negatives we obtain in our testing. The choice depends on our tolerance of either one.
So far our calculation of probability/confidence interval is based on discrete probability mass functions. We can generalize to other approximations by noting that
we are calculating P(Y) where Y is a sum of identically distributed, independent
random variables =⇒ Xj = 0 or 1 with probability p.
P
Y = Xj = the number of successes, where the outcome for each trial is independent.
There are many results for sums of random variables, the most famous being the
Central Limit Theorem. This let us approximate the distribution of the sum with
a norma distribution:
X
Y ∼ N (µ, σ 2 ) where µ = E
hX
Y
i
and σ 2 = V ar
hX
Y
i
Let’s first review Expected Value.
Expected value is by definition the weighted average, where the weight is given by
the probability mass function: P (X = x i ) = f (xi ) for discrete values Xi of the
random variables X.
E[X] =
X
xi · f (xi )
all i
X
E[g(X)] =
g(xi ) · f (xi ) for g a function
all i
For X a Bernoulli random variable, f(1) = p, f(0)= 1-p, so E[X] = 1·p+0·(1−p) = p
For a binomial random variable
P (Y = j) =
E[Y ] =
n
X
j=0
j
n
j
!
pj (1 − p)n−j
n
j
!
pj (1 − p)n−j
This can be calculated by rearranging terms in the sum. We also note that since
P
Y = nk=1 Xk i.e. Y is the sum of the outcomes of n Bernoulli trials, then E[Y ] =
Pn
E [ k=1 Xk ]
68
Since Xk are identical, and we can commute the sums,
E
"
n
X
#
Xk =
k=1
n
X
E[Xk ] = nE[Xk ] = np
k=1
Similarly we can use this idea to calculate variance
V ar[X] = E[(X − E(X))2 ] = E[X 2 − 2XE[X] + (E[X])2 ] = E[X 2 ] − (E[X])2
(since E[X] is just a number not a random variable).
For the Bernoulli trials:
E[X] = p
V ar[X] = E[X 2 ] − p2
E[X 2 ] = 12 p + 02 (1 − p) = p
V ar[X] = p − p2 = p(1 − p)
P
P
So E[Y ] = E[Xk ] since Y = Xk
Now we can calculate:
V ar[Y ] = V ar
"

#
n
X
k=1
Xk = E 
"
#
Recall that
E
n
X
n
X
k=1
Xk − E
"
n
X
Xk
k=1
#!2 

Xk = nE[Xk ] = np
k=1
So we can write the variation as:

V ar[Y ] = E 
=E
=
"
n
X
k=1
n
X
k=1
n
X
k=1


!2 
n
XX
X
2
(Xk − p)(Xj − p)
(Xk − p)  = E  (Xk − p) + 2
j k<j
k=1

#
(Xk − p)2 + 2E 
E[(Xk − p)2 ] + 2
XX
j k<j
XX
j k<j

(Xk − p)(Xj − p)
E[(Xk − p)(Xj − p)]
Note that
E[(Xk − p)(Xj − p)] =
=
X
(xk − p)(xj − p)f (xj , xk )
X
(xk − p)(xj − p)f (xj )f (xk )
all xk , xj
all xk , xj
69
since xj , xk are independent (Bernoulli trials).
So, E[(Xk − p)(Xj − p)] = E[Xk − p]E[Xj − p] = 0 since E[Xk ] = p
So, V ar[Y ] =
Pn
k=1 E[(Xk
− p)2 ] =
Pn
k=1 V
ar[Xk ] = np(1 − p)
Thus the sum of independent random variables X i has expected value nE[Xi ] and
variance nV ar[Xi ].
In addition the Central Limit Theorem gives an approximation of the density of
sum of i.i.d. random variables.
lim =
n→∞
Pn
Xi − nE[Xi ]
∼ N (0, 1)
√ p
n V ar[Xi ]
i=1
This says that the sum of n i.i.d. random variables X i approaches a normal distribution.
A normal distribution with mean µ and variance σ 2 has probability density function
of the form
(y − µ)2
2σ 2
e
p(y) ≈ √
2πσ 2
−
-∞ < y < +∞
Note: this is a continuous random variable.
You can verify that :
Z
+∞
p(y)dy = 1,
−∞
E[Y ] =
Z
+∞
yp(y)dy = µ,
−∞
V ar[Y ] =
Z
+∞
(y−µ)2 p(y)dy = σ 2
−∞
By the definition of E[Y],
E[Y − µ] = 0, E[cY ] = cE[Y ] = cµ
V ar[cY ] = E[(cY − cµ)2 ] = c2 V ar[Y ]
So in the limit above, we subtract off the mean and divide by the standard deviation,
leaving us with a random variable with mean 0 , and variance 1.
The proof that the density tends towards a normal distribution is not covered here,
but the implications are significant: The Central Limit Theorem (CLT) says that
we can take any random variable which has bounded mean and variance (the random variables can be discrete or continuous) and if we take a sum of these random
variables, as n → ∞ the density of the sum will be normal.
70
Then we can approximate the probability of observations for the sums by using the
normal distribution.
For example, in our previous example, we considered the probability that P (
k) for some k.
The CLT says we can consider instead:
P (Z) =
Pn
i=1
Xi >
P
Xi − nE[Xi ]
√ √
n V arXi
P(z)
0
Then
P
P
Xi − nE[Xi ]
k − nE[Xi ]
>√ √
√ √
n V arXi
n V arXi
=⇒
P
z
k − nE[Xi ]
Z>√ √
n V arXi
!
!
= P (Z > z)
P
Notice that in this case the comparison is between ni=1 Xi a sum of discrete random
variables which take only positive values, and Z, a continuous random variable with
range over all reals.
So we would expect that this approximation may not be valid for all values of
P
Y = ni=1 Xi for finite n.
Then P (Z > z) we can identify using p(z) Typically we call the range of likely
variation a confidence interval, which is then defined in terms of values of Z. The
confidence interval could be one-sided or two-sided, depending on the application.
We can compare our previous results to approximation with the normal distribution:
P (Y ≤ 1) = (binomial)
P (Y ≤ 1) = (normal)
FIGURE
71
P(z)
P(Z > z*)
z
z*
0
Example #2
Suppose a bus can arrive at any time between 11:00 am and 11:15 am, with equal
probability. If you arrive at the bus stop at 11:00 am, with what probability will
you wait for 10 minutes or more for the bus?
With what probability will you wait a total of 300 minutes over the whole month
(assuming you arrive at 11:00 am every day)?
For which value of total minutes would you question the validity of the bus schedule?
Let X = the waiting time of the bus in one day
Then
P (X > 10) = 1 − P (X ≤ 10) = 1 −
Z
10
f (x)dx
0
Where f(x) is a uniform probability density given by
f (x) =

1
15

0
Let Y = total amount of time waiting =
P (Y > 300) = P (
30
X
0 ≤ x ≤ 15
otherwise.
P30
Xi > 300) = P
i=1
i=1 Xi .
Then
P
Xi − nE[Xi ]
300 − nE[Xi ]
> √ √
√ √
n V arXi
n V arXi
Where
Z
15
15
x
dx =
and
15
2
0
Z 15
1 x3 15x2
152 x
(x − 15/2)2
dx = [ −
+
] = 18.75
V ar[Xi ] =
15
15 3
2
4
0
E[Xi ] =
72
!
Then


15
30
300 − 30 ·
X

2 

P (Y > 300) = P ( Xi > 300) = P 
Z > √30 · 18.75  = P (Z > 3.1623) < 0.01
i=1
So if we could choose a 90% or 95% confidence level, that would be
P (Z > 1.28) ≈ 0.90
P (Z > 1.6) ≈ 0.95
Note that 3.1623 is much larger that these levels.
Then we could conclude that we have observed rare event, given the assumption
about the bus arrivals, so we could question this assumption.
Example #3
Test a drug on 100 patients, with probability p of benefit from the drug.
How many patients would we expect to observe receiving benefits from the drug in
order to accept the assumed effectiveness of it?
In this case we have to choose the confidence level and determine the number of
patients that satisfy that confidence level.
If Y is the number of patients receiving a benefit, this says we want to find y such
that P (Y ≥ y) = 0.95
Then we need to identify E[Y] and Var[Y], and in particular we would like to identify
P
Y as a sum of random variables. Here Y = Xi whereXi is the Bernoulli trial for
each individual patient.
So E[Xi ] = p =⇒ E[Y ] = 100p and V ar[Y ] = 100p(1 − p).
=⇒
Y − 90
≥ −1.65 = 0.95
P √
100 · 0.09
=⇒
Y ≥ −1.65 · 3 + 90 ≈ 85.05
So, in order to accept the drugs effectiveness of .9, we would want to see more than
85 patients receiving benefits from the drug.
”Least Squares” - Linear Regression
Data:- view as a random variable or as a function + random variable at each data
point.
73
Let Yi = the data points
Yi = f (t) + εi where εi = error (usual assumption εi ∼ N(0,σ))
What is σ. It depends on f(t). The data points fit a linear function f (t) = a + bt
Y
i
t
In general, we would like to minimize ”errors” ε i . In fact, we will minimize variance
about a + bt (”like the mean”).
F (a, b) =
n
X
i=1
(yi − (a + bti )2 )
n
X
∂F
=
2(yi − (a + bti )(−1)) = 0
∂a
i=1
n
X
∂F
=
2(yi − (a + bti )(−ti )) = 0
∂b
i=1
=⇒
n
X
i=1
n
X
yi = na + b
n
X
ti
i=1
ti yi = a
i=1
n
X
i=1
ti − b
n
X
t2i
i=1
Solve for a and b:
Pn
P
− b ni=1 ti
a=
n P
P
P
n ni=1 yi ti − ni=1 yi ni=1 ti
P
P
b=
n( ni=1 t2i ) − ( ni=1 ti )2
i=1 yi
74
Note ŷi = a + bti is the estimate of yi
Then
Pn
variation of estimate
(ŷi − ȳ)2
R2 =
= Pi=1
n
2
variation of actual data
i=1 (yi − ȳ)
We can apply the Least square fit to the following data:
n
X
n
X
xi = 217.3
i=1
n
X
x2i = 3671
i=1
n
X
yi = 1049.2
i=1
xi yi = 17657
i=1
Then a = 29.4751, b = 3.065, and R 2 = 0.7178
Example(From Larsen&Marx)
Crickets make their chirping sound by sliding one wing cover very rapidly back and
forth over the other. Biologists have long been aware that there is a linear relationship between temperature and frequency with which a cricket chirps, although
the slope and y-intercept of the relationship varies from species to species. Listed
below are 15 frequency-temperature observations recorded for the striped ground
cricket, Nemobius fasciatus fasciatus. Plot these data and find the equation of the
least-square line, y= a+bx. Suppose a cricket of this species is observed to chirp 18
times per second. What would be the estimated temperature?
Observation number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Chirps per second (x i )
20.0
16
19.8
18.4
17.1
15.5
14.7
17.1
15.4
16.2
15.0
17.2
16.0
17.0
14.4
75
Temperature,yi (0 F )
88.6
71.6
93.3
84.3
80.6
75.2
69.7
82.0
69.4
83.3
79.6
82.6
80.6
83.5
76.3
Note that X and Y do not have to be related linearly in order to use linear regression.
Example
If y ≈ Aebx , then we can take logarithms of both sides to obtain linear problem.
=⇒ ln y = ln(Aebx ) = ln A + Bx. So, W = ln A + Bx
Then we can apply the linear regression formulae to W, and x
ln a =
P
Wi − b
n
P
xi
b=
P
P
P
Wi xi − yi xi
P
P
n( x2i ) − ( xi )2
Example(From Larsen&Marx)
Mistletoe is a plant that grows parasitically in the upper branches of large trees.
Left unchecked, it can seriously stunt a tree’s growth. Recently an experiment
was done to test a theory that older trees are less susceptible to mistletoe growth
than younger trees. A number of shoots were cut from 3-,4-,9-,15-, and 40-yearold Ponderosa pines. These were then side-grafted to 3-year-old nursery stock and
planted in a preserve. Each tree was ”inoculated” with mistletoe seeds. Five years
later, a count was made of the number of mistletoe plants in each stand of trees.
(A stand consisted of approximately ten trees; there were three stands of each of
the four youngest age groups and two stands of the oldest.)The results are shown
below:
Number of
Mistletoe
Plants,y
3
28
33
22
Age of Trees
x (years)
9
15
22
10
4
10
36
24
15
6
14
9
40
1
1
So if we try to approximate mistletoe data using y = a + bx we get a = 25.9375, b
= -0.7117, and R2 = 0.6508.
For y = axb , we use z = ln y , v = ln x, α = ln a
=⇒ z = ln a + b ln x = ln a + bv = α + bv.
Then we get α = 4.9, b=-1.1572, and r 2 =0.826.
Note: Here r 2 was calculated using the linear formulation.
Pn
(ẑi − z̄)2
r = Pni=1
2
2
i=1 (zi
where ẑi = α + bvi and ŷi = eα xbi
For y = aebx , w = lny =⇒ w = ln a + bx
76
− z̄)
Then we get ln a = 3.56, b=-0.0893
Pn
(ŵi − w̄)2
r = Pi=1
n
2
2
i=1 (wi
− w̄)
Again r 2 was calculated using the linear formulation where w i = ln a + bxi , and
yi = e(ln a+bxi )
77
Download