Uploaded by rj.estrada8

sada weeks 1 to 3

advertisement
2/26/23, 3:44 PM
sada_weeks_1_to_3
Statistics and GB
Data1305
Analysis - COR1Statistics Basics
Experiment Design
Types of Data
Quantitative Qualitative
Single-variate Multi-variate (multiple questions in single observation)
Discrete (integers) Continuous (float)
Inference can be made from a sample to the population from which it was drawn, but not
to another population.
Required Sample Size
Desired Confidence
Max Acceptable Error
Null Hypothesis
No relationship exists between two sets of data or variables.
Law of Large Number a.k.a. Central Limit Theorem
The more a population differes from the normal distribution, the higher the required to
sample a normal distribution.
n
Measures of Central Tendency
Range
range = |max − min|
Median
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
1/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
for an ordered array of values
n
ordered list = [x1 , x2 , ⋯ , xn ]
If is odd, the median is in the 'center' of the ordered list.
n
x n−1
2
+1
such that there are values to either side of the median.
If is even, the median is the average of the two 'center' elements of the ordered list.
n−1
2
n
x n + x n +1
2
2
2
Note that the median may not be present in the list of values.
Mode
In a unimodal distribution, there is a single value which is most represented in the data
set. In a normal unimodal distribution, this serves as the 'peak' around which other
values gather.
A bimodal distribution can be characterized as having two distinctive 'peaks'.
Skew
the sign indicates where the long tail is. positive skew has long tail to the right (and peak
to the left) negative skew has long tail to the left (and peak to the right)
Measure of Centeal Tendency - Population vs Sample
A sample of size is drawn from the overall population of size
n
Size
Measure
N
Population Sample
N
n
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
2/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
Measure
Population Sample
Mean
Variance
Standard Deviation
μ
σ
x̄
2
s
σ
2
s
Mean
N
∑
μ =
i=1
xi
N
n
∑
x̄ =
i=1
xi
n
Trimmed Mean
remove some % of the outliers
Variance
We can compare each element to the average of all the datasets to determine the
amount of variation in the dataset. We expect a dataset with no variation to have a
variance of .
0
Population
N
σ
2
∑
=
i=1
(xi − μ)
2
N
Sample
Calculating the variance of a sample relies on , which generates bias that causes the
use of the above formula for population to understimate the sample variance. The
numerator must be reduced in order to increase the sample variance in order to
compensate for this bias. See derivation of why we alter the denominator to
here
x̄
n − 1
n
s
2
∑
=
i=1
(xi − x̄)
2
n − 1
Units:
Note that the units of variance are the square of the units of the items in the dataset.
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
3/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
Standard Deviation:
Population
2
√
σ = √σ =
∑
N
i=1
(xi − μ)
2
N
Sample
n
2
s = √s = √
∑
i=1
(xi − x̄)
2
n − 1
Units:
Note that, unlike variance, the units of variance are the same as the units of the items in
the dataset.
Coefficient of Variation and Risk vs Reward
Minimize:
risk
variation
=
reward
Maximize
σ
=
expected return
μ
reward
risk
Coefficient of Correlation
the coefficient of correlation of and
X
Y
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
4/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
Cov(X, Y )
ρ =
where
Cov(X, Y )
σX σY
is the covariance of and :
X
Y
N
∑
Cov(X, Y ) =
i=1
(xi − x̄)(yi − ȳ )
N
Probability Basics
An event
Universe - set of all possible outcomes of an event
Outcome can occur or not occur (
)
an outcome can be impossible, or certain, or somewhere in between
A
Acomplement
0 ≤ P ≤ 1
but something must happen
i
∑ P (ei ) = 1
i=1
the event must have an outcome
Union
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
subtract the intersection to avoid "double counting"
If the events are mutually exclusive, then
P (A ∩ B) = 0
by definition.
Independent Events - Intersection
P (A ∩ B) = P (A) ∗ P (B|A)
if and are independent, than
A
B
, so
P (B|A) = P (B)
P (A ∩ B) = P (A) ∗ P (B)
Dependent Events - Conditional Probability
Providing information about the outcomes of event causes you to "change your
answer" of that the probability of event .
B
A
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
5/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
Now that you know the oucome of event , the probability of event has changed and
needs to be recalculated to factor in this new information.
The range of possible outcomes for event may have changed given that outcome of
event .
Event is said to be dependent on event
The probability of given is equal to the probability of the intersection of and ,
over the probability of .
B
A
A
B
A
B
A
B
A
B
B
P (A ∩ B)
P (A | B) =
P (B)
Probability Distribution
We consider an event with possible outcomes. The outcomes each have a
probability of occurring
.
We want to find the expected value of :
.
We must weigh each outcome by the probability of that outcome
.
N
X
P (X)
X E(X)
xi
P (xi )
X
P (X)
P (X) ∗ X
P (X) ∗ (X − μ)
xi
P (xi )
P (xi ) ∗ xi
P (xi ) ∗ (xi − μ)
P (xN )
P (xN ) ∗ xN
P (xN ) ∗ (xN − μ)
2
2
⋯
xN
2
summing the last two columns to:
μ
σ
2
n
μ = ∑ P (xi ) ∗ xi
i=1
n
σ
2
= ∑ P (xi ) ∗ (xi − μ)
2
i=1
Binomial Distribution
Binomial Conditions:
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
6/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
1. n trials
2. 2 outcomes
3. events are independent
4. P(Success) is constant
A Fair Coin
The Experiment
Let's flip a fair coin n times and define success as Heads. Consider the following
outcomes, where S is for Success, F for Failure.
Now imagine that we require x successes, that is we want to know the probability of
attaining x number of Success outcomes, ignoring order.
In our first example:
n = 3
x = 1
p = 0.5
We will call this a triplet
The Outcomes
(n,x,p)
with a value
(3,1,0.5)
These are all the possible ordered outcomes of the 3 trials:
SSS
SSF
SFS
SFF
FSS
FSF
FFS
FFF
This is the subet of outcomes where
SFF
FSF
FFS
x = 1
SFFL
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
7/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
Out of this subset, there is only one option that follows the arbitrary requirement of
being ordered SuccessesFirstFailuresLast (SFFL).
SFF
This is the SFFL result for the (n,x,p) triplet of (3,1,0.5) .
If (n,x,p) was modified to (3,2,0.5) the SFFL result would be:
SSF
and at (3,3,0.5) it would be:
SSS
If we expand to (5,2,0.5) :
SSFFF
If we generalize (n,x,p) , then we can imagine a SFFL result of
SSS........SSSFFF........FFF
|----- x ----||--- n-x ----|
|------------ n -----------|
P (SF F L)
Let's calculate the probability that the n trials result in the SFFL result.
P (x successes in n trials each with chance p ordered as SFFL) = P (x consecutive su
= P (x consecutive successes) ∩ P (n-x consecutiv
The probability of x consecutive Successes for independent events is the product of
indidivual probabilities, p :
p
x
x
In the SFFL result, the remaining results are Failures. After counting Successes, there
are
Failures remaining. The probability of a failure is . So the probability of
consecutive Failures is:
x
n − x
1 − p
n − x
(1 − p)
(n−x)
Bringing it all together for the SFFL result, we have:
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
8/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
P (x successes in n trials each with chance p ordered as SFFL) = p
W
x
∗ (1 − p)
(n−x)
- the 'winners'
Now, remember that the selection of the SFFL result was arbitrary. By placing the
Successes all in a neat row, it was easy for us to calculate the probability of that row
occuring in the results as . By ending with all of the Failures in a row, we were able to
arrive at
.
But since the original ask was for successes, regardless of order, we have clearly
undercounted.
For the (n,x,p) triplet of (3,1,0.5) , when we chose the SFFL result
SFF
that was just a subset of the broader pool of results that won by satisfying the ask
. Let's examine this subset
SFF
FSF
FFS
When we apply the (n,x,p) triplet of (3,1,0.5) to the formula we derived above,
we get:
p
(1 − p)
x
(n−x)
x
x = 1
W
P (x successes in n trials each with chance p ordered as SFFL) = p
x
∗ (1 − p)
(n−x)
=
which is the probability of getting the SFFL result SSF . Looking again at the set of
results satisfying
:
SFF
FSF FFS we see that other two 'winners' are just a rearrangement of the SFFL result.
Recall how we identified that the probability of getting the SFFL result was an
undercount of the original ask, which was the probability of any of results satisfying
occuring, regardless of order.
We need to know how many 'winners' can be created by simply rearranging the order of
the SFFL result.
In the case above, with a (n,x,p) triplet of (3,1,0.5) , the SFFL could be arranged
in 3 different ways to create the full set of results satisfying
.
W
x = 1
x = 1
W
x = 1
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
9/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
Lets compare some (n,x,p) triplet, their SFFL value, and the size of the full set of
results satisfying the required value.
W
x
n,x,p
3,0,0.5
3,1,0.5
3,2,0.5
3,3,0.5
n,x,p
4,0,0.5
4,1,0.5
4,2,0.5
4,3,0.5
4,4,0.5
n,x,p
5,0,0.5
5,1,0.5
5,2,0.5
5,3,0.5
5,4,0.5
5,5,0.5
SFFL
value
FFFFF
SFFFF
SSFFF
SSSFF
SSSSF
SSSSS
SFFL value
FFF
SFF
SSF
SSS
SFFL value
FFFF
SFFF
SSFF
SSSF
SSSS
size of
W
1
5
10
10
5
1
size of
1
4
6
4
1
size of
1
3
3
1
L
W
FFF
SFF, FSF, FFS
SSF, SFS, FSS
SSS
L
W
FFFF
SFFF,FSFF,FFSF,FFFS
SSFF,SFSF,SFFS,FSSF,FSFS,FFSS
SSSF,SSFS,SFSS,FSSS
SSSS
L
FFFFF
SFFFF,FSFFF,FFSFF,FFFSF,FFFFS
SSFFF,SFSFF,SFFSF,SFFFS,FSSFF,FSFSF,FSFFS,FFSSF,FFSFS,FFFSS
SSSFF,SSFSF,SSFFS,SFSSF,SFSFS,SFFSS,FSSSF,FSSFS,FSFSS,FFSSS
SSSSF,SSSFS,SSFSS,SFSSS,FSSSS
SSSSS
n choose x
We need to generalize this to ask
Give n values, how many different ways can I choose x, ignoring order?
Notice that this is equivalent to:
How many ways can I rearrange the letters in my (3,1,0.5) SFFL: S ,
F, F ?
where
and the presence of a single S represents the fact that
. We can
generalize to:
n = 3
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
x = 1
10/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
n,x,p
n,0,p
n,x,p
n,n,p
SFFL value
size of
'F' repeated n times
1
'S' repeated x times, then 'F' repeated n-x n choose
times
x
'S' repeated n times
1
L
'F' repeated n times
n choose x # of
rearrangments
'S' repeated n times
W
factorial
!
consider students entering a classroom in a line to sit at desks. How many different
ways could they arrange themselves? The first student has desks to choose from. The
second student has
options. The final student has 1 options.
s
s
s
s − 1
sth
\# of orders that s objects can be selected (arranged) = s (s − 1)(s − 2) ⋯ (2)(1)
We define this product as , read as "s factorial"
s!
s! = s (s − 1)(s − 2) ⋯ (2)(1)
Considering that
and
helps ground the definition of factorial less in the
pattern of the multiplication used to calculate ti, but instead in the concept of ordering.
can be thought of as the number of ways that the null set can be ordered, which is
naturally .
1! = 1
0! = 1
0!
∅
1
n Pr
We know that the number of ways that items can be arranged is defined as . But
what is we only want to select and arrange out of those items? We define these
permutation as "permutations of n items taken r at a time"
We pick the elements in any abitrary order
n
n!
r
n
n Pr
element
1
2
3
...
...
options
n
n − 1
n − 2
n − (r
r
− 1)
The choices are independent, so by the Product Rule for Counting, we can say:
n Pr
multiplying by
(n−r)!
(n−r)!
= n (n − 1)(n − 2) ⋯ (n − (r − 1))
, we get
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
11/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
n!
n Pr
=
(n − r)!
n Cr
a.k.a.
n
( )
r
We know that combinations are order-agnostic, while permutations are orderdependent. We can inuit that to go from the number of permutations to the number of
combinations, that is from to , we would have to divide by some factor. That
factor would be the number of items that are considered equivalent from the perspective
of combinations, after ignoring order. Remember that is the number of ways that one
can rearrange elements. It turns out that exceeds by a factor of
n
Pr
n
Cr
r!
r
n
Pr
n
Cr
r!
n!
n Cr
Returning to n
=
n Pr
=
r!
choose x
n!
(n−r)!
=
r!
, a.k.a
r!(n − r)!
n
( )
x
Recall that we wanted to relate our (n,x,p) triplet to the size of the set of 'winners'
.
We asked:
Give n values, how many different ways can I choose x, ignoring order?
We knew that
was undercounting by this exact factor, which we know have
the capacity to name.
W
P (SF F L)
P (x successes in trials each with chance p) =
P (x successes in n trials each with chance p ordered as SFFL)
(number of ways that ordered SFFL could be rearranged)
Recall that
P (x successes in n trials each with chance p ordered as SFFL) = p
And we can now express that
as or
n
Cr
n
n choose x
x
∗ (1 − p)
(n−x)
factor that we explored in the tables earlier
( )
r
number of ways that ordered SFFL could be rearranged = (
n
)
x
Thus we can conclude:
P (x successes in trials each with chance p) = p
x
∗ (1 − p)
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
(n−x)
(
n
)
x
12/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
Poisson Distribution
Let be the frequency of some event occuring per time period where the expected
value of can be written as:
λ
V
T
x
E(x) = λ
We want the probability of the event occuring k times over the course of period . That
is:
T
P (x = k)
Deriving the Poisson Distribution from the Binomial
Distribution
Let's define trials, each occuring over a period , where is of duration of
Starting with
n
t
t
.
T /n
P (event occurs 1 time over time period T ) = λ
we can see that
λ
P (event occurs 1 time over time period t) =
n
So for a given trial we can say that
Let's first model this as a binomial distribution:
Binomial Conditions:
1. n trials: True
2. 2 outcomes: True
3. events are independent: True
4. P(Success) is constant: True
Setup:
define success as sampled is: event occurs at least once
probability of success:
number of trials:
required number of successes:
We take our binomial distribution:
p =
p =
λ
n
λ
n
n
k
P (k successes in trials each with chance p) = (
n
k
(n−k)
) p ∗ (1 − p)
k
and we start with an value of 60.
n
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
13/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
λ
P (success, n=60, k, \;p =
) = (
n
60
λ
λ
k
(60−k)
) ∗ (
) ∗ (1 − (
))
k
60
60
But we quickly realize that "how many of the 60 1-minute trials succeeded?" is only an
approximation of "how many times did the event occur?".
After all, we defined the success of each trial as: the event occurs at least once.
Events occuring more than once per trial will not be captured.
Therefore "what is the probability of out of the slices of succeeding" is only an
approximation of "what is the probability the event occurs times over ". We can
improve the approximation by increasing . Let's perform 3600 1-second trials.
k
n
T
k
T
n
λ
P (success, n=3600, k, \;p =
) = (
n
3600
λ
λ
k
(3600−k)
) ∗ (
) ∗ (1 − (
))
k
3600
3600
And it follows that the best approximation is:
P (x = k) =
lim (
n→∞
Rewriting
n
( )
k
n
λ k
λ
(n−k)
) ∗ (
) ∗ (1 − (
))
k
n
n
as a factorial and distributing the exponents:
λ
n!
P (x = k) =
lim
n→∞
Rewriting
n!
(n−k)!
as
k
∗
(n − k)!k!
n
λ
∗ (1 −
k
)
λ
∗ (1 −
n
)
−
k
n
:
n(n − 1)(n − 2) ⋯ (n − k + 1)
n(n − 1)(n − 2) ⋯ (n − k + 1)
P (x = k) =
n
lim
n
n→∞
λ
k
∗
k
λ
∗ (1 −
)
n
λ
∗ (1 −
n
k!
)
−
k
n
Applying the limit to each term and simplifying in advance of applying the limit:
n
P (x = k) =
k
+ C
lim
n→∞
n
k
λ
k
∗
λ
∗ lim (1 −
k!
)
n
n
n→∞
λ
∗ lim (1 −
n→∞
)
−
k
n
Applying each limit:
λ
k
P (x = k) = 1 ∗
1
∗
k!
e
λ
λ
∗ 1
k
P (x = k) =
k! e
λ
Applications of the Poisson Distribution
Unbounded upper limit
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
14/15
2/26/23, 3:44 PM
sada_weeks_1_to_3
When asked what is
, you must reformulate the question:
P (x > k)
P (x > k) = 1 − P (x ≤ k)
Similarly,
P (x ≥ k) = 1 − P (x < k)
HyperGeometric Distribution
Consider a population . A sample will be randomly selection.
N
n
Conditions
1. The population can be divided into two mutually exclusive groups (e.g. Pass/Fail).
2. The finite population is sampled without replacement. This means that the
probability of choosing an item from a given group will change after each draw, as
the composition of the remaining elements has changed.
N
Probability
Consider samples drawn from population , which can be divided into two mutually
exclusive groups
(of size ) and
.
The probability that out of samples drawn belong to the
group:
n
N
Success
k
K
F ailure
n
Succes
(
P (x = k) =
K
k
)(
N −K
n−k
)
N
(n)
file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html
15/15
Download