2/26/23, 3:44 PM sada_weeks_1_to_3 Statistics and GB Data1305 Analysis - COR1Statistics Basics Experiment Design Types of Data Quantitative Qualitative Single-variate Multi-variate (multiple questions in single observation) Discrete (integers) Continuous (float) Inference can be made from a sample to the population from which it was drawn, but not to another population. Required Sample Size Desired Confidence Max Acceptable Error Null Hypothesis No relationship exists between two sets of data or variables. Law of Large Number a.k.a. Central Limit Theorem The more a population differes from the normal distribution, the higher the required to sample a normal distribution. n Measures of Central Tendency Range range = |max − min| Median file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 1/15 2/26/23, 3:44 PM sada_weeks_1_to_3 for an ordered array of values n ordered list = [x1 , x2 , ⋯ , xn ] If is odd, the median is in the 'center' of the ordered list. n x n−1 2 +1 such that there are values to either side of the median. If is even, the median is the average of the two 'center' elements of the ordered list. n−1 2 n x n + x n +1 2 2 2 Note that the median may not be present in the list of values. Mode In a unimodal distribution, there is a single value which is most represented in the data set. In a normal unimodal distribution, this serves as the 'peak' around which other values gather. A bimodal distribution can be characterized as having two distinctive 'peaks'. Skew the sign indicates where the long tail is. positive skew has long tail to the right (and peak to the left) negative skew has long tail to the left (and peak to the right) Measure of Centeal Tendency - Population vs Sample A sample of size is drawn from the overall population of size n Size Measure N Population Sample N n file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 2/15 2/26/23, 3:44 PM sada_weeks_1_to_3 Measure Population Sample Mean Variance Standard Deviation μ σ x̄ 2 s σ 2 s Mean N ∑ μ = i=1 xi N n ∑ x̄ = i=1 xi n Trimmed Mean remove some % of the outliers Variance We can compare each element to the average of all the datasets to determine the amount of variation in the dataset. We expect a dataset with no variation to have a variance of . 0 Population N σ 2 ∑ = i=1 (xi − μ) 2 N Sample Calculating the variance of a sample relies on , which generates bias that causes the use of the above formula for population to understimate the sample variance. The numerator must be reduced in order to increase the sample variance in order to compensate for this bias. See derivation of why we alter the denominator to here x̄ n − 1 n s 2 ∑ = i=1 (xi − x̄) 2 n − 1 Units: Note that the units of variance are the square of the units of the items in the dataset. file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 3/15 2/26/23, 3:44 PM sada_weeks_1_to_3 Standard Deviation: Population 2 √ σ = √σ = ∑ N i=1 (xi − μ) 2 N Sample n 2 s = √s = √ ∑ i=1 (xi − x̄) 2 n − 1 Units: Note that, unlike variance, the units of variance are the same as the units of the items in the dataset. Coefficient of Variation and Risk vs Reward Minimize: risk variation = reward Maximize σ = expected return μ reward risk Coefficient of Correlation the coefficient of correlation of and X Y file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 4/15 2/26/23, 3:44 PM sada_weeks_1_to_3 Cov(X, Y ) ρ = where Cov(X, Y ) σX σY is the covariance of and : X Y N ∑ Cov(X, Y ) = i=1 (xi − x̄)(yi − ȳ ) N Probability Basics An event Universe - set of all possible outcomes of an event Outcome can occur or not occur ( ) an outcome can be impossible, or certain, or somewhere in between A Acomplement 0 ≤ P ≤ 1 but something must happen i ∑ P (ei ) = 1 i=1 the event must have an outcome Union P (A ∪ B) = P (A) + P (B) − P (A ∩ B) subtract the intersection to avoid "double counting" If the events are mutually exclusive, then P (A ∩ B) = 0 by definition. Independent Events - Intersection P (A ∩ B) = P (A) ∗ P (B|A) if and are independent, than A B , so P (B|A) = P (B) P (A ∩ B) = P (A) ∗ P (B) Dependent Events - Conditional Probability Providing information about the outcomes of event causes you to "change your answer" of that the probability of event . B A file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 5/15 2/26/23, 3:44 PM sada_weeks_1_to_3 Now that you know the oucome of event , the probability of event has changed and needs to be recalculated to factor in this new information. The range of possible outcomes for event may have changed given that outcome of event . Event is said to be dependent on event The probability of given is equal to the probability of the intersection of and , over the probability of . B A A B A B A B A B B P (A ∩ B) P (A | B) = P (B) Probability Distribution We consider an event with possible outcomes. The outcomes each have a probability of occurring . We want to find the expected value of : . We must weigh each outcome by the probability of that outcome . N X P (X) X E(X) xi P (xi ) X P (X) P (X) ∗ X P (X) ∗ (X − μ) xi P (xi ) P (xi ) ∗ xi P (xi ) ∗ (xi − μ) P (xN ) P (xN ) ∗ xN P (xN ) ∗ (xN − μ) 2 2 ⋯ xN 2 summing the last two columns to: μ σ 2 n μ = ∑ P (xi ) ∗ xi i=1 n σ 2 = ∑ P (xi ) ∗ (xi − μ) 2 i=1 Binomial Distribution Binomial Conditions: file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 6/15 2/26/23, 3:44 PM sada_weeks_1_to_3 1. n trials 2. 2 outcomes 3. events are independent 4. P(Success) is constant A Fair Coin The Experiment Let's flip a fair coin n times and define success as Heads. Consider the following outcomes, where S is for Success, F for Failure. Now imagine that we require x successes, that is we want to know the probability of attaining x number of Success outcomes, ignoring order. In our first example: n = 3 x = 1 p = 0.5 We will call this a triplet The Outcomes (n,x,p) with a value (3,1,0.5) These are all the possible ordered outcomes of the 3 trials: SSS SSF SFS SFF FSS FSF FFS FFF This is the subet of outcomes where SFF FSF FFS x = 1 SFFL file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 7/15 2/26/23, 3:44 PM sada_weeks_1_to_3 Out of this subset, there is only one option that follows the arbitrary requirement of being ordered SuccessesFirstFailuresLast (SFFL). SFF This is the SFFL result for the (n,x,p) triplet of (3,1,0.5) . If (n,x,p) was modified to (3,2,0.5) the SFFL result would be: SSF and at (3,3,0.5) it would be: SSS If we expand to (5,2,0.5) : SSFFF If we generalize (n,x,p) , then we can imagine a SFFL result of SSS........SSSFFF........FFF |----- x ----||--- n-x ----| |------------ n -----------| P (SF F L) Let's calculate the probability that the n trials result in the SFFL result. P (x successes in n trials each with chance p ordered as SFFL) = P (x consecutive su = P (x consecutive successes) ∩ P (n-x consecutiv The probability of x consecutive Successes for independent events is the product of indidivual probabilities, p : p x x In the SFFL result, the remaining results are Failures. After counting Successes, there are Failures remaining. The probability of a failure is . So the probability of consecutive Failures is: x n − x 1 − p n − x (1 − p) (n−x) Bringing it all together for the SFFL result, we have: file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 8/15 2/26/23, 3:44 PM sada_weeks_1_to_3 P (x successes in n trials each with chance p ordered as SFFL) = p W x ∗ (1 − p) (n−x) - the 'winners' Now, remember that the selection of the SFFL result was arbitrary. By placing the Successes all in a neat row, it was easy for us to calculate the probability of that row occuring in the results as . By ending with all of the Failures in a row, we were able to arrive at . But since the original ask was for successes, regardless of order, we have clearly undercounted. For the (n,x,p) triplet of (3,1,0.5) , when we chose the SFFL result SFF that was just a subset of the broader pool of results that won by satisfying the ask . Let's examine this subset SFF FSF FFS When we apply the (n,x,p) triplet of (3,1,0.5) to the formula we derived above, we get: p (1 − p) x (n−x) x x = 1 W P (x successes in n trials each with chance p ordered as SFFL) = p x ∗ (1 − p) (n−x) = which is the probability of getting the SFFL result SSF . Looking again at the set of results satisfying : SFF FSF FFS we see that other two 'winners' are just a rearrangement of the SFFL result. Recall how we identified that the probability of getting the SFFL result was an undercount of the original ask, which was the probability of any of results satisfying occuring, regardless of order. We need to know how many 'winners' can be created by simply rearranging the order of the SFFL result. In the case above, with a (n,x,p) triplet of (3,1,0.5) , the SFFL could be arranged in 3 different ways to create the full set of results satisfying . W x = 1 x = 1 W x = 1 file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 9/15 2/26/23, 3:44 PM sada_weeks_1_to_3 Lets compare some (n,x,p) triplet, their SFFL value, and the size of the full set of results satisfying the required value. W x n,x,p 3,0,0.5 3,1,0.5 3,2,0.5 3,3,0.5 n,x,p 4,0,0.5 4,1,0.5 4,2,0.5 4,3,0.5 4,4,0.5 n,x,p 5,0,0.5 5,1,0.5 5,2,0.5 5,3,0.5 5,4,0.5 5,5,0.5 SFFL value FFFFF SFFFF SSFFF SSSFF SSSSF SSSSS SFFL value FFF SFF SSF SSS SFFL value FFFF SFFF SSFF SSSF SSSS size of W 1 5 10 10 5 1 size of 1 4 6 4 1 size of 1 3 3 1 L W FFF SFF, FSF, FFS SSF, SFS, FSS SSS L W FFFF SFFF,FSFF,FFSF,FFFS SSFF,SFSF,SFFS,FSSF,FSFS,FFSS SSSF,SSFS,SFSS,FSSS SSSS L FFFFF SFFFF,FSFFF,FFSFF,FFFSF,FFFFS SSFFF,SFSFF,SFFSF,SFFFS,FSSFF,FSFSF,FSFFS,FFSSF,FFSFS,FFFSS SSSFF,SSFSF,SSFFS,SFSSF,SFSFS,SFFSS,FSSSF,FSSFS,FSFSS,FFSSS SSSSF,SSSFS,SSFSS,SFSSS,FSSSS SSSSS n choose x We need to generalize this to ask Give n values, how many different ways can I choose x, ignoring order? Notice that this is equivalent to: How many ways can I rearrange the letters in my (3,1,0.5) SFFL: S , F, F ? where and the presence of a single S represents the fact that . We can generalize to: n = 3 file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html x = 1 10/15 2/26/23, 3:44 PM sada_weeks_1_to_3 n,x,p n,0,p n,x,p n,n,p SFFL value size of 'F' repeated n times 1 'S' repeated x times, then 'F' repeated n-x n choose times x 'S' repeated n times 1 L 'F' repeated n times n choose x # of rearrangments 'S' repeated n times W factorial ! consider students entering a classroom in a line to sit at desks. How many different ways could they arrange themselves? The first student has desks to choose from. The second student has options. The final student has 1 options. s s s s − 1 sth \# of orders that s objects can be selected (arranged) = s (s − 1)(s − 2) ⋯ (2)(1) We define this product as , read as "s factorial" s! s! = s (s − 1)(s − 2) ⋯ (2)(1) Considering that and helps ground the definition of factorial less in the pattern of the multiplication used to calculate ti, but instead in the concept of ordering. can be thought of as the number of ways that the null set can be ordered, which is naturally . 1! = 1 0! = 1 0! ∅ 1 n Pr We know that the number of ways that items can be arranged is defined as . But what is we only want to select and arrange out of those items? We define these permutation as "permutations of n items taken r at a time" We pick the elements in any abitrary order n n! r n n Pr element 1 2 3 ... ... options n n − 1 n − 2 n − (r r − 1) The choices are independent, so by the Product Rule for Counting, we can say: n Pr multiplying by (n−r)! (n−r)! = n (n − 1)(n − 2) ⋯ (n − (r − 1)) , we get file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 11/15 2/26/23, 3:44 PM sada_weeks_1_to_3 n! n Pr = (n − r)! n Cr a.k.a. n ( ) r We know that combinations are order-agnostic, while permutations are orderdependent. We can inuit that to go from the number of permutations to the number of combinations, that is from to , we would have to divide by some factor. That factor would be the number of items that are considered equivalent from the perspective of combinations, after ignoring order. Remember that is the number of ways that one can rearrange elements. It turns out that exceeds by a factor of n Pr n Cr r! r n Pr n Cr r! n! n Cr Returning to n = n Pr = r! choose x n! (n−r)! = r! , a.k.a r!(n − r)! n ( ) x Recall that we wanted to relate our (n,x,p) triplet to the size of the set of 'winners' . We asked: Give n values, how many different ways can I choose x, ignoring order? We knew that was undercounting by this exact factor, which we know have the capacity to name. W P (SF F L) P (x successes in trials each with chance p) = P (x successes in n trials each with chance p ordered as SFFL) (number of ways that ordered SFFL could be rearranged) Recall that P (x successes in n trials each with chance p ordered as SFFL) = p And we can now express that as or n Cr n n choose x x ∗ (1 − p) (n−x) factor that we explored in the tables earlier ( ) r number of ways that ordered SFFL could be rearranged = ( n ) x Thus we can conclude: P (x successes in trials each with chance p) = p x ∗ (1 − p) file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html (n−x) ( n ) x 12/15 2/26/23, 3:44 PM sada_weeks_1_to_3 Poisson Distribution Let be the frequency of some event occuring per time period where the expected value of can be written as: λ V T x E(x) = λ We want the probability of the event occuring k times over the course of period . That is: T P (x = k) Deriving the Poisson Distribution from the Binomial Distribution Let's define trials, each occuring over a period , where is of duration of Starting with n t t . T /n P (event occurs 1 time over time period T ) = λ we can see that λ P (event occurs 1 time over time period t) = n So for a given trial we can say that Let's first model this as a binomial distribution: Binomial Conditions: 1. n trials: True 2. 2 outcomes: True 3. events are independent: True 4. P(Success) is constant: True Setup: define success as sampled is: event occurs at least once probability of success: number of trials: required number of successes: We take our binomial distribution: p = p = λ n λ n n k P (k successes in trials each with chance p) = ( n k (n−k) ) p ∗ (1 − p) k and we start with an value of 60. n file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 13/15 2/26/23, 3:44 PM sada_weeks_1_to_3 λ P (success, n=60, k, \;p = ) = ( n 60 λ λ k (60−k) ) ∗ ( ) ∗ (1 − ( )) k 60 60 But we quickly realize that "how many of the 60 1-minute trials succeeded?" is only an approximation of "how many times did the event occur?". After all, we defined the success of each trial as: the event occurs at least once. Events occuring more than once per trial will not be captured. Therefore "what is the probability of out of the slices of succeeding" is only an approximation of "what is the probability the event occurs times over ". We can improve the approximation by increasing . Let's perform 3600 1-second trials. k n T k T n λ P (success, n=3600, k, \;p = ) = ( n 3600 λ λ k (3600−k) ) ∗ ( ) ∗ (1 − ( )) k 3600 3600 And it follows that the best approximation is: P (x = k) = lim ( n→∞ Rewriting n ( ) k n λ k λ (n−k) ) ∗ ( ) ∗ (1 − ( )) k n n as a factorial and distributing the exponents: λ n! P (x = k) = lim n→∞ Rewriting n! (n−k)! as k ∗ (n − k)!k! n λ ∗ (1 − k ) λ ∗ (1 − n ) − k n : n(n − 1)(n − 2) ⋯ (n − k + 1) n(n − 1)(n − 2) ⋯ (n − k + 1) P (x = k) = n lim n n→∞ λ k ∗ k λ ∗ (1 − ) n λ ∗ (1 − n k! ) − k n Applying the limit to each term and simplifying in advance of applying the limit: n P (x = k) = k + C lim n→∞ n k λ k ∗ λ ∗ lim (1 − k! ) n n n→∞ λ ∗ lim (1 − n→∞ ) − k n Applying each limit: λ k P (x = k) = 1 ∗ 1 ∗ k! e λ λ ∗ 1 k P (x = k) = k! e λ Applications of the Poisson Distribution Unbounded upper limit file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 14/15 2/26/23, 3:44 PM sada_weeks_1_to_3 When asked what is , you must reformulate the question: P (x > k) P (x > k) = 1 − P (x ≤ k) Similarly, P (x ≥ k) = 1 − P (x < k) HyperGeometric Distribution Consider a population . A sample will be randomly selection. N n Conditions 1. The population can be divided into two mutually exclusive groups (e.g. Pass/Fail). 2. The finite population is sampled without replacement. This means that the probability of choosing an item from a given group will change after each draw, as the composition of the remaining elements has changed. N Probability Consider samples drawn from population , which can be divided into two mutually exclusive groups (of size ) and . The probability that out of samples drawn belong to the group: n N Success k K F ailure n Succes ( P (x = k) = K k )( N −K n−k ) N (n) file:///Users/cschneider/Desktop/local-vsc/mba/src/statistics_and_data_analysis/problem_sets/study_guide/sada_weeks_1_to_3.html 15/15