[ Basic Definitions ] Population the entire group under study Sample a representative subset of population [ Measures of Center ] Mean (average) the total of sum of numbers the count of numbers [9, 10, 12, 13, 13, 13, 15, 15, 16, 16, 18, 22, 23, 24, 24] SUM= 243 COUNT= 15 MEAN = 16.2 Median (middle) the number in middle after arranging all values from lowest to highest if there are two middle numbers add them and divide by two [9, 10, 12, 13, 13, 13, 15, 15, 16, 16, 18, 22, 23, 24, 24] Mode (most) the number that appears the most in the sample. [9, 10, 12, 13, 13, 13, 15, 15, 16, 16, 18, 22, 23, 24, 24] Special Cases*: No Mode… if all numbers occur the same number of times Two Modes ... when more than one number being the most frequent [ Measures of Spread ] Range (difference) the difference between the largest and the smallest number. [72, 110, 134, 190, 238, 287, 305, 324] MAX= 324 - MIN= 72 = RANGE = 252 IQR (interquartile range) For any distribution that’s ordered from low to high, the IQR contains half of the values. [72, 110, 134, 190, 238, 287, 305, 324] To find the IQR you first find the values at Q1 and Q3. Multiply the count of values (8) by 0.25 and 0.75 Q1 position: 0.25 x 8 = 2 Q3 position: 0.75 x 8 = 6 IQR = Q3 – Q1 = 177 i.e., = 110 i.e., = 287 Outliers Outliers are extreme values present in data [12,5, 9, 11, 72, 7, 61]. Like 72 or 61 in this array. The data points that fall below Q1 – 1.5 IQR or above Q3 + 1.5 IQR are outliers. Variance (distance) The avreage distance from each data point to the mean. 1. Calcuate the mean of the set 2. From each number, subtract the mean to find the differences (variance) 3. Square each difference. 4. Work out the average of the squared differences. Standard Deviation The square root of stantard deviation measures how far a group of numbers is from the mean standard deviation S variance Z-Score (standard score) how far a data point is from the mean [-3, -2, -1,0, +1, +2, +3] [ Measures of Chance ] Probabilities Let's assume we have the following Palomar College Male student characteristics: 1- Simple Probability: P (attractive) = 6100 attractive men /10,000 men P (attractive) = 6100 / 10,000 = 0.61 (61%) 2- Joint Probability: Probability of Event A AND Event B Happening p (attractive AND wealthy) = p (attractive) p (wealthy) p (attractive AND wealthy) = (6100/10,000) (1700/10,000) p (attractive AND wealthy) = (0.61) (0.17) = 0.104 (10.4%) 3- Union Probability: • Mutual Exclusive (cannot occur simultaneously): p (A average OR B average) = p (A average) p (B average) p (A average OR B average) = (910/10,000) p (A average OR B average) = (0.091) • (2460/10,000) (0.246) = 0.337 (33.7%) Non-Mutually Exclusive (can occur simultaneously): p (A average OR attractive) = p (A average) p (attractive) - p (A average and attractive) p (A average OR attractive) = (910/10,000) p (A average OR attractive) = (0.091) (6100/10,000) - [p (910/10,000) (0.61) - [(0.091) (0.61)] p (A average OR attractive) = (0.701) - (0.056) = 0.645 (64.5%) 4- Conditional Probability: Where A is the event to measure and B is the condition P(COVID) = 0.13 P(fever) = 0.42 P(COVID | fever) = P(COVID * fever) / P(fever) = 0.054/ 0.42 = 0.128 (12.8% ) p (6100/10,000)] [ Probability Distributions ] 1- Discrete Distributions Discrete probability distributions are graphs of the outcomes of test results that are finite, such as a value of 1, 2, 3, true, false, success, or failure. Binomial Distribution Probability distribution of number of successes in a sequence of independent events: • Number of heads in q sequence of coin flips • 10 women in a sample size of 100 person n = is the number of trials (occurrences) p = is the probability of success in a single trial Poisson Distribution A Poisson distribution measures how many times an event is likely to occur within “x” period of time • Probability of 12 ppl arriving at restaurant per hour • Probability of <200 visits to a website per day x = is a Poisson random variable λ = is an average rate of value (graph peak) 2- Continues Distributions A probability distribution in which the random variable X can take on any value. The probability that X falls between two values (a and b) equals the integral (area under the curve) The Normal Distributions Also known as or Gaussian distribution is a probability distribution that is symmetric about the mean (bell curve). [ Hypothesis Testing ] Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. It is used to estimate the relationship between 2 statistical variables like: • • Changing the name of the website, might increase traffic?! Using Vitamin C supplement, affects the sex ration in pregnancy?! Null Hypothesis (H0): Assuming nothing, assuming no difference exists Alternative Hypothesis (H1): A different exists in birth ration in two populations P-Value (probability value): how likely your data could have occurred under the null hypothesis. Alpha σ (significance level): probability threshold for rejecting the null hypothesis, alpha is decided before data collection happens (typically 0.05) if ( p ≤ σ ) reject null hypothesis Hypothesis testing work flow: 1- Define the population: adult aged 18-30 who eat meat or are vegans 2- Define Null Hypothesis: no difference of cancer frequency between meat eaters and vegans 3- Define Alternative Hypothesis: meat eaters are more likely to develop cancer 4- Collect Data: diet status and cancer frequency in the past 3 years 5- Perform Statistical Test: difference in mean number of cancers of the two populations 6- Draw Conclusion: are the frequency of cancer is higher in adults who consume meat? Statistical Tests & Experiments: Treatment: the independent variable (advertisement) Response: the dependent variable (number of purchases) Controlled Experiments: Treatment group: sees the advertisement Control group: doesn’t see the advertisement Experiments Standards: Randomization: participants are assigned to treatment/control group randomly. Blinding: participants won’t know which group they are in. *The previous techniques is referred to as Randomized Controlled Trials (RCT) A/B Testing: a form of RCTs that tests only two different treatments [ Correlation ] Correlation means association – more precisely, it measures the extent to which two variables are related, we can describe the relationship regarding the direction as either positive or negative. Or regarding the strength as either Strong – moderate – weak relationship. Correlation does not imply Causation, remember that!! Pearson Correlation Coefficient (r): • Quantifies the strength of a relationship between two variables, with a number between -1 & 1 • Applies only to linear relationships where there is a dependent and independent variables • The magnitude corresponds to strength of the relationship • While the sign (+ or -) correspond to direction of the relationship