MGT 201 Midterm study guide I. Descriptive statistics How to quantitatively describe a data set {x1, …, xn}. 1. Measures of central tendency: where is the data located? a. Mean/average: 𝑥̅ = (x1+ …+ xn)/n. Excel: =AVERAGE(…) b. Median: middle number when data is sorted in increasing order (or average of two middle numbers). Less influenced by extreme values. Excel: =MEDIAN(…) c. Mode: most frequent observation. Excel: =MODE.SNGL(…) 2. Measures of spread/variability/dispersion: measure how far from the mean the data falls a. Mean absolute deviation (average of the |xi – 𝑥̅ |’s) b. Variance = mean square deviation (average of the (xi – 𝑥̅ )2 ’s). Excel: VAR.P(…) c. Standard deviation = square root of the variance (brings to the same unit as the data). Used as a unit of distance to the mean (e.g., a data point is “far” from the mean if it is 3 standard deviations above or below the mean). Excel: STDEV.P(…) d. Coefficient of variation = ratio of standard deviation to mean (gives a reference point to understand size of standard deviation in context) e. Range: gap between lowest and highest observation. Excel: = MAX(…) – MIN(…) f. IQR: gap between 1st and 3rd quartiles. Measures the spread of the middle 50% of the data. To find quartiles: find the median; if it’s a data point, exclude it. The median divides the data set into two parts. Find the median of each part: those are the quartiles (3rd quartile is median of upper half, 1st quartile is median of lower half). Excel: QUARTILE.EXC(…) (may give a slightly different answer than method described here) 3. Measures of dependence of two data sets a. Covariance = average of the (xi – 𝑥̅ ) (yi – 𝑦%). Measures how close to a straight line the scatter plot of y vs. x is. Unit = unit of x times unit of y. Value difficult to interpret. Excel: =COVARIANCE.P(…, …) b. Correlation coefficient = covariance/ (σx σy). Standardizes the covariance to a value from –1 to 1. When absolute value of the correlation is close to 1, we can predict one variable using the other using the equation of the regression line. Excel: =CORREL(…, …) c. Do not confuse correlation and causation II. Probability of events 1. Sample space = set of all possible outcomes. Has probability 1. 2. Event = subset of the sample space = subset of outcomes 3. Probability of an event A: likelihood that event A happens. If all outcomes are equally likely, it is the ratio of number of outcomes in A out of total number of possible outcomes 4. Complement of an event A: happens whenever A does not happen. P(not A ) = 1 – P(A) 5. Union of events A and B: when A happens or B happens or both happen. P(A or B) = P(A) + P(B) – P(A and B) 6. Intersection of events A and B: when A and B happen simultaneously. Events are mutually exclusive (i.e., disjoint) when their intersection is empty. 7. Conditional probability: when we know for sure that an event is true. It’s a reduction of the sample space to a subset corresponding to what we know for sure is true (what is “given”). Probability of A given B is the probability that A happens given that B happens for sure. P(A|B) = P(A and B) / P(B) 8. Total probability rule: when breaking up the probability into two (or more) parts by intersecting or condition on another event makes it easier. P(A) = P(A and B) + P(A and ‘not B’) 9. Multiplication rule: P(A and B ) = P(A|B ) * P(B) = P(B|A) * P(A) 10. Independence of events: A and B are independent when knowing that B happens does not change the probability of A. Events A and B are independent if and only if P(A and B) = P(A)*P(B). Two other equivalent definitions: P(A) = P(A|B); P(B) = P(B|A). To show A and B are independent, show one of these 3 equalities hold. To show events are not independent, show that one of these equalities does not hold. Do not confuse independence with being mutually exclusive! Mutually exclusive events cannot be independent. 11. Bayes Rule: used when you need to flip the conditioning, i.e., find P(B|A) when you P(A|B) . P(B) know P(A|B). Formula: P(B|A) = P(A|B) . P(B) + P(A|not B) . P(not B) III. Random variables and distributions 1. Generalities a. A random variable assigns a numerical value to each possible outcome of a statistical experiment. It must be subject to uncertainty and it must be numerical. It can be discrete or continuous. b. Distribution of a discrete random variable: set of possible values xi and their corresponding probabilities pi. c. To find the probability of an event on discrete random variable X, add up the probabilities of all the outcomes corresponding to the event. d. Expected value of X, E[X]: weighted average of the possible outcomes: ∑! 𝑝! 𝑥! . Excel: = SUMPRODUCT(…, …) e. Variance of X, Var(X): weighted average of the square deviations to the mean: ∑! 𝑝! (𝑥! − 𝐸 [𝑋])" f. Standard deviation of X: square root of the variance. g. Coefficient of variation of X: ratio of standard deviation to expected value. 2. Some special distributions a. Discrete uniform distribution: n equally likely outcomes. Each one has probability 1/n. b. Binomial distribution: • n independent trials, each one results in either success or failure (define what is a trial, define what is a success) • Each trial has the same chance of success, p • Random variable X counts how many successes there are out of the n trials. (Make sure you define what X is.) • Then X has a binomial distribution with parameters (n,p). #! • 𝑃(𝑋 = 𝑥) = %!(#'%)! 𝑝 % (1 − 𝑝)#'% for x = 0, …, n • • • IV. E[X] = np, Var(X) = np(1– p) P(X = x) = BINOM.DIST(x, n, p, false) = BINOM.DIST(x, n, p, 0) P(X ≤ x) = BINOM.DIST(x, n, p, true) = BINOM.DIST(x, n, p,1 ) Combinations of random variables 1. Definition: Z = a X + b Y, where a and b are constants, X and Y are random variables 2. Expected value: E[a X + b Y] = a E[X] + b E[Y] 3. Joint distribution of X and Y: gives the probabilities that simultaneously X = x and Y = y, for all possible values x of X and y of Y. Marginal distribution of X and of Y can be found by adding up the joint probabilities over the rows and columns of the joint distribution table. 4. X and Y are independent random variables when P(X = x and Y = y) = P(X=x) * P(Y=y) for all possible values x of X and y of Y. 5. Covariance of X and Y: weighted average of (𝑥! − 𝐸 [𝑋])(𝑦! − 𝐸 [𝑌]), using the joint probabilities as weights. 6. Correlation of X and Y = covariance/ (SD(X) . SD(Y)). Standardizes the covariance to a value from –1 to 1. 7. Finding the distribution of Z: find all the possible values of Z, and, for each of these values, the corresponding probability, by adding up the joint probabilities of X and Y that lead to this value of Z. 8. Variance of linear combination: Var(aX + bY) = a" Var(X) + b" Var(Y) + 2abCov(X,Y)