Statistics and Quantitative Analysis U4320 Segment 4: Statistics and Quantitative Analysis Prof. Sharyn O’Halloran Probability Distributions A. Distributions: How do simple probability tables relate to distributions? 1. What is the Probability of getting a head? ( 1 coin toss) Prob. 1/2 0 1 Proportion of Heads Probability Distributions(cont.) 2. Now say we flip the coin twice. The picture now looks like: 1/2 1/4 0 0 heads 1/2 1head 1 2 heads Proportion of Heads Probability Distributions(cont.) As number of coin tosses increases, the distribution looks like a bell-shaped curve: 0 1/2 1 Proportion of Heads Probability Distributions(cont.) 3. General: Normal Distribution Probability distributions are idealized bar graphs or histograms. As we get more and more tosses, the probability of any one observation falls to zero. Thus, the final result is a bell-shaped curve Probability Distributions(cont.) B. Properties of a Normal Distribution 1. Formulas: Mean and Variance Population Sample N Mean X 2 i X i 1 N N Variance n ( X i i 1 N X n n )2 s2 i i 1 ( X i X )2 i 1 n 1 Probability Distributions(cont.) 2. Note Difference with the Book The population mean is written as: xp ( x ), Variance as: 2 ( x ) 2 p( x ) . Example: Two tosses of a coin Number of Heads x 0 1 2 Probability p(x) 1/4 1/2 1/4 Probability Distributions(cont.) 2. Note Difference with the Book (cont.) So the average, or expected, number of heads in two tosses of a coin is: 0*1/4 + 1*1/2 + 2*1/4 = 1. Probability Distributions(cont.) 3. Expected Value E(x) = Average or Mean E x = Expected Variance 2 Probability Distributions(cont.) C. Standard Normal Distribution 1.Definition: a normal distribution with mean 0 and standard deviation 1. Total Area of Curve = 1 Z-values - Are points on the x-axis that show how that point Z-values is away from the many standard deviations mean m. Probability Distributions(cont.) C. Standard Normal Distribution 2. Characteristics symmetric Unimodal. continuous distributions 3. Example: Height of people are normally distributed with mean 5'7" Total Area of Curve = 1 area=1/2 What is the proportion of people taller than 5'7"? 5'7" Z-values Probability Distributions(cont.) D. How to Calculate Z-scores Definition: Z-value is the number of standard deviations away from the mean Definition: Z-tables give the probability (score) of observing a particular z-value. Probability Distributions(cont.) 1. What is the area under the curve that is greater than 1 ? Prob (Z>1) The entry in the table is 0.159, which is the total area to the right of 1. Total Area of Curve = 1 0.159 z=1 Z-values Probability Distributions(cont.) 2. What is the area to the right of 1.64? Prob (Z > 1.64) The table gives 0.051, or about 5%. Total Area of Curve = 1 0.051 1.64 Z-values Probability Distributions(cont.) 3. What is the area to the left of -1.64? Prob (Z < -1.64) Total Area of Curve = 1 0.051 -1.64 Z-values Probability Distributions(cont.) 4. What is the probability that an observation lies between 0 and 1? Prob (0 < Z < 1) 0.50 0.34 0.159 1.00 Z-values Probability Distributions(cont.) 5. How would you figure out the area between 1 and 1.5 on the graph? Prob (1 < Z < 1.5) 0.159 0.092 0.067 1.00 1.50 Z-values Probability Distributions(cont.) 6. What is the area between -1 and 2? Prob (-1 < Z < 2) P (-1<Z<0) = .341 P (0<z<2) = .50-.023 =.477 .818 0.159 0.477 0.341 0.023 -1.00 2.00 Z-values 0.477 + 0.341 = Probability Distributions(cont.) 7. What is the area between -2 and 2? Prob(-2<Z<2) 1- Prob (Z< -2) - Prob (Z>2) = 1 - .023 - .023 = .954 0.023 0.023 -2.00 2.00 Z-values Probability Distributions(cont.) E. Standardization 1. Standard Normal Distribution -- is a very special case where the mean of distribution equals 0 and the standard deviation equals 1. Z-values Probability Distributions(cont.) 2. Case 1: Standard deviation differs from 1 For a normal distribution with mean 0 and some standard deviation , you can convert any point x to the standard normal distribution by changing it to x/. SD=1 SD=3 -2.00' -2.00 2.00 2.00' Z-values Probability Distributions(cont.) 3. Case 2: Mean differs from 0 So starting with any normal distribution with mean and standard deviation 1, you can convert to a standard normal by taking x- and using this as your Z-value. Now, what would be the area under the graph between 50 and 51? Prob (0<Z<1) = .341 SD=1 x=51 Z-values Probability Distributions(cont.) 4. General Case: Mean not equal to 0 and SD not equal to 1 Say you have a normal distribution with mean & standard deviation . You can convert any point x in that distribution to the same point in the standard by computing x normal Z . This is called standardization. The Z-value corresponds to x. The Z-table lets you look up the Z-Score of any number. Probability Distributions(cont.) 5. Trout Example: a. The lengths of trout caught in a lake are normally distributed with mean 9.5" and standard deviation 1.4". There is a law that you can't keep any fish below 12". What percent of the trout is this? (Can keep above 12) Step 1: Standardize Find the Z-score of 12: Prob (x>12) 12 - 9.5 Z = --------- = 1.79. 1.4 Step 2: Find z-score Find Prob (Z>1.79) Look up 1.79 in your table; only .037, or about 4% of the fish could be kept. Probability Distributions(cont.) 5. Trout Example (cont.): b. Now they're thinking of changing the standard to 10" instead of 12". What proportion of fish could be kept under the new limit? Standardize Prob (x>10) 10 - 9.5 Z = --------- = 0.36. 1.4 Step 2: Find z-score Prob (Z>.36) In your tables, this gives .359, or almost 36% of the fish could be kept under the new law. Joint Distributions A. Probability Tables 1. Example: Toss a coin 3 times. How many heads and how many runs do we observe? Def: A run is a sequence of one or more of the same event in a row Possible outcomes Toss Probability TTT TTH THT THH HTT HTH HHT HHH 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8 Heads x 0 1 1 2 1 2 2 3 Runs y 1 2 3 2 2 3 2 1 Joint Distributions (cont.) 2. Joint Distribution Table Runs Heads x 0 1 2 3 y 1 2 3 1/8 0 0 1/8 1/4 0 1/4 (2/8) 1/4 (2/8) 0 1/2 0 1/8 1/8 0 1/4 1/8 3/8 3/8 1/8 1 Joint Distributions (cont.) 3. Definition: The joint probability of x and y is the probability that both x and y occur. p(x,y) = Pr(X and Y) p(0, 1) = 1/8, 0. p(1, 2) = 1/4, and p(3, 3) = Joint Distributions (cont.) B. Marginal Probabilities 1. Def: Marginal probability is the sum of the rows and columns. The overall probability of an event occurring. p( x ) p( x , y ) . y So the probability that there are just 1 head is the prob of 1 head and 1 runs + 1 head and 2 runs + 1 head and 3 runs = 0 + 1/4 + 1/8 = 3/8 Joint Distributions (cont.) C. Independence A and B are independent if P(A|B) = P(A). P ( A| B ) P ( A& B ) ; P( B) P( A) P( A| B), P ( A, B ) P ( A) P ( B ) . Joint Distributions (cont.) Are the # of heads and the # of runs independent? # Runs y 1 2 3 # heads x marg dist 0 1/8 1 3/8 2 3/8 3 1/8 marg dist 1/4 1/2 1/4 1 Correlation and Covariance A. Covariance 1.Definition of Covariance Which is defined as the expected value of the product of the differences from the means. x , y E ( X x )( Y Y ) N ( X i x )( Yi Y ) i 1 N ( X i x )( Yi Y ) p ( x , y ). Correlation and Covariance 2.Graph Correlation and Covariance B. Correlation 1.Definition of Correlation x,y Covariance x y SDx * SDy Correlation and Covariance 2. Characteristics of Correlation -1 1 if =1 then y x Correlation and Covariance 2. Characteristics of Correlation (cont.) if = -1 y x Correlation and Covariance 2. Characteristics of Correlation (cont.) if = 0 y x Correlation and Covariance 2. Characteristics of Correlation (cont.) Why? x,y x y (x i x )( yi y ) N n n 2 ( x ) i x 2 ( y ) i y N N i 1 i 1 Sample Homework GET /FILE 'gss91.sys'. The SPSS/PC+ system file is read from file gss91.sys The SPSS/PC+ system file contains 1517 cases, each consisting of 203 variables (including system variables). 203 variables will be used in this session. ------------------------------COMPUTE AFFAIRS = XMARSEX. RECODE AFFAIRS (0,5,8,9 = SYSMIS) (1,2 = 0) (3,4 = 1). VALUE LABELS AFFAIRS 0 'BAD' 1 'OK'. The raw data or transformation pass is proceeding 1517 cases are written to the compressed active file. ***** Memory allows a total of 10345 Values, accumulated across all Variables. There also may be up to 1293 Value Labels for each Variable. Sample Homework ------------------------------------------------------------------------------AFFAIRS Value Label BAD OK Valid Cum Value Frequency Percent Percent Percent .00 1.00 . 870 57.4 90.2 90.2 94 6.2 9.8 100.0 553 36.5 Missing ------- ------- ------Total 1517 100.0 100.0 ------------------------------------------------------------------------------- Sample Homework AFFAIRS Mean Mode Kurtosis S E Skew Maximum Valid cases .098 .000 5.398 .079 1.000 964 Std err Std dev S E Kurt Range Sum .010 .297 .157 1.000 94.000 Missing cases 553 Median Variance Skewness Minimum .000 .088 2.718 .000 Sample Homework COMPUTE MONEY = INCOME91. RECODE MONEY (0,22,98,99 = SYSMIS) (1 THRU 15 = 0) (16 THRU 21 = 1). VALUE LABELS MONEY 0 'LOW' 1 'HIGH'. FREQUENCIES /VARIABLES AFFAIRS MONEY /STATISTICS ALL. MONEY Value Label LOW HIGH Valid Cum Value Frequency Percent Percent Percent .00 1.00 787 51.9 57.5 57.5 581 38.3 42.5 100.0 . 149 9.8 Missing ------------- ------Total 1517 100.0 100.0 ------------------------------------------------------------------------------- Sample Homework MONEY Mean .425 Mode .000 Kurtosis -1.910 S E Skew .066 Maximum 1.000 Valid cases 1368 Std err .013 Std dev .494 S E Kurt .132 Range 1.000 Sum 581.000 Median Variance Skewness Minimum Missing cases 149 .000 .245 .305 .000 Sample Homework CROSSTABS /TABLES=MONEY BY AFFAIRS /CELLS /STATISTICS=CORR. Memory allows for 7,021 cells with 2 dimensions for general CROSSTABS. ------------------------------------------------------------------------------MONEY by AFFAIRS AFFAIRS MONEY LOW COUNT ROW % COL. % TOTAL % HIGH COUNT ROW % COL. % TOTAL % ROW TOTAL BAD 450 89.1 57.0 51.6 339 92.4 43.0 38.9 789 90.5 OK 55 10.9 66.3 6.3 28 7.6 33.7 3.2 83 9.5 COLUMN TOTAL 505 57.9 367 42.1 872 100.0 Sample Homework Statistic Value ---------------------------Pearson's R -.05487 Spearman Correlation -.05487 Approximate ASE1 T-value Significance -------- -----------------.03267 -1.62089 .10540 .03267 -1.62089 .10540 Number of Missing Observations: 645