Quantitative Data Analysis: Statistics Sherlock Holmes "... while man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician" Overview General Statistics The Normal Distribution Z-Tests Confidence Intervals T-Tests General Statistics ~ THE GOLDEN RULE ~ Statistics NEVER replace the judgment of the expert. Approach to Statistical Research 1. 2. 3. 4. 5. 6. Formulate a Hypothesis State predictions of the hypothesis Perform experiments or observations Interpret experiments or observations Evaluate results with respect to hypothesis Refine hypothesis and start again (Basically the same as all other research) Hypothesis Testing H0 : Null Hypothesis, status quo HA : Alternative Hypothesis, research question So, either : "The data does not support H0" or "We fail to reject H0" Types of Data Continuous Discrete # of days worked this week, # leaves on a tree Ordinal height, age, time {Good, O.K., Bad} Nominal {Yes/No}, {Teacher/Chemist/Haberdasher} Picturing The Data Pie Charts Nominal/Ordinal Only suitable for data that adds up to 1 Hard to compare values in the chart Bar Charts Nominal/Ordinal Easier to compare values than pie chart Suitable for a wider range of data Dot Plots Nominal/Ordinal Represents all the data Difficult to read Box Plots Nominal/Ordinal 1IQR, 3IQR Outliers Scatter Plots Excellent for examining association between two variables Histograms Continuous Data Divide Data into ranges Time-Series Plots Time related Data e.g. Stock Prices Question 1 In a telephone survey of 68 households, when asked do they have pets, the following were the responses : 16 : No Pets 28 : Dogs 32 : Cats Draw the appropriate graphic to illustrate the results !! Question 1 - Solution Total number surveyed = 68 Number with no pets = 16 =>Total with pets = (68 - 16) = 52 But total 28 dogs + 32 cats = 60 => So some people have both cats and dogs Question 1 - Solution How many? It must be (60 - 52) = 8 people No pets = 16 Dogs = 20 Cats = 24 Both = 8 ------------------------Total = 68 Question 1 - Solution Graphic: Pie Chart or Bar Chart The Literary Digest Poll 1936 US Presidential Election Alf Landon (R) vs. Franklin D. Roosevelt (D) The Literary Digest Poll Literary Digest had been conducting successful presidential election polls since 1916 They had correctly predicted the outcomes of the 1916, 1920, 1924, 1928, and 1932 elections by conducting polls. These polls were a lucrative venture for the magazine: readers liked them; newspapers played them up; and each “ballot” included a subscription blank. The Literary Digest Poll They sent out 10 million ballots to two groups of people: prospective subscribers, “who were chiefly upper- and middle-income people” a list designed to "correct for bias" from the first list, consisting of names selected from telephone books and motor vehicle registries The Literary Digest Poll Response rate: approximately 25%, or 2,376,523 responses Result: Landon in a landslide (predicted 57% of the vote, Roosevelt predicted 40%) Election result: Roosevelt received approximately 60% of the vote The Literary Digest Poll POSSIBLE CAUSES OF ERROR Selection Bias: By taking names and addresses from telephone directories, survey systematically excluded poor voters. Republicans were markedly overrepresented in 1936, Democrats did not have as many phones, not as likely to drive cars, and did not read the Literary Digest “Sampling Frame” is the actual population of individuals from which a sample is drawn: Selection bias results when sampling frame is not representative of the population of interest The Literary Digest Poll POSSIBLE CAUSES OF ERROR Non-response Bias: Because only 20% of 10 million people returned surveys, nonrespondents may have different preferences from respondents Indeed, respondents favored Landon Greater response rates reduce the odds of biased samples Terminology Population: is a set of entities concerning which statistical inferences are to be drawn. Sample: a number of independent observations from the same probability distribution Parameter: the distribution of a random variable as belonging to a family of probability distributions, distinguished from each other by the values of a finite number of parameters Bias: a factor that causes a statistical sample of a population to have some examples of the population less represented than others. Outliers (and their treatment) An "outlier" is an observation that does not fit the pattern in the rest of the data Check the data Check with the measurer If reason to believe it is NOT real, change it if possible, otherwise leave it out (but note). If reason to believe it is real, leave it out and note. The Mean The Mean (Arithmetic) The mean is defined as the sum of all the elements, divided by the number of elements. The statistical mean of a set of observations is the average of the measurements in a set of data The Variance But there can be a lot of variance in individual elements, e.g. teacher salaries Average = €22,000 Lowest = € 12,000 Difference = 12,000 - 22,000 = -10,000 The Variance Sum of (Sample - Average) = 0, thus we need to define variance. The variance of a set of data is a cumulative measure of the squares of the difference of all the data values from the mean divided by sample size minus one. Standard Deviation The standard deviation of a set of data is the positive square root of the variance. -1 -1 Question 2 Find the mean and variance of the following sample values : 36, 41, 43, 44, 46 Question 2 Mean: (36 + 41 + 43 + 44 + 46)/5 = 42 Variance Difference Square 36 – 42 = -6 36 41 – 42 = -1 1 43 – 42 = 1 1 44 – 42 = 2 4 46 – 42 = 4 16 --------------------------------------- 58 58 / (5 -1) = 58 / 4 = 14.5 The Normal Distribution Density Curves: Properties The Normal Distribution The graph has a single peak at the center, this peak occurs at the mean The graph is symmetrical about the mean The graph never touches the horizontal axis The area under the graph is equal to 1 Characterization A normal distribution is bell-shaped and symmetric. The distribution is determined by the mean mu, m, and the standard deviation sigma, s. The mean mu controls the center and sigma controls the spread. The Normal Distribution If a variable is normally distributed, then: within one standard deviation of the mean there will be approximately 68% of the data within two standard deviations of the mean there will be approximately 95% of the data within three standard deviations of the mean there will be approximately 99.7% of the data The Normal Distribution Why? One reason the normal distribution is important is that many psychological and organsational variables are distributed approximately normally. Measures of reading ability, introversion, job satisfaction, and memory are among the many psychological variables approximately normally distributed. Although the distributions are only approximately normal, they are usually quite close. Why? A second reason the normal distribution is so important is that it is easy for mathematical statisticians to work with. This means that many kinds of statistical tests can be derived for normal distributions. Almost all statistical tests discussed in this text assume normal distributions. Fortunately, these tests work very well even if the distribution is only approximately normally distributed. Some tests work well even with very wide deviations from normality. One Tail / Two Tail Imagine we undertook an experiment where we measured staff productivity before and after we introduced a computer system to help record solutions to common issues of work Average productivity before = 6.4 Average productivity after = 9.2 One Tail / Two Tail 0 Before = 6.4 After = 9.2 10 One Tail / Two Tail 0 Before = 6.4 After = 9.2 10 One Tail / Two Tail 0 Before = 6.4 After = 9.2 10 One Tail / Two Tail 0 Before = 6.4 After = 9.2 10 One Tail / Two Tail 0 Before = 6.4 After = 9.2 10 One Tail / Two Tail 0 Before = 6.4 After = 9.2 10 One Tail / Two Tail 0 Before = 6.4 After = 9.2 10 One Tail / Two Tail σ 0 Before = 6.4 σ σ After = 9.2 10 One Tail / Two Tail One-Tailed H0 : m1 >= m2 HA : m1 < m2 Two-Tailed H0 : m1 = m2 HA : m1 <>m2 STANDARD NORMAL DISTRIBUTION Normal Distribution is defined as N(mean, (Std dev)^2) Standard Normal Distribution is defined as N(0, (1)^2) STANDARD NORMAL DISTRIBUTION Using the following formula : will convert a normal table into a standard normal table. Exercise If the average IQ in a given population is 100, and the standard deviation is 15, what percentage of the population has an IQ of 145 or higher ? Answer P(X >= 145) P(Z >= ((145 - 100)/15)) P(Z >= 3) From tables: 99.87% are less than 3 => 0.13% of population Trends in Statistical Tests used in Research Papers Historically Results in: Accept/Reject Currently Results in: p-Value Results in: Approx. Mean Confidence Intervals A confidence interval is used to express the uncertainty in a quantity being estimated. There is uncertainty because inferences are based on a random sample of finite size from a population or process of interest. To judge the statistical procedure we can ask what would happen if we were to repeat the same study, over and over, getting different data (and thus different confidence intervals) each time. Confidence Intervals If we know the true population mean and sample n individuals, we know that if the data is normally distributed, Average mean of these n samples has a 95% chance of falling into the interval Confidence Intervals where the standard error for a 95% CI may be calculated as follows; Example 1 Example 1 Does FF-PD-G have more of the popular vote than FG-L ? In a random sample of 721 respondents : 382 FF-PD-G 339 FG-L Can we conclude that FF-PD-G has more than 50% of the popular vote ? Example 1 - Solution Sample proportion = p = 382/721 = 0.53 Sample size = n = 721 Standard Error = (SqRt((p(1-p)/n))) = 0.02 95% Confidence Interval 0.53 +/- 1.96 (0.02) 0.53 +/- 0.04 [0.49, 0.57] Thus, we cannot conclude that FF-PD-G had more of the popular vote, since this interval spans 50%. So, we say: "the data are consistent with the hypothesis that there is no difference" Example 2 Example 2 Did Obama have more of the popular vote than McCain ? In a random sample of 1000 respondents 532 Obama 468 McCain Can we conclude that Obama had more than 50% of the popular vote ? Example 2 – 95% CI Sample proportion = p = 532/1000 = 0.532 Sample size = n = 1000 Standard Error = (SqRt((p(1-p)/n))) = 0.016 95% Confidence Interval 0.532 +/- 1.96 (0.016) 0.532 +/- 0.03136 [0.5006, 0.56336] Thus, we can conclude that Obama had more of the popular vote, since this interval does not span 50%. So, we say : "the data are consistent with the hypothesis that there is a difference in a 95% CI" Example 2 – 99% CI Sample proportion = p = 532/1000 = 0.532 Sample size = n = 1000 Standard Error = (SqRt((p(1-p)/n))) = 0.016 99% Confidence Interval 0.532 +/- 2.58 (0.016) 0.532 +/- 0.041 [0.491, 0.573] Thus, we cannot conclude that Obama had more of the popular vote, since this interval does span 50%. So, we say : "the data are consistent with the hypothesis that there is no difference in a 99% CI" Example 2 – 99.99% CI Sample proportion = p = 532/1000 = 0.532 Sample size = n = 1000 Standard Error = (SqRt((p(1-p)/n))) = 0.016 99.99% Confidence Interval 0.532 +/- 3.87 (0.016) 0.532 +/- 0.06 [0.472, 0.592] Thus, we cannot conclude that Obama had more of the popular vote, since this interval does span 50%. So, we say : "the data are consistent with the hypothesis that there is no difference in a 99.99% CI" T-Tests One Tail / Two Tail T-test Z-test T-Tests powerful parametric test for calculating the significance of a small sample mean necessary for small samples because their distributions are not normal one first has to calculate the "degrees of freedom" T-Tests The t-test is often called the Student's t-test. It was created by a chief brewer named William S. Gossett who worked for the Guinness Brewery. He discovered this statistic as part of his work in the brewery to compare the different brewing processes for changing raw materials into beer. Guinness did not allow its employees to publish results but the management decided to allow Gossett to publish it under a pseudonym - Student. Hence we have the Student's t-test. T-Test ~ THE GOLDEN RULE ~ Use the t-Test when your sample size is less than 30 T-Tests If the underlying population is normal If the underlying population is not skewed and reasonable to normal (n < 15) If the underlying population is skewed and there are no major outliers (n > 15) If the underlying population is skewed and some outliers (n > 24) T-Tests Form of Confidence Interval with tValue Mean +/- tValue * SE -------------as before as before Two Sample T-Test: Unpaired Sample Consider a questionnaire on computer use to final year undergraduates in year 2007 and the same questionnaire give to undergraduates in 2008. As there is no direct one-to-one correspondence between individual students (in fact, there may be different number of students in different classes), you have to sum up all the responses of a given year, obtain an average from that, down the same for the following year, and compare averages. Two Sample T-Test: Paired Sample If you are doing a questionnaire that is testing the BEFORE/AFTER effect of parameter on the same population, then we can individually calculate differences between each sample and then average the differences. The paired test is a much strong (more powerful) statistical test. Choosing the right test Choosing a statistical test http://www.graphpad.com/www/Book/Choose.htm Choosing a statistical test http://www.graphpad.com/www/Book/Choose.htm