Southern California Bioinformatics Summer Institute Richard Johnston Pasadena City College UCLA School of Medicine rmjohnston@mac.com Introduction Data Displaying Data Descriptive Statistics Inferential Statistics Q&A Wrap-up ©Richard Johnston 2 Introduce commonly used statistical concepts and measures Show how to compute statistical measures using Excel and R Minimize discussion of statistical theory as much as possible. Provide sample calculations in the downloaded materials ©Richard Johnston 3 Could I have gotten these results by chance? Is there a significant difference between these two samples? How can I describe my results? What can I say about the average of these measurements? What should I do with these outliers? … etc. ©Richard Johnston 4 Excel 2007 R (???) ©Richard Johnston 5 Mostly, the look and feel have changed. Up to 1,048,576 rows and 16,384 columns in a single worksheet Up to 32,767 characters in a single cell More sorting options Enhanced data importing Improved PivotTables ©Richard Johnston 6 More options for conditional formatting of cells Multithreaded calculation of formulae, to speed up large calculations, especially on multi-core/multi-processor systems. Improved filtering New charting features … Other changes to make Excel and other Office programs more Vista-like ©Richard Johnston 7 R is a language and environment for statistical computing and graphics widely used in research institutions. R provides a wide variety of statistical and graphical techniques, and is highly extensible. One of R's strengths is the ease with which publicationquality plots can be produced. R is available as free software, R runs on Windows, MacOS, and a wide variety of UNIX platforms. ©Richard Johnston 8 Primary (collected by you) or Secondary (obtained from another source) Observational or Experimental Quantitative or Qualitative Quantitative data uses numerical values to describe something (e.g., weight, temperature) Qualitative data uses descriptive terms to classify something (e.g., gender, color) ©Richard Johnston 9 Nominal (Qualitative) Examples are gender, color, species name, … Ordinal (Qualitative or quantitative) Allows rank ordering of values Examples: Grades A-F Rating Level 1 through 5 “Slow”, “Medium”, “Fast” ©Richard Johnston 10 Interval (Quantitative) Allows addition and subtraction, but not multiplication and division No real zero point Example: Temperature measurement in degrees Fahrenheit 100 degrees F is 50 more than 50 degrees F 100 degrees F is not twice as hot as 50 degrees F ©Richard Johnston 11 Ratio (Quantitative) Allows addition, subtraction, multiplication and division Has a true zero point A value zero means the absence of the measured quantity Examples: Weight, age, or speed To decide if a measurement is Interval or Ratio, see if the phrase “Twice as…” makes sense: e.g., Twice as (heavy, old, fast) ©Richard Johnston 12 Charts Pie Column Line Scatter Histograms Tip: Pivot Charts can be used to quickly generate a wide variety of charts. ©Richard Johnston 13 Open AspirinStudyData.xlsx or .xls in Excel and select “Pie” tab. ©Richard Johnston 14 Open AspirinStudyData.xlsx or .xls in Excel and select “Bar” tab. ©Richard Johnston 15 Open AspirinStudyData.xlsx or .xls in Excel and select “XY Line” tab. ©Richard Johnston 16 Open AspirinStudyData.xlsx or .xls in Excel and select “Scatter” tab. ©Richard Johnston 17 ©Richard Johnston 18 Open AspirinStudyData.xlsx 0r .xls Create bin values 20,25,… ,100 Select Data Analysis Select Histogram Click Input Range and select the Age data Click on Bin Values and select the bin values. Check Labels and Chart Output Click OK ©Richard Johnston Start R Select File>change dir… Browse to default directory Type the following (including capital letters): > aspirin=read.csv("AspirinStudyData.csv",header=T) > attach(aspirin) >hist(Age) ©Richard Johnston ©Richard Johnston Measures of Central Tendency Mean Median Mode Measures of Dispersion Range Variance Standard Deviation Quartiles Interquartile Range ©Richard Johnston 22 Use built-in functions to perform basic analyses: Formulas|MoreFunctions|Statistical Use the Data Analysis Add-in for more complex analyses: Open AspirinStudyData.xlsx Select Data|Data Analysis Select Descriptive Statistics Select the Age data Check Labels in first row Check Summary Statistics ©Richard Johnston 23 ©Richard Johnston 24 Type the following: > summary(Age) Min. 1st Qu. 34.00 54.00 >sd(Age) [1] 8.173196 >var(Age) [1] 66.80114 …(etc.) Median 59.00 ©Richard Johnston Mean 3rd Qu. 59.09 65.00 Max. 82.00 25 Mean Arithmetic average of the values Median Midpoint of the values (half are higher and half are lower) If there are an even number the median is the average of the two points. Mode The value that occurs most frequently. There may be more than one mode. ©Richard Johnston 26 Range Gives an idea of the spread of values, but depends only on two of them – the largest and the smallest. Variance Averages the squared deviations of each value from the mean. Standard Deviation Calculated by taking the square root of the variance More useful than the variance since it’s in the same measurement units as the data. ©Richard Johnston 27 Quartiles Divides the data into four equal segments In Excel: Q1: 54 Q2: 59 Q3: 65 Q4: 82 =QUARTILE(Age,1) =QUARTILE(Age,2) =QUARTILE(Age,3) =QUARTILE(Age,4) In R: >quantile(Age) 0% 25% 50% 34 54 59 75% 100% 65 82 ©Richard Johnston Approximately 25% are less than Q1 50% are less than Q2 75% are less than Q3 100% are less than Q4 28 Interquartile Range (IQR) Measures the spread of the center 50% of the data IQR = Q3 – Q1 Used to help identify outliers (more on this later) General “Rule”: Consider discarding values less than Q1 – 1.5 x IQR or greater than Q3+1.5 x IQR ©Richard Johnston 29 Predicting the distribution of values Empirical rule for “Bell Shaped” curves: Approximately 68% of the values will fall within 1 SD of the mean, 95% will fall within 2SD of the mean, and 99.7% will fall within 3SD of the mean ©Richard Johnston 30 From the example: ±1 SD: 71.4% ± 2SD: 95.4% ± 3SD: 99.4% ©Richard Johnston 31 Outliers are values that are (or seem to be) out of line with the rest of the observations. Outliers can distort statistical measures They may be indicative of transient errors in equipment or errors in transcription. They may also indicate a flaw in experimental assumptions. As we’ve just shown, 3 observations out of 1000 can be expected to be over 3 SD from the mean. Outliers that can’t be readily explained should receive careful attention. ©Richard Johnston 32 Quartiles and boxplots can help identify outliers. In R, type boxplot(Age) The middle box is the IRQ. The horizontal line is the median. The whiskers are 1.5 x IRQ In this example, four values are outliers ©Richard Johnston 33 Histogram of Age 0.02 0.01 0.00 In R: > h=hist(Age,plot=F) > plot(h) > s=sd(Age) > m=mean(Age) ylim=range(0,h$density, + dnorm(0,sd=s)) >hist(Age,freq=F,ylim=ylim) > curve(dnorm(x,m,s),add=T) Density 0.03 0.04 0.05 Age histogram and a normal curve with the same mean and standard deviation. 30 40 50 60 70 80 Age ©Richard Johnston 34 The mean, median, and mode are the same value The distribution is bell shaped and symmetrical around the mean The total area under the curve is equal to 1 The left and right sides extend indefinitely x = the normally distributed random variable of interest μ = the mean of the normal distribution σ = the standard deviation of the normal distribution z = the number of standard deviations between x and μ, otherwise known as the standard z-score ©Richard Johnston 35 z is calculated using the formula z= (x-μ)/σ For the Age data μ = 59.09 σ = 8.17 for x = 82 (the oldest subject) z = (82 – 59.09)/8.17 = 2.80 The 82 year old is 2.80 SD away from the mean of the population. ©Richard Johnston 36 The standard normal distribution is a normal distribution with μ=0 σ =1 The total area under the standard normal curve is equal to 1. ©Richard Johnston 37 The shaded area represents the probability that x is within 1 SD of the mean in Excel 68% = NORMDIST(1,0,1,1). NORMDIST(-1,0,1,1) -4 -3 -2 -1 0 1 2 3 1 2 3 4 The z-score for 5% probability that x is less than z is about -1.64 -1.64 = NORMSINV(.05). -4 ©Richard Johnston -3 -2 -1 0 4 38 Sampling Sampling Distributions Confidence Intervals Hypothesis Testing ©Richard Johnston 39 The term ”population” in statistics represents all possible outcomes or measurements of interest in a particular study. A “sample” is a subset of the population that is representative of the whole population. Analysis of a sample allows us to infer characteristics of the entire population with a quantifiable degree of certainty. ©Richard Johnston 40 In the 1980’s Harvard did a study of the effectiveness of aspirin in the prevention of heart attacks. They followed over 22,000 physicians for five years. Half of the physicians were given a daily dose of aspirin, and half were given a placebo. Neither the subjects nor the investigators knew which was being administered. (More on this later.) A coin is flipped 20 times in each of 20 trials to determine whether it is “fair”. Seeds are divided randomly into two groups and planted. One group receives fertilizer A and the other group receives fertilizer B. All other factors (light, water, etc.) are kept the same. ©Richard Johnston 41 Patients at 6 US hospitals were randomly assigned to 1 of 3 groups: 604 received intercessory prayer after being informed that they may or may not receive prayer; 597 did not receive intercessory prayer also after being informed that they may or may not receive prayer; and 601 received intercessory prayer after being informed they would receive prayer. Intercessory prayer was provided for 14 days, starting the night before coronary artery bypass graft surgery (CABG). The primary outcome was presence of any complication within 30 days of CABG. Secondary outcomes were any major event and mortality. (American Heart Journal, 2006) ©Richard Johnston 42 Several factors contribute to the determination of the sample size needed for a particular study: Desired confidence level (95%, 99%) Margin of error (5%,3%) Population size (Results don’t change much for populations of 20,000 or more) Expected proportion (p=q=.5 is conservative) For example, a 99% confidence level with a margin of error of 6% would require about 450 samples. Formulas for sample size vary, and are not presented here Several online tools are available for determining sample sizes (e.g., http://www.raosoft.com/samplesize.html) ©Richard Johnston 43 Your company has just completed a five year study on the effectiveness of aspirin in preventing heart attacks. Five hundred physician volunteers were divided randomly into two groups. One group received 325mg of aspirin every other day. The other group received a placebo instead of aspirin. Neither the subjects nor the test administrators knew whether aspirin or placebo was being administered. The subjects were monitored for five years to determine whether or not they experienced a heart attack ©Richard Johnston 44 The results of the study are provided in tabular format. The table contains the following information: Field Name Subject Age Sex Group Smoker Attack AttackDate Ulcer Transfusion Contents Subject identification number Age of the subject at the start of the experiment Sex of the subject Group the subject was assigned to (placebo or aspirin) Smoker/Non-Smoker status Attack: The subject had a heart attack during the study No Attack: The subject did not have a heart attack during the study Date the heart attack occurred Ulcer: The subject developed an ulcer during the study No Ulcer: The subject did not develop an ulcer during the study Trans: The subject required a transfusion during the study No Trans: The subject did not require a transfusion during the study ©Richard Johnston 45 The capabilities of Excel can be used to summarize the data in various ways. One way to summarize the data is with a PivotTable report. Open AspirinStudyData.xlsx or .xls and click on tab “Study Summary” ©Richard Johnston 46 The data seem to indicate that aspirin helps to prevent heart attacks. Your task is to determine the statistical significance of the results. ©Richard Johnston 47 If we perform an experiment such as flipping a coin we can count the number of “successes” (i.e., heads) in a number of trials to get an estimate of the underlying probability that a flip of the coin will result in heads. The larger the number of trials, the more confident we are that the true probability is within a certain range, or confidence interval. The number of successes in repeated experiments such as this form the familiar normal (bell-shaped) curve. Assuming a normal distribution of results allows us to calculate statistical characteristics for a wide variety of experiments, including clinical trials. ©Richard Johnston 48 The “true” chance of attack in the Placebo Group is referred to as p1. Similarly, the “true” chance of attack in the Aspirin Group is referred to as p2. Our objective is to estimate the true difference of p1 and p2 using the results of this study. Note: The details of the computation are provided in the Aspirin Study PDF document and Excel Workbook ©Richard Johnston 49 We compute estimates of p1, p2 and the difference p1 – p2 using the information in Table 1 as follows: Estimate of p1: Estimate of p2: Estimate of p1 - p2: (Note: Statisticians use the caret or “hat” to indicate that a value is an estimate of the true value for that measure.) ©Richard Johnston 50 The computation of a “confidence interval” allows us to specify the probability that any given confidence interval from a random sample will contain the true population mean. Typically, a 95% confidence interval is used. The formula for computing the confidence interval for the true difference is: p1 p2 pˆ1 pˆ 2 z SE( pˆ1 pˆ 2 ) 2 where p1 p2 is the true difference pˆ1 pˆ 2 is the observed difference z is the critical value for 95% confidence (see diagram) 2 SE( pˆ1 pˆ 2 ) is the standard error of pˆ1 pˆ 2 (The standard deviation of the sample proportion.) ©Richard Johnston 51 The 95% confidence interval for this study is p1 p2 .096 (1.96)(.033) .096 .066 In other words, we are 95% confident that the true difference in the heart attack rates is between .030 and .162. Since the lower number is still positive, we are 95% confident that aspirin has a beneficial effect in preventing heart attacks. ©Richard Johnston 52 (See documentation for details of computation) ©Richard Johnston 53 A confidence level is a range of values used to estimate a population parameter such as the mean. A confidence level is the probability that the interval estimate will include the mean. Increasing the confidence level makes the interval wider (less precise). Increasing the sample size reduces the width of the interval (more precise). ©Richard Johnston 54 We now address the question: If aspirin had no effect, what is the probability that the observed results occurred by chance? ©Richard Johnston 55 In order to answer this question, we formulate two hypotheses H0 and Ha. H0 is called the “Null Hypothesis”, and Ha is called the “Alternate Hypothesis” For this study H0 and Ha can be stated as follows: Null hypothesis H0: Aspirin has no effect, and p1 = p2 Alternate Hypothesis Ha: Aspirin does reduce heart attacks, and p1 > p2 ©Richard Johnston 56 Under the Null Hypothesis, the probability of attack with or without aspirin therapy is the same, and the observed difference is due to random chance. Since we are assuming p1 = p2, we can pool the results to get an estimate of the probability of an attack under the Null Hypothesis: pˆ x1 x 2 55 31 .172 n1 n2 250 250 ©Richard Johnston 57 The Standard Error under the Null Hypothesis is given by SE O ( pˆ1 pˆ 2 ) pˆ (1 pˆ )( .172(1 .172)( 1 1 ) n1 n 2 1 1 ) 250 250 .0338 ©Richard Johnston 58 We can now compute the test statistic z for the observed results under the Null Hypothesis: pˆ1 pˆ 2 zOBS SE O ( pˆ1 pˆ 2 ) .220 .124 .0338 2.840 The value for zOBS is almost three standard deviations from zero, indicating that it is highly unlikely we would get the observed results fromrandom chance. ©Richard Johnston 59 In order to determine the probability of getting this result under the Null Hypothesis, we compute the “p-value”. The p-value in this case is the probability that the test statistic z is greater than or equal to zOBS: p value Pr(z zOBS ) Pr(z 2.840) ©Richard Johnston 60 We can use a built-in Excel function to calculate the p-value. In this case we use the function NORMSDIST which returns the area under the bell curve from minus infinity up to the specified z value. Since we are interested in the area under the right-hand tail of the bell curve we use the following calculation in Excel: p-value = 1-NORMSDIST(ZOBS) = 1- NORMSDIST(2.840) =.002 ©Richard Johnston 61 In R: > attack=c(31,55) > total=c(250,250) > prop.test(attack,total,alternative="less",correct=F) 2-sample test for equality of proportions without continuity correction data: attack out of total X-squared = 8.089, df = 1, p-value = 0.002227 alternative hypothesis: less sample estimates: prop 1 prop 2 0.124 0.220 ©Richard Johnston 62 0 99.8% 0.2% zOBS 2.840 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Illustration of p-value for zOBS=2.840 ©Richard Johnston 63 Conclusion (Finally!) Given this p-value, we can reject the Null Hypothesis HO with 99.8% level of confidence, and accept the Alternate Hypothesis Ha that the probability of heart attack is less when aspirin is taken regularly. ©Richard Johnston 64 There are three versions of the Alternate Hypothesis Ha: Two sided H a : p1 p2 p - value Pr( z zOBS ) Right H a : p1 p2 p - value Pr(z zOBS ) Left H a : p1 p2 p - value Pr(z zOBS ) This tutorial used the “Right” version, since we were interested in the right-hand tail of the bell curve. The determination of the p-value is slightly different in each case. ©Richard Johnston 65 9 8 Interval or Ratio data 6 5 4 3 2 1 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 Frequency 7 Bin Sample A Sample B ©Richard Johnston 66 Use one of Excel’s t-Tests in the Data Analysis Add-in In this case, Two Samples with Unequal Variances ©Richard Johnston 67 t-Test: Two-Sample Assuming Unequal Variances Mean Variance Observations Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Sample A Length Sample B Length 31.81100303 59.25031625 245.1876265 408.7490104 50 50 0 92 -7.587355003 1.28745E-11 1.661585397 2.57491E-11 1.986086272 Since p-value < is less than .05, we reject the null hypothesis ©Richard Johnston 68 In R: >t.test(Sample_A,Sample_B) Welch Two Sample t-test data: Sample_A and Sample_B t = -7.5874, df = 92.23, p-value = 2.544e-11 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -34.62166 -20.25696 sample estimates: mean of x mean of y 31.81100 59.25032 ©Richard Johnston 69 Allows hypothesis testing of nominal and ordinal data. Used to test whether a frequency distribution fits a predicted distribution. Hypotheses: H0: The actual distribution can be described by the expected distribution Ha: The actual distribution differs from the expected distribution ©Richard Johnston 70 Suppose the expected distribution of colors of flowers is: Color Expected percentage White 40% Yellow 30% Orange 20% Blue 5% Purple 5% Total 100% ©Richard Johnston 71 The observed distribution of an experimental sample is: Color Number White 145 Yellow 128 Orange 73 Blue 32 Purple 22 Total 400 Can we conclude that the expected distribution is “true” based on the observations? ©Richard Johnston 72 The Chi-Square statistic is calculated from: Where O = Number observed in each category E = Number expected in each category For this example, X2 = 9.95 ©Richard Johnston 73 The critical Chi-Square score Xc2 depends of the number of degrees of freedom. In this case: d.f. = k – 1, where k is the number of categories. In this case k=5 so d.f. = 4. In Excel, we can use the CHIINV function to get the critical chisquare score: CHIINV(probability ,deg-freedom) For α=10, d.f. = 4: CHIINV(0.1,4)= 7.77944 Since X2 = 9.95 is greater than the critical chi-square value, we conclude that the observed distribution differs from the expected distribution. ©Richard Johnston 74 In Excel, we can use the CHITEST function to calculate the probability of the observed chi-square score: CHITEST(actual_range,expected_range) CHITEST returns the probability that a value of the χ2 statistic at least as high as the value calculated could have happened by chance. In this case, CHITEST(actual_range,expected_range) = .041 This means that the probability of the observed results is less than the 10% probability we chose for α, so we conclude that the observed distribution differs from the expected distribution. ©Richard Johnston 75 Use Excel’s Help resources to explore the various types of tests and statistics A list of useful books and websites is provided with the handouts. Most importantly Have a statistician look at your results before publishing them ©Richard Johnston 76 ©Richard Johnston 77 1. Excellent electronic statistics textbook: http://www.statsoft.com/textbook/stathome.html 2. UCLA’s Statistics Advisory site: http://www.ats.ucla.edu/stat/ 3. Choosing the correct statistic: http://bama.ua.edu/~jleeper/627/choosestat.html 4. Handy R reference: http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html 5. Discussions of statistical tests with examples: http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm#hsb 6. List of sites for learning and using R: http://www.ats.ucla.edu/stat/R/ 7. Wikipedia has informative discussions of topics in statistics, with links to primary references ©Richard Johnston 78 Material and information from the following references were used in this presentation 1. Introductory Statistics with R by Peter Dalgaard 2. The Complete Idiot's Guide to Statistics by Robert A. Donnelly Jr. 3. Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith ©Richard Johnston 79 SoCal BSI Dr. Momand Dr. Johnston SoCal BSI Core Instructors Ronnie Cheng All of you No statisticians were harmed during the making of this presentation ©Richard Johnston 80