DATA ANALYSIS & BASIC STATISTICS XIAO WU XIAO.WU@YALE.EDU PURPOSE OF THIS WORKSHOP • Statistics as a useful tool to analyze results • Basic terminology and most commonly used tests • Exposure to more advanced statistical tools WHY DO WE NEED STATISTICS? WHY DO WE NEED STATISTICS? • Summary • Classification • Interpretation • Pattern searching • Abnormality identification • Prediction • Intrapolation • Extrapolation SUMMARY http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/ SUMMARY • Mean, median, mode • Variance, standard deviation • Max, min values and range • Quartiles http://www.mymarketresearchmethods.com/descriptive-inferential-statistics-difference/ EXAMPLE Firm A • Mean: $5,800 Firm B • Mean: $5,000 EXAMPLE Firm A • Mean: $5,800 • Median: $4,000 • SD: $7,270 • 3rd Quartile: $4,000 • 1st Quartile: $500 Firm B • Mean: $5,000 • Median: $5,000 • SD: $203 • 3rd Quartile: $5,175 • 1st Quartile: $4,825 EXAMPLE # Salary ($) # Salary ($) 1 20000 1 4650 2 4000 2 4700 3 4000 3 4750 4 500 4 4800 5 5 500 4850 6 4900 7 4950 8 5000 9 5050 10 5100 11 5150 12 5200 13 5250 14 5300 15 5350 CLASSIFICATION Identification of variable • Independent vs. dependent • Numeric vs. categorical Variable Categorical Numeric Nominal Continuous Ordinal Discrete PATTERN SEARCHING • Distribution of data • Some commonly used distributions • • • • Uniform Binomial Poisson … • Central limit theorem http://www.mathwave.com/img/art/graphs_pdf2.gif UNIFORM • Every outcome has equal chance • Example: • Flipping a coin • Rolling a dice • What if you need to flip multiple times? BINOMIAL • Two outcomes, probability p and 1p • Multiple trials: n • Example: • Flipping a coin 100 times • Germination of multiple seeds https://onlinecourses.science.psu.edu/stat414/sites/onlinecourses.science.p su.edu.stat414/files/lesson09/graph_n15_p02.gif POISSON • Counts of rare, independent events • Each with probability, or average rate p • Example: radioactive decay http://kaffee.50webs.com/Science/images/alpha_decay.gif THE MOST IMPORTANT DISTRIBUTION NORMAL DISTRIBUTION • Central limit theorem • Every distribution converges to a normal distribution • Large sample size normal distribution Parameters: • mean • standard deviation https://www.mathsisfun.com/data/images/normal-distrubution-large.gif PATTERN SEARCHING Hypothesis testing • Difference between two populations • Z-test or t-test? • What does p-value mean? • Family-wise error – Bonferroni correction • More than two possibilities • Chi square test • Fisher’s exact test • More than two variables • ANOVA EXAMPLE 1 SAT score is related to gender • Null hypothesis • Alternative hypothesis (3 possibilities) • One or two tail? • Z or T test? • p=0.07, conclusion? EXAMPLE 2 Predictors of stroke • Age • Hypertension • Gender • … EXAMPLE 3 Genome-wide association studies • Scanning markers across the DNA of many people to find genetic variations associated with certain diseases PATTERN SEARCHING Hypothesis testing • One variable • Z-test or t-test? • What does p-value mean? • Family-wise error – Bonferroni correction • Compare two categorical variables • Chi square test • Fisher’s exact test • More than two variables • ANOVA CHI SQUARE Punnett Square • A cross between two pea plants yields 880 plants, 639 green, 241 yellow • Hypothesis: The green allele is dominant and both parents are heterozygous. http://www2.lv.psu.edu/jxm57/irp/chisquar.html CHI SQUARE G g G GG (green) Gg(green) g Gg(green) gg (yellow) • 75% green • 25% yellow CHI SQUARE Green Yellow Observed (o) 639 241 Expected (e) 660 220 Deviation (d=o – e) -21 21 Deviation squared (d^2) 441 441 d^2/e 0.668 2 Sum 2.669 Degree of freedom: number of categories – 1 = 1 CHI SQUARE PREDICTION • Regression • Linear regression • Multiple linear regression • Accuracy vs. simplicity • Validation • leave-k-out http://2.bp.blogspot.com/-W7Ptp8uB02U/T8UAGm4Uw5I/AAAAAAAAC08/DcHCtLWXvU/s1600/actnactn+1.png EXAMPLE • Use brain structural measurements to predict a subject’s performance on picture vocabulary test • 144 total structural measurements • 521 subjects • First step: eliminate unnecessary variables • All zeros? • Highly correlated pairs • Variables that do not correlate well with performance score EXAMPLE • • • • Run regression Validation: leave 1 out and leave 10 out Principle component analysis … PREDICTION More complicated models: • Baysian approach • Use prior knowledge to update prediction • Diffusion weights • Use local structure to predict neighboring values STATISTICAL TOOLS • EXCEL • MatLab •R • MiniTab •… QUESTIONS? MY OWN RESEARCH • Cost-effectiveness analysis • Mathematical modeling in medicine • Simulate iterations rather than actual patients RECENT RESULTS RESULTS GROUP EXERCISE