Statistics: Data Presentation & Analysis Fr Clinic I Overview • • • • Tables & Graphs Populations & Samples Mean, Median, & Variance Error Bars – Standard Deviation, Standard Error & 95% Confidence Interval (CI) • Comparing Means of Two Populations • Linear Regression (LR) Warning • Statistics is a huge field, I’ve simplified considerably here. For example: – Mean, Median, and Standard Deviation • There are alternative formulas – 95% Confidence Interval • There are other ways to calculate CIs (e.g., z statistic instead of t; difference between two means, rather than single mean…) – Error Bars • Don’t go beyond the interpretations I give here! – Comparing Means of Two Data Sets • We just cover the t test for two means when the variances are unknown but equal, there are other tests – Linear Regression • We only look at simple LR and only calculate the intercept, slope and R2. There is much more to LR! Tables Table 1: Average Turbidity and Color of Water Treated by Portable Water Filters Water Pond Water (2) 10 (3) 13 Apparent Color (Pt-Co) (4) 30 Sweetwater 4 5 12 Hiker 3 8 11 (1) Turbidity True Color (NTU) (Pt-Co) Consistent Format, Title, Units, Big Fonts Differentiate Headings, Number Columns Consistent Format, Title, Units Good Axis Titles, Big Fonts Figures 25 Turbidity (NTU) 20 20 11 15 10 11 10 7 5 5 1 0 Pond Water Sweetwater Miniworks Hiker Pioneer Voyager Filter Figure 1: Turbidity of Pond Water, Treated and Untreated Populations and Samples • Population – All possible outcomes of experiment or observation • US population • Particular type of steel beam • Sample – Finite number of outcomes measured or observations made • 1000 US citizens • 5 beams • Use samples to estimate population properties – Mean, Variance • E.g., Height of 1000 US citizens used to estimate mean of US population Central Tendency • Mean and Median 1 3 3 6 8 10 Mean = xbar = Sum of values divided by sample size = (1+3+3+6+8+10)/6 = 5.2 NTU Median = m = Middle number Rank 1 2 3 4 5 6 Number 1 3 3 6 8 10 For even number of sample points, average middle two = (3+6)/2 = 4.5 Excel: Mean – AVERAGE; Median - MEDIAN Variability • Variance, s2 – sum of the square of the deviation about the mean divided by degrees of freedom – s2 = n(xi – xbar)2/(n-1) – Where xi = a data point and n = number of data points • Example (cont.) – s2 = [(1-5.2)2 + (3-5.2)2 + (3-5.2)2 + 6-5.2)2 + (85.2)2 + (10-5.2)2] /(6-1) = 11.8 NTU2 Excel: Variance – VAR Error Bars • Show data variability on plot of mean values • Types of error bars include: • Max/min, ± Standard Deviation, ± Standard Error, ± 95% CI Turbidity (NTU) 10 8 6 4 2 0 Filter 1 Filger 2 Filter Type Filter 3 Standard Deviation, s • Square-root of variance s s • If phenomena follows Normal Distribution (bell curve), 95% of population lies within 1.96 standard deviations of the mean 2 • Error bar is s above & below mean Normal Distribution 95% Excel: standard deviation – STDEV -4 -1.96 -2 0 1.96 2 Standard Deviations from Mean Standard Deviation 4 Standard Error of Mean sX • Also called St-Err or sxbar • For sample of size n taken from population with standard deviation estimated as s sX s n • As n ↑, sxbar estimate↓, i.e., estimate of population mean improves • Error bar is St-Err above & below mean 95% Confidence Interval (CI) for Mean • A 95% Confidence Interval is expected to contain the population mean 95 % of the time (i.e., of 95%-CIs from 100 samples, 95 will contain pop mean) X t 95 %, n 1s X • t95%,n-1 is a statistic for 95% CI from sample of size n – t95%,n-1 = TINV(0.05,n-1) – If n 30, t95%,n-1 ≈ 1.96 (Normal Distribution) • Error bar is t 95 %, n 1 s X above & below mean Using Error Bars to compare data • Standard Deviation – Demonstrates data variability, but no comparison possible • Standard Error – If bars overlap, any difference in means is not statistically significant – If bars do not overlap, indicates nothing! • 95% Confidence Interval – If bars overlap, indicates nothing! – If bars do not overlap, difference is statistically significant • We’ll use 95 % CI in this class – Any time you have 3 or more data points, determine mean, standard deviation, standard error, and t95%,n-1, then plot mean with error bars showing the 95% confidence interval Adding Error Bars to an Excel Graph • Create Graph – Column, scatter,… • • • • Select Data Series In Layout Tab-Analysis Group, select Error Bars Select More Error Bar Options Select Custom and Specify Values and select cells containing the t 95 %, n 1 s X values Example 1: 95% CI Turbidity Data 1 2 3 mean St Dev NTU NTU NTU NTU NTU 2.1 2.1 2.2 2.1 0.06 3.2 4.4 5 4.2 0.92 4.3 4.2 4.5 4.3 0.15 Filter 1 Filter 2 Filter 3 7.0 6.0 Turbidity (NTU) 5.0 4.2 4.3 Filter 2 Filter 3 4.0 3.0 2.1 2.0 1.0 0.0 Filter 1 Portable Water Filter n 3 3 3 St-Err NTU 0.03 0.53 0.09 t95%,2 +/- 95% CI t95%,2St-Err 4.30 4.30 4.30 0.14 2.28 0.38 What can we do? • Lift weight multiple times using different solar panel combinations (or hyrdoturbines, or gear boxes) and plot mean and 95 % Confidence interval error bars. – If error bars overlap between to different test conditions, indicates nothing! – If error bars do not overlap, difference is statistically significant T Test • A more sophisticated way to compare means • Use t test to determine if means of two populations are different • E.g., lift times with different solar panel combinations or turbines or… Comparing Two Data Sets using the t test • Example - You lift weight with two panels in series and two in parallel. – Series: Mean = 2 min, s = 0.5 min, n = 20 – Parallel: Mean = 3 min, s = 0.6 min, n = 20 • You ask the question - Do the different panel combinations result in different lift times? – Different in a statistically significant way Are the Lift Times Different? Series • Use TTEST (Excel) • Fractional probability of being wrong if you claim the two populations are different – We’ll say they are significantly different if probability is ≤ 0.05 Parallel 1.5 2 2.2 1.8 3 1.6 1.2 2.1 1.9 2.2 2.6 1.7 1.8 1.5 2.4 2.5 2.7 1.4 1.5 2.6 3 2.4 2.2 2.6 3.4 3.6 3.8 3.5 2.7 2.4 3.5 3.8 2.1 2.5 3.4 3.3 2.4 3.6 2.3 3.7 Marbles Linear Regression • Fit the best straight line to a data set Grade Point Average 25 20 y = 1.897x + 0.8667 R2 = 0.9762 15 10 5 0 0 2 4 6 8 10 12 Height (m) Right-click on data point and select “trendline”. Select options to show equation and R2. R2 - Coefficient of multiple Determination • R2 = n(ŷi - ybar)2 / n(yi - ybar)2 – ŷi = Predicted y values, from regression equation – yi = Observed y values – Ybar = mean of y • R2 = fraction of variance explained by regression – R2 = 1 if data lies along a straight line