Goal Bi t ti ti Biostatistics • Presentation of data Lecture INF4350 – descriptive p – tables and graphs O t b 12008 October Anja Bråthen Kristoffersen Biomedical Research Group Department of informatics, UiO Variable types • Categorical C t i l variables i bl – Ordinal: • Are you smoking? 1 = “Daily” Daily , 2 = “now now and then” then , 3 = “Stopped last year”, 4 = “Stopped earlier”, 5 = “never” – Nominal (Discrete variables): • Civil state: 1 = “not not married married”, 2 = “married” married , 3 = “have have a partner” , 4 = “divorced”, 5 = ”widow” • DNA (A, T, C, G) • Binary variables (0 (0, 1) • Continues variables – numbers • S Sensitivity, iti it specificity, ifi it ROC curve • Hypotese yp testing g – Type I and type II error – Multiple testing – False positives Variables can be • Independent I d d t – Are not influenced by other variables. – Are A nott influenced i fl db by th the event, t b butt could ld influence i fl the event • Dependent – Variables influence each other. For instance would the information that a gene is on/of possible influence an other gene. Which variable that depend/influence the other variable can often not be defined. Average (mean) Adjusted mean • Properties: – All observations must be known – The observations do not need to be order – Sensible for ”outliers” (extreme, untypical) values. • Equation: x= • Mean based on the central observations: ( 90 – 95 % of the observations;; ”the tale” (5 – 10 %) of the data is not included for calculations. calculations • Less sensible for extreme observations. x1 + x2 + K + xn 1 n = ∑ xi n n i =1 Combining means Median • Synonym: Equation: x= – 50 percentile – Empirical median n1 x1 + n2 x2 + K + nm xm n1 + n2 + K + nm Where ni is the number of observations behind the mean Note that adjusted means can not be combined like this this. • Properties: p xi . – The observations are ordered – ”Median = the value that divides the observations in two parts.” – Not sensitive for extreme observations. – Mathematical not good since the median of more then one set of observations can not be combined. Mode • The observation that occur most times. – Mathematical not good since the median of more then one set of observations b ti can nott be b combined. bi d Dispersal measures • Range = Xn – X1 – Same oneness as the observations – Sensible for extreme observations • Quantiles, percentiles – The numeral Vp that has p proportions of the ordered observations below. (0<p<1) – Same oneness as the observations • Standard deviation sd = 1 n ∑ ( xi − x ) 2 n − 1 i =1 – Always positive – Outlying observations contribute most – Same oneness as the observations Standard deviation • If the data is close to Gaussian distributed approximately 95% of the population are within x ± 1.96 ⋅ sd – Which approximately correspond to the 2.5 and 97.5 percentile – A consequences of the properties of the Gaussian distribution – Depends p on approximately pp y symmetry y y and unimodality y. • Quick and dirty: sd ≈ Range 4 – Handy when a first guess of the sd when calculating the necessary numbers of observations. Descriptive statistics - tables • A scalar l variable: i bl – Calculate mean, median and standard deviation • A categorical variable: – Descriptive statistics → frequencies • Two categorical g variables: – Descriptive statistics → cross table • A scalar variable and a categorical variable – compare mean/median / di ffor each h category • Two scalar variable: – Categorise one of the variables – or: linear regression Do always y p plot yyour data QQplot ”A plot tells more than 1000 tests” • A scalar variable: – Histogram – Box-plot Box plot – Compare the data with the Gaussian distribution: Q-Q plot easier to read and explain than “Gaussian curve upon” a histogram upon Histogram Do always y p plot yyour data Do always y p plot yyour data ”A plot tells more than 1000 tests” ”A plot tells more than 1000 tests” • Two scalar variable: • A scalar and a categorical variable – Scatter plot – Box plot Scatter plot Two scalar and a categorical variable: – Scatter S tt plot l t Example probability of getting at boy Number of babies born 10 100 1000 10000 100000 376058 17989361 34832051 Number of boys Prosentage boys 8 0.8 55 0.55 525 0.525 5139 0 5139 0.5139 51127 0.51127 1927054 0.51247 9219202 0 51248 0.51248 17857857 0.51268 Prevalence, sensitivity, specificity, and more A = {symptom t or positive iti diagnostic di ti test t t} B = {ill} P(B ) = prevalence l off the h illness ill Relative risk A = {Positive mammogram} B = {Breast cancer within two years} Pr (B | A) = 0.1 Pr (B | A ) = 0.0002 RR = Pr (B | A) 0 .1 = = 500 Pr (B | A ) 0.0002 Example breast cancer diagnostic A = {positive mammogram} B = {b breast e s ccancer ce wit w hin two wo ye yearss} P( A | B ) = sensitivity Pr (B | A ) = 0.0002 then Pr (B | A ) = 1 − 0.0002 = 0.9998 P (A | B ) = specificity PPV = Pr (B | A) = 0.1 P (A | B ) = false positive rate P (A | B ) + P ( A | B ) = 1 Then we have P (A | B ) = 1 − P (A | B ) = 1 − spesifisit p y P(B | A) = PPV = PV + = positive predicative value P (B | A ) = NPV = NV + = negative predicative value NPV = Pr P (B | A ) = 0.9998 Example breast cancer in different groups Traditional 2·2 table ill • Breast B t cancer – Breast cancer among women 45 to 54 years old • Group A: gave first birth before 20 year old • Group B: gave first birth after 30 year old – Assume that 40 of 10000 in group A and 50 of 10000 iin group B gett cancer, coincidence i id or different risk? – If the th numbers b where h 400 off 100000 and d 500 of 100000? Still coincidence? Analyse av 2·2 tabell Test result + - + a [TP] b [FP] a+b - c [FN] d [TN] c+d a+c b d b+d a+b+c+d b d TP = true positive, positive FP = false positive positive, FN = false negative, TN = true negative Example breast cancer a b c d > fisher.test(matrix(c(40,9960,50,9950),ncol = 2, byrow=TRUE)) • Fisher showed that the probability of obtaining any such set of values was given by the hypergeometric distribution: ⎛ a + b ⎞⎛ c + d ⎞ ⎜⎜ ⎟⎜ ⎟ a ⎟⎠⎜⎝ c ⎟⎠ (a + b )!(c + d )!(a + c )!(b + d )! ⎝ p= = (a + b + c + d )!a!b!c!d ! ⎛a + b + c + d ⎞ ⎜⎜ ⎟⎟ a+c ⎝ ⎠ • If the p value is less than a cut off (i.e. p<0.05) we assume that we can reject the null hypotheses and assume that th t “true “t odds dd ratio ti is i nott equall tto 1” 1”, h hence th the test result differentiate between ill and not ill. Fisher's Exact Test for Count Data data: matrix(c(40, 9960, 50, 9950), ncol = 2, byrow = TRUE) p-value p value = 0.3417 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.5133146 1.2371891 sample p estimates: odds ratio 0.7992074 > fisher.test(matrix(c(400,99600,500,99500),ncol ( ( ( , , , ), = 2,, byrow=TRUE)) y )) Fisher's Exact Test for Count Data data: matrix(c(400, ( ( , 99600,, 500,, 99500), ), ncol = 2,, byrow y = TRUE)) p-value = 0.0009314 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.6987355 0.9135934 sample estimates: odds ratio 0.7991994 40 9960 50 9950 400 99600 500 99500 Prevalence, sensitivity, specificity, and more a+c a+b+c+d a Sensitivity = Pr ( A | B ) = a+c d Specificity = Pr (A | B ) = b+d a PPV = Pr (B | A) = a+b d NPV = Pr (B | A ) = c+d a+d Accuracy = a+b+c+d Prevalence = Pr (B ) = Statistical tests Testing hypotheses • Find a null and an alternative hypothesis p • Example: – H0: Expected response is equal in both groups – H1: Expected p response p is different between g groups. p • p-value: is the probability to observe the observed values given that H0 is true. • Reject H0 if the p-value is less than a given significance level (e (e.g. g 0 0.05 05 or 0 0.01) 01) Statistisk test metoder • Some tests assume a certain distribution – E.g.: g t-test assume that the data are (approximately) Gaussian distributed • Non parametric tests are more flexible – E.g.: comparing two medians: non parametric t t two test, t independent i d d t groups (Mann-Whitney) (M Whit ) • Two categorical variables: – Fisher test – Chi square ttestt – Mann-Whitney • Two scalar variables: – t.test t test • A scalar and a categorical variable: – anova Mann Whitney Mann-Whitney The tt-test test The t statistic is based on the sample mean and variance • • I order In d to apply l the h M Mann-Whitney Whi test, the h raw d data ffrom samples l A and B must first be combined into a set of N=na+nb elements, which are then ranked from lowest to highest. These rankings are then re-sorted into the two separate samples. The value of U reported in this analysis is the one based on sample A, A calculated as n n + +11 U A = na nb + a ( a 2 ) −T A where TA = the observed sum of ranks for sample A, A and na nb + t • na (na + 1) = the maximum possible value of TA 2 Convert the U statistics into p p-values. ANOVA A simple experiment • The t-test and its variants only work when p p pools. there are two sample • Analysis of variance (ANOVA) is a general technique for handling multiple variables variables, with replicates. • Measure response to a drug treatment in two different mouse strains. • Repeat each measurement five times. • Total T t l experiment i t = 2 strains t i * 2 ttreatments t t * 5 repetitions = 20 arrays • If you look for treatment effects using a ttest then you ignore the strain effects test, effects. ANOVA lingo Two factor design Two-factor • F Factor: t a variable i bl th thatt iis under d th the control t l off th the experimenter (strain, treatment). • Level: a possible value of a factor (drug (drug, no drug). • Main effect: an effect that involves only one factor. • Interaction effect: an effect that involves two or more factors simultaneously. • Balanced design: an experiment in which each factor and level is measured an equal number of times. Fixed and random effects • Fi Fixed d effect: ff t a factor f t for f which hi h the th levels l l would ld be repeated exactly if the experiment were repeated. • Random effect: a term for which the levels would not repeat in a replicated experiment. • In the simple experiment, treatment and strain are fixed effects, and we include a random effect to account for biological and experimental variability. variability ANOVA model Eijk = μ + Ti + S j + (TS )ij ⎧ i = 1, K , n, ⎪ + ε ijk ⎨ j = 1, K , m, ⎪ k = 1, K , p. ⎩ • μ is the mean expression level of the gene. • T and S are main effects (treatment, strain) with n and m levels, respectively. • TS is an interaction effect. • p is the number of replicates per group. • ε represents p random error ((to be minimized). ) ANOVA steps • For each gene on the array – Fit the p parameters T and S,, minimizing g ε. – Test T, S and TS for difference from zero, yielding three F statistics statistics. – Convert the F statistics into p-values. F statistics F-statistics • Compare two linear models. Mean Squares Group MSG or MSE Mean Squares Error • This compares the variation between groups (group mean to group mean) to the variation within groups (individual values to group means). F= Pr( Fdf1 ,df 2 > Fcalculated ) F-distribution ANOVA assumptions A B ANOVA output Gene • For a given gene, the random error terms p , normally y distributed and are independent, have uniform variance. • The main effects and their interactions are linear. p-value Strain effects Treatment effects Interaction effects Vehicle Drug Multiple testing correction This and some following slides are from http://compdiag.molgen.mpg.de/ngfn/docs/2004/mar/DifferentialGenes.pdf. Multiple testing correction • O On an array off 10,000 10 000 spots, t a p-value l off 0.0001 may not be significant. • Bonferroni correction: divide your p-value y the number of measurements. cutoff by • For significance of 0.05 with 10,000 spots, you need a p-value of 5 × 10-6. • Bonferroni is conservative because it ass mes that all genes are independent assumes independent. Types of errors • • • • F l positive False iti (Type (T I error): ) the experiment indicates that the gene has changed, but it actually has not not. False negative (Type II error): the gene has changed, but the experiment i t failed f il d tto indicate i di t the change. Typically, researchers are more concerned d about b t ffalse l positives. Without doing many (expensive) replicates, there will always be many false negatives. False discovery rate • 5 FP 13 TP • 33 TN 5 FN • Th false The f l discovery di rate t (FDR) is the percentage of genes above a given position in the ranked list that are expected to be false positives. False positive rate: percentage off non-differentially diff ti ll expressed d genes that are flagged. False discovery rate: percentage t off flagged fl d genes that are not differentially expressed. FDR example • Order the unadjusted p-values p1 ≤ p2 ≤ … ≤ pm. Desired • To control FDR at level α, Rank of this gene j ⎫ ⎧ j* = max⎨ j : p j ≤ α ⎬ m ⎭ ⎩ significance threshold Total number of genes • Reject the null hypothesis for j = 1, …, j*. • This approach is conservative if many genes are differentially expressed. (Benjamini & Hochberg, 1995) 33 TN 5 FN FDR = FP / (FP + TP) = 5/18 = 27.8% FPR = FP / (FP + TN) = 5/38 = 13.2% Controlling the FDR p-value of this gene 5 FP 13 TP Rank 1 2 3 4 5 6 7 8 9 10 … 1000 (jα)/m 0.00005 0 0.00010 00010 0.00015 0.00020 0 0.00025 00025 0.00030 0.00035 0 0.00040 00040 0.00045 0.00050 p-value 0.0000008 0 0.0000012 0000012 0.0000013 0.0000056 0 0.0000078 0000078 0.0000235 0.0000945 0 0.0002450 0002450 0.0004700 0.0008900 0.05000 1.0000000 • Choose the threshold so that, for all the genes above it, (jα)/m is less than the corresponding pvalue. • Approximately 5% of the examples above the line are expected to be false positives. False discovery rate Bonferroni vs. vs false discovery rate • Bonferroni controls the family-wise error p y of at least one rate;; i.e.,, the probability false positive. • FDR is the proportion of false positives among the genes that are flagged as differentially ff expressed. Diagnostic/ROC curve Diagnostic/ROC curve Ranging g g of 109 CT images g of one radiologist: g Definitively Probable normal normal Normal 33 6 Not normal 3 Total 36 Status Ranging g g of 109 CT images g of one radiologist: g Definitively Probable normal normal Normal 33 6 51 Not normal 3 109 Total 36 Definitively not normal Total not normal 6 11 2 58 2 2 11 33 8 8 22 35 unsure Probably Criteria ”1+” all with range from 1 to 5 get the diagnose ill. Find all the ill ones, but identify y now healthy y ones. Sensitivity = 1, specificity = 0, false positive rate = 1 Status Definitively not normal Total not normal 6 11 2 58 2 2 11 33 51 8 8 22 35 109 unsure Probably Criteria ”2+” all with range from 2 to 5 get the diagnose ill. Find 48/51 of the ill ones, but identifies 33/58 healthy ones. Sensitivity = 0.94, specificity = 0.57, false positive rate = 0.43 Diagnostic/ROC curve Diagnostic/ROC curve Ranging g g of 109 CT images g of one radiologist: g Definitively Probable normal normal Normal 33 6 Not normal 3 Total 36 Status Ranging g g of 109 CT images g of one radiologist: g Definitively Probable normal normal Normal 33 6 51 Not normal 3 109 Total 36 Definitively not normal Total not normal 6 11 2 58 2 2 11 33 8 8 22 35 unsure Probably Criteria ”3+” all with range from 3 to 5 get the diagnose ill. Find 46/51 of the ill ones, but identifies 39/58 healthy ones. Sensitivity = 0.90, specificity = 0.67, false positive rate = 0.33 Status Probable normal normal Normal 33 6 Not normal 3 Total 36 2 58 2 2 11 33 51 8 8 22 35 109 Ranging g g of 109 CT images g of one radiologist: g Definitively Probable normal normal Normal 33 6 51 Not normal 3 109 Total 36 Definitively not normal Total not normal 6 11 2 58 2 2 11 33 8 8 22 35 unsure 11 Diagnostic/ROC curve Ranging g g of 109 CT images g of one radiologist: g Definitively 6 Probably Criteria ”4+” all with range from 4 to 5 get the diagnose ill. Find 44/51 of the ill ones, but identifies 45/58 healthy y ones. Sensitivity = 0.86, specificity = 0.78, false positive rate = 0.22 Diagnostic/ROC curve Status Definitively not normal Total not normal unsure Probably Criteria ”5+” all with range from 2 to 5 get the diagnose ill. Find 33/51 of the ill ones, but identifies 56/58 healthyy ones. Sensitivity = 0.65, specificity = 0.97, false positive rate = 0.03 Status Definitively not normal Total not normal 6 11 2 58 2 2 11 33 51 8 8 22 35 109 unsure Probably Criteria ”6+” all with range > 5 get the diagnose ill. Find non of the ill ones, but identifies all the healthy y ones. Sensitivity = 0, specificity = 1, false positive rate = 0 Diagnostic/ROC curve Positiv test criteria sensitivity specificity False positive rate 1+ 1 0 1 2+ 0.94 0.57 0.43 3+ 0.90 0.67 0.33 4+ 0.86 0.78 0.22 5+ 0.65 0.97 0.03 6+ 0 1.0 0 Referanser • http://www.medisin.ntnu.no/ikm/medstat/M g p edStat1.07.dag1.pdf • http://www.medisin.ntnu.no/ikm/medstat/M edStat1 07dag2 sanns pdf edStat1.07dag2.sanns.pdf • http://noble.gs.washington.edu/~noble/gen ome373/Microarray analysis: ANOVA and multiple testing correction