Comparing correlated correlations Advisor: Rhonda Decook Client: Vinayak Consultants: Tianyu Li, Qinbin Fan Department of Statistics and Actuarial Science University of Iowa Outline Introduction Data Highlights Results & Analysis Conclusion LOGO Introduction LOGO Gold Standard: first take the average score of each image from 3 graders and then re-rank them. ( we also tried other ways to define the gold standard, but this definition is the one we mainly use) Let ralg.i,GS represent the pearson correlation coefficient between algorithm i and the gold standard (GS). The ralg.i,GS values for the data set with 12 algorithms and 25 images are shown below (from largest correlation to smallest): Introduction LOGO Correlation with Algorithm GS 12* 0.6583 10 0.6259 3 0.5806 9 0.5718 4 0.5385 8 0.5040 6 0.5000 5 0.4630 7 0.4601 1 0.4300 11 0.3104 2 0.2422 Introduction LOGO For each competing algorithm i=1,2,…,11, we wish to test Ho: ralg.12,GS-ralg.i,GS=0 And we improve our method by using The Fish 1+𝑟 Transformation for r which is 𝑍 = log 𝑒 . It is a 1−𝑟 ‘normalizing’ transformation. Correlations fall between -1 and 1, but after transformation, they fall between −∞ and +∞. Our method is based on the journal article: Cohen A. (1989) "Comparison of correlated correlations." Statistics in Medicine, 8(12):14851495. Introduction LOGO We use the bootstrap to form the confidence interval on the difference between the transformed version of r. The bootstrap method takes into account the fact that the algorithms were all applied to the same set of 25 images. The bootstrap method resamples with replacement from the original set of n=25 images, to create a ‘new hypothetical’ data set. We calculate the difference in correlations in each of 5000 bootstrapped data sets to provide us with sampling distribution for the difference Introduction LOGO If the confidence interval does not include zero, then the algorithms have significantly different correlations, and one is better than the other. We apply the Bonferroni multiple comparison adjustment to account for the fact that we are doing 11 comparisons. (C.I. level is 10.05/11 = 0.995). The adjustment allows us to maintain the type I family-wise error rate at the 𝛼=0.05 level. Data Highlights LOGO Patient ID 1 2 … A 1.16154 2.114705 … B 1.203186 2.126865 … … … … … 1 2 3 GS 16 21 13 16 20 13 23 20 … … … … Results & Analysis r12-rj (raw) z12-zj (fisher) CI of Diff (99.5%) LOGO CI of Diff (95%) Significan t Diff A2 VS A12 0.416 0.542 ( 0.042, 1.338 ) ( 0.169, 1.048 ) YES A11 VS A12 0.347 0.468 (-0.109, 1.266 ) ( 0.087, 0.975 ) NO/YES A1 VS A12 0.228 0.329 (-0.253, 0.977 ) (-0.058, 0.761 ) NO A7 VS A12 0.198 0.292 (-0.229, 0.870 ) (-0.046, 0.664 ) NO A5 VS A12 0.195 0.288 (-0.255, 0.845 ) (-0.066, 0.670 ) NO A6 VS A12 0.158 0.240 (-0.466, 0.835 ) (-0.208, 0.620 ) NO Results & Analysis A8 VS A12 0.154 A4 VS A12 LOGO 0.235 (-0.310, 0.753 ) (-0.113, 0.596 ) NO 0.119 0.187 (-0.431, 0.768 ) (-0.211, 0.584 ) NO A9 VS A12 0.086 0.139 (-0.461, 0.689 ) (-0.256, 0.504 ) NO A3 VS A12 0.077 0.126 (-0.422, 0.652 ) (-0.222, 0.464 ) NO A10 VS A12 0.032 0.055 (-0.411, 0.576 ) (-0.255, 0.392 ) NO Results & Analysis A2(worst) VS A12 LOGO Results & Analysis A11(2nd worst) VS A12 LOGO Results & Analysis LOGO We also tried other ways to define the gold standard, for example, we tried to use only the 2 most strongly correlated columns and do the same, to use the median rank of all 3 for each image, and do the same, and to use averages without re-ranking. We found pretty similar results from all four cases. To save space and time, we do not present results of the other three. Conclusion LOGO We did not find statistically significant difference between our client’s method and all the other methods. Results may vary when we have more data. ( more image grades)