AAPOS 2015 Research Workshop The tools of data analysis: matching the hammer to the nail GLOSSARY Chi-square test This is an approximate test of whether or not two distributions are the same (eg proportions with and without disease in two different patient groups). It is calculated based on the numbers in each category, in each group (e.g. the “twoby-two” table). Fisher’s exact test This is another test of whether or not two distributions are the same. Unlike the chi-square test, it is not based on an approximation, so it is more accurate than the chi-square test when small numbers of subjects are involved. Online calculator: http://www.graphpad.com/quickcalcs/contingency1/ T-test (a.k.a. two-sample t-test or unpaired t-test) This is a test for comparing the mean values from two sets of measurements from different groups of subjects. It is based on the mean, standard deviation, and sample size in each group. Online calculator: http://www.graphpad.com/quickcalcs/ttest1/?Format=SD One-sample t-test: This is a test for comparing a mean value from a set of measurements to a hypothetical value (e.g. 0). It is based on the mean, standard deviation, and sample size. Online calculator: http://www.graphpad.com/quickcalcs/OneSampleT1.cfm?Format=SD Paired t-test This is a test for comparing two sets of measurements when the measurements come from the same sample, e.g. “before” and “after” for the same group of subjects. It is the same as performing a one-sample t-test on the difference measures (e.g. difference between before and after for each subject), testing whether the mean difference is different from zero. Online calculator: You can enter the pairs of measurements: http://www.graphpad.com/quickcalcs/ttest1/?Format=50 Or, you can calculate the difference measure for each subject, and use a onesample t-test (see above). Two-tailed test A two-tailed test should be used in most cases, because the real difference (e.g. between two groups with mean values A and B) could theoretically go either way (A>B or B>A). Standard deviation (s.d.) This is a measure of the degree of scatter of values around the mean value. Excel has a function (STDEV) to calculate it. Sensitivity The proportion of actual cases that test positive. Specificity The proportion of actual non-cases (normals) that test negative. Positive predictive value (PPV) The chance that a positive test value is really an actual case. PPV depends on a combination of sensitivity, specificity, and the frequency of cases in the population. Negative predictive value (NPV) The chance that a negative test value is really normal. NPV depends on a combination of sensitivity, specificity, and the frequency of cases in the population. Receiver-operating characteristic (ROC) curve A graphical representation of all the combinations of sensitivity and specificity that you can have as you vary the cutoff for a test. It is plotted as sensitivity on the y-axis, against (1-specificity) on the x-axis. The area under the ROC curve is a measure of overall accuracy, with 1 being a perfect test, and 0.5 being a worthless test. Non-parametric test A non-parametric test may be appropriate when more common tests are not, eg when the distribution of values is very skewed (i.e. nowhere near a normal distribution). There are non-parametric equivalents to more commonly used tests; eg the Wilcoxon rank sum test may be more appropriate than a 2-sample ttest when a distribution is very skewed in one direction (just as the median may be a more appropriate summary measure than the mean). Pearson’s correlation coefficient R This is a measure of the strength of the linear correlation between an “independent” variable (x) and a “dependent” variable (y). It can vary from 1 (perfect positive linear relationship; y increasing as x increases) to -1 (perfect negative relationship; y decreasing as x increases), with 0 indicating no relationship at all (y is totally independent of x). Excel has a function (CORREL) to calculate this. R does not tell you how much y changes for a given change in x (that’s a different number, the slope; see below); R just tells you how tight the relationship is. Linear regression This is a statistical technique that finds the best-fitting line modeling the relationship between x and y. The equation describing the line can then be used to predict a value of y if you know the value of x, or to describe how much y changes for a given change in x (this is the “slope” of the equation). The “add trendline” function in excel performs linear regression on scatter-plot data and shows the fitted line. You can display the equation and the R-squared value; the R-squared value is the square of Pearson’s correlation coefficient R. Logistic regression Logistic regression is a form of linear regression, to be used when the dependent (outcome) variable is a binary (e.g. yes/no) variable (rather than a continuous variable). The logistic regression technique essentially first transforms the binary variable to something that can be treated as a continuous variable, and then performs ‘ordinary’ linear regression. It is very useful in data modeling when one or more variables in the model are binary. Instead of y being a continuous variable that changes with x, it is the frequency or probability of a “yes” vs. “no” outcome that changes with x. The output of the equation is a value somewhere between 0 and 1, representing the risk or probability of a yes vs. no outcome. More details: http://www.biostathandbook.com/simplelogistic.html http://www.medcalc.org/manual/logistic_regression.php Odds ratio (OR) The odds ratio (OR) indicates the strength of the association between two variables, classically termed “exposure” (think: treated/untreated; with/without the gene; onset before/after a given age) and “outcome” (think: cured/not cured, misaligned/straight, acuity above/below a given value). If OR>1, exposure is associated with higher odds of outcome; if OR<1, exposure is associated with lower odd of outcome. But, if the 95%-confidence interval (CI; see below) includes the value 1, i.e., runs from a value below 1 to a value above 1, then the OR is not significantly different from 1, and the exposure is not associated with the odds of outcome. Online calculator: http://www.medcalc.org/calc/odds_ratio.php Confidence interval (CI) A sample that you study provides an estimate of a true value (e.g. an odds ratio, or a prevalence). The 95% CI gives a range of values around this estimate; chances are that the true value lies within this range. Multiple or multivariate linear regression This is linear regression, in which y is modeled as a function of a combination of different x variables instead of just one x variable. This statistical technique models the relationship between a dependent variable (e.g., y) and multiple independent or explanatory variables (=x1 +x2 +x3, etc.) that each influence y. The data set of x values and y values is used to create an equation that describes that relationship. The equation can then be used to predict a value of y for any subject if you know the values of x1, x2, x3, etc. The equation describes how much y varies for a given change in x1 or x2 or x3 (each of which can have a different “slope” or effect on y). Multiple or multivariate logistic regression This is logistic regression, in which the binary (yes/no) outcome y is modeled as a function of a combination of different x variables instead of just one x variable. The output of the equation is a value somewhere between 0 and 1, i.e., the risk or probability that the outcome will be yes vs. no based upon the values of the different x variables. The equation can be used to develop a prediction model and predict an outcome for a given patient. The model also provides an odds ratio for each different x variable, which describe the strength of the association between each exposure (x variable) and the outcome (y). Because these regression methods are “multivariate” or have multiple exposure variables, they allow us to control for potential confounding variables in order to determine if an association between a specific exposure or treatment (e.g., high oxygen) and an outcome (e.g., severe ROP) exists independently when you control for other factors (e.g., birth weight and gestational age). Bland-Altman plot A plot of data from two sets of measurements on the same samples/subjects, where the difference between measure A and measure B is plotted against the average of measure A and measure B for each pair of measurements (each sample/subject). Repeatability index This is a measure of repeatability. It is 1.96 times the s.d. of the difference between measure A and measure B. Power Vector To quantitatively analyze astigmatism in both magnitude and axis, we transform clinical refractive error into a power vector notation consisting of (J0, J45, M). This is a simple tool for astigmatism research, allowing averaging and comparison of astigmatic refractive errors of different axes. M M is the same as the spherical equivalent refractive error. J0 For sphero-cylindrical notation S + C axis A in degrees, J0= (- C/2) * cos (2 * A) J45 For sphero-cylindrical notation S + C axis A in degrees, J45= (- C/2) * sin (2 * A) Examples: Sphero-cylindrical notation J0 J45 Plus cylinder 1D axis 090 0.5 0 Plus cylinder 1D axis 180 -0.5 0 Plus cylinder 1D axis 045 0 -0.5 Plus cylinder 1D axis 135 0 0.5 Oblique astigmatism that is not exactly at axis 045 or 135 has non-zero values for both J0 and J45 components. LogMAR LogMAR notation of visual acuity (VA) expresses VA in a form that is more appropriate for mathematical manipulation (such as averaging) than Snellen fraction or decimal notation. For VA represented as a fraction (such as x/20), LogMAR VA = -(log (x/20)). Examples: LogMAR VA 0 0.1 0.2 0.3 Snellen VA 20/20 20/25 20/32 20/40