Statistics Help Guide Becki Paynich I. II. A Few Important Things: A. Math is the least important factor in statistics. Knowing which statistics test to employ based on sample, level of measurement, and what you are trying to understand is the most important part in statistics. However, you need to understand how the math works to better understand why you pick any given test. B. Statistics is not brain surgery. Lighten up. You’ll do fine. C. This is a help guide, not a comprehensive treatment of statistics. Please refer to a good Statistics textbook for a deeper understanding of statistics. Levels of Measurement: Knowing the level of measurement is extremely important in statistics. Most tests of significance are explicit about the level of measurement they require. If the requirements for the level of measurement are not adhered to, results lose validity. For example, Pearson’s r assumes that variables are at least interval level. If a Pearson’s r is employed on nominal or ordinal level variables, results cannot be directly interpreted (pretty much your results are meaningless). A. Nominal—a variable that is measured in categories that cannot be ranked. Usually nominal variables are collapsed into the smallest number of categories possible (0,1) so that they may be used in analysis. A variable in this form (dichotomous) can be used in virtually any type of analysis. For example: the variable RACE may be originally separated into 12 different categories. These 12 might be collapsed into 2 (most likely being Caucasian=1 and Non-Caucasian=0). The important thing to remember is that collapsing categories into larger ones can hide important relationships between them. For example, the difference between African-Americans and Native Americans on a given variable may be important but if they are both collapsed into the category of Non-Caucasian, this relationship will be obscured. Another thing to keep in mind is that a dichotomous or dummy variable (one with only two categories as in above example) is the most powerful form, statistics wise, for a nominal variable to be in. B. Ordinal—a variable that can be ranked but the exact difference between categories cannot be identified. For example: The categories of Strongly Agree, Agree, No Opinion, Disagree, and Strongly Disagree, can be differentiated by order but one does not know the exact mathematical difference between two respondents who answer Strongly Agree and Agree respectively. Another example would be categorical ranges. Income, for example, can be grouped into the following ranges: $0-$10,000; $10,0001-$20,000; and $20,000+. While certainly, this variable is measured with numbers, one cannot tell the exact difference between a respondent who selected the 0-10,000 range and a respondent that selected the 10,001-20,000 range. In theory, they could by $1.00 1 apart or closer to $10,000 apart in income. A final form of ordinal variable which is actually treated as interval-level is when one adds together multiple ordinal level responses. For example, I may want to make an index on how well a student likes the course materials by combining his/her responses to questions about the books, the web page, and the class notes. Thus, if a student answered Strongly Agree to questions about these three items, and the score for Strongly Agree is 5, the student’s total score for this new index is 15. Ordinal variables can also be and often are collapsed into a smaller number of categories. For example: A variable measuring rank may have 8 or more original categories that can be collapsed into the categories of administration=1 and nonadministration=0. Or, there could be three categories with a middle category being middle-management. The problem again becomes sacrificing important relationships that may be obscured by the collapse for ease in analysis. The best rule of thumb is to run analyses with all the categories in their smaller form first to identify important relationships between them and then run analyses again after collapsing the variables to identify important relationships across larger ones. ** When you create nominal and ordinal level questions, make sure they are mutually exclusive (no respondent can fit into more than one category) and exhaustive (you have listed all possibilities). Also, when designing numerical ranges, make sure that the intervals have the same width. In the following example, notice that the width for each category is $10,000: Question: What is your income range? 1 = 0-10,000 2 = Above 10,000 to 20,000 3 = Above 20,000 to 30,000 C. Interval—a variable or scale that uses numbers to rank order. This is the best type of variable to have as you can keep it in interval form or turn it into an ordinal or nominal level variable and do virtually any type of analysis. Furthermore, the exact difference between two respondents can be identified. For example: with the variable age, when respondents report their actual age, one can mathematically determine the differences between the ages of respondents. If however, the variable is grouped into age ranges (0-5, 6-10, 11-15, 16-20…) then the variable is ordinal and not interval scale. Interval level variables can be collapsed into ordinal or even nominal categories but this is usually avoided as most statistical tests designed for interval level variables are more powerful and provide more understanding of the nature of the relationships than those tests designed for ordinal or nominal variables. D. Ratio—a ratio scale variable is identical to interval in almost every aspect except for it has an absolute zero. Age and Income have absolute zeros. If a variable can be considered on a ratio scale, then it is also interval scale. **Many researchers in criminal justice ignore the difference between interval and ratio as 2 the tests of significance that we use for ratio and interval variables are the same.** **Make sure to code variables in the most logical order.** For example, if you are trying to measure the frequency of smoking, use the highest numbers for the most amount of smoking and the lowest numbers for the least amount of smoking. For example: Question: How often do you smoke? 0 = Never 1 = Less than once per week 2 = At least once per week 3 = 2 or 3 times per week 4 = Daily III. Numerical Measures: A. Measures of Central Tendency 1. Mode-the most prevalent variable (for example, in a data set of pets: 25 cats, 13 dogs, 12 hamsters, 47 fish. Fish is the modal category with 47 being the mode). The only measure of central tendency for nominal-level data is the mode. The mode can also be used with ordinal and interval-level data. 2. Median-the mid point in a value. An equal number of the sample will be above the median as will be below it. In a sample of n =101: 50 of the scores will be above the median, and 50 will be below the median). The median is most often used for interval-level data. When ranked from lowest to highest, the 51st score will be the median. If your n is an even number, take the average between the two middle scores. 3. Mean-the average of a value. The mean is usually not computed on ordinal and nominal-level data because it has little explanatory value. However, many statistical tests require that the mean be computed for the formula. The mean is most appropriate for interval-level data. B. Measures of Variability 1. Range—the difference between the largest and smallest values. 2. Variance—the sum of the squared deviations of n measurements from their mean divided by (n – 1). 3. Standard Deviation—the positive square root of the variance. One must compute the variance first to get the standard deviation because the sum of deviations from the mean always equals zero. 4. Typically, standard deviations are used in the form of standard errors from the mean in a normal curve. In a normal distribution, the interval from one standard error below the mean to one standard error above the mean contains approximately 68% of the measurements. The interval between two standard errors below and above the mean contains approximately 95% of the measurements. 3 5. IV. V. In a perfect normal curve, the mode, median, and mean are all the same number. Probability: A. Classic theory of probability—the chance of a particular outcome occurring is determined by the ratio of the number of favorable outcomes (successes) to the total number of outcomes. This theory only pertains to outcomes that are mutually exclusive (disjoint). B. Relative Frequency Theory—is that if an experiment is repeated an extremely large number of times and a particular outcome occurs a percentage of the time, then that particular percentage is close to the probability of that outcome. C. Independent Events—outcomes not affected by other outcomes. D. Dependent Events—outcomes affected by other outcomes. E. Multiplication Rule 1. Joint Occurrence—to compute the probability of two or more independent events all occurring, multiply their probabilities. F. Addition Rule—to determine the probability of at least one successful event occurring, add their probabilities. G. Remember to account for probabilities with and with replacement. For example, when picking cards out of a deck, the probability of choosing on the first try the Queen of Hearts is 1/52. However, if you don’t put back the first card before choosing for the second time, your probability increases to 1/51 because there are only 51 cards left to choose from. H. Probability Distribution is nothing more than a visual representation of the probabilities of success for given outcomes. I. Discrete variables are those that do not have possible outcomes in between values. Thus, a coin toss can only result in either a heads or tails outcome. Continuous variables are those that have possible outcomes between values. For example, it is absolute possible to be 31.26 years old. Sampling: A. Independent Random Sampling: Most statistical tests are based on the premise that samples are independently and randomly selected. If in fact a sample was purposive, statistical analysis cannot be generalized outside the sample. Random sampling is done so that inferential statistics can be interpreted outside 4 the sample. Statistical tests can still be employed in samples that have been drawn through non-random techniques, however, their interpretation must be confined to the sample at hand, and limitations of the sampling design must be addressed in the results and methods discussions of your research. B. VI. Central Limit Theorem—the larger the number of observations (the bigger your n), the more likely the sample distribution will approximate a normal curve. Principles of Testing: A. Research hypothesis—can be stated in a certain direction or without direction. Directional hypotheses are tested with one-tailed tests of significance. Non-directional hypotheses are tested with two-tailed tests of significance. B. Null hypothesis—is essentially the statement that two values are not statistically related. If a test of a research hypothesis is not significant, then the null hypothesis cannot be rejected. C. Type I error: (alpha) is when a researcher rejects the null hypothesis when in fact it is true. That is, stating that two variables are significantly related when in fact they are not. D. Type II error: (beta) is when a researcher fails to reject the null hypothesis when in fact the null hypothesis is false. That is, stating that two variables are not significantly related when in fact they are. VII. Univariate Inferential Tests: (See also Appendix A for quick reference guide)Essentially, univariate tests are only looking at one variable and how the scores for that variable differ between sample and population groups, two samples, and within groups at different points in time. Many univariate tests can be employed on proportions when that is all that is available. This is not a comprehensive listing of all univariate tests but can the most commonly used are briefly discussed below. A. Steps in testing: 1. State the assumptions 2. State the Null and Research Hypotheses. 3. Decide on a significance level for the test, Determine the test to be used. 4. Compute the value of a test statistic. 5. Compare the test statistic with the critical value to determine whether or not the test statistic falls in the region of acceptance or the region of rejection. B. One-sample z-test: 1. Requirements: normally distributed population. Population variance is known. 5 2. 3. 4. Test for population mean One-tailed or two-tailed. This test essentially tells you if your sample mean is statistically different from the larger population mean. C. One-sample t-test: 1. Requirements: normally distributed population. Population variance is known. 2. Test for population mean 3. One-tailed or two-tailed. 4. This test essentially tells you if your sample mean is statistically different from the larger population mean. D. Two-sample t and z-tests for comparing two means: 1. Requirements: two normally distributed but independent populations. Population variance can be known or unknown. Different formulas depending on whether the variance is known. E. Paired Difference t-test: 1. Requirements: a set of paired observations from a normal population. This test is usually employed to compare “before” and “after” scores. Also employed in twin studies. F. Chi-Square for population distributions: 1. Requirements: assumption of normal distribution 2. This test basically compares frequencies from a known sample with frequencies from an assumed population. There are many different types of chi-square tests which test different things—consistency, distribution, goodness of fit, independence…make sure you are employing the right chi-square for your needs. VIII. Bivariate Relationships: Bivariate simply means analysis between two variables. Before beginning this section it is important to note that there is a difference between measures of association and tests of significance. A Pearson’s r (a measure of association) will give you are statistic between -1 and +1. -1 is a perfect negative correlation, +1 is a perfect positive correlation. 0 is no correlation. When SPSS provides output, it has to run a separate test to tell you if the relationship is significant. Thus, if doing stats by hand, one has to complete two different formulas—one to get the correlation and one to get the significance. A. Measures of Association: (See also Appendix A for quick reference guide) Nominal Variables 1. Lambda-a PRE (proportionate reduction in error) approach that 6 2. 3. essentially tells one how well the prediction in one variable is affected with the knowledge of another. Lambda ranges from 0-1. Thus the direction of relationship between two variables cannot be determined, only the strength can be determined with Lambda. This measure of association is used for nominal level variables and is based on modal categories. Phi-a correlation coefficient used to estimate association in a 2 x 2 table. Cramer’s V- a correlation coefficient to estimate association in tables larger than 2 x 2. Ordinal Variables 1. Gamma-the most frequently used PRE measure of ordered crosstabular association. It ranges from -1 to +1 thus the direction and the strength of the relationship can be determined with gamma. Gamma is used for ordinal level variables. Gamma is based on concordant and discordant pairs. Gamma does not account for tied pairs. 2. Kendall’s tau b- like gamma, is a PRE measure based on concordant and discordant pairs, but accounts for tied pairs on the dependent variable. Kendall’s tau b ranges from -1 to +1. Use Kendall’s tau b for ordinal variables that when they are arranged in tables, the number of rows and columns is equal. 3. Kendall’s tau c- same as Kendall’s tau b but used for when the number of rows and columns are not equal. 4. Somers’s dxy – measure of association that accounts for tied pairs on both the independent and dependent variable. Somers’s dxy will have different values depending on which variable is X (independent) and which variable is Y(dependent). Interval Variables 1. Spearman’s rho- measure of association for interval level data that is less sensitive to outliers than Pearson’s r. Pearson’s r is usually the preferred test statistic so if outliers are not a problem, use Pearson’s r. 2. Pearson’s r – measure of association that requires interval level variables. Pearson’s is great and used often because it holds strong interpretive power. However, it is sensitive to outliers. In cases where your data holds outliers that cannot be taken out of the analysis, use Spearman’s rho which is less sensitive to outliers. B. Tests of Significance (Again, see also Appendix A) 1. Chi-Square- used for nominal and sometimes ordinal level data. The important thing to remember here is that if you have too many categories, the data will be spread out which can be problematic for this test. Chi-square assumes that there are at least 5 cases in each box in the table. To avoid this problem, categories can be collapsed (see discussion under nominal variables in levels of measurement). Chi-square can tell you if the observed frequency is statistically significant relative to what 7 2. 3. IX. we might expect by chance. Again, be careful that you are using the right chi-square formula as there are several uses for chi-square. Bivariate Regression- regression can be used for as little as two variables. However, the assumptions of linear regression must be followed, and the causal order of X and Y must be specified correctly. If the causal order is not specified, try running the regression while flipping around X and Y. Perhaps a reciprocating relationship is present. A more detailed discussion or regression is provided below under multivariate linear regression. Logistic regression-Logistic regression is used for when the dependent variable is dichotomous (only two categories: yes/no, black/white…etc.) Logistic regression is interpreted slightly different than linear regression in that the coefficients refer to the likelihood or probability that X causes a change in Y. Multivariate Relationships: A. Linear Regression Assumptions: There are several assumptions that must be followed when employing linear regression. As a researcher, if you violate any of these assumptions, you must provide a discussion of the limitations produced by the violated assumptions. Some violations are more serious and problematic than others. These assumptions are listed and briefly discussed below. Many of them apply to logistic and bivariate equations discussed above. They are in no particular order. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. The relationship between the independent variables (X) and the dependent variable (Y) is linear. Errors are random and normally distributed. The mean of error is zero. Errors are not correlated with X. No autocorrelation between the disturbances. No covariance between error and X. The number of observations must be greater than the number of parameters to be estimated (n must be larger than the number of X’s in the equation). There is variability in both the X and Y values. (X & Y should be interval level but this assumption violation is not a huge deal if Y is interval and at least one X is interval). X can be dichotomous as well but the coefficient cannot be directly interpreted, must be restated into probabilities. No specification bias of error. No perfect multicollinearity. The direction of the hypothesis is properly stated (X impacts Y in this manner) X and Y are normally distributed. 8 X. B. Multivariate regression assumes that the variables included in the equation are meaningful and make theoretical sense. If one throws every variable possible into the equation, the r2 will be inflated. To avoid this, only include those variables which make theoretical sense and use the Adjusted r2 if there is a significant difference between r2 and adjusted r2. C. The output from SPSS will provide a table with the standardized and unstandardized beta coefficients. The standardized coefficients reflect the relative importance of any given X in the equation. For standardized beta coefficients, the largest one signifies the variable with biggest impact on Y. The unstandardized coefficients will tell you the change in Y with a one unit increase in X. There is also a column (sig) which will tell you if that X variable is statistically and significantly related to Y. D. Regression is an extremely powerful and useful tool in statistics. Please refer to class notes and stats books if you have any questions about regression. Common Mistakes: A. Forgetting to convert between standard deviation and variance. Some formulas require standard deviation, others require variance. Remember that one must compute the variance first and then take its square root to get the standard deviation because the sum of deviations from the mean always equals zero. B. Misstating one-tailed and two-tailed hypotheses. If your hypotheses suggest a direction—greater than or less than (for example individuals with more education will have a higher income), then the hypothesis is one-tailed. If your hypothesis only says that there will be a difference in values—not equal (for example, education has an effect on income), then your hypothesis is two-tailed. Make sure that the appropriate test is employed based on how your hypothesis is stated. C. Failing to split the alpha level for two-tailed tests. D. Misreading the standard normal (z) table. E. Using n instead of n – 1 degrees of freedom in one-sample t-tests. Make sure to use the appropriate formula to figure out the degrees of freedom, it differs across tests. F. Confusing confidence level with confidence interval. The confidence level is the significance level of the test or the likelihood of obtaining a given result by chance. The confidence interval is a range of values between the lowest and highest values that the estimated parameter could take at a given confidence level. G. Confusing interval width with margin of error. Interval width is the 9 distance between the margins of error. For example: +/-3 margin of error has an interval width of 6. XI. H. Confusing statistics with parameters. Parameters are characteristics of the population that you usually don’t know. Statistics are characteristics of samples that you are usually able to compute. Essentially you compute statistics in order to estimate parameters. I. Confusing the addition rule with the multiplication rule. J. Forgetting the addition rule applies to mutually exclusive outcomes. K. Using a “percentage” instead of a “proportion” in a formula. What do I do with a data set? A. There are no hard and fast rules because every research project is different. However, there are a few basic steps in data analysis: 1. Run frequencies of every variable. This will give you a feel for the data and will alert you to any problems or anomalies that need to be addressed. 2. Run a correlation matrix of important variables. Use Kendall’s tau b for this preliminary analysis. I know that Kendall’s tau b is for ordinal level data and not all variables will be ordinal, however, it is a good starting point to begin to see what relationships look like between variables. Also, the matrix might help you identify degrees of multicollinearity which will justify combining variables into scales and indices. Once you review the matrix you may get some concrete ideas about how to proceed with the analyses. Do not use this matrix as a form of hypotheses testing unless both variables you are looking at are ordinal level. 3. Construct scales, indices, and interaction variables. Usually, questions/items which measure similar things are combined into a scale by adding each item together. This step will eliminate multicollinearity problems. Keep in mind that the items you add together must be theoretically or empirically related to each other. Also, each item must be measured the same way and in the same direction. Also, you may have to place weight on items which hold more predictive power. Make sure to run an alpha reliability analysis. Alpha should be at least a .60 to justify the scale. However, if it is close and you have sound theoretical arguments why the item(s) should remain together in scale form, it is OK to keep them together. But, you should analyze items separately to identify which variables/relationships were most important. 4. Begin hypotheses testing. At this stage, you should be careful about the assumptions you violate because you are going to use your analyses as tests of hypotheses rather than exploratory analyses. Choose 10 5. the appropriate tests based on the requirements and assumptions. Recode variables in scales and indices (be sure to run reliability analyses). Be aware that you may have to tests hypotheses in different ways with different tests depending on the data set. For example, experience may be measured in several ways and the measures may be at different levels. You may not be able to combine these measures into a scale or index which means you will have to analyze each measure of experience separately. Discuss your analyses. Make sure to include limitations and alternative explanations for your results. APPENDIX A Tests of Significance Information Obtained: Tests of significance can only tell you whether or not something is statistically significant such as: Whether or not two variables are dependent upon one another (chi-square) Whether or not the distribution of one variable is not likely based on random chance (chisquare) Whether or not a sample mean is significantly different from a populations mean (T-tests and Z-tests for comparing a sample to a population) Whether or not two sample means are significantly different from one another (T-tests and Z-tests for comparing 2 samples) Whether or not 3 or more groups or samples are statistically significant from each other (ANOVA) Level of measurement Why Use? Step 1 Step 2 T-Tests (Univariate) The test variable is either measured in the nominal (when you compare proportions) or interval (when you compare means) To see if a population (based on information from a sample) is significantly different from a population. To see if two samples are significantly different. State your Assumptions Random Sampling (Independent random sampling if comparing two samples) Level of Measurement _________ Normal sampling distribution State the Null and Research Hypotheses H0: The population (based on the sample) is not significantly different from the overall population. Or The two samples are not significantly different. H1: The population (based on the sample) is 11 significantly different from the overall population. Or The two samples are significantly different. Step 3 Step 4 Step 5 Special Instructions Level of measurement Why Use? Step 1 Step 2 *You may also state the research hypothesis in a certain direction if you know the direction. This is called a one-tailed test.* Select the Sampling Distribution (T Distribution) and Establish the Critical Region 1. Pick your alpha level (For ex: .05) 2. Determine if this is a one or two-tailed test 3. Determine the degrees of freedom Df= n-1 (for comparing a sample to a population) Df= n1+ n2 – 2 (for comparing two samples) 4. Look up the critical T in back of book based on if it is a one or two tailed test and the alpha. Compute the Test Statistic (Solve the formula to get the obtained T) Interpret your results. Anything beyond the critical T on both tails for two-tailed test, and beyond one tail for a one-tailed test you reject the null hypothesis (The area beyond T is considered the “critical” region. Anything smaller than the critical T for a one-tailed test and within the region between the critical Ts (plus and minus) means you fail to reject the null hypothesis T can be used when the n is smaller than 100. It can also be used when n is larger than 100. When comparing samples (Ch.8) be sure to compute ALL formulas needed. Z –Tests (Univariate) The test variable is either measured in the nominal (when you compare proportions) or interval (when you compare means) To see if a population (based on information from a sample) is significantly different from a population. To see if two samples are significantly different. State your Assumptions Random Sampling (Independent random sampling if comparing two samples) Level of Measurement _________ Normal sampling distribution State the Null and Research Hypotheses H0: The population (based on the sample) is not significantly different from the overall population. Or The two samples are not significantly different. 12 H1: The population (based on the sample) is significantly different from the overall population. Or The two samples are significantly different. Step 3 Step 4 Step 5 Special Instructions Level of measurement Why Use? Step 1 Step 2 Step 3 *You may also state the research hypothesis in a certain direction if you know the direction. This is called a one-tailed test.* Select the Sampling Distribution (Z Distribution) and Establish the Critical Region 1. Pick your alpha level (For ex: .05) 2. Determine if this is a one or two-tailed test 3. Look up the critical Z in back of book based on if it is a one or two tailed test and the alpha. Compute the Test Statistic (Solve the formula to get the obtained Z) Interpret your results. Anything beyond the critical Z on both tails for two-tailed test, and beyond one tail for a one-tailed test you reject the null hypothesis (The area beyond Z is considered the “critical” region. Anything smaller than the critical Z for a one-tailed test and within the region between the critical Zs (plus and minus) means you fail to reject the null hypothesis Z can only be used when the n is larger than 100. When comparing samples be sure to compute ALL formulas needed. ANOVA (Univariate) The test variable is interval (you are comparing means). The groups are based on Nominal or Ordinal categories. To see if 3 or more groups or samples are significantly different. State your Assumptions Independent Random Sampling Level of Measurement _________ Normal sampling distribution Population variances are equal. State the Null and Research Hypotheses H0: The population means (based on the samples) are not significantly different from one another. H1: The population means (based on the samples) are significantly different from one another. Select the Sampling Distribution (F Distribution) and Establish the Critical Region 1. Pick your alpha level (For ex: .05) 13 Step 4 Step 5 Special Instructions 2. Determine the degrees of freedom between and within. Dfb = k-1 Dfw = N-k 3. Look up the critical F in back of book based on the degrees of freedom and the alpha. Compute the Test Statistic (Solve the formula to get the obtained F) Interpret your results. Anything equal to or larger than the critical F means you reject the null hypothesis. Anything smaller than the critical F means you fail to reject the null hypothesis ANOVA should be used when the n’s are comparable in number. When using ANOVA be sure to compute ALL formulas needed. With ANOVA, there is no such thing as a one or two-tailed test. The F ratio is based solely on the differences between groups. Level of measurement Why Use? Step 1 Step 2 Step 3 Step 4 Chi-Square (Univariate and Bivariate) Nominal and Ordinal To see if two variables are dependent. To see if distribution of one variable is not likely based on random chance. State your Assumptions Random Sampling Level of Measurement _________ State the Null and Research Hypotheses H0: The two variables are independent. Or The distribution of ____ is random. H1: The two variables are dependent. Or The distribution of ____ is not random. Select the Sampling Distribution (chi-square) and Establish the Critical Region 4. Pick your alpha level (For ex: .05) 5. Calculate degrees of freedom: Df = (r-1)(c-1) for two variable tests. Df = (k-1) for one variable tests 6. Look up the critical chi-square in the back of the book. Compute the Test Statistic (Solve the formula to get 14 Step 5 Special Instructions the obtained chi-square) Interpret your results. Anything equal to or above the critical chi-square means you reject the Null hypothesis. Anything smaller than the critical chi-square means you fail to reject the null hypothesis Don’t use on anything bigger than a 4 x 4 table. If any cell has a frequency of less than 5, do not use Measures of Association (Bivariate) Information Obtained: Measures of Association will tell you the strength of a relationship but will not tell you whether or not it is statistically significant. {For example, the relationship between eating apples and getting cancer may achieve statistical significance (because it is consistent) but it can either be a strong relationship (it only takes a few apples to give you cancer) or a weak one (it takes many apples—like millions—over a lifetime to increase the probability of getting cancer.)} Tests which measure the association between two variables are based on the following assumptions: Random Sampling Normal Sampling Distribution The level of measurement of the variables will determine which test you choose. Nominal Level Variables Test Phi Cramer’s V Lambda Formula Formula is based on ChiSquare statistic Formula is based on ChiSquare statistic PRE measure (means your prediction error is reduced when you have knowledge of the independent variable) Special Instructions Use for 2x2 tables. Ranges 0 to +1 Use for larger than 2x2 tables. Ranges 0 to +1. Use for any type of table except for when the row marginals are vastly different. Ranges from 0 to +1. Multiply Lambda by 100 and get the exact reduction in error by knowing the independent variable. 15 Ordinal Level Variables Test Gamma Formula PRE measure based on concordant and discordant pairs (Gamma does not account for tied pairs) Kendall’s tau-b (use when the number of rows is equal to the number of columns) PRE measure based on concordant and discordant pairs. This test accounts for tied pairs (thus a little more accurate than Gamma) Kendall’s tau-c (use when the number of rows is NOT equal to the number of columns) PRE measure based on concordant and discordant pairs. This test accounts for tied pairs (thus a little more accurate than Gamma) Somers dxy This test accounts for tied pairs on both the independent and dependent variable. With this formula, you must specify the independent and dependent variables correctly. Special Instructions Ranges from -1 to +1 (-1 is a perfect negative relationship and +1 is a perfect positive relationship. 0 means no relationship. Ranges from -1 to +1 (-1 is a perfect negative relationship and +1 is a perfect positive relationship. 0 means no relationship. Ranges from -1 to +1 (-1 is a perfect negative relationship and +1 is a perfect positive relationship. 0 means no relationship. Ranges from -1 to +1 (-1 is a perfect negative relationship and +1 is a perfect positive relationship. 0 means no relationship. Interval and Ratio Level Variables Test Spearman’s rho (Can be used for ordinal level variables that are “continuous”—scales with actual scores—in some cases) Pearson’s r Formula Use this formula when you have outliers. This test is less sensitive to outliers. Special Instructions Ranges from 0 to +1. Once you square the obtained Spearman’s rho, you make multiply that number by 100 and interpret it as the percentage of reduced error. This formula is the most Ranges from -1 to +1 (-1 is often used but is sensitive to a perfect negative outliers relationship and +1 is a perfect positive relationship. 0 means no relationship. If you square this number it can be interpreted to represent the 16 percentage of variation of the dependent variable that the independent variable explains. 17