A Consideration of Practical Significance in

advertisement
A Consideration of Practical Significance
in Adverse Impact Analysis
Eric M. Dunleavy, Ph.D. - Senior Consultant
July 2010
One of the frequent statistical techniques used in EEO context is the adverse impact
analysis, which compares the employment consequences of an organizational policy or
procedure between two groups. This comparison often simplifies to a test of the difference
between two rates or the ratio 1 of those rates. Perhaps most commonly considered in analyses of
hiring data, adverse impact analyses often answer this basic question: Are the hiring rates
between group 1 (e.g., males) and group 2 (e.g., females) „meaningfully‟ different? It is
important to note that, in this context, the notion of „meaningfully different‟ can be interpreted in
more than one way. For example, from a statistical significance perspective, „meaningfully
different‟ generally means „probably not due to chance‟. In other words, what is the degree of
uncertainty inherent in the conclusions of the analysis (i.e., that there is a meaningful difference
between two groups)?
From the practical significance perspective, „meaningfully‟ different could also mean
„dissimilar enough for the EEO and/or scientific community to notice‟. This perspective
emphasizes the magnitude or size of the difference. As this notion suggests, practical
significance measures include some inherent subjectivity, because EEO and scientific
communities must determine how large a difference (or how much a ratio deviates from 1) must
be to become a „red flag‟ that may eventually be deemed unlawful discrimination. As described
by the OFCCP statistical standards report (1979):
“First, any standard of practical significance is arbitrary. It is not rooted in mathematics or statistics, but
in a practical judgment as to the size of the disparity from which it is reasonable to infer discrimination.
1
Please refer to Morris and Lobsenz (2000) for a review of tests that focus on the ratio of selection rates.
1 Copyright 2010 DCI Consulting Group Inc
Second, no single mathematical measure of practical significance will be suitable to widely different
factual settings.”
Practical significance is an important addition to statistical significance in the
consideration of potential adverse impact. Because meaningless group differences will be
“statistically significant” with large samples sizes, it is important to determine whether the size
of the group difference represents potential discrimination. For example, Dunleavy, Clavette, &
Morgan (2010) have demonstrated that a 1% difference in selection rates can become statistically
significant when the sample size reaches 1,200: a difference in selection rates so small that
discrimination cannot be reasonably inferred.
Although concrete practical significance standards are not available for all situations, a
number of practical significance measures have been endorsed by EEO doctrine and accepted by
U.S. courts dealing with EEO claims. Other practical significance measures, while not explicitly
endorsed by EEO doctrine or courts, are generally accepted by the social scientific community.
This paper reviews some practical significance measures that may be useful in the context of
adverse impact analyses. These measures are particularly useful in combination with statistical
significance2 tests.
Practical significance measures appropriate for adverse impact analysis
Perhaps the most commonly used practical significance measure in the EEO context is
the 4/5 or 80% rule, which uses an impact ratio (i.e., Group A pass rate divided by Group B
pass rate) to measure magnitude. Codified in the Uniform Guidelines on Employee Selection
Procedures (UGESP, section 4D), the rule is described as follows:
th
“A selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5) (or eighty percent)
of the rate for the group with the highest rate will generally be regarded by the Federal enforcement
agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be
regarded by Federal enforcement agencies as evidence of adverse impact. Smaller differences in selection
rate may nevertheless constitute adverse impact, where they are significant in both statistical and
practical terms or where a user's actions have discouraged applicants disproportionately on grounds of
race, sex, or ethnic group. Greater differences in selection rate may not constitute adverse impact where
the differences are based on small numbers and are not statistically significant.”
Thus, the 4/5th rule is a measure of the magnitude of disparity. As the UGESP definition
points out, the 4/5th rule is endorsed by Federal agencies, yet may need to be interpreted in light
of particular context (e.g., sample size, in combination with statistical significance testing).
However, case law suggests that the 4/5th rule can be interpreted as adequate stand alone
2
Note that statistical significance tests like Z and Fisher‟s exact test are often useful, yet may be trivial in some
situations.
2 Copyright 2010 DCI Consulting Group Inc
evidence in some situations, although it is unclear exactly what circumstances warrant such
interpretation. 3 Note that the 4/5th rule is also explicitly endorsed in the Office of Federal
Contract Compliance Programs (OFCCP) Compliance Manual (1993; Section 7E06, titled
“MEASUREMENT OF ADVERSE IMPACT”):
“80 Percent Rule: OFCCP has adopted an enforcement rule under which adverse impact will not
ordinarily be inferred unless the members of a particular minority group or sex are selected at a rate that
is less than 80 percent or four-fifths of the rate at which the group with the highest rate is selected (41
CFR 60-3.4D, Questions and Answers to Clarify and Provide a Common Interpretation of the Uniform
Guidelines on Employee Selection Procedures (Questions and Answers) (Nos. 10-27)). When a minority
or female selection rate is less than 80 percent of that of White males a test of statistical significance
should be conducted. (See SCRR Worksheets 17-6a, 6b, and accompanying instructions.) The 80 percent
rule is a general rule, and other factors such as statistical significance, sample size, whether the
employer's actions have discouraged applicants, etc., should be analyzed”.
Of course, it is important to note that the 4/5th rule analysis can be inaccurate in some
situations. Shortly after the publication of the UGESP, the management science literature
criticized the rule as stand-alone evidence of discrimination. These reasonable criticisms
centered on (1) a series of inconsistencies regarding the interpretation of the rule (which are
apparent in UGESP and the UGESP Question and Answers) and (2) some poor psychometric
properties of 4/5th rule analyses. Most recently, in a study conducted by Roth, Bobko, and
Switzer (2006), simulation research was used to identify some situations where the 4/5 th rule
provided erroneous conclusions. Specifically, the authors showed that false-positives (situations
when the 4/5th rule was violated but selection rates were essentially equal in the population of
applicants) occurred at an alarming rate, particularly when there were few hires, low minority
representation, and small applicant pools. For these and other reasons4, most experts in the area
of EEO view the 4/5th rule as a general rule of thumb that can be used in combination with other
evidence, such as statistical significance testing (Meier, Sacks, Zabell, 1984). Having said that,
little research has offered an alternative rule of thumb, so the 4/5th rule appears to be no worse
conceptually than other social scientific rules of thumb for measures such as odds ratios, absolute
differences in selection rates, or Cohen‟s h transformations of the difference. These measures are
described in more detail later in the paper.
Note that the rationale for combining practical and statistical significance results is an
intuitive one. In situations where the measures come to identical conclusions, the EEO analyst
can usually feel very confident in a finding of meaningful impact or no impact. In other
3
Also note that much of this case law is older, and many rulings were decided in the 10 years after UGESP were
codified.
4
Please refer to Biddle (2005) for a description of how the 4/5th rule was developed, and the somewhat arbitrary
nature of this rule of thumb.
3 Copyright 2010 DCI Consulting Group Inc
situations, context may play an important role when statistical and practical significance
measures produce different conclusions (i.e., when a standard deviation analysis is greater than
2.0 but the 4/5th rule is not violated).
Table 1 presents a framework for interpreting statistical and practical significance
measures. As the table shows, statistically significant tests paired with meaningful practical
measures point toward a disparity reasonable from which to infer discrimination. It is probably
not reasonable to infer discrimination when a disparity is not statistically significant or
practically meaningful. In other situations where the two perspectives disagree, context will play
an important role. Note that it is difficult to conclude practical significance in the absence of
statistical significance, because we are not confident that the difference is „real‟.
Table 1: A Framework for Interpreting Statistical and Practical Significance Measures
Practical Significance Measure
(e.g., difference, impact ratio, etc.) Results
Meaningful
Statistical
Significance
Test
(e.g., Z, FET)
Results
Trivial
Significant
A disparity that is
probably reasonable
from which to infer
discrimination
Somewhere in the Middle
(but chance is probably not
an explanation)
Not
Significant
Somewhere in the
middle (but chance is
probably an
explanation)
A disparity that is probably
not reasonable from which
to infer discrimination
The issue of inconsistent results across disparity measurement perspectives was
considered by the 2 nd Circuit in Waisome v. Port Authority (1991) which ruled that practical
significance evidence was required even in situations where a disparity was statistically
significant at greater than two standard deviations:
“We believe Judge Duffy correctly held there was not a sufficiently substantial disparity in the rates at
which black and white candidates passed the written examination. Plainly, evidence that the pass rate of
black candidates was more than four-fifths that of white candidates is highly persuasive proof that there
was not a significant disparity. See EEOC Guidelines, 29 C.F.R. § 1607.4D (1990); cf. Bushey, 733 F.2d
at 225-26 (applying 80 percent rule). Additionally, though the disparity was found to be statistically
significant, it was of limited magnitude, see Bilingual Bicultural Coalition on Mass Media, Inc. v. Federal
Communications Comm'n, 595 F.2d 621, 642 n. 57 (D.C.Cir.1978) (Robinson, J., dissenting in part)
(statistical significance tells nothing of the importance, magnitude, or practical significance of a
disparity) (citing H. Blalock, Social Statistics 163 (2d ed. 1972))………These factors, considered in light
of the admonition that no minimum threshold of statistical significance mandates a finding of a Title VII
4 Copyright 2010 DCI Consulting Group Inc
violation, persuade us that the district court was justified in ruling there was an insufficient showing of a
disparity between the rates at which black and white candidates passed the written examination.”
Other practical significance measures have been used by courts as well. For example,
numerous courts have evaluated practical significance using the actual percentage difference in
selection rates. For example, in Frazier v. Garrison I.S.D. (1993), a four and a half percent
difference in selection rates was deemed trivial in a situation where 95% of applicants were
selected. A similar practical significance measure was used in Moore v. Southwestern Bell
Telephone Co., where the court held that „employment examinations having a 7.1 percentage
point differential between black and white test takers do not, as a matter of law, make a prima
facie case of disparate impact. Therefore, there was no meaningful discrepancy between
minority and non-minority pass rates based on selection rate differences’. 5
„Flip flop‟ rules have also been endorsed by courts and the EEO community as measures
of practical significance. Instead of measuring magnitude, these measures essentially impose
some correction for sampling error on a practical significance measure, ensuring that a result
wouldn‟t drastically change if small changes to the hiring rates were made. This rationale is
similar to statistical significance testing. For example, with the regard to the 4/5 th rule, Question
and Answer 21 from UGESP states:
“If the numbers of persons and the difference in selection rates are so small that it is likely that the
difference could have occurred by chance, the Federal agencies will not assume the existence of adverse
impact, in the absence of other evidence. In this example, the difference in selection rates is too small,
given the small number of black applicants, to constitute adverse impact in the absence of other
information (see Section 4D). If only one more black had been hired instead of a white the selection rate
for blacks (20%) would be higher than that for whites (18.7%). Generally, it is inappropriate to require
validity evidence or to take enforcement action where the number of persons and the difference in
selection rates are so small that the selection of one different person for one job would shift the result
from adverse impact against one group to a situation in which that group has a higher selection rate than
the other group.”
A similar practical significance measure was articulated in Contreras v. City of Los
Angeles (1981). In this case practical significance was assessed via the number of additional
„victim‟ applicants that would need to be selected to eliminate a significant disparity. Practical
significance was also assessed by determining the number of additional „victim‟ applicants that
would need to be selected to make rates very close between groups (i.e., around 2%).
5
It is important to note that in both of these cases overall selection rates and subgroup selection rates were very
high, and that the 4/5th rule was not violated. It is unclear how differences in selection rates of this magnitude would
be interpreted when selection rates are lower such that the 4/5th is violated (e.g., 4% vs. 8% and an impact ratio of
.50 instead of 92% vs. 96% and an impact ratio of .96). Intuitively, such differences may be treated differently.
5 Copyright 2010 DCI Consulting Group Inc
Another practical significance measure was used in U.S. v. Commonwealth of Virginia
(1978) and in the Waisome case described above. This method required assessing the number of
additional „victim‟ applicants that would need to be selected to eliminate a statistically
significant disparity (e.g., less than 2 standard deviations). In this context, if „one or two‟
additional passes from the „victim‟ group changed the statistical results, the difference would not
be considered practically significant.6
Social scientific trends toward the use of practical significance measures
Practical significance is a general concept that has gained a great deal of support in the
social scientific community in the last few decades. 7 As advocated by Kirk (1996) in a special
series on practical significance in Educational and Psychological Measurement, it is a concept
whose time has come. This is because many in the social scientific community have identified an
over-reliance of statistical significance testing in academic and applied research, and advocated a
more balanced set of statistical standards that include practical significance measures in the form
of effect sizes. 8 For example, in the most recent Publication manual of the American
Psychological Association (2010) a failure to report effect sizes (as practical significance
measures) is considered a defect in the reporting of research:
“No approach to probability value directly reflects the magnitude of an effect or the strength of a
relation. For the reader to fully understand the importance of your findings, it is almost always necessary
to include some index of effect size or strength of relation in your Results section.”
Additionally, the Journal of Applied Psychology, generally considered a top-tier social scientific
journal, now requires authors to:
“…indicate in the results section of the manuscript the complete outcome of statistical tests including
significance levels, some index of effect size or strength of relation, and confidence intervals” (Zedeck,
2003, p. 4).
6
Note that this „statistical significance flip flop‟ rule may be somewhat counter-intuitive, since the flip flop
condition equates to the use of statistical significance testing with a lower alpha level (e.g., .04 instead of 05) or
higher standard deviation criterion (e.g., 2.1 instead of 2.0). In this context alpha is being adjusted according to the
number of additional selections that would be required to produce non-significant results.
7
Importantly, many statistically savvy researchers have advocated the use of both practical and statistical
significance testing methods for many years (e.g., Cohen, 1988; Henkel, 1976; Reynolds, 1984; Tabachnick &
Fidell, 2001). Meier, Sacks, & Zabell (1984) made this same recommendation specifically for analyses of disparity.
8
In this context, effect size refers to a measure capturing the magnitude of a relation between two variables, using
an outcome and a predictor that influences that outcome. Note that this is not simply a probabilistic test as in
statistical significance testing. For example, how strongly are gender and the likelihood of being hired correlated in
the EEO context? In the discrimination context it is usually hypothesized that gender is a cause or explanation of
being hired or rejected.
6 Copyright 2010 DCI Consulting Group Inc
This paradigmatic shift is particularly noteworthy within the context of applied and present day
EEO research because applicant pools are often very large and thus may produce trivial
statistically significant results as a function of sample size alone. See Table 2 for an example of a
disparity that appears practically meaningless, but as more and more data are collected (via
multiplying sample sizes by a constant of 10), the statistical significance test eventually suggests
meaningful disparity (at a sample size of 2,400). In other words, the impact ratio and difference
in rates always suggests trivial disparity and are constant over sample size, yet the Z test changes
simply as a function of sample size, and is eventually significant even though the difference in
rates is only 1 percentage point. This phenomenon is likely when data are collected over time
(e.g., 1 year, 2 years, 3 years, 10 years), across multiple locations, or across multiple jobs.
Table 2: A comparison of practical and statistical measures across sample sizes
# Applicants
Males
100
1,000
1,200
10,000
100,000
1,000,000
Females
100
1,000
1,200
10,000
100,000
1,000,000
# Selections
Males
99
990
1,188
9,900
99,000
990,000
Females
98
980
1,176
9,800
98,000
980,000
Selection Rates
Total
0.985
0.985
0.985
0.985
0.985
0.985
Males
0.99
0.99
0.99
0.99
0.99
0.99
Females
0.98
0.98
0.98
0.98
0.98
0.98
Practical
Measures
Impact
Ratio
0.99
0.99
0.99
0.99
0.99
0.99
Diff
in
rates
0.01
0.01
0.01
0.01
0.01
0.01
Statistical
Test
SD (Z) test
0.58
1.84
2.01
5.82
18.40
58.17
Importantly, there are a variety of available practical significance measures capturing the
magnitude of relation between protected group status and the likelihood of an employment
decision. For example, the odds ratio, which captures the odds of experiencing a positive
employment experience for one group relative to another, provides an intuitive metric of
practical significance.9 This metric has been endorsed by statisticians with expertise in the EEO
community (e.g., Gastwirth, 1988). For example, Gastwirth suggested that an odds ratio of 1.4
(or its reciprocal, .70) was a reasonable rule of thumb for moderate disparity, representing the
case where members of one group are 1.4 times (or 40%) more likely to experience a positive
employment decision than members of another group.
9
Note that the odds ratio is similar to the impact ratio used for 4/5 th rule analysis. However, the odds ratio takes into
consideration the rejection rate of each group in addition to the selection rate, while the impact ratio only considers
selection rates. This difference can explain situations where the odds ratio and the impact ratio do not provide the
same conclusion. For example, if one group is selected at 92% and another group is selected at 96%, the impact ratio
suggests trivial significance (i.e., an impact ratio of 0.96), whereas the odds ratio would suggest meaningful
disparity (i.e., an odds ratio of 0.458).
7 Copyright 2010 DCI Consulting Group Inc
Measures of association (e.g., Phi) capturing the magnitude of relation between protected
group and employment outcome can also be useful from a practical significance perspective. For
readers familiar with the employment testing context, this notion is similar to „validity
coefficients‟ used to measure how well a test predicts job performance. Although there are some
special considerations for assessing the relationship between two dichotomous variables, the
same general 0 to 1 „validity coefficient scale‟ applies, where 0 indicates no relationship between
group and outcome and values closer to 1 indicate strong relationship. Unfortunately, there are
no clear and obvious rules of thumb for these metrics, 10 although a value close to zero can
reasonably be interpreted as no relationship, and thus trivial disparity. Cohen‟s h statistic, which
is a transformation of the difference in two rates, may be another useful metric. Although no
clear and universal rules of thumb are available for interpretation, Cohen (who later wrote that he
regretted providing rules of thumb that were so easily misapplied) suggested the following
starting points for interpretation:
0.2 = Small difference in rates
0.5 = Medium difference in rates
0.8 = Large difference in rates
It is important to reiterate that the various measures of practical significance may yield
differing conclusions. The investigation of the validity of drug testing at the U.S. Postal Service
by Normand, Salyards, and Mahoney (1990) provides an excellent example the potential for
differing conclusions. Normand et al. found that applicants testing positive for drugs were 48.5%
more likely to be heavy users of absenteeism leave than were applicants testing negative for
drugs, a difference that is statistically significant. The 48.5% difference is of a magnitude that on
the face appears to be of high practical significance and the odds ratio of 1.97 exceeds
Gastwirth‟s rule-of-thumb for moderate practical significance. However, if the same data are
converted to a correlation coefficient, the resulting correlation of .10 would be considered a low
level of practical significance.
Conclusion
Understanding the practical significance of a selection rate difference (or ratio of
selection rates) is a critical issue in the high stakes situation where an employer faces charges of
unlawful employment practices that discriminate against a protected group. A number of EEO-
10
The Department of Labor provided some general rules of thumb for interpreting the usefulness of correlations in
the testing context, usually when a continuous test score predicts a continuous performance outcome. In this context,
a correlation of .11 or less is „unlikely to be useful‟, between .11 and .20 „depends on the circumstances‟, between
.21 and .35 is „likely to be useful‟, and above .35 is „very beneficial‟. However, given the special statistical case of
two dichotomous variables, it is unclear how these DOL rules of thumb apply to adverse impact analyses. Future
research should consider this issue.
8 Copyright 2010 DCI Consulting Group Inc
endorsed and scientifically based practical significance measures are available for analysis of
traditional employment decision data. These measures may be particularly useful in situations
where sample sizes are very large, and statistical significance testing becomes a meaningless
exercise. In fact, conducting and interpreting statistical significance tests alone when samples are
very large is scientifically unsound and can be potentially misleading. We strongly recommend
that EEO analysts consider both statistical significance tests and practical significance measures
in adverse impact analyses. 11
11
Again, it is important to note that UGESP may endorse a similar combination of tests, although the exact meaning
of the following section (4D) is unclear: „Smaller differences in selection rate may nevertheless constitute adverse
impact, where they are significant in both statistical and practical terms or where a user's actions have discouraged
applicants disproportionately on grounds of race, sex, or ethnic group. Greater differences in selection rate may not
constitute adverse impact where the differences are based on small numbers and are not statistically significant’.
9 Copyright 2010 DCI Consulting Group Inc
References
American Psychological Association. (2010). Publication manual of the American Psychological
Association (6th ed.). Washington, DC: Author.
Biddle, D.A. (2005). Adverse impact and test validation: A practitioners guide to valid and
defensible employment testing. Burlington, VT: Gower.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. New York, NY: Erlbaum
Associates.
Dunleavy, E. M., Morgan, D. M., & Clavette, M. (2010). Practical Significance: A concept
whose time has come in adverse impact analyses. In Morrison, J., & Sinar, E.
(Moderators). The 4/5ths Is Just a Fraction: Alternative Adverse Impact Methodologies.
Symposium presented to the 25th annual SIOP conference in Atlanta, GA, April 2010.
Gastwirth, J.L. (1988). Statistical reasoning in law and public policy (Vol. 1). San Diego, CA:
Academic Press.
Henkel, R. E. (1976). Tests of Significance. Sage University Series Quantitative applications in
the social sciences. Newbury Park, CA: Sage Publications.
Kirk, R.E. (1996). Practical significance: A concept whose time has come. Educational and
Psychological Measurement, 56, 746-759.
Meier, P., Sacks, J. & Zabell, S. (1984). What happened in Hazelwood: Statistics, employment
discrimination, and the 80% Rule. American Bar Foundation Research Journal, 1, 139186.
Morris, S.B. & Lobsenz, R.E. (2000). Significance tests and confidence intervals for the adverse
impact ratio. Personnel Psychology, 53, 89-111.
Normand, J., Salyards, S. D., & Mahoney, J. J. (1990). An evaluation of preemployment drug
testing. Journal of Applied Psychology, 75(6), 629–639.
Office of Federal Contract Compliance Programs. (1993). Federal contract compliance manual.
Washington, DC: U.S. Department of Labor.
Reynolds, H. T. (1984). Analysis of nominal data. Sage University Series Quantitative
applications in the social sciences. Newbury Park, CA: Sage Publications
10 Copyright 2010 DCI Consulting Group Inc
Roth, P.L., Bobko P., &. Switzer, F. S. III. (2006). Modeling the behavior of the 4/5th‟s rule for
determining adverse impact: Reasons for caution. Journal of Applied Psychology, 91,
507-522.
Sobel, R., Michelson, S., Finklestein, M., Fienberg, S., Eisen, D., Davis, F. G., Paller, P. E.
(1979). Statistical inferences of employment discrimination and the calculation of back
pay. Part I: Inferences of discrimination. Unpublished OFCCP Statistical standards panel
report.
Tabachnick, B., & Fidell, L. (2001). Using multivariate statistics. Needham Heights, MA: Allyn
& Bacon.
Uniform guidelines on employee selection procedures. Fed. Reg., 43, 38,290-38,315 (1978).
Zedeck, S. (2003). Instructions for authors. Journal of Applied Psychology, 88, 35.
Cases Cited
Contreras v. City of Los Angeles (1981) 656 F.2d 1267
Frazier v. Garrison ISD (1993) 980 F.2d 1514
Moore v. Southwestern Bell Telephone Co. (5th Cir.1979) 593 F.2d 607, 608
U.S. v. Commonwealth of Virginia (1978) 620 F.2d 1018
Waisome v. Port Authority of New York & New Jersey (1991) 948 F.2d 1370
11 Copyright 2010 DCI Consulting Group Inc
Download