Measurement and Statistics Primer

advertisement
MEASUREMENT AND STATISTICAL ISSUES IN HUMAN RESOURCE
MANAGEMENT
A Primer for the Non-Expert
Timothy A. Judge
Department of Management
Mendoza College of Business
University of Notre Dame
©Timothy A. Judge, 2013
MEASUREMENT AND STATISTICAL ISSUES IN HUMAN RESOURCE
MANAGEMENT
A Primer for the Non-Expert
OUTLINE
I.
INTRODUCTION
Page 2
Importance of Measurement
Importance of Statistical Analysis
II. FUNDAMENTALS OF STATISTICAL ANALYSIS
Page 5
III. PROBLEMS IN ESTABLISHING CAUSALITY
Page 29
IV. MEASURING INDIVIDUAL DIFFERENCES
Page 30
V. CONFIRMATORY RESEARCH
Page 45
VI. COMPUTER PACKAGES
Page 54
VII. SUMMARY
Page 55
Central Tendency
Dispersion
Standard Scores
Normal Distribution
Hypothesis Testing
Errors
Correlation
Regression
Multiple Regression
Reliability
Standard Error of Measurement
Validity of Measures
Criterion-Related Validity
Content Validity
Face Validity
Construct Validity
Cross-Validation
Validity Generalization
Decision Analysis
Utility Analysis
Meta-Analysis
Measurement and Statistics Primer
Page 2 of 57
I. INTRODUCTION
After many years of saving, Jack had accumulated enough cash to buy a local
ice cream shop. One of Jack's first tasks was to figure out how to staff the shop. Being
a novice at this, Jack consulted his friend, Margaret, owner of the local hardware store.
Margaret advised Jack that she used the interview to get "the most knowledgeable
people possible," and recommended it to Jack because her people had "generally
worked out well."
While Jack greatly respected Margaret's advice, upon reflection several
questions came to mind. Given that there are several qualities important to a good ice
cream shop employee, how does one go about identifying and measuring the best
indicators of those qualities? Does Margaret's use of the interview mean that it meets
Jack's requirements? Jack also wondered that if he used the interview, how confident
could he be that his judgments would be the same as someone else's? Jack also needed
to hire a store manager. What characteristics would he need to look for in a strong
leader? Finally, how could Jack test if his chosen method of selecting employees was
effective or ineffective?
Jack also had another set of decisions to make. How could he determine if the
wage he offers differs greatly from the relevant labor market? Jack has heard that
entry-level employees often engage in counterproductive behaviors—stealing, showing
up late, taking off early, giving free ice cream to friends, etc. By what means could he
predict employees’ tendencies to engage in these behaviors in advance? How could
these relationships be compared with findings from other organizations? By what
means could Jack evaluate the effectiveness of a training and development program?
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 3 of 57
Finally, how can Jack ensure that his human resource decisions are fair and nondiscriminatory? Jack was unsure how to go about answering these questions.
These questions faced by Jack are just a few of the issues confronting
managers of human resources every day. While answering each question requires
knowledge of the specific practice under consideration, it is also essential that the
manager understand the measurement and analytical issues underlying each
question. Without measurement and statistical analysis, evaluation of practices
must be as subjective as Margaret's answer to Jack's question. The purpose of this
primer is to introduce you to the measurement concepts and statistical tools
essential to answer the questions facing managers of human resources, a few of
which were presented above.
Importance of Measurement
Imagine a world in which measurement of individual differences did not
exist, except within the mind of each individual. Every person would have his or her
own measure of a man or woman, but the standard would dwell solely within the
opinions and values of the individual. Inferences made about, and debates over, the
characteristics of individuals would be entirely subjective. Efforts to understand
and predict could not be undertaken because no knowledge would be generally
held. Further, because each individual would have his or her own set of standards
and measurements, general knowledge about people would be difficult to achieve.
Accepted standards of measurement provide a common metric against which
differences between individuals can be judged. To be sure, there is still room for
subjectivity and disagreements. However, measures allow the debate of individual
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 4 of 57
differences to reach a higher plane. Accepted standards of measurement enable us
to draw inferences based on procedures that have been tried and tested, allowing us
to be more objective and systematic in investigating our attribute(s) of interest.
The better the measure, the less decision error one risks over the true level
of the attribute. This has direct implications for managers. For example, the better
measure of friendliness Jack chooses, the fewer customers will be driven away by
employees (mistakenly identified as friendly) providing poor customer service.
Further, if Jack has difficulty measuring friendliness, accurately appraising whether
this is a wise selection strategy will be an arduous task. Finally, selection and
appraisal procedures that are not accurate predictors of true performance often
place one in jeopardy of litigation from disgruntled applicants.
Importance of Statistical Analysis
As just explained, measurement is an essential issue for the manager of
human resources to consider. Yet without analysis of those measures, measurement
itself is futile. It is probably safe to conclude that rather than being beset by a lack of
measurement information, most managers are overwhelmed by too much
information. For example, in formulating selection decisions the manager may have
information on hundreds of candidates on several different predictors. The use of
statistics is to make sense out of this mass of information.
As evidenced by Jack's dilemma, the typical manager is faced with a great
deal of uncertainty. While statistical analysis does not eliminate the uncertainty, it
provides the basis for better decisions to be made based on the data at hand.
Further, statistics are the tools that allow us to make inferences about our measures.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 5 of 57
How reliable or consistent are the measures of the attribute(s) of interest? How
accurate or valid are they? This paper will introduce the ways in which we can
describe and make inferences about our measures of concern.
II. FUNDAMENTALS OF STATISTICAL ANALYSIS
Measuring individual differences is a detailed issue that will be addressed in
the next section. However, a pertinent question is: once we have a measurement,
what do we do with it? It is essential that the manager be able to analyze the
numbers measurement provides. Statistics are the methods we use to make sense
out of numbers, both to describe measures of attributes, and to infer knowledge
from them. In short, descriptive statistics are concerned with summarizing data in a
digestible manner; inferential statistics are concerned with estimating the likelihood
of certain phenomena given the results at hand. The statistics reviewed below can
be used for both descriptive and inferential purposes, depending on the goal of the
manager.
Central Tendency
Central tendency designates the typical response of a distribution. There
are three statistics commonly used to indicate central tendency. The mode refers to
the most frequent value. The median is the middle observation, or the point at
which half the observations fall above and half fall below. The mean of a set of
observations is the arithmetic average, or the sum of the set divided by the total
number of observations in the set. The mean is calculated using the following
formula:
M=
∑x
n
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Where:
Page 6 of 57
M = mean
∑ x = sum of the observations, x
n = number of observations
As an example, suppose we had the following set of performance scores from a
sample of Jack's employees (on a 100 point scale):
49,54,68,68,75,78,84,91,100
There is only one value, 68, that occurs twice. Therefore, it is the mode. The median
is 75—four observations fall above 75 and four fall below. The mean is 74.1, which
is the sum of scores (667) divided by the number of scores (9).
What are the advantages and disadvantages of each measure of central
tendency? The mode is most appropriate for summarizing qualitative data. For
example, if one was curious about the number of women working at a company
(perhaps to compare female representation of one's company to the relevant labor
market), the mode would describe the most common gender indicated. It may make
less sense to discuss mean or median gender. However, the mode suffers from
several disadvantages that limit its use. First, there may be more than one mode. If
another 91 were added to the above distribution, there would be two modes,
making it an ambiguous measure of central tendency. Second, the mode is very
sensitive to changes in a single value in the distribution. For example, if one of the
applicants scoring 68 instead scored 100, the mode would jump from 68 to 100
even though only one scored changed! For these reasons, the mode is generally only
used in describing qualitative data.
The median has the advantage of not being sensitive to extreme values in the
distribution. If the person who scored 68 instead scored 25, the median would not
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 7 of 57
change (four scores still fall below 75), whereas the mean would change
considerably (69.3). On the other hand, this insensitivity to extreme values can be a
disadvantage. Consider the following tests:
Test #1: 19,25,51,52,53
Test #2: 50,50,51,97,99
The median (51) is the same for both tests even though the placement of values is
radically different. The mean is capable of reflecting this difference (40 for test #1
versus 69.4 for test #2). Thus, sensitivity to extreme values can be both illustrative
and misleading. If the median and mean are vastly different, one should investigate
the cause of the difference, as each may provide an important piece of information
in describing the data.
While the mean and median are both acceptable methods of describing
central tendency, the mean has one characteristic that makes it the most widely
used measure of central tendency: its importance in drawing inferences about
central tendency (for example, to see if the average score for the above two tests are
significantly different). The median has computational properties that make it
problematic in inferential statistics. Thus, the mean is employed as the measure of
central tendency in most statistical analyses. In a subsequent section we will
illustrate the use of the mean in drawing inferences.
Dispersion
The obvious fact in studying individual differences is that individuals differ.
Dispersion, or variability, indicates the degree to which observations on individuals
depart from central tendency. The most common means of expressing dispersion is
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 8 of 57
the standard deviation, which indicates how far the observations on average deviate
from central tendency. The equation for the standard deviation (s) is:
√∑ (xi -M)2
s=
Where:
n-1
2
∑ (xi -M) = squared deviation of the ith observation, xi, from
the mean of the observations, M, summed over all
observations
n = number of observations1
From the previous example, the standard deviation of the first test is 16.6. The
standard deviation of the second test is 26.1. The higher standard deviation of test
#2 indicates that the scores are more dispersed.
Standard Scores
When comparing scores between two or more samples, often the raw value
alone does not provide full information on the relative status of the score. For
example, an individual scoring 80 on test #1 (with a mean of 40) is very different
from scoring 80 on test #2 (where the mean is 69.4). The former is 40 points above
the mean, the latter only 10.6. It is also important to consider, and control for, how
variable the scores are about the mean.
Standard Scores (Z Scores)
Standard scores show the relative status of a score within a distribution, or,
as in the above example, between distributions. It indicates the number of standard
deviations the particular observation is above or below the mean. Therefore, it
In finding the average deviation, why not simply average the deviations about the mean by
subtracting each observation from the mean and dividing by the number of observations? The
difficulty is that the average signed deviation from the mean is always zero. Therefore, one must take
the absolute average deviation. The easiest way to do this is to square each deviation and then
return it to its original units by taking the square root. If the square root is not taken, it is known as
the variance.
1
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 9 of 57
adjusts for unequal means and variances between samples. It is calculated as
follows:
Z=
(π‘₯ − 𝑀)
𝑠
where the terms are as previously defined. Continuing the example of the two tests,
we can calculate a standard score for someone scoring 80 on each test:
𝑍1 =
(80−40.0)
𝑍2 =
(80−69.4)
16.6
26.1
=2.41
=0.41
The person in the first test, scoring 2.41 standard deviations above the mean, did
relatively better than the individual in the second scoring 0.41 standard deviations
above the mean—even though their absolute score is the same. Standardizing
variables gives us a more complete picture of where the scores stand relative to
others within a distribution or across distributions.2
Percentiles
Another way of reporting standard scores is with a score with which the
reader undoubtedly has some experience, the percentile rank. Percentile rank refers
the percentage of scores in its frequency distribution that are the same or lower
than it. For example, if someone scores at the 80th percentile on a measure, the
person scored equal to or higher than 80% of the other people who completed the
measure. The formula for computing percentile rank is:
𝑃𝑅 =
Where:
2
𝐢𝑙 + 0.5𝐹𝑖
× 100
𝑁
PR = percentile rank
The mean of standardized scores is always 0 and the standard deviation 1.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 10 of 57
Cl
= the count of all scores less than the score of interest
Fi
= the frequency of the score of interest
N
= the number of individuals in the sample.
Returning to Jack’s distribution of scores:
Test #1: 19,25,51,52,53
Test #2: 50,50,51,97,99
For either test, the person who scored 51 would be at the following
percentile:
𝑃𝑅 =
2 + 0.5(1)
× 100 = 50 (50π‘‘β„Ž π‘π‘’π‘Ÿπ‘π‘’π‘›π‘‘π‘–π‘™π‘’)
5
For Test #2, the person who scored 50 would be at the following percentile:
𝑃𝑅 =
0 + 0.5(2)
× 100 = 20 (20π‘‘β„Ž π‘π‘’π‘Ÿπ‘π‘’π‘›π‘‘π‘–π‘™π‘’)
5
As you can see, percentile rankings change depending on the number and
distribution of scores. For example, if 50 still tied for the lowest score on Test #2 out
of 100 (as opposed to 5) test takers, the percentile rank becomes:
𝑃𝑅 =
0 + 0.5(2)
× 100 = 1 (1𝑠𝑑 π‘π‘’π‘Ÿπ‘π‘’π‘›π‘‘π‘–π‘™π‘’)
100
Other Standard Scores
There are other ways of standardizing scores, often for the purpose of
providing feedback. Stanine scores standardize scores on a nine-point scale with a
mean of five and a standard deviation of two. So, for example, the bottom 4% of
scores represent the 1st stanine, the middle 20% of scores represent the 5th stanine,
and the top 4% of scores represent the 9th stanine. T-scores standardize scores so
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 11 of 57
that the mean is 50 and the standard deviation is 10. T-scores are computed as
follows:
𝑇 = 50 +
Where:
10(𝑋 − 𝑀π‘₯ )
𝑠π‘₯
X = Raw score of individual
Mx = Mean score of sample
sx = Standard deviation of sample scores
Returning again to Jack’s scores, the person who scored 51 on Test #1 would have:
𝑇 = 50 +
10(51 − 40)
= 56.63
16.58
The person who scored 99 on Test #2 would have:
𝑇 = 50 +
10(99 − 69.4)
= 61.33
26.12
Figure 1
Relationships Among Various Standard Score Measures
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 12 of 57
Figure 1 shows the relationships among z-scores, percentiles, stanines, T
scores, and the normal distribution. If scores are normally distributed, the
percentile rank is directly analogous to probabilities derived from the normal
distribution, a topic to which we turn next.
Normal Distribution
Observe Figure 2. It could be, for example, a distribution of scores on an
employment test. Note that the distribution is centered on (and has the greatest
Figure 2
The Normal Distribution
frequency about) the mean, is bell shaped with decreasing frequency of
observations as one gets farther from the mean. Also note that the distribution is
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 13 of 57
symmetric about the mean. Such a distribution is called a normal distribution.
One rather interesting property of the normal distribution is that approximately
68% of the scores fall within 1 standard deviation of the mean, approximately 95%
within 2, and approximately 99% within 3 standard deviations of the mean.
Figure 3
Height and the Normal Distribution
Height is one of
many variables
that is normally
distributed.
As we will see,
though, it is
important to
remember that not
everything is
normally
distributed.
The normal distribution is referred to as "the workhorse of inferential
statistics" because once raw scores have been transformed into z scores, it is very
easy to refer them to tabled values of the standard normal distribution to find
probabilities associated with finding a value within the particular range of interest.
For example, if the population of scores for test #1 is normally distributed, the
probability of observing a z-score greater than 2.41 is about .02, indicating that
about 2% of individuals taking the test can be expected to score above 80.
Conversely, roughly 34% of individuals taking test #2 can be expected to be over 80.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 14 of 57
While some attributes are approximately normally distributed (height,
weight, intelligence), many are not (income).
One cannot use the normal
distribution for inferential purposes without assuming the values are approximately
normally distributed. However, the Central Limit Theorem allows us to assume
Figure 4
Not All Variables Are Normally Distributed
As you can see from this graph of income in the United Kingdom, income is one of those
variables that is not normally distributed. (Source: Life in the Middle - The Untold Story of Britain’s
Average Earners.)
that the distribution of means is approximately normally distributed as long as the
sample size is sufficiently large (usually at least 30), regardless of the distribution of
individual values. Therefore, even if the population is not normally distributed, the
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 15 of 57
distribution of sample means drawn from the population is.
This allows
determination of probabilistic properties associated with mean observations from
the standard normal.
The standard normal distribution applies when the population standard
deviation is known. In practice, one seldom knows values of the entire population.
When the population variance is unknown, the Student's t-distribution can be used,
which closely resembles the standard normal. Tables for the t-distribution are also
widely published in statistics texts, and are precisely estimated by computer
packages (see section VI).
Hypothesis Testing
Human resource managers often want to make inferences about a population
or populations from which samples have been drawn. Remember that one of the
questions in Jack's mind was how his company's compensation level compared with
the relevant labor market. He may, for example, wish to compare the wage he is
offering to that of a competing company. As another example, Jack may wish to
compare pass rates on his selection measure between minorities and nonminorities
to assess if his hiring procedure adversely impacts upon minorities. For both these
investigations, Jack could take a sample of each group to assess if the means from
each population are equal or unequal. Since the sample drawn will not perfectly
reflect the population, the means will vary due to sampling error. Hypothesis
testing seeks to answer the question: at what point does the difference between the
means become so large that we dismiss the hypothesis that the two population
means are equal? The null hypothesis, denoted Ho, is the hypothesis that is
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 16 of 57
assumed to be true in producing the sample distribution used in testing the null
hypothesis.
Typically, the null is no difference hypothesized between the
populations. The alternative hypothesis, H1, is assumed to be true when the null
is false. It typically posits a difference between the means.
The exact procedures to execute the test vary, depending on the particular
assumptions and samples underling the test. The computations are explained in
most introductory statistics texts, or conducted on computer (see section VI).
Suffice it to say that a t-statistic is calculated (in place of a z-score because  is
unknown) and compared to the t-distribution.3 Given, as explained above, that the
sample means will probably differ, it can mean two things. The difference could
simply be due to sampling error or chance variation because we do not have a
perfect picture of the population. On the other hand, it could be indication that the
two population means are in fact not equal and the difference is not due to error.
Convention is to use .05 (5 chances out of 100 that the difference arises by chance
variation if there is no true difference) as the probability level at which we would
reject the null hypothesis that the means are not equal. A t-statistic of ο‚±2 is a good
benchmark, as the probability of observing a t-statistic of ο‚±2 is about .05. To be
sure, 5 times out of 100 we can expect to be wrong in rejecting the null of equal
means. However, .05 is a point at which most are willing to chance a mistake in
order to make inferences about the true nature of events.
We are allowed to compare the mean value to the t-distribution because we can assume the means
are approximately normally distributed through the Central Limit Theorem.
3
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 17 of 57
Errors
Effective management of human resources necessitates the use of statistics
to make "best guesses" about the true state of affairs when incomplete information
and measurement error exists. Obviously, these educated guesses are not always
correct. In statistical lexicon, mistakes that arise from erroneous inferences are
termed Type I and Type II errors.4
If the null hypothesis is true, but Jack rejected it, he has made a Type I error.
This is also represented by the Greek letter  ("alpha"), or the significance level.
When one makes a Type I error, the means differed by a significant amount, but the
difference was due to chance variation (sampling error). This is not the only
mistake Jack needs to concern himself with. He could also make a Type II error, or
falsely accepting the null hypothesis of equal means when they are in fact not equal.
This error is represented by the Greek letter  ("beta"). When one lowers the
probability of rejecting a true null (decreases ), it is more likely that one has
accepted a false null (increases ). For most decisions, it is best to make it difficult
to reject the hypothesis the weight of past evidence supports (the null). That is why
 is generally set quite low (and thus increasing ). However, one must be aware of
both errors. Each can be costly. And, all else equal, decreasing one error increases
the probability of committing the other.
There is nothing magical (or, according to some) even logical about the p < .05 standard. The origin
of this p-value is one of the towering figures in statistics, Sir Ronald A. Fisher. In 1925, Fisher
suggested the use of a boundary between significance and nonsignificance that was based on
probability. Fisher set this boundary at p = .05; its widespread adoption has led many to question the
wisdom of the standard in theory and in practice (see Cohen, 1994).
4
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 18 of 57
Figure 4
DECISION
Accept Ho
Figure
4
illustrates
the
Results of Hypothesis Tests
decisions and results. The
NATURE OF NULL
probability of accepting a
true null is equal to 1-. On
Ho true
Ho false
Correct
Type II error
the other hand, rejecting a
(1-)
()
false null, the other correct
Type I error
Correct
()
(power)
Reject Ho
decision, is 1- and is often
referred to as the power of
the test. Alpha and beta are
as previously defined.
Correlation
Remember one of the questions in Jack's mind was how to hire a store
manager. Suppose a friend of Jack’s—Sallie—gave him a dataset from the lifeguard
service she manages (in reality, the data in Figures 5 and 6 are actually on
lifeguards). Sallie’s data shows a relationship between a lifeguard’s personality and
his or her leadership effectiveness. Graphically, the relationship might look like
Figure 5 for Sallie’s lifeguards.
Each point on the graph, called a scatterplot,
represents a lifeguard, having both a score on extraversion and a rating of
leadership effectiveness. By visual inspection one could see that there is a positive
association between extraversion and leadership. Those who are extraverted seem
to make better leaders. However, it is important to have a precise numerical
measure of the association between two variables. A correlation coefficient is a
standardized (controls for differing levels of variance) measure of linear covariation
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 19 of 57
between two variables. The population correlation, like the population mean and
standard deviation, is unknown and must be estimated from sample data. The
sample correlation coefficient is calculated by the following formula:
π‘Ÿπ‘₯𝑦 =
∑(π‘₯ − 𝑀π‘₯ )(𝑦 − 𝑀𝑦 )
√∑(π‘₯ − 𝑀π‘₯ )2 ∑(𝑦 − 𝑀𝑦 )2
With standardized values (z scores), the equation simplifies to:
π‘Ÿπ‘₯𝑦 =
(zx zy )
𝑛
The correlation can range from +1.0 (perfect positive relation between the
two variables) to -1.0 (perfect negative relation). A correlation of ο‚±1 indicates that
knowing the value of one variable allows exact determination of the other's value. A
correlation of 0.0 signifies no relationship between the variables, indicating that
knowing the value of one variable gives us no information about the value of the
other. In the extraversion and leadership example above, the correlation is +.42,
consistent with the visual inspection of Figure 5.5
Let’s say Jack also received data from Margaret’s hardware store—in this
case, prediction of the degree to which the employees engaged in counterproductive
work behaviors. This variable of interest—counterproductive work behaviors—is
graphed with conscientiousness in Figure 6.
Each data point represents an
employee with a score on conscientiousness and a supervisor rating of the degree to
which the employee engages in counterproductive work behaviors.
A visual
inspection gives one the impression that the variables are negatively related. To
The reader can be forgiven for underestimating the correlation in Figure 3 from a visual inspect of
the graph. As Hunter and Schmidt (2004) note, when interpreting raw data, we tend to
underestimate the true relationship and overestimate the variability in that relationship (in other
words, think the data are “all over the place” when in fact there is a consistent relationship).
5
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 20 of 57
point, the correlation is ο€­.41. Higher levels of employee conscientiousness are
associated with lower degrees of counterproductive behaviors (as perceived by the
Figure 5
The Relationship Between Extraversion and Leadership
employee’s supervisor). From Figure 6, Jack might interpret these data as indicating
that when staffing the ice cream shop, he should give applicants a personality test
(to assess conscientiousness). From Figure 5, Jack might wish to give a measure of
extraversion to those individuals he is considering for store manager. (Shortly, we
will address a question that might come to mind: Can we have any confidence that
validity for one organization or one type of job [in this case, lifeguards or hardware
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 21 of 57
store employees] would generalize to another organization or another job type [in
this case, ice cream shop employees or store manager]?)
Figure 6
The Relationship Between Conscientiousness and Counterproductive Work Behaviors
Given many possible correlation coefficients based on many different
possible samples from the population, how does one determine if there is a "true"
relationship between the variables? In much the same way as comparing means, we
may test the hypothesis of no relationship between the variables (correlation
coefficient equal to zero) against the alternative of a significant relationship. As in
comparing population means, a test statistic is calculated (here rxy), compared to a
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 22 of 57
probability distribution (generally the t-distribution) and a probability level
derived. If the probability is less than the significance level, the hypothesis of no
relationship between the variables is rejected. In such a case we would conclude the
"true" relationship is likely to be other than zero.
The larger the sample size, the easier it is to achieve a significant correlation.
For example, a correlation of rxy=.97 is not significantly different from zero at the .05
level when the sample size is 3. However, when n=100 a correlation of rxy=.19 is
significant.6
Squaring the correlation coefficient, or r2, represents the proportion of total
variance of one variable explained by the other. Therefore, Jack's correlation of .38
between pay and performance represents 14% of the variance in performance
explained by variation in pay. It also leaves 86% unexplained by pay (explained by
other factors). When trying to predict what a person will do in the future, errors are
common.
This simply serves to illustrate that human behavior is somewhat
unpredictable. Thus, it is relatively rare for one variable to explain a majority of
variance in another. This issue will be revisited in subsequent sections.
Regression
Suppose Jack has operated the store for a year and now wants to estimate his
staffing needs for the upcoming summer ice cream rush. Jack could use past data on
the daily high temperature and the estimated number of workers required that day
(recorded each day over the last year) to predict his staffing requirements for the
The significance test for the correlation coefficient relies on the assumption that the population
values of both distributions are normally distributed. When this assumption is in doubt or the
sample size is small, one should use the Spearman's rank-order correlation coefficient. The
computational formula is contained in nearly all statistics texts.
6
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 23 of 57
upcoming summer. Regression, a prediction of the level of one variable based on
the level of one or more other variables, is perfectly suited for this type of problem.
Suppose Jack had past data on demand for ice cream and numbers of workers
required for the past year. Figure 7 represents these values. Each data point
represents a day in the past year when Jack recorded the daily high temperature
and wrote down his estimate of the optimal number of employees on that day. The
line fitted through the data is called a regression line, which represents the "best fit"
line, as the squared deviations from the mean line are the least of all possible
straight lines.
It represents the prediction line for the number of workers
demanded for a corresponding high temperature. From this line, the number of
workers Jack needs to hire, based on the forecast high, can be projected.
In regression, the dependent variable is the variable whose value is
influenced (or depends on) the value of another.
In this case, the dependent
variable is the number of workers demanded (total number of workers needed to
staff three shifts). The independent variable is that which induces changes in the
dependent variable. Here, the independent variable is the daily high temperature.
The regression line is estimated by:
y=a+bx+e
Where:
y
a
b
x
e
= score on dependent variable
= intercept value
= slope of the regression line (regression coefficient)
=score on independent variable
=error term
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 24 of 57
Like all other statistics, the population regression equation must be
estimated from sample data. Errors result when the regression line does not
perfectly
Figure 7
Predicted Demand for Workers
Regression Line Fit Plot
Estimated Workers Need Over 3 6-Hour Shifts
30
25
y =-7.008 + 0.2916x
20
Estimated Workers Need
Over 3 6-Hour Shifts
15
Predicted Estimated
Workers Need Over 3 6Hour Shifts
10
5
0
0
20
40
60
80
Daily High
100
Estimated workers is
based on the past
year’s data, when on
that day Jack wrote
down an estimate of
the optimal number
of workers needed
that day.
predict values of the dependent variable. In our example the error term includes all
factors other than temperature that influence demand for workers. The estimated
regression function is y = ο€­7.008 + 0.2916x, where y is the predicted value.
Accordingly, for any given value of x (i.e., daily high temperature) we can predict y
(the number of workers required). For example, if the daily high is 60 degrees, Jack
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 25 of 57
will need an estimated 10-11 workers on his payroll (the actual predicted value is
10.487 workers). If the high temperature is 90 degrees, Jack will need a predicted
19 (exact predicted value = 19.234) workers. The slope value indicates that a 1 unit
change in x induces a b unit change in y. In our example, an increase of 10 degrees
leads to approximately 3 more workers required.7
Several assumptions are required for regression analysis: the independent
variables and error terms are uncorrelated; the mean of all errors is 0; all errors
have equal variances; and the errors are not correlated with one another. The
implications of violating these assumptions are discussed in Kennedy (2008).
It would be useful to determine what proportion of total variability in the
dependent variable explained by the regression of Y on X.
The coefficient of
determination, denoted R2, is the proportion of total sample variability of the
dependent variable explained by the independent variable.
It is calculated by
dividing variability explained by the independent variable by total variability (which
is the variance of Y). For example, R2=.68 in the equation in our example, meaning
68% of the variability in number of ice cream workers required is explained by its
linear dependence on consumer demand for ice cream. In "simple" regression (one
independent variable) such as this, R2 = rxy2. When predicting human thoughts,
feelings, or action, one generally has to settle for less variance explained. People are
complicated.
When using standardized variables, the intercept drops out (remember z scores have a mean of
zero), and the b coefficient represents the correlation between the dependent and independent
variable.
7
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 26 of 57
As with other statistics, we are able to test the b coefficient against zero to
determine if the independent variable is a significant predictor of the dependent
variable.8 We do this by dividing the coefficient estimate by its standard error
(remember because the population regression coefficients are estimated with
sample data, and because the prediction is not perfect, they are estimated with
error). Calculation of b coefficients is quite laborious and is therefore conducted
using computer packages (see section VI). The null hypothesis is generally b=0 (a
slope of zero), indicating no relationship between the variables. Once the test
statistic is calculated, it is referred to the t-distribution. If the statistic is large
enough to be statistically significant, the null is rejected and it is asserted that values
of Y significantly depend on values of X (or that X significantly predicts Y).
Multiple Regression
Remember the example from a few pages earlier regarding the effect of
conscientiousness on counterproductive behaviors? Jack observed a correlation of .41 and concluded
that
hiring conscientious individuals
should reduce
counterproductive behaviors such as absence, lateness, theft, etc. However, this
conclusion might be suspect without considering the job held by the individual.
Individuals in higher-level positions (like managers) may be less likely to engage in
counterproductive behaviors—taking a day off may simply leave more work for the
next day.
Therefore, job level might confound the relationship between
conscientiousness and counterproductivity.
Luckily, there is a procedure that
allows us to control for other influences when investigating the relationship
In order to do this, it is necessary to assume that the prediction errors, e, are normally distributed.
This assumption is also dealt with in Kennedy (2008).
8
© Timothy A. Judge, 2013
Measurement and Statistics Primer
between two variables.
Page 27 of 57
Multiple regression, as a generalization of simple
regression, allows investigation of multiple influences on the independent variable.
The general form of the equation can be represented as:
Y=a+b1x1+b2x2+...+bkxk+e
Where x1,x2,...,xk represent 1 through k independent variables; all other terms are as
previously defined.
The interpretation of the effect of an independent variable is similar to
simple regression, except that it now measures the effect of one variable holding the
others in the equation constant. Each regression coefficient in multiple regression is
known as a partial regression coefficient because it expresses the partial effect of
the coefficient on the dependent variable.
The power of multiple regression to the human resource manager should not
be underestimated. By controlling for the influence of all variables the investigator
wishes to specify, it allows inferences regarding the influence of one independent
variable on the dependent variable, controlling for the effect of other possible
influences.
In our earlier example, it is possible to investigate the effect of
conscientiousness on counterproductive behaviors controlling for job held. In other
words, for those having the same position in the organization, what is the effect of
conscientiousness on counterproductivity?
Multiple regression is ideally suited for prediction based on multiple sources
of information. For example, suppose Jack decided to predict job performance
based on two selection predictors, collected data on the predictors and the criterion,
and estimated the following regression equation with his sample data:
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 28 of 57
Y=10+.3X1+.6X2
Jack may then use this equation for future selection decisions. For example, Jack
may wish to predict subsequent job performance on an applicant who scored 50 on
test 1 and 80 on test 2. Assume 65 is the minimum acceptable performance rating.
The applicant's predicted job performance is:
Y=.3(.50)+.6(.80)=73
Thus, this applicant would be predicted to be successful, albeit marginally, on the
job. If Jack needed to fill 25 positions, he would probably hire the highest 25
predicted job performances.
It is often held that because the weight on X2 is greater than X1 it is a more
important predictor of the dependent variable (e.g., job performance). This is an
incorrect assertion because the variables are measured in different units. For
example, measuring pay in dollars versus thousands of dollars would yield a
coefficient one thousand times smaller even though the relationship is no different.
Regression with standardized variables eliminates this problem as all the variables
are forced into the same units. In fact, with standardized variables, each regression
coefficient is equivalent to a partial correlation coefficient between the particular
independent variable and the dependent variable.
Therefore, it provides
information on the strength of the separate relationship between the independent
and dependent variables, partialling out (e.g., holding constant) the effect of the
other variables. Squaring the partial correlation coefficient indicates the proportion
of variance in the dependent variable explained by the independent variable, once
the influence of the other variables is removed. In our example, if once standardized
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 29 of 57
X2 had a larger coefficient than X1, X2 would explain more variance in performance.
Thus, without generalizing beyond the sample, X2 would be a stronger (more
important) predictor of the dependent variable.
The coefficient of determination in multiple regression has a comparable
interpretation to simple regression. R2 reflects the proportion of total variance in
the dependent variable explained by the set of independent variables. For example,
R2=.50 indicates that 50% of the variance in the dependent variable is explained by
the independent variables.9
III. PROBLEMS IN ESTABLISHING CAUSALITY
One must be cautious in attributing causality using correlation and
regression.
By themselves, they do not separate causality between variables.
Consider a correlation on might find between pay and performance such that those
who earn more have higher performance ratings. How does one interpret this?
High
performers
are
generally
paid
more
for
their
accomplishments
(π‘π‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’ → π‘π‘Žπ‘¦). However, high pay also serves as an incentive to greater
efforts (π‘π‘Žπ‘¦ → π‘π‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’). Thus, in this example it is impossible merely looking
at a correlation or regression coefficient to attribute causal direction. In such cases,
tighter controls, either in research designs or statistical controls, are needed before
causal inferences can be drawn (see Schwab & Trevor, 2012, for further discussion).
Non-linear regression models can be estimated, often with a substantial increase in prediction. For
example, one can see that the scatterplot in Figure 7 is not linear—as you might expect, changes in
temperature lead to greater differences in estimated demand for workers at high temperatures than
at low temperatures (i.e., the difference between a high of 80° and 70° leads to a greater change in
workers needed than the difference between a high of 30° and 20°). The distribution is exponential,
and there are various ways to model such distributions (see Kennedy, 2008).
9
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 30 of 57
IV. MEASURING INDIVIDUAL DIFFERENCES
From Plato to Darwin to managers in search of productive workers, the
fundamental differences in individuals has at once been an obvious fact and a source
of fascination. The first task of a manager making differentiations between people
(whether for hiring, compensating, training, or appraising employees) is to measure
the differences. Measurement is the assignment of numbers to objects, attributes,
or events. In many cases, measurement is both critical and difficult. In trying to
assess human thought and behavior, measurement is particularly difficult. Two
central means of evaluating the quality of our measures are reliability and validity.
Each will be explored in turn.
Reliability
Remember Jack's concern whether his judgment when interviewing
applicants would be consistent with others?
For example, if Jack's assistant
manager also interviewed applicants, to what degree would their evaluations agree?
This is an issue of reliability, or the consistency or reproducibility of a measuring
instrument. If Jack found that their judgments were often quite different, Jack might
question the reliability of their evaluations, and the usefulness of the procedure. A
test, set of evaluations, or survey items that do not correlate well with themselves
can hardly be expected to correlate with any variable of interest. Thus, reliability is
an essential starting point in measurement and statistical analysis.
Reliability theory posits that variation in scores, for example on an
employment test, appraised performance, or job satisfaction survey, is composed of
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 31 of 57
variation in "true" scores (i.e., reflecting variation in true ability or performance)
plus variation due to error in the measuring instrument. Or,
2 = 2t + 2e
Where:
2 = total variance in scores, as defined earlier
2t = variance in "true" scores
2e = error variance
The more total variance is due to true differences between the individuals and less
to inconsistencies (which produce variance) in the measuring instrument, the more
reliable the measuring device.
In classical reliability theory, the reliability coefficient is represented as:
π‘Ÿπ‘₯π‘₯ =
πœŽπ‘‘2
πœŽπ‘’2
=
1
−
𝜎2
𝜎2
The higher proportion of "true" variance to total variability (or lower proportion of
error to total variance), the higher the reliability of the measuring instrument. Just
as r2 tells us the percentage of total variance shared by the variables, and R2
indicates the proportion of variance in the dependent variable explained by the
independent variable(s), the square of the reliability coefficient, theoretically,
reveals the proportion of total variance in the measured variable due to "true"
differences in individuals. If we had true scores, we could calculate reliability in this
manner.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 32 of 57
Figure 8
A Tale of Two Tests
𝜎2
Test #1
Test #2
πœŽπ‘‘2
πœŽπ‘’2
80%
20%
20%
80%
πœŽπ‘‘2
πœŽπ‘’2
𝜎2
Variability alone does not determine reliability; it is the proportion of true
variance to total. For example, Figure 8 shows two tests with the same level of total
variance, 2. Yet test 1 is much more reliable than test 2, as 80% of the total
variance in test 1 is due to variation in individual characteristics ("true" variance)
and only 20% due to error. However, in test 2, only 40% is "true" variance, and
60% measurement error.
In practice, since true scores are never known, reliability must be estimated
from the data obtained from our measuring instruments. One of the more obvious
means of estimating reliability is test-retest, where the same form of a test is
administered twice to the same applicants (after a suitable time period) and the two
scores are correlated. One potential drawback of the test-retest estimate is any
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 33 of 57
variable that influences one administration and not the other will reduce reliability.
Another problem with the test-retest method is that the individual may remember
responses from the first test or assessment, or consistently guess in the same
manner on both tests.
Perhaps the most popular method of estimating reliability is internal
consistency, which holds that items from the same test should predict the total score
equally well regardless of where they are placed in the test. One approach is to
correlate one half of the test with the other half, a split-half reliability. Because
reliability increases with test length and the split-half method cuts length in half, the
obtained correlation is a conservative estimate of the true reliability of the test. The
Spearman-Brown prophecy formula is often used to correct for this reduced
reliability:
π‘Ÿ11 =
2π‘Ÿπ‘₯π‘₯
(1 + π‘Ÿπ‘₯π‘₯ )
Where r11 is the corrected correlation and rxx is the correlation between the halves.
Perhaps the most sophisticated measure of internal consistency is Cronbach's alpha
(Cronbach, 1951), which yields the mean correlation between all possible half-splits.
Cronbach's alpha is available on most computer packages (see section VI). It can be
calculated manually with the following formula:
∝=
Where
𝑁 × π‘ŸΜ…
1 + ([𝑁 − 1] × π‘ŸΜ… )
 = coefficient alpha
N = number of items in measure
π‘ŸΜ… = average correlation among items
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 34 of 57
For example, if Jack wished to measure extraversion with a 10-item scale,
and the average correlation among those 10 items was π‘ŸΜ… =.40, then  is:
∝=
10 × .40
4
4
=
=
= .87
1 + ([10 − 1] × .40) (1 + 3.6) 4.6
What is an acceptable level of reliability? It depends on several factors.
Though most researchers appear to adhere to a “ ο‚³ .70 is acceptable,  ο‚³ .80 is
good” rule, such simplistic rules do as much harm as good. For example, longer tests
can be expected to be more reliable than shorter tests.
Internal consistency
estimates can also be expected to be higher than inter-rater estimates. A coefficient
alpha of .70 on a long item test might be considered to be marginally reliable,
whereas a correlation of .60 between interviewer judgments might be thought of as
quite good. Reliabilities below .50 are seldom considered adequate regardless of the
method used to estimated reliability.
There are many factors that influence the reliability of a measuring
instrument. As mentioned earlier, large sample sizes (more is known about the
population) and number of test items or raters (using 10 predictors to select people
is more likely to yield a consistent estimate of their ability than a single item)
increase reliability.
Finally, heterogeneity in the individual difference being
measured serves to increase reliability, as there is more variance to be explained.
Standard Error of Measurement
The standard error of measurement indicates the degree of error expected
in an individual's score. If an individual were to take the test (or be evaluated)
many times, his or her scores would vary, and we expect that variance to follow a
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 35 of 57
normal distribution. More scores should be near the individual's true score than far
away. The mean of this distribution is the individual's true score, and the standard
deviation is the standard error of measurement (abbreviated meas). The meas
represents the average error in the measurement device.
As with all normal
distributions, 68% of the scores lie within 1 standard deviation of the mean, 95%
within 2, and so on. The standard error of measurement may be expressed as:
meas = x (1 β€’ rxx)
As one can see, meas is determined by both the variance of the scores and the
reliability of measurement. If reliability is perfect (rxx=1.0), there is no error in
estimating an individual's true score. Perhaps the most important use of meas for
human resource managers is that it enables us to make inferences about true scores.
For example, if the standard deviation on Jack's employment test is x=4, and
reliability for the test is rxx=.80, then meas=1.79. If an individual scores 80, Jack can
be 68% confident that the individual's true score is within ο‚±1.79 point of their
obtained score (roughly between 78 and 82), and 95% confident that their true
score is between 76.5 and 83.5 (ο‚±3.58 points). This also provides useful information
in determining whether two scores are significantly different. If the lower limit of
the higher score is above the upper limit of the lower score, then we can conclude
the two scores are significantly different. For example, following the example above,
if one applicant scored 80 and another scored 72, Jack can be 95% confident that the
two scores are different (that the first applicant truly has a higher score).
Validity of Measures
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 36 of 57
Suppose that Jack and his assistant manager each interviewed applicants and
then rated them on a 1 to 10 point scale. Jack found that the correlation between
their ratings was r=.75. One might be tempted to conclude that Jack and his
assistant must do a good job of selecting applicants since they have fairly consistent
evaluations.
However, reliability of measurement does not necessarily imply
accuracy of judgment. For example, weight can be measured quite reliably, but
surely is not an accurate predictor of performance for most jobs. Similarly, while
Jack and his hand-picked assistant's judgments are consistent, it could be because
they both evaluate applicants on criteria not strongly related to job performance
(e.g., appearance).
The above example illustrates the importance of validity in human resource
management. Validity refers to how well the instrument measures or predicts the
criterion. If we have information from a measurement device, how much does that
information help in predicting the criterion of interest? If the highest (lowest)
scores on a predictor always led to the highest (lowest) scores on the criterion, our
predictor would be perfectly valid. Unfortunately, in practice this does not occur.
The question then becomes: how does one go about designing a measure to be as
valid as possible and evaluating if a given measure is valid? Strategies used to
establish validity depend on both the specific use of the measuring instrument and
the data collection constraints imposed on the organization.
The primary validation strategies can be classified as either empirical or
logical. Empirical strategies estimate the validity of a procedure by examining the
correlation or regression coefficient between the predictor and the criterion. High
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 37 of 57
correlation coefficients imply high validities.10
strategy is criterion-related validity.
The most important empirical
Logical strategies establish validity by
evaluating how well the measuring device samples the criterion. The important
logical strategy is content validation. Face validity is an informal method, neither
logical nor empirical. Construct validity is actually a combination of empirical and
logical strategies that enable us to understand the factors that cause variation in the
criterion. Each will be explained in turn.
Criterion-Related Validity
Criterion-related validation is employed when one wishes to quantitatively
estimate the relationship between a predictor and the criterion. For example, if Jack
were to relate (using correlation or regression) interview or test scores to job
performance in evaluating the accuracy of the predictor, he would be using a
criterion-related strategy. Those predictors explaining the most variance in the
criterion are the most valid and will be preferred. There are two specific variants of
criterion-related strategies: predictive and concurrent.
Predictive Validation
In predictive validation, the predictor is measured at one point in time and
information on the criterion is gathered at a later date. Then, the two sets of
information are correlated.
Perhaps the "purest" way to conduct predictive
The following formula is used to estimate the validity if there was no measurement error
(reliability was perfect):
10
π‘Ÿπ‘₯𝑦 =
π‘Ÿπ‘₯𝑦 (π‘œπ‘π‘ )
√π‘Ÿπ‘₯π‘₯ √π‘Ÿπ‘¦π‘¦
Where rxy=estimated true correlation; rxy(obs)=observed correlation; rxx=reliability of predictor;
ryy=reliability of criterion. This is the best estimate of the true validity of the measure.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 38 of 57
validation in selection decisions, for example, is to gather information on the
predictor and then select applicants on the basis of some other predictor.
For most organizations this "pure" method of validation is impractical. It is
costly to administer a test when no direct result is forthcoming. This is particularly
true in the case of the purest predictive design, which would require hiring all
applicants. A more realistic but still costly method would entail giving applicants
the test but ignoring the results from this test when making hiring decisions. A
problem with both of these methods is that managers need validation information
quickly to avoid costly mistakes in the immediate future. “If the test you want me to
use is so good,” Jack might ask, “Why can’t I use it now?”
Imagine if we used a predictive validation design whereby we administer the
measuring device (e.g., an employment test) to applicants, select on the basis of
those scores, and later correlate predictor scores with measures of job performance.
Why is this problem? The primary problem with this strategy is that the correlation
underestimates the true relationship between the test and performance because of
restriction of range in the predictor. Because only those who scored above the
cutoff point on the predictor were hired, we never know how those who were not
hired would have scored on the criterion (job performance). Figure 9a shows a
"true" relationship between the test and job performance of r=.56.
If the
organization were to select on the basis of test scores, Figure 9b indicates because
the range is restricted, only information to the right of the cutoff Xc is considered,
and the obtained correlation coefficient would drop to r=.19 even though the true
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 39 of 57
Figure 9
Effect of Range Restriction on Observed Correlation
Figure 9a. Relationship Without Range Restriction (rxy=.56)
Validity of Test
10
Performance Rating
9
8
7
6
5
4
3
2
1
70
80
90
100
110
120
130
140
Score on Selection Test
Figure 9b. Observed Relationship With Range Restriction (rxy=.19)
Validity of Test
10
Only those applicants with
scores above 100 on the
selection test were hired.
Thus, in validating the test,
performance ratings of those
not hired are not available.
This range restriction
downwardly biases the
observed correlation (if they
had been hired, the observed
correlation would have been
r=.57).
Performance Rating
9
8
7
6
5
4
3
2
1
70
80
90
100
110
Score on Selection Test
© Timothy A. Judge, 2013
120
130
140
Measurement and Statistics Primer
Page 40 of 57
relationship was still r=.56. It is possible to estimate the correlation between the
predictor and criterion if no restriction of the range existed.
The formula to correct estimated validities for range restriction is relatively
complicated. This formula is provided below:
π‘Ÿπ‘‘ =
Where:
𝑠
( 𝑑⁄π‘ π‘Ÿ )
2 + π‘Ÿ 2 (𝑠𝑑⁄ )2
1 − π‘Ÿπ‘₯𝑦
π‘₯𝑦
π‘ π‘Ÿ
rt = estimated "true" correlation between predictor and criterion
rxy = observed correlation between predictor and criterion
st = standard deviation of predictor for total sample (estimated on
applicant pool)
sr = standard deviation of restricted sample
In essence, this formula estimates what the distribution of test scores and job
performance would have looked like if all applicants were hired. As such, it is a
hypothetical means of projecting what the validity would be if all information was
available.
Concurrent Validation
Perhaps the most expedient method of empirical validation is concurrent
validation. In this case, present employees are administered the employment test,
and their most recent performance ratings are correlated with their test scores.
While this approach is convenient, particularly under time constraints, there are
several potential problems. First, it is not clear that current job holders are as
motivated to do well on the predictor (after all, their employment does not hinge on
performance on the test) as actual applicants for the job. Further, how would those
who quit or were fired have tested? This restriction in range of the criterion
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 41 of 57
attenuates the observed predictor-criterion relationship. Perhaps most importantly,
concurrent designs may be biased by the effect of job experience on the test. Almost
certainly individuals learn skills on the job that are related to the skills assessed by
the employment test.
One approach to mitigate this bias is to control (using
multiple regression) for experience in predicting job performance.
Content Validity
Criterion-related validation strategies concern the extent to which the
predictor is a significant sign of the criterion. Content validation concerns the
degree to which the measurement device is an adequate sample of the criterion. In
other words, a test is content valid if it adequately represents the criterion of
interest. For example, Jack might consider a test that entails evaluating how nimbly
the applicant scoops ice cream into the cone and serves it to customers content
valid. Though there are metrics or statistics to assess content validity, typically it is
ascertained by subjective judgments.
If one does not use quantitative results to evaluate the content validity of a
test, how does one go about establishing validity? Typically, an expert or experts
evaluate how well the content of the test represents job performance. In short, the
knowledge, skills, and abilities (as identified by a job description and specification)
required to perform the job must be reflected in the test for it to be judged content
valid. Because content validity is judgmental, it is crucial that those who evaluate
the content of the test be experts regarding the job in question, and be supplied with
accurate information on the test and criterion.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 42 of 57
Face Validity
Face validity refers to whether individuals taking the test believe it to be a
valid measure of the criterion. In short, is the test valid on "the face of it?" While
this is an informal and entirely subjective method, it can be very important to
organizations. If applicants view the test as a poor method of selection, using the
test might generate more resentment than it is worth. Although applicants judge a
test as fair or unfair on many grounds, content validity would appear to be one way
to increase face validity. For example, a work sample test (e.g., in Jack’s case, having
applicants scoop ice cream and serve it to customers) would likely be judged to have
high content validity because it samples a key aspect of performance. For the same
reason it should also have high face validity. Thus, content valid tests will almost
always be face valid, although the reverse is not necessarily true.
Construct Validity
Construct validity has as its goal to understand the trait or construct that the
test measures. Because it entails more than prediction or sampling, it is a more
rigorous method of validation. While construct validation can be conducted in many
different forms, several of the more common are: 1) correlations between several
different measures of the construct; 2)
expert judgment regarding the
appropriateness of the test in sampling or predicting the underlying construct; 3)
correlational relationships between the measures and behaviors purportedly
manifested by the construct.11
There are more advanced methods (such as factor analysis) and concepts (such as convergent and
discriminant validity) designed to assess construct validity (see Schwab & Trevor, 2012).
11
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 43 of 57
For example, suppose Jack wished to assess the construct validity of an
integrity test. Does the test allow Jack to understand what integrity is and how well
is it measured by the test? If Jack knew his that his test was highly correlated with
other measures of honesty (#1), rated as appropriate by experts on the subject (#2),
and found a strong negative correlation between his test and stealing (#3), it would
provide some evidence of the construct validity of the measure. While construct
validity is rigorous, the conclusions one can draw about applicants based on the test
are stronger, as one has a better idea of what factors cause the construct.
Cross-Validation
How does one know if a validity coefficient calculated from one sample will
apply to other samples of interest? Cross-validation is the procedure by which one
demonstrates whether a predictor validated from the present sample continues to
be a valid predictor when applied to another sample. Cross-validation is important
in selection because a prediction scheme (for example, weights on various
predictors) is often applied to many samples subsequent to the one in which it was
originally developed. It is crucial, therefore, to investigate how valid this scheme is
on the various samples to which it might be applied.
Cross-validation generally begins by gathering predictor and criterion
information on the current sample and then calculating a correlation coefficient or
regression equation. Next, a separate independent group has predictor information
gathered. These scores are then predicted based on the validity coefficient(s) from
the original sample.
Finally, criterion values are correlated.
The higher this
correlation, the greater the confidence that the selection method is valid across
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 44 of 57
samples. Perhaps the most practical approach to cross-validation is to split the
sample in half, using one half for developing the prediction scheme, and testing the
scheme on the other half. Regardless of the method used, the cross-validated
coefficient can be expected to "shrink" because the original scheme maximized on
the idiosyncrasies of the sample that do not generalize to the other. If the shrinkage
is great, doubt is cast on the ability of the predictor(s) beyond the sample it was
originally based upon.
Validity Generalization
One of the traditional views of personnel psychology was that validities for
employment tests are situation-specific.
This was based on empirical results
showing considerable variation in validity coefficients across populations. This
opinion carried great weight in the formulation of early standards and laws
governing employee tests, which advised against borrowing validity evidence from
other populations unless it could be demonstrated that work behaviors and the
organizational context between the populations were very similar.
Schmidt and Hunter have convincingly argued that the specific nature of
validity coefficients might be due to artifacts in the measuring procedures. For
example, small sample sizes, differences in reliability in the predictor and criterion,
or differences in range restriction are only several of the possible factors that
attenuate estimates of validity across samples, irrespective of the true validity.
Schmidt, Hunter, and colleagues have found that nearly all of the variance in validity
estimates is due to these artifacts. Their findings indicate that validity coefficients
are much more generalizable than has typically been assumed. The implication is
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 45 of 57
that managers may not be forced to "re-invent the wheel" for their staffing
decisions. They may be able to rely on others who have demonstrated the test to be
valid.12 In fact, meta-analysis, introduced in the next section, will show how the
organization can use findings compiled across many organizations in making human
resource decisions.
V. CONFIRMATORY RESEARCH
Obviously, a central part of a manager's job is to make decisions. But how
can one determine the quality of those decisions? Successful outcomes are the
ultimate standard, but final outcomes (e.g., profitability, market share) give us very
poor information about exactly where decision might be improved. Confirmatory
research enables us to investigate the accuracy of human resource decisions, the
cost of errors associated with particular practices, and how to compile findings in
hope of making better decisions in the future.
Decision Analysis
After Jack institutes his new hiring procedure, he might like to see his
"batting average." Remember from hypothesis testing that we discussed four types
of decisions: accepting the null hypothesis when it is true; accepting the null when
it is false (Type II error); rejecting the null when it is true (Type I error); and
rejecting the null when it is false (power). Decision analysis is another 2 × 2
procedure that provides information on the immediate consequences of human
resource decisions. For the purposes of decision analysis, we assume that the null
hypothesis is that the individual will be considered successful on the job. Accepting
12
Not all courts have accepted this standard.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 46 of 57
the applicant when he or she will in fact be successful is obviously a correct
decision.
However, rejecting the applicant when he or she would have been
successful is an error. Rather than labelled a Type I error, in decision analysis such
a mistake is termed a false negative (applicants falsely predicted to be
unsuccessful).
Rejecting the applicant who would have been considered
unsuccessful is a correct decision. Finally, accepting an applicant who turns out to
be unsuccessful is a false positive (positive performance was falsely predicted).
Figure 10 shows the scatterplot of predictor-criterion scores from Figure 9,
with a validity coefficient of r=.54. Point Xc represents the cutoff point for predictor
scores (in this case, the cutoff is 94). Applicants scoring to the right of Xc are hired,
those to the left are rejected. Cutoffs are set based on the desired number of
employees hired, minimum qualifications needed, or both factors. Point Ys
represents the minimum performance required to be judged successful on the job
(in this case, the minimum performance baseline is the scale midpoint—5.5 on the
1-10 scale; where the baseline is set depends, of course, on the job, the performance
standards, and so forth). Those above it are considered successful employees; those
below it are not. Applicants in Quadrant I were hired and were above the baseline
(considered successful). Applicants in Quadrant III were not hired and, if they were,
would have been below the performance baseline (considered unsuccessful). Thus,
Quadrants I and III are correct decisions. Applicants in Quadrant II were not hired
but, had they been, would have been above the baseline (considered successful).
Applicants in Quadrant IV were hired, but performed below the baseline. Thus,
whereas Quadrants I and III represent correct decisions, Quadrants II and IV
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 47 of 57
represent errors. Applicants in Quadrant II are false negatives.
Employees in
Quadrant IV are false positives.13
Setting a cutoff score defines the selection ratio, or the proportion of
applicants hired. It can be calculated using the number of individuals in each
quadrant for the following formula:
π‘†π‘’π‘™π‘’π‘π‘‘π‘–π‘œπ‘› π‘…π‘Žπ‘‘π‘–π‘œ =
(𝐼 + 𝐼𝑉)
(𝐼 + 𝐼𝐼 + 𝐼𝐼𝐼 + 𝐼𝑉)
The lower (higher) the cutoff, the higher (lower) the selection ratio. Because 48 out
of 67 applicants in Figure 10 were hired, the selection ratio is (48/67)=.72.14
The base rate is the proportion of applicants considered successful if all
applicants were hired. It is represented by the following formula:
π΅π‘Žπ‘ π‘’ π‘…π‘Žπ‘‘π‘’ =
(𝐼 + 𝐼𝐼)
(𝐼 + 𝐼𝐼 + 𝐼𝐼𝐼 + 𝐼𝑉)
In Figure 10, 45 out of 67 applicants would be considered successful. Therefore, the
base rate is .67 (45/67).
Of course, in a predictive validation design (where the selection measure is used to hire from an
applicant pool), if the test is used in making decisions, Quadrants II and III are missing (since
applicants who scored below the cut line were never hired). However, as discussed previously, there
are several options: (1) until the measure is validated, hiring decisions can be made without regard
to scores on selection measure; (2) simulated results for those quadrants can be constructed based
on range restriction; (3) a concurrent validation design can be used such that the selection measure
is given to current employees.
13
Selection ratios vary dramatically by job type, industry, and labor market conditions. For example,
one would expect a very high selection ratio in hiring packing plant workers in good economic
conditions (I worked with one such organization that hired virtually every able-bodied applicant). In
contrast, the selection ratio in hiring a professor may be .01, which is precisely what it was with a
search committee I chaired in 2012.
14
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 48 of 57
Figure 10
Decision Analysis of Predictor-Criterion Scores
II
I
III
IV
Ys
Xc
Obviously the goal is to eliminate the errors. One way to reduce the overall
error rate would be to choose a more valid selection procedure.
A validity
coefficient of 1.0 (a straight line of scores) would lead to no errors. A coefficient of
0.0 (a circle of scores) would lead to as many errors as correct decisions. The
selection ratio and base rate also have implications for errors. Moving the cutoff or
minimum level of acceptable performance decreases one error while increasing the
other. However, there is a point at which total errors are minimized. The highest
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 49 of 57
number of correct decisions is where the number of false positives exactly equals
the number of false negatives.
The optimal place to set the cutoff, however, depends on the cost of each
error to organizations. False positives are undoubtedly more salient to managers.
Hiring applicants who later turn out to be poor matches is very visible. Conversely,
those who got away are often unnoticed. A strategy designed to minimize false
positives would mean hiring fewer applicants. Therefore, the balance is between
meeting one's labor force requirements and minimizing those that are incorrectly
hired.
The benefit of decision analysis is not that it makes the staffing decisions for
the manager.
Rather, the advantage is that it presents the manager with
consequences of human resource judgments he or she must make. Further, the
natural tradeoff between false positives and false negatives forces managers to
consider the costs of both errors in formulating their selection strategies.
Utility Analysis
It is a truism that profit and loss are the bottom line for most organizations.
Utility analysis concerns the evaluation of implications of human resource (staffing
in particular) decisions on organizations in dollar terms. As such, it is a powerful
means to understand the costs and benefits of decisions managers must make
regarding selection.15
Suppose that Jack wishes to hire 50 employees, and has 100 applicants for
the positions. The selection ratio is .50 (50/100). Jack has the choice of using two
Cascio and Aguinis (2010) also analyze the costs associated with other human resource
management activities (turnover, absenteeism, training programs).
15
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 50 of 57
different predictors but is unsure of which to use (he cannot afford to use both).
Schmidt, Hunter, McKenzie and Muldrow (1979) provide a framework to analyze
which predictor will yield the biggest dollar improvement over random selection.
Suppose the two predictors Jack is considering are the interview (denoted
P1) and a work sample test that entails scooping ice cream and serving it to the
customer (denoted P2). It costs $250 to interview an applicant and $325 per
applicant to administer the work sample (these costs mostly comprise the staff time
required to interview applicants or administer the work sample to them). In the
past Jack has found a correlation of .30 between his ratings of applicants based on
the interview and job performance, and a correlation of .35 between scores on the
work sample and job performance ratings. If the selection ratio is .50, the average
predictor score of the top 50% of applicants is z=.80 (.80 standard deviations above
the mean).16 The final piece of information Jack needs is the standard deviation of
performance is dollars. Cascio and Aguinis (2010) present several methods for
calculating the standard deviation of dollar-valued performance.
The simplest
method is to assume that SDy is 40% of employees’ average annual salary. Assume
that Jack finds the standard deviation to be $6,000, indicating that an employee who
performs one standard deviation above the mean is worth $6,000 more to Jack than
the average employee.
Schmidt et al. use the following formula to estimate the net increase in
dollars to the organization using the selection procedure in question over random
selection:
16
Cascio and Aguinis (2010) provide tables for estimating this figure.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 51 of 57
π‘ˆ = (𝑁𝑠 × π‘Ÿπ‘₯𝑦 × π‘†π·π‘¦ × π‘§π‘₯𝑠 ) − (𝑁𝑑 × π‘)
Where:
U
= utility (net gain from using selection procedure)
Ns
= number of applicants selected
rxy
= correlation between predictor and job performance
SDy
= standard deviation of performance in dollars
zxs
= average standard score on predictor for applicants selected
Nt
= total number of applicants
c
= cost of predictor per applicant
The net gain over random selection for the interview would be:
π‘ˆπ‘ƒ1 = (50 × .30 × $6,000 × .80) − (100 × $250) = $47,000
The net gain for the work sample would be:
π‘ˆπ‘ƒ2 = (50 × .35 × $6,000 × .80) − (100 × $325) = $51,500
Although both are a substantial improvement over random selection, it appears that
Jack would be better off using the interview even though the work sample is slightly
more valid. Use of the interview is expected to result in a $4,500 annual net savings
over using the work sample as a predictor.
One can see that the potential payoff from a selection procedure is a function
of several factors. As the selection ratio increases, the utility increases. In fact, if the
selection ratio were quite high, the work sample would lose money compared to
random selection. The validity of the test will also increase the utility. If the validity
for either test were .10, Jack would lose money over using either method over
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 52 of 57
random selection. However, since a more valid selection procedure may be more
expensive to administer, one must balance the extra cost against the savings from
better predictors. Finally, as job performance becomes more valuable, it pays the
organization to have a more valid selection procedure.
Meta-Analysis
Remember that Jack wondered how to examine results from other
organizations in formulating his own human resource management policies? He
could rely on information from surveys of other organizations, or he may have
information on his closest competitor(s).
However, samples from disparate
populations may be difficult for Jack to assimilate in a systematic manner. Further,
he has no way of determining if his sample is representative. Meta-analysis refers
to the statistical analysis of empirical results accumulated from individual studies.
It allows the collection of data from various studies in an objective and systematic
manner, permitting the manager to make more informed and comprehensive
judgments about the relationship(s) of interest.
The particular methods of meta-analysis vary, depending on the data
available and the preferences of the investigator.
The general approach is to
combine findings in a certain manner to arrive at the average result. For example,
suppose Jack had results from 5 organizations on their findings regarding the
relationship between satisfaction with pay and intent to leave the organization.
Their results are described in Figure 11.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 53 of 57
How would Jack interpret these findings?
Figure 11
We could find the average correlation between pay
Correlation Between Pay
Satisfaction and Intent to Leave
Organization for 5 Companies
Company
rxy
satisfaction and intent to leave using the following
formula:
n
#1
-.25
110
#2
-.30
52
#3
-.45
98
#4
-.29
28
#5
-.51
205
π‘ŸΜ… =
∑ π‘Ÿπ‘₯𝑦
π‘›π‘Ÿ
Where: r = the average correlation from each
study
nr = the number of studies
For our example the average correlation is:
r=(-.25)+(-.30)+(-.45)+(-.29)+(-.51) = -.36
493
One could also calculate a weighted mean, so that the studies with larger sample
sizes would be given proportionately greater weight (thus eliminating sampling
error). Again using our example:
r=(-.25ο‚΄110)+(-.30ο‚΄52)+(-.45ο‚΄98)+(-.29ο‚΄28)+(-.51ο‚΄205) = -.41
493
Based on the result, then, satisfaction with pay explains about 15% of the variance
in intent to leave. If Jack has a problem with turnover he may want to increase
employees' compensation.
One of the strengths of meta-analysis is that it is possible to combine studies
reporting differing statistics into an overall effect.
For example, t-statistics,
correlations, and z-scores can all be transformed into the same metric, enabling
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 54 of 57
interpretation of the overall relationship despite the differing statistics.
manager will not often conduct a meta-analysis.
The
However, the increasing
proliferation of the results in professional journals allows the manager to consult
the source for an overall summary statistic in formulating his policies.
Another advantage of meta-analysis is that statistical corrections for study
artifacts can be made. Computing an average correlation weighted by sample size
corrects for sampling error (removing the bias that would be created by giving small
sample correlations or effects the same weight as large sample correlations or
effects). However, other corrections can be made as well, including corrections for
predictor and criterion unreliabilities and for range restriction (each using the
formulae provided earlier).
VI. COMPUTER PACKAGES
The statistics and measurement techniques reviewed in this paper can be
calculated, as they typically are, using computer packages. While the number of
packages available are too numerous to mention, PC Magazine reviewed 49 of the
most popular statistical packages. The editor recommends four advanced packages:
SPSS, Stata, SAS, Minitab, and R. Each performs all the statistics reviewed in this
paper: mode, median, mean, standard deviation, correlation, reliability, correlation,
difference between means, and regression. More advanced statistics are also within
the packages' capabilities. The price of these packages average about $795. The
article also reviews basic packages that are cheaper and easier for the novice to use.
R is particularly noteworthy because it is free (see http://www.r-project.org/),
though it is more technically oriented (and flexible) than other packages.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 55 of 57
Spreadsheets such as Excel also perform all of the basic statistical analyses
mentioned above, though they can be quite cumbersome to use (there are add-ins—
such as EZAnalyze [http://www.ezanalyze.com/]—that make analyzing data with
Excel somewhat easier).
VII. SUMMARY
Statistics are the methods used to summarize data, and to infer knowledge
based upon it. Statistics indicating central tendency describe the typical value of a
distribution. Dispersion indicates how variable the scores are from the mean. Both
dispersion and central tendency can be used for inferential purposes. The normal
distribution is used to make probabilistic inferences about variables following such
a distribution.
These inferences are made based upon the null (e.g., no significant
relationship or difference) and alternative (a significant difference or relationship)
hypotheses. Rejecting a null of no differences indicates an inferred difference
between variables. A correlation coefficient is a standardized measure of linear
association between two variables. High correlations coefficients indicate the two
variables are strongly related. Regression the prediction of one variable based on
the level of one or more other variables.
Measurement is the assignment of numbers to objects, attributes, or events.
The quality of the measuring device directly affects managers. Good measures
provide important information about the attributes of interest. The two primary
means of evaluating measures are reliability and validity. Reliability indicates the
consistency of the measuring instrument. If a measuring instrument is inconsistent,
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 56 of 57
serious doubt is cast on its usefulness as a way of gaining information about the
attribute. Standard error of measurement indicates the average degree of error
in the measuring instrument. Validity refers to how well the instrument predicts
the criterion. Valid measures provide much information about the criterion. There
are several different forms of validity and validation, depending on the data and the
goal of the investigator.
There are several ways the manager can investigate and improve on the
quality of his or her decisions. Decision analysis refers to the analysis of mistakes
in human resource decisions.
Utility analysis concerns the evaluation of
implications of human resource (particularly staffing) decisions in dollar terms.
Finally, meta-analysis is the empirical analysis of results accumulated from
individual studies.
© Timothy A. Judge, 2013
Measurement and Statistics Primer
Page 57 of 57
References
Cascio, W. F., & Aguinis, H. (2010). Applied Psychology in Human Resource
Management (7th ed.). Upper Saddle River, NJ: Prentice Hall.
Cohen, J. (1994). The Earth Is Round (p < .05). American Psychologist, 49, 9971003.
Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests.
Psychometrika, 16, 297-334.
Hunter, J. E., &. Schmidt, F. L. (2004). Methods of Meta-Analysis: Correcting Error
and Bias in Research Findings (2nd ed.). Newbury Park, CA: Sage.
Kennedy, P. (2008). A Guide to Econometrics (6th ed.). Hoboken, NJ: WileyBlackwell.
Schwab, D. P., & Trevor, C. O. (2012). Research Methods for Organizational
Studies (3rd ed.). Florence, KY: Routledge Academic.
Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of Valid
Selection Procedures on Work-Force Productivity. Journal of Applied
Psychology, 64, 609-626.
© Timothy A. Judge, 2013
Download