MEASUREMENT AND STATISTICAL ISSUES IN HUMAN RESOURCE MANAGEMENT A Primer for the Non-Expert Timothy A. Judge Department of Management Mendoza College of Business University of Notre Dame ©Timothy A. Judge, 2013 MEASUREMENT AND STATISTICAL ISSUES IN HUMAN RESOURCE MANAGEMENT A Primer for the Non-Expert OUTLINE I. INTRODUCTION Page 2 Importance of Measurement Importance of Statistical Analysis II. FUNDAMENTALS OF STATISTICAL ANALYSIS Page 5 III. PROBLEMS IN ESTABLISHING CAUSALITY Page 29 IV. MEASURING INDIVIDUAL DIFFERENCES Page 30 V. CONFIRMATORY RESEARCH Page 45 VI. COMPUTER PACKAGES Page 54 VII. SUMMARY Page 55 Central Tendency Dispersion Standard Scores Normal Distribution Hypothesis Testing Errors Correlation Regression Multiple Regression Reliability Standard Error of Measurement Validity of Measures Criterion-Related Validity Content Validity Face Validity Construct Validity Cross-Validation Validity Generalization Decision Analysis Utility Analysis Meta-Analysis Measurement and Statistics Primer Page 2 of 57 I. INTRODUCTION After many years of saving, Jack had accumulated enough cash to buy a local ice cream shop. One of Jack's first tasks was to figure out how to staff the shop. Being a novice at this, Jack consulted his friend, Margaret, owner of the local hardware store. Margaret advised Jack that she used the interview to get "the most knowledgeable people possible," and recommended it to Jack because her people had "generally worked out well." While Jack greatly respected Margaret's advice, upon reflection several questions came to mind. Given that there are several qualities important to a good ice cream shop employee, how does one go about identifying and measuring the best indicators of those qualities? Does Margaret's use of the interview mean that it meets Jack's requirements? Jack also wondered that if he used the interview, how confident could he be that his judgments would be the same as someone else's? Jack also needed to hire a store manager. What characteristics would he need to look for in a strong leader? Finally, how could Jack test if his chosen method of selecting employees was effective or ineffective? Jack also had another set of decisions to make. How could he determine if the wage he offers differs greatly from the relevant labor market? Jack has heard that entry-level employees often engage in counterproductive behaviors—stealing, showing up late, taking off early, giving free ice cream to friends, etc. By what means could he predict employees’ tendencies to engage in these behaviors in advance? How could these relationships be compared with findings from other organizations? By what means could Jack evaluate the effectiveness of a training and development program? © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 3 of 57 Finally, how can Jack ensure that his human resource decisions are fair and nondiscriminatory? Jack was unsure how to go about answering these questions. These questions faced by Jack are just a few of the issues confronting managers of human resources every day. While answering each question requires knowledge of the specific practice under consideration, it is also essential that the manager understand the measurement and analytical issues underlying each question. Without measurement and statistical analysis, evaluation of practices must be as subjective as Margaret's answer to Jack's question. The purpose of this primer is to introduce you to the measurement concepts and statistical tools essential to answer the questions facing managers of human resources, a few of which were presented above. Importance of Measurement Imagine a world in which measurement of individual differences did not exist, except within the mind of each individual. Every person would have his or her own measure of a man or woman, but the standard would dwell solely within the opinions and values of the individual. Inferences made about, and debates over, the characteristics of individuals would be entirely subjective. Efforts to understand and predict could not be undertaken because no knowledge would be generally held. Further, because each individual would have his or her own set of standards and measurements, general knowledge about people would be difficult to achieve. Accepted standards of measurement provide a common metric against which differences between individuals can be judged. To be sure, there is still room for subjectivity and disagreements. However, measures allow the debate of individual © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 4 of 57 differences to reach a higher plane. Accepted standards of measurement enable us to draw inferences based on procedures that have been tried and tested, allowing us to be more objective and systematic in investigating our attribute(s) of interest. The better the measure, the less decision error one risks over the true level of the attribute. This has direct implications for managers. For example, the better measure of friendliness Jack chooses, the fewer customers will be driven away by employees (mistakenly identified as friendly) providing poor customer service. Further, if Jack has difficulty measuring friendliness, accurately appraising whether this is a wise selection strategy will be an arduous task. Finally, selection and appraisal procedures that are not accurate predictors of true performance often place one in jeopardy of litigation from disgruntled applicants. Importance of Statistical Analysis As just explained, measurement is an essential issue for the manager of human resources to consider. Yet without analysis of those measures, measurement itself is futile. It is probably safe to conclude that rather than being beset by a lack of measurement information, most managers are overwhelmed by too much information. For example, in formulating selection decisions the manager may have information on hundreds of candidates on several different predictors. The use of statistics is to make sense out of this mass of information. As evidenced by Jack's dilemma, the typical manager is faced with a great deal of uncertainty. While statistical analysis does not eliminate the uncertainty, it provides the basis for better decisions to be made based on the data at hand. Further, statistics are the tools that allow us to make inferences about our measures. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 5 of 57 How reliable or consistent are the measures of the attribute(s) of interest? How accurate or valid are they? This paper will introduce the ways in which we can describe and make inferences about our measures of concern. II. FUNDAMENTALS OF STATISTICAL ANALYSIS Measuring individual differences is a detailed issue that will be addressed in the next section. However, a pertinent question is: once we have a measurement, what do we do with it? It is essential that the manager be able to analyze the numbers measurement provides. Statistics are the methods we use to make sense out of numbers, both to describe measures of attributes, and to infer knowledge from them. In short, descriptive statistics are concerned with summarizing data in a digestible manner; inferential statistics are concerned with estimating the likelihood of certain phenomena given the results at hand. The statistics reviewed below can be used for both descriptive and inferential purposes, depending on the goal of the manager. Central Tendency Central tendency designates the typical response of a distribution. There are three statistics commonly used to indicate central tendency. The mode refers to the most frequent value. The median is the middle observation, or the point at which half the observations fall above and half fall below. The mean of a set of observations is the arithmetic average, or the sum of the set divided by the total number of observations in the set. The mean is calculated using the following formula: M= ∑x n © Timothy A. Judge, 2013 Measurement and Statistics Primer Where: Page 6 of 57 M = mean ∑ x = sum of the observations, x n = number of observations As an example, suppose we had the following set of performance scores from a sample of Jack's employees (on a 100 point scale): 49,54,68,68,75,78,84,91,100 There is only one value, 68, that occurs twice. Therefore, it is the mode. The median is 75—four observations fall above 75 and four fall below. The mean is 74.1, which is the sum of scores (667) divided by the number of scores (9). What are the advantages and disadvantages of each measure of central tendency? The mode is most appropriate for summarizing qualitative data. For example, if one was curious about the number of women working at a company (perhaps to compare female representation of one's company to the relevant labor market), the mode would describe the most common gender indicated. It may make less sense to discuss mean or median gender. However, the mode suffers from several disadvantages that limit its use. First, there may be more than one mode. If another 91 were added to the above distribution, there would be two modes, making it an ambiguous measure of central tendency. Second, the mode is very sensitive to changes in a single value in the distribution. For example, if one of the applicants scoring 68 instead scored 100, the mode would jump from 68 to 100 even though only one scored changed! For these reasons, the mode is generally only used in describing qualitative data. The median has the advantage of not being sensitive to extreme values in the distribution. If the person who scored 68 instead scored 25, the median would not © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 7 of 57 change (four scores still fall below 75), whereas the mean would change considerably (69.3). On the other hand, this insensitivity to extreme values can be a disadvantage. Consider the following tests: Test #1: 19,25,51,52,53 Test #2: 50,50,51,97,99 The median (51) is the same for both tests even though the placement of values is radically different. The mean is capable of reflecting this difference (40 for test #1 versus 69.4 for test #2). Thus, sensitivity to extreme values can be both illustrative and misleading. If the median and mean are vastly different, one should investigate the cause of the difference, as each may provide an important piece of information in describing the data. While the mean and median are both acceptable methods of describing central tendency, the mean has one characteristic that makes it the most widely used measure of central tendency: its importance in drawing inferences about central tendency (for example, to see if the average score for the above two tests are significantly different). The median has computational properties that make it problematic in inferential statistics. Thus, the mean is employed as the measure of central tendency in most statistical analyses. In a subsequent section we will illustrate the use of the mean in drawing inferences. Dispersion The obvious fact in studying individual differences is that individuals differ. Dispersion, or variability, indicates the degree to which observations on individuals depart from central tendency. The most common means of expressing dispersion is © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 8 of 57 the standard deviation, which indicates how far the observations on average deviate from central tendency. The equation for the standard deviation (s) is: √∑ (xi -M)2 s= Where: n-1 2 ∑ (xi -M) = squared deviation of the ith observation, xi, from the mean of the observations, M, summed over all observations n = number of observations1 From the previous example, the standard deviation of the first test is 16.6. The standard deviation of the second test is 26.1. The higher standard deviation of test #2 indicates that the scores are more dispersed. Standard Scores When comparing scores between two or more samples, often the raw value alone does not provide full information on the relative status of the score. For example, an individual scoring 80 on test #1 (with a mean of 40) is very different from scoring 80 on test #2 (where the mean is 69.4). The former is 40 points above the mean, the latter only 10.6. It is also important to consider, and control for, how variable the scores are about the mean. Standard Scores (Z Scores) Standard scores show the relative status of a score within a distribution, or, as in the above example, between distributions. It indicates the number of standard deviations the particular observation is above or below the mean. Therefore, it In finding the average deviation, why not simply average the deviations about the mean by subtracting each observation from the mean and dividing by the number of observations? The difficulty is that the average signed deviation from the mean is always zero. Therefore, one must take the absolute average deviation. The easiest way to do this is to square each deviation and then return it to its original units by taking the square root. If the square root is not taken, it is known as the variance. 1 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 9 of 57 adjusts for unequal means and variances between samples. It is calculated as follows: Z= (π₯ − π) π where the terms are as previously defined. Continuing the example of the two tests, we can calculate a standard score for someone scoring 80 on each test: π1 = (80−40.0) π2 = (80−69.4) 16.6 26.1 =2.41 =0.41 The person in the first test, scoring 2.41 standard deviations above the mean, did relatively better than the individual in the second scoring 0.41 standard deviations above the mean—even though their absolute score is the same. Standardizing variables gives us a more complete picture of where the scores stand relative to others within a distribution or across distributions.2 Percentiles Another way of reporting standard scores is with a score with which the reader undoubtedly has some experience, the percentile rank. Percentile rank refers the percentage of scores in its frequency distribution that are the same or lower than it. For example, if someone scores at the 80th percentile on a measure, the person scored equal to or higher than 80% of the other people who completed the measure. The formula for computing percentile rank is: ππ = Where: 2 πΆπ + 0.5πΉπ × 100 π PR = percentile rank The mean of standardized scores is always 0 and the standard deviation 1. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 10 of 57 Cl = the count of all scores less than the score of interest Fi = the frequency of the score of interest N = the number of individuals in the sample. Returning to Jack’s distribution of scores: Test #1: 19,25,51,52,53 Test #2: 50,50,51,97,99 For either test, the person who scored 51 would be at the following percentile: ππ = 2 + 0.5(1) × 100 = 50 (50π‘β πππππππ‘πππ) 5 For Test #2, the person who scored 50 would be at the following percentile: ππ = 0 + 0.5(2) × 100 = 20 (20π‘β πππππππ‘πππ) 5 As you can see, percentile rankings change depending on the number and distribution of scores. For example, if 50 still tied for the lowest score on Test #2 out of 100 (as opposed to 5) test takers, the percentile rank becomes: ππ = 0 + 0.5(2) × 100 = 1 (1π π‘ πππππππ‘πππ) 100 Other Standard Scores There are other ways of standardizing scores, often for the purpose of providing feedback. Stanine scores standardize scores on a nine-point scale with a mean of five and a standard deviation of two. So, for example, the bottom 4% of scores represent the 1st stanine, the middle 20% of scores represent the 5th stanine, and the top 4% of scores represent the 9th stanine. T-scores standardize scores so © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 11 of 57 that the mean is 50 and the standard deviation is 10. T-scores are computed as follows: π = 50 + Where: 10(π − ππ₯ ) π π₯ X = Raw score of individual Mx = Mean score of sample sx = Standard deviation of sample scores Returning again to Jack’s scores, the person who scored 51 on Test #1 would have: π = 50 + 10(51 − 40) = 56.63 16.58 The person who scored 99 on Test #2 would have: π = 50 + 10(99 − 69.4) = 61.33 26.12 Figure 1 Relationships Among Various Standard Score Measures © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 12 of 57 Figure 1 shows the relationships among z-scores, percentiles, stanines, T scores, and the normal distribution. If scores are normally distributed, the percentile rank is directly analogous to probabilities derived from the normal distribution, a topic to which we turn next. Normal Distribution Observe Figure 2. It could be, for example, a distribution of scores on an employment test. Note that the distribution is centered on (and has the greatest Figure 2 The Normal Distribution frequency about) the mean, is bell shaped with decreasing frequency of observations as one gets farther from the mean. Also note that the distribution is © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 13 of 57 symmetric about the mean. Such a distribution is called a normal distribution. One rather interesting property of the normal distribution is that approximately 68% of the scores fall within 1 standard deviation of the mean, approximately 95% within 2, and approximately 99% within 3 standard deviations of the mean. Figure 3 Height and the Normal Distribution Height is one of many variables that is normally distributed. As we will see, though, it is important to remember that not everything is normally distributed. The normal distribution is referred to as "the workhorse of inferential statistics" because once raw scores have been transformed into z scores, it is very easy to refer them to tabled values of the standard normal distribution to find probabilities associated with finding a value within the particular range of interest. For example, if the population of scores for test #1 is normally distributed, the probability of observing a z-score greater than 2.41 is about .02, indicating that about 2% of individuals taking the test can be expected to score above 80. Conversely, roughly 34% of individuals taking test #2 can be expected to be over 80. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 14 of 57 While some attributes are approximately normally distributed (height, weight, intelligence), many are not (income). One cannot use the normal distribution for inferential purposes without assuming the values are approximately normally distributed. However, the Central Limit Theorem allows us to assume Figure 4 Not All Variables Are Normally Distributed As you can see from this graph of income in the United Kingdom, income is one of those variables that is not normally distributed. (Source: Life in the Middle - The Untold Story of Britain’s Average Earners.) that the distribution of means is approximately normally distributed as long as the sample size is sufficiently large (usually at least 30), regardless of the distribution of individual values. Therefore, even if the population is not normally distributed, the © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 15 of 57 distribution of sample means drawn from the population is. This allows determination of probabilistic properties associated with mean observations from the standard normal. The standard normal distribution applies when the population standard deviation is known. In practice, one seldom knows values of the entire population. When the population variance is unknown, the Student's t-distribution can be used, which closely resembles the standard normal. Tables for the t-distribution are also widely published in statistics texts, and are precisely estimated by computer packages (see section VI). Hypothesis Testing Human resource managers often want to make inferences about a population or populations from which samples have been drawn. Remember that one of the questions in Jack's mind was how his company's compensation level compared with the relevant labor market. He may, for example, wish to compare the wage he is offering to that of a competing company. As another example, Jack may wish to compare pass rates on his selection measure between minorities and nonminorities to assess if his hiring procedure adversely impacts upon minorities. For both these investigations, Jack could take a sample of each group to assess if the means from each population are equal or unequal. Since the sample drawn will not perfectly reflect the population, the means will vary due to sampling error. Hypothesis testing seeks to answer the question: at what point does the difference between the means become so large that we dismiss the hypothesis that the two population means are equal? The null hypothesis, denoted Ho, is the hypothesis that is © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 16 of 57 assumed to be true in producing the sample distribution used in testing the null hypothesis. Typically, the null is no difference hypothesized between the populations. The alternative hypothesis, H1, is assumed to be true when the null is false. It typically posits a difference between the means. The exact procedures to execute the test vary, depending on the particular assumptions and samples underling the test. The computations are explained in most introductory statistics texts, or conducted on computer (see section VI). Suffice it to say that a t-statistic is calculated (in place of a z-score because ο³ is unknown) and compared to the t-distribution.3 Given, as explained above, that the sample means will probably differ, it can mean two things. The difference could simply be due to sampling error or chance variation because we do not have a perfect picture of the population. On the other hand, it could be indication that the two population means are in fact not equal and the difference is not due to error. Convention is to use .05 (5 chances out of 100 that the difference arises by chance variation if there is no true difference) as the probability level at which we would reject the null hypothesis that the means are not equal. A t-statistic of ο±2 is a good benchmark, as the probability of observing a t-statistic of ο±2 is about .05. To be sure, 5 times out of 100 we can expect to be wrong in rejecting the null of equal means. However, .05 is a point at which most are willing to chance a mistake in order to make inferences about the true nature of events. We are allowed to compare the mean value to the t-distribution because we can assume the means are approximately normally distributed through the Central Limit Theorem. 3 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 17 of 57 Errors Effective management of human resources necessitates the use of statistics to make "best guesses" about the true state of affairs when incomplete information and measurement error exists. Obviously, these educated guesses are not always correct. In statistical lexicon, mistakes that arise from erroneous inferences are termed Type I and Type II errors.4 If the null hypothesis is true, but Jack rejected it, he has made a Type I error. This is also represented by the Greek letter ο‘ ("alpha"), or the significance level. When one makes a Type I error, the means differed by a significant amount, but the difference was due to chance variation (sampling error). This is not the only mistake Jack needs to concern himself with. He could also make a Type II error, or falsely accepting the null hypothesis of equal means when they are in fact not equal. This error is represented by the Greek letter ο’ ("beta"). When one lowers the probability of rejecting a true null (decreases ο‘), it is more likely that one has accepted a false null (increases ο’). For most decisions, it is best to make it difficult to reject the hypothesis the weight of past evidence supports (the null). That is why ο‘ is generally set quite low (and thus increasing ο’). However, one must be aware of both errors. Each can be costly. And, all else equal, decreasing one error increases the probability of committing the other. There is nothing magical (or, according to some) even logical about the p < .05 standard. The origin of this p-value is one of the towering figures in statistics, Sir Ronald A. Fisher. In 1925, Fisher suggested the use of a boundary between significance and nonsignificance that was based on probability. Fisher set this boundary at p = .05; its widespread adoption has led many to question the wisdom of the standard in theory and in practice (see Cohen, 1994). 4 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 18 of 57 Figure 4 DECISION Accept Ho Figure 4 illustrates the Results of Hypothesis Tests decisions and results. The NATURE OF NULL probability of accepting a true null is equal to 1-ο‘. On Ho true Ho false Correct Type II error the other hand, rejecting a (1-ο‘) (ο’) false null, the other correct Type I error Correct (ο‘) (power) Reject Ho decision, is 1-ο’ and is often referred to as the power of the test. Alpha and beta are as previously defined. Correlation Remember one of the questions in Jack's mind was how to hire a store manager. Suppose a friend of Jack’s—Sallie—gave him a dataset from the lifeguard service she manages (in reality, the data in Figures 5 and 6 are actually on lifeguards). Sallie’s data shows a relationship between a lifeguard’s personality and his or her leadership effectiveness. Graphically, the relationship might look like Figure 5 for Sallie’s lifeguards. Each point on the graph, called a scatterplot, represents a lifeguard, having both a score on extraversion and a rating of leadership effectiveness. By visual inspection one could see that there is a positive association between extraversion and leadership. Those who are extraverted seem to make better leaders. However, it is important to have a precise numerical measure of the association between two variables. A correlation coefficient is a standardized (controls for differing levels of variance) measure of linear covariation © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 19 of 57 between two variables. The population correlation, like the population mean and standard deviation, is unknown and must be estimated from sample data. The sample correlation coefficient is calculated by the following formula: ππ₯π¦ = ∑(π₯ − ππ₯ )(π¦ − ππ¦ ) √∑(π₯ − ππ₯ )2 ∑(π¦ − ππ¦ )2 With standardized values (z scores), the equation simplifies to: ππ₯π¦ = (zx zy ) π The correlation can range from +1.0 (perfect positive relation between the two variables) to -1.0 (perfect negative relation). A correlation of ο±1 indicates that knowing the value of one variable allows exact determination of the other's value. A correlation of 0.0 signifies no relationship between the variables, indicating that knowing the value of one variable gives us no information about the value of the other. In the extraversion and leadership example above, the correlation is +.42, consistent with the visual inspection of Figure 5.5 Let’s say Jack also received data from Margaret’s hardware store—in this case, prediction of the degree to which the employees engaged in counterproductive work behaviors. This variable of interest—counterproductive work behaviors—is graphed with conscientiousness in Figure 6. Each data point represents an employee with a score on conscientiousness and a supervisor rating of the degree to which the employee engages in counterproductive work behaviors. A visual inspection gives one the impression that the variables are negatively related. To The reader can be forgiven for underestimating the correlation in Figure 3 from a visual inspect of the graph. As Hunter and Schmidt (2004) note, when interpreting raw data, we tend to underestimate the true relationship and overestimate the variability in that relationship (in other words, think the data are “all over the place” when in fact there is a consistent relationship). 5 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 20 of 57 point, the correlation is ο.41. Higher levels of employee conscientiousness are associated with lower degrees of counterproductive behaviors (as perceived by the Figure 5 The Relationship Between Extraversion and Leadership employee’s supervisor). From Figure 6, Jack might interpret these data as indicating that when staffing the ice cream shop, he should give applicants a personality test (to assess conscientiousness). From Figure 5, Jack might wish to give a measure of extraversion to those individuals he is considering for store manager. (Shortly, we will address a question that might come to mind: Can we have any confidence that validity for one organization or one type of job [in this case, lifeguards or hardware © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 21 of 57 store employees] would generalize to another organization or another job type [in this case, ice cream shop employees or store manager]?) Figure 6 The Relationship Between Conscientiousness and Counterproductive Work Behaviors Given many possible correlation coefficients based on many different possible samples from the population, how does one determine if there is a "true" relationship between the variables? In much the same way as comparing means, we may test the hypothesis of no relationship between the variables (correlation coefficient equal to zero) against the alternative of a significant relationship. As in comparing population means, a test statistic is calculated (here rxy), compared to a © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 22 of 57 probability distribution (generally the t-distribution) and a probability level derived. If the probability is less than the significance level, the hypothesis of no relationship between the variables is rejected. In such a case we would conclude the "true" relationship is likely to be other than zero. The larger the sample size, the easier it is to achieve a significant correlation. For example, a correlation of rxy=.97 is not significantly different from zero at the .05 level when the sample size is 3. However, when n=100 a correlation of rxy=.19 is significant.6 Squaring the correlation coefficient, or r2, represents the proportion of total variance of one variable explained by the other. Therefore, Jack's correlation of .38 between pay and performance represents 14% of the variance in performance explained by variation in pay. It also leaves 86% unexplained by pay (explained by other factors). When trying to predict what a person will do in the future, errors are common. This simply serves to illustrate that human behavior is somewhat unpredictable. Thus, it is relatively rare for one variable to explain a majority of variance in another. This issue will be revisited in subsequent sections. Regression Suppose Jack has operated the store for a year and now wants to estimate his staffing needs for the upcoming summer ice cream rush. Jack could use past data on the daily high temperature and the estimated number of workers required that day (recorded each day over the last year) to predict his staffing requirements for the The significance test for the correlation coefficient relies on the assumption that the population values of both distributions are normally distributed. When this assumption is in doubt or the sample size is small, one should use the Spearman's rank-order correlation coefficient. The computational formula is contained in nearly all statistics texts. 6 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 23 of 57 upcoming summer. Regression, a prediction of the level of one variable based on the level of one or more other variables, is perfectly suited for this type of problem. Suppose Jack had past data on demand for ice cream and numbers of workers required for the past year. Figure 7 represents these values. Each data point represents a day in the past year when Jack recorded the daily high temperature and wrote down his estimate of the optimal number of employees on that day. The line fitted through the data is called a regression line, which represents the "best fit" line, as the squared deviations from the mean line are the least of all possible straight lines. It represents the prediction line for the number of workers demanded for a corresponding high temperature. From this line, the number of workers Jack needs to hire, based on the forecast high, can be projected. In regression, the dependent variable is the variable whose value is influenced (or depends on) the value of another. In this case, the dependent variable is the number of workers demanded (total number of workers needed to staff three shifts). The independent variable is that which induces changes in the dependent variable. Here, the independent variable is the daily high temperature. The regression line is estimated by: y=a+bx+e Where: y a b x e = score on dependent variable = intercept value = slope of the regression line (regression coefficient) =score on independent variable =error term © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 24 of 57 Like all other statistics, the population regression equation must be estimated from sample data. Errors result when the regression line does not perfectly Figure 7 Predicted Demand for Workers Regression Line Fit Plot Estimated Workers Need Over 3 6-Hour Shifts 30 25 y =-7.008 + 0.2916x 20 Estimated Workers Need Over 3 6-Hour Shifts 15 Predicted Estimated Workers Need Over 3 6Hour Shifts 10 5 0 0 20 40 60 80 Daily High 100 Estimated workers is based on the past year’s data, when on that day Jack wrote down an estimate of the optimal number of workers needed that day. predict values of the dependent variable. In our example the error term includes all factors other than temperature that influence demand for workers. The estimated regression function is y = ο7.008 + 0.2916x, where y is the predicted value. Accordingly, for any given value of x (i.e., daily high temperature) we can predict y (the number of workers required). For example, if the daily high is 60 degrees, Jack © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 25 of 57 will need an estimated 10-11 workers on his payroll (the actual predicted value is 10.487 workers). If the high temperature is 90 degrees, Jack will need a predicted 19 (exact predicted value = 19.234) workers. The slope value indicates that a 1 unit change in x induces a b unit change in y. In our example, an increase of 10 degrees leads to approximately 3 more workers required.7 Several assumptions are required for regression analysis: the independent variables and error terms are uncorrelated; the mean of all errors is 0; all errors have equal variances; and the errors are not correlated with one another. The implications of violating these assumptions are discussed in Kennedy (2008). It would be useful to determine what proportion of total variability in the dependent variable explained by the regression of Y on X. The coefficient of determination, denoted R2, is the proportion of total sample variability of the dependent variable explained by the independent variable. It is calculated by dividing variability explained by the independent variable by total variability (which is the variance of Y). For example, R2=.68 in the equation in our example, meaning 68% of the variability in number of ice cream workers required is explained by its linear dependence on consumer demand for ice cream. In "simple" regression (one independent variable) such as this, R2 = rxy2. When predicting human thoughts, feelings, or action, one generally has to settle for less variance explained. People are complicated. When using standardized variables, the intercept drops out (remember z scores have a mean of zero), and the b coefficient represents the correlation between the dependent and independent variable. 7 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 26 of 57 As with other statistics, we are able to test the b coefficient against zero to determine if the independent variable is a significant predictor of the dependent variable.8 We do this by dividing the coefficient estimate by its standard error (remember because the population regression coefficients are estimated with sample data, and because the prediction is not perfect, they are estimated with error). Calculation of b coefficients is quite laborious and is therefore conducted using computer packages (see section VI). The null hypothesis is generally b=0 (a slope of zero), indicating no relationship between the variables. Once the test statistic is calculated, it is referred to the t-distribution. If the statistic is large enough to be statistically significant, the null is rejected and it is asserted that values of Y significantly depend on values of X (or that X significantly predicts Y). Multiple Regression Remember the example from a few pages earlier regarding the effect of conscientiousness on counterproductive behaviors? Jack observed a correlation of .41 and concluded that hiring conscientious individuals should reduce counterproductive behaviors such as absence, lateness, theft, etc. However, this conclusion might be suspect without considering the job held by the individual. Individuals in higher-level positions (like managers) may be less likely to engage in counterproductive behaviors—taking a day off may simply leave more work for the next day. Therefore, job level might confound the relationship between conscientiousness and counterproductivity. Luckily, there is a procedure that allows us to control for other influences when investigating the relationship In order to do this, it is necessary to assume that the prediction errors, e, are normally distributed. This assumption is also dealt with in Kennedy (2008). 8 © Timothy A. Judge, 2013 Measurement and Statistics Primer between two variables. Page 27 of 57 Multiple regression, as a generalization of simple regression, allows investigation of multiple influences on the independent variable. The general form of the equation can be represented as: Y=a+b1x1+b2x2+...+bkxk+e Where x1,x2,...,xk represent 1 through k independent variables; all other terms are as previously defined. The interpretation of the effect of an independent variable is similar to simple regression, except that it now measures the effect of one variable holding the others in the equation constant. Each regression coefficient in multiple regression is known as a partial regression coefficient because it expresses the partial effect of the coefficient on the dependent variable. The power of multiple regression to the human resource manager should not be underestimated. By controlling for the influence of all variables the investigator wishes to specify, it allows inferences regarding the influence of one independent variable on the dependent variable, controlling for the effect of other possible influences. In our earlier example, it is possible to investigate the effect of conscientiousness on counterproductive behaviors controlling for job held. In other words, for those having the same position in the organization, what is the effect of conscientiousness on counterproductivity? Multiple regression is ideally suited for prediction based on multiple sources of information. For example, suppose Jack decided to predict job performance based on two selection predictors, collected data on the predictors and the criterion, and estimated the following regression equation with his sample data: © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 28 of 57 Y=10+.3X1+.6X2 Jack may then use this equation for future selection decisions. For example, Jack may wish to predict subsequent job performance on an applicant who scored 50 on test 1 and 80 on test 2. Assume 65 is the minimum acceptable performance rating. The applicant's predicted job performance is: Y=.3(.50)+.6(.80)=73 Thus, this applicant would be predicted to be successful, albeit marginally, on the job. If Jack needed to fill 25 positions, he would probably hire the highest 25 predicted job performances. It is often held that because the weight on X2 is greater than X1 it is a more important predictor of the dependent variable (e.g., job performance). This is an incorrect assertion because the variables are measured in different units. For example, measuring pay in dollars versus thousands of dollars would yield a coefficient one thousand times smaller even though the relationship is no different. Regression with standardized variables eliminates this problem as all the variables are forced into the same units. In fact, with standardized variables, each regression coefficient is equivalent to a partial correlation coefficient between the particular independent variable and the dependent variable. Therefore, it provides information on the strength of the separate relationship between the independent and dependent variables, partialling out (e.g., holding constant) the effect of the other variables. Squaring the partial correlation coefficient indicates the proportion of variance in the dependent variable explained by the independent variable, once the influence of the other variables is removed. In our example, if once standardized © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 29 of 57 X2 had a larger coefficient than X1, X2 would explain more variance in performance. Thus, without generalizing beyond the sample, X2 would be a stronger (more important) predictor of the dependent variable. The coefficient of determination in multiple regression has a comparable interpretation to simple regression. R2 reflects the proportion of total variance in the dependent variable explained by the set of independent variables. For example, R2=.50 indicates that 50% of the variance in the dependent variable is explained by the independent variables.9 III. PROBLEMS IN ESTABLISHING CAUSALITY One must be cautious in attributing causality using correlation and regression. By themselves, they do not separate causality between variables. Consider a correlation on might find between pay and performance such that those who earn more have higher performance ratings. How does one interpret this? High performers are generally paid more for their accomplishments (πππππππππππ → πππ¦). However, high pay also serves as an incentive to greater efforts (πππ¦ → πππππππππππ). Thus, in this example it is impossible merely looking at a correlation or regression coefficient to attribute causal direction. In such cases, tighter controls, either in research designs or statistical controls, are needed before causal inferences can be drawn (see Schwab & Trevor, 2012, for further discussion). Non-linear regression models can be estimated, often with a substantial increase in prediction. For example, one can see that the scatterplot in Figure 7 is not linear—as you might expect, changes in temperature lead to greater differences in estimated demand for workers at high temperatures than at low temperatures (i.e., the difference between a high of 80° and 70° leads to a greater change in workers needed than the difference between a high of 30° and 20°). The distribution is exponential, and there are various ways to model such distributions (see Kennedy, 2008). 9 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 30 of 57 IV. MEASURING INDIVIDUAL DIFFERENCES From Plato to Darwin to managers in search of productive workers, the fundamental differences in individuals has at once been an obvious fact and a source of fascination. The first task of a manager making differentiations between people (whether for hiring, compensating, training, or appraising employees) is to measure the differences. Measurement is the assignment of numbers to objects, attributes, or events. In many cases, measurement is both critical and difficult. In trying to assess human thought and behavior, measurement is particularly difficult. Two central means of evaluating the quality of our measures are reliability and validity. Each will be explored in turn. Reliability Remember Jack's concern whether his judgment when interviewing applicants would be consistent with others? For example, if Jack's assistant manager also interviewed applicants, to what degree would their evaluations agree? This is an issue of reliability, or the consistency or reproducibility of a measuring instrument. If Jack found that their judgments were often quite different, Jack might question the reliability of their evaluations, and the usefulness of the procedure. A test, set of evaluations, or survey items that do not correlate well with themselves can hardly be expected to correlate with any variable of interest. Thus, reliability is an essential starting point in measurement and statistical analysis. Reliability theory posits that variation in scores, for example on an employment test, appraised performance, or job satisfaction survey, is composed of © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 31 of 57 variation in "true" scores (i.e., reflecting variation in true ability or performance) plus variation due to error in the measuring instrument. Or, ο³2 = ο³2t + ο³2e Where: ο³2 = total variance in scores, as defined earlier ο³2t = variance in "true" scores ο³2e = error variance The more total variance is due to true differences between the individuals and less to inconsistencies (which produce variance) in the measuring instrument, the more reliable the measuring device. In classical reliability theory, the reliability coefficient is represented as: ππ₯π₯ = ππ‘2 ππ2 = 1 − π2 π2 The higher proportion of "true" variance to total variability (or lower proportion of error to total variance), the higher the reliability of the measuring instrument. Just as r2 tells us the percentage of total variance shared by the variables, and R2 indicates the proportion of variance in the dependent variable explained by the independent variable(s), the square of the reliability coefficient, theoretically, reveals the proportion of total variance in the measured variable due to "true" differences in individuals. If we had true scores, we could calculate reliability in this manner. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 32 of 57 Figure 8 A Tale of Two Tests π2 Test #1 Test #2 ππ‘2 ππ2 80% 20% 20% 80% ππ‘2 ππ2 π2 Variability alone does not determine reliability; it is the proportion of true variance to total. For example, Figure 8 shows two tests with the same level of total variance, ο³2. Yet test 1 is much more reliable than test 2, as 80% of the total variance in test 1 is due to variation in individual characteristics ("true" variance) and only 20% due to error. However, in test 2, only 40% is "true" variance, and 60% measurement error. In practice, since true scores are never known, reliability must be estimated from the data obtained from our measuring instruments. One of the more obvious means of estimating reliability is test-retest, where the same form of a test is administered twice to the same applicants (after a suitable time period) and the two scores are correlated. One potential drawback of the test-retest estimate is any © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 33 of 57 variable that influences one administration and not the other will reduce reliability. Another problem with the test-retest method is that the individual may remember responses from the first test or assessment, or consistently guess in the same manner on both tests. Perhaps the most popular method of estimating reliability is internal consistency, which holds that items from the same test should predict the total score equally well regardless of where they are placed in the test. One approach is to correlate one half of the test with the other half, a split-half reliability. Because reliability increases with test length and the split-half method cuts length in half, the obtained correlation is a conservative estimate of the true reliability of the test. The Spearman-Brown prophecy formula is often used to correct for this reduced reliability: π11 = 2ππ₯π₯ (1 + ππ₯π₯ ) Where r11 is the corrected correlation and rxx is the correlation between the halves. Perhaps the most sophisticated measure of internal consistency is Cronbach's alpha (Cronbach, 1951), which yields the mean correlation between all possible half-splits. Cronbach's alpha is available on most computer packages (see section VI). It can be calculated manually with the following formula: ∝= Where π × πΜ 1 + ([π − 1] × πΜ ) ο‘ = coefficient alpha N = number of items in measure πΜ = average correlation among items © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 34 of 57 For example, if Jack wished to measure extraversion with a 10-item scale, and the average correlation among those 10 items was πΜ =.40, then ο‘ is: ∝= 10 × .40 4 4 = = = .87 1 + ([10 − 1] × .40) (1 + 3.6) 4.6 What is an acceptable level of reliability? It depends on several factors. Though most researchers appear to adhere to a “ο‘ ο³ .70 is acceptable, ο‘ ο³ .80 is good” rule, such simplistic rules do as much harm as good. For example, longer tests can be expected to be more reliable than shorter tests. Internal consistency estimates can also be expected to be higher than inter-rater estimates. A coefficient alpha of .70 on a long item test might be considered to be marginally reliable, whereas a correlation of .60 between interviewer judgments might be thought of as quite good. Reliabilities below .50 are seldom considered adequate regardless of the method used to estimated reliability. There are many factors that influence the reliability of a measuring instrument. As mentioned earlier, large sample sizes (more is known about the population) and number of test items or raters (using 10 predictors to select people is more likely to yield a consistent estimate of their ability than a single item) increase reliability. Finally, heterogeneity in the individual difference being measured serves to increase reliability, as there is more variance to be explained. Standard Error of Measurement The standard error of measurement indicates the degree of error expected in an individual's score. If an individual were to take the test (or be evaluated) many times, his or her scores would vary, and we expect that variance to follow a © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 35 of 57 normal distribution. More scores should be near the individual's true score than far away. The mean of this distribution is the individual's true score, and the standard deviation is the standard error of measurement (abbreviated ο³meas). The ο³meas represents the average error in the measurement device. As with all normal distributions, 68% of the scores lie within 1 standard deviation of the mean, 95% within 2, and so on. The standard error of measurement may be expressed as: ο³meas = ο³x (1 β rxx) As one can see, ο³meas is determined by both the variance of the scores and the reliability of measurement. If reliability is perfect (rxx=1.0), there is no error in estimating an individual's true score. Perhaps the most important use of ο³meas for human resource managers is that it enables us to make inferences about true scores. For example, if the standard deviation on Jack's employment test is ο³x=4, and reliability for the test is rxx=.80, then ο³meas=1.79. If an individual scores 80, Jack can be 68% confident that the individual's true score is within ο±1.79 point of their obtained score (roughly between 78 and 82), and 95% confident that their true score is between 76.5 and 83.5 (ο±3.58 points). This also provides useful information in determining whether two scores are significantly different. If the lower limit of the higher score is above the upper limit of the lower score, then we can conclude the two scores are significantly different. For example, following the example above, if one applicant scored 80 and another scored 72, Jack can be 95% confident that the two scores are different (that the first applicant truly has a higher score). Validity of Measures © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 36 of 57 Suppose that Jack and his assistant manager each interviewed applicants and then rated them on a 1 to 10 point scale. Jack found that the correlation between their ratings was r=.75. One might be tempted to conclude that Jack and his assistant must do a good job of selecting applicants since they have fairly consistent evaluations. However, reliability of measurement does not necessarily imply accuracy of judgment. For example, weight can be measured quite reliably, but surely is not an accurate predictor of performance for most jobs. Similarly, while Jack and his hand-picked assistant's judgments are consistent, it could be because they both evaluate applicants on criteria not strongly related to job performance (e.g., appearance). The above example illustrates the importance of validity in human resource management. Validity refers to how well the instrument measures or predicts the criterion. If we have information from a measurement device, how much does that information help in predicting the criterion of interest? If the highest (lowest) scores on a predictor always led to the highest (lowest) scores on the criterion, our predictor would be perfectly valid. Unfortunately, in practice this does not occur. The question then becomes: how does one go about designing a measure to be as valid as possible and evaluating if a given measure is valid? Strategies used to establish validity depend on both the specific use of the measuring instrument and the data collection constraints imposed on the organization. The primary validation strategies can be classified as either empirical or logical. Empirical strategies estimate the validity of a procedure by examining the correlation or regression coefficient between the predictor and the criterion. High © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 37 of 57 correlation coefficients imply high validities.10 strategy is criterion-related validity. The most important empirical Logical strategies establish validity by evaluating how well the measuring device samples the criterion. The important logical strategy is content validation. Face validity is an informal method, neither logical nor empirical. Construct validity is actually a combination of empirical and logical strategies that enable us to understand the factors that cause variation in the criterion. Each will be explained in turn. Criterion-Related Validity Criterion-related validation is employed when one wishes to quantitatively estimate the relationship between a predictor and the criterion. For example, if Jack were to relate (using correlation or regression) interview or test scores to job performance in evaluating the accuracy of the predictor, he would be using a criterion-related strategy. Those predictors explaining the most variance in the criterion are the most valid and will be preferred. There are two specific variants of criterion-related strategies: predictive and concurrent. Predictive Validation In predictive validation, the predictor is measured at one point in time and information on the criterion is gathered at a later date. Then, the two sets of information are correlated. Perhaps the "purest" way to conduct predictive The following formula is used to estimate the validity if there was no measurement error (reliability was perfect): 10 ππ₯π¦ = ππ₯π¦ (πππ ) √ππ₯π₯ √ππ¦π¦ Where rxy=estimated true correlation; rxy(obs)=observed correlation; rxx=reliability of predictor; ryy=reliability of criterion. This is the best estimate of the true validity of the measure. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 38 of 57 validation in selection decisions, for example, is to gather information on the predictor and then select applicants on the basis of some other predictor. For most organizations this "pure" method of validation is impractical. It is costly to administer a test when no direct result is forthcoming. This is particularly true in the case of the purest predictive design, which would require hiring all applicants. A more realistic but still costly method would entail giving applicants the test but ignoring the results from this test when making hiring decisions. A problem with both of these methods is that managers need validation information quickly to avoid costly mistakes in the immediate future. “If the test you want me to use is so good,” Jack might ask, “Why can’t I use it now?” Imagine if we used a predictive validation design whereby we administer the measuring device (e.g., an employment test) to applicants, select on the basis of those scores, and later correlate predictor scores with measures of job performance. Why is this problem? The primary problem with this strategy is that the correlation underestimates the true relationship between the test and performance because of restriction of range in the predictor. Because only those who scored above the cutoff point on the predictor were hired, we never know how those who were not hired would have scored on the criterion (job performance). Figure 9a shows a "true" relationship between the test and job performance of r=.56. If the organization were to select on the basis of test scores, Figure 9b indicates because the range is restricted, only information to the right of the cutoff Xc is considered, and the obtained correlation coefficient would drop to r=.19 even though the true © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 39 of 57 Figure 9 Effect of Range Restriction on Observed Correlation Figure 9a. Relationship Without Range Restriction (rxy=.56) Validity of Test 10 Performance Rating 9 8 7 6 5 4 3 2 1 70 80 90 100 110 120 130 140 Score on Selection Test Figure 9b. Observed Relationship With Range Restriction (rxy=.19) Validity of Test 10 Only those applicants with scores above 100 on the selection test were hired. Thus, in validating the test, performance ratings of those not hired are not available. This range restriction downwardly biases the observed correlation (if they had been hired, the observed correlation would have been r=.57). Performance Rating 9 8 7 6 5 4 3 2 1 70 80 90 100 110 Score on Selection Test © Timothy A. Judge, 2013 120 130 140 Measurement and Statistics Primer Page 40 of 57 relationship was still r=.56. It is possible to estimate the correlation between the predictor and criterion if no restriction of the range existed. The formula to correct estimated validities for range restriction is relatively complicated. This formula is provided below: ππ‘ = Where: π ( π‘⁄π π ) 2 + π 2 (π π‘⁄ )2 1 − ππ₯π¦ π₯π¦ π π rt = estimated "true" correlation between predictor and criterion rxy = observed correlation between predictor and criterion st = standard deviation of predictor for total sample (estimated on applicant pool) sr = standard deviation of restricted sample In essence, this formula estimates what the distribution of test scores and job performance would have looked like if all applicants were hired. As such, it is a hypothetical means of projecting what the validity would be if all information was available. Concurrent Validation Perhaps the most expedient method of empirical validation is concurrent validation. In this case, present employees are administered the employment test, and their most recent performance ratings are correlated with their test scores. While this approach is convenient, particularly under time constraints, there are several potential problems. First, it is not clear that current job holders are as motivated to do well on the predictor (after all, their employment does not hinge on performance on the test) as actual applicants for the job. Further, how would those who quit or were fired have tested? This restriction in range of the criterion © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 41 of 57 attenuates the observed predictor-criterion relationship. Perhaps most importantly, concurrent designs may be biased by the effect of job experience on the test. Almost certainly individuals learn skills on the job that are related to the skills assessed by the employment test. One approach to mitigate this bias is to control (using multiple regression) for experience in predicting job performance. Content Validity Criterion-related validation strategies concern the extent to which the predictor is a significant sign of the criterion. Content validation concerns the degree to which the measurement device is an adequate sample of the criterion. In other words, a test is content valid if it adequately represents the criterion of interest. For example, Jack might consider a test that entails evaluating how nimbly the applicant scoops ice cream into the cone and serves it to customers content valid. Though there are metrics or statistics to assess content validity, typically it is ascertained by subjective judgments. If one does not use quantitative results to evaluate the content validity of a test, how does one go about establishing validity? Typically, an expert or experts evaluate how well the content of the test represents job performance. In short, the knowledge, skills, and abilities (as identified by a job description and specification) required to perform the job must be reflected in the test for it to be judged content valid. Because content validity is judgmental, it is crucial that those who evaluate the content of the test be experts regarding the job in question, and be supplied with accurate information on the test and criterion. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 42 of 57 Face Validity Face validity refers to whether individuals taking the test believe it to be a valid measure of the criterion. In short, is the test valid on "the face of it?" While this is an informal and entirely subjective method, it can be very important to organizations. If applicants view the test as a poor method of selection, using the test might generate more resentment than it is worth. Although applicants judge a test as fair or unfair on many grounds, content validity would appear to be one way to increase face validity. For example, a work sample test (e.g., in Jack’s case, having applicants scoop ice cream and serve it to customers) would likely be judged to have high content validity because it samples a key aspect of performance. For the same reason it should also have high face validity. Thus, content valid tests will almost always be face valid, although the reverse is not necessarily true. Construct Validity Construct validity has as its goal to understand the trait or construct that the test measures. Because it entails more than prediction or sampling, it is a more rigorous method of validation. While construct validation can be conducted in many different forms, several of the more common are: 1) correlations between several different measures of the construct; 2) expert judgment regarding the appropriateness of the test in sampling or predicting the underlying construct; 3) correlational relationships between the measures and behaviors purportedly manifested by the construct.11 There are more advanced methods (such as factor analysis) and concepts (such as convergent and discriminant validity) designed to assess construct validity (see Schwab & Trevor, 2012). 11 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 43 of 57 For example, suppose Jack wished to assess the construct validity of an integrity test. Does the test allow Jack to understand what integrity is and how well is it measured by the test? If Jack knew his that his test was highly correlated with other measures of honesty (#1), rated as appropriate by experts on the subject (#2), and found a strong negative correlation between his test and stealing (#3), it would provide some evidence of the construct validity of the measure. While construct validity is rigorous, the conclusions one can draw about applicants based on the test are stronger, as one has a better idea of what factors cause the construct. Cross-Validation How does one know if a validity coefficient calculated from one sample will apply to other samples of interest? Cross-validation is the procedure by which one demonstrates whether a predictor validated from the present sample continues to be a valid predictor when applied to another sample. Cross-validation is important in selection because a prediction scheme (for example, weights on various predictors) is often applied to many samples subsequent to the one in which it was originally developed. It is crucial, therefore, to investigate how valid this scheme is on the various samples to which it might be applied. Cross-validation generally begins by gathering predictor and criterion information on the current sample and then calculating a correlation coefficient or regression equation. Next, a separate independent group has predictor information gathered. These scores are then predicted based on the validity coefficient(s) from the original sample. Finally, criterion values are correlated. The higher this correlation, the greater the confidence that the selection method is valid across © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 44 of 57 samples. Perhaps the most practical approach to cross-validation is to split the sample in half, using one half for developing the prediction scheme, and testing the scheme on the other half. Regardless of the method used, the cross-validated coefficient can be expected to "shrink" because the original scheme maximized on the idiosyncrasies of the sample that do not generalize to the other. If the shrinkage is great, doubt is cast on the ability of the predictor(s) beyond the sample it was originally based upon. Validity Generalization One of the traditional views of personnel psychology was that validities for employment tests are situation-specific. This was based on empirical results showing considerable variation in validity coefficients across populations. This opinion carried great weight in the formulation of early standards and laws governing employee tests, which advised against borrowing validity evidence from other populations unless it could be demonstrated that work behaviors and the organizational context between the populations were very similar. Schmidt and Hunter have convincingly argued that the specific nature of validity coefficients might be due to artifacts in the measuring procedures. For example, small sample sizes, differences in reliability in the predictor and criterion, or differences in range restriction are only several of the possible factors that attenuate estimates of validity across samples, irrespective of the true validity. Schmidt, Hunter, and colleagues have found that nearly all of the variance in validity estimates is due to these artifacts. Their findings indicate that validity coefficients are much more generalizable than has typically been assumed. The implication is © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 45 of 57 that managers may not be forced to "re-invent the wheel" for their staffing decisions. They may be able to rely on others who have demonstrated the test to be valid.12 In fact, meta-analysis, introduced in the next section, will show how the organization can use findings compiled across many organizations in making human resource decisions. V. CONFIRMATORY RESEARCH Obviously, a central part of a manager's job is to make decisions. But how can one determine the quality of those decisions? Successful outcomes are the ultimate standard, but final outcomes (e.g., profitability, market share) give us very poor information about exactly where decision might be improved. Confirmatory research enables us to investigate the accuracy of human resource decisions, the cost of errors associated with particular practices, and how to compile findings in hope of making better decisions in the future. Decision Analysis After Jack institutes his new hiring procedure, he might like to see his "batting average." Remember from hypothesis testing that we discussed four types of decisions: accepting the null hypothesis when it is true; accepting the null when it is false (Type II error); rejecting the null when it is true (Type I error); and rejecting the null when it is false (power). Decision analysis is another 2 × 2 procedure that provides information on the immediate consequences of human resource decisions. For the purposes of decision analysis, we assume that the null hypothesis is that the individual will be considered successful on the job. Accepting 12 Not all courts have accepted this standard. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 46 of 57 the applicant when he or she will in fact be successful is obviously a correct decision. However, rejecting the applicant when he or she would have been successful is an error. Rather than labelled a Type I error, in decision analysis such a mistake is termed a false negative (applicants falsely predicted to be unsuccessful). Rejecting the applicant who would have been considered unsuccessful is a correct decision. Finally, accepting an applicant who turns out to be unsuccessful is a false positive (positive performance was falsely predicted). Figure 10 shows the scatterplot of predictor-criterion scores from Figure 9, with a validity coefficient of r=.54. Point Xc represents the cutoff point for predictor scores (in this case, the cutoff is 94). Applicants scoring to the right of Xc are hired, those to the left are rejected. Cutoffs are set based on the desired number of employees hired, minimum qualifications needed, or both factors. Point Ys represents the minimum performance required to be judged successful on the job (in this case, the minimum performance baseline is the scale midpoint—5.5 on the 1-10 scale; where the baseline is set depends, of course, on the job, the performance standards, and so forth). Those above it are considered successful employees; those below it are not. Applicants in Quadrant I were hired and were above the baseline (considered successful). Applicants in Quadrant III were not hired and, if they were, would have been below the performance baseline (considered unsuccessful). Thus, Quadrants I and III are correct decisions. Applicants in Quadrant II were not hired but, had they been, would have been above the baseline (considered successful). Applicants in Quadrant IV were hired, but performed below the baseline. Thus, whereas Quadrants I and III represent correct decisions, Quadrants II and IV © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 47 of 57 represent errors. Applicants in Quadrant II are false negatives. Employees in Quadrant IV are false positives.13 Setting a cutoff score defines the selection ratio, or the proportion of applicants hired. It can be calculated using the number of individuals in each quadrant for the following formula: ππππππ‘πππ π ππ‘ππ = (πΌ + πΌπ) (πΌ + πΌπΌ + πΌπΌπΌ + πΌπ) The lower (higher) the cutoff, the higher (lower) the selection ratio. Because 48 out of 67 applicants in Figure 10 were hired, the selection ratio is (48/67)=.72.14 The base rate is the proportion of applicants considered successful if all applicants were hired. It is represented by the following formula: π΅ππ π π ππ‘π = (πΌ + πΌπΌ) (πΌ + πΌπΌ + πΌπΌπΌ + πΌπ) In Figure 10, 45 out of 67 applicants would be considered successful. Therefore, the base rate is .67 (45/67). Of course, in a predictive validation design (where the selection measure is used to hire from an applicant pool), if the test is used in making decisions, Quadrants II and III are missing (since applicants who scored below the cut line were never hired). However, as discussed previously, there are several options: (1) until the measure is validated, hiring decisions can be made without regard to scores on selection measure; (2) simulated results for those quadrants can be constructed based on range restriction; (3) a concurrent validation design can be used such that the selection measure is given to current employees. 13 Selection ratios vary dramatically by job type, industry, and labor market conditions. For example, one would expect a very high selection ratio in hiring packing plant workers in good economic conditions (I worked with one such organization that hired virtually every able-bodied applicant). In contrast, the selection ratio in hiring a professor may be .01, which is precisely what it was with a search committee I chaired in 2012. 14 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 48 of 57 Figure 10 Decision Analysis of Predictor-Criterion Scores II I III IV Ys Xc Obviously the goal is to eliminate the errors. One way to reduce the overall error rate would be to choose a more valid selection procedure. A validity coefficient of 1.0 (a straight line of scores) would lead to no errors. A coefficient of 0.0 (a circle of scores) would lead to as many errors as correct decisions. The selection ratio and base rate also have implications for errors. Moving the cutoff or minimum level of acceptable performance decreases one error while increasing the other. However, there is a point at which total errors are minimized. The highest © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 49 of 57 number of correct decisions is where the number of false positives exactly equals the number of false negatives. The optimal place to set the cutoff, however, depends on the cost of each error to organizations. False positives are undoubtedly more salient to managers. Hiring applicants who later turn out to be poor matches is very visible. Conversely, those who got away are often unnoticed. A strategy designed to minimize false positives would mean hiring fewer applicants. Therefore, the balance is between meeting one's labor force requirements and minimizing those that are incorrectly hired. The benefit of decision analysis is not that it makes the staffing decisions for the manager. Rather, the advantage is that it presents the manager with consequences of human resource judgments he or she must make. Further, the natural tradeoff between false positives and false negatives forces managers to consider the costs of both errors in formulating their selection strategies. Utility Analysis It is a truism that profit and loss are the bottom line for most organizations. Utility analysis concerns the evaluation of implications of human resource (staffing in particular) decisions on organizations in dollar terms. As such, it is a powerful means to understand the costs and benefits of decisions managers must make regarding selection.15 Suppose that Jack wishes to hire 50 employees, and has 100 applicants for the positions. The selection ratio is .50 (50/100). Jack has the choice of using two Cascio and Aguinis (2010) also analyze the costs associated with other human resource management activities (turnover, absenteeism, training programs). 15 © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 50 of 57 different predictors but is unsure of which to use (he cannot afford to use both). Schmidt, Hunter, McKenzie and Muldrow (1979) provide a framework to analyze which predictor will yield the biggest dollar improvement over random selection. Suppose the two predictors Jack is considering are the interview (denoted P1) and a work sample test that entails scooping ice cream and serving it to the customer (denoted P2). It costs $250 to interview an applicant and $325 per applicant to administer the work sample (these costs mostly comprise the staff time required to interview applicants or administer the work sample to them). In the past Jack has found a correlation of .30 between his ratings of applicants based on the interview and job performance, and a correlation of .35 between scores on the work sample and job performance ratings. If the selection ratio is .50, the average predictor score of the top 50% of applicants is z=.80 (.80 standard deviations above the mean).16 The final piece of information Jack needs is the standard deviation of performance is dollars. Cascio and Aguinis (2010) present several methods for calculating the standard deviation of dollar-valued performance. The simplest method is to assume that SDy is 40% of employees’ average annual salary. Assume that Jack finds the standard deviation to be $6,000, indicating that an employee who performs one standard deviation above the mean is worth $6,000 more to Jack than the average employee. Schmidt et al. use the following formula to estimate the net increase in dollars to the organization using the selection procedure in question over random selection: 16 Cascio and Aguinis (2010) provide tables for estimating this figure. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 51 of 57 π = (ππ × ππ₯π¦ × ππ·π¦ × π§π₯π ) − (ππ‘ × π) Where: U = utility (net gain from using selection procedure) Ns = number of applicants selected rxy = correlation between predictor and job performance SDy = standard deviation of performance in dollars zxs = average standard score on predictor for applicants selected Nt = total number of applicants c = cost of predictor per applicant The net gain over random selection for the interview would be: ππ1 = (50 × .30 × $6,000 × .80) − (100 × $250) = $47,000 The net gain for the work sample would be: ππ2 = (50 × .35 × $6,000 × .80) − (100 × $325) = $51,500 Although both are a substantial improvement over random selection, it appears that Jack would be better off using the interview even though the work sample is slightly more valid. Use of the interview is expected to result in a $4,500 annual net savings over using the work sample as a predictor. One can see that the potential payoff from a selection procedure is a function of several factors. As the selection ratio increases, the utility increases. In fact, if the selection ratio were quite high, the work sample would lose money compared to random selection. The validity of the test will also increase the utility. If the validity for either test were .10, Jack would lose money over using either method over © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 52 of 57 random selection. However, since a more valid selection procedure may be more expensive to administer, one must balance the extra cost against the savings from better predictors. Finally, as job performance becomes more valuable, it pays the organization to have a more valid selection procedure. Meta-Analysis Remember that Jack wondered how to examine results from other organizations in formulating his own human resource management policies? He could rely on information from surveys of other organizations, or he may have information on his closest competitor(s). However, samples from disparate populations may be difficult for Jack to assimilate in a systematic manner. Further, he has no way of determining if his sample is representative. Meta-analysis refers to the statistical analysis of empirical results accumulated from individual studies. It allows the collection of data from various studies in an objective and systematic manner, permitting the manager to make more informed and comprehensive judgments about the relationship(s) of interest. The particular methods of meta-analysis vary, depending on the data available and the preferences of the investigator. The general approach is to combine findings in a certain manner to arrive at the average result. For example, suppose Jack had results from 5 organizations on their findings regarding the relationship between satisfaction with pay and intent to leave the organization. Their results are described in Figure 11. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 53 of 57 How would Jack interpret these findings? Figure 11 We could find the average correlation between pay Correlation Between Pay Satisfaction and Intent to Leave Organization for 5 Companies Company rxy satisfaction and intent to leave using the following formula: n #1 -.25 110 #2 -.30 52 #3 -.45 98 #4 -.29 28 #5 -.51 205 πΜ = ∑ ππ₯π¦ ππ Where: r = the average correlation from each study nr = the number of studies For our example the average correlation is: r=(-.25)+(-.30)+(-.45)+(-.29)+(-.51) = -.36 493 One could also calculate a weighted mean, so that the studies with larger sample sizes would be given proportionately greater weight (thus eliminating sampling error). Again using our example: r=(-.25ο΄110)+(-.30ο΄52)+(-.45ο΄98)+(-.29ο΄28)+(-.51ο΄205) = -.41 493 Based on the result, then, satisfaction with pay explains about 15% of the variance in intent to leave. If Jack has a problem with turnover he may want to increase employees' compensation. One of the strengths of meta-analysis is that it is possible to combine studies reporting differing statistics into an overall effect. For example, t-statistics, correlations, and z-scores can all be transformed into the same metric, enabling © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 54 of 57 interpretation of the overall relationship despite the differing statistics. manager will not often conduct a meta-analysis. The However, the increasing proliferation of the results in professional journals allows the manager to consult the source for an overall summary statistic in formulating his policies. Another advantage of meta-analysis is that statistical corrections for study artifacts can be made. Computing an average correlation weighted by sample size corrects for sampling error (removing the bias that would be created by giving small sample correlations or effects the same weight as large sample correlations or effects). However, other corrections can be made as well, including corrections for predictor and criterion unreliabilities and for range restriction (each using the formulae provided earlier). VI. COMPUTER PACKAGES The statistics and measurement techniques reviewed in this paper can be calculated, as they typically are, using computer packages. While the number of packages available are too numerous to mention, PC Magazine reviewed 49 of the most popular statistical packages. The editor recommends four advanced packages: SPSS, Stata, SAS, Minitab, and R. Each performs all the statistics reviewed in this paper: mode, median, mean, standard deviation, correlation, reliability, correlation, difference between means, and regression. More advanced statistics are also within the packages' capabilities. The price of these packages average about $795. The article also reviews basic packages that are cheaper and easier for the novice to use. R is particularly noteworthy because it is free (see http://www.r-project.org/), though it is more technically oriented (and flexible) than other packages. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 55 of 57 Spreadsheets such as Excel also perform all of the basic statistical analyses mentioned above, though they can be quite cumbersome to use (there are add-ins— such as EZAnalyze [http://www.ezanalyze.com/]—that make analyzing data with Excel somewhat easier). VII. SUMMARY Statistics are the methods used to summarize data, and to infer knowledge based upon it. Statistics indicating central tendency describe the typical value of a distribution. Dispersion indicates how variable the scores are from the mean. Both dispersion and central tendency can be used for inferential purposes. The normal distribution is used to make probabilistic inferences about variables following such a distribution. These inferences are made based upon the null (e.g., no significant relationship or difference) and alternative (a significant difference or relationship) hypotheses. Rejecting a null of no differences indicates an inferred difference between variables. A correlation coefficient is a standardized measure of linear association between two variables. High correlations coefficients indicate the two variables are strongly related. Regression the prediction of one variable based on the level of one or more other variables. Measurement is the assignment of numbers to objects, attributes, or events. The quality of the measuring device directly affects managers. Good measures provide important information about the attributes of interest. The two primary means of evaluating measures are reliability and validity. Reliability indicates the consistency of the measuring instrument. If a measuring instrument is inconsistent, © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 56 of 57 serious doubt is cast on its usefulness as a way of gaining information about the attribute. Standard error of measurement indicates the average degree of error in the measuring instrument. Validity refers to how well the instrument predicts the criterion. Valid measures provide much information about the criterion. There are several different forms of validity and validation, depending on the data and the goal of the investigator. There are several ways the manager can investigate and improve on the quality of his or her decisions. Decision analysis refers to the analysis of mistakes in human resource decisions. Utility analysis concerns the evaluation of implications of human resource (particularly staffing) decisions in dollar terms. Finally, meta-analysis is the empirical analysis of results accumulated from individual studies. © Timothy A. Judge, 2013 Measurement and Statistics Primer Page 57 of 57 References Cascio, W. F., & Aguinis, H. (2010). Applied Psychology in Human Resource Management (7th ed.). Upper Saddle River, NJ: Prentice Hall. Cohen, J. (1994). The Earth Is Round (p < .05). American Psychologist, 49, 9971003. Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16, 297-334. Hunter, J. E., &. Schmidt, F. L. (2004). Methods of Meta-Analysis: Correcting Error and Bias in Research Findings (2nd ed.). Newbury Park, CA: Sage. Kennedy, P. (2008). A Guide to Econometrics (6th ed.). Hoboken, NJ: WileyBlackwell. Schwab, D. P., & Trevor, C. O. (2012). Research Methods for Organizational Studies (3rd ed.). Florence, KY: Routledge Academic. Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of Valid Selection Procedures on Work-Force Productivity. Journal of Applied Psychology, 64, 609-626. © Timothy A. Judge, 2013