test score interpreation

advertisement
1
TEST CONSTRUCTION (Jennifer Janusz)
Test= systematic method for measuring a sample of behavior
Test construction specifying the test’s purpose
Generating test items
Administering the items to a sample of examinees for the purpose of item analysis
Evaluating the test’s reliability and validity
Establishing norms
Item analysis
Relevance- extent to which test items contribute to achieving the stated goals of testing
To determine relevance:
1) Content appropriateness: Does the item actually assess the domain the test is designed to evaluate
2) Taxonomic level: Does the item reflect the appropriate ability level?
3) Extraneous abilities: Does the item require knowledge, skills, or abilities outside the domain of interest
Ethics- to meet the requirements of privacy, the information asked by test items must be relevant to the stated purpose of
testing (Anastasi)
Item Difficulty
Determined by the Item Difficulty Index (p)
- values of p range from 0-1
- calculated by dividing # who answered correctly by total # of sample
- larger indicate easier items
- p=1, all people answered correctly
- p=0, no one answered correctly
Typically, items with moderate difficulty level (p=.5) are retained
- increases score variability
- ensure that scores will be normally distributed
- provides maximum differentiation between subjects
- helps maximize test’s reliability
Optimal difficulty level is affected by several factors
1) greater the probability that correct answer can be selected by guessing, the higher the optimal difficulty level
- for true/ false item, where chance is .50, preferred difficulty level is .75
2) If goal of testing is to choose a number of examinees, the preferred difficulty level will = the proportion of examinees
to be chosen
- if on test, only 15% are to be admitted, the average item difficulty level for entire test should be .15
People in sample must be representative of population of interest, since difficulty index is affected by nature of tryout
sample.
Study tip- In most situations p=.50 is optimal, except in T/F test, where p=.75 is optimal.
Item Discrimination
Extent to which an item differentiates between examinees who obtain high vs low scores
Item Discrimination Index (D)
- identifying the ppl who obtained the upper and lower 27%
- for each item, subtract the percent of examinees in the lower-scoring group (L) from the percent of examinees in the
upper-scoring (U) group who answered it right
D= U-L
- range from –1 to +1.
- D= +1 if all in upper group and none in lower group answer right
- D= -1 if none in upper group and all in lower group answer right
- Acceptable D= .35 or higher
- Items with moderate difficulty level (.50) have greatest potential for maximum discrimination
Item Response Theory
Classical Test Theory- obtained test score reflects truth and error
Concerned with item difficulty and discrimination, reliability, and validity
Shortcomings1) item and test parameters are sample-dependent
- item difficulty index, reliability coeff, etc likely to differ between samples
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
2
2) Difficult to equate scores on different tests
- score of 50 on one test doesn’t = 50 on another
Item Response Theory (IRT)
Advantages over Classical Test Theory
1) item characteristics are sample invariant- same across samples
2) test scorers reported in terms of examinee’s level of a trait rather than in terms of total test score, possible to equate
scores from different tests
3) Had indices that help identify item biases
4) Easier to develop computer-adaptive tests, where administration of items of based on examinee’s performance on
previous items
Item characteristic curve (ICC)- plot the proportion of ppl who answered correctly against the total test score, performance on an external criterion, or
mathematically-derived estimate of ability
- provides info on relationship between an examinee’s level on the trait measured by the test and the probability that he
will respond correctly to that item
- P value: probability of getting item correct based on examinee’s overall level
Various ICCs provide information on either 1, 2, or 3 parameters:
1) difficulty
2) difficulty and discrimination
3) difficulty, discrimination, and guessing (probability of answering right by chance)
Study tip Link these with IRT- sample invariant, test equating, computer adaptive tests
RELIABILITY
Classical test theory- obtained score (X) composed of true score (T) and error component (E)
X=T+E
True score= examinee’s status with regard to attribute measured by test
Error component= measurement of error
Measurement of Error= random error due to factors that are irrelevant to what is being measured and have an unpredictable
effect on test score
Reliability- estimate of proportion of variability in score that is due to true differences among examinees on the attribute
being measured
- when a test is reliable, it provides consistent results
- consistency = reliability
Reliability Coefficient
Reliability Index- (in theory)
- calculated by dividing true score variance by the obtained variance
- would indicate proportion of variability in test scores that reflects true variability
- however, true test scores variance not known so reliability must be estimated
Ways to estimate a test’s reliability:
1) consistency of response over time
2) consistency across content samples
3) consistency across scorers
Variability that is consistent is true score variability
Variability that is inconsistent is random error
Reliability coefficient- correlation coefficient for estimating reliability
- ranges from 0-1
- r = 0, all variability is due to measurement error
- r = 1, all variability due to true score variability
- Reliability coefficient symbolized by rxx .
- Subscript indicates correlation coefficient calculated by correlating test with itself rather than with another
measure
- Coefficient is interpreted directly as the proportion of variability in obtained test scores and reflects true score
variability
- Ex. r= .84 means that 84% of variability in scores due to true score differences while 16% (1.0 - .84) is due to
measurement error.
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
3
- If double it, reflects twice as much variability
- Does NOT indicate what is being measured by a test
- Only indicates whether it is being measured in a consistent precise way
Study tip Unlike other correlations, the reliability coefficient is NEVER squared to interpret it. It is interpreted directly as a
measure of true score variability. r=.89 means that 89% of variability in obtained scores in true score variability.
Methods for Estimating Reliability
Test-Retest Reliability
- administering the same test to the same group on two occasions
- correlating the two sets of scores
- reliability coefficient indicates degree of stability or consistency of scores over time
- coefficient of stability
Source of error
1) Primary sources of error are random factors related to the time that passes
Time sampling factors
- random fluctuations in examinees over time (anxiety, motivation)
- random variations in testing situation
- memory and practice when don’t affect all examinees in the same way
Appropriate for tests that measure things that are stable over time and not affected by repeated measurement.
- good for aptitude
- bad for mood or creativity
Higher coefficient than alternate form because only one source of error
Alternate (Equivalent, Parallel) Forms Reliability
- two equivalent forms of the test are given to same group and scores are correlated
- consistency of response to different item samples
- may be over time, if given on two different occasions
- alternate forms reliability coefficient
- coefficient of equivalence when administered at same time
- coefficient of equivalence and stability when administered at two different times
Sources of error
1) Primary source of error is content sampling
- error introduced by an interaction between different examinees knowledge and the different content assessed by the
two forms
- ex. items on form A might be a better match for one examinee’s knowledge, while the opposite is true for another
examinee
- two scores will differ, lowering the reliability coefficient
2) Time sampling can also cause error
Not good for
- when attribute not stable over time
- when scores can be affected by repeated measurement
- when same strategies used to solve problems on both forms= practice effect
- when practice differs for different examinees (are random), it is a source for measurement error
Good for speeded tests
Considered to be the most rigorous and best method for estimating reliability
Often not done because of difficulty of creating two equal forms
Less affected by heterogenous items than internal consistency
- higher coefficient than internal consistency (KR-20) when items are heterogeneous
Internal Consistency Reliability
Administering the test once to a single sample
Yields the coefficient of internal consistency
1) Split half
Test is split into equal halves so that each examinee has two scores
Scores are correlated
Most common to divide by even and odd numbers
Problem- produces reliability coefficient based on test scores derived from one-half of the length of the test
- reliability tends to decrease as length of test decreases
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
4
-
split-half tends to underestimate true reliability
however when two halves not = in mean or SD, may either under or overestimate
corrected using the Spearman-Brown prophecy formula, which estimates what reliability coefficient would have been
S-B used to estimate effects of increasing or decreasing length of test on reliability
2) Cronbach’s coefficient alpha
Test administered once to single sample
Formula used to determine average degree of inter-item consistency
Average reliability that would be obtained from all possible splits of the test
Tends to be conservative, considered the lower boundary of test’s reliability
Kuder-Richardson Formula 20- use when tests scored dichotomously (right or wrong); produces high reliability coeff for
speeded tests
Sources of error
1) Content sampling
- split half: error resulting from differences in the two halves of the test (better fit for some examinees)
- coefficient alpha: differences between individual test items
2) Heterogeneity of content domain
- coefficient alpha only
- test is heterogeneous when it measures several different domains
- greater heterogeneity, lower interitem correlation  lower magnitude of coefficient alpha
Good for
- tests measuring a single characteristic
- characteristic changes over time
- scores likely to be affected by repeated measures
Bad for
- speed tests, because produce spuriously high coefficients
- alternate forms best for speed tests
Inter-rater ( Inter-scorer, Inter-observer) Reliability
When test scores rely on rater’s judgment
Done by
1) calculating a correlation coefficient (kappa coefficient or coefficient of concordance)
2) determining the percent agreement between the raters
- does not take into account the level of agreement that would have occurred by chance
- px when recording high-frequency behavior because degree of chance agreement is high
Sources of error
- factors related to the raters (motivation, biases)
- characteristics of the measuring device
- reliability low when categories are not exhaustive and/or not mutually exclusive and discrete
- consensual observer drift
- observers working together influence each other so they score in similar, idiosyncratic ways
- tends to artificially inflate reliability
Improve reliability
- eliminate drift by having raters work independently or alternate raters
- tell raters their work will be checked
- training should emphasize difference between observation and interpretation
Study tip Spearman-Brown = split-half reliability; KR-20 = coefficient alpha; Alternate form most thorough method
Internal consistency not appropriate for speeded tests
Factors that Affect the Reliability Coefficient
1) Test length- longer the test, larger the reliability coefficient
- Spearman-Brown can be used to estimate effects of lengthening or shortening a test on its reliability coefficient
- Tends to overestimate a test’s true reliability
- Most likely when the added items do not measure the same content domain
- When new items are more susceptible to measurement error
- When mean and SD not equal, can over or underestimate
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
5
2) Range of test scores- maximized when range is unrestricted
- range affected by degree of similarity among sample on attribute measured
- when heterogeneous, range is maximized
- will overestimate if sample is more heterogeneous than examinees
- affected by item difficulty level
- if all easy or hard, results in restricted range
- best to choose items in mid-range (p = .50)
3) Guessing
- as probability of guessing correct answer increases, reliability coefficient decreases
- T/F test lower reliability coefficient than multiple choice
- Multiple choice lower reliability coefficient than free recall
Interpretation of Reliability
Scores achieved by the group and the individual
The Reliability Coefficient
Interpreted directly as the proportion of variability in a set of test scores that is attributable to a true score variability
R= .84 means that 84% of variability in test score is due to true score differences among examinees, while 16% due to error
Coefficient of .80 acceptable; achievement and ability is usually .90
No single index of reliability for any test
Test’s reliability can vary by situation and sample
Standard Error of Measurement
Assists in interpreting the individuals score
Index of error in measurement
Construct a confidence interval around the score
- estimate range examinee’s true score likely to fall in
- when raw scores converted to percentile ranks, called percentile band
Use standard error of measurement
- index of amount of error expected in obtained scores due to unreliability of test
SEmeas =
SDx 1 - r xx
SE meas = standard error of measurement
SDx = standard deviation of test scores
R = reliability coefficient
Standard error affected by the standard deviation and the test’s reliability coefficient
Lower standard deviation and higher reliability coefficient, the smaller the standard error of measure (vice versa)
Type of standard deviation, so talk about area under the normal curve
68% confidence interval by +/- 1
95% confidence interval by +/- 2
99% confidence interval by +/- 3
Problem- measurement error not equally distributed throughout range of scores
- use of same standard error to construct confidence intervals for all scores can be misleading
- manuals report different standard errors for different score intervals
Study tip Name “standard error of measurement” can help remember when it’s used- used to construct a confidence interval
around a measured (obtained score)
Know difference between standard error of measurement and standard error of estimate
Estimating True Scores from Obtained Scores
Because of measurement effects, obtained test scores tend to be biased estimates of true scores
- scores above mean tend to overestimate
- scores below mean tend to underestimate
- farther from the mean, greater the bias
Rather than using confidence interval, can use a formula that estimates true score by taking into account this bias by
adjusting the obtained score by using the mean of the distribution and the test’s reliability coefficient.
-less used
Reliability of Difference Scores
Compare performance of one person on two test scores (i.e., VIQ and PIQ)
Reliability coefficient for the difference score can be no larger than the average reliabilities of the two tests
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
6
- Test A has reliability coefficient of .95 and Test B has .85, difference score will have reliability of .90 or less
Reliability coefficient for difference scores depends on degree of correlation between tests
- more highly correlated, the smaller the reliability coefficient and the larger the standard error of measure
VALIDITY
Test’s accuracy. Valid when it measures what it is intended to measure.
Intended use for tests, each has it’s own method for establishing validity
1) content validity- for a test used to obtain information about a person’s familiarity with a particular content or behavior
domain
2) construct validity- test used to determine the extent to which an examinee possesses a trait
3) criterion related validity- test used to estimate or predict a person’s standing or performance on an external criterion
Even when a test is found to be valid, it might not be valid for all people
Study tip When scores are important because they provide info on how much a person knows or on each person’s status
with regard to a trait- content and construct
When scores used to predict scores on another measure, and those scores are of most interest- criterion related validity
Content Validity
The extent that a test adequately samples the content or behavior domain it is to measure
If items not a good sample, results of test misleading
Most associated with achievement tests that measure knowledge and with tests designed to measure a specific behavior
Usually “built into” test as it is constructed by identifying domains and creating items
Establishment of content validity relies on judgment of subject matters experts
If experts agree items are adequate and representative, then test is said to have content validity
Qualitative evidence of content validity
- coefficient of internal consistency will be large
- test will correlate highly with other tests of same domain
- pre-post test evaluations of a program designed to increase familiarity with domain will indicate appropriate changes
Don’t confuse with face validity
Content validity = systematic evaluation of tests by experts
Face validity = whether or not a test looks like it measures what it’s supposed to
If lacks face validity, ppl may not be motivated
Construct Validity
When a test has been found to measure the trait that it is intended to measure
Abstract characteristic, cannot be observed directly but must be inferred by observing its effects.
No single way to measure
Accumulation of evidence that test is actually measuring what it was designed to
1) Assessing internal consistency
- do scores on individual items correlate highly with overall score
- are all items measuring same construct
2) Studying group differences
- Do scores accurately distinguish between people known to have different levels of the construct
3) Conducting research to test hypotheses about the construct
- Do test scores change, following experimental manipulation, in the expected direction
4) Assessing convergent and discriminant validity
- does it have high correlations with measures of the same trait (convergent)
- does it have low correlations with measures of different traits (discriminant)
5) Assessing factorial validity
- Does it have the factorial composition expected
Most theory laden of the methods of test validation
- begin with a theory about the nature of the construct
- guides selection of test items and choosing a method for establishing validity
- example: if want to develop a creativity test and believe that creativity is unrelated to intelligence, is innate, and that
creative ppl generate more solutions, you would want to determine the correlation between scores on creativity tests
and IQ tests, see if a course in creativity affects scores, and see if test scores distinguished between ppl who differ in
number of solutions they generate
Most basic form of validity because techniques involved overlap those used for content and criterion-related validity
“all validation is one, and in a sense all is construct validation” Cronbach
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
7
Convergent and Discriminant Validity
Correlate test scores with scores on other measures
Convergent = high corr with measures of same trait
Discriminant = low corr with measures of unrelated trait
Multitrait-multimethod matrix- used to assess convergent and discriminant
- table of correlation coefficients
- provides info about degree of association between 2 or more traits that have been assessed using 2 or more measures
- see if the correlations between different methods measuring the same trait are larger than the correlations between the
same and different methods measuring different traits
You need two traits that are unrelated (assertiveness and aggressiveness) and each trait measured by different methods (self
and other rating)
Calculate correlation coefficient for each pair and put in matrix
Four types of correlation coefficients
1) Monotrait-monomethod coefficient = same trait-same method
- reliability coefficients
- indicate correlation between a measure and itself
- not directly relevant to validity, need to be large
4) Monotrait-heteromethod coefficient = same trait-different method
- correlation between different measures of the same trait
- provide evidence of convergent validity when large
5) Heterotrait-monomethod coefficient = different traits-same method
- correlation between different traits measured by same method
- provide evidence of discriminant validity when small
6) Heterotrait-heteromethod coefficient = different traits- different method
- correlation between different traits measured by different methods
- provide evidence of discriminant validity when small
Factor Analysis
Identify the minimum number of common factors (dimensions) required to account for the intercorrelations among a set of
tests, subtests, or test items.
Construct validity when it has high correlations with factors it would be expected to correlate with and low correlations
with factors it wouldn’t be expected to correlate with (another way for convergent and discriminant validity)
Five steps
1) Administer several test to a sample
- administer test in question along with some that measure same construct and some that measure different construct
2) Correlate scores on each test with scores on every other test to obtain a correlation [R] matrix
- high correlations suggest measuring same construct
- low correlations suggest measuring different constructs
- pattern of correlations determines how many factors will be extracted
3) Convert correlation matrix into factor matrix using one of several factor analytic techniques
- Data in correlation matrix used to derive a factor matrix
- Factor matrix contains correlation coefficients (“factor loadings”) which indicate the degree of association between
each test and each factor
4) Simplify the interpretation of the factors by “rotating” them
- pattern of factor loadings in original matrix is difficult to interpret, so factors are rotated to obtain clearer pattern of
correlation
- rotation can produce orthogonal or oblique factors
5) Interpret and name factors in rotated matrix
- names determined by looking as tests that do and do not correlate with each factor
Factor loadings- correlation coefficients indicate the degree of association between each test and each factor
- square it and determine the amount of variability in test scores explained by the factor
Communality- “common variance”
- amount of variability in test scores that is due to the factors that the test shares in common with the other tests in the
analysis
- total amount of variability in test scores explained by the identified factors
- communality = .64 means that 64% of the variability in those test scores is explained by a combination of the factors
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
8
A test’s reliability (true score variability) consists of two components
Communality- variability due to factors that the test shares in common with other tests in the factor analysis
Specificity- variability due to factors that are specific and unique to the test and are not measured by other tests in the factor
analysis
- portion of true test score variability not explained by the factor analysis
Communality is a lower limit estimate of a test’s reliability coefficient
- a test’s reliability will always be at least as large as it’s communality
Naming of factor done by inspecting pattern of factor loadings
Rotated matrix: redividing the communality of each test included in the analysis
- as a result, each factor accounts for a different proportion of a test’s variability than it did before the rotation
- makes it easier to interpret the factor loadings
Two types
1) orthogonal- resulting factors are uncorrelated
- attribute measured by one factor is independent from the attribute measured by the other factor
- choose if think constructs are unrelated
- types include varimax, quartimax, equimax
2) oblique- factors are correlated
- attributes measured are not independent
- choose if think constructs may be related
- types include quartimin, oblimin, oblimax
When factors are orthogonal, test’s communality can be calculated from it’s factor loadings
Communality equals the sum of the squared factor loadings
When factors are oblique, the sum of the squared factor loadings exceeds the communality
Study tip
1) squared factor loading provides measure of shared variability
2) when orthogonal, test’s communality can be calculated by squaring and adding the test’s factor loading
3) orthogonal factors are uncorrelated, while oblique factors are correlated
CRITERION-RELATED VALIDITY
Used when test scores are used to draw conclusions about an examinee’s standing or performance on another measure.
Predictor- the test used to predict performance
Criterion- other measure that is being predicted
Correlating scores of a sample on the predictor with their scores on the criterion.
When the criterion related validity coefficient is large, confirms the predictor has criterion related validity
Concurrent vs. Predictive Validity
Two forms
Concurrent- criterion data collected prior to or at same time as predictor data
- preferred when predictor used to estimate current status
- examples: estimate mental status, predict immediate job performance
Predictive- criterion data collected after predictor data
- preferred when purpose of testing is to predict future performance on the criterion
- examples: predict future job performance, predict mental illness
Study tip Convergent and divergent associated with construct validity
Concurrent and predictive associated with criterion-related validity
Interpretation of the Criterion-Related Validity Coefficient
Rarely exceed .60
.20 or .30 might be acceptable if alternative predictors are unavailable or have lower coefficients or if test administered in
conjunction with others
Shared variability- squaring the coefficient gives you the variability that is accounted for by the measure
Expectancy table- scores on predictor used to predict scores on criterion
Study tip You can square a correlation coefficient to interpret it only when it represents the correlation between two
different tests.
When squared, it gives a measure of shared variability
Terms that suggest shared variability include “accounted for by” and “explained by”
If asks how much variability in Y is explained by S, square the correlation coeff
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
9
Standard Error of Estimate
Derive a regression equation used to estimate criterion score from obtained predictor score
There will be error unless correlation is 1.0
Standard error of estimate used to construct confidence interval around estimated criterion score.
SEest = SDy 1 – (rxy)2
SEest = standard error of estimate
SDy = standard deviation of criterion scores
rxy = validity coefficient
Affected by two factors: standard deviation of the criterion scores and predictor’s criterion related validity coefficient
Standard error of estimate smaller with smaller standard deviation and larger validity coefficient
Larger SD, larger standard error of estimate
When validity coefficient = +/- 1, standard error of estimate = 0
When validity coefficient = 0, standard error of estimate = standard deviation
Study tip Know difference between standard error of estimate and standard error of measurement.
Standard error of estimate is confidence interval around an estimated (predicted) score
Incremental Validity
Increase in correct decisions that can be expected if predictor is used as a decision-maker
Important- even when a predictor’s validity coefficient is large, use of the predictor might not result in a larger proportion
of correct decisions
Scatterplot
To use, criterion and predictor cutoff scores must be set
Criterion cutoff score- provides cutoff for your criterion being predicted, i.e. successful and unsuccessful
Predictor cutoff score- provides score that would have been used to hire or not hire
- divides into positives and negatives
- postivies= those who scored above the cutoff
- negatives= those who scored below the cutoff
Four quadrants of the scatterplot
High on predictor
Low on predictor
High on criterion
True +
False -
Low on criterion
False +
True -
1) True positives- predicted to succeed by the predictor and are successful on the criterion
2) False positives- predicted to succeed by the predictor and are not successful on the criterion
3) True negatives- predicted to be unsuccessful by predictor and are unsuccessful on the criterion
4) False negatives- predicted to be unsuccessful by predictory and are successful on the criterion
If predictor score lowered number of positives would increase and number of negative would decrease.
If predictor score raised number of positives would decrease and number of negatives would increase. (false + decrease)
Selection of optimal predictor cutoff based on:
- number of people in the four quadrants
- goal of testing
- goal is to maximize proportion of true positives, high predictor score set because reduces number of false
positives
Criterion cutoff can also be raised or lowered, but might not be feasible
Low scores may not be acceptable
Incremental validity calculated by subtracting the base rate from the positive hit rate
Incremental validity = Positive Hit Rate – Base Rate
Base rate is proportion of people who were selected without use of the predictor and who are currently considered
successful on the criterion
Base rate = True Positives + False Negatives
Total number of people
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
10
Positive hit rate is proportion of people who would have been selected on the basis of their predictor scores and who are
successful on the criterion
Positive hit rate = True Positives
Total Positives
Best when:
1) validity coefficient is high
2) base rate is moderate
3) selection ratio is low
When incremental validity is also used in determining whether a screening test is an adequate substitute for a lengthier
evaluation
Positives- people who are identified as having the disorder by the predictor
Negatives- people who are not identified as having the disorder by the predictor
Criterion cutoff divides people into those who have been dx with the lengthier evaluation as having the disorder or not
Study tip Predictor determines whether a person is positive or negative
Criterion determines whether person is false or positive
The scatterplot on the test may not have the same quadrant labels
Relationship between Reliability and Validity
Reliability places a ceiling on validity
- when a test has low reliability, it cannot have a high degree of validity
High reliability does not guarantee validity
- test can be free of measurement error but still not test what its supposed to
Reliability is necessary but not sufficient for validity
Predictor’s criterion related validity cannot exceed the square root of its reliability coefficient
If reliability coefficient of predictor is .81, validity coefficient must be <.90
Validity is limited by reliability of predictor and criterion
To obtain a high validity coefficient, reliability of both must be high
Correction of Attenuation Formula
Estimate what a predictor’s validity coefficient would be if the predictor and the criterion were perfectly reliable (r=1.0)
Need
1) the predictor’s current reliability coefficient
2) the criterion’s current reliability coefficient
3) criterion-related validity coefficient
Tends to overestimate the actual validity coefficient
Criterion Contamination
a) Accuracy of a criterion measure can be contaminated by way in which scores on the criterion measure are determined.
b) Tends to inflate the relationship between the predictor and a criterion, resulting in artificially high criterion-related
validity coefficient
c) If eliminate it, coefficient decreases
d) Make sure that individual rating ppl on criterion measure is not familiar with performance on predictor.
Cross-validation
a) When a predictor is developed, items that are retained for final version are those that correlate highly with criterion.
b) However, can be due to unique characteristics of try out sample
c) Predictor “made” for that sample, and if use same sample to validate the test, the criterion related validity coefficient
with be high
d) Must cross validate a predictor on another sample.
e) Cross validation coefficient tends to shrink and be smaller than the original coefficient. Smaller the initial evaluation
sample, the greater the shrinkage of the validity coefficient
TEST SCORE INTERPREATION
Norm Referenced Interpretation- comparing to scores obtained by people in a normative sample
- Raw score is converted to another score that indicates standing in norm group
- Emphasis is on identifying individual differences
- Adequacy relies on how much person’s characteristics match those in sample
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
11
-
Weaknesses
- Finding norms that match person
- Norms quickly become obsolete
1) Percentile Ranks- raw score expressed in terms of percentage of sample who obtained lower scores.
- Advantage: Easy to interpret
- Distribution always flat (rectangular) in shape regardless of shape of raw score distribution because evenly distributed
throughout range of scores (same # of ppl between 20 and 30 as between 40 and 50)
- Nonlinear transformation: distribution of raw scores differs in shape from the distribution of raw scores
- Disadvantage: ordinal scale of measurement
- Indicate relative position in distribution, do not provide info on absolute differences between subjects
- Maximizes differences between those in middle of distribution while minimizing differences between those at
extremes
- Can’t perform many mathematical calculations on percentiles
Study tip Linear = distributions look alike
Nonlinear = distributions look different
2) Standard Scores- indicates position in normative sample in terms of standard deviations from the mean
- Advantage: permits comparison of scores across tests
Z score: subtracting mean of distribution from raw score to obtain a deviation score, then dividing by distribution’s
standard deviation
Z = (X - M)
SD
Following properties
1- mean = 0
2- SD = 1
3- All raw scores below mean are negative, all above mean are positive
4- Unless it is normalized, distribution has same shape as raw scores
Linear transformation
Normalized by special procedure if think normal distribution in pop and that non-normality is error
T score: Mean of 50, standard deviation of 10
--- 68.26% --------------- 95.44% ---------------------------------- 99.72% ----------------------SD
PR
Z
T
IQ
-3
1
-3
20
55
-2
2
-2
30
70
-1
16
-1
40
85
0
50
0
50
100
1
84
1
60
115
2
98
2
70
130
3
99
3
80
145
Study tip Can calculate percentile rank from a z-score using the area under the normal curve.
Percentile rank of 84 = z score of 1
50% fall below the mean and 34% (half of 68%) fall between the mean and 1SD
50 + 34= 84
3) Age and grade equivalent scores- score that permits comparison of performance to that of average examinees at
different ages or grades
- easily misinterpreted
- highly sensitive to minor changes in raw scores
- do not represent equal intervals of measurement
- do not uniformly correspond to norm referenced scores
CRITERION REFERENCED INTERPRETATION
Interpreting scores in terms of a prespecified standard.
Percentage score on type: indicates percentage of questions answered correctly.
A cutoff score in then set so that those above pass, and those below fail
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
12
Mastery (criterion referenced) testing: specifying terminal level of performance required for all learners and periodically
administering test to assess degree of mastery
- to see if performance improves from program or not
- if deficiencies seen, given remedial instruction and process repeated until passes
- goal not to identify differences between examinees, but to make sure all examinees reach same level of performance
Regression equation/ expectancy tables: interpreting scores in terms of likely status on an external criterion
Study tip Link percentile ranks, standard scores, and age/grade equivalents with norm referenced interpretation
Link percentage scores, regression equation, and expectancy table with criterion referenced interpretation
SCORE ADJUSTMENT AND RELATED TECHNIQUES
Ensure that use of measures does not result in adverse impact for members of minority groups
Used to:
1) alleviate test bias
2) achieve business or societal goals (“make up” for past discrimination, allow for greater diversity in workplace)
3) increase fairness in selection system by ensuring that score on single instrument not overemphasized
Only first justification has been widely accepted, and only under certain circumstances (when predictive bias demonstrated)
Involve taking group differences into account when assigning or interpreting scores
Bonus points- adding a constant number of points to all ppl of a certain group
Separate cutoff scores- different cutoff scores for different groups
Within-group norming- raw score converted to standard score within group
Top-down selection from separate lists- ranking candidates separately within
Groups and then selecting top-down from within each group in accordance with a prespecified rule about
number of openings. Often used by
Affirmative action to overselect from excluded groups
Banding: considering ppl within a range as having identical scores. May be
Combined with another method to ensure minority representation
Section 106 of the Civil Rights Act prohibits score adjustment or use of different cut off scores in employment based on
race, color, religion, sex or national origin
Sackett and Wilk: may take group membership into if doing so can be shown to increase accuracy of measurement or
accuracy of prediction without increasing the adverse impact of the test
- banding found legal in one case
- banding with minority preference may be useful for balancing competing goals of diversity and productivity and legal
under the Act
Correction for Guessing
Ensure examinee does not benefit from guessing
When tests corrected, best to omit rather than guess
Corrections involve calculating a corrected score by taking into account:
- the number of correct responses
- the number of incorrect responses
- the number of alternatives for each item
Corrected score = R – (W/n-1)
R = # of right answers
W = # of wrong answers
N = # of choices per answer
When the correction involves subtracting points from scores, the resulting distribution will have a lower mean and a larger
SD than the original distribution
Experts generally discourage correction for guessing
- unless considerable number of items omitted by some of the examinees, relative position of examinees will be same
regardless of correction
- correction ahs little or no effect on test’s validity
Only time justified is for speeded tests
THE FINE PRINT: Caveat emptor! These study materials have helped many people who have successfully completed the ABCN board certification
process, but there is no guarantee that they will work for you. The notes’ authors, web site host, and everyone else involved in the creation and
distribution of these study notes make no promises as to the complete accuracy of the material, and invite you to suggest changes.
Download