Appraisal 1

advertisement
Comprehensive
Exam Review
Click the LEFT mouse key ONCE to continue
Appraisal
Part 1
Click the LEFT mouse key ONCE to continue
The Vocabulary
of Appraisal
A test is a sample of behavior, i.e., a series of
tasks (e.g., items) used to obtain systematic
observations presumed to represent
attributes or characteristics. A test is used as
a measurement tool.
Measurement is the process of assigning
numbers to human attributes or characteristics.
Assessment is the use of methods or processes to
gather data about, or evidence of, human
behavior.
Assessment is a preferred term because it
(merely) connotes the collection of data
concerning the present state of human
behavior, whereas the term diagnosis connotes
determination of the degree of abnormality.
Interpretation is the act of stating the
meaning and/or usefulness of behavioral data.
Evaluation is the process of applying
judgments to and/or making decisions based
on the results of measurement.
An evaluation program is a program test
designed to measure and assess an
individual’s growth, adjustment, and/or
achievement, or a program’s effectiveness.
Tests used in the counseling professions are
usually (and generally) classified into five
categories:
Aptitude
Achievement
Intelligence
Interest
Personality
Aptitude, achievement, and intelligence tests
are sometimes clustered under the heading
ability tests.
An ability test is a standardized test that
measures a test taker’s current level of
performance in a specified area of cognitive,
psychomotor, or physical functioning.
An achievement test measures a test taker’s
achievement level in one or more content or
subject matter areas.
An adjustment inventory is a self-report
instrument used to identify personal and
social adjustment problems.
Cognitive assessment is a data-collecting
technique used to assess an individual’s
ability to perform mental activities relative
to acquiring, processing, retaining,
conceptualizing, and organizing verbal,
spatial, psychomotor, sensory, and
perceptual information.
A diagnostic (ability) test measures specific
aspects of achievement in a single subject or
field.
An intelligence test is a psychological or
educational test designed to measure
intellectual operations, functions, and
general abilities.
An inventory assesses an individual’s
opinions, interests, and dispositions about
specific situations.
A mastery test assesses whether an individual
has achieved mastery, generally defined by a
passing or cut score, in a specific domain of
knowledge or skill.
A multi-factor test measures multiple
constructs that are relatively uncorrelated
with one another.
A performance test is one that generally
requires the use and manipulation of
physical objects and the application of
physical and manual skills in situations
rather than oral or written responses.
A screening test is a beginning point in a
selection or diagnostic process that identifies
broad classifications of test takers.
A projective test technique assesses personality
dynamics through psychological projection.
Test takers respond to ink blots, pictures,
incomplete sentences, or other unstructured
stimuli, in such a manner that they “project”
into their responses manifestations of
personality characteristics.
An aptitude test is a cognitive or psychomotor measure used to predict success in a
course, job, or educational or training
program.
An interest inventory measures preferences for one or more activities from a
large set of possible activities.
A personality inventory measures one or
more aspects of personality, including
attributes, dynamics, or characteristic ways
of behaving.
A self-report inventory usually consists of
questionnaire-type statements requiring a
limited form of responding (e.g., true-false or
multiple-choice items).
An individual test is administered to one person
at a time.
A group test is administered simultaneously to
a group of people.
In a power test, speed is not measured as a
component of performance, i.e., there is more
than sufficient time to respond.
In a speed(ed) test, time is measured as a
component of performance.
A verbal test necessitates command of
language for effective responding.
A nonverbal test de-emphasizes comprehension of language as a requirement
for effective responding.
An objective test has clear and unambiguous
scoring criteria.
A placement test is used to assign
individuals to different levels or categories.
Construct equivalence is the degree to
which multiple tests measure essentially the
same construct. It also refers to the extent
to which the same test measures the same
construct when administered to two
different cultural or linguistic groups.
Documentation includes supporting materials
such as test manuals and research reports
created by test authors and publishers to
provide evidence of a test’s quality and
promote use of that test.
Discriminating power is the ability of a test
item to differentiate between individuals who
possess much of a given characteristic such as
skill, knowledge, or attitude, and individuals
who possess little of the characteristic.
Adaptive testing is an individualized,
sequential form of testing in which successive
test items are selected on the basis of a test
taker’s responses to previous items. Test items
also are selected based on psychometric
properties and test content.
A pilot test is the administration of a test to a
representative sample of examinees so that
the test’s properties may be determined.
A test battery is a group of tests for which the
results are valued individually and/or in
combination. It is standardized on the same
population so that norm-referenced scores can
be derived and used for comparison and
decision-making purposes.
A standardized test is one in which testing
conditions are the same for all examinees,
including directions, scoring procedures, test
use, data on reliability and validity, and
adequately determined norms.
A field test is an administration of a test
employed to examine the quality of testing
procedures such as test administration,
responding, scoring, and reporting, in a
manner that is more extensive than in pilot
testing.
Alternate forms are two or more interchangeable versions of a test that generally assess the
same construct, use the same instructions for
test administration, and are given for the same
purposes.
Alternate forms include:
parallel forms, which have identical content
and psychometric properties;
equivalent forms, which sample the same
content areas and are considered equivalent
in regard to derived scores; and
comparable forms, which have similar
content areas but do not share statistical
similarity.
Neuropsychological assessment is an
evaluation that generates possible
hypotheses and conclusions regarding
processes that affect the central nervous
system, or psychological or behavioral
dysfunctions related to pathology in the
central nervous system.
Outcome evaluation is a practitionergenerated assessment of the efficacy of a
particular intervention, program, or
service.
A job analysis identifies the (a) knowledge,
skills, abilities, and other personal qualities
needed to perform a given job; and (b) the
specific tasks to be performed relative to the
job.
Portfolio assessment is the evaluation of
systematically collected educational or work
products over a period of time.
Performance assessment is evaluation of
observable products or behaviors in settings
designed to represent real-life contexts in
which knowledge and skills are actually
utilized.
Program evaluation is assessment of the
efficacy of a planned set of procedures.
Personality assessment is evaluation of
normal or abnormal dimensions of
personality.
Psychological assessment is an evaluation of
an individual’s psychological functioning
that includes administering, scoring and
interpreting tests and inventories, behavioral
observations, client and third-party
interviews, and analysis of prior educational,
occupational, medical, and psychological
records.
Psychological testing is employment of tests
and inventories to measure an individual’s
psychological traits and dimensions.
Vocational assessment is a form of
psychological assessment that generates
hypotheses and inferences related to
constructs such as the test taker’s values,
work needs, interests, and careerdevelopment status.
Norms are statistics that describe the
performance of individuals of various ages
or grade levels who comprise the
standardization group for the test.
Age norms are scores that represent average
performances for individuals by chronological age. They usually are expressed as
central tendencies, scores, percentiles,
standard scores, or stanines.
Local norms are a set of scores obtained from
a specific sample that are not considered
generalizable to populations beyond the
sample.
The reference population is the group of
people from which a sample was used to
establish norms for a given test.
The standardized sample is the group of
people from the reference population whose
performances were used to establish the
norms for a given test.
Utility is an evaluation, often in cost-benefit
form, of the relative value of using or not
using a given test for a specific purpose.
Ability is the power to perform a designated
responsive act. The power may be potential or
actual, native, or acquired.
Achievement level is an individual’s
performance and competency in a specified
subject area.
The description of achievement level is usually
defined as a category on a continuum that
ranges from “basic” to “advanced.”
Aptitude is the capacity to gain proficiency
with training.
Intelligence is the cognitive ability to perceive
and understand relationships, such as logical,
spatial, verbal, numerical, and recall of
associated meanings.
Intelligence is sometimes considered
synonymous with academic aptitude,
scholastic aptitude, mental ability, capacity, or
mental maturity.
A raw score is an original and unadjusted
test score, usually characterized by a sum
of the correct answers or another
combination of item scores.
The “ceiling” is the upper limit of ability
measured by a test.
The “ceiling effect” is when many
respondents achieve very high (raw) scores
on a test or measurement, i.e., the test is too
easy for most of the respondents.
A criterion is a standard, norm, or judgment
used as the basis for quantitative and/or
qualitative comparison.
In a criterion-referenced test, score
interpretations are made based on the test
taker’s independent performance level,
rather than relative to the performance levels
achieved by others.
In a norm-referenced test, score interpretations are made relative to the performance levels achieved by others.
A composite score results from the
combination of several scores as specified by
a certain formula.
A cut score is the particular score value or
point on a score scale that differentiates
interpretation of scores below or above the
point. If one cut score is used, the potential
scores may fall into ranges of either “pass”
and “mastery” or “fail” and “nonmastery.”
A gain score is the difference between an
individual’s two test scores on the same or
equivalent test.
Holistic scoring is a method that uses
previously specified criteria to determine an
overall appraisal of performance on a test or
test item.
A derived score is one numerically converted
from a quantitative or qualitative mark on
one scale into the units of another scale. It is
also referred to as a scaled score.
Examples include grade placement, chronological age equivalent, chronological age
placement, educational age, intelligence
quotient, percentile rank, and standard
score.
An equated score is a derived score that is
comparable from test to test, such as
standard scores, grade placements, and
mental ages.
A grade-equivalent score is the real or
estimated mean or median score for a gradelevel population.
An intelligence quotient (IQ) is a measure of
potential rate of intellectual growth that is
expressed as the ratio of mental age (MA) to
chronological age (CA). The formula is IQ =
MA/CA x 100.
A mental age is the average or normal
chronological age for a given score on an
intelligence test.
A deviation IQ is an intelligence test score
that is a derived score based on the
individual’s deviation from the mean of the
norm group in standard deviation units.
A scaled score is a unit in a system of
equated scores established for the raw
scores of a test.
A scaled score usually is interpreted relative to
the mean performance of a given reference
group, whereby the interval between any pair
of scaled scores represents meaningful
differences in terms of the characteristics of
the reference group.
A scoring rubric is the set of principles, rules,
and standards used to assess an individual’s
performance, a product, or a response to a
test item.
Scoring rubrics vary by the amount of judgment
involved, number of distinct score levels, and
latitude for intermediate or fractional score
values.
A standard score (e.g., Sigma score, T score,
or z score) is a type of derived score that
indicates the extent to which a score deviates
from the mean.
A distribution of standard scores for a
specified population will have values for the
mean and standard deviation that can be
readily interpreted and understood.
A “true score” is the mean score of the
theoretical distribution of scores that
would be obtained by the individual test
taker on an unlimited number of identical
administrations of the same test.
In “true score theory”
X (the actual/observed score received) =
true score (i.e., actual trait level) +
systematic error (e.g., test anxiety) +
random error (e.g., not feeling well)
Classification accuracy is the degree of
accurate categorizations and diagnoses when
a test is used to classify an individual or event.
A false negative is an error whereby an
outcome or performance that is predicted not
to meet an expected criteria actually meets
those criteria.
A false positive is an error whereby an
outcome or performance is predicted to meet
an expected criteria but actually does not
fulfill those criteria.
In a high-stakes test, results have a significant
and direct impact for the individual test taker,
program, or institution being evaluated.
In a low-stakes, results have inconsequential
impact on the individual test taker, program,
or institution.
Intervention planning is the work behavior
of a practicing helping professional that
involves the development of treatment goals,
plans, and protocols.
The local setting is the place where a test is
used.
Local evidence is the reliability and/or
validity data collected for a given set of test
takers at a single institution or specific
location.
A test user is an individual or organization
that chooses to administer and interpret test
scores elicited in a given setting so that testbased decisions and actions may be made.
Psychodiagnosis is the use of psychological test
data to classify an individual’s mental health
status.
Selection is an objective of testing that results
in either accepting or rejecting candidates for
specific opportunities in educational and
employment contexts.
Sensitivity is the extent to which a diagnostic
test identifies a disorder when it actually is
present.
(Test) Bias is the underrepresentation or
irrelevance of construct components in test
scores that results in one group of test takers
being typically favored over another.
Response bias is the systematic error caused
by the test taker’s tendency to respond in a
certain way to test items.
Translational equivalence is the extent to
which the (original) content of a test
corresponds to a linguistically translated
version of the test.
Sociometry is the measurement of the
interpersonal relationships among members
of a group.
Coaching is the process of helping prospective
examinees increase their test scores. It
includes practices such as learning test-taking
strategies that are independent of the
curricula of schools and training programs.
Correction for guessing is a score-change
technique that compensates for guessing
on a test. The number of right answers on
a test is adjusted by subtracting a
proportion of the total number of incorrect
responses from the total number of correct
answers.
Flagging is the process of attaching an
indicator to a test score to signify that the
score was obtained in a nonstandardized
testing administration.
Item analysis is a method used in test
construction to determine how well a given
test item discriminates among individuals
who differ in some characteristic.
Item-effectiveness considerations include
validity relative to curriculum content and
educational objectives, discriminating power
relative to validity and internal consistency,
and level of difficulty.
A construct is the underlying theoretical
concept or characteristic to be measured
by a specific test.
The construct domain is a set of associated
attributes to be assessed by a specific test.
The content domain is the specific set of
skills or level of knowledge that is measured
by a given test.
The criterion domain is the variable used as
a frame of reference when making
comparisons for a specific test.
An item pool is a set of potential items from
which items are extracted for either the
development of a test or the selection of
successive items when adapting the test.
An item prompt is a stimulus, such as a
question or set of instructions, that guides
the test taker in formulating a response.
A test manual is a publication (aka a “user’s
guide”) prepared by test developers and
publishers to provide information on
administering and scoring the test, and
interpreting scores. It also may provide
information on test characteristics, and
procedures used in developing the test and
evaluating the technical quality of its test
scores.
A technical manual is a publication prepared
by test authors and publishers that provides
technical and psychometric data concerning
the respective test.
A test developer is the individual(s) or
organization that constructed a test and its
supporting materials.
Test development is the process of designing,
constructing, assessing, and modifying a
test. It includes the development of content,
administration, and scoring procedures, and
determination of technical quality.
Test documents are publications, written
works, and technical information concerning
a test that test users may use to evaluate the
test for appropriateness and technical
adequacy for a particular intended purpose.
Classical test theory is a school of thought
that defines an individual’s observed test
score as the product of two separate
components: a true test score and an
independent error of measurement.
Classical test theory and its premises about
the components of a test score yield
(traditional) implications for relationships
among validity, reliability, and other statistical
measures.
Generalizability theory is an extension of
classical test theory in which analyses are
used to evaluate the generalizability of scores
beyond the specific sample of items, persons,
and observational conditions that were
studied.
Item response theory (IRT) is a theory of
test performance that highlights the
relationship between the mean item score
and the calibrated level of the ability or trait
measured by the item to theoretically yield
the maximally appropriate items for each
respondent.
A population is the group of people to whom
results will apply, typically considered as the
group to whom results will be generalized.
A sample is a subset of a given population.
A random sample is a sample of a given
population that is selected in such a way that
selection bias is eliminated and every member
of the population has an equal chance of being
included in the sample.
This concludes Part 1 of the
presentation on
APPRAISAL
Download