Uploaded by Jacala, Jacielyn

PSM106-MIDTERM-REVIEWER

advertisement
PSM106
CHAPTER 3: A STATISTIC REFRESHER
Scales of Measurement
Measurement – the act of assigning
numbers or symbols to characteristics
of things (people, events, whatever)
according to rules. The rules used in
assigning numbers are guidelines for
representing the magnitude (or some
other characteristic) of the object
being measured.
Scale – a set of numbers (or other
symbols) whose properties model
empirical properties of the objects to
which the numbers are assigned.2
There are various ways in which a scale
can be categorized.
Continuous Scale – a scale used to
measure a continuous variable.
Ordinal scales – permit classification.
However, in addition to classification,
rank ordering on some characteristic is
also permissible with ordinal scales.
Interval scales – contain equal
intervals between numbers. Each unit
on the scale is exactly equal to any
other unit on the scale. But like ordinal
scales, interval scales contain no
absolute zero point. With interval
scales, we have reached a level of
measurement at which it is possible to
average a set of measurements and
obtain a meaningful result.
Ratio scale - has a true zero point. All
mathematical
operations
can
meaningfully be performed because
there exist equal intervals between
the numbers on the scale as well as a
true or absolute zero point.
Describing Data
Discrete Scale – a scale used to
measure a discrete variable.
Error – refers to the collective
influence of all of the factors on a test
score or measurement beyond those
specifically measured by the test or
measurement.
Nominal scales – are the simplest form
of measurement. These scales involve
classification or categorization based
on one or more distinguishing
characteristics, where all things
measured must be placed into
mutually exclusive and exhaustive
categories.
Distribution – may be defined as a set
of test scores arrayed for recording or
study.
Raw score – is a straightforward,
unmodified
accounting
of
performance that is usually numerical.
Frequency Distribution – all scores are
listed alongside the number of times
each score occurred.
- a frequency distribution is referred to
as a simple frequency distribution to
indicate that individual scores have
been used and the data have not been
grouped.
Grouped Frequency Distribution –
frequency distribution used to
summarize data.
Range – is equal to the difference
between the highest and the lowest
scores.
Bar graph - numbers indicative of
frequency also appears on the Y-axis,
and reference to some categorization
appears on the X-axis.
Quartiles – the dividing points
between the four quarters in the
distribution.
Frequency Polygon – are expressed by
a continuous line connecting the
points where test scores or class
intervals (as indicated on the X-axis)
meet frequencies (as indicated on the
Y-axis).
Measure of Central Tendency – is a
statistic that indicates the average or
midmost score between the extreme
scores in a distribution.
Arithmetic mean or mean – which is
referred to in everyday language as
the “average.”
Median – defined as the middle score
in a distribution, is another commonly
used measure of central tendency.
Mode – the most frequently occurring
score in a distribution of scores.
Bimodal distribution – there are two
scores (51 and 66) that occur with the
highest frequency.
Measures of Variability – statistics
that describe the amount of variation
in a distribution.
Variability – is an indication of how
scores in a distribution are scattered or
dispersed.
Interquartile range – is a measure of
variability equal to the difference
between Q3 and Q1.
Semi-interquartile range – which is
equal to the interquartile range
divided by 2.
Average Deviation – another tool that
could be used to describe the amount
of variability in a distribution.
Standard Deviation – as a measure of
variability equal to the square root of
the average squared deviations about
the mean.
Variance – is equal to the arithmetic
mean of the squares of the differences
between the scores in a distribution
and their mean.
Skewness – or the nature and extent
to which symmetry is absent.
Positive Skew – when relatively few of
the scores fall at the high end of the
distribution.
Negative Skew – when relatively few
of the scores fall at the low end of the
distribution.
Kurtosis – refer to the steepness of a
distribution in its center.
Platykurtic – relatively flat
Leptokurtic – relatively peaked
strength of the relationship between
two things.
Mesokurtic – somewhere in the
middle
Correlation – is an expression of the
degree
and
direction
of
correspondence between two things.
Normal Curve – is a bell-shaped,
smooth, mathematically defined curve
that is highest at its center.
CHAPTER 4: OF TESTS AND TESTING
Standard Score – is a raw score that
has been converted from one scale to
another scale, where the latter scale
has some arbitrarily set mean and
standard deviation.
z score – results from the conversion
of a raw score into a number indicating
how many standard deviation units
the raw score is below or above the
mean of the distribution.
T scores – can be called a fifty plus or
minus ten scale.
Stanine – a term that was a
contraction of the words standard and
nine.
Linear Transformation – is one that
retains a direct numerical relationship
to the original raw score.
Nonlinear Transformation – may be
required when the data under
consideration are not normally
distributed yet comparisons with
normal distributions need to be made.
Coefficient of correlation (or
correlation coefficient) – is a number
that provides us with an index of the
A. Some Assumptions
Psychological
Testing
Assessment.
About
and
Assumption 1. Psychological Traits
and States Exist
Trait – any distinguishable, relatively
enduring way in which one individual
varies from another.
States – distinguish one person from
another but are relatively less
enduring.
Psychological trait exists only as a
construct.
Construct – an informed, scientific
concept developed or constructed to
describe or explain behavior. We can’t
see, hear, or touch construct but we
can infer their existence from an overt
behavior.
Overt behavior – refers to an
observable action or product of an
observable action.
Assumption 2. Psychological Traits
and States can be Quantified and
Measured
Measuring traits and states by means
of a test entail developing not only
appropriate test items but also
appropriate ways to score the test and
interpret the results.
be compensated for by data from
other sources.
Assumption 5. Various Sources of
Error Are Part of the Assessment
Process
Error
refer
to
mistakes,
miscalculations and the like.
The test score is presumed to
represent the strength of the targeted
ability or trait or state and is frequently
based on cumulative scoring.
- Traditionally refers to something that
is more than expected; it is a
component of the measurement
process;
Assumption 3. Test-Related Behavior
Predicts Non-Test-Related Behavior.
- Refers to a long-standing assumption
that factors other than what a test
attempts to measure will influence
performance on the test.
Patterns of answer to true-false
questions on one widely used test of
personality are used in decision
making regarding mental disorder. The
tasks in some tests mimic the actual
behaviors that the test user is
attempting to understand.
Assumption 4. Tests and other
Measurement Techniques Have
Strengths and Weaknesses.
Competent test users understand a
great deal about the tests they use.
They understand among other things,
how a test was developed, the
circumstances under which it is
appropriate to administer the test,
how the test should be administered
and to whom, and how the test result
should be interpreted. Competent test
users understand and appreciate the
limitations of the tests they use as
well as how those limitations might
Test scores are always subject to
questions about the degree to which
the measurement process includes
error.
Error Variance - the component of a
test score attributable to sources
other than the trait or ability
measured.
Sources of Error Variance
1. Assessee
2. Assessor
3. Measuring instruments
Classical test theory or true score
theory - the assumption is made that
each test taker has a true score on a
test that would be obtained but for the
action of measurement error.
Assumption
6.
Testing
and
Assessment can be conducted in a fair
and unbiased manner.
Today all major test publishers strive
to develop instruments that are fair
when used in strict accordance with
guidelines in the test manual.
One source of fairness-related
problems is the test user who
attempts to use a particular test with
people whose background and
experience are different from the
background and experience of people
for whom the test was intended.
Assumption
7.
Testing
Assessment Benefit Society.
and
In a world without tests, teachers and
school administrators could arbitrarily
place children in different types of
special classes simply because that is
where they believed the children
belonged. In a world without tests,
there would be a great need for
instruments to diagnose educational
difficulties in reading and math and
point the way to remediation.
The criteria for a good test would
include clear
instructions for
administering,
scoring,
and
interpretation. It would also seem to
be a plus if a test offered economy in
time and money it took to administer,
score, and interpret it.
CHAPTER 5: RELIABILITY
A good test or more generally, a good
measuring tool or procedure is
reliable. The criterion of reliability
involves the consistency of the
measuring tool. The precision with
which the test measures and the
extent to which error is present in
measurements. In theory, the
perfectly reliable m measuring tool
consistently measures in the same
way.
Classical test theory states that a
score on an ability test is presumed to
reflect not only the test takers true
score on the ability being measured
but also error.
Error - refers to the component of the
observed test score that does not have
to do with the test takers ability.
A statistic useful in describing sources
of test score variability is the variance
- the standard deviation squared.
True variance - variance from true
differences
Error variance – variance from
irrelevant, random sources
Reliability refers to the proportion of
the total variance attributed to true
variance.
The greater the proportion of the total
variance attributed to true variance,
the more reliable the test.
Measurement error – refers to
collectively, all of the factors
associated with the process of
measuring some variable, other than
the variable being measured.
Categories of Measurement Error
Random error – is a source of error in
measuring a targeted variable caused
by unpredictable fluctuations and
inconsistencies of other variables in
the measurement process.
Systematic error – refers to a source of
error in measuring a variable that is
typically constant or proportionate to
what is presumed to be the true value
of the variable being measured.
Sources of Error Variance
1. Test construction, administration,
scoring, and/or interpretation
Item sampling or content sampling –
is one source of variance during test
construction. It refers to the variation
among items within a test as well to
variation among items between tests.
The extent to which a test takers score
is affected by the content sampled on
a test and by the way the content is
sampled is a source of error variance.
Sources of error variance during test
administration:
1. Test environment like room
temperature, level of lightning and
amount of ventilation and noise.
2. Test taker variables like pressing
emotional
problems,
physical
discomfort, lack of sleep, and the
effects of drugs or medication.
3. Formal learning experiences, casual
life experiences, therapy, illness and
changes in mood or mental state.
4. Body weight, can be source of error
variance.
5. Examiner-related variables are
potential sources of error variance.
Scorers and scoring system are
potential sources of error variance.
Test-retest reliability – is an estimate
of reliability obtained by correlating
pairs of scores from the same people
on two different administrations of the
same test. It is appropriate when
evaluating the reliability of a test that
purports to measure something that is
relatively stable over time such as
personality trait.
Coefficient of stability – is when the
interval between testing is greater
than six months, the estimate of testretest reliability is often referred to as
_________.
The degree of relationship between
various forms of a test can be
evaluated by means of an alternateforms or parallel-forms coefficient of
reliability which is often termed as
coefficient of equivalence.
Parallel forms of a test exist when for
each form of the test the means and
the variances of observed test scores
are equal.
Parallel forms reliability – refers to an
estimate of the extent to which item
sampling and other errors have
affected test scores on versions of the
same test when for each form of the
test, the means and variances of
observed tests scores are equal.
Alternate forms are simply different
versions of a test that have been
constructed so as to be parallel.
Alternate forms reliability refers to an
estimate of the extent to which these
different forms of the same test have
been affected by item sampling error
or other error.
An estimate of the reliability of a test
can be obtained without developing
an alternate form of the test and
without having to administer the test
twice to some people.
The computation of a coefficient of
split-half reliability generally entails
three steps:
1. Divide the test into equivalent
halves.
2. Calculate a Pearson r between
scores on the two halves of the test
3. Adjust the half-test reliability using
Spearman-Brown formula.
Split-half reliability is also referred to
odd-even reliability.
The Spearman-Brown formula – allows
a test developer or user to estimate
internal consistency reliability from a
correlation of two halves of a test.
Inter-item consistency refers to the
degree of correlation among all the
items on a scale. A measure of interitem consistency is calculated from a
single administration of a single trait.
Test are said to be homogeneous if
they contain items that measure a
single trait.
Internal consistency estimates of
reliability – an estimate of reliability of
a test obtained from a measure of
inner-item consistency.
Heterogeneity – describes the degree
to which a test measures different
factor.
Methods of obtaining internal
consistency estimates of reliability:
Heterogeneous test is composed of
items that measure more than one
trait.
1. Split-half estimate – is obtained by
correlating two pairs of scores
obtained from equivalent halves of a
single test administered once.
Coefficient alpha – developed by
Cronbach and elaborated on by others
Coefficient alpha – the mean of all
possible split-half correlation
Average proportional distance (APD)
– a relatively new measure for
evaluating the internal consistency of
a test. It is a measure that focuses on
the degree of difference that exists
between item scores. It is a measure
used to evaluate the internal
consistency of a test that focuses on
the degree of difference that exists
between item scores.
Inter-scorer reliability is the degree of
agreement or consistency between
two or more scorers with regard to a
particular measure.
The simplest way of determining the
degree of consistency among scorers
in the scoring of a test is to calculate a
coefficient of correlation - coefficient
of inter-scorer reliability.
Basic 3 approaches to the estimation
of reliability:
1. Test-retest
2. Alternate or parallel forms
3. Internal or inter-item consistency
A dynamic characteristic – is a trait
state, or ability presumed to be everchanging as a function of situational
and cognitive experiences.
Static characteristics – a trait, state or
ability presumed to be relatively
unchanging overtime; contrast with
dynamic.
Restriction or inflation of range – also
referred to as inflation of variance, a
reference
to
a
phenomenon
associated with reliability estimates
wherein the variance od either
variable in a correlational analysis is
inflated by the sampling procedures
used and so the resulting correlations
coefficients tends to be higher
contrast with restriction of range.
Speed tests – generally contain items
of uniform level of difficulty so that
when given generous time limits all
test takers should be able to complete
all the test items correctly.
Power tests – is when a time limit is
long enough to allow test takers to
attempt all items and if some items are
so difficult that no test taker is able to
obtain a perfect score.
Criterion-referenced tests – is
designed to provide an indication of
where a test taker stands with respect
to some variable or criterion such as an
educational or a vocational objective.
Classical test theory – is also referred
to as the true score model of
measurement. It is the most widely
used and accepted model in the
psychometric literature today.
True score – as a value that according
to classical test theory genuinely
reflects an individual’s ability level as
measured by a particular test.
Domain sampling theory – seek to
estimate the extent to which specific
sources of variations under defined
conditions are contributing to the test
score. It is a test reliability is conceived
of as an objective measure of how
precisely the test score assesses the
domain from which the test draws a
sample.
Generalizability study – examines how
generalizable scores from a particular
test are if the test is administered in
different situations.
Stated in
the language
of
generalizability
theory,
generalizability study examines how
much of impact different facets of the
universe have on the test score.
The influence of particular facets on
the test score is represented by
coefficients of generalizability.
In the decision study developers
examine the usefulness of test scores
in helping the test user make
decisions.
Item response theory (IRT) another
alternative to the true score model. It
is also referred to as latent-trait theory
or the latent-trait model, a system of
assumptions about measurement and
the extent to which each test items
measures the trait.
In the context of IRT discrimination
signifies the degree to which an item
differentiates among people with
higher or lower levels of the trait,
ability, or whatever it is that is being
measured.
There are IRT models designed to
handle data resulting from the
administration of
tests
with
Dichotomous test items.
Dichotomous test item – a test item or
question that can be answered with
only one of two response options such
as true or false or yes-no.
Polytomous test items – a test item or
question with three or more
alternative responses where only one
alternative is scored correct or scored
as being consistent with a targeted
trait or other construct.
CHAPTER 6: VALIDITY
A test is considered valid for a
particular purpose if it does. It
measures what it purports to measure.
A test reaction time is valid test if it
accurately measures reaction time. A
test of intelligence is a valid test if it
truly measures intelligence.
Other considerations
A good test is one that trained
examiners can administer, score, and
interpret with a minimum of difficulty.
A good test is useful test, one can
yields actionable results that will
ultimately benefit individual test
takers or society at large.
If the purpose of a test is to compare
the performance of the test takers
with the performance of other test
takers, then a “good test” is one that
contains adequate norms also referred
to as normative data.
Norms – provide a standard with
which the results of measurement can
be compared.
Norm-referenced
testing
and
assessment as a method of evaluation
and a way of deriving meaning from
test scores by evaluating an individual
test taker score and comparing it to
scores of a group of test takers.
A common goal of norm-referenced
tests is to yield information on a test
taker’s standing or ranking relative to
some comparison group of test takers.
Norm – refer to behavior that is usual,
average, normal, standard, expected,
or typical.
Norm in the psychometric context are
the test performance data of a
particular group of test takers that are
designed for use as a reference when
evaluating or interpreting individuals
test scores.
Standardization
or
test
standardization is the process of
administering a test to representative
sample of test takers for the purpose
of establishing norms.
Sampling – some defined group as the
population for which the test is
designed.
Sample – a portion of the universe of
people deemed to be representative
of the whole population.
Sampling the process of selecting the
portion of the universe deemed to be
representative
of
the
whole
population.
Stratified-random sampling is the
process of developing a sample based
on specific subgroups of a population
in which every member has the same
chance of being included in the
sample.
Stratified sampling is the process of
developing a sample based on specific
subgroups of a population.
2 types of sampling procedure:
1. Purposive sampling – the arbitrary
selection of people to be part of a
sample because they are thought to be
representative of the population being
studied.
2. Incidental sampling – referred to as
convenience sampling. The process of
arbitrarily selecting some people to be
part of sample because they are
readily available, not because they are
most representative of the population
being studied.
Types of Norms
1. Age Norms – also known as ageequivalent scores, indicate the
average performance of different
samples of test takers who were at
various ages at the time the test was
administered.
2. Grade Norms – designed to indicate
the average test performance of test
takers in a given school grade.
– are developed by administering the
test to representative samples of
children over a range of consecutive
grade levels.
3. National Norms – are derived from
a normative sample that was
nationally representative of the
population at the time the norming
study was conducted.
4. National Anchor Norms - – an
equivalency table for scores on two
nationally standardized test designed
to measure the same thing.
5. Local Norms – provide normative
information with respect to the local
population’s performance on some
test.
Percentage correct – refers to the
distribution of raw scores more
specifically to the number of items
that were answered correctly
multiplied by 100 and divided by the
total number of items.
Equipercentile method – the
equivalency of scores on different
tests is calculated with reference to
corresponding percentile score.
Fixed reference group scoring system
– a system of scoring wherein the
distribution of scores obtained on the
test from one group of test takers is
used as the basis for the calculation of
test scores for future administrations.
Fixed reference group is used as the
basis for the calculation of test scores
for future administration of the test.
6. Norms from a fixed reference group
Norm-referenced – a way to derive
meaning from a test scores. An
approach to evaluate the test score in
relation to other scores on the same
set.
7. Subgroup Norms – norms for any
defined group within a large group.
Criterion – a standard on which a
judgment or decision may be based.
8. Percentile Norms – are the raw data
from a test’s standardization sample
converted to percentile form.
Criterion-referenced testing and
assessment may be defined as a
method of evaluation and a way of
deriving meaning from test scores by
evaluating an individual’s score with
reference to a set standard.
Percentile – is an expression of the
percentage of people whose score on
a test or measure falls below a
particular raw score. It is a converted
score that refers to a percentage of
test takers.
Content-referenced testing and
assessment – also referred to as
criterion-referenced or domain-
referenced testing and assessment. A
method of evaluation and a way of
deriving meaning from test scores by
evaluating an individual’s score with
reference to a set standard; contrast
with norm-referenced testing and
assessment.
Download