Title of Resource Reliability Assessment
Author(s) Raechel Soicher
Institution Sierra College
This is an in-class practice activity which allows students to identify examples of
different reliability assessments. They are provided with five descriptions and
Brief Description:
asked to identify each as an example of test-retest, split-half, internal consistency,
parallel-forms, or interrater reliability.
Keywords: Reliability and Validity
Author Contact
rsoicher@sierracollege.edu
Information:
Additional
Information:
TeachPsychScience.org is made possible with grant support from the Association for Psychological Science (APS) Fund
for Teaching and Public Understanding of Psychological Science to the site creators Gary Lewandowski, Natalie
Ciarocco, and David Strohmetz. All materials on this site have been subjected to a peer review process. We welcome
additional resources (www.teachpsychscience.org/submissions).
© 2013 by Raechel Soicher. All Rights Reserved. This material may be used for noncommercial
educational purposes. All other uses require the written consent of the authors.
Instructors:
It is often useful to provide students with concrete examples that help to illustrate concepts in research methods that
the students may not have the opportunity to experience first-hand. In this brief exercise (5-10 minutes), students
review examples of reliability assessment which are real-life descriptions of the different assessment types. This can be
given as an in-class assignment following review of the test types, either individually or in pairs, or may be assigned as
homework.
Testing Reliability: Handout 1
Test-Retest Reliability: In this type of reliability assessment, the measure of the construct is tested on two
different occasions (time-points) for consistency. Testing is typically completed a couple of weeks apart. If the
measure is reliable, then the two scores for each participant will be comparable at each time point. The
scores do not need to be exactly the same, but should be similar.
Split-Half Method: This reliability assessment splits the measure of the construct into two halves and
compares performance on the two halves. In this method, the researcher randomly assigns half of the
questions (or items) to one group of participants and half of the questions (or items) to a different group of
participants. Then, a correlation is computed between each half. If the measure is reliable, then the
correlation will be high. This method is preferable to test-retest because it requires only one time point for
data collection.
Internal Consistency: This method is similar to the split-half method. In the case of internal consistency,
however, the split-half procedure is repeated multiple times (Think! What do you think would be the
advantage of administering this multiple times?). Because of this, multiple correlation coefficients will be
computed. Lastly, the research finds the average of the correlations. In practice, “Cronbach’s Alpha” can be
used to compute a value for internal consistency.
Parallel-Forms Method: In this method, the researcher creates two measures of the same construct which
are administered to the same group of people at one time point. Then, the researcher calculates the
correlation between the two forms. The higher the correlation, the more reliable the measure. This method
is not very time consuming to administer, which is a benefit, but you need to have a relatively long measure in
order to divide it into two parts of sufficient length.
Inter-rater Reliability: The previous four tests of reliability are focused on creating a measure of a construct.
For this test of reliability, the emphasis is on using a particular measure consistently across different
researchers. When a research design requires observations of an event (e.g., acts of aggression), this
assessment is used to make sure the observations are consistent and unbiased when more than one person on
the research team is doing the observations. The results of the observations for multiple observers are
compared and their agreement is assessed. The higher the agreement between observers, the more reliable
they have been. For example, if the observers agree 92 times out of100 acts of aggression, the reliability is a
strong 92%.
Handout 2: Identify the Reliability Assessment Being Used
In each of the following examples, determine which form of assessment is being used for the reliability of the
measure of the dependent variable. Possible forms of assessment:
Test-Retest
Split-Half Internal Consistency Parallel-Forms Inter-Rater
______________________1. Julie is going to measure a person’s mood before and after a stressful situation.
However, she is worried that if a participant sees the same mood questionnaire before and after the
experimental situation, it will change their responses. Therefore, Julie decides to create two different mood
inventories composed of different questions. In order to assess reliability of the two, she brings in twenty
pilot participants, administers both forms to the group, and then correlates scores on the two forms.
______________________ 2. Ronnie, Janna, and Todd have been watching video recordings of children
playing in a room with different types of toys. They have been tasked with noting when a child displays
helping behaviors. To make sure they are consistent with their categorizations, the researcher compared all
three of their observations to see the extent to which they agreed with one another.
______________________ 3. Dr. Perkins wants to know if his test of extraversion is reliable. On Monday, he
splits the test of extraversion such that twenty participants see the odd-numbered questions and twenty
participants see the even-numbered questions. On Wednesday, he repeats a similar procedure with a new
group, where twenty participants respond to questions 1-20 and the other twenty participants respond to
questions 21-40. After correlating scores on all four “parts”, he evaluates the average of the correlations.
______________________ 4. Rodger has developed a test of depressive symptoms. He gives the test to
twelve of his patients in October and then again to the same twelve in December. Next, he determines the
correlation in scores between the two testing occasions.
______________________ 5. In just one session, Dr. Woolley administers an inventory she developed to
measure college students’ level of motivation. To determine if her inventory is reliable, she compares the
scores on the first half of the inventory to the second half of the inventory.
______________________ 6. Professor X wants to assess the level of pre-existing knowledge of students in
his upper division psychology class. On the first day of class, the students complete the final exam from
Professor X’s introductory psych class. Then, a week later, before any content has been introduced, Professor
X administers the exam for a second time. A high level of consistency in the results between day one and the
end of week 1 would indicate that Professor X’s exam is reliable.
______________________ 7. Sierra College is interested in evaluating the reliability of a critical thinking
assessment. After developing a set of 100 critical thinking problems, half of them are randomly assigned to
students at the Rocklin campus and the other half are assigned to students at the Roseville campus. The
reliability of the critical thinking assessment will be high if the scores of students at both campuses is
comparable.