Running head: EXPLANATIONS FOR CONFIDENCE RATINGS 1

advertisement
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
1
What are Confidence Judgments Made of? Students’ Explanations for their Confidence Ratings
and What that Means for Calibration
Daniel L. Dinsmore and Meghan M. Parkinson
University of Maryland
DRAFT
Author Note
Daniel L. Dinsmore, Department of Human Development, University of Maryland; Meghan M.
Parkinson, Department of Human Development, University of Maryland.
Correspondence concerning this article should be addressed to Daniel L. Dinsmore,
Department of Human Development, 3304 Benjamin Building, College Park, MD 20742.
E-mail: dinsmore@umd.edu
Thanks to Gregory Hancock for his helpful comments regarding this manuscript.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
2
Abstract
Although calibration has been widely studied, questions remain about how best to capture
confidence ratings, how to calculate continuous variable calibration indices, and on what exactly
students base their reported confidence ratings. Undergraduates in a research methods class
completed a prior knowledge assessment, two sets of readings and posttest questions, and rated
their confidence in their responses to each posttest item. Students also wrote open-ended
responses explaining why they marked their confidence as they did. Students provided
confidence ratings on a 100-mm line for one of the passages and through magnitude scaling for
the other counterbalanced passage. Calibration was calculated using a rho coefficient. No
within-subject differences were found between 100-mm line responses and magnitude scaling
responses, p = .54. Open-ended responses revealed that students base their confidence ratings on
prior knowledge, characteristics of the text, characteristics of the item, guessing, and
combinations of these categories. Future studies including calibration should carefully consider
implicit assumptions about students’ sources of confidence and how those sources theoretically
relate to calibration.
Keywords: Calibration, confidence ratings, magnitude scaling, self-regulation
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
3
What are Confidence Judgments Made of? Students’ Explanations for their Confidence Ratings
and What that Means for Calibration
Students take many tests throughout their academic career. Successful students develop a
set of expectations about what will be on the test and either memorize the appropriate
information or develop knowledge schema (Crisp, Sweiry, Ahmed, & Pollitt, 2008). At a certain
point in the process of studying, students must determine whether they have adequately learned
the concepts on which they will be tested, called a judgment of learning (Dunlosky, Serra,
Matvey, & Rawson, 2005). The same type of judgment occurs again when answering test
questions and students determine whether they have adequately answered each question, called a
confidence judgment (Schraw, 2009). It is extremely important how accurate students’
judgments of learning and confidence judgments are as this impacts self-regulation of learning
(Labuhn, Zimmerman, & Hasselhorn, 2010). Calibration is thus the relation between the degree
of confidence students have about their performance and their actual performance (Fischoff,
Slavic, & Lichtenstein, 1977; Glenberg, Sanocki, Epstein, & Morris, 1987).
Since calibration is so crucial to learning, it is not surprising that cognitive,
developmental, and educational psychologists have long studied the phenomenon (e.g.,
Fischhoff, Slavic, & Lichtenstein; Glenberg & Epstein, 1985; Koriat, 2011). Thorndike first
brought attention to the need to move away from simply measuring how precisely one could
estimate physical differences and move towards investigations of how one judges differences in
comparison to a mental standard of some sort (Woodworth & Thorndike, 1900). He argued that
this was of utmost importance in understanding how we navigate everyday situations. After over
a century of research into calibration there are still many questions as to how it relates to the
everyday situation of successful or unsuccessful learning. It is the purpose of this study to
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
4
systematically address these questions through precise theoretical framing, manipulated methods
of measurement, and students’ self-reported reasons for their responses to the confidence
measures.
A recent review of the contemporary literature on calibration found that while just over
half of studies explicitly defined what was meant by calibration, very few grounded that
definition in a theoretical framework (Parkinson, Dinsmore, & Alexander, 2010). What it means
to be a poorly calibrated versus a well-calibrated learner is somewhat in doubt without
conceptual assumptions from a model or theory of some sort. Furthermore, measurement of the
confidence ratings upon which calibration is calculated should be congruent with the chosen
theoretical framework. If, for example, calibration is theoretically thought to be domain-specific
the corresponding measurement should reflect domain-specific items. Theory should also guide
interpretation of students’ bases for judging their confidence since confidence ratings are
necessary for calculating calibration. Previous research has suggested that students when
students answer questions after reading a passage they tend to judge their understanding and
subsequent confidence based on their familiarity with the domain rather than what they learned
from the specific text (Glenberg, Sanocki, Epstein, & Morris, 1987). According to the Model of
Domain Learning (Alexander, Murphy, Woods, & Duhon, 1997) this finding would not only be
explainable, but also expected.
The current study is guided by Bandura’s (1986) model of reciprocal determinism. The
model explains that individuals must regulate the reciprocal relations between personal
influences, behavioral influences, and environmental influences in order to learn. Personal
influences could be metacognitive, such as prior knowledge and experience (i.e. metacognitive
knowledge or experiences; Flavell, 1979), or motivational, such as goals, interest, and self-
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
5
efficacy. Metacognitive experiences, according to Flavell (1979), impact goal-setting and
activate strategy usage and tend to be conscious when a learner is faced with a complex, or
highly demanding task. As with most conceptions of the relation between metacognition and
self regulation, we conceptualize metacognition as a cognitive aspect of self regualation, which
subsumes it (Dinsmore, Alexander, & Loughlin, 2008). Since the larger frame of self-regulation
is thought to be domain-specific in nature (Alexander, Dinsmore, Parkinson, & Winters, 2011), it
is important to consider that students will have different metacognitive experience across
different domains and subsequently their calibration would be expected to potentially be better or
worse depending on the interaction between their metacognitive experience and the particular
task. Given this consideration, students in the current study were asked to complete a set of tasks
specific to their research methods class.
Personal factors both influence and are influenced by environmental factors, such as the
task, achievement outcomes, or the classroom climate. Evaluations of the task are expected to
contribute to students’ confidence ratings, and students’ judgment accuracy is in turn expected to
shape perceptions of task difficulty. The final corner of the triangle is behavioral factors.
Bandura described three types of behavior: self-observation, self-judgment, and self-reaction.
Self-observation refers to attention to a task. Confidence ratings are a type of self-judgment
because students compare their performance to pre-established goals, or Woodworth and
Thorndike’s mental standard. Self-reaction is the resulting behavior from self-observation and
self-judgment. Positioning calibration in the behavioral portion of the model implies that not
only do personal factors and environmental factors influence calibration (as has oft been
studied), but that calibration itself influences students’ metacognition, goals, and ability to
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
6
confront the task. Therefore if students are poorly calibrated their performance is not expected to
be as high as students who are well-calibrated.
In order to understand the role of calibration in the frameworks mentioned previously, we
must be able to make valid inferences about the measurement of confidence and performance,
and ultimately, the calculations of calibration used. Currently, there appears to be little
consensus on what methods to use to calculate calibration (e.g., Parkinson, Dinsmore, &
Alexander, 2010). Typically, respondents are asked to complete a multiple-choice recall
measure for each item then respond whether they were “confident” or “not confident”. The
Hamman or gamma coefficient would then be used to calculate the likelihood that a participant
exhibited more correct judgments than incorrect judgments (Schraw, 2009). However, this
dichotomous measure of both performance and confidence is a problematic one. As Thorndike
and Gates (1929) pointed out, there is likely a distribution of people that fall along a continuum
of any variable (in this case performance and confidence) rather than a true dichotomy. While it
may be the case that dichotomization of the construct makes discussing and calculating
calibration easier, we may in fact be invalidating any inferences we can make about an
individual’s calibration in the framework of self regulation.
Fortunately, there are measures of both performance and confidence that can be used to
increase the validity of the inferences made from calculations of calibration. With regards to
knowledge, it is clear that a correct response on a multiple-choice questions does not indicate
total knowledge of a concept and conversely an incorrect response on a multiple choice item
does not indicate no knowledge of a concept. One way to expand this dichotomy is to use a
graduated response model (Alexander, Murphy, & Kulikowich, 1998). In this response model,
ordered categories are used to represent varying levels of knowledge about a domain or topic.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
7
With regards to confidence, it is also unlikely that an individual would either be
completely confident or not confident at all. This study explores two different options for
measuring confidence on a continuous scale. The first of these is to measure confidence on a
100-millimeter line (e.g., Schraw, Potenze, & Nebelsick-Gullet, 1993). Using the 100-mm line,
participants are free to mark anywhere on the line from “not confident” to “very confident” to
describe their level of confidence about an item with possible scores ranging from 0 to 100. The
other option explored in this study is that of magnitude scaling (for more information about
magnitude scaling see Lodge, 1981). Magnitude scaling not only provides interval level data,
but additionally allows one to rate their confidence at a ratio level by comparing confidence of
one item to an anchor item. The magnitude of their confidence increases the validity of the
interpretations one can make about their confidence, and thus, the calibration calculated for each
individual.
Although, extending confidence and performance beyond dichotomized measures allows
for greater validity of interpretation, it also brings with it some challenges. Namely, how do we
calculate calibration? The extension of confidence and performance measures to continuous data
will require us to go beyond the use of the Hamman and the gamma coefficients, to coefficients
that consider ordinal data with underlying continuities (i.e., the graduated response model) and
continuous data (i.e., the 100-mm and magnitude scales).
Lastly, looking at the model of reciprocal determinism (Bandura, 1986) it seems apparent
that many studies of calibration manipulate environmental influences, such as feedback (Labuhn,
Zimmerman, & Hasselhorn, 2010) and task difficulty (Pulford & Colman, 1997), and measure
behavioral influences, such as judgments of learning (van Overschelede & Nelson, 2006) or
help-seeking behavior (Stavrianopoulos, 2007), but less is known about the role of personal
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
8
factors in calibration. The current study asks students to self-report how they chose their
confidence rating for particular items through an open-ended written response. The aim in
gathering this information is to determine which factors students are considering when they
make their confidence judgments, and subsequently which factors are cited by poorly and highly
calibrated individuals. In addition to the valuable contribution of self-reported bases for
judgments, the following questions are under investigation in the current study.
Can the rho coefficient provide valid inferences about the relation between performance
and confidence (i.e., calibration)? It is hypothesized that the rho coefficient will
be appropriate for calculating calibration since it correlates ordinal data with an
underlying continuous distribution (Cohen, Cohen, West, & Aiken, 2003).
Does the type of confidence scale used (i.e., 100-mm versus magnitude scale) affect the
validity of inferences drawn from the calculation of calibration (i.e., the rho
coefficient) within subjects and in relation to an external measure (e.g., prior
subject-matter knowledge)? It is predicted that magnitude scaling will provide
more valid inferences since these are ratio-level data instead of interval level data.
What do students report considering when making their confidence judgments? It is
anticipated that students will report personal influences suggested by Bandura
(1986) such as self-efficacy or interest, and aspects of metacognition described by
Flavell (1979) such as metacognitive knowledge and metacognitive experiences.
Students may also consider aspects of the task, such as item difficulty and answer
choices, when making their confidence judgments. The current study utilized
open-ended written responses to capture as much variability in self-reported
responses as possible.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
9
Method
Participants
The participants for the current study (11 males, 61 females, Mage = 20.9 years, age range:
19-25) were recruited from a human development research methods class in a large public midAtlantic university. These participants were ethnically diverse (58.3% Causasian, 19.4% African
American, 15.3% Asian, and 6.9% Hispanic) and came from a wide variety of majors (the most
prevalent major was psychology at 33.3%). Participants were sophomore, juniors, and seniors
(Mcredits = 87.8, MGPA = 3.3).
Materials
The materials for this study consisted of two text passages adapted from a research
methods textbook (Cozby, 2007). These passages dealt with the topic of scientific approach
(SA; Appendix A) and statistical inference (SI; Appendix B). The passages were chosen to be
equivalent with regards to both readability and difficulty. Both passages were three pages single
spaced (1669 words and 1672 words respectively), had comparable Flesch Reading Ease scores
(42.7 and 45.0 respectively), and comparable Flesch-Kincaid Grade Level scores (both passages
11.5). Further, there were no significant differences detected for the passage recall measures for
each passage described below (Mdiff = 1.25, SEdiff = 6.63, p = .11).
Measures
The measures for this study consisted of a subject-matter knowledge measure, two
passage recall measures, and two item confidence measures.
Subject-matter knowledge. The subject-matter knowledge measure was a 12-item
graduated multiple-choice test. These 12 items constituted a variety of topics from the research
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
10
methods course in which the students were enrolled and were verified as important topics in the
course by the instructor of record.
The answer choices for each item were on a graduated response scale (Alexander,
Murphy, & Kulikowich, 1998) representing four categories. The first of these four categories
was the in-topic correct response which was scored a 4. The second of these four categories was
the in-topic incorrect response and was scored a 2. This response was not correct, however, the
response within the same topic as the correct response. The third category was the in-domain
incorrect response and was scored a 1. This response was also incorrect, was from a different
topic than the correct response, but was still in the domain of research methods. The fourth
category was the popular lore option which was scored a 0. This incorrect response was one in
which a participant with little or no subject-matter would chose and was not associated with the
domain of research methods. An example of one of these items follows.
A researcher runs a time-series design testing the effects of his magic memory pill.
Ironically, after taking his Magic Memory Pill, the participants in his study forget to
come to the subsequent observations. The main threat to internal validity here is:
a. dizziness (0)
b. maturation (2)
c. mortality (4)
d. reliability (1)
These items and graduated responses were validated by the instructor of record for the research
methods course. Possible scores for this measure could range from 0 to 48. The current sample
had a mean score for this measure of 33.74 (SD = 5.77). The reliability for this measure was low
(α = .30), possibly due to the low prior knowledge of many of these participants.
Passage recall measures. For each of the text passages (SA and SI) there was a 10-item
graduated multiple-choice measure. The ten items for each of these measures (the SA passage
recall and SI passage recall) were based on concepts taken directly from the respective passages.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
11
Similar to the subject-matter knowledge measure, the response choices represented a graduated
response scale consisting of four categories. The first category was the in-passage correct
response and was scored a 4. The second category was the in-passage closely-related incorrect
response and was scored a 2. This response was closely conceptual related to the correct
response and was located in close proximity to the correct answer in the passage. The third
category was the in-passage further-related incorrect response and was scored a 1. This response
was in the passage but was not closely-related to the correct response and was in further
proximity to the correct response in the passage. The last category was the popular lore response
and was scored a 0. This response was not located anywhere in the text passage. An example of
one of these items follows.
To ensure research with major flaws will not become part of the scientific literature,
scientists rely on:
a. expertise (0)
b. reliability estimates (1)
c. verification of ideas (2)
d. peer review (4)
Validity for these items was established by their correspondence to the text passages and
validated by the course instructor of record. Possible scores for these measures could range from
0 to 40. The current sample had a mean score for the SA passage recall measure of 33.40 (SD =
6.45) and for the SI passage recall measure of 32.15 (SD = 5.63). The reliability for these
measures were .67 and .62 for the SA and SI measures respectively.
Item confidence measures. Following each item on both the SA and SI passage recall
measures participants were asked how confident they were that their response was correct. This
was done using either a 100-mm line (Schraw, et al., 1993) or magnitude scaling (Lodge, 1981).
100-mm line confidence measure. For this measure, participants were instructed, “After
each item please indicate how confident you are with the answer you circled by making a slash
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
12
mark on the line (or ends) indicating your confidence from not confident to very confident.” A
sample item follows.
How confident are you that your response to the above item is correct?
not confident
very confident
Each participant’s response was measured on the 100 millimeter line and recorded using a
standard ruler. Possible scores on the 100-mm line confidence scale ranged from 0 to 100.
Magnitude scaling confidence measure. For this measure, the directions were:
After each item please indicate how confident you are with the answer you circled
by comparing your confidence in that item with the anchor item which has been
assigned a value of 10. If you are more confident for a particular item than the
anchor item (which had a value of 10) you should write a number higher than 10
in the brackets. For example, if you are three times as confident in that particular
item than you were with the anchor statement, you would write a 30 (i.e. three
times as much as 10) for your confidence judgment of that item. If you are less
confident for a particular item than the anchor item (which had a value of 10) you
should write a number lower than 10 in the brackets. For example, if you are half
as confident in that particular item than you were with the anchor statement, you
would write a 5 (i.e. half of 10) for your confidence judgment of that item. No
negative numbers please.
The anchor item immediately followed the instructions and consisted of a multiple-choice item
of the same type as the recall items that was of about medium difficulty (item easiness = .53) in
an earlier pilot sample. A sample confidence question follows.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
13
How confident are you that your response to the above item is correct compared
to the anchor item? [
]
Possible scores on the magnitude judgments ranged from 0 to ∞. These raw scores were then
converted to log (base 10) scores (Lodge, 1981) so that a transformed value of one was equal to
the anchor (initially 10), values under ten indicated less confidence than the anchor, and values
above ten indicated more confidence than the anchor.
Measurements
In addition to the measures described above, two measurements were taken during the
experiment, namely, participants’ ability to summarize the passages they had read and openended questions about how they were judging their confidence.
Passage summaries. After reading the passage, participants were asked to, “Please
summarize this passage, giving both the overall main idea and the major points addressed.”
Participants were given the front and back of the page to write their response. These summaries
will be scored using a rubric currently under development.
Open-ended confidence questions. Following two of the confidence items participants
were asked to, “Please explain how you arrived at or what you considered when making your
confidence judgment.” Participants were given a 3-inch space to write their response.
The open-ended responses were then coded for instances of confidence judgments being
arrived at through five a priori categories: prior knowledge, characteristics of the text,
characteristics of the item, guessing, and other considerations. Considerations of prior
knowledge in confidence judgments were instances where participants cited their level of prior
knowledge about the concept in question. For example, one participant responded to an item,
“this question I decided based on past knowledge and understanding.” Characteristics of the text
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
14
included instances where participants cited things they had read in the text. For example, a
participant responded, “I remember seeing adoption in the passage earlier.” Characteristics of
the item included instances where participants cited either the stem or responses for the item as
reasons for their confidence level. For example, one participant responded, “All of the answer
choices seem to contribute to the soundness of research (i.e., expertise of the peer reviewers
matters). However, peer review is the formalized process by which research becomes part of the
scientific literature.” Guessing included instances where participants indicated their confidence
level was influenced by guessing. For example, one participant responded, “I just guessed.” The
other category included statements that had none of the element described previously (i.e., prior
knowledge, characteristics of the text, characteristics of the item, or guessing). For example, one
participant responded, “What I considered when making my confidence judgment is that
personal judgment is like a gut feeling of one person not a group of people. Everyone can/may
rely on intuition based on experience.”
It was also possible that an open-ended response could include more than one of these
elements. In other words, a response could include both an explanation of how prior knowledge
and item characteristics influenced their confidence levels. For example, a participant with this
combination responded, “prior knowledge (definitions) and relation of confidence to other
answers.”
The first and second author concurrently coded responses for 10 randomly selected
participants (40 responses in all) to ensure the coding scheme was useful. Then the first and
second authors independently coding the remaining participants (62 participants, 248 responses).
Exact agreement for codes (including combinations of codes) was 92.5%.
Procedures
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
15
The study was conducted during regular class time for their research methods class. The
first session consisted of filling out the informed consent, filling out a demographic survey,
completing the subject-matter knowledge measure. Additionally, students were randomly
assigned to read and respond to one of the two text passages (either SA or SI) and fill out their
confidence ratings using the 100-mm scale or the magnitude scale. Passage and confidence
ratings scale were counterbalanced. After reading the passage, participants completed the 10
recall items, the confidence items, and completed the passage summary. During the second
session, participants read other passage (either SA or SI) and completed the 10 recall items,
confidence items, and passage summary for that passage. Participants were also randomly
assigned two items from each passage to complete the open-ended response.
Results
Rho Coefficient
Calibration for each participant was calculated using the rho coefficient which is
appropriate for ordinal level data with an underlying continuous distribution (Cohen, Cohen,
West, & Aiken, 2003). To examine whether the rho coefficient gave valid inferences about an
individual’s calibration we began by inspecting the scatterplots for each individual’s raw
performance and confidence ratings. Figures 1 and 2 present scatterplots for individuals that
have a range of rho coefficient values for the 100-mm and magnitude confidence scales
respectively.
From these scatterplots, it appears that the rho coefficient is giving us some indication
that the higher confidence and performance are linearly related, the higher the rho coefficient is
(i.e., more highly calibrated). One issue that we encountered and is evident in the scatterplots of
the 100-mm lines is the amount of the 100-mm line participants used. Specifically, the mean of
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
16
the standard deviations of participants’ ten responses on the 100-mm confidence line was 17.68
(SD = 12.88) with a maximum standard deviation of 71.56 and a minimum standard deviation of
0. Clearly, it then becomes difficult to ascertain what differences the participants ascribe to
different locations on the 100-mm line.
Type of Confidence Scale
The issue of interpreting the different locations on the 100-mm line described previously
brings us to the next question. Namely, does the type of confidence scale (100-mm line versus
magnitude judgments) matter? We will discuss issues of interpretation from above, but will first
present an analysis of within-subjects differences between the 100-mm and magnitude scales.
Since there were no passage differences in recall performance (analysis previously presented in
the methods section), this allows an inspection of the differences between participants’
calibration (rho coefficient) for the two scales. A within-subjects analysis indicated that there
were no significant differences between individuals’ calibration using the 100-mm scale and the
magnitude scale (F = .375, df = 1, 68, p = .54). There were also no differences for order of
measurement scale (F = .186, df = 1, 68, p = .67). The means and standard deviations for this
analysis can be found in Table 1.
Further, the distributions of rho for these two passages were similar. Both the SA
passage and SI passage were both negatively skewed (-.64 and -.53 respectively), however, had
different levels of kurtosis (-.02 and -.34 respectively). The distributions of rho for these
passages are presented in Figure 3 for the SA passage and Figure 4 for the SI passage.
Although there were no differences in terms of the magnitude of their calibration (i.e.,
rho coefficient), we still feel that the validity of the interpretations of the calibration coefficient
are stronger for the magnitude scale. In other words, given an anchor statement, individuals are
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
17
rating their confidence by indicating the magnitude of their confidence for one item versus an
anchor item, hence the scale is known and fixed to the anchor item. For the 100-mm line, it is
unknown if an item marked at 50 millimeters is perceived to be twice as difficult as an item
marked at 25 millimeters.
We had hoped to see if the strength of the relation between the rho coefficient calculated
from the 100-mm line and the magnitude scale to the subject-matter knowledge differed,
however, due to the low reliability of the subject-matter knowledge scale we chose not to analyze
this particular scale. We will, however, code the passage summaries that we have and see if the
correlation between passage summary and calibration differ in those relations.
Open-ended Participant Reports
Finally, we present data on what students reported they used to make their confidence
ratings. The five categories discussed in the methods section (prior knowledge, characteristics of
the text, item characteristics, guessing, and other) were all present in the raters’ codings.
Additionally, participants responses also fell into the following joint categories: prior knowledge
and text characteristics; prior knowledge and item characteristics; prior knowledge and guessing;
text characteristics and item characteristics; text characteristics and guessing; item characteristics
and other; prior knowledge, text characteristics, and item characteristics; text characteristics,
item characteristics, and guessing; and finally, prior knowledge, item characteristics, and
guessing. The number of responses that fell into each of these categories by passage is presented
in Table 2.
It is clear from these data that participants were taking into account multiple factors when
rating their confidence. Not only do these data support using Bandura’s (1986) model of
reciprocal determinism, but further, indicates that inferences made about participants’ calibration
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
18
must be carefully considered. For instance, there were many individuals who only used one
factor for making their confidence judgments, while others used different factors for different
items. Further some individuals were able to incorporate multiple factors when making their
confidence judgments.
Discussion
How can the preceding findings from this exploratory study help guide exploration of
calibration in different learning situations? We suggest that these data inform research into
calibration in regards to theoretical framing and the importance of knowing what individuals are
basing their confidence judgments on, measurement of confidence and performance, and the
calculation of calibration.
These findings indicate that students’ confidence ratings do include elements of the
person (i.e. prior knowledge) and the task (i.e., text characteristics and item characteristics in this
study). Further, it was clear students were basing their confidence on different parts of the model
of reciprocal determinism (i.e., person characteristics or environment characteristics) or a
combination of both (i.e., person and environment characteristics). These findings highlight that
these data fit the model of reciprocal determinism, and that further investigation into how
students chose which of these aspects to consider when making a confidence judgment is
warranted. For instance, simply asking students to rate their general confidence may not provide
enough information for valid inferences related to the nature of calibration. Perhaps it is the case
that for items or tasks it may be more telling to ask how confident one is based on person factors,
environment factors, and some combination of both. Regardless of how confidence is measured,
it is critical to frame it in some sort of theory or model to provide explanatory power about
calibration inside and outside of the classroom.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
19
Secondly, in order to make valid inferences about calibration it is also clear that the
measurements of confidence and performance must match the underlying theoretical distribution.
For both confidence and performance these underlying continuous distributions need to be
measured using continuous, not dichotomous variables. For the case of confidence judgments
this can be done in relatively easy fashion. The 100-mm lines provided a “quick and dirty”
method measuring confidence on an interval scale. While these scales meet the criteria of being
interval-level data, the issue of where students’ rate their confidence on the line (both mean
location and the variability of these responses) is difficult to interpret and likely differs across
individuals. The magnitude scales on the other hand provide ratio-level data in which we can
directly interpret the magnitude of students’ responses to a given anchor statement. The
difficulty here is figuring out what the anchor item can be. For this study, we chose to use an
item that was of moderate difficulty for a small pilot sample, but perhaps other considerations
about an item and the individuals in the sample need to be taken into account.
Performance on the other hand is a bit more difficult to measure in a continuous fashion.
We have used ordinal level data with the graduated response model. One possible extension of
this response model would be to use more responses, upwards of seven, which may produce a
distribution of item responses over enough items to be approximately normal. Using Monte
Carlo simulations such as Schraw, Kuch, & Roberts (this symposium) may help elucidate some
of these issues.
Lastly, the issue of measurement has a direct effect on the calculation of calibration. It is
evident that whatever coefficient or value is used for calibration must meet the assumptions of
the underlying data. In this case we used rho because it was a conservative coefficient for our
data. Another option for these data would be to use the polyserial correlation, which can be used
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
20
for ordinal level data with an underlying continuous distribution and continuous data, although
the polyserial coefficient has some computational difficulties (i.e., maximum-likelihood
estimation) that is difficult when each student only responds to a small number of items such as
in the current study.
One avenue of further investigation would be to see if an index of calibration could be
constructed using standard error in linear regression. If confidence and performance were
expected to have a standardized slope of 1, participants confidence ratings and performance
could be used to obtain a standard error of these points from this predicted slope of 1. However,
since continuous measures of performance are difficult and not common in the literature, other
types of regression need to be investigated. One type of regression useful for ordinal data in this
case might be ordinal regression, which is a type of logistic regression (Pedhazur, 1997). Issues
of violations of assumptions need to be overcome. For example, with these data a few
participants only had responses in one outcome category (e.g., 4). Without a minimum of
responses in each category, violations of assumptions and model fit may inhibit its use.
Because calibration is such an important consideration in self-regulation and learning,
innovations in its measurement continue to be of utmost importance to researchers and educators
alike. This study sought to answer three critical questions remaining in the use and interpretation
of students’ calibration data. First, two types of confidence rating response formats were
contrasted and some restriction in the range of responses to 100-mm lines was found. This
finding draws attention to the need to consider the nature of the data being interpreted. Second, a
rho coefficient was utilized to calculate calibration scores with continuous data and this approach
is suggested for use with non-dichotomus responses. Finally, a theoretical examination of
students’ rationales for their confidence ratings was undertaken. Students consider various and
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
sometimes multiple sources when making their confidence judgments. Therefore, it is
imperative for educators to consider the cues provided from these sources in order to better
understand why some students can accurately monitor their learning, while others do not even
notice the need to adjust their strategies and goals for learning.
21
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
22
References
Alexander, P. A., Dinsmore, D. L., Parkinson, M. M., & Winters, F. I. (2011). Self-regulated
learning in academic domains. In B. J. Zimmerman, & D. Schunk (Eds.), Handbook of
self-regulation of learning and performance. New York: Routledge.
Alexander, P. A., Murphy, P. K., & Kulikowich, J. M. (1998). What responses to domainspecific analogy problems reveal about emerging competence: A new perspective on an
old acquaintance. Journal of Educational Psychology, 90, 397-406.
Alexander, P. A., Murphy, P. K., Woods, B. S., & Duhon, K. E. (1997). College instruction and
concomitant changes in students’ knowledge, interest, and strategy use: A study of
domain learning. Contemporary Educational Psychology, 22, 125-146.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory.
Englewood Cliffs, NJ: Prentice-Hall.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. Mahwah, NJ: Lawrence Erlbaum Associates.
Cozby, P. C. (2007). Methods in behavioral research. NY: McGraw-Hill.
Crisp, V., Sweiry, E., Ahmed, A., & Pollitt, A. (2008). Tales of the expected: The influence of
students’ expectations on question validity and implications for writing exam questions.
Educational Research, 50, 95-115.
Dinsmore, D. L., Alexander, P. A., & Loughlin, S. M. (2008). Focusing the conceptual lens on
metacognition, self regulation, and self-regulated learning. Educational Psychology
Review, 20, 391-409.
Dunlosky, J., Serra, M. J., Matvey, G., & Rawson, K. A. (2005). Second-order judgments about
judgments of learning. Journal of General Psychology, 132, 335-346.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
23
Fischoff, B., Slavic, P., & Lichtenstein, S. (1977). Knowing with certainty: The appropriateness
of extreme confidence. Journal of Experimental Psychology: Human Perception and
Performance, 3, 552-564.
Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitivedevelopmental inquiry. American Psychologist, 34, 906-911.
Glenberg, A. M., & Epstein, W. (1985). Calibration of Comprehension. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 11, 702-718.
Glenberg, A. M., Sanocki, T., Epstein, W., & Morris, C. (1987). Enhancing calibration of
comprehension. Journal of Experimental Psychology: General, 2, 119-136.
Koriat, A. (2011). Subjective confidence in perceptual judgments: A test of the self-consistency
model. Journal of Experimental Psychology: General, 140, 117-139.
Labuhn, A. S., Zimmerman, B. J., & Hasselhorn, M. (2010). Enhancing students’ self-regulation
and mathematics performance: The influence of feedback and self-evaluative standards.
Metacognition and Learning, 5, 173-194.
Lodge, M. (1981). Magnitude scaling: Quantatitive measurement of opinions. Newbury Park,
CA: Sage Publications.
Parkinson, M. M., Dinsmore, D. L., & Alexander, P. A. (2010). Calibrating calibration:
Towards conceptual clarity and agreement in calculation. Paper presented at the annual
Meeting of the American Educational Research Association, Denver.
Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd edition). Orlando, FL:
Harcourt Brace.
Pulford, B. D., & Colman, A. M. (1997). Overconfidence: Feedback and item difficulty effects.
Personality and Individual Differences, 23, 125-133.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
24
Schraw, G. (2009). A conceptual analysis of five measures of metacognitive monitoring.
Metacognition and Learning, 4, 33-45.
Schraw, G., Kuch, F., & Roberts, R. (2011). Bias in the gamma coefficient: A Monte Carlo
study. In P. Alexander (Chair), Calibrating calibration: Conceptualization, measurement,
calculation, and context. Symposium presented at the annual meeting of the American
Educational Research Association, New Orleans.
Schraw, G., Potenza, M. T., & Nebelsick-Gullet, L. (1993). Constraints on the calibration of
performance. Contemporary Educational Psychology, 18, 455-463.
Stavrianopoulos, K. (2007). Adolescents’ metacognitive knowledge monitoring and academic
help seeking: The role of motivation orientation. College Student Journal, 41, 444-453.
Thorndike, E. L., & Gates, A. I. (1929). Elementary principles of education. New York: The
Macmillan Company.
van Overschelde, J. P., & Nelson, T. O. (2006). Delayed judgments of learning cause both a
decrease in absolute accuracy (calibration) and an increase in relative accuracy
(resolution). Memory & Cognition, 34, 1527-1538.
Woodworth, R. S., & Thorndike, E. L. (1900). Judgments of magnitude by comparison with a
mental standard. Psychological Review, 7, 344-355.
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
Table 1
Means and Standard Deviations of Participants’ Calibration for the 100-mm and Magnitude
Scales
100-mm line first
Magnitude scale first
100-mm scale
.53 (.17)
.47 (.22)
Magnitude scale
.50 (.21)
.54 (.21)
25
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
26
Table 2
Number of Responses for each Confidence Code Category by Passage
Category
P
T
I
G
O
PT
PI
PG
TI
TG
IG
IO
PTI
TIG
PIG
SA passage
12
49
24
2
25
5
9
0
6
1
4
1
4
0
0
SI passage
11
29
25
8
31
14
6
1
7
2
2
1
2
1
1
Note. P = prior knowledge; T = text characteristic; I = item characteristic; G = guessing; O =
other; PT = prior knowledge and text characteristics; PI = prior knowledge and item
characteristics; PG = prior knowledge and guessing; TI = text characteristics and item
characteristics; TG = text characteristics and guessing; IO = item characteristics and other; PTQ
= prior knowledge, text characteristics, and item characteristics; TIG = text characteristics, item
characteristics, and guessing; PIG = prior knowledge, item characteristics, and guessing
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
27
Figure 1
Scatterplots for Ranges of Participants Calibration (rho) for the 100-mm Scales
rho = .28
rho = .09
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0
20
40
60
80
0
100
20
rho = .55
40
60
80
100
60
80
100
rho = .82
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0
20
40
60
80
100
0
20
40
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
28
Figure 2
Scatterplots for Ranges of Participants Calibration (rho) for the Magnitude Scales
rho = .05
rho = .28
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0
0.5
1
1.5
2
2.5
0
0.5
rho = .51
1
1.5
2
2.5
rho = .88
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
1.75
1.8
1.85
1.9
1.95
2
2.05
1.65
1.7
1.75
1.8
1.85
1.9
1.95
2
2.05
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
29
Figure 3
Distribution of Participants’ Calibration for the Scientific Approach Passage
60
40
30
20
10
rho coefficient
.95-1
.90-.95
.85-.90
.80-.85
.75-.80
.70-.75
.65-.70
.60-.65
.55-.60
.50-.55
.45-.50
.40-.45
.35-.40
.25-.30
.20-.25
.15-.20
.10-.15
.05-.10
0
0-.05
Frequency
50
Running head: EXPLANATIONS FOR CONFIDENCE RATINGS
30
Figure 4
Distribution of Participants’ Calibration for the Statistical Inference Passage
45
40
30
25
20
15
10
5
rho coefficient
.95-1
.90-.95
.85-.90
.80-.85
.75-.80
.70-.75
.65-.70
.60-.65
.55-.60
.50-.55
.45-.50
.40-.45
.35-.40
.25-.30
.20-.25
.15-.20
.10-.15
.05-.10
0
0-.05
Frequency
35
Download