See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/324618977 Quantitative critical thinking: Student activities using Bayesian updating Article in American Journal of Physics · May 2018 DOI: 10.1119/1.5012750 CITATIONS READS 10 311 1 author: Aaron Warren Purdue University Northwest 19 PUBLICATIONS 585 CITATIONS SEE PROFILE All content following this page was uploaded by Aaron Warren on 02 May 2018. The user has requested enhancement of the downloaded file. PHYSICS EDUCATION RESEARCH SECTION The Physics Education Research Section (PERS) publishes articles describing important results from the field of physics education research. Manuscripts should be submitted using the web-based system that can be accessed via the American Journal of Physics home page, http://ajp.dickinson.edu, and will be forwarded to the PERS editor for consideration. Quantitative critical thinking: Student activities using Bayesian updating Aaron R. Warrena) Department of Chemistry and Physics, Purdue University Northwest, 1401 S. US-421, Westville, Indiana 46391-9542 (Received 1 June 2016; accepted 6 November 2017) One of the central roles of physics education is the development of students’ ability to evaluate proposed hypotheses and models. This ability is important not just for students’ understanding of physics but also to prepare students for future learning beyond physics. In particular, it is often hoped that students will better understand the manner in which physicists leverage the availability of prior knowledge to guide and constrain the construction of new knowledge. Here, we discuss how the use of Bayes’ Theorem to update the estimated likelihood of hypotheses and models can help achieve these educational goals through its integration with evaluative activities that use hypothetico-deductive reasoning. Several types of classroom and laboratory activities are presented that engage students in the practice of Bayesian likelihood updating on the basis of either consistency with experimental data or consistency with pre-established principles and models. This approach is sufficiently simple for introductory physics students while offering a robust mechanism to guide relatively sophisticated student reflection concerning models, hypotheses, and problemsolutions. A quasi-experimental study utilizing algebra-based introductory courses is presented to assess the impact of these activities on student epistemological development. The results indicate gains on the Epistemological Beliefs Assessment for Physical Science (EBAPS) at a minimal cost of class-time. VC 2018 American Association of Physics Teachers. https://doi.org/10.1119/1.5012750 I. INTRODUCTION There is widespread agreement that preparation for the contemporary and future workplace requires physics students (including non-majors) to develop an understanding of the procedures by which physics and science generally operate.1–5 Many of these procedures fall within the category of abilities termed “critical thinking,” involving evaluative reflection upon data, hypotheses, and models. Although a number of instructors may prefer to value other abilities more, the case is made that it is a mistake to do so since traditional course designs leave students underprepared for their futures. This is in part because the students often complete physics courses with their initial novice-like traits still intact.6,7 Novice physics students are often prone to unscientific and disorganized evaluation of models and hypotheses. They usually prefer methods that might be viewed as primarily authoritarian in nature due to their reliance upon recognized authorities such as instructors, outstanding peers, and answer keys rather than an independent consideration of evidence.8,9 If little attention is paid during a course to helping students to develop the means to evaluate models and hypotheses in a robust fashion, then there can be little hope that they will finish a class with any measurable improvement in associated 368 Am. J. Phys. 86 (5), May 2018 http://aapt.org/ajp critical thinking abilities. For example,10,11 it is found that students in traditional labs spent significantly less time engaged in sense-making than their peers in non-traditional courses, particularly in the amount of time spent making sense of errors and uncertainties. Moreover, many studies find that introductory physics courses, both traditional and non-traditional, even including those that produce gains on measures of conceptual understanding, tend to have a negative impact on students’ attitudes.12,13 Despite the broadly negative findings on student attitude shifts during introductory courses, there are some wellestablished curricula that have demonstrated success in helping to improve student beliefs. In particular, courses with an explicit focus on model-building, including Physics by Inquiry,14 Physics of Everyday Thinking,15 Modeling Instruction,16–18 and the Investigative Science Learning Environment (ISLE),19,20 have been able to demonstrate gains on student attitudes. Additionally, courses that pay special attention to epistemological considerations have been shown to produce gains.21,22 Critical thinking and reflection on the process of learning are woven within these curricula as students continually propose and evaluate hypotheses and models. Several evaluation techniques that may receive attention in these curricula are dimensional analysis, special-case C 2018 American Association of Physics Teachers V 368 analysis, and limit-case analysis. Each evaluation technique applies pre-determined criteria and standards to an assertion in order to test whether the assertion is well-defined and consistent with prior knowledge. Students are then expected to update the estimated likelihood of an assertion in a qualitative, intuitive fashion. The attention that these courses generally pay to helping students reflect in their learning is thought to be one essential feature that enables positive attitudinal shifts to occur. Specific evaluation activities, employing hypotheticodeductive reasoning, have been shown to positively impact student problem-solving ability.8,9 These activities, although ideally meant for curricula such as ISLE or Modeling Instruction, are curriculum-agnostic in the sense that they may be easily and directly incorporated within even traditional courses (although it is not known how their effects are modified by being employed in different curricular environments). Here, we endeavor to deepen such activities and make them more precise through the use of Bayes’ Theorem to quantitatively update student estimates of likelihoods. We propose that the incorporation of Bayes’ Theorem within these evaluation activities for an algebra-based introductory physics curriculum may provide a novel means for generating positive attitudinal shifts among students. Given the effect hypothetico-deductive evaluation activities are already known to have on problem-solving ability, their extension to include Bayesian updating in order to also produce positive attitudinal shifts may make them especially useful in fulfilling the broader educational goals of many physics courses. The central purpose of this report is therefore to test whether students’ use of Bayes’ Theorem to estimate and update subjective probabilities of hypotheses can positively impact their epistemological attitudes. We first present the motivation for this study by examining the manner by which Bayes’ Theorem creates explicit values for organized student reflection at the end of the evaluative process, enabling important but subtle points of general scientific reasoning to be clarified. This motivation is rooted within the logical structure of hypothetico-deductive reasoning, which we briefly review. We then discuss how the use of Bayes’ Theorem is related to hypothesis-evaluation and illustrate its use with examples. Next, templates and examples of several novel types of curricular materials that engage students in the use of Bayes’ Theorem are described. Finally, the results of a quasi-experimental study assessing the epistemological effect of these materials are presented. While not conclusive, the results support the hypothesis that Bayesian evaluation activities enhance students’ epistemological attitudes. These results, when combined with previous work showing that evaluation activities are able to improve problem-solving ability at a relatively low cost of class-time and instructor preparation, suggest that the Bayesian evaluation activities have promising educational potential. II. HYPOTHETICO-DEDUCTIVE REASONING One of the central cognitive tools in scientific reasoning is the hypothetico-deductive (HD) process,23 which is used to test a hypothesis by deducing a prediction and then comparing an actual result with the predicted result. A generic template for this process is shown in Table I. It should be noted that the assumptions of the planned test will often require testing or assessment of their own before any change in confidence regarding the hypothesis may be granted. 369 Am. J. Phys., Vol. 86, No. 5, May 2018 Table I. A schematic outline of the hypothetico-deductive process. IF AND THEN AND/BUT [the hypothesis to be tested is assumed to be true] [a test is planned under certain assumed conditions] [a prediction is deduced] [results of the test, including associated uncertainties, were obtained] THEREFORE [estimated likelihood of hypothesis should/should not be changed] Depending on the nature of the hypothesis being tested, there are two general types of situations in which students can be asked to use the HD process. Direct evaluation: The hypothesis makes assertions about the nature of some data. For example, assertions about data patterns (e.g., asserting that some particular data are best modeled with a linear fit), the presence and impact of errors associated with data collection, or the comparison of empirical results with a theoretical prediction. Indirect evaluation: The hypothesis makes assertions about relationships between physical models (e.g., coherence in some limiting cases or special cases) or between a physical model and general theoretical principles (e.g., energy conservation).25 The difference between these two categories is the immediate presence or absence of empirical data. The term “Indirect” is used for the latter category because data have already been used to establish confidence in some other models or principles, and thus, those data are now being used indirectly to test the model at hand. It is in this way that much work in theoretical physics is done. String theory, for example, is largely motivated by inconsistencies between general relativity and quantum field theory, and through the careful use of indirect evaluation, it qualifies as an empirical science despite the difficulty in generating directly testable predictions.26 This is a typical stage in the development of theoretical knowledge, as a new framework first gains confidence by demonstrating consistency with pre-existing principles (and the data they are supported by) before being tested against data from novel experiments. One of the distinctive features of introductory physics is that it is often the first opportunity available to help students become more sensitive to this point that one can reason at the level of physical models and theories to test new ideas without always needing to collect more data. This is an essential concept for students to be able to construct well-organized, hierarchal knowledge in many fields, including physics, and can be introduced at the beginning of the course by the following example. A. Introductory example We start with two cylindrical containers, with differing radii and with marks at equal height intervals. Water is initially poured into the wide cylinder up to mark 4. It is then poured from the wide cylinder to the narrow cylinder and is observed to reach mark 6 (see Fig. 1). On the basis of this result, one na€ıve model relating the height of the water in the two cylinders, hwide and hnarrow, is hnarrow ¼ hwide þ 2: (1) We evaluate this model using the HD process, either directly or indirectly. Direct evaluation amounts to repeating Aaron R. Warren 369 quantitative model. This simple scenario provides a touchstone example of such reasoning which other activities can build upon, as discussed below. III. BAYESIAN UPDATING Fig. 1. Water is poured from a wide cylinder to a narrow cylinder, and the height is observed to change from mark 4 in the wide cylinder to mark 6 in the narrow cylinder. [From Lawson, The Neurological Basis of Learning, Development and Discovery: Implications for Science and Mathematics Instruction. Copyright 2003 by Kluwer Academic Publishers. Reproduced by permission of Kluwer Academic Publishers (Ref. 24). the experiment with a change in the independent variable (the water’s initial height) and comparing the result with the prediction of our model. Direct evaluation: IF the model is true AND the experiment is repeated with the wide cylinder initially filled to mark 6 instead of mark 4 THEN when poured into the narrow cylinder, the water should reach mark 8 BUT when the experiment is performed, we see the water reaches mark 9 THEREFORE we must decrease confidence in our model. An indirect evaluation of the water-pouring model entails a clever thought experiment that turns the experimental procedure around in order to arrive at a contradiction with a general theoretical principle that we already have strong confidence in. Indirect Evaluation: IF the model is true AND water is initially poured into the narrow cylinder up to mark 2 and then transferred from the narrow cylinder into the wide cylinder THEN the model predicts that hwide ¼ 0 BUT this means that the water disappears, which clearly violates volume conservation for our fluid THEREFORE we must decrease confidence in our model. Here, prior data that went into developing students’ established confidence in volume conservation as a valid principle for water (generally provided by earlier K-12 science courses) have been leveraged in order to test a new assertion. Since the indirect approach is able to modify confidence in the proposed model, it is epistemically equivalent to the direct evaluation of the model with new data from new experiments. Students usually express surprise and appreciation for the indirect approach, as they often have not engaged in thought experimentation or critical reflection of a 370 Am. J. Phys., Vol. 86, No. 5, May 2018 At the end of the HD process, one may update confidence in the hypothesis being tested, if the assumptions of the test were sufficiently met. This updating is usually performed in a qualitative fashion by students in introductory physics courses, sometimes with the exaggerated and incorrect belief that the result of a test may disprove a hypothesis or (even worse) that a result may prove a hypothesis. Such a binary model of truth value hampers the ability to adapt to new information, as it severely overestimates the impact of any single test. Thus, students with particular attitudes toward knowing possess different attitudes toward learning, affecting their adaptiveness to new information or claims.27 In contemporary science, the likelihood of a hypothesis is often calculated and updated using Bayes’ Theorem, which codifies the optimal manner by which new evidence should be used to update the estimated likelihood of a claim.28 One need to only look at the extensive role Bayesian inferencing plays in gravitational wave detection29 to see its impact on physics, for example. A typical use of Bayes’ Theorem requires one to begin with an initial probability for a hypothesis H, denoted P(H), which is called the prior. Once some evidence E is gathered, the probability of the evidence P(E) and the conditional probability PðEjHÞ of observing evidence E given hypothesis H are used to calculate the posterior probability PðHjEÞ of hypothesis H given the evidence E according to PðHjEÞ ¼ PðEjHÞ PðHÞ: PðEÞ (2) This can be re-written in a more convenient form for our purposes PðHjEÞ ¼ PðHÞ R ; PðHÞ R þ 1 PðHÞ (3) where R¼ PðEjH Þ ; PðEj:H Þ (4) is a Bayes factor. This is the ratio by which a particular piece of evidence is relatively more likely given hypothesis H than by its converse, :H. This factor encodes the inferential power of the evidence with regard to H. For example, a strong confirmatory result means that the evidence can be much better explained by the hypothesis than otherwise and R will tend toward large values in such cases (i.e., R 1). A strong disconfirmatory result means that the evidence is much better explained by assuming that the hypothesis is false than otherwise and R will tend toward zero (although never equaling zero). A null result is when the evidence offers no discrimination between H and :H, in which case R ¼ 1. Guidelines for the estimation of R are provided in Table II. In order to better help students understand the act of confidence updating along with the associated epistemic implications, it seems natural to engage them in the use of Bayes’ Aaron R. Warren 370 Table II. Guidelines for estimation of R. [Adapted with permission from Kass and Raftery, J. Am. Stat. Assoc. 90(430), 791 (1995). Copyright 1995, American Statistical Association (Ref. 30)]. R Interpretation 1 < 150 1 1 150 to 20 1 1 to 20 3 1 3 to 1 :H very strongly favored :H strongly favored :H substantially favored :H barely favored H barely favored H substantially favored H strongly favored H very strongly favored 1 to 3 3 to 20 20 to 150 >150 Theorem. While the full-fledged use of Bayes’ Theorem is perhaps too much to expect in an introductory physics course, it is quite possible to employ it in a loose fashion. This can be introduced at the beginning of a course with the water-pouring example. A. Introductory example Direct evaluation: If we revisit the direct evaluation of the water-pouring task from above, then the hypothesis H is the assertion that hnarrow ¼ hwide þ 2. Before planning or doing the direct evaluation, each student is asked to estimate their confidence in H, thus providing the prior P(H). If the student is completely unsure, then they are told to set P(H) ¼ 0.5, representing a 50% chance that the hypothesis is or is not valid, corresponding to the maximum epistemic uncertainty. The test is then performed, and the evidence E is the observation that the water rose to mark 9 instead of mark 8. The same student may then use Table II to select a value for R, which in this case should be a small number because the result is much more likely if our hypothesis is false than if it is true. Let us suppose the student, based on their subjective estimation of the inferential power of this result, selects R ¼ 0.001, which is in the category of being a very strong disconfirmation of the hypothesis (<1=150). Specifically, this choice for the R value indicates that the student believes that the obtained result is 1000 times more likely to occur if the hypothesis is false than if the hypothesis was true. Another student may select R ¼ 0.05 if they feel that the result is only 20 times more likely if the hypothesis is false than if it is true. Note that the value of R cannot be zero, though, because one can never be absolutely sure of the measurements; for example, there is always the possibility of miscounting the marks or having made a procedural error with the materials, no matter how many times it is performed. Continuing with the example, where our student has identified an initial subjective confidence of P(H) ¼ 0.5 and selected an updating coefficient of R ¼ 0.001, the subjective posterior probability is then calculated as PðHjEÞ ¼ 0:5 0:001 9:99 104 ; 0:5 0:001 þ 1 0:5 (5) demonstrating how the direct evaluation allows the student to drastically reduce the likelihood of this model on the basis of compelling disconfirmatory evidence. Within the HD 371 Am. J. Phys., Vol. 86, No. 5, May 2018 process, this posterior probability is the precise formulation of the conclusion reached in the THEREFORE portion of the process. Although the specific values used are subjective and will vary student-by-student, as more evaluations are performed (say by repeating the experiment with new values), each student’s posterior probability will asymptotically approach zero. If the students do this for a total of three tests and then plot the results for the class, they can directly see the convergence for themselves. After doing this, it is important to point out several epistemic corollaries. First, the fact that we can never reach a probability of zero is the quantitative expression of the maxim that we should never rule out any hypothesis with an absolute certainty. Similarly, in cases where a hypothesis is confirmed by multiple experiments, the posterior probability will approach, but never reach, a value of 1; this expresses the maxim that one ought never to have absolute confidence in the truth of a hypothesis. Second, the fact that everyone starts with subjective prior probabilities and yet inevitably converges to a single asymptotic result illustrates the objective aspect of science. Students may tend to conflate “absolute” with “objective,” believing that in order to be objectively valid, something must be absolutely true, and this example is the first time when many of them may begin to realize that these are logically disjointed properties. The emphasis that the Bayesian activity places upon the selection and justification of the updating coefficient R is intended to create value for the act of experimental error analysis. Students in traditional labs spend very little time engaged in sense-making.11 Indeed, students generally do not think at all about error analysis, and if they do, they think only that it is a waste of time.10 Uncertainties and errors are viewed as being unrelated to physics proper and of little importance to the experiment or its outcome. Students tend to take a “trouble-shooting” approach in their attitude toward error analysis, intending to get the expected result instead of accepting the results and trying to make sense of them. One intention of the Bayesian direct evaluation activities is to redefine for students what the result of an experiment really is. Instead of being a simple yes/no match to a prediction in order to confirm a hypothesis that the students already accept as true, the Bayesian approach requires students to shift their perspective and see an experiment as a way to gain new information that affects their confidence in a hypothesis which can never be fully proven or disproven. It is now acceptable for the experiment to produce a disconfirmatory result, and the goal becomes not to get the predicted result but to figure out whether their result is confirmatory, null, or disconfirmatory (and to what extent). The only way to determine that is via careful error analysis. Thus, the goal of the laboratory is no longer to prove something they know but to gather new information that has a real, calculable effect on their confidence in the hypothesis, and this creates value for the act of error analysis. The evaluative reflection that Bayesian updating enables is able to create authentic value for the act of error analysis from the students’ perspective in a traditional lab. After all, there seems to be little point to error analysis if one already and absolutely knows the truth, thanks to the textbook or instructor. Without Bayesian updating, error analysis in an introductory lab is more likely to be reduced in the students’ view to an artificial and unmotivated burden. Another way to create value and engage students in error analysis is by Aaron R. Warren 371 framing the experiments as investigations instead of tests, such as in Refs. 10 and 11. We leave open as a possibility that the inclusion of Bayesian updating may benefit both labs that run traditional testing experiments as well as those that are setup as more open-ended investigative experiments. Indirect evaluation Bayes’ Theorem also clarifies the importance of indirect evaluation as a scientific reasoning strategy. For the indirect evaluation of the water-pouring task, H is still the assertion that hnarrow ¼ hwide þ 2, but E now represents the evidence that this model is inconsistent with volume conservation (as found via the special-case analysis with hnarrow treated as the independent variable and set to an initial value of 2 marks). Students start with their personal priors P(H) and make estimates of R, which will again be a small number due to the fact that the thought experiment appears to violate volume conservation. Note that R cannot be zero, though, because there may always be unknown factors which make volume conservation an invalid principle in this setting; this is extremely unlikely, and shocking if true, but impossible to rule out. Again, students can use their estimates to calculate posterior probabilities, and doing a few similar thought experiments (such as starting with hnarrow ¼ 1) allows them to see convergence toward zero likelihood of the model. The use of indirect evaluation is thus demonstrated to be epistemically equivalent to the use of direct evaluation. This illustrates the evaluative role of scientific principles, which are able to serve as powerful tools to modify confidence in proposed hypotheses and models. The precise values of the posteriors that each student gets from direct evaluation of H will generally be different from the posteriors arrived at via indirect evaluation (due to differing estimates of R), but this is a difference of degrees and not of kind. Finally, the broader importance of why the students should value this process is something that deserves continual emphasis during the course. The ability to reason precisely, even under uncertain conditions, is one of the most daunting challenges for students, especially for those who initially tend toward absolutist epistemological beliefs. Bayes’ Theorem is powerful in this regard because it shows that uncertainty in a hypothesis is not something that can or should be avoided; rather, it is unavoidable, and evaluative tests provide the only valid mechanism by which learning can be achieved and uncertainty diminished. By continually reflecting on their epistemic uncertainties and seeing how those uncertainties change in response to the HD process via Bayes’ Theorem, students can gain a certain comfort level in dealing with the uncertainty and may begin to treat it from a more productive point of view. IV. CURRICULAR MATERIAL DESIGNS To engage students in the use of both the HD process and Bayes’ Theorem, a variety of activities have been developed. These activities allow students to exercise the use of each of these reasoning tools as part of direct and indirect evaluations. A. Direct evaluation activities Direct evaluation is primarily exercised by students during in-class activities that involve the analysis of pre-recorded data and in laboratory experiments that require students to design and conduct experiments to test specific physical models. An 372 Am. J. Phys., Vol. 86, No. 5, May 2018 example in-class activity is shown in Fig. 2, illustrating a typical format for engaging students in the construction and evaluation of empirical hypotheses regarding relationships between measured quantities. This activity is done in lecture at the beginning of a module focusing on Newtonian mechanics. Here, evaluation is not simply a summative test performed to verify a model, but a formative tool that plays a central role in scientific reasoning, even when one is just beginning to explore a particular class of phenomena. For the example shown in Fig. 2, typical proposed relationships include linear, quadratic, and sinusoidal functions of the angle. Students work in groups, with laptops, to graph, fit, and test proposed functions and to update their confidence in each proposal. Activities such as this may be utilized in a few different ways. When time is available or the number of competing hypotheses is suitably small, students can complete the entire activity in small groups before sharing and discussing results with the class. An alternative approach, which is sometimes more appropriate, is to reconvene as a class after part (a) is complete, compile a single class list of hypotheses under consideration along with initial estimates of the likelihood of each, assign particular hypotheses to each group for testing in part (b), and then reconvene the class again afterward to share results and identify an optimal hypothesis. While some direct evaluation activities have been developed for use in lecture or recitation, such as above, the majority have been made for use in the laboratory. Laboratory experiments in introductory classes often intend to test a particular physical model. As shown in Fig. 3, the laboratory report can be used as a mechanism to compel students to reflect on the HD process being instantiated by the particular experiment at hand and to use Bayes’ Theorem to modify their confidence in the tested model. The summary in part (h) of the report generally follows a consistent pattern: IF the model is true AND the proposed experiment is conducted as described THEN the actual results should confirm the prediction deduced from the model AND/BUT our results, after accounting for both random and systematic errors, confirm/fail to test/disconfirm the model THEREFORE we are more/equally/less confident in the model, as estimated with Bayes’ Theorem. In addition, the lab report highlights the importance of experimental error analysis. Students not only perform error analysis but also reflect on how making an inference about the validity of the model strongly depends upon sufficient minimization of the errors. If the errors are too significant, the R factor in Bayes’ Theorem will be essentially equal to 1 and the posterior probability of the model will nearly equal the prior; thus, very little information about the validity of the model can be learned from such an experiment. Moreover, the directionality of the systematic errors is especially important to consider when estimating R. For example, if one tests Newtonian mechanics by predicting the average acceleration of a cart rolling down an incline under the assumed condition of no rolling friction, then the systematic error induced by the presence of rolling friction will cause the actual acceleration to be less than predicted. If, after doing the experiment, the actual acceleration is significantly less than predicted, then this error will cause R to be closer to 1 since the error increases the likelihood that the evidence could have been obtained even though the hypothesis (of Aaron R. Warren 372 Fig. 2. An example in-class activity for direct evaluation of empirical hypotheses. Note the notation change in the presentation of Bayes’ Theorem; the prior is renamed the initial confidence, P(H) ¼ Ci, and the posterior is the final confidence, PðHjEÞ ¼ Cf . This terminology and notation are consistently used on all Bayesian activities given to students. Newton’s Laws) is true. However, if the actual acceleration is greater than predicted, then this systematic error will make R closer to 0 since the error decreases the likelihood that the evidence could have been obtained while the hypothesis was valid. Although one does not need Bayes’ Theorem to realize this, its use provides a structure to draw out such reasoning and explicitly demonstrates via the R factor why and how error analysis is central to the direct evaluation of a model. B. Indirect evaluation activities Students can practice indirect evaluation during lecture, recitation, and homework activities that require them to test either their proposed problem solutions or a proposed physical model, without collecting any new data. Valid indirect evaluation techniques such as dimensional, special-case, limit-case, and trend-case analyses are foreign to most students. Demonstration of these techniques during lecture and practice of their use during recitation and homework, along with specific and steady feedback, is therefore required for students to gradually adopt these techniques. Such feedback includes written comments on student work pointing out when a student’s selected special-case was not productive, perhaps because the selected change does not yield an intuitively or conceptually obvious impact on the answer. Other times, the feedback may point out mistakes in conceptual or quantitative reasoning a student made during the evaluative 373 Am. J. Phys., Vol. 86, No. 5, May 2018 analysis. It may take a minimum of 6–8 weeks before the majority of a class begins to demonstrate proficiency with indirect evaluation strategies.8,9 Figure 4 shows a general template for activities that engage students in self-evaluation of their proposed solutions to standard end-of-chapter problems. This approach first requires the students to reflect on how their solution is derived as an instantiation of a general physical model, such as the ideal gas law or rigid body motion. Then, the activity points out that a solution can be evaluated by testing for consistency with other pieces of information that are already held in high confidence. The students do this via the HD process, using techniques such as special-case analysis,8 and then update their confidence in their solution via Bayes’ Theorem. In this way, students can reduce the epistemic uncertainty in their work in a scientifically authentic fashion, without relying upon external authorities. Near the end of the semester, some of this scaffolding can be removed as students internalize the process, including reminders about the HD process and prompts to update confidence estimates with Bayes’ Theorem. V. IMPACT ON STUDENT ATTITUDES A. Methods To assess the impact of the Bayesian evaluation activities on student learning, the Epistemological Beliefs Assessment Aaron R. Warren 373 Fig. 3. A template for laboratory experiment reports that highlights the use of hypothetico-deductive reasoning and the use of Bayes’ Theorem. for Physical Science (EBAPS)21,31 was administered in a pre/post-test format to three one-semester algebra-based introductory physics courses. This instrument includes 30 items that ask students to respond to statements on a 5-point Likert scale ranging from strongly disagree to strongly agree. Scoring of the EBAPS features five subscales that measure specific aspects of student epistemological sophistication. The subscales are Axis 1, Structure of Scientific Knowledge; Axis 2, Nature of Knowing and Learning; Axis 3, Real-life Applicability; Axis 4, Evolving Knowledge; Axis 5, Source of Ability to Learn. In this study, we required an instrument capable of performing measurements that are tightly aligned to the particular attitudes that the Bayesian activities are intended to impact. None of the eight categories of highly correlated items from the Colorado Learning Attitudes about Science Survey (CLASS)12 were deemed appropriate measures. The Sense-Making/Effort category comes closest, but it conflates the nature of learning, the structure of knowledge, and other expert-identified attitudinal dimensions that the EBAPS keeps separate. This is perhaps not surprising Fig. 4. After attempting to solve a standard end-of-chapter problem, this activity compels students to indirectly evaluate their solution using techniques such as special-case and limit-case analysis and to update their confidence in their solution, thereby modifying the epistemic uncertainty in their own work. 374 Am. J. Phys., Vol. 86, No. 5, May 2018 Aaron R. Warren 374 (and even expected) since the CLASS categories are based on student responses in order to achieve high reliability, and students are not likely to attend to the distinct conceptual dimensions an expert recognizes. In particular, the EBAPS scores for Axes 2 and 4 focus more specifically on attitudes that the Bayesian activities are designed to affect. Thus, although not as widely used or reliable as the CLASS, the EBAPS allows the most appropriate measurement of the specific attitudes we wish to assess. Axis 2—Nature of Knowing and Learning (Nat. Learn.) subscale—assesses whether students believe that learning science consists of information retention or whether it is inherently constructive, progressing by leveraging prior experiences to interpret and test new information and requiring continual reflection upon one’s own understanding. Given that the Bayesian activities involve students reflecting upon and altering their confidence based on tests employing both their prior knowledge in thought experiments and their empirical results in laboratory, we expect that such activities could improve student gains on this subscale. In particular, questions 1, 11, 12, 13, and 30 each deal with students’ attitudes toward constructivist activities that assess one’s confidence and change subjective confidence by using prior knowledge to independently test an idea or problem solution. Axis 4—Evolving Knowledge (Evo. Know.) subscale— focuses on how well students are able to avoid absolutism and extreme relativism. The HD process and Bayes’ Theorem necessarily refute absolutism, as models are tested and confidence levels modified. Bayesian activities also give students grounds to reject extreme relativism by demonstrating that, regardless of one’s personal prior confidence, the results of direct and indirect evaluations can cause everyone’s confidence in a proposition to converge toward 0 or 1. This subscale relies solely on questions 6, 28, and 29, each of which deals with attitudes toward subjective probability in science and its ability to be affected by new evidence. The remaining subscales (Axes 1, 3, and 5) are not predicted to show any effect due to Bayesian activities. For instance, although various hypothetico-deductive activities, such as special-case analysis, do intend to help students construct a hierarchical organization to physical knowledge, the addition of Bayesian updating to those activities does not draw further attention to such a structure. Instead, the Bayesian updating only propagates the implication of that structure in order to revise confidence in a hypothesis or model. Likewise, Bayesian updating has no direct bearing on attitudes toward problem-solving sophistication, notions of self-efficacy, or the relationship between abstract physical models or representations and the real world. The three courses were each taught by the same instructor, for lectures and labs, and used the textbook by Etkina et al.,32 which is based on the ISLE approach. However, only two of the courses (labeled E1 and E2) included Bayesian activities; the third course (labeled C) did not include any Bayesian activities. Within each experimental condition course, E1 and E2, there were three direct evaluation activities conducted during the lecture, eight laboratory experiments and reports using Bayesian updating, and 14 indirect evaluation activities incorporated within the weekly recitation assignments. The survey administration in all three courses was similar, being completed during class and without incentivization. Student responses were checked to ensure that they indicated serious effort, for example, by not giving the same response on the final 10 items of the survey. B. Results The results of the pre/post-tests, after listwise deletion of students who did not fully complete either the pre- or postsurveys, are shown in Table III, including mean scores on all five subscales. The percentages of students who provided complete responses to both the pre- and post-surveys were 78% (E1: 21 of 27 students), 80% (E2: 35 of 44 students), and 82% (C: 18 of 22 students). The overall score and each subscale score range from 0 to 100. Bayesian estimation (BEST)33 is used to determine the mean and 95% highestdensity interval (HDI) of the distribution of the difference of means. Bayesian estimation has the advantage of generating an explicit distribution of credible values and is more informative than traditional distribution characterization via the mean and standard error. This study focuses on Axis 2 (Nat. Learn.) and Axis 4 (Evo. Know.), which are shown in bold font, since those are the only subscales that the Bayesian activities are intended to affect. The E1 and E2 classes appear to have differences in their performances. Similar variability has been seen in CLASS pre-/post-results obtained from prior sections of this course during routine course assessment, with gains in overall CLASS scores fluctuating by roughly 5%–10% from one semester to the next despite nearly identical instructional Table III. Pre-/post-comparisons for C(N ¼ 18), E1 (N ¼ 21), and E2 (N ¼ 35). Classes E1 and E2 included Bayesian activities, while C did not. Mean gains and 95% HDI for the distribution of mean gains are provided by Bayesian estimation (Ref. 33). Bayesian activities are predicted to affect attitudinal shifts in Axis 2 (Nat. Learn.) and Axis 4 (Evo. Know.), shown in bold font. C Pre Post Mean gain E1 Pre Post Mean gain E2 Pre Post Mean gain 375 Overall Struct. know. Nat. learn. Real-life app. Evo. know. Src. learn 67.3 6 2.1 70.3 6 1.4 2:67:8 2:9 Overall 69.0 6 1.6 73.4 6 1.6 4:49:4 0:5 Overall 61.9 6 2.0 66.0 6 1.6 4:09:5 1:2 58.8 6 2.4 62.4 6 2.0 3:410:3 3:5 Struct. know. 59.2 6 2.7 65.7 6 2.6 6:314:0 1:8 Struct. know. 56.6 6 2.5 57.6 6 2.1 0:87:5 6:2 67.9 6 3.2 67.5 6 2.4 0:48:2 9:2 Nat. learn. 63.2 6 2.9 72.3 6 2.4 9:517:5 1:2 Nat. learn. 60.0 6 2.1 63.7 6 2.0 3:99:9 1:9 73.1 6 4.3 79.5 6 2.6 16:0 5:65:4 Real-life app. 77.7 6 3.2 81.8 6 2.9 13:6 4:34:8 Real-life app. 68.0 6 3.6 76.0 6 3.0 17:7 7:91:8 70.4 6 5.1 73.6 6 3.6 3:116:8 10:7 Evo. know. 71.0 6 3.2 75.4 6 3.6 4:114:1 6:3 Evo. know. 57.1 6 4.0 66.0 6 3.4 9:020:1 1:7 75.6 6 3.6 79.7 6 3.1 4:614:8 5:8 Src. learn 83.3 6 2.6 84.0 6 2.4 0:78:1 7:0 Src. learn 71.4 6 3.4 77.0 6 2.2 5:413:7 3:4 Am. J. Phys., Vol. 86, No. 5, May 2018 Aaron R. Warren 375 materials and style. Given the relatively high reliability and validity of the CLASS, it seems most likely that statistical noise is the cause of the EBAPS differences between E1 and E2, and we therefore combine the two sections in order to better smooth fluctuations and test the predicted impact of the Bayesian activities. The results for the mean gains of the Control class versus the gains from the two experimental classes (combined) are shown in Fig. 5 and summarized in Table IV. Although this study is only concerned with Axis 2 (Nat. Learn.) and Axis 4 (Evo. Know.), for the sake of completeness, we include the results for each subscale as well as the overall EBAPS score. The data were then multiply imputed34,35 to produce estimated values for the missing pre-/post-test results from individual students. Globally, the missing data rate was 10.2% (19 of 186 possible). A total of 50 imputations were generated, pooled, and assessed for each missing value. The method used was multiple imputation by chained equations,36,37 and the variables used in the imputation model included the set of individual gains on all five EBAPS subscales as well as the overall score. To test whether the Bayesian activities produced a significant effect on Axis 2 and Axis 4, as predicted, we will analyze the data using two separate approaches. First, we will use classical null-hypothesis significance testing in order to make inferences about effects observed on Axis 2 and Axis 4. Our second analysis will use Bayesian parameter estimation to calculate likelihoods and effect sizes for Axis 2 and Axis 4. Since this is an exploratory study to gather tentative information about the impacts of these new activities, rather than a test of a firm hypothesis, we will proceed in an investigative fashion. Our goal is not a strict binary judgment of the impact or no impact because even if there were significant evidence for one or the other, there would still be significant threats to external validity posed by the sample composition and size, by the unknown reliability of the EBAPS subscales, and by the fact that only one method of instruction from a single instructor was employed. Instead, our analytical goal is to generate information regarding the observed impact of these activities which can inform any future implementation and study of their attitudinal impact. As will be seen, both the classical and Bayesian approaches will paint a similar picture of the impact produced by the activities. First, taking the classical approach, we perform a multivariate analysis of variance with one independent variable (class condition) and two dependent variables (subscale gains on Axis 2 (Nat. Learn.) and Axis 4 (Evo. Know.). The Table IV. Summary of mean gains for the control and experimental (E1 and E2 combined) classes. Cohen’s d is provided as an estimate of the effect size. Although this study is only concerned with Axis 2 and Axis 4 (in bold font), the overall score and all subscales are also included for completeness. Overall Axis 1 Axis 2 Axis 3 Axis 4 Axis 5 Mean gain 2.9 6 1.5 3.6 6 2.4 0.3 6 2.6 6.4 6 3.9 3.2 6 5.4 4.2 6 3.2 (C) Mean gain 4.2 6 0.8 3.1 6 1.6 5.7 6 1.6 6.5 6 2.1 7.1 6 3.1 3.8 6 1.9 (E) Cohen’s d 0.184 0.040 0.453 0.005 0.146 0.026 MANOVA indicates a marginally significant effect (Pillai ¼ 0.052, F(2, 90) ¼ 2.474, and p ¼ 0.090) with a moderate effect size (partial g2 ¼ 0.063 and 95% HDI ¼ (0.009, 0.170)). The observed power is 0.54, which means that a repetition of this study would be nearly as likely to produce a false negative as to identify a true positive. Thus, if these results are to be tested further, then future work must involve larger samples and perhaps utilize a higher-precision instrument (yet to be developed) that is more capable of detecting the attitudinal shifts that the Bayesian activities are intended to produce. Continuing our exploration of the results, we tentatively accept the marginal significance of the MANOVA in order to investigate how each individual axis may have been affected by the class condition. Strictly speaking, no firm conclusions may be drawn from this work; however, it is informative to see whether one axis appears to have responded differently from another. Indeed, the class condition shows a likely and moderately strong effect on Axis 2, (F ¼ 4.793, p ¼ 0.031, and partial g2 (95% HDIÞ ¼ 0:0500:158 0:002 ), but a much weaker and less likely effect on Axis 4 (F ¼ 0.499, p ¼ 0.482, and partial g2 ð95% HDIÞ ¼ 0:0050:058 0:000 ). We now undertake an independent Bayesian analysis of the data. The likelihoods that the gains in the Experiment condition exceeded the gains in the Control condition on each subscale and corresponding effect sizes were estimated using the BEST software package. The results are shown in Table V. Although the results are listed for all subscales, we are only concerned with Axis 2 and Axis 4. In Bayesian parameter estimation, a prior estimate of credible parameters is updated and credibility thereby reallocated across the space of possible parameter values. Unlike classical inferencing, whereby multiple tests require p-value corrections in order to moderate family-wise Type I error rates, multiple comparisons made in a Bayesian estimation do not usually require any correction or modification (the exception being situations Table V. The difference of mean gains between conditions, the estimated likelihood that gains in the Experiment condition exceeded Control group gains, and the estimated effect sizes, all provided by Bayesian parameter estimation (Ref. 33). Although this study is only concerned with Axis 2 and Axis 4 (in bold font), all subscales are included for completeness. Axis 1 Fig. 5. Gains for mean student scores in the Control and Experimental (E1 and E2 combined) classes, with bars indicating one standard error of the mean. 376 Am. J. Phys., Vol. 86, No. 5, May 2018 Axis 2 Axis 3 Axis 4 Axis 5 Difference of gains 0:65:7 6:412:8 0:29:8 3:616:7 0:67:4 6:6 0:0 9:5 8:5 9:4 (BEST) Likelihood Egain > Cgain 43.4 97.4 51.6 70.6 44.4 (BEST) (%) 1:00 0:46 0:51 0:61 Effect size (BEST) 0:040:40 0:47 0:510:02 0:010:46 0:140:34 0:050:58 (95% HDI) Aaron R. Warren 376 when a multi-level model would be appropriate).38 Here, to the extent that we believe Axis 2 and Axis 4 measure distinct attitudinal dimensions (as intended in the construction of the subscales), a multi-level model is inappropriate. Examining the estimates yielded by the BEST estimation procedure, we find that there is a 97.4% likelihood that the Experiment condition outperformed the Control condition on Axis 2, while there is a 70.6% likelihood for Axis 4. Effect sizes are estimated at roughly 0.51 for Axis 2 and 0.14 for Axis 4. These results support the same conclusions as our classical analysis, without any need for worry regarding tentative acceptance of marginal significances or other such threats to internal validity. Namely, we again conclude that the activities appear to have had a moderate and highly likely impact on Axis 2, alongside a weak and less likely impact on Axis 4. The external validity of these conclusions still suffers a number of threats posed by the limited sample sizes, the specific nature of the instruction in each course, and the unknown reliability of the EBAPS subscale scores. These are factors that only further testing can address. Despite these important caveats, when taken as a whole, the results are suggestive of an educational impact that merits further study. An immediate follow-up question that emerges from our analysis is why the activities might positively impact attitudes on Axis 2 (Nat. Learn.) to a much greater degree than on Axis 4 (Evo. Know.). This result is surprising considering the emphasis the Bayesian activities place upon confidence updating. One hypothesis we favor is that the use of subjective probabilities in the Bayesian activities places the obvious and repeated value in the relativistic nature of science. Meanwhile, the emergence of intersubjective agreement which is gradually realized via multiple tests and updatings of a single hypothesis (and which tends to universally drive subjective probabilities toward 0 or 1) was only experienced by students in one Bayesian activity early in the course. Therefore, students may be led by this skewed experience toward an overly relativist perception of scientific knowledge. To test this hypothesis, student responses to the three questions that factor into the Axis 4 subscale score, shown in Fig. 6, may be analyzed for changes in student attitudes toward relativist or absolutist positions. For example, for question #29, we classify answer C as a neutral response, while answers A and B show relativist attitudes and answers D and E express absolutist attitudes. For questions #6 and #28, we classify responses A and B as absolutist, C as neutral, while D and E are relativist responses. As shown in Fig. 7, aggregating the pre- and postresponses from each condition and examining the change in response rates for these three categories, we find that students in the Control tended to adopt more absolutist attitudes from pre- to post-tests, while students in the Experimental group showed a shift toward relativist attitudes. This pattern was held for responses to all three items, as well as the aggregate shown here. Given that the Axis 4 subscale intends to assess the extent to which a student is able to maintain a suitably sophisticated understanding of scientific knowledge, being rooted in intersubjective agreements which are constrained by nature, Fig. 6. The items from EBAPS which are used to score Axis 4 (Evo. Know.). The responses scored as most expert-like for this subscale are #6 A, #28 E, and #29 C. Here, we will instead sort student responses into three categories (relativist, neutral, and absolutist) as described in the text. 377 Am. J. Phys., Vol. 86, No. 5, May 2018 Aaron R. Warren 377 Fig. 7. Shown, for each condition, are changes in the cumulative percentage of responses to the questions used for the Axis 4 subscale (questions #6, 28, and 29), which were classified as relativist, neutral, or absolutist. attitudes that are “too absolutist” or “too relativist” are accordingly scored lower. Thus, it may be that the curricular selection of the specific Bayesian activities used produced attitudes which were scored only slightly higher than those in the Control condition. A more frequent engagement in Bayesian activities that develop intersubjective convergence via multiple Bayesian updatings (performed either serially by a single student/group or in parallel by pooling results from a number of independent evaluations) may better enable expert-like sophistication and enable a greater positive shift in student attitudes on Axis 4. An alternative hypothesis, which is not mutually exclusive to the above hypothesis, is that these results are due to an incomplete growth of students’ epistemic attitudes. The Reflective Judgment Model39,40 views people as generally progressing through several levels of growth (Prereflective, Quasi-Reflective, and Reflective) which, roughly speaking, are typified by absolutist, relativist, and expert-like attitudes, respectively. Thus, relativist attitudes are an intermediate stage between absolutist and expert-like attitudes. From this perspective, we would suggest that the Bayesian activities may enable a progression in students’ level of reflective judgment. If true, the results obtained would indicate that the students’ progression was incomplete but was nonetheless substantially enhanced in the Experimental condition, whereas the Control condition appeared to regress toward a Prereflective level. C. Discussion The results suggest that the Bayesian activities may generate a meaningful and educationally significant effect on specific student attitudes regarding the nature of learning although this suggestion is not conclusive. Studies with larger sample sizes and a variety of instructional methodologies are required to more precisely test and estimate the effects of the Bayesian activities. Additionally, the development of an instrument designed specifically to assess student attitudes that pertain to uncertainty, confidence, and changes in confidence due to new information would allow results to carry stronger inferential power. Given the distinctive nature of the results obtained in this work, namely the ability to positively shift student epistemic attitudes in a one-semester 378 Am. J. Phys., Vol. 86, No. 5, May 2018 introductory physics course, this is worthy of further exploration. It is worth noting that any effects of the Bayesian activities came at a relatively small cost in terms of class time, as the activities did not require a burdensome curricular investment. The cumulative amount of time (in lecture, recitation, and lab) that students in the E1 and E2 classes spent on Bayesian activities was roughly 4–5 h out of a total of 72 hours of instructional time over the semester. It is possible that a greater incorporation of Bayesian activities would produce greater epistemological gains. Although these courses were based on the ISLE curriculum, the Bayesian activities themselves could easily be adapted to any curriculum, including traditional designs. Of course, the impact of the Bayesian activities is almost certainly non-trivially related to other pedagogical features of the course, and so, their effectiveness is sure to vary depending on the context of use. The ISLE approach is centered upon the notion of model development, testing, and revision/ rejection and therefore lends itself very naturally to the inclusion of Bayesian activities. Approaches similar to ISLE, such as Modeling Instruction or Physics by Inquiry, could similarly prove to be amenable to the addition of Bayesian activities. Traditional pedagogical approaches may make it more difficult for engagement in Bayesian activities to carry sufficient meaning or significance to the students and may therefore mitigate the effectiveness of these activities. On the other hand, if exam questions or other assessments are given that include the exercise of Bayesian activities, perhaps the explicit value placed on these activities would be sufficient to produce both attitudinal and problem-solving gains. VI. CONCLUSION Several activity types, featuring both direct and indirect evaluation of hypotheses, have been proposed in order to engage students in Bayesian updating. The results from a quasi-experimental study indicate that these Bayesian activities benefit some aspects of student epistemology, particularly those regarding the nature of knowing and learning. The ability for a set of activities to produce a positive shift in student attitudes makes it a distinctly useful curricular tool, particularly given that the well-documented difficulty many courses have in producing such shifts. Moreover, the prospective portability of these activities in a variety of course designs, at minimal cost of class time, means that they may be broadly helpful in many courses without needing extensive instructor preparation or course alterations. Considering that other positive attitudinal shifts have generally required extensive investments of preparation and class time (e.g., Refs. 14, 15, 18, and 21), this seems rather remarkable. Beyond that, these activities appear to have enhanced a curriculum (ISLE) that already exhibited positive attitudinal shifts. We caution that further testing is required, given the limitations imposed by the sample size in this study, and also that qualitative analysis via interviews would help to deepen an understanding of how and to what degree these Bayesian activities influence student attitudes. The specific selection of activities used in the study also appears to have caused students to shift their attitude regarding the evolution of knowledge toward an overly relativist perspective. This suggests that there is room for improvement in the design and selection of Bayesian activities in Aaron R. Warren 378 order to generate further gains. The optimal selection of Bayesian activities is almost certain to depend to some degree upon the particular course design that they are incorporated into. Moreover, their use and impact in calculusbased introductory classes may differ from what was seen here in algebra-based courses and is something we are currently studying. One intent of the Bayesian activities, particularly the direct evaluation activities, was to motivate students to attend to error analysis and to greater sense-making in general. This has not been tested yet and stands out as something that ought to be studied. This is particularly true if, as we hypothesized, the Bayesian activities are able to increase sense-making among students even in more traditional lab designs. Finally, there are many open questions, including the interaction of these activities with demographic variables in producing attitudinal gains. Considering the many ways in which gender affects student attitudes in physics courses (e.g., Refs. 41 and 42), one may anticipate a similar difference in the impact of Bayesian activities. Another important question would be the robustness and transferability of students’ attitudes beyond the physics classroom and surveys, similar to studies of the transferability of scientific reasoning abilities (e.g., Ref. 43). Since Bayes’ Theorem is presented and used across a wide variety of physical contexts in these activities, does that make it possible for students to recognize and use it more generally? In particular, one may wonder whether the instantiations of Bayesian updating that these activities provide are sufficiently generic as to facilitate transfer across topical domains,44 which is indeed a broader goal of these activities and would be worthy of continued pursuit. ACKNOWLEDGMENTS The author received financial support for this study via Instructional Improvement Program grants at his institution and is thankful to three anonymous reviewers whose comments greatly improved the quality of this article. a) Electronic mail: arwarren@purdue.edu R. W. Bybee and B. Fuchs, “Preparing the 21st century workforce: A new reform in science and technology education,” J. Res. Sci. Teach. 43(4), 349–352 (2006). 2 R. Gott, S. Duggan, and P. Johnson, “What do practicing applied scientists do and what are the implications for science education?,” Res. Sci. Technol. Educ. 17, 97–107 (1999). 3 E. Lottero-Perdue and N. W. Brickhouse, “Learning on the job: The acquisition of scientific competence,” Sci. Educ. 86, 756–782 (2002). 4 S. Duggan and R. Gott, “What sort of science education do we really need?,” Int. J. Sci. Educ. 24, 661–679 (2002). 5 National Academy of Engineering, Educating the Engineer of 2020: Adapting Engineering Education to the New Century (The National Academies Press, Washington, DC, 2005). 6 E. F. Redish, J. M. Saul, and R. N. Steinberg, “Student expectations in introductory physics,” Am. J. Phys. 66, 212–224 (1998). 7 M. Sahin, “Effects of problem-based learning on university students’ epistemological beliefs about physics and physics learning and conceptual understanding of Newtonian mechanics,” J. Sci. Educ. Technol. 19, 266–275 (2010). 8 A. R. Warren, “Impact of teaching students to use evaluation strategies,” Phys. Rev. Spec. Top.-Phys. Educ. Res. 6, 020103 (2010). 9 A. R. Warren, Ph.D. dissertation, Rutgers University, 2006. 10 R. Lippmann, Ph.D. dissertation, University of Maryland, 2003. 11 A. Karelina and E. Etkina, “Acting like a physicist: Student approach study to experiment design,” Phys. Rev. Spec. Top.-Phys. Educ. Res. 3, 020106 (2007). 1 379 Am. J. Phys., Vol. 86, No. 5, May 2018 12 W. K. Adams, K. K. Perkins, N. S. Podelefsky, M. Dubson, N. D. Finkelstein, and C. E. Wieman, “New instrument for measuring students beliefs about physics and learning physics: The Colorado Learning Attitudes about Science Survey,” Phys. Rev. Spec. Top.-Phys. Educ. Res. 2, 010101 (2006). 13 A. Madsen, S. B. McKagan, and E. C. Sayre, “How physics instruction impacts students’ beliefs about learning physics,” Phys. Rev. Spec. Top.Phys. Educ. Res. 11, 010115 (2015). 14 B. A. Lindsey, L. Hsu, H. Sadaghiani, J. W. Taylor, and K. Cummings, “Positive attitudinal shifts with the physics by inquiry curriculum across multiple implementations,” Phys. Rev. Spec. Top.-Phys. Educ. Res. 6, 010102 (2012). 15 V. Otero and K. Gray, “Attitudinal gains across multiple universities using the physics and everyday thinking curriculum,” Phys. Rev. Spec. Top.Phys. Educ. Res. 4, 020104 (2008). 16 D. Hestenes, “Toward a modeling theory of physics instruction,” Am. J. Phys. 55, 440–454 (1987). 17 D. Hestenes, C. Megowan-Romanowicz, S. Osborn Popp, J. Jackson, and R. Culbertson, “A graduate program for high school physics and physical science teachers,” Am. J. Phys. 79(9), 971–979 (2011). 18 E. Brewe, L. Kramer, and G. O’Brien, “Modeling instruction: Positive attitudinal shifts in introductory physics measured with CLASS,” Phys. Rev. Spec. Top.-Phys. Educ. Res. 5, 013102 (2009). 19 E. Etkina and A. Van Heuvelen, “Investigative science learning environment,” in Forum on Education of the American Physical Society, Spring issue (2004), pp. 12–14. 20 E. Etkina, A. Van Heuvelen, S. White-Brahmia, D. T. Brookes, M. Gentile, S. Murthy, D. Rosengrant, and A. Warren, “Scientific abilities and their assessment,” Phys. Rev. Spec. Top.-Phys. Educ. Res. 2, 020103 (2006). 21 A. Elby, “Helping physics students learn how to learn,” Am. J. Phys., Phys. Educ. Supp. 69(7), S54–S64 (2001). 22 E. F. Redish and D. Hammer, “Reinventing college physics for biologists: Explicating and epistemological curriculum,” Am. J. Phys. 77(629), 629–642 (2009). 23 A. E. Lawson, “The generality of hypothetico-deductive reasoning,” Am. Biol. Teach. 62(7), 482–495 (2000). 24 A. E. Lawson, The Neurological Basis of Learning, Development and Discovery: Implications for Science and Mathematics Instruction (Kluwer Academic Publishers, New York, 2003). 25 E. Etkina, A. Warren, and M. Gentile, “The role of models in physics instruction,” Phys. Teach. 44(1), 34–39 (2006). 26 R. Dawid, String Theory and the Scientific Method (Cambridge U. P., UK, 2013). 27 J. Brownlee, S. Walker, S. Lennox, B. Exley, and S. Pearce, “The first year university experience: Using personal epistemology to understand effective learning and teaching in higher education,” High. Educ. 58(5), 599–618 (2009). 28 D. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd ed. (Oxford U P, Oxford, 2006). 29 B. P. Abbot et al., “Observation of gravitational waves from a binary black hole merger,” Phys. Rev. Lett. 116, 061102 (2016). 30 R. E. Kass and A. E. Raftery, “Bayes factors,” J. Am. Stat. Assoc. 90(430), 773–795 (1995). 31 A. Elby, J. Fredriksen, C. Schwartz, and B. White, “Epistemological beliefs assessment for physics science,” <http://www2.physics.umd.edu/ elby/EBAPS/home.htm>. 32 E. Etkina, M. Gentile, and A. Van Heuvelen, College Physics, 1st ed. (Pearson, Boston, 2014). 33 J. K. Kruschke, “Bayesian estimation supersedes the t test,” J. Exp. Psychol. 142(2), 573–603 (2013). 34 R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, 2nd ed. (Wiley, Canada, 2002). 35 N. J. Horton and K. P. Kleinman, “Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models,” Am. Stat. 61(1), 79–90 (2007). 36 I. R. White, P. Royston, and A. M. Wood, “Multiple imputation using chained equations: Issues and guidance for practice,” Stat. Med. 30(4), 377–399 (2011). 37 S. van Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate imputation by chained equations in R,” J. Stat. Software 45(3), 1–67 (2011). 38 A. Gelman, J. Hill, and M. Yajima, “Why we (usually) don’t have to worry about multiple comparisons,” J. Res. Educ. Eff. 5(2), 189–211 (2012). Aaron R. Warren 379 39 P. M. King and K. S. Kitchener, Developing Reflective Judgment: Understanding and Promoting Intellectual Growth and Critical Thinking in Adolescents and Adults (Jossey-Bass, San Francisco, 1994). 40 P. M. King and K. S. Kitchener, “Reflective judgment: Theory and research on the development of epistemic assumptions through adulthood,” Educ. Psychol. 39(1), 5–18 (2004). 41 L. Kost, S. Pollock, and N. Finkelstein, “Characterizing the gender gap in introductory physics,” Phys. Rev. Spec. Top.-Phys. Educ. Res. 5, 010101 (2009). 380 View publication stats Am. J. Phys., Vol. 86, No. 5, May 2018 42 J. M. Nissen and J. T. Shemwell, “Gender, experience, and self-efficacy in introductory physics,” Phys. Rev. Phys. Educ. Res. 12, 020105 (2016). 43 E. Etkina, S. Murthy, and X. Zou, “Using introductory labs to engage students in experimental design,” Am. J. Phys. 74(11), 979–986 (2006). 44 J. A. Kaminsky, V. M. Sloutsky, and A. F. Heckler, “The cost of concreteness: The effect of nonessential information on analogical transfer,” J. Exp. Psychol. 19(1), 14–29 (2013). Aaron R. Warren 380