1 Contact info: Ernest T. Pascarella Educational Policy and Leadership Studies N491 Lindquist Center University of Iowa Iowa City, IA 52242 (319) 335-5369 Ernest‐pascarella@uiowa.edu How Robust Are the Findings of Academically Adrift? By Ernest T. Pascarella, Charles Blaich, Georgianna L. Martin, and Jana M. Hanson Ernest T. Pascarella ernest‐pascarella@uiowa.edu is professor and the Mary Louise Petersen Chair in Higher Education at the University of Iowa, where he also co‐directs the Center for Research on Undergraduate Education (CRUE). Charles Blaich has been at Wabash College since 1991, where he director of the Center of Inquiry in the Liberal Arts; he also directs the Higher Education Data Sharing Consortium. Georgianna L. Martin is a PhD candidate in the Student Affairs Administration and Research program, and Jana Hanson is a PhD student in the Higher Education and Student Affairs program, at the University of Iowa; both are research assistants in CRUE. The research on which this study was based was supported by a generous grant from the Center of Inquiry at Wabash College to the Center for Research on Undergraduate Education at the University of Iowa. Abstract The recently published book by Richard Arum and Josipa Roksa, Academically Adrift: Limited Learning on College Campuses, has ignited something of a furor in American postsecondary education. Its major findings suggest that on average, students make only small gains in critical thinking and reasoning skills during college; a substantial percent of students make no discernible gain at all in these skills; and levels of engagement in serious academic work, such as studying and writing, are low. Researchers from the Center of Inquiry at Wabash College and the Center for Research on Undergraduate Education at the University of Iowa have conducted analyses of the 17‐institution Wabash National Study to determine the robustness of some of Arum and Roksa’s major findings. Our results with a different sample of institutions, a different sample of students, and a different standardized measure of critical thinking closely parallel those of Arum and Roksa. We conclude that the findings of Arum and Roksa are not the artifact of an anomalous sample or instrument and need to be taken seriously. At the same time we also point out important limitations in drawing causal inferences from change scores without a control group of individuals who do not attend college. 2 There’s no news like bad news: It travels fast and far. The publication of Richard Arum and Josipa Roksa’s influential new book, Academically Adrift: Limited Learning on College Campuses (2011), the findings of which were summarized in an article by the same authors in the March/April Change, has caused a national furor centering around their multi‐institutional findings that the general impact of college on student intellectual development is considerably less than stellar. The book’s core conclusion that postsecondary education has little effect on student learning is based largely on three major outcomes: small average gains during college on a standardized measure of critical thinking and complex reasoning (the Collegiate Learning Assessment, or CLA), a large percentage of students failing to make individually significant gains on the CLA during college, and low levels of engagement in serious academic work such as studying and writing. Whether or not they intended to, Arum and Roksa have thrown down a sizable gauntlet. If their findings are robust or broadly generalizable, American postsecondary education may be facing a period of considerable soul searching. But just how robust are the findings of Academically Adrift (hereafter AcAd) and the other related reports released by Arum and Roksa? This is what replication studies, a time‐ honored way of testing the generalizability of important findings, are for. While not unheard of, independent replication of findings is not a particularly common event in research on the effects of college on students (Pascarella & Terenzini, 1991, 2005). But the authors of this article‐‐researchers at the Center of Inquiry (COI) at Wabash College and the Center for Research on Undergraduate Education (CRUE) at the University of Iowa who jointly conducted the quantitative component of the Wabash National Study of Liberal Arts Education (WNS)‐‐ think that because of its national visibility, AcAd is undoubtedly worth replicating. And because Arum and Roksa were transparent about their research methods, conducting such analyses with the WNS was reasonably straightforward. The WNS had a design closely paralleling that of the study on which AcAd is based. Because it administered standardized measures of such outcomes as critical thinking and moral reasoning, the WNS presented a unique opportunity to determine the generalizability of the AcAd findings with a different sample of institutions and students and with somewhat different measures of student intellectual development. This article reports the first of our findings. The Wabash National Study (WNS) 3 The WNS was a large, longitudinal study of students attending 17 four‐year institutions located in 11 states. It oversampled liberal arts colleges because its focus was the effect of a liberal arts education, but it also contained a mix of public and private research universities, as well as comprehensive regional institutions. Random samples of full‐time students at each institution were assessed when they entered college in fall 2006, a second time at the end of their initial year of college (spring 2007), and a third time near the end of four years of college (spring 2010). During each 90‐minute assessment, students completed either the Collegiate Assessment of Academic Proficiency Critical Thinking Test (CAAP‐CT) or the Defining Issues Test (DIT), the latter a measure of moral reasoning. They were randomly assigned an instrument (half got the CAAP‐CT and half the DIT) in fall 2006 and continued with the same instrument in the follow‐up assessments. In the second and third assessments, participants also completed one of several instruments, including the National Survey of Student Engagement (NSSE), that asked about their various experiences during college. Usable data across all four years of the WNS were available for 2212 students. The CAAP‐CT is a well‐regarded and frequently used 40‐minute, 32‐item instrument designed to measure a student’s ability to clarify, analyze, evaluate, and extend arguments. It consists of four passages in a variety of formats (e.g., case studies, debates, dialogues, experimental results, statistical arguments, editorials). Each passage contains a series of arguments that support a general conclusion followed by a set of multiple‐choice test items. The internal consistency reliabilities of the CAAP‐CT range between .81 and .82. The DIT measures levels of moral judgment or reasoning. It gauges the extent to which an individual uses principled moral reasoning in resolving moral dilemmas and rejects ideas based on their being simplistic or biased. We used the DIT‐N2 score, which has a reliability in the .77 ‐ .81 range; in its earlier form as the P‐score, it significantly predicted a wide range of ethical behaviors (Pascarella & Terenzini, 1991, 2005). Replication Analyses The CLA performance task, which forms the centerpiece of the research conducted in Academically Adrift , is a 90‐minute task requiring students to use an integrated set of critical thinking, analytic reasoning, problem‐solving, and written communications skills to address several open‐ended questions about a hypothetical but realistic situation. Trained human scorers determine an individual’s performance score. The CLA performance task correlates .58 with the CAAP‐CT (Klein, Liu, & Sconing, 2009), so while the two measures are substantially related, they also appear to measure somewhat different dimensions of critical thinking and analytical reasoning. 4 In the analyses reported in AcAd, Arum and Roksa followed 2000+ students attending 24 four‐year colleges and universities during the first two years of postsecondary education. In another report the authors reveal what happened after four years of college (Arum, Roksa, & Cho, 2011). In each case they computed average change scores on the CLA for their sample and converted them to a common metric so that they could be compared with other research findings. They also estimated the percentage of students in the sample after two and four years who made statistically reliable individual CLA gains. The longitudinal design of WNS closely parallels that of the CLA study, with one difference. The first follow‐up assessment in the CLA study came after two years of postsecondary education, while the first follow‐up in the WNS came at the end of the first year of college. Accordingly, in our replication analyses we calculated average change scores on the CAAP‐CT and the DIT over the first year of college and over four years of college. We converted our average change scores to the same scales as Arum and Roksa’s (i.e., a proportion of a standard deviation gain and percentile gain) and likewise estimated the percentage of students failing to demonstrate statistically reliable CAAP‐CT and DIT gains after one and four years of college. Finally, we summarized students’ self‐reports of study time and writing assignments during college and compared them across the WNS and CLA studies. What We Found The average gains during college on the CLA, CAAP‐CT, and DIT‐N2 are visually compared in Figure 1. As the figure shows, the three were generally similar, but particular similarity is demonstrated by the gains made by students from two independent samples on the CLA and CAAP‐CT. Figure 1 about here Over the first two years of college, students made an average gain on the CLA of .18 of a standard deviation (sd). Placed against the template of a normal distribution, this represented an improvement of about 7 percentile points. Thus, if students were placed at the 50th percentile on the CLA when they entered college, they were at the 57th percentile after two years of college. The one‐year gain on the CAAP‐CT was .11 sd, which represented only a 4 percentile point improvement to the 54th percentile. Assuming the same trajectory of change this would mean that after two years of college the hypothetical CAAP‐CT gain would be estimated at .22 sd, which would be quite similar to the .18 sd gain over two years of college reported for the CLA by Arum and Roksa. 5 The high degree of similarity continues if one looks at CLA and CAAP‐CT gains made over four years of college. The average four‐year CLA gain was .47 sd, which translates to an 18 percentile point improvement. Thus, if students were functioning at the 50th percentile on the CLA when they entered college, they functioned at the 68th percentile on average four years later. The corresponding four‐year gain on the CAAP‐CT was .44 sd. This represents a 17‐point increase and indicates performance at the 67th percentile after four years of college. Across both instruments, the four‐year gains in critical thinking were quite similar to Pascarella and Terenzini’s 2005 estimate of a .50 sd gain over four years of college based on research conducted largely in the 1990s. However, for research conducted during the 20‐year period from 1969 to 1989, Pascarella and Terenzini in 1991 estimated an average gain in critical thinking skills over four years of college of 1.00 sd – substantially larger than these more recent estimates. As also shown in Figure 1, the gains made on the DIT‐N2 score over one and four years of college were somewhat larger than those shown for critical thinking skills. During the first year of college, students made an average gain of .32 sd, which was a 13 percentile point increase. Over four years of college, the DIT‐N2 gain was .58 sd. This represented a 22 percentile point increase in students’ use of principled reasoning in resolving moral issues since entering college. Still, this latter gain was considerably smaller than the average four‐year gain from research conducted in the 1990s, which was estimated at .77 sd, or a 28 percentile point increase in the use of principled moral reasoning (Pascarella & Terenzini, 2005). Arum and Roksa also considered change during college in students’ performance on the CLA from a different perspective by estimating the percent of students in their sample who actually failed to make statistically significant gains (at the .05 level) on the CLA over two and four years of college. This they estimated at 45 percent of students over two years and 36 percent of students over four years. Using the same statistical procedure to identify a significant gain as Arum and Roksa’s, we estimated that 44 percent of students did not make statistically significant gains on the CAAP‐CT over the first year of college and that 33 percent did not do so over four years of college. Similarly, an estimated 38 percent of students did not make statistically significant gains on the DIT‐N2 score during the first year of college, and an estimated 29 percent did not increase their use of principled moral reasoning over four years of college. Although our findings are consistent with those of Arum and Roksa, we do remain somewhat skeptical of the procedure itself. Gain scores at the individual student level are notoriously unreliable, and testing their statistical significance with a procedure designed to determine the significance of gains made by groups and not individuals does not correct for 6 that fact. Indeed, it could be argued that using a somewhat arbitrary level of statistical significance (.05) with unreliable scores might have seriously underestimated the percent of students who made gains that were, in fact, real. When we substituted a more liberal significance level (.10), we saw a substantial drop in the percent of students not making significant gains on all three measures. Consequently, we would caution that an estimated percentage of students failing to make a “significant” individual gain on any of the three instruments constitutes a finding subject to a very wide range of potential error and attendant misinterpretation. We think it provides a somewhat less trustworthy argument for “limited learning on college campuses” than the evidence of overall small average gains during college on two different measures of critical thinking and reasoning. Finally, both AcAd and the WNS collected student self‐reports about their college experience, including their level of involvement in academic work. While a detailed comparison of all these self‐reports is beyond the scope of this paper, a comparison of two salient dimensions of academic engagement after four years of college give a flavor of what the two studies found. Arum, Roksa, and Cho (2011) report that the typical full‐time student in the CLA sample studied between 13 and 14 hours per week, which they point out is about half as much time as was reported over a decade ago. In the WNS study the typical full‐time student spent about 15 hours per week preparing for class, a finding quite consistent with Arum, Roksa, and Cho. This lends credence to Arum and Roksa’s argument that students used to work harder, particularly when linked with the average drop in four‐year critical thinking gains from 1.00 sd to .50 sd observed over the past two decades (as noted above). Along with others (see Pascarella & Terenzini, 1991, 2005) Arum and Roksa report that extent of cognitive growth is frequently dependent upon the level of a student’s overall academic engagement during college. Other things being equal, less academic engagement often means less cognitive growth. Arum, Roksa and Cho also found that 51 percent of students had not written a paper of at least 20 pages during the school year. The corresponding percentage in the WNS was slightly more than 40 percent. The difference in these last percentages is quite possibly due to the heavy representation in the WNS sample of liberal arts colleges, which tend to require more written work than other four‐year institutions (Pascarella, et al., 2005). When someone delivers news as potentially unsettling as that delivered by Arum and Roksa, it is almost axiomatic that their methods will be questioned, as will the robustness of their findings. Our attempt to cross‐validate some of Arum and Roksa’s major findings with the Wabash National Study is not an attempt to answer all those questions. Nevertheless, the 7 findings from the WNS, based on an independent sample of institutions and students and using a multiple‐choice measure of critical thinking substantially different in format than the Collegiate Learning Assessment, closely match those reported by Arum and Roksa. This suggests that an important part of what Arum and Roksa found is not simply the result of an anomalous sample or the use of a psychometrically unconventional method of measuring critical thinking and reasoning skills. The results from the WNS reported above clearly do not resolve all the genuine issues and concerns that will inevitably be raised about the findings of Academically Adrift and its associated reports. However, the WNS results do suggest that Arum and Roksa should be taken seriously. Their findings are not going to go away simply because they make academics uncomfortable. Cautions in Interpreting Change Scores Because much of the evidence presented above centers on change during college, we feel compelled to point out at least two critical cautions involved in interpreting change scores as indicators of the actual effect of college. These cautions are every bit as important as the fact that our findings closely match those of Academically Adrift. The first caution concerns change in an absolute sense. The changes in intellectual and moral development that both studies report appear to be modest. As far as we know, however, no one has come up with an operational definition of just how much change we should expect on such instruments during college if we are to conclude that postsecondary education is doing the job it claims it is. Some human traits are simply less changeable than others, and that needs to be considered. Until we can come up with standards of expected change during college, the meaning of average gain scores like the ones reported above will be largely in the eye of the beholder. One person’s “trivial” may be another person’s “important.” A second, and equally important, caution concerns change in a relative sense. We fear that what appear to be small changes during college on the CLA or CAAP‐CT will be interpreted in many camps as definitive evidence that college provides little or no added value over secondary school in terms of intellectual development. Making such a case based on the change‐score evidence summarized above would perhaps represent a greater indictment of higher education’s ability to impart critical thinking skills than the findings themselves. Simply put, one cannot validly use an average gain score during college as an accurate estimate of the value‐added effect of college. Neither the AcAd study nor the WNS has a control group of individuals measured over the same period of time with the same instruments who do not attend college. Without such a control group, the meaning of change during college can be quite deceiving, for two reasons. 8 • First, the presence of gains during college does not necessarily point to an effect of college. Many factors, such as history, normal maturation, and the practice effect from taking a test more than once contribute to change and may masquerade as a college effect. This means that the value‐added of college could actually be smaller than the gains described above. • Conversely (and this seems counter‐intuitive), little or no gain during college does not mean that college is failing to add value. On some traits, such as quantitative skills, students do not always appear to progress much during college, but their counterparts who do not attend college actually retrogress substantially over the same period of time (Pascarella & Terenzini, 1991). This perhaps explains why Pascarella and Terenzini (2005) concluded that the average four‐year gain in critical thinking during college in the 1990s (.50 of a standard deviation) was actually smaller than the estimated value‐added effect of college on critical thinking (.55 of a standard deviation) during the same period of time. Both the value‐added and average gain methods took into account changes in critical thinking for students during four years of college. However, the value‐added method produced a different, and larger, estimate because it also took into account four‐year changes in critical thinking for those with little or no exposure to college. The bottom line is that simple average gain scores on critical thinking or other measures of development during college need to be interpreted with a large dose of caution. Admittedly, it is no easy task to construct studies of college impact that include control groups of non‐ college individuals. Until such studies are done, however, accurately estimating the added value of postsecondary education on student intellectual development will be an elusive, and likely contentious, quest. A Final Thought While the above cautions need to be taken seriously, they do not diminish the potential importance of the findings of Academically Adrift and the fact that these findings have essentially met the standard of independent replication with different samples of institutions and students and a different measure of critical thinking skills. There is a regrettable tendency in scholarly work about postsecondary education for even the most important books to flash brilliantly and fade rapidly. It would be genuinely unfortunate if this were the fate of AcAd. Irrespective of how validly one thinks Arum and Roksa have interpreted their findings, they have nevertheless issued an important wakeup call 9 to American higher education. Hopefully, such a call can be used to initiate a productive and open‐minded national conversation on just how much effective undergraduate education really matters in our colleges and universities. Resources Arum, R. & Roksa, J. (2011). Academically adrift: Limited learning on college campuses. Chicago: University of Chicago Press. Arum, R., Roksa, J., & Cho, E. (2011). Improving undergraduate learning: Findings and policy recommendations from the SSRC‐CLA longitudinal project. Brooklyn, NY: Social Science Research Council. Klein, S., Liu, O., & Sconing, J. (2009). Test validity study (TVS) report. Washington, DC: Fund for the Improvement of Postsecondary Education. Pascarella, E. & Terenzini, P. (1991). How college affects students. San Francisco: Jossey‐Bass. Pascarella, E. & Terenzini, P. (2005). How college affects students (2nd edition). San Francisco: Jossey‐Bass. Pascarella, E. et al. (2005). Liberal arts colleges and liberal arts education: New evidence on impacts. San Francisco: Jossey‐Bass. Roksa, J. & Arum, R. (2011, March/April). The state of undergraduate learning. Change . 10 Figure 1: Four‐year student change measured in standard deviation (sd) and percentile point (pp) increases on the Defining Issues Test, the Collegiate Assessment of Academic Proficiency Critical Thinking Test, and the Collegiate Learning Assessment