G ENERAL E DUCATION A SSESSMENT Spring 2010 Report Prepared by Dr. Linda Siefert General Education Assessment Director August 2010 This page purposely blank. ii Acknowledgement I would like to acknowledge the help and support of the following people in the assessment efforts for Spring 2010: Dr. Ken Spackman, Director of University Planning Dr. Carrie Clements, Director of the Center for Teaching Excellence Dr. Abdou Ndoye, Assessment Director, Watson School of Education Erin Danielle Cooke, Graduate Assistant, Department of Psychology Robert Wilcox and Judy Kinney, Office of Institutional Research and Assessment iii This page purposely blank. iv TABLE OF CONTENTS EXECUTIVE SUMMARY .......................................................................................................................... 1 BACKGROUND AND SCOPE ................................................................................................................... 3 METHODOLOGY .................................................................................................................................. 5 ASSESSMENT TOOLS ........................................................................................................................ 6 SAMPLE SELECTION ......................................................................................................................... 6 SCORING ........................................................................................................................................ 7 RESULTS ............................................................................................................................................ 9 A NOTE ON QUANTITATIVE AND QUALITATIVE DATA .......................................................................... 9 THOUGHTFUL EXPRESSION (WRITTEN COMMUNICATION) ................................................................. 11 INQUIRY ....................................................................................................................................... 17 CRITICAL THINKING ...................................................................................................................... 22 SOCIAL SCIENCE FOUNDATIONAL KNOWLEDGE ................................................................................. 27 COMPARISON OF SCORES FROM TWO RUBRICS ................................................................................. 28 RELIABILITY OF SCORES ................................................................................................................. 30 SCORER FEEDBACK ....................................................................................................................... 34 INSTRUCTOR FEEDBACK ................................................................................................................. 38 DISCUSSION, LIMITATIONS AND RECOMMENDATIONS ............................................................................. 39 THOUGHTFUL EXPRESSION (WRITTEN COMMUNICATION) ................................................................. 39 INQUIRY ....................................................................................................................................... 41 CRITICAL THINKING ...................................................................................................................... 43 SOCIAL SCIENCE FOUNDATIONAL KNOWLEDGE ................................................................................. 44 RELATIVE STRENGTHS AND WEAKNESSES ACROSS RUBRICS ............................................................... 45 METHODOLOGY AND PROCESS........................................................................................................ 46 LIMITATIONS ................................................................................................................................ 47 RECOMMENDATIONS ...................................................................................................................... 48 REFERENCES .................................................................................................................................... 49 APPENDIX A RUBRICS USED ............................................................................................................... 51 APPENDIX B DIMENSION MEANS AND STANDARD DEVIATIONS ................................................................ 57 APPENDIX C RESULTS BY COURSE ...................................................................................................... 59 APPENDIX D DETAILED SCORER FEEDBACK .......................................................................................... 63 APPENDIX E CORRELATIONS BETWEEN RUBRIC DIMENSIONS.................................................................. 65 v LIST OF TABLES Table 1 Written Communication Score Results ……………………………………………………………………. 12 Table 2 Distribution of Scores for Written Communication …………………………………………………. 13 Table 3 Distribution of Scores by Gender …………………………………………………………………………….. 16 Table 4 Inquiry Rubric Score Results ……………………………………………………………………………………. 18 Table 5 Distribution of Scores for Inquiry, Applicable Scores Only ……………………………………….. 19 Table 6 Critical Thinking Score Results …………………………………………………………………………………. 23 Table 7 Distribution of Scores for Critcal Thinking, Applicable Scores Only ………………………….. 24 Table 8 Foundational Knowledge Score Results …………………………………………………………………... 27 Table 9 List of Statistically Significant Correlations across Rubrics …..…………………………………… 29 Table 10 Interrater Reliability .……………………………………………………………………………………………… 33 Table 11 Scorer Feedback on Process ………………………………………………………………………………….. 35 Table 12 Written Communication Percent of Sample Scored at Least 2 and at Least 3 ……….. 39 Table 13 Inquiry Percent of Sample Scored at Least 2 and at Least 3 ………………………………….. 41 Table 14 Critical Thinking Percent of Sample Scored at Least 2 and at Least 3 …………………….. 43 Table 15 Foundational Knowledge Percent of Sample Scored at Least 2 and at Least 3 ………. 45 Table B1 Means and Standard Deviations for Each Rubric Dimension …………………………………. 57 Table C1 Written Communication Results by Course …………………………………………………………… 60 Table C2 Inquiry Rubric Score Results by Course …………………………………………………………………. 61 Table C3 Critical Thinking Score Results by Course ………………………………………………………………. 62 Table E1 Correlation between Dimensions ………………………………………………………………………….. 66 Table E2 Correlation between Dimensions ………………………………………………………………………….. 67 LIST OF FIGURES Figure 1 Distribution of Scores for Written Communication ………………………………………………… 13 Figure 2 Distribution of Scores for Inquiry, Applicable Scores Only ……………………………………… 19 Figure 3 Distribution of Scores for Critcal Thinking, Applicable Scores Only …………………………. 24 vi EXECUTIVE SUMMARY This report provides the results of the General Education Assessment efforts for Spring 2010. The processes used were recommended by the General Education Assessment Committee in its March 2009 Report. Three UNCW Learning Goals were assessed using AAC&U VALUE Rubrics: Thoughtful Expression, Inquiry, and Critical Thinking. In addition, a locally created rubric for Foundational Knowledge was piloted. The sample consisted of 293 student work products from the following Basic Studies courses: ENG 201, FST 210, MUS 115, PSY 105, and SOC 105. RESULTS FOR THOUGHTFUL EXPRESSION (WRITTEN COMMUNICATION) The median score for all five dimensions was 2 on the 4-level scale (with level 4 the expectation for UNCW graduates). Work products were strongest on the dimension WC1 Context and Purpose for Writing. Work products were weakest on the dimensions WC3 Genre and Disciplinary Conventions and WC4 Sources and Evidence. There were significant differences between the results for females and males. Scores were higher for term papers than for in-class test questions, except on the dimension WC5 Control of Syntax and Mechanics. RESULTS FOR INQUIRY Three of the dimensions were considered not applicable by scorers for some assignments. The median score for five of the six dimensions was 2 on the 4-level scale (with level 4 the expectation for UNCW graduates). The median was 3 for the dimension IN2 Existing Knowledge, Research, and/or Views. Work products were strongest on the dimension IN2 Existing Knowledge, Research, and/or Views, and weakest on the dimension IN6 Limitations and Implications. RESULTS FOR CRITICAL THINKING All dimensions of the rubric were considered not applicable for at least one assignment. The median score for three of the five dimensions was 2 on the 4-level scale (with level 4 the expectation for UNCW graduates). The median was 1 for the other two dimensions. Work products were strongest on the dimensions CT1 Explanation of issues and Evidence. Work products were weakest on CT3 Influence of context and assumptions and CT5 Conclusions and related outcomes. Scores were higher on term papers than in-class test questions on all dimensions except CT1 Explanations of Issues. 1 OTHER FINDINGS The results for the pilot of the Foundational Knowledge rubric were inconclusive. It was determined that the student work products were collected too early in the semester to provide an accurate measure of student knowledge of discipline terminology and concepts. Interrater reliability was measured using a number of statistical methods. While only 3 of the 16 dimensions across all three rubrics met the benchmark chosen, the findings were promising for the first use of the rubrics. Additional exposure to and use of the rubrics, along with enhanced training, should improve interrater reliability in the future. PROCESS FEEDBACK Instructor and scorer feedback was gathered for all steps in the process. Both instructors and scorers had a high level of satisfaction with the process. Two scorers suggested that more training would be helpful. Scorers also provided valuable feedback on aspects of the rubrics, and this feedback will be used to make modifications to the rubrics. RECOMMENDATIONS Based on the analysis of the findings from the student work products sampled and of the participant feedback, the following recommendations were made by the Learning Assessment Council. • • • • • Levels of expected performance at the basic studies, or lower division, level should be developed for each rubric. Additional exposure to the content of and rationale for the UNCW Learning Goals should be provided to increase faculty ownership and awareness of these Goals. The LAC will ask the Center for Teaching Excellence to provide a workshop series on these Goals. The LAC will ask the University Curriculum Committee to consider actions in this area. To increase student exposure to the writing process, the Writing Intensive component of University Studies should be implemented by Fall 2012. Modifications and improvements to the general education assessment process should be made as needed, including the following: modify rubrics based on feedback, develop benchmarks work products, and enhance instructor and scorer workshops. Long term implementation schedule should provide flexibility for targeting additional sampling for specific learning goals that are characterized by ambiguous or unclear assessment results. For 2010 – 2011, Critical Thinking will be sampled for this purpose. 2 BACKGROUND AND SCOPE Before discussing General Education Assessment, it is important to understand what we mean by General Education. General Education is most often thought of as the curriculum requirements outside of the majors that expose students to foundational knowledge across the disciplines and to a variety of ways of thinking about and exploring the world. General Education can also be thought of in terms of broad learning outcomes, the knowledge and set of abilities that are needed by citizens and workers throughout their lives. Examples of these broad learning outcomes are the ability to think critically and the ability to communicate thoughtfully and clearly. In terms of these broad learning outcomes, General Education is any curriculum or formal experience that provides students opportunities to practice and eventually master these abilities. Through this lens, General Education becomes broader than “the Gen Ed curriculum” to also include any work within the college experience than helps cultivate the set of General Education learning outcomes. Taking this broader perspective, UNCW has adopted nine Learning Goals. These nine UNCW Learning Goals are Foundational Knowledge, Inquiry, Information Literacy, Critical Thinking, Thoughtful Expression, Second Language, Diversity, Global Citizenship, and Teamwork. For each learning goal, detailed learning outcomes are described for both basic studies and the characteristics of UNCW graduates. In August 2008 Provost Brian Chapman created and charged a General Education Assessment Committee with designing assessment mechanisms for the current Basic Studies structure (as it appeared in the 2008-09 Undergraduate Catalogue) using the Faculty Senate-approved learning outcomes for general education. After collaborating with the University Studies Advisory Committee, which was drafting Revising General Education at UNCW, and designing and administering an information gathering survey of faculty teaching Basic Studies courses, the committee presented a Report of the General Education Assessment Committee to the Provost in March 2009. Key findings and recommendations included in that report were: • • • • • An alignment between the UNCW Learning Goals and basic studies component common student learning outcomes; An estimation of the fit of the University Studies component common student learning outcomes to the Basic Studies courses, based on a faculty survey; A recommendation to use student work products from assignments embedded in normal basic studies coursework to assess student learning; A recommendation to use the AAC&U VALUE Rubrics for the Learning Goals Information Literacy, Critical Thinking, Thoughtful Expression (written), and Inquiry; A recommendation to implement a three-year recurring cycle for assessing the nine UNCW Learning Goals, with a recommended schedule from Fall 2009 through Fall 2010. 3 In late Spring 2009, three members of the General Education Assessment Committee performed a Pilot Assessment of Basic Studies: Critical Thinking, Inquiry and Analysis, and Written Communication. The main purpose of the pilot was to test the process recommended by the committee. Their report outlined additional recommendations about the process. During Fall 2009, it was determined that the College of Arts and Sciences Director of Assessment would be responsible for implementing general education assessment at the basic studies level in Spring 2010. Based on the recommended implementation schedule, the following UNCW Learning Goals were assessed: Thoughtful Expression, Inquiry, and Critical Thinking. This report outlines the methodology of and findings from that study. In the scheme of all general education assessment at UNCW, this report provides useful information on the abilities of UNCW students during their basic studies work as seen through course-embedded assignments. 4 METHODOLOGY The purpose of the general education assessment activities in Spring 2010 was to examine the following questions: • • • • What are the overall abilities of students taking basic studies courses with regard to the UNCW Learning Goals of Thoughtful Expression, Inquiry, and Critical Thinking? What are the relative strengths and weaknesses within the subskills of those goals? Are there any differences in performance based on demographic and preparedness variables such as gender, race or ethnicity, transfer students vs. freshman admits, honors vs. non-honors students, total hours completed, or entrance test scores? What are the strengths and weaknesses of the assessment process itself? A final purpose was to pilot test a Social Science Foundational Knowledge rubric. UNCW has adopted an approach to assessing its Learning Goals at the basic studies level that uses assignments that are a regular part of the course content. A strength of this approach is that the student work products are an authentic part of the curriculum, and hence there is a natural alignment often missing in standardized assessments. Students are motivated to perform at their best because the assignments are part of the course content and course grade. The assessment activities require little additional effort on the part of course faculty because the assignments used are a regular part of the coursework. An additional strength of this method is faculty collaboration and full participation in both the selection of the assignments and the scoring of the student work products. The student work products collected are scored independently on a common rubric by trained scorers. The results of this scoring provide quantitative estimates of students’ performance and qualitative descriptions of what each performance level looks like, which provides valuable information for the process of improvement. The normal disadvantage to this type of approach when compared to standardized tests is that results cannot be compared to other institutions. This disadvantage is mitigated in part by the use of the AAC&U VALUE rubrics for many of the Learning Goals. This concern is also addressed by the regular administration of standardized assessments, in particular, the CLA and the ETS Proficiency Profile, giving the university the opportunity to make such comparisons. 5 ASSESSMENT TOOLS For three of the four UNCW Learning Goals assessed, Association of American Colleges and Universities (AAC&U) Valid Assessment of Learning in Undergraduate Education (VALUE) rubrics were used: • for Thoughtful Expression, the VALUE Written Communication rubric was used; • for Inquiry, the VALUE Inquiry and Analysis rubric was used; and • for Critical Thinking, the VALUE Critical Thinking rubric was used. The VALUE rubrics, part of the AAC&U Liberal Education and America’s Promise (LEAP) initiative, were developed by over 100 faculty and other university professionals. Each rubric contains the common dimensions and most broadly shared characteristics of quality for each dimension. A locally created rubric was piloted for assessing Foundational Knowledge in the Social Sciences. Appendix A contains the versions of each of the rubrics that were used in the study. SAMPLE SELECTION The sampling method used lays the foundation for the generalizability of the results. As mentioned in the introduction, no one part of the basic studies curriculum, nor for that matter no one part of the university experience, is solely responsible for helping students to write well, think critically, or conduct responsible inquiry and analysis. These skills are practiced in many courses. The Fall 2008 survey helped determine which basic studies courses are most appropriate for assessing each of these goals. For this first round of assessment, five basic studies courses which are taken by a large number of students were selected, in order to represent as much as possible the work of “typical” UNCW students. Within each course, sections were divided into those taught by tenure-line and non-tenure-line faculty, those taught in the classroom and online, and honors and non-honors. Within each subgroup, sections were selected randomly in quantities that represent as close as possible the overall breakdown of sections by these criteria. Thirteen sections were selected in all. Within each section, all student work products were collected, and random samples of the work products were selected. Prior to the start of the semester, the CAS Director of Assessment met with course instructors to familiarize them with the VALUE rubrics. Instructors were asked to review their course content and assignments, and to select one assignment that they felt fit the dimensions of at least one of the rubrics. 6 Each student filled out a Student Work Product Cover Sheet, which acknowledged the use of their work for the purpose of general education assessment. These cover sheets were removed before scoring. The name and student ID information on the cover sheets was matched with student demographic information in university records for the purpose of analysis based on demographic and preparedness variables. SCORING REVIEWER RECRUITMENT AND SELECTION Reviewers were recruited from UNCW faculty across the college and all schools. A recruitment email was sent to all department chairs on February 11, 2010, with a request that it be forwarded to all department faculty. The desire was to include reviewers from a broad spectrum of departments. The intent was to give faculty who do not teach in departments that offer basic studies courses the opportunity to see the work being by students in the general education courses, and faculty who teach upper-level courses, such as capstone courses, within departments that do offer general education courses the opportunity to see the learning students experience as they begin their programs, as well as faculty who do teach basic studies courses. It was also important to have a least one faculty member from each of the departments from which student work products were being reviewed. SCORING PROCESS Metarubrics, such as the VALUE rubrics, are constructed so that they can be used to score a variety of student artifacts across disciplines, across universities, and across preparation levels. But their strength is also a weakness: the generality of the rubric makes it more difficult to use than a rubric that is created for one specific assignment. To address this issue, a process must be created that not only introduces the rubric to the scorers, but also makes its use more manageable. Volunteer scorers initially attended a two-hour workshop on one of the three rubrics (Written Communication, Inquiry and Analysis, or Critical Thinking). During the workshop, scorers reviewed the rubric in detail and were introduced to the following assumptions adopted for applying the rubrics to basic studies work products. Initial assumptions 1. Each rubric can be used across all school years (freshman to senior), with Level 4 Capstone representing the characteristics we want the work of UNCW graduates to demonstrate. 2. When scoring, we are comparing a particular work product to the characteristics we want the work of UNCW graduates to demonstrate. 7 3. A main purpose of the scoring is to determine the relative strengths and weaknesses of our students. Therefore it is important to look for evidence for each dimension of the rubric separately, and not score the work products holistically (i.e. tend towards one score for all dimensions). 4. The instructor’s directions about the assignment should guide the scorer’s interpretation of the rubric dimensions. 5. Other assumptions will need to be made when each rubric is used to score individual assignments. For example, a dimension may not fit a particular assignment. After reviewing the rubric and initial assumptions, the volunteers read and scored 3 – 4 student work products. Scoring was followed by a detailed discussion, so that scorers could better see the nuances of the rubric and learn what fellow scorers saw in the work products. From these discussions, assumptions began to be developed for applying the rubric to each specific assignment. The work on common assignment-specific assumptions or guidelines was continued on the day of scoring. Scorers were assigned to groups of 2, 3, or 4. Scoring of each assignment began with the group scoring one student work product together and discussing their individual scores. Discussion clarified any implicit assumptions each scorer had used in scoring the first work product. From that discussion, each group created any assignment-specific assumptions that they would use for scoring the rest of the set of assignments. After completing a packet of work products, each scorer completed a rubric feedback form and turned in the assignment-specific assumptions used by the group. The feedback form asked for information on how well each rubric dimension fit the assignment and student work. It also asked for feedback on the quality criteria for each dimension. Scorers were also asked to complete an end-of-day survey to provide feedback on the entire process. In order to measure the consistency of the application of the rubric, additional common work products were included in each packet for statistically measuring interrater reliability. 8 RESULTS A NOTE ON QUANTITATIVE AND QUALITATIVE DATA Quantitative data are numerical scores, such the number of correct questions on a test. Qualitative data are verbal summaries, such as the oral or written feedback professors provide on an assignment. Rubrics combine aspects of both qualitative data and quantitative data. They contain detailed descriptions of the quality criteria for each level on the scoring continuum. When a student work product is scored, it is compared to these criteria and categorized into the level that best matches the features of the work. The levels or categories are often designated with labels such as Novice, Developing, Apprentice, and Expert. Sometimes the levels are simply numbers. With or without the use of numbers, the levels usually represent an ordering of categories, but the categories are not equally spaced along a number line. Although a level 2 is considered higher, or larger, than a level 1, it is not proper to assume that a student that scores at a level 2 or Developing is twice as knowledgeable as a student who scored at a level 1 or Novice; nor can we assume that, whatever the difference is between these two categories, that it is exactly the same as the difference between levels 2 and 3. For this reason, these ordinal data do not yield valid mean scores. And averages are not what we’re interested in anyway. When we analyze the results of an assessment effort, we want to determine if we are satisfied with the demonstrated knowledge and skills of our students. Rather than describing the average student, we want to discover what percent of our students are below, meet, or exceed our expectations. In this report, score results are given in percentage of students scored at each level. The nature of the data also requires the use of non-parametric tests of significance. Means and standard deviations are often provided by researchers for non-interval level data. This information is given in appendix Table B1, as it may be helpful to some readers as a starting point to suggest further investigation using statistical methods appropriate to ordinal data. As previously mentioned, one of the assumptions of our use of the VALUE Rubrics is that the Level 4 Capstone describes the qualities of understanding that we want our graduating seniors to demonstrate. We have not defined as an institution yet our minimum expectations for our first and second year students, the predominate group taking basic studies courses. After this initial project, we might be in the position to set reasonable expectations. 9 DESCRIPTION OF SAMPLE DESCRIPTION OF COURSES A total of 302 student work products were sampled from the 13 assignments collected. However, the cover sheet information could not be matched with Banner data for nine of them. After removal of these work products from the sample, a total of 293 student work products were scored from the following 5 courses: • ENG 201 College Writing and Reading II (4 sections, one taught by tenure-line faculty, one by a full-time lecturer, 2 by part-time instructors) • FST 210 Moviemakers and Scholars Series (1 section taught by a full-time lecturer) • MUS 115 Survey of Music Literature (3 sections, two taught by tenure-line faculty, one by a part-time instructor; one honors section) • PSY 105 General Psychology (3 sections, 2 taught by tenure-line faculty, 1 by a part-time instructor) • SOC 105 Introduction to Sociology (3 sections, taught by two tenure-line faculty; two online sections) The breakdown of sections taught by tenure-line faculty, lecturers, and part-time faculty are representative of those breakdowns for the course as a whole. SAMPLE BY RUBRIC The total number of work products in the final sample scored using each rubric was: • Written Communication: 116 work products scored by four scorers • Inquiry: 98 work products scored by four scorers • Critical Thinking: 183 work products scored by seven scorers • Foundational Knowledge: 45 work products scored by one scorer The total number of scores produced was larger than the total number of work products because 27 work products were scored using both the Written Communication and Inquiry rubrics, 37 were scored using both the Written Communication and Critical Thinking rubrics, 40 were scored using both the Inquiry and Critical Thinking rubrics, and 45 were scored using both the Critical Thinking and Foundational Knowledge rubrics. No work products were scored using more than two rubrics. DESCRIPTION OF STUDENTS The 293 work products were produced by 288 unique students (five students provided work products for two different courses). A few Banner records did not contain all demographic variables of interest, therefore the sample size for any particular variable may be smaller. The demographic breakdown of the participating students, compared in parenthesis to the overall undergraduate enrollment for AY 2009-2010 was: 55.0% (59.4%) female; 13.1% (28.0%) transfer students; 7.7% (6.5%) honors students; 3.4% (4.9%) African American; 0.3% (0.6%) 10 American Indian; 1.0% (1.8%) Asian; 3.0% (3.9%) Hispanic; 2.3% (1.3%) of Multiple race or ethnicity; 0.3% (0.4%) Non-resident Alien; 83.6% (83.0%) white; and 3.0% (4.1%) listed unknown or other ethnicity. There were no students of Hawaiian or Pacific Island ethnicity in the sample, although 0.1% of all students describe themselves of this ethnicity (UNCW OIRA, 2009). The only group that was not representative of all UNCW students is the percent of transfer students. It is to be expected that transfer students would not be represented proportionally in basic studies courses. For those students with SAT score information (223), the mean Total SAT score was (compared in parenthesis to the overall undergraduate enrollment for AY 2009-2010) 1145.6 (1166), the mean SAT Math was 583.7 (589), and the mean SAT Verbal was 561.9 (577). For those who took the ACT college placement test (77), the mean composite score was 23.5, which, like the SAT scores, is just slightly below the 50% percentile for Fall 2009 freshman (UNCW OIRA, 2010). The mean total number of credit hours students had completed prior to Spring 2010, 44.9, was skewed due to a number of outliers with well over 120 total hours (maximum was 191). The median number of hours was 37. This included both UNCW hours and transfer hours. The median UNCW hours was 16 (mean 33.8), and the median transfer hours was 3 (mean 11.1). Broken down into groups, 43.0% had completed between 0 and 29 hours, 32.9% had completed between 30 and 59 hours, 8.7% had completed between 60 and 89 hours, and 15.4% had completed 90 or more hours. THOUGHTFUL EXPRESSION (WRITTEN COMMUNICATION) At the basic studies level, the UNCW Thoughtful Expression Learning Goal is for students to demonstrate an ability to express meaningful ideas in writing. For purposes of this Learning Goal, “Thoughtful Expression is the ability to communicate meaningful ideas in an organized, reasoned and convincing manner. Thoughtful expression involves a purpose responsive to an identified audience, effective organization, insightful reasoning and supporting detail, style appropriate to the relevant discipline, purposeful use of sources and evidence, and error-free syntax and mechanics” (UNCW Learning Goals, 2009). The VALUE Written Communication rubric contains five dimensions that are aligned with the UNCW description of Thoughtful Expression. SUMMARY OF SCORES BY DIMENSION Four faculty members scored 116 work products from four courses, ENG 201, FST 210, MUS115, and PSY 105. Nineteen work products (16.4%) were scored by multiple scorers. Table 1 provides summary information for all work products. 11 Table 1 Written Communication Score Results Benchmark WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics Milestones Capstone 0 1 2 3 4 NA 1 (.9%) 4 (3.4%) 5 (4.3%) 23 (19.8%) 1 (0.9%) 17 (14.7%) 23 (19.8%) 25 (21.6%) 18 (15.5%) 26 (22.4%) 44 (37.9%) 43 (37.1%) 48 (41.4%) 24 (20.7%) 40 (34.5%) 36 (31.0%) 36 (31.0%) 34 (29.3%) 45 (38.8%) 42 (36.2%) 18 (15.5%) 10 (8.6%) 4 (3.4%) 6 (5.2%) 7 (6.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) All assignments were scored on each dimension (no dimension was consider not applicable for any assignment). Figure 1 and Table 2 provide additional illustration of the score distributions for each dimension. 12 WRITTEN COMMUNICATION RESULTS BY DIMENSION Figure 1 Distribution of Scores for Written Communication Table 2 Distribution of Scores for Written Communication 0 1 2 3 4 th 25 %tile 50th %tile 75th %tile Mode WC1 WC 2 0.9% 3.4% 14.7% WC 3 WC 4 WC 5 4.3% 19.8% 0.9% 19.8% 21.6% 15.5% 22.4% 34.9% 37.1% 41.4% 20.7% 34.5% 31.0% 31.0% 29.3% 38.8% 36.2% 15.5% 2 8.6% 2 3.4% 1 5.2% 1 6.0% 2 2 2 2 2 2 3 3 3 3 3 2 2 2 3 3 13 RESULTS BY DIMENSION WC1 Context of and Purpose for Writing This dimension was the highest scoring Written Communication dimension. Less than one percent of the work products demonstrated no attention to context, audience, purpose, and to the assigned task (scores of 0). One in seven work products demonstrated minimal attention to context, audience, purpose, and to the assigned task (scores of 1). Over one third of work products demonstrated awareness of the context, audience, purpose, and assigned task (scores of 2). Three in ten work products demonstrated adequate consideration of context, audience, and purpose, and a clear focus on the assigned task (scores of 3). One in seven work products demonstrated a thorough understanding of context, audience, and purpose that was responsive to the assigned task and focused all elements of the work (scores of 4). WC2 Content Development Less than one in twenty work products demonstrated no content development (scores of 0). One in five work products used appropriate and relevant content to develop simple ideas in some parts of the work (scores of 1). Over one third of the work products used appropriate and relevant content to develop and explore ideas through the work (scores of 2). Two in five work products used appropriate, relevant, and compelling content to explore ideas within the context of the discipline (scores of 3 and 4). WC3 Genre and Disciplinary Conventions With the exception of WC4 Sources and Evidence, scores on this dimension were the lowest. This dimension was the most problematic for scorers, requiring a number of assumptions. All teams had questions as to where to score use of citations, and all decided to score this under disciplinary conventions. One in twenty work products demonstrated no attempt to use a consistent system for organization and presentation (scores of 0). One in five work products demonstrated an attempt to use a consistent system for basic organization and presentation (scores of 1). Two in five work products followed expectations appropriate to the specific writing task for basic organization, content, and presentation (scores of 2). One third of work products demonstrated consistent use of important conventions particular to the writing task, including stylistic choices (scores of 3 and 4). WC4 Sources and Evidence The scores on this dimension were the lowest of all dimensions, showing very mixed results, with a large portion of students scoring 0. However, 18 of the 23 scores of zero came from one assignment, an in-class compare and contrast essay that did not specifically ask for examples. Including those work products, one in five demonstrated no attempt to use sources to support ideas (scores of 0). One in seven work products demonstrated an attempt to use sources to support ideas (scores of 1). One in five work products demonstrated an attempt to use credible and/or relevant sources to support ideas that were appropriate to the task (scores of 2). More than 14 two in five work products demonstrated consistent use of credible, relevant sources to support ideas (scores of 3 and 4). WC5 Control of Syntax and Mechanics Less than one percent of the work products did not mean the level 1 benchmark (scores of 0). Almost one fourth of work products used language that sometimes impeded meaning because of errors in usage (scores of 1). Over one third of work products used language that generally conveyed meaning with clarity, although writing included some errors (scores of 2). Over one third of work products used straightforward language that generally conveyed meaning, with few errors (scores of 3). One in twenty work products used graceful language that skillfully communicated meaning with clarity and fluency, with virtually no errors (scores of 4). CORRELATION BETWEEN DIMENSIONS All dimension scores were correlated with each other at the .01 level, except for the correlation between WC4 Sources and Evidence and WC5 Control of Syntax and Mechanics, which was significant at the .05 level. The magnitudes of the correlations range from .193 to .671, with the highest correlation between WC2 Content Development and WC4 Sources and Evidence. This finding seems to be appropriate as content development requires the use of appropriate and relevant content or sources and evidence. See appendix Table E1 for a complete presentation of correlation coefficients. These large and statistically significant correlations between the scores on each dimension of the rubric might suggest some “cross scoring,” or lack of independent scoring on the part of the scorers. They may, however, also simply represent the interdependence of the components of writing. DEMOGRAPHIC AND PREPAREDNESS FINDINGS The most notable finding related to demographic variables is that there was a difference between the score distributions for males and females, and two of these differences were statistically significant. Table 3 below illustrates these distributions. There were statistically significant differences between the distributions of scores for males and females for the dimensions WC2 Content Development and WC5 Control of Syntax and Mechanics, with females scores higher. Although not statistically significant, the distributions for the rest of the dimensions all demonstrated females scoring higher than males. (Although only percentages are given, the sample contained an equal number of work products written by females and males, 58 each). Many research studies have demonstrated that females score much higher than males on verbal fluency and basic writing skills (Hockenbury and Hockenbury, 2006, p.426), so these results should come as no surprise. Keep in mind that scoring was done anonymously. 15 Table 3 Distribution of Scores by Gender Dimension WC1 0 Female 0.0% Male 1.7% WC2 Female 1.7% Male 5.2% WC3 Female 3.4% Male 5.2% WC4 Female 17.2% Male 22.4% WC5 Female 0.0% Male 1.7% *Statistically significant at the .05 level **Statistically significant at the .01 level 1 6.9% 22.4% 5.2% 34.5% 13.8% 29.3% 10.3% 20.7% 13.8% 31.0% 2 37.9% 37.9% 50.0% 24.1% 43.1% 39.7% 19.0% 22.4% 37.9% 31.0% 3 36.2% 25.9% 34.5% 27.6% 36.2% 22.4% 48.3% 29.3% 37.9% 34.5% 4 19.0% 12.1% 8.6% 8.6% 3.4% 3.4% 5.2% 5.2% 10.3% 1.7% Chi-square test for independence 7.321 19.242** 5.406 5.247 8.913* There was a significant positive correlation between the number of credit hours completed and the scores on WC2 Content Development (.312**), WC3 Genre and Disciplinary Conventions (.247**), and WC4 Sources and Evidence (.255**). The lack of significant correlation between number of hours completed and WC1 Context and Purpose for Writing and WC5 Control of Syntax and Mechanics may indicate that these dimensions of writing are not being addressed for improvement as much as the other three. Despite the fact that scores increased as credit hours increased, there were still a substantial number of students who had already completed 90 or more credit hours that did not score at a level of 3 or 4 (45.5% for WC1, 45.5% for WC2, 54.6% for WC3, 36.4% for WC4, and 57.8% for WC5). With regard to college entrance test scores, the only significant correlation was between SAT Math scores and WC5 Control of Syntax and Mechanics (.231*). There were no statistically significant differences in the score distributions between transfer students and students who entered as freshman, or between honors and non-honors students. Due to the small (though representative) number of students from each of the race/ethnicity categories other than white, no analysis on this variable was done. COMPARISON BETWEEN COURSES AND ASSIGNMENTS Scores by subject are provided in appendix Table C1. Analysis of scores separated into the four subjects did not result in significant differences in the distribution of scores, except for WC4. There was a significant difference in the distribution of scores on the Film Studies assignment compared to all other assignments, with the Film Studies scores much lower. This is most likely due to the wording of the question. 16 There were differences in the distributions of scores for most of the dimensions based on the type of assignment; for three dimensions the differences were statistically significant (WC1, WC2, and WC4). Scores were higher on out-of-class term papers for all dimensions except WC5 Control of Syntax and Mechanics. While the differences for WC5 was not significant, the results were contrary to expectations. INQUIRY At the basic studies level, the UNCW Inquiry Learning Goal is for students to practice rigorous, open-minded and imaginative inquiry. For purposes of this Learning Goals, “Inquiry is the systematic and analytic investigation of an issue or problem with the goal of discovery. Inquiry involves the clear statement of the problem, issue or question to be investigated; examination of relevant existing knowledge; design of an investigation process; analysis of the complexities of the problem, clear rationale supporting conclusions; and identification of limitations of the analysis” (UNCW Learning Goals, 2009). The VALUE Inquiry and Analysis rubric contains six dimensions that are aligned with the UNCW description of Inquiry. SUMMARY OF SCORES BY DIMENSION Four faculty members scored 98 work products from two courses, ENG 201 and PSY 105. Fifteen work products (15.3%) were scored by multiple scores. Table 4 provides summary information for all work products. 17 Table 4 Inquiry Rubric Score Results Benchmark IN1 Topic Selection IN2 Existing Knowledge, Research, and/or Views IN3 Design Process IN4 Analysis IN5 Conclusions IN6 Limitations and Implications Milestones Capstone 0 1 2 3 4 NA 1 (1.0%) 1 (1.0%) 3 (3.1%) 4 (4.1%) 6 (6.1%) 22 (23.2%) 1 (1.0%) 26 (22.4%) 3 (3.1%) 5 (5.1%) 84 (85.7%) 40 (40.8%) 6 (6.1%) 9 (9.2%) 10 (10.2%) 6 (6.1%) 10 (10.2%) 8 (8.2%) 19 (19.4%) 26 (26.5%) 34 (34.7%) 39 (39.8%) 30 (30.6%) 32 (32.7%) 40 (40.8%) 35 (35.7%) 30 (30.6%) 14 (14.3%) 8 (8.2%) 7 (7.1%) 9 (9.2%) 7 (7.1%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 13 (13.3%) IN1 Topic Selection was judged by scorers as not applicable for four of the five assignments. IN2 Existing Knowledge and IN6 Limitations and Implications were each considered not applicable for one assignment. Figure 2 and Table 5 provide the score distributions for each dimension for work products that were scored on that dimension (i.e., work products in the NA column above are not included). 18 INQUIRY RESULTS BY DIMENSION FOR APPLICABLE SCORES ONLY Figure 2 Distribution of Scores for Inquiry, Applicable Scores Only Table 5 Distribution of Scores for Inquiry, Applicable Scores Only 0 1 2 3 4 th 25 %tile 50th %tile 75th %tile Mode IN1 IN2 IN3 IN4 7.1% 1.7% 6.1% 9.2% 10.2% 7.1% 21.4% 6.9% 10.2% 8.2% 19.4% 30.6% 42.9% 37.9% 34.7% 39.8% 30.6% 37.6% 7.1% 44.8% 40.8% 35.7% 30.6% 16.5% 21.4% 1 8.6% 2 8.2% 2 7.1% 2 9.2% 1 8.2% 1 2 3 2 2 2 2 3 3 3 3 3 3 2 3 3 2 2,3 2 19 IN5 IN6 RESULTS BY DIMENSION IN1 Topic Selection This dimension was not scored for four of the five assignments. Only one English assignment provided students with the opportunity to select their topic. For the rest of the assignments, students were instructed to analyze a given article, looking at particular aspects of the article. For the assignment for which this dimension was applicable, one work product did not meet the benchmark score of 1. Three work products identified a topic that was considered too general to be manageable (score of 1). Six work products identified a manageable topic that left out relevant aspects (score of 2). Four work products identified a focused and manageable topic that addressed relevant aspects of the topic (scores of 3 and 4). The small number of student work products makes it difficult to make performance comparisons with the other dimensions. IN2 Existing Knowledge, Research, and/or Views This dimension was the highest scoring Inquiry dimension. This dimension was scored for all 4 English assignments, but was not scored for the Psychology assignment. (Although the assignment asked students to summarize the background information within the article, the scorers determined this did not fit with the intent of the rubric dimension, which required presenting information from relevant sources with various points of view.) Almost one in ten work products either lacked enough content to judged (scores of 0) or presented information from irrelevant sources with limited points of view (scores of 1). Over one third of work products presented information from relevant sources representing limited points of view (scores of 2). Over half the work products presented in-depth information from relevant sources representing various points of view (scores of 3 and 4). IN3 Design Process This dimension was scored for all five assignments. The methodology that scorers looked for was the methodology that was provided by the instructor in the assignment directions. About one in twenty work products demonstrated no understanding of the methodology (scores of 0). One in ten work products demonstrated some misunderstanding of the methodology (scores of 1). One in three students utilized a process, but parts of the process were missing or incorrectly developed (scores of 2). Almost half the students demonstrated the ability to utilize a methodology that was appropriately developed, even though for most, more subtle elements were not there (scores of 3 and 4). IN4 Analysis This dimension was scored for all five assignments. One in ten work products contained no elements of analysis (scores of 0). One in twelve work products did not organize evidence in order to make analysis (scores of 1). Two in five work products contained organized evidence, although the organization was not considered effective in revealing patterns, differences, or 20 similarities (scores of 2). Two in five work products contained evidence organized effectively to reveal patterns, differences, or similarities (scores of 3 and 4). IN5 Conclusions This dimension was scored for all five assignments. One in ten work products stated no conclusions (scores of 0). One in five work products stated a general conclusion that was ambiguous, illogical, or unsupportable from the inquiry findings (scores of 1). Three in ten work products stated general conclusions that went beyond the scope of the inquiry (scores of 2). Four in ten work products stated conclusions focused solely on and arising specifically from the inquiry finding (scores of 3 and 4). IN6 Limitations and Implications The scores on this dimension were the lowest of all Inquiry dimensions. This dimension was scored for four of the five assignments. One in 14 work products presented no limitations and implications (scores of 0). Three in ten work products presented limitations and implications that were irrelevant and unsupported (scores of 1). Over one third of work products presented relevant and supported limitations and assumptions (scores of 2). One fourth of work products discussed relevant and supported limitations and implications (scores of 3 and 4). Scores on this dimension were the lowest for this rubric. CORRELATION BETWEEN DIMENSIONS All dimensions of Inquiry were significantly correlated with each other except IN1, Topic Selection. For this dimension, all correlation coefficients were positive, but the only statistically significant correlation was with IN5, Conclusions. The reason for the lack of significance was probably due to the fact that only 14 work products were scored on this dimension. The strongest correlations were between IN3 and IN4, IN4 and IN5, and IN1 and IN5. See appendix Table E1 for a complete presentation of correlation coefficients. It should be no surprise that the components of inquiry are highly correlated. These large and statistically significant correlations between the scores on each dimension of the rubric might also suggest some “cross scoring,” or lack of independent scoring on the part of the scorers. DEMOGRAPHIC AND PREPAREDNESS FINDINGS There was no significant difference in the distribution of scores between males and females on each of the dimensions. For IN1 Topic Selection the distributions were virtually the same. For IN5 Conclusions the distributions were somewhat different, with larger percentages of male scores in the lower levels and larger percentages of female scores in the higher levels. But even here, the significance level on all measures of comparison was only .085. There was no significant difference in the distribution of scores between transfer students and students who entered as freshman. Likewise, there was no significant difference in the 21 distribution of scores between honors students and non-honors students. However the sample size for each of these was extremely low (no more than 10 transfer students and 4 honors students were scored on each dimension). Due to the small (though representative) number of students from each of the race/ethnicity categories other than white (no more than 10 for each category), no analysis on this variable was done. There were statistically significant, though not large, positive correlations between the number of credit hours completed and IN3 (.275**), IN4 (.279**), IN5 (.322**), and IN6 (.278*). This finding is what we would hope for. The correlation between IN2 Existing Knowledge, which marked the highest scores, and number of credit hours completed was small (.053) and not significantly different from zero. There were no correlations between and dimension scores and ACT, SAT Verbal and SAT Math scores. COMPARISON BETWEEN COURSES AND ASSIGNMENTS Psychology work products were not scored on dimensions IN1 and IN2. Scores for the other four dimensions were significantly higher on the English composition papers than on the Psychology papers (see appendix Table C2). The percent of 2’s and 3’s were about the same, but there were no scores of 4 on the Psychology papers, and more 0’s and 1’s. All assignments were out-of-class term papers, hence there is no comparison between assignment types. CRITICAL THINKING At the basic studies level, the UNCW Critical Thinking Learning Goal is for students to use multiple methods and perspectives to critically examine complex problems. For purposes of this Learning Goal, “Critical Thinking is ‘skilled, active interpretation and evaluation of observations, communications, information and argumentation’ (Fisher and Scriven, 1997). Critical thinking involves a clear explanation of relevant issues, skillful investigation of evidence, purposeful judgments about the influence of context or assumptions, reasoned creation of one’s own perspective, and synthesis of evidence and implications from which conclusions are drawn” (UNCW Learning Goals, 2009). The VALUE Critical Thinking rubric contains five dimensions that are aligned with the UNCW description of Critical Thinking. SUMMARY OF SCORES BY DIMENSION A total of 183 student work products from two Music 115, three Psychology 105, and three Sociology 105 sections were scored by seven scorers. Twenty-two work products (12.0%) were scored by multiple scorers. Table 6 provides summary information for all work products. 22 Table 6 Critical Thinking Score Results Benchmark CT1 Explanation of Issues CT2 Evidence CT3 Influence of Context and Assumptions CT4 Student’s Position CT5 Conclusions and Related Outcomes Milestones Capstone 0 1 2 3 4 NA 12 (6.6%) 13 (7.1%) 31 (16.9%) 40 (21.9%) 44 (24.0%) 40 (21.9%) 42 (23.0%) 60 (32.8%) 35 (19.1%) 40 (21.9%) 39 (21.3%) 13 (7.1%) 11 (6.0%) 7 (3.8%) 0 (0.0%) 38 (20.8%) 20 (10.9%) 64 (35.0%) 20 (10.9%) 39 (21.3%) 49 (26.8%) 41 (22.4%) 47 (25.7%) 33 (18.0%) 27 (14.8%) 9 (4.9%) 1 (0.5%) 5 (2.7%) 39 (21.3%) 56 (30.6%) Each of the five dimensions was judged by the scorers as not applicable for at least one of the assignments. CT3 Influence of Context and Assumptions and CT5 Conclusions and Related Outcomes were not applicable for three assignments; CT1 Explanation of Issues and CT4 Student’s Position were not applicable for two assignments. Figure 3 and Table 7 provide the score distributions for each dimension for work products that were scored on that dimension (i.e., work products in the NA column above are not included). 23 CRITICAL THINKING RESULTS BY DIMENSION FOR APPLICABLE SCORES ONLY Figure 3 Distribution of Scores for Critcal Thinking, Applicable Scores Only Table 7 Distribution of Scores for Critcal Thinking, Applicable Scores Only CT1 0 1 2 3 4 th 25 %tile 50th %tile 75th %tile Mode CT 2 CT 3 CT 4 CT 5 8.3% 8.0% 26.1% 13.9% 30.7% 27.6% 27.0% 33.6% 34.0% 32.3% 29.0% 36.8% 29.4% 32.6% 26.0% 27.6% 23.9% 10.9% 18.8% 7.1% 7.6% 1 4.3% 1 0.0% 0 0.9% 1 3.9% 0 2 2 1 2 1 3 3 2 2 2 2 2 1 1 1 24 DIMENSION CT1 Explanation of Issues This dimension was scored for six of the eight assignments. Scores on this dimension were the highest of all dimensions of Critical Thinking (along with CT2 Evidence). Less than one in ten work products provided no explanation of the issue (scores of 0). Over one fourth of work products stated the issue or problem with no clarification (scores of 1). Over one third of work products stated the issue or problem, but left some points ambiguous (scores of 2). One in three work products stated, described, and clarified the issue or problem (scores of 3 and 4). RESULTS BY CT2 Evidence This dimension was scored for seven of the eight assignments. Scores on this dimension were the highest of all dimensions of Critical Thinking (along with CT1 Explanation of issues). Over one third of the work products provided no evidence (scores of 0) or provided evidence that was taken from sources without evaluation of relevance or factualness (scores of 1). Over one third of students provided evidence with some interpretation, although the viewpoints of authors were taken as fact, with little questioning (scores of 2). About three in ten work products provided evidence that was interpreted, analyzed or synthesized, and evaluated as to factualness (scores of 3 and 4). CT3 Influence of Context and Assumptions This dimension was scored for five of the eight assignments. Scores on this dimension were one of the lowest (along with CT5 Conclusions and related outcomes). One fourth of the work products demonstrated no awareness of assumptions (scores of 0). Over one third of work products showed an emerging awareness of assumption and some identification of context (scores of 1). Three in ten work products questioned some assumptions (but overlooked others) and identified some relevant context (scores of 2). One in ten work products identified the student’s own and others’ assumptions as well a several relevant contexts (scores of 3). There were no scores of 4. CT4 Student’s Position This dimension was scored for six of the eight assignments. One in seven work products contained no statement of student position (scores of 0). One third of the work products provided a simplistic or obvious position (scores of 1). One third of the work products provided a specific position that acknowledged different sides of an issue (scores of 2). One in ten work products not only acknowledged difference sides of an issue, but incorporated those positions and took into account the complexities of the issue (scores of 3 and 4). CT5 Conclusions and Related Outcomes This dimension was scored for five of the eight assignments. Scores were the lowest on this dimension. One in three work products provided no conclusions (scores of 0). Almost one third 25 of work products provided oversimplified outcomes and conclusions that were inconsistently tied to some of the information discussed (scores of 1). Approximately one fourth of work products provided conclusions that were logically tied to information (because information was chosen to fit the conclusion) and identified some related outcomes clearly (scores of 2). Approximately one in ten work products provided conclusions logically tied to a range of information, including opposing viewpoints and identified related outcomes clearly (scores of 3 and 4). CORRELATION BETWEEN DIMENSIONS All dimensions of Critical Thinking were significantly correlated with each other. The correlation coefficients range in magnitude from .247 to .692. See appendix Table E1 for a complete presentation of correlation coefficients. It should be come as surprise that the components of critical thinking are highly correlated. The fairly large values of the correlation coefficients, and the fact that they are all statistically significant, might also point to some “cross scoring” or lack of independence in the scoring of each dimension. DEMOGRAPHIC AND PREPAREDNESS FINDINGS The following statistically significant correlations were found: the number of hours completed was negatively correlated with CT5 (-.320**), SAT Verbal was positively correlated with CT3 (.271**), CT4 (.261**), and CT5 (.306**), and the ACT composite score was positively correlated with CT4 (.442**). There were no significant differences between the distributions of scores between genders, or between transfer students and students starting as freshmen, or between honor and non-honor students. There were only 5 honors students in this part of the sample, and their scores were distributed across the scale. Due to the small (though representative) number of students from each of the race/ethnicity categories other than white (no more than 10 for each category), no analysis on this variable was done. COMPARISON BETWEEN COURSES AND ASSIGNMENTS Scores by subject are provided in appendix Table C3. There were statistically significant differences between courses on all dimension except CT4. In each case, the score distributions were higher for MUS115 than for SOC105 and PSY105 (PSY105 was not scored on CT3 and CT5). There were differences in the distributions of scores for most of the dimensions based on the type of assignment; for two dimensions the differences were statistically significant (CT3 and CT5). Scores were higher on out-of-class term papers for all dimensions except CT1 Explanation of issues. 26 SOCIAL SCIENCE FOUNDATIONAL KNOWLEDGE At the basic studies level, the UNCW Foundational Knowledge Learning Goal is for students to acquire foundational knowledge, theories and perspectives in a variety of disciplines. For purposes of this Learning Goal, “Foundational knowledge comprises the facts, theories, principles, methods, skills, terminology and modes of reasoning that are essential to more advanced or independent learning in an academic discipline” (UNCW Learning Goals, 2009). A locally created rubric for assessing Foundational Knowledge in the Social Sciences, based on the Social Science Component student learning outcomes, was piloted in Spring 2010. SUMMARY OF SCORES BY DIMENSION A total of 45student work products from two Sociology 105 online sections were scored by one scorer. Table 8 provides summary information for the work products. Table 8 Foundational Knowledge Score Results Use of Discipline Terminology Explanation and Understanding of Concepts and Principles 0 1 2 3 4 NA 0 (0.0%) 0 (0.0%) 27 (60.0%) 27 (60.0%) 10 (22.2%) 11 (24.4%) 7 (15.6%) 5 (11.1%) 1 (2.2%) 2 (4.4%) 0 (0.0%) 0 (0.0%) RESULTS BY DIMENSION FK1 Use of Disciplinary Terminology In three in five work products, the meaning of the discourse was unclear, and/or attempts to use terminology were inaccurate or inappropriate to the context (scores of 1). Over one in five work products conveyed meaning, although not always using appropriate terminology (scores of 2). Not quite one in five work products conveyed meaning by using all relevant terminology appropriately (scores of 3 and 4). FK2 Explanation of Understanding of Concepts and Principles Three in five work products demonstrated an attempt to describe or explain concepts or principles, but those descriptions or explanations were too vague or simplistic (scores of 1). One fourth of the work products explained concepts and principles at a basic level, but left out important information or connections (scores of 2). One in seven work products accurately explained the concepts and principles within the context of the situation. CORRELATION BETWEEN DIMENSIONS There was a very high, statistically significant correlation between the two dimensions of the rubric (.933**). This might indicate some “cross scoring,” or lack of independence in scoring the two dimensions. Another possibility is that the two dimensions are assessing almost the same thing. 27 DEMOGRAPHIC AND PREPAREDNESS FINDINGS There were no significant differences between the distribution of scores for males and females on either dimension. In addition, there were no correlations that were significantly different from zero between either dimension and credit hours completed, or ACT or SAT Math scores. However, there was a statistically significant correlation between FK1 Use of Disciplinary Terminology and students’ SAT Verbal scores (.354*). COMPARISON OF SCORES FROM TWO RUBRICS Assignments often require the demonstration of skills related to multiple learning goals, such as written communication and critical thinking, or written communication and inquiry and analysis. Seven instructors suggested two rubrics as fitting their assignments, and six assignments were scored with two rubrics: 27 work products from two sections were scored using both the Written Communication and Inquiry rubrics, 37 work products from two sections were scored using both the Written Communication and Critical Thinking rubrics, 40 work products from one section were scored using both the Inquiry and Critical Thinking rubrics, and 45 work products from two sections (one assignment) were scored using both the Critical Thinking and Foundational Knowledge rubrics. Each work product was scored on the two rubrics by a separate set of scorers trained on the rubric. Any comparison of scores across rubrics must be done with caution, and should only serve as a starting place for further investigation. This is because the criteria for each level of the rubric cannot be assumed to be scaled the same. For example, Level 2 cannot be considered to be in the identical place on a scale of abilities for each of the rubrics. With this in mind, the review of the distribution of scores for work products scored with two rubrics provides the following observations: • • • For those work products scored with both the Written Communication and Inquiry rubrics, there was no dimension that stood out as relatively stronger than the others, while those of WC5 Control of Syntax and Mechanics, WC 3 Genre and Disciplinary Conventions, and IN1 Topic Selection, were relatively low. For those work products scored with both the Written Communication and Critical Thinking rubrics, the distribution of scores on WC1 Context and Purpose for Writing and CT1 Explanation of issues were relatively strong, while those of CT3 Influence of context and assumptions and CT5 Conclusions and related outcomes were relatively low. For those work products scored with both the Inquiry and Critical Thinking rubrics, there was no dimension that stood out as relatively stronger or weaker than the others. 28 • For those work products scored with both the Critical Thinking and Foundational Knowledge rubrics, there was no dimension that stood out as relatively stronger than the others, while that of CT5 Conclusions and related outcomes was relatively low. For the most part, these results are consistent with what you would see by comparing the results of all work products scored with each rubric. An exception to this is point three, where scores for this set of work products were about the same for Inquiry and Critical Thinking. A review of the scores of all work products scored on Critical Thinking and those scored on Inquiry would suggest that student skills on inquiry are strong relative to critical thinking. The fact that there was only one assignment in this sample makes any conclusions based solely on it unjustified. Correlation was tested between the scores on all the dimensions from each rubric. Three fourths of the 95 correlations were not significantly different from zero at the .05 level. See appendix Tables E1 and E2 for complete correlation tables. The statistically significant correlations are presented in Table 9. Table 9 List of Statistically Significant Correlations across Rubrics Dimensions Spearman’s Sample Size Dimensions Rho CT1 – IN3 .529** 40 WC1 – CT1 CT1 – IN4 .506** 40 WC1 – IN1 CT1 – IN5 .363* 40 WC1 – IN2 CT1 – IN6 .434** 40 WC5 – IN4 CT2 – IN6 .399* 40 WC5 – IN6 CT5 – IN6 .420** 40 WC5 – IN5 WC5 – CT5 FK1 – CT1 .471** 45 FK2 – CT1 FK1 – CT2 .372* 45 FK2 – CT2 FK1 – CT3 .398** 45 FK2 – CT3 FK1 – CT4 .484** 45 FK2 – CT4 FK1 – CT5 .336* 45 FK2 – CT5 Spearman’s Rho .576* -.569** -.425* .533** .712** .427* -.624** .455** .340* .347* .474** .323* Sample Size 19 14 27 27 14 27 18 45 45 45 45 45 *Statistically significant at the .05 level **Statistically significant at the .01 level There were six significant correlations between the dimensions of Critical Thinking and the dimensions of Inquiry and Analysis, all positive. CT1 Explanation of Issues was correlated with four of the six Inquiry dimensions, IN3 Design Process, IN4 Analysis, IN5 Conclusions, and IN6 Limitations and Implications. (Note that there are no correlation coefficients between CT1 and the other two dimensions of Inquiry because those two dimensions were considered by the 29 scorers to be not applicable to the assignment.) CT2 Evidence was positively correlated with IN6 Limitations and Implications. And CT5 Conclusions and related outcomes (implications and consequences) was positively correlated with IN6 Limitations and Implications. It is actually surprising that this correlation was not higher. The skills of Critical Thinking and Inquiry and Analysis overlap to a large extent, so these results are not surprising. The correlations between Written Communication and the other rubrics are very inconsistent and difficult to interpret. It is not surprising that WC1 Context of and Purpose for Writing is positively correlated with CT1 Explanation of issues, which requires a description of the issue or problem to be considered. It is very surprising, however, that WC1 Context and Purpose for Writing is negatively correlated with IN1 Topic selection. Further review of the 14 work products in this subsample shows extremely high scores on WC1 compared to the rest of the sample. This sample of 14 work products from one assignment is too small to draw any conclusions from. It is also interesting that WC1 is negatively correlated with IN2 Existing Knowledge, Research, and/or Views, as IN2 and WC2 Content Development are overlapping concepts, and WC1 and WC2 are highly correlated. In fact it is interesting that WC2 and IN2 are not significantly correlated (the correlation coefficient is negative, but not significantly different from zero). As above, the number of work products is too small to draw any conclusions from. WC5 Control of Syntax and Mechanics is correlated with four dimensions from the other two rubrics, one correlation being negative. Any significant positive correlation could be an indication that scorers, while not looking specifically for control of syntax and mechanics, are biased by the overall quality of use of language. The fact that one of the correlations is negative (with CT5 Conclusions and related outcomes) and the fact that the other three Inquiry dimensions are not significantly correlated with WC5 suggests that this is probably not the case. All of the correlation coefficients between Critical Thinking and Foundational Knowledge were significantly difference from zero. RELIABILITY OF SCORES A NOTE ABOUT VALIDITY Reliability is a necessary, but not sufficient, condition for validity. According to standards published jointly by the American Education Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME), validity is the degree to which evidence and theory support the interpretation of test scores for the proposed use (AERA, 1999). The VALUE rubrics were recommended as valid means to assess specific UNCW Learning Goals because (1) they align directly to the definitions 30 of Thoughtful Expression, Inquiry, and Critical Thinking adopted by UNCW, and (2) according to the AAC&U developers, “The rubric development teams relied on existing campus rubrics when available, other organizational statements on outcomes, experts in the respective fields and faculty feedback from campuses throughout the process. Each VALUE rubric contains the most common and broadly shared criteria or core characteristics considered critical for judging the quality of student work in that outcome area” (AAC&U, 2010). The Social Sciences Foundational Knowledge rubric also aligns with the definitions within the UNCW Learning Goals, as well as the social science component student learning outcomes. The rubric does, however, require additional vetting by faculty before it can meet validity standards. MEASURING RELIABILITY To ensure that a means of assessment, such as a rubric, that is considered to be valid produces reliable scores, we must also look at reliability, in this case interrater reliability. The Spring 2010 scoring event was the first use of the four rubrics for the 15 scorers. Details about how scorers were normed were given the in Methodology chapter of this report. Briefly, scorer norming consisted of two stages. First, each scorer attended a two-hour workshop at which the rubric was reviewed and three to four student work products were scored and discussed. Second, on the day of scoring, scorers worked in groups of 2, 3, or 4. They began the scoring process for each assignment packet by scoring an discussing one common work product from their packets, and created additional scoring guidelines specific to that assignment, if necessary. There were a number of additional common student work products in each packet so that interrater reliability could be assessed. Only the independently scored work products were used to measure interrater reliability (as scorers came to consensus on all discussed papers). Interrater reliability is a measure of the degree of agreement between scorers, and provides information about the trustworthiness of the data. It helps answer the question—Would a different set of scorers at a different time arrive at the same conclusions? In practice, interrater reliability is enhanced over time through scorer discussion, as well as through improvements to the scoring instructions (rubric). There is much debate about the best means of measuring interrater reliability. There are many measures that are used. Some differences in the measures are due to the types of data (nominal, ordinal, or interval data). Other differences have to do with what is actually being measured. Correlation coefficients describe consistency between scorers. For example, if Scorer 1 always scored work products one level higher than Scorer 2, there would be perfect correlation between them. You could always predict one scorer’s score by knowing the other’s score. It does not, 31 however, yield any information about agreement. A value of 0 for a correlation coefficient indicates no association between the scores, and a value of 1 indicates complete association. Spearman’s rho rank order correlation coefficient is an appropriate correlation coefficient for ordinal data. Percent agreement measures exactly that—the percentage of scores that are exactly the same. It does not, however, account for chance agreement. Percent adjacent measures the number of times the scores were exactly the same plus the number of times the scores were only one level different. Percent adjacent lets the research know how often there is major disagreement between the scorers on the quality of the artifact. If percent adjacent is far from 100%, the rubric should be reevaluated and/or additional norming may be required. Krippendorff’s alpha is a measure of agreement that accounts for chance agreement. It can be used with ordinal data, small samples, and with scoring practices where there are multiple scorers. A value of 0 for alpha indicates only chance agreement, and a value of 1 indicates reliable agreement not based on chance. Each of these measures of reliability are provided in Table 10. For Written Communication, 19 work products were double scored. Six of those work products were discussed, leaving a sample of 13 (11.2%) for testing interrater reliability. For Inquiry, 15 work products were double scored. Four of those work products were discussed, leaving a sample of 11 (11.2%) for testing interrater reliability. For Critical Thinking, 22 work products were double, triple, or quadruple scored. Seven of those work products were discussed, leaving a sample of 15 (8.2%) for testing interrater reliability. (Since in many cases each Critical Thinking artifact was scored by three or four scores, the number of cases to check was larger.) Table 10 provides the results of the various interrater reliability measures for each dimension. 32 Table 10 Interrater Reliability Percent Agreement Plus Percent Adjacent 46.2% 38.5% 53.8% 46.2% 41.2% 84.6% 84.6% 100.0% 84.6% 100.0% 0.242 0.397 0.550 0.377 0.259 0.381 0.427 0.588* 0.377 0.284 62.5% 63.6% 81.8% 63.6% 55.6% 87.5% 100.0% 90.9% 90.9% 77.8% 0.396 0.503 0.642 0.588 0.213 0.497 0.561 0.621* 0.600 0.205 55.6% 40.0% 31.3% 41.2% 75.0% 100.0% 100.0% 81.3% 88.2% 93.8% 0.841 0.728 0.590 0.549 0.895 0.874** 0.737** 0.594* 0.542* 0.909** Krippendorff’s Alpha Spearman’s Rho Written Communication WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics Inquiry and Analysis 1 IN1 Topic Selection IN2 Existing Knowledge, Research, Views IN3 Design Process IN4 Analysis IN5 Conclusions IN6 Limitations and Implications Critical Thinking CT1 Explanation of Issues CT2 Evidence CT3 Influence of Context and Assumptions CT4 Student’s Position CT5 Conclusions and Related Outcomes 1 Sample was too small to analyze after the removal of assignments for which this dimension was considered Not Applicable. *Statistically significant at the .05 level **Statistically significant at the .01 level Determining acceptable values for interrater reliability measures is not easy. Acceptable levels will depend on the purposes that the results will be used for. These levels must also be chosen in relationship to the type of scoring tool or rubric, and the measure of reliability being used. In this case, the tool is a “metarubric,” a rubric that is designed to be applied across a broad range of artifacts and contexts. This type of instrument requires more scorer interpretation than rubrics designed for specific assignments. For consistency measures, such as correlation coefficients, in a seminal work, Nunnally states that .7 may suffice for some purposes whereas for other purposes “it is frightening to think that any measurement error is permitted” (Nunnally, 1978, pp.245-246). The standard set for Krippendorff’s alpha by Krippendorff himself is .8 to ensure that the data are at least similarly interpretable by researchers. However, “where only tentative conclusions are acceptable, alpha greater than or equal to .667 may suffice” (Krippendorff, 2004, 33 p. 241). In the present context, we should aim for values of at least .67, with the recognition that this could be difficult given the broad range of artifacts scored with the metarubrics. Comparing the results of the reliability indices for this study to the benchmark of .67, three Krippendorff’s alpha and Spearman’s rho coefficients are above .67—CT1, CT2, and CT5. The percent agreement was high for most of the Inquiry dimensions, higher than the alpha coefficient, which is generally the case when there are two scorers. For Critical Thinking, most work products in the sample were scored by three to four scorers, which decreased the percent agreement (across all scorers), but increased alpha and rho. High levels of percent agreement are seen for IN4 and CT5. Five dimensions show percent adjacent at 100%, and only 17 dimension scores out of 202 total dimension scores (8.4%) were more than one level different. An interesting finding is that scorers that worked in groups of 3 and 4 (those scoring the Critical Thinking rubric) had higher interrater reliability as measured by alpha and rho than those working in pairs. Whether this was due to the rubric itself or to working in a larger group cannot be determined without further research. Overall, these various measures of reliability are promising for the first use of the rubrics at UNCW. They provide some evidence that the scorer norming activities had an effect. They do, however, indicate that more work needs to be done. SCORER FEEDBACK All scores filled out two types of feedback forms. At the end of the day, each scorer filled out a process feedback survey. This survey asked for their opinions about how well each step of the process had gone, and for any recommendations for improvement. During the day, after completing each packet of student work products, each scorer filled out a rubric feedback form. This form asked for information on how well each rubric dimension fit the assignment and student work. It also asked for feedback on the quality criteria for each dimension. SCORER FEEDBACK ON PROCESS Table 11 provides the results on the selected responses items on the survey. 34 Table 11 Scorer Feedback on Process 1. 2. 3. 4. 5. 6. 7. The invitation to volunteer accurately described the experience. The timing of the invitation gave adequate opportunity to arrange for attending workshops and scoring. The 2-hour norming session adequately prepared me for what was expected of me during the allday scoring session. The all-day scoring session was well-organized. The structure of the all-day scoring made it reasonable to work for the full time. When I had questions, one of the leaders was available to answer it. When I had questions, the question was answered. 8. I was comfortable scoring student work products from outside my discipline on the broad Learning Goals. 9. The process is an appropriate way to assess students on the UNCW Learning Goals. 10. This process is valuable in improving student learning. 11. I would participate in this process again. 12. I would recommend participating in this process to my colleagues. Strongly Agree 14 (93.3%) Disagree 0 (0.0%) Strongly Disagree 0 (0.0%) Missing or NA 0 (0.0%) Agree 1 (6.7%) Neutral 0 (0.0%) 14 (93.3%) 1 (6.7%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 11 (73.3%) 2 (13.3%) 1 (6.7%) 1 (6.7%) 0 (0.0%) 0 (0.0%) 13 (86.7%) 12 (80.0%) 1 (6.7%) 2 (13.3%) 1 (6.7%) 1 (6.7%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 15 (100.0%) 13 (86.7%) 5 (33.3%) 0 (0.0%) 2 (13.3%) 4 (26.7%) 0 (0.0%) 0 (0.0%) 3 (20.0%) 0 (0.0%) 0 (0.0%) 1 (6.7%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 2 (13.3%) 5 (33.3%) 7 (46.7%) 2 (13.3%) 0 (0.0%) 0 (0.0%) 1 (6.7%) 6 (40.0%) 7 (46.7%) 7 (46.7%) 7 (46.7%) 7 (46.7%) 7 (46.7%) 1 (6.7%) 1 (6.7%) 1 (6.7%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 1 (6.7%) 0 (0.0%) 0 (0.0%) There were also three open-ended questions on the survey. The results of these are included in the summary below. The complete results of these questions are provided in Appendix D. There was a high level of satisfaction with regard to most aspects of the process. The initial contact and explanations about the responsibilities of the volunteers was clear to all. Most scorers responded that the norming session adequately prepared them for the all-day scoring session. Positive comments were that it was a great way to learn, that it provided context, and that it was helpful, and even fun. Although scorers felt prepared to score, they did provide a 35 number of recommendations for improving preparation. Two scorers suggested that an additional session, maybe optional, would have been beneficial. With regard to the all-day scoring session, scorers were generally satisfied. Eight scorers commented that working with a partner was beneficial. One stated that the initial discussion facilitated uniformity and efficiency. Another said that it allowed them to check themselves and made the exercise enjoyable. Two scorers commented on the value of participating in the scoring process. One stated that it was a valuable way for faculty to share and learn and discuss, and the other stated that s/he learned a lot that will be able to be applied to courses taught. One scorer who scored the same assignment in the beginning of the day and again at the end of the day stated that it was very instructive to revisit the same assignment (most scorers scored three packets from different assignments). With regard to the length of the session, one respondent suggested breaking the scoring downing into two sessions. Finally, there was a suggestion to create a set of guidelines for the scoring process. There were a number of comments related to the rubrics in the comments on the end-of-day survey. Two scorers suggested defining exemplar works to use a guides in the scoring process. One scorer was not sure that the Inquiry rubric could be applied universally. Another suggested refining some parts of the rubrics. Most comments concerned the match between the assignments and rubrics. Four scorers suggested that rubrics and assignments should be synchronized better or more effectively, one suggesting that assignments be written with the rubric in mind. Another scorer stated that fuller instructions from the professor to the students would improve the responses. SCORER FEEDBACK ON RUBRIC AND REQUIRED ASSUMPTIONS Written Communication Rubric For the most part, the Written Communication Rubric fit the six different assignments well, requiring only a few assumptions. The dimension WC3 Genre and Disciplinary Conventions was the most problematic, requiring a number of assumptions. All teams had questions as to where to score use of citations, and all decided to score this under disciplinary conventions. For scorers who were scoring outside their discipline, lack of knowledge about disciplinary conventions was also an issue. This issue was somewhat abated because there were faculty from four of the five disciplines at the event who could answer questions. The dimension WC4 Sources and Evidence was problematic for one assignment (See Written Communications Results section). In this case, it was determined by the scorers that the prompt for the assignment could have been improved. Inquiry and Analysis Inquiry is approached differently across the disciplines, making it difficult to use a metarubric. However, the way the dimensions are divided places most of these differences into one dimension—IN3 Design Process, and most instructors provided information on the approach to 36 the inquiry in the assignment directions. This dimension required the scorers to make specific guidelines for four of the five assignments . The guideline common to all scoring was that the appropriate methodological approach was that described (or inferred) in the assignment instructions. One dimension in the rubric stands out as clearly not applicable to most assignments at the basic studies level—IN1 Topic selection. Only one assignment gave students a clear choice of topics (a second one gave choice of focus within the topic). Therefore, this dimension was scored as not applicable for four of the five assignments. The dimension IN2 Existing Knowledge, Research, and/or Views fit all assignments except one. Although the assignment asked students to summarize the background information within the article, the scorers determined this did not fit with the intent of the rubric dimension, which required presenting information from relevant sources with various points of view. The dimensions IN4 Analysis and IN5 Conclusions fit all assignments well. There were a number of comments on the quality criteria for the Conclusions dimension. Scorers felt that it does not follow that the level 2 quality criteria is “better” than the level 1 criteria, or visa versa, for that matter. Because of this flaw, the level 3 criteria was too broad, requiring a score of 3 even if the conclusion was weak. There was one similar comment about the Analysis dimension quality criteria, indicating a large jump between the level 2 and level 3 criteria. The dimension IN6 Limitations and Implications was assumed not to fit one assignment. However, review after the scoring session showed that the assignment did contain a question regarding implications. This illustrates the need to check scorer assumptions before they are used. Critical Thinking This rubric was the most difficult for scorers to apply. Some of the difficulty seems to have come from the rubric itself, and some with the course instructors’ understanding of the dimensions of the rubric. Most assignments did not fit all dimensions. Each of the five dimensions was judged by the scorers as not applicable for at least one of the assignments. CT3 Influence of Context and Assumptions and CT5 Conclusions and Related Outcomes were not applicable for three assignments; CT1 Explanation of Issues and CT4 Student’s Position were not applicable for two assignments. Most of the assumptions were clarifications of the quality criteria into their own word. There were a number of assumptions for the dimension CT2 Evidence. Since this dimension has two parts, scorers needed to create rules about what happens when a work product met one part but not the other at any given level. 37 INSTRUCTOR FEEDBACK A brief survey was sent to the 13 instructors who provided the student work products. Four responded. The survey was sent at the end of April right after the all-day scoring session. Instructors were asked to comment on the process of selecting an assignment, the collection process, and to provide any additional comments. All four respondents said that the assignment selection process was not difficult, although one said that he would appreciate a bit more training. Regarding the collection process, all said that it was clear and worked well; three also mentioned that the communication from the beginning of the process until the time of collection was very important. Under other comments, two respondents mentioned that the only suggestion for improvement in the process would be more updates or meetings. No clear signs of difficulties with the process were seen in the survey nor in any of the interactions with the instructors throughout the semester. A portion of the communications, mainly concerning collection of papers, was between a graduate assistant and the instructors, and this worked well. 38 DISCUSSION, LIMITATIONS AND RECOMMENDATIONS Returning to the purpose of the general education assessment activities in Spring 2010, what evidence was obtained for the following questions: • • • • What are the overall abilities of students taking basic studies courses with regard to the UNCW Learning Goals of Thoughtful Expression, Inquiry, and Critical Thinking? What are the relative strengths and weaknesses within the subskills of those goals? Are there any differences in performance based on demographic and preparedness variables such as gender, race or ethnicity, transfer students vs. freshman admits, total hours completed, or entrance test scores? What are the strengths and weaknesses of the assessment process itself? THOUGHTFUL EXPRESSION (WRITTEN COMMUNICATION) The median score on all five dimensions of Written Communication was 2, which means that at least half of students work products sampled demonstrate performance at the first Milestone level (level 2). In fact, for all dimensions, it was substantially over half. Table 12 shows the percent of work products scored at a level 2 or higher and the percent of work products scored at a level 3 or higher for each dimension. Table 12 Written Communication Percent of Sample Scored at Least 2 and at Least 3 Dimension WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics % of Work Products Scored 2 or higher 84.4% 76.8% 74.1% 64.7% 76.7% % of Work Products Scored 3 or Higher 46.5% 39.7% 32.7% 44.0% 42.2% The university has not as yet determined an expected level of attainment on this rubric for basic studies students, so no statement can be made as to whether expectations have or have not been met. These scores may demonstrate adequate skills for the majority of students completing basic studies courses. These scores also demonstrate, however, the need for additional practice and feedback for the remainder of their four-year experience. There is some evidence from this study that some students may not attain skills at the level 4 before leaving the university. Scores increased with hours of coursework completed on only three of the five dimensions, and even these correlations were low. Additionally, results showed that for each dimensions there were 39 still a large portion of students who had completed 90 or more credit hours who scored below 3 (from 36.4 to 54.6% per dimension). These findings point to the need for additional instruction and practice in the writing process, such as from the Writing Intensive curriculum that will be introduced in the new University Studies curriculum. Student performance was strongest on WC1 Context and Purpose for Writing. Areas of relative weakness were WC3 Genre and Disciplinary Conventions and WC4 Sources and Evidence. However, a substantial portion of the work products that were scored below the level 2 on WC4 were from one assignment. Had the test question been specific in requiring evidence, it’s possible these scores would have been higher. This fact, along with findings from the other rubrics, indicates the need for additional training of instructors on the dimensions of each Learning Goal. Other than the issue just mentioned, there were no significant differences in the score distributions between the four content areas. There were significant differences in scores by type of assignment, with higher scores on the term papers than on the in-class test questions for 4 of the 5 dimensions. Scores on WC5 Control of Syntax and Mechanics were higher on the in-class test questions (though not significantly), which could be due to scorers being more lenient in this area, taking into account the time constraints. This is an area for additional investigation and to be discussed with scorers in the future. A significant finding, consistent with other research findings, was that females scored higher than males on all dimensions, and significantly higher on WC2 Content Development and WC5 Control of syntax and mechanics. This suggests the need for additional writing support for males. Differences between student-reported race/ethnicity classifications could not be researched due to the fact that the samples were too small for all categories except white. Future research will need to determine ways that possible differences between race/ethnicity classifications can be checked for. The only correlation between SAT or ACT scores was a positive correlation between SAT Math scores and WC5 Control of Syntax and Mechanics. There were no differences between transfer student and 4-year students, or between honors and non-honors students. The lack of strong association between preparedness and writing ability as scored by this rubric is difficult to interpret. The rubric was deemed by the scorers to fit the student work products well. There was some scorer confusion about dimension WC3Genre and Disciplinary Conventions, which will need to be addressed in scorer training and/or through some wording changes in the rubric. Scorers suggested some minor changes to quality criteria. There were high correlations between the 40 scores of each dimension, which may indicate that more emphasis needs to be placed on scoring each dimension separately, in accordance with the rubric. Interrater reliability measures were all lower than we would like them to be, and ways to improve them in the future need to be implemented. INQUIRY The median value of five the six dimensions was 2, while the median of IN2 Existing Knowledge, Research, and/or Views was 3. As you can see in Table 13 below, more than 60% of work products were scored at least a level 2 on each dimension. Table 13 Inquiry Percent of Sample Scored at Least 2 and at Least 3 Dimension IN1 Topic Selection IN2 Existing Knowledge, Research, and/or Views IN3 Design Process IN4 Analysis IN5 Conclusions IN6 Limitations and Implications % of Work Products Scored 2 or higher 71.5% 91.4% 83.7% 82.6% 70.4% 62.3% % of Work Products Scored 3 or Higher 28.6% 53.5% 49.0% 42.8% 39.8% 24.7% The university has not as yet determined an expected level of attainment on this rubric for basic studies students, so no statement can be made as to whether expectations have or have not been met. These scores show that a large majority of students possess inquiry skills at least at level 2, and a large portion at level 3 or higher. Growth, of course, is needed in the skills represented by each dimension for most students in order to attain level 4 skills. There were statistically significant, though not large, increases in scores as credit hours completed increased for four of the dimensions, demonstrating the positive impact of coursework on inquiry skills. The correlation between hours completed and IN2 Existing Knowledge was small (.053) and not significantly different from zero. Along with the fact that the scores on this dimension were very high, this may suggest most that students enter the university with adequate skills for their basic studies work. Additional investigation is needed to shed more light on this issue. Students with over 90 hours of coursework completed (there were only five in the sample), scored at levels 3 and 4 for all dimensions, except for one score of 2 for IN5, and two scores of 1 and one score of 2 for IN6. Student performance was strongest on IN2 Existing Knowledge, Research, and/or Views, which represents the skill in presenting information from relevant, diverse sources. Scores on IN6 Limitations and Implications were the weakest regardless of credit hours completed. This 41 indicates that students need additional instruction and practice in presenting and supporting the limitations and implications of inquiry findings. Only two content areas were scored with this rubric—ENG201 and PSY105. Scores were significantly higher on the English composition papers than the Psychology papers on the four dimensions that all assignments were scored on. This data may suggest that the use of a process approach to writing affects not only the dimensions of written communication, but also the dimensions of inquiry. Although there is not often time for instructor feedback loops in many basic studies classes, peer review mechanisms could provide benefits in such courses. The only statistically significant correlations between Inquiry scores and demographic and preparedness variables were those between the number of credit hours completed and four of the dimensions of the rubric, which was discussed above. However, differences between studentreported race/ethnicity classifications could not be researched due to the fact that the samples were too small for all categories except white. Future research will need to determine ways that possible differences between race/ethnicity classifications can be checked for. The correlations between the scores on each dimension were fairly large and statistically significant except for those involving correlations with IN1, which had a small sample. This suggests the possibility that scorers were not looking at each dimension independently, which would introduce bias into the results. Future training and scoring sessions will focus on this issue. Interrater reliability measures were all lower than we would like them to be, and ways to improve them in the future need to be implemented. The VALUE rubric fit the sample assignments well, except for IN1 Topic Selection. Only one assignment was scored on that dimension. All other assignments did not require students to select a topic. While this might suggest a concern to some readers, there are also arguments to the contrary. Curriculum for teaching inquiry skills is often constructed to give students increasing ownership of the various steps in the inquiry process (for example, see Teaching Issues and Experiments in Ecology, 2010). In the level often called guided inquiry, the research question is given to the student (as well as other portions of the process). While students should graduate with the skills to complete open-ended inquiry, the guided inquiry represented by this sample of student work products is an appropriate step in the learning process and therefore an appropriate sample for assessing Inquiry at the basic studies level. There were a number of scorer comments on the quality criteria for dimensions IN4 Analysis and IN5 Conclusions. The rubric needs to be evaluated in these areas for ways to address these concerns. 42 Finally, while scores on this Learning Goal are relatively high when comparing all three rubrics, any comparison of scores across rubrics must be done with caution. CRITICAL THINKING The median score on CT1, CT2, and CT 4 was 2, which means that at least half of student work products sampled demonstrate performance at the first Milestone level, level 2. The median score for CT3 and CT5 was 1, indicating that less than half of student work products demonstrated performance at the first Milestone level (level 2). Table 14 shows the percent of work products scored at a level 2 or higher and the percent of work products scored at a level 3 or higher for each dimension. Table 14 Critical Thinking Percent of Sample Scored at Least 2 and at Least 3 Dimension CT1 Explanation of Issues CT2 Evidence CT3 Influence of Context and Assumptions CT4 Student’s Position CT5 Conclusions and Related Outcomes % of Work Products Scored 2 or higher 64.1% 65.0% 40.3% 51.1% 37.0% % of Work Products Scored 3 or Higher 35.1% 28.2% 10.9% 18.5% 11.0% The university has not as yet determined an expected level of attainment on this rubric for basic studies students, so no statement can be made as to whether expectations have or have not been met. These scores may demonstrate adequate skills for the majority of students completing basic studies courses in only three of the five dimensions of critical thinking. These results are consistent with the evidence attained so far from the Collegiate Learning Assessment (CLA). These scores also demonstrate the need for additional practice and feedback for the remainder of their four-year experience. Evidence from this study suggests, however, that student abilities may not be improving by graduation. All correlations between dimension scores and credit hours completed were negative, although only one was statistically significant (CT5 -.329). Additionally, results showed that for each dimension there were still an extremely large portion of students who had completed 90 or more credit hours who scored below 3 (from 72.7% to 100%). It should be no surprise that students performed higher in explaining the issue (CT1) and in presenting evidence from sources (CT2) than they did on the other dimensions. What is hidden from this summary is the number of assignments in this sample that were not scored on all dimensions. This fact indicates is that students need more opportunities to practice all dimensions of critical thinking, especially those skills associated with understanding the 43 influence of context and assumptions (CT3), formulating a position (CT4), drawing conclusions, and understanding implications and consequences (CT5). This does not necessarily mean that all dimensions must be practiced on the same assignment. There were significant positive correlations with the preparedness variables SAT (Verbal) and ACT scores. There were no differences between genders, transfer students and students who started as freshman, and honors and non-honors students (there were only 5 honors students in this part of the sample). Scores could not be analyzed with respect to race/ethnicity due to the small sample size of all categories except white. There was not a significant difference between scores by assignment type for the three highest scoring dimensions. However, there were significant differences between scores by assignment type for the two lowest-scoring dimensions, CT3 and CT5, where the scores on the one term paper were higher than on the test questions. The lack of association for three dimensions may be more interesting than the association. The correlation between the scores on each dimension was fairly large and statistically significant. This suggests the possibility that scorers were not looking at each dimension independently. Future training and scoring sessions will focus on this issue. Interrater reliability met the benchmark for three of the five dimensions, and the reliability of the other two dimensions was still high compared to the other two rubrics. This was the only rubric for which scorers were placed into groups of 3 and 4 for part of the scoring session. Whether the higher reliability results were due to the rubric itself or to working in larger groups cannot be determined without further investigation. However, given the similarity in structure of the rubrics, the hypothesis that working in larger groups improves calibration seems worthy of future investigation. This rubric was the most difficult for scorers to apply. Some of the difficult seems to have come from the rubric itself, and some with the course instructors’ understanding of the dimensions of the rubric. The dimensions and rubric quality criteria need to be reviewed. In addition, more time needs to be spent in the information sessions for instructors on each of the dimensions of critical thinking. It is all important to keep in might that assignments at the basic studies level may not always address all dimensions of critical thinking. SOCIAL SCIENCE FOUNDATIONAL KNOWLEDGE The median score on both FK1 and FK2 was 1, which means that less than half the work products received scores of 2 or higher. Table 15 below shows the percent of work products 44 scored at a level 2 or higher and the percent of work products scored at a level 3 or higher for each dimension. Table 15 Foundational Knowledge Percent of Sample Scored at Least 2 and at Least 3 Dimension FK1 Use of Discipline Terminology FK2 Explanation and Understanding of Concepts and Principles % of Work Products Scored 2 or higher 40.0% 40.0% % of Work Products Scored 3 or Higher 17.8% 15.6% Information from this rubric is for the evaluation of the rubric and process only, as this is the first use of the rubric, evidence comes from only one of the social sciences, and only one scorer assessed the work products. Both sections were online, and the test question came halfway into the semester. The correlation between scores from each dimensions were significant and high (.933). This rubric needs further review by social science faculty and further testing before results can be evaluated with confidence. One conclusion does start to come to light from the data. The test was given approximately halfway through the semester. The results indicate that this might be too early to assess foundational knowledge. Assessment of terminology and concepts would be more appropriate at the end of an introductory course, or during a second course, when more than one course in a discipline is required. RELATIVE STRENGTHS AND WEAKNESSES ACROSS RUBRICS Comparison of scores across rubrics should be done cautiously and should only serve as a starting place for further investigation. This is because the criteria for each level of the rubric cannot be assumed to be scaled the same. For example, Level 2 cannot be considered to be in the identical place on a scale of abilities for each of the rubrics. With this in mind, determination of university-wide expectations for performance in basic studies courses should be done on a rubric-by-rubric basis. With this caveat in mind, it is helpful for the prioritization of effort to look at potential areas of strength and weakness. Looking at the results from all scoring, the following areas stand out. Areas of Relative strengths: • IN2 Existing Knowledge, Research, and/or Views – presenting information from relevant sources representing various points of view 45 • • WC1 Context and Purpose for Writing – demonstrating consideration of context, audience, purpose, and a clear focus on the assigned task IN3 Design Process – developing critical elements of the methodology or theoretical framework Areas of Relative weaknesses: • CT5 Conclusions and related outcomes – logically tying conclusions to a range of information, including opposing viewpoints • CT3 Influence of context and assumptions – identifying own and others’ assumptions and relevant contexts when presenting a position • CT4 Student’s position – stating a specific position that takes into account the complexity of an issue, including acknowledging others’ points of view When you look at the results from only those work products that were scored on more than one rubric, for the most part, these results are consistent with what is listed above. The notable exception is that for this smaller sample, the Inquiry dimensions were low along with those of Critical Thinking. However, there was only one assignment scored on both rubrics, making this evidence less compelling. In additional, an important finding is that, on all three VALUE rubrics, scores tended to drop as you go down the dimensions of the rubric. That is, student abilities in understanding purpose, explaining issues, and presenting information—those used to begin communication and investigation—are stronger than their skills in identifying assumptions, stating conclusion or position, and discussing limitations, implications, and consequences—those used to critically evaluate information. The findings in this section all point towards the need to provide students more opportunities to practice higher-order thinking skills, starting with general education courses. METHODOLOGY AND PROCESS This assessment methodology ran fairly smoothly during its first full-scale implementation. Feedback was good from both instructor and scorer participants. Based on the feedback from scorers and instructors, and the results presented in this report, there are some areas for further work. PROCESS OF SELECTING ASSIGNMENTS Most assignments selected for scoring with the Written Communication and Inquiry rubrics matched their respected rubrics well. However, this was not the case for the Critical Thinking rubric. There was no Critical Thinking dimension that scorers deemed applicable to all 46 assignments, and two of dimensions were not deemed applicable for three of the seven assignments. This indicates clearly the need for additional discussion of the dimensions of critical thinking in the initial workshop for instructors, in addition to follow up during the selection process. This is not meant to suggest, however, that all assignments selected for general education assessment purpose must align with all dimensions of the rubric. It would also be helpful for instructional purposes for there to be more dissemination of information about the UNCW Learning Goals, such as through Center for Teaching Excellence workshops, and inclusion of these goals as appropriate in course syllabi. PROCESS OF NORMING SCORERS Interrater reliability was lower than the benchmark for 13 of the 16 dimensions. Reliability can be improved in two ways. The first is by reexamining the rubrics. Scorers provided feedback on a number of dimensions for which the quality criteria can be improved. The other means of improving interrater reliability is through training. Scorers themselves noted that additional training would be beneficial. In addition, the hypothesis that working in groups larger than two improves calibrating seems worthy of future investigation. OTHER ISSUES Although the sample of students who contributed work products was representative of the UNCW undergraduate student body, the sample of students from all race/ethnicity classifications other than white was not large enough to test for differences between groups. Further studies will need to include ways to analyze potential differences so that they can be addressed if necessary. As previously mentioned, one of the assumptions of our use of the VALUE Rubrics is that the Level 4 Capstone describes the qualities of understanding that we want our graduating seniors to demonstrate. We have not defined as an institution yet our minimum expectations for our first and second year students, the predominate group taking basic studies courses. It is important to set these expectations to get the full benefit of the results of this and future assessment efforts. LIMITATIONS This was the first large-scale use of this methodology for assessing the General Education learning goals. Therefore student ability is not the only thing reflected in the results. The newness of the rubric to both the instructors selecting assignments and the faculty doing the scoring has implications about the reliability of the results. The sample of student work products for Spring 2010 was created to be a random sample from representative courses in the Basic Studies curriculum. Still, it represents just five subject areas in one point in time. Another limitation of the results is that interrater reliability measures were 47 lower than optimal for 13 of the 16 dimensions. Generalization to all students working on basic studies requirements should be done only cautiously at this time. Additional sampling from other basic studies courses as we continue to assess these Learning Goals should be combined with these findings to provide a clearer picture of UNCW student abilities. They should also be combined with finding from other instruments, such as the Collegiate Learning Assessment. RECOMMENDATIONS Although there are limitations to this study, some recommendations that will only have positive effects on student learning can be made in light of these findings. Based on the analysis of the findings presented in the previous sections of this chapter, the Learning Assessment Council recommends the following actions to improve both student learning and the General Education Assessment process. • Levels of expected performance at the basic studies, or lower division, level should be developed for each rubric. • Additional exposure to the content of and rationale for the UNCW Learning Goals should be provided to increase faculty ownership and awareness of these Goals. The LAC will ask the Center for Teaching Excellence to provide a workshop series on these Goals. The LAC will ask the University Curriculum Committee to consider actions in this area. • To increase student exposure to the writing process, the Writing Intensive component of University Studies should be implemented by Fall 2012. • Modifications and improvements to the general education assessment process should be made as needed, including the following: modify rubrics based on feedback, develop benchmarks work products, and enhance instructor and scorer workshops. • Long term implementation schedule should provide flexibility for targeting additional sampling for specific learning goals that are characterized by ambiguous or unclear assessment results. For 2010 – 2011, Critical Thinking will be sampled for this purpose. 48 REFERENCES American Association of Colleges and Universities. (2010). VALUE Project. Accessed July 15, 2010. http://www.aacu.org/value/index.cfm American Educational Research Association. (1999). Standards for Educational and Psychological Testing. Washington, D.C.: AERA. Fisher A. and Scriven.M. (1997). Critical Thinking: Its Definition and Assessment. CA: Edgeress/UK: Center for Research in Critical Thinking. Hockenbury, D.H. and Hockenbury, S.E. (2006). Psychology (Fourth Edition). New York: Worth Publishers. Krippendorff, K. (2004). Content analysis: An introduction to its methodology. (2nd edition). Thousand Oaks, CA: Sage Publications. Lance, C.E., Butts, M.M.& Michels, L.C. (2006). The sources of four commonly reported cutoff criteria: What did they really say? Organizational Research Methods 9 (2) 202-220. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Teaching Issues and Experiments in Ecology: Teaching Resources: Inquiry Framework. Accessed June 12, 2010. http://tiee.ecoed.net/teach/framework.jpg University of North Carolina Wilmington. (2009). Revising General Education at UNCW. adopted by Faculty Senate March 17, 2009. http://www.uncw.edu/universitystudies/documents/Univ.StudiesCurriculumReport.pdf University of North Carolina Wilmington. (2009). UNCW Learning Goals. adopted by Faculty Senate March 17, 2009. http://www.uncw.edu/assessment/uncwLearningGoals.html University of North Carolina Wilmington. (2009). Report of the General Education Assessment Committee, March 2009. http://www.uncw.edu/assessment/Documents/General%20Education/GenEdAssessmentCom mitteeReportMarch2009.pdf 49 University of North Carolina. (2009) Office of Institutional Research and Assessment. Factsheet Fall 2009. University of North Carolina Wilmington. (2010) Office of Institutional Research and Assessment. Common Data Set 2009-2010. Accessed June 2, 2010. http://www.uncw.edu/oira/documents/CDS2009_2010_finalrevisedfacultycounts_3.pdf 50 APPENDIX A RUBRICS USED AAC&U Written Communication Rubric AAC&U Inquiry and Analysis Rubric AAC&U Critical Thinking Rubric Locally created Social Sciences Foundational Knowledge Rubric 51 52 53 54 Social and Behavioral Sciences Student Learning Outcome SBS 1: Describe and explain the major terms, concepts, and principles in at least one of the Social and Behavior Sciences. Rubric Evaluators are encouraged to assign a zero to any work sample or collection of work that does not meet benchmark (cell one) level performance. Level 4 Level 3 Use of Discipline Terminology Demonstrates fluency in the terminology relevant to the topic by displaying skillful and precise word choices that underscore meaning. Conveys meaning to the reader by using all relevant terminology accurately and appropriately. Explanation and Understanding of Concepts and Principles Demonstrates a thorough understanding of the relevant concepts and principles of the discipline by correctly using them in support of an argument. Accurately explains the concepts and principles within the context of the situation, making relevant connections. Draft March 15, 2010 55 Level 2 Conveys meaning, although often does not utilize terms relevant to the topic, OR when terminology is utilized, it is sometimes used inaccurately. Explains concepts and principles at a basic level, but leaves out important information and/or connections. Level 1 Meaning of the discourse is unclear, and/or attempts to use terminology of the discipline are inaccurate or inappropriate to the context. Attempts to describe or explain relevant concepts and principles are too vague and/or simplistic. This page purposely blank. 56 APPENDIX B DIMENSION MEANS AND STANDARD DEVIATIONS Note of caution: Data from these rubrics cannot be assumed to be interval-level data. That is, although a level 2 is considered higher, or larger, than a level 1, it is not proper to assume that a student that scores at a level 2 is twice as knowledgeable as a student who scored at a level 1; nor can we assume that, whatever the difference is between these two categories, that it is exactly the same as the difference between levels 2 and 3. In addition, the scale of quality criteria may differ between the three rubrics. See page X for a more complete discuss of these possible differences. Therefore this table should analyzed with extreme caution, and no hypothesize should be made solely from this information. Table B1 Means and Standard Deviations for Each Rubric Dimension Dimension Mean Std. Dev. WC1 Context of and Purpose for Writing 2.46 0.955 WC2 Content Development 2.22 0.976 WC3 Genre and Disciplinary Conventions 2.06 0.907 WC4 Sources and Evidence 1.94 1.246 WC5 Control of Syntax and Mechanics 2.24 0.900 IN1 Topic Selection 2.14 1.231 IN2 Existing Knowledge, Research, and/or Views 2.52 0.822 IN3 Design Process 2.35 0.985 IN4 Analysis 2.23 1.023 IN5 Conclusions 2.09 1.131 IN6 Limitations and Implications 1.88 1.040 CT1 Explanation of Issues 1.99 1.092 CT2 Evidence 1.90 0.998 CT3 Influence of Context and Assumptions 1.25 0.967 CT4 Student’s Position 1.58 0.972 CT5 Conclusions and Related Outcomes 1.21 1.081 FK1 Use of Discipline Terminology 1.60 0.837 FK2 Explanation and Understanding of Concepts 1.60 0.863 and Principles 57 N 116 116 116 116 116 14 58 98 98 98 85 145 163 119 144 127 45 45 This page purposely blank. 58 APPENDIX C RESULTS BY COURSE The following three tables contain the distribution of score results by subject. 59 Table C1 Written Communication Results by Course All Subjects (6 sections, 116 work products) WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics English 201 (2 sections, 27 work products) WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics FST 210 (1 section, 38 work products) WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics MUS 115 (2 sections, 33 work products) WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics PSY 105 (1 section, 18 work products) WC1 Context of and Purpose for Writing WC2 Content Development WC3 Genre and Disciplinary Conventions WC4 Sources and Evidence WC5 Control of Syntax and Mechanics 0 1 2 3 4 NA 1 (.9%) 4 (3.4%) 5 (4.3%) 23 (19.8%) 1 (0.9%) 17 (14.7%) 23 (19.8%) 25 (21.6%) 18 (15.5%) 26 (22.4%) 44 (37.9%) 43 (37.1%) 48 (41.4%) 24 (20.7%) 40 (34.5%) 36 (31.0%) 36 (31.0%) 34 (29.3%) 45 (38.8%) 42 (36.2%) 18 (15.5%) 10 (8.6%) 4 (3.4%) 6 (5.2%) 7 (6.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 3 (11.1%) 5 (18.5%) 0 (0.0%) 0 (0.0%) 1 (3.7%) 3 (11.1%) 0 (0.0%) 7 (25.9%) 15 (55.6%) 7 (25.9%) 9 (33.3%) 5 (18.5%) 15 (55.6%) 4 (14.8%) 15 (55.6%) 10 (37.0%) 14 (51.9%) 5 (18.5%) 8 (29.6%) 4 (14.8%) 2 (17.4%) 1 (11.1%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 4 (10.5%) 1 (2.6%) 18 (47.4%) 1 (2.6%) 10 (26.3%) 16 (42.1%) 15 (39.5%) 11 (28.9%) 13 (34.2%) 15 (39.5%) 12 (31.6%) 14 (34.2%) 5 (13.2%) 6 (15.8%) 10 (26.3%) 6 (15.8%) 8 (21.1%) 4 (10.5%) 14 (36.9%) 2 (5.3%) 0 (0.0%) 1 (2.6%) 0 (0.0%) 4 (10.5%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 6 (18.2%) 5 (15.2%) 4 (12.1%) 5 (15.2%) 2 (6.1%) 10 (30.3%) 13 (39.4%) 16 (48.5%) 11 (33.3%) 13 (39.4%) 14 (42.4%) 10 (30.3%) 12 (36.4%) 17 (51.5%) 16 (48.5%) 3 (9.1%) 5 (15.2%) 1 (3.0%) 0 (0.0%) 2 (6.1%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 1 (5.6%) 0 (0.0%) 0 (0.0%) 1 (5.6%) 1 (5.6%) 3 (16.7%) 2 (11.1%) 4 (22.2%) 4 (22.2%) 11 (61.1%) 10 (55.6%) 3 (16.7%) 6 (33.3%) 8 (44.4%) 5 (27.8%) 4 (22.2%) 10 (55.6%) 7 (38.9%) 5 (27.8%) 1 (5.6%) 0 (0.0%) 3 (16.7%) 1 (5.6%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 60 Table C2 Inquiry Rubric Score Results by Course 0 All Subjects (5 sections, 98 work products) Topic Selection Existing Knowledge, Research, and/or Views Design Process Analysis Conclusions Limitations and Implications English 201 (4 sections, 58 work products) Topic Selection Existing Knowledge, Research, and/or Views Design Process Analysis Conclusions Limitations and Implications PSY 105 (1 section, 40 work products) Topic Selection Existing Knowledge, Research, and/or Views Design Process Analysis Conclusions Limitations and Implications 1 2 3 4 NA 1 (1.0%) 1 (1.0%) 6 (6.1%) 9 (9.2%) 10 (10.2%) 6 (6.1%) 3 (3.1%) 4 (4.1%) 10 (10.2%) 8 (8.2%) 19 (19.4%) 26 (26.5%) 6 (6.1%) 22 (23.2%) 34 (34.7%) 39 (39.8%) 30 (30.6%) 32 (32.7%) 1 (1.0%) 26 (22.4%) 40 (40.8%) 35 (35.7%) 30 (30.6%) 14 (14.3%) 3 (3.1%) 5 (5.1%) 8 (8.2%) 7 (7.1%) 9 (9.2%) 7 (7.1%) 84 (85.7%) 40 (40.8%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 13 (13.3%) 1 (1.7%) 1 (1.7%) 1 (1.7%) 1 (1.7%) 0 (0.0%) 0 (0.0%) 3 (5.2%) 4 (6.9%) 4 (6.9%) 3 (5.2%) 10 (17.2%) 9 (15.5%) 6 (10.3%) 22 (37.9%) 21 (36.2%) 27 (46.6%) 20 (34.5%) 20 (34.5%) 1 (1.7%) 26 (44.8%) 24 (41.4%) 20 (34.5%) 19 (32.8%) 9 (15.5%) 3 (5.2%) 5 (8.6%) 8 (13.8%) 7 (12.1%) 9 (15.5%) 7 (12.1%) 44 (75.9%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 13 (22.4%) 0 (0.0%) 0 (0.0%) 5 (12.5%) 8 (20.0%) 10 (25.0%) 6 (15.0%) 0 (0.0%) 0 (0.0%) 6 (15.0%) 5 (12.5%) 9 (22.5%) 17 (42.5%) 0 (0.0%) 0 (0.0%) 13 (32.5%) 12 (30.0%) 10 (25.0%) 12 (30.0%) 0 (0.0%) 0 (0.0%) 16 (40.0%) 15 (37.5%) 11 (27.5%) 5 (12.5%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 40 (100.0%) 40 (100.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 61 Table C3 Critical Thinking Score Results by Course 0 All Subjects (8 sections, 183 work products) CT1 Explanation of Issues CT2 Evidence CT3 Influence of Context and Assumptions CT4 Student’s Position CT5 Conclusions and Related Outcomes MUS 115 (2 sections, 39 work products) CT1 Explanation of Issues CT2 Evidence CT3 Influence of Context and Assumptions CT4 Student’s Position CT5 Conclusions and Related Outcomes PSY 105 (3 sections, 78 work products) CT1 Explanation of Issues CT2 Evidence CT3 Influence of Context and Assumptions CT4 Student’s Position CT5 Conclusions and Related Outcomes SOC 105 (3 sections, 66 work products) CT1 Explanation of Issues CT2 Evidence CT3 Influence of Context and Assumptions CT4 Student’s Position CT5 Conclusions and Related Outcomes 1 2 3 4 NA 12 (6.6%) 13 (7.1%) 31 (16.9%) 20 (10.9%) 39 (21.3%) 40 (21.9%) 44 (24.0%) 40 (21.9%) 49 (26.8%) 41 (22.4%) 42 (23.0%) 60 (32.8%) 35 (19.1%) 47 (25.7%) 33 (18.0%) 40 (21.9%) 39 (21.3%) 13 (7.1%) 27 (14.8%) 9 (4.9%) 11 (6.0%) 7 (3.8%) 0 (0.0%) 1 (0.5%) 5 (2.7%) 38 (20.8%) 20 (10.9%) 64 (35.0%) 39 (21.3%) 56 (30.6%) 1 (2.6%) 4 (10.3%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 6 (15.4%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 5 (12.8%) 13 (33.3%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 9 (23.1%) 12 (30.8%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 4 (10.3%) 4 (10.3%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 20 (51.3%) 0 (0.0%) 39 (100.0%) 39 (100.0%) 39 (100.0%) 7 (8.0%) 6 (7.7%) 10 (12.8%) 8 (10.3%) 7 (9.0%) 14 (17.9%) 15 (19.2%) 18 (23.1%) 25 (32.1%) 29 (37.2%) 19 (24.4%) 33 (42.3%) 19 (24.4%) 28 (35.9%) 29 (37.2%) 19 (24.4%) 21 (26.9%) 11 (14.1%) 16 (20.5%) 8 (10.3%) 1 (1.3%) 3 (3.8%) 0 (0.0%) 1 (1.3%) 5 (6.4%) 18 (23.1%) 0 (0.0%) 20 (25.6%) 0 (0.0%) 0 (0.0%) 4 (6.1%) 3 (4.5%) 21 (31.8%) 12 (18.2%) 32 (48.5%) 26 (39.4%) 23 (34.8%) 22 (33.3%) 24 (36.4%) 12 (18.2%) 18 (27.3%) 14 (21.2%) 16 (24.2%) 19 (28.8%) 4 (6.1%) 12 (18.2%) 6 (9.1%) 2 (3.0%) 11 (16.7%) 1 (1.5%) 6 (9.1%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 20 (30.3%) 5 (7.6%) 0 (0.0%) 17 (25.8%) 62 APPENDIX D DETAILED SCORER FEEDBACK Scorer Qualitative Comments. What parts of the process worked the best? • • • • • • • • • • • • • • • The 2-hr session was a great way to learn & score & gave context & materials that prepared me for longer session. The paired/small group norming during long. everything was very well-organized. The small 2-hr norming sessioin was very helpful. Working with a partner is a great idea. working with someone else. The overall process was very well organized. The expectations were also very clear. There was adequate time between norming session & the all-day scoring to review materials. The training was excellent. Everything was well organized. Questions were answered thoroughly. The workshop Both the norming session and the initial group review of a common student response greatly facilitated uniformity and efficiency of the process. Working in groups allowed us to check ourselves and made the exercise enjoyable. Collaboration, discussion & exchanging ideas. The time we were allowed. I would hate to try this in a short time period. It was good to work with a partner to come to decisions about rubric use. Having a second round of a set of assignments later in the day was instructive. training & creating assumptions before using the rubric for scoring The training session was helpful, even fun. I also found it helpful to compare my assessment (scores) with another volunteer, for the purpose of calibration. very organized. Norming session helpful. splitting in groups and discussing the same cases with the partner(s). In what ways could the scoring process be improved? • • • • • • • • • • perhaps an additonal short session prior to all day session, not required but available we are still learning what assignments match what rubrics. Mismatches in this area cause confusion. Also it is quite possible that some rubric categories will be interpreted one way for one assignment, and a different way for another assignment. I wasn't always clear what to list under "assumptions." AC in the room. I'm not sure I like the full day approach. Perhaps it could be broken down into 2 sessioins, maybe even weekday evenings? If possible, rubrics & assignments should be synchronized better/more effectively. (Of course, this comment speaks more to scoring itself.) The process, while long, was fairly seamless & straight forward. Refining some parts of the rubric (see specific suggestions on packet evaluations). Exemplar work (examples) to use as a guide. Probably to have the rubrics and the assignment instructions correlated. Not all dimensions of the Inquiry rubric fit each assignment, and we questioned the applicability at all of this rubric to the English assignment. Some assignments were difficult to evaluate without having prior 63 • • • • knowledge of course content. *Need more "linear" continuum for bins 1-4, as this is how the data will be interpreted/presented. Can instructors write assignment with rubric in mind? I'm not sure how universal the inquiry rubric truly is. Clearer, fuller instructions from professor to student would help me assess whether the student met the assignment. Sparse instructions invite vauge interpretations of whether the student successfully met the requirements of the assignment. The would be especially helpful for me as I socred student work outside my discipline. An additional norming session with a different work product could be useful. clearer/definitions/standards. Going through some classic examples for each level would be very helpful. Any other comments or suggestions. • • • • • • • • • • • Well organized. A valuable site for faculty to share & learn & discuss from mulitple depts and disciplines! Thank you. A worthwhile experience. I feel I learned a lot that I will be able to apply in my courses. Thank you! Request facilities to leave the AC during the work session! Lunch was good. Everyone seemed delighted to participate. A well-manged process. A set of guidelines for assignment to be used in this process could allow for a more uniform application of the rubric and make the scoring more accurate and representative. Sharing an overview of the process (different people using different rubrics with the same student work) I will be interested in the inter-rater reliability of the rubric. The end of the semester (especially April) is sooo busy that it might be better to hold this earlier in the semester--either making use of work products from the fall semester, or earlier in the spring semester. #12 neutral just because of the time involved. #9 Neutral - yes & no. will depend on what data show and how we modify the curriculum (if at all). Thanks for the process. I learned something about my own assignments as a result of this project. Great! #8 neutral - except music! 64 APPENDIX E CORRELATIONS BETWEEN RUBRIC DIMENSIONS 65 Table E1 Correlation between Dimensions—Spearman’s rho Rank Order Correlation Coefficients WC1 WC2 WC3 WC1 1.00 .658** .537** n 116 116 116 WC2 .658** 1.00 .606** n 116 116 116 WC3 .537** .606** 1.00 n 116 116 116 WC4 .535** .671** .443** n 116 116 116 WC5 .407** .259** .414** n 116 116 116 IN1 -.569** -.250 -.627* n 14 14 14 IN2 -.425* -.111 -.111 n 27 27 27 IN3 -.326 -.033 -.150 n 27 27 27 IN4 -.090 .193 -.026 n 27 27 27 IN5 -.250 .105 -.026 n 27 27 27 IN6 -.294 .032 .071 n 14 14 14 CT1 .576* .006 .172 n 19 19 19 CT2 .173 .163 .222 n 37 37 37 CT3 -.070 .215 .036 n 18 18 18 CT4 -.230 -.077 -.056 n 18 18 18 CT5 -.371 -.071 -.193 n 18 18 18 *Statistically significant at the .05 level **Statistically significant at the .01 level WC4 .535** 116 .671** 116 .443** 116 1.00 116 .193* 116 .115 14 -.428 27 .047 27 .330 27 .197 27 .510 14 .137 19 .037 37 -.024 18 -.293 18 -.253 18 WC5 .407** 116 .259** 116 .414** 116 .193* 116 1.00 116 .028 14 -.015 27 .235 27 .533** 27 .427* 27 .712** 14 .097 19 -.053 37 -.421 18 -.407 18 -.624** 18 IN1 -.569** 14 -.250 14 -.627* 14 .115 14 .028 14 1.00 14 .366 14 .446 14 .499 14 .717** 14 .451 14 IN2 -.425* 27 -.111 27 -.111 27 -.428 27 -.015 27 .366 14 1.00 58 .542** 58 .503** 58 .509** 58 .479** 45 0 0 0 0 0 0 0 0 0 0 IN3 -.326 27 -.033 27 -.150 27 .047 27 .235 27 .446 14 .542** 58 1.00 98 .775** 98 .639** 98 .493** 85 .529** 40 .157 40 .033 40 -.073 40 .182 40 66 IN4 -.090 27 .193 27 -.026 27 .330 27 .533** 27 .499 14 .503** 58 .775** 98 1.00 98 .753** 98 .521** 85 .506** 40 .190 40 .041 40 -.093 40 .254 40 IN5 -.250 27 .105 27 -.026 27 .197 27 .427* 27 .717** 14 .509** 58 .639** 98 .753** 98 1.00 98 .628** 85 .363* 40 .061 40 -.086 40 -.098 40 .165 40 IN6 -.294 14 .032 14 .071 14 .510 14 .712** 14 .451 14 .479** 45 .493** 85 .521** 85 .628** 85 1.00 85 .434** 40 .399* 40 .263 40 .135 40 .420** 40 CT1 .576* 19 .006 19 .172 19 .137 19 .097 19 CT2 .173 37 .163 37 .222 37 .037 37 -.053 37 CT3 -.070 18 .215 18 .036 18 -.024 18 -.421 18 CT4 -.230 18 -.077 18 -.056 18 -.293 18 -.407 18 0 0 0 0 0 .529** 40 .506** 40 .363* 40 .434** 40 1.00 145 .651** 125 .247* 101 .512** 126 .561** 109 0 .157 40 .190 40 .061 40 .399* 40 .651** 125 1.00 163 .692** 104 .668** 124 .618** 123 0 .033 40 .041 40 -.086 40 .263 40 .247* 101 .692** 104 1.00 119 .593** 119 .533** 107 0 -.073 40 -.093 40 -.098 40 .135 40 .512** 126 .668** 124 .593** 119 1.00 144 .634** 127 CT5 -.371 18 -.071 18 -.193 18 -.253 18 -.624** 18 0 0 .182 40 .254 40 .165 40 .420** 40 .561** 109 .618** 123 .533** 107 .634** 127 1.00 127 Table E2 Correlation between Dimensions— Spearman’s rho Rank Order Correlation Coefficients FK1 FK2 CT1 1.00 .933** .471** 45 45 45 FK2 .933** 1.00 .455* 45 45 45 *Statistically significant at the .05 level **Statistically significant at the .01 level FK1 CT2 .372* 45 .340* 45 CT3 .398** 45 .347* 45 CT4 .484** 45 .474** 45 CT5 .336* 45 .323* 45 67