Higher Education Research & Development, Vol. 20, No. 3, 2001 The Course Experience Questionnaire: Altering question format and phrasing could improve the CEQ’s effectiveness MALCOLM ELEY Monash University Two studies considered whether the Course Experience Questionnaire’s (CEQ) question format was the most appropriate for the CEQ’s purpose. In the rst, comparisons were made against two alternative but minimalist variations on the standard format. None of three tests showed the standard format to be superior. In the second, students reported the thinking used in deciding their responses on a sample of speci c CEQ questions. Those reports uniformly showed responses to be decided by the recall of particular, concrete and personal experiences prompted by a question, and not by the overviewing implicitly assumed by the standard format. The implications drawn are that the systematic trialing of alternative question forms could well result in improved performance of the CEQ as an instrument, and that those alternative forms should probably be constrained to those directly prompting the recall of personal experience, but in more guided fashion than seems presently to occur. ABSTRACT Introduction The Course Experience Questionnaire (CEQ) (Ramsden, 1991) was developed to gather information relevant to judging the overall effectiveness of degree programs. As commonly used within the Australian higher education sector it comprises 25 questions, structured as ve subscales plus a single overall rating question, and asks recent graduates to judge the quality of the teaching and learning support experienced during their entire degree studies. The CEQ has undergone considerable development work, and its overall construct validity, and that of its component subscales seems well established (Richardson, 1994; Wilson, Lizzio, & Ramsden, 1997). Further, there has been work to establish the in uence on CEQ responses of various characteristics of the responding graduates (Long & Johnson, 1997). Concerns about which particular aspects of a graduate’s teaching and learning experiences constitute potentially useful sources of variation amongst degree programs, which show valid relationships with teaching quality, and which seem therefore sensible targets for questionnaire questions have been well considered. The CEQ has been incorporated into an annual survey of graduates run by the Graduate Careers Council of Australia (GCCA) since 1993. The responses colISSN 0729-4360 print; ISSN 1469-8360 online/01/030293-20 Ó DOI: 10.1080/07294360120108377 2001 HERDSA 294 M. Eley lected in that survey are fed back to the participating institutions, and in varied fashion those institutions make use of the data for their own course review processes. Those same responses are also used annually to rank Australian universities within particular elds of study (e.g., Ashenden & Milligan, 1999), which rankings are widely available to prospective students. Recently, the Australian federal government has shown interest in developing a set of performance indicators applicable to universities’ activities, and the CEQ has been identi ed as a candidate for that set. The latest area of development is in statistical analysis and interpretation tools that will enable institutions to make valid, appropriate and more targeted use of CEQ data (Hand, Trembath, & Elsworthy, 1998). However, with all this national and institutional interest, there is yet one aspect of the CEQ that seems to have received little research attention. This relates not to the substance of its questions, but more to their form. The CEQ uses a question format originally devised for attitude surveying, in which the intent is that responses should indicate something of respondents’ attitudes and values (Likert, 1932; Anastasi, 1982). In the CEQ context, a statement that represents a particular value position— e.g., “It was always easy to know the standard of work expected” or “To do well in this course all you really needed was a good memory”—is presented, and the student responds by indicating a degree of agreement or disagreement. But rather than responses being used to describe the students, as is the original design purpose of this question format, those responses are instead used to infer something about the courses that they experienced. There has been a subtle shift; the “tool” has been used for something other than that for which it was originally designed. We should not assume that this shift is of no matter. With student questionnaires focused on the teaching of an individual academic, question form can have signi cant effects on the psychometric properties of the questionnaire (Eley & Stecher, 1994; 1995; 1997; Stecher & Eley, 1994). Questionnaires comprising behavioural observation scale questions (e.g., Borman, 1986; Latham & Wexley, 1977) proved better at detecting differences between independent samples of teaching, and between distinct aspects within a single sample, than parallel equivalent questionnaires using a CEQ-like agree/disagree format. There is yet a further point. In asking a student to indicate agreement with something like “Feedback on my work was usually provided only in the form of marks or grades” the CEQ makes some implicit assumptions about how individual students determine their responses. In that they are asked to respond relative to their overall course, the assumption is that students bring to mind some representative sampling of their experiences from across the years, and then apply some sort of averaging or mode nding algorithm. Further, in that all students’ responses to any single question are pooled in analyses, the assumption is that all those responses belong to the same class; that is, all students took the same intended meaning from the question. Neither of these assumptions need be valid. Students might well vary in the time frames that they apply to their response processing. Students might vary in terms of what they include in “my work”, or what “usually” means quantitatively, or indeed what constitutes “feedback”. The point is that we do not know much of the thinking that students use in deciding their CEQ responses. We do not know The Course Experience Questionnaire 295 whether the CEQ question form implicitly asks students to do something that they might nd impossible to do, or which is perhaps contrary to their well established processing predispositions. The question arises as to whether some modi cations to the ways in which the CEQ asks its questions might improve the effectiveness of the CEQ as an instrument. What might happen if question formats were varied from the indirect, attitudinally based agree/disagree form used? Are there discernible patterns of processing that students use in deciding their CEQ responses? Study One The rst study tested the proposition that the CEQ’s agree/disagree question format might not be the most optimal psychometrically. More speci cally, the concern was to test whether altering the form of the CEQ’s individual questions offered any reasonable prospect of resulting in improvements in the reliabilities of the CEQ’s subscales, in the empirical t of those subscales to the underlying conceptual dimensionality, and in the ability of the subscales to differentiate amongst de ned cohorts within elds of study. The basic approach was not to develop and test full alternatives to the CEQ. Rather, the present CEQ was compared to alternatives that incorporated quite minimal modi cations. Only the form of the individual questions was altered. The semantic intent of those questions was maintained. The order and number of questions was kept the same. The general instructions to respondents were kept as parallel as possible. The simple logic was that if the present form of the CEQ is even broadly optimal for its task, then any sort of tinkering should result in a deterioration of its capabilities as a measurement device. It should certainly not be the case that the present form of the CEQ would prove to be poorer than any such modi cation. One modi ed version used an adaptation of the behavioural observation scale (BOS) format. In developing “true” BOS questions, speci c concrete events that respondents might observe or experience, and which could be taken as indicators of the dimension of interest, are generated. In an individual BOS question, a description of one of these “indicator events” would be presented, and the respondent would report the frequency with which it had been observed or experienced. So to parallel a CEQ question like “The staff put a lot of time into commenting on my work”, more speci c descriptions like “Assignment work was returned with written comment on its quality”, “Tutors observed my in-class work, and suggested how I might improve it”, or “Teachers gave me out-of-class assistance in interpreting my mid-term test results” might be generated. BOS questions would then present these descriptions, each time asking the student to report how consistently or frequently they were observed or experienced. In the present study however, such speci c BOS descriptions were not generated. Rather, the original CEQ phrasing was minimally altered, if at all, and only as necessary for a frequency-based response scale to seem sensible semantically. In essence this meant that the present “BOS” version of the CEQ comprised phasings very close to the standard phrasings, but associated with a frequency-based response 296 M. Eley scale rather than the agree/disagree. The “BOS” CEQ used here was not so much a full BOS alternative to the CEQ, but the standard CEQ with its questions minimally transformed to be “BOS-like”. In similar fashion, the second CEQ version used here was an adaptation of the dimensional rating scale (DRS) format often found in teaching evaluations. In this format students might be given dimensional aspects of lecturing such as “punctuality”, “clarity”, “supportiveness” or “availability”, and be required to rate a lecturer against each on something like a “very good” through “very poor” scale. In the present context, CEQ phrasings like “The teaching staff normally gave me helpful feedback on how I was going” would be re-caste as a dimensional description like “The helpfulness of the feedback given by teaching staff on how you were going”. Again then, the “DRS” CEQ used here was not a full dimensional rating alternative; it was constrained to comprise the standard CEQ questions minimally transformed to be “DRS-like”. This rst study was clearly then a quite conservative test of the simple proposition that the present CEQ, with its agree/disagree question format, was the most optimal for its purpose. The study was not a test of what fully developed alternative CEQ formats might look like. It was rather a test of whether proceeding to such development effort might be worthwhile. Method Three parallel versions of the 25-item CEQ were prepared. The rst was the “standard” version used by the GCCA. It comprised question stems that were qualitative statements paired with agree/disagree response scales. The instructions were those in the GCCA usage. Students were instructed to think about their course as a whole, and to indicate their view of each question’s statement. The second version used adaptations of the behavioural observation scale (BOS) question format. The GCCA stems were modi ed by removing any frequency connoting quali ers (e.g., always, sometimes, usually), and by deleting any syntactically unnecessary references to “the course”. The response scale ranged “all or almost all” through “very few or none”. Students were instructed to think back over the individual subjects taken in their course, and to estimate the proportion for which each question’s statement was true. The individual subject was used as the basis for response for purely practical reasons; students could not be expected to recall speci c events over their degree durations, yet some “countable” unit was needed for the “BOS-like” format to be sensible. The nal version adapted a dimensional rating scale (DRS) question format. Each GCCA stem was converted into an unquali ed description of a dimension, and the response scale asked for a rating from “very good” through “very poor” against that dimension. Students were instructed to think about their course as a whole, and to rate the course element described in each question. Example parallel questions are shown in Table 1. The three versions were administered to nal year undergraduate Engineering and Business students in their last semester before graduation. Engineering and Business were chosen because the degree programs are relatively structured, with consider- The Course Experience Questionnaire 297 TABLE 1. Examples of question stems from each CEQ version GCCA (standard) version BOS version DRS version It was always easy to know the standard of work expected. It was easy to know the standard The ease with which you could of work expected. determine the standard of work expected. I usually had a clear idea of where I was going and what was expected of me in this course. I had a clear idea of where I was going and what was expected of me. As a result of my course, I The subject developed my feel con dent about tackling con dence about tackling unfamiliar problems. unfamiliar problems. How clearly you understood where you were going and what was expected of you in the course. The course’s development of your con dence to tackle unfamiliar problems. able commonality within the programs undertaken for any speci c specialty. This meant that the students making judgments about say Civil Engineering would have much experience in common. Any individual student completed one of the three versions. The versions were distributed randomly to students within each accessed class group, ensuring that roughly equivalent numbers within any specialty completed each. Because of varying access practicalities, questionnaire completion occurred via a mixture of supervised individuals , supervised in-class, at home but returned in class, and mailed out and mailed back. This variation in completion mode could be expected to contribute to nonsystematic error variance, and thus to the conservative nature of the present study. Results and Discussion Responses from a total of 352 students were analysed, being 147 from Engineering (50.3% of the enrolment) and 205 from Business (48.8% of enrolment). The breakdown of students according to specialty and CEQ version is shown in Table 2. In summary, 126 students completed the GCCA version, 123 the BOS, and 103 the DRS. For each version, response possibilities were scored by ascribing point values 5 through 1, so that larger values consistently associated with more positive response meanings. This meant some variation between version scoring regimes since eight of the GCCA and BOS questions were negatively phrased whereas all DRS questions were unidirectional. Two sets of analyses were conducted on the entire sample. First, Cronbach a reliabilities were calculated on each of the ve de ned CEQ subscales, separately for each of the three versions (see Table 3). For four of the subscales the GCCA version yielded the lowest reliability, indicating that as measures the subscales seemed generally the weakest when de ned by the standard GCCA format questions. Even for the Good Teaching subscale, for which the GCCA reliability was the highest, the reliabilities for the three versions were probably more similar than distinct. The 298 M. Eley TABLE 2. Numbers of responding students within each specialty program responding to the CEQ versions GCCA BOS DRS GCCA BOS DRS Chemical Civil Electrical Materials Mechanical 8 10 8 15 13 7 10 17 9 2 1 3 18 11 15 Accounting Banking & Finance Business Admin. International trade Management Marketing 11 15 12 31 30 18 5 10 5 1 1 1 12 5 11 10 6 11 Note: A further 10 business students were not identi able as belonging to one of the specialties. GCCA version of the CEQ was not found to yield clearly superior subscale reliabilities than what were presumably less than optimal manifestations of alternative question format versions. The second analyses on the entire sample were factor analyses (principal components, varimax) of responses to the 24 subscale de ning questions from each of the three versions separately. Each analysis forced a ve-factor solution to investigate the extent to which factor loadings would re ect the a priori conceptual subscale structure de ned by the “standard” GCCA version. Table 4 lists rotated factor loadings for each CEQ question, grouped by subscale membership and solution factor. For any CEQ version, the strongest indication of stable subscale structure would be for all primary loadings (the largest factor loading for any given question) to site in the diagonal cells in Table 4. This would indicate a close coincidence between factor solution and de ned subscale membership of the questions. In general terms the t with conceptual subscale structure seemed best for the BOS version, next for the GCCA, and least for the DRS. For the BOS version 21 of the 24 questions showed agreement between primary loadings and subscale membership, for the GCCA this was true of 19 questions, and for the DRS 16 questions. Putting this another way, all ve BOS subscales had a majority of items TABLE 3. Subscale reliabilities (decimal points omitted) calculated for each version GCCA BOS DRS Good teaching (GT) 6 questions Clear goals (CG) 4 questions Generic skills (GS) 6 questions Appropriate assessment (AA) 4 questions Appropriate workload (AW) 4 questions 863 843 830 677 766 785 648 787 780 489 605 716 654 662 827 The Course Experience Questionnaire 299 TABLE 4. Primary factor loadings for each CEQ question for ve-factor solutions on each CEQ version separately (principal components, varimax, decimal points omitted) Factors for GCCA version Question GT GT a 03 07 15 17 18 20 711b 785 753 801 672 671 CG 01 06 13 24 GS 02 05 09 10 11 22 AA 08 12 16 19 AW 04 14 21 23 a b CG GS AA AW Factors for BOS version GT CG AA AW 623 786 544 726 GT 768 522 579 687 628 721 468 785 730 621 718 632 497 608 673 721 731 426 686 764 563 782 765 541 534 634 689 844 630 762 AW 437 581 519 AA 569 491 494 678 829 510 785 370 GS 507 660 680 439 655 CG 706 681 605 594 673 616 736 373 GS Factors for DRS version 590 502 776 619 699 771 754 751 Questions are listed as grouped into their CEQ subscales. Bold gures are the loadings that conform to CEQ subscale structure. showing solution loading agreement, with two showing complete agreement; four of the GCCA showed majority agreement with one showing complete; for the DRS it was two subscales and one respectively. There are some further indications deriving from the factor solutions. Initial components extraction for the GCCA analysis showed seven eigenvalues greater than one, whereas for each of the BOS and DRS six were shown. The rotated ve-factor solution for the GCCA version explained 54.8% of the variance, whereas for each of the BOS and DRS this gure was 57.5% and 61.3% respectively. For the GCCA some 42% of nonredundant residuals had absolute values greater than 0.05, 300 M. Eley whereas for each of the BOS and DRS this was 38% and 41% respectively. Finally, Kaiser-Meyer-Olkin measures of sampling adequacy calculated on the ve-factor solutions were 0.768 for the GCCA, 0.803 for the BOS, and 0.846 for the DRS, indicating that the GCCA solution could be characterised as “middling” and the BOS and DRS as “meritorious” (Kaiser, 1974). All of this combines to suggest that the a priori expectation of a ve-factor solution seems to t least well for the GCCA data, compared to the BOS and DRS (Tabachnick & Fidell, 1996). There are some further observations that warrant comment. The ve “errant” GCCA questions with loading mismatches were all from the four subscales other than Good Teaching, with all migrating to align with Good Teaching. For the other two CEQ versions the picture was more mixed, with “errant” BOS questions migrating to align with Good Teaching or Clear Goals, and the “errant” DRS with Good Teaching, Clear Goals or Generic Skills. One interpretation might be that for the GCCA version the Good Teaching subscale is dominant or superordinate, whereas for the other versions the subscales seem more independent and equally weighted. In summary, while all of these comparisons of the CEQ versions on subscale t are rough measures, the consistent nding would seem to be that the standard GCCA question format did not result in unambiguous superiority. A further set of analyses was conducted, but now separately on the Engineering and Business samples. Within each sample, and separately for each CEQ version, univariate one-way analyses of variance comparing degree specialties were calculated using subscale and overall rating scores as dependent variables. For reasons of small group sizes, Materials students and International Trade students were excluded. The purpose in these analyses was not so much to test for actual differences amongst the Engineering and Business specialties, but rather to test the comparative ability of the three CEQ versions to detect whatever between-specialty differences there might be. The outcomes of interest therefore were not the actual main effect signi cances, but rather the relative strengths of those effects across equivalent analyses for each CEQ version. The measure used was h2, a ratio of SSbet to SStotal: larger values of h2 indicate that group means are more different than similar. If within one of the Engineering or Business samples, the h2 for a particular dependent variable for one CEQ version is larger than the parallel values for the other two versions, then it is a rough indicator that the between-specialty differences, as measured by that CEQ version, are greater than the differences measured by the other versions. To put it another way, that one CEQ version would be a better measurement instrument in that differences would have a better chance of being detected. So if across all dependent variables we nd that one CEQ version consistently tends to show the largest (or smallest) h2 values compared to parallel values for the other versions, then what individually were “rough indicators” of measurement superiority (or inferiority) combine into more con dent grounds. For only one of the 12 analyses within the Engineering or Business samples did the GCCA version yield the greatest h2 value (see Table 5). Were the GCCA version the better measurement instrument, then the simple expectation would be that between-group differences would more likely manifest with the GCCA. The GCCA version should show the greatest h2 values on at least a majority of instances, The Course Experience Questionnaire 301 TABLE 5. Strength of association (h2) values from AnoVas on each CEQ version, for Engineering and Business samples separately Good teaching Clear goals Generic skills Appropriate assessment Appropriate workload Overall rating Engineering 0.219a 0.172 0.314 0.062 0.176 0.189 0.184 0.130 0.188 0.059 0.048 0.181 0.133 0.097 0.202 0.146 0.083 0.144 Business 0.057 0.024 0.086 0.046 0.086 0.047 0.031 0.227 0.450 0.018 0.085 0.067 0.115 0.148 0.031 0.047 0.117 0.053 a Top gure is for GCCA version, middle is for BOS, bottom is for DRS. The largest value within each comparable set is set in bold type. certainly better than simple chance. It did not. As rough and conservative a test as this comparative use of h2 is, students’ responses using the GCCA version seemed least likely to differentiate between the specialty groups within either Engineering or Business. When it comes to the ability of the CEQ to operate as a measurement device, and to detect differences, the standard GCCA version did not show itself to be clearly superior. The question addressed by this rst study was very basic. Can evidence be found that the present CEQ with its agree/disagree question format is likely less than optimal for its role as a measurement instrument? The study used a conservative methodology in that the standard CEQ was compared against minimalist variations from itself, rather than against genuinely developed alternatives. The analyses applied three conservative tests. The simple conclusion is that on none of those tests did the standard CEQ show itself to be unambiguously superior. The CEQ’s subscales were generally found to exhibit better scale reliabilities when the question formats were something other than the standard agree/disagree. The empirical t of individual questions’ response patterns to those subscales as an underlying a priori structure was better when the question formats were other than agree/disagree. Finally, the capability of the CEQ to detect measurable differences was better when the question format was other than agree/disagree. In summary, this rst study’s ndings clearly suggest that the present form of the CEQ is indeed less than optimal for its purpose. The minimalist and conservative comparisons made here suggest that a careful and systematic development of alternative question forms could well be expected to result in the clearer de nition of the CEQ’s subscales, and an improved capability of the CEQ to distinguish amongst cohorts, institutions and programs. Study Two Taking the Study One ndings to indicate that the development of alternative CEQ forms would likely prove fruitful, an obvious next question is what such alternative 302 M. Eley forms might look like? A good starting point might be to investigate the thinking processes that respondents use in deciding their responses to the CEQ. The concern here is not to alter the conceptual dimensions of the CEQ, but rather to consider better ways of collecting students’ responses relating to those conceptual dimensions. Were we to investigate how students presently make their response decisions, what they speci cally recall to mind, how they use that recalled information, we might then be able to devise ways of asking the CEQ questions that t better with those decision processes. Such a concern that any alternative CEQ form should be developed especially to t closely with student processing was the basis of the second study. Students were asked to report on the speci c decision making that led to particular responses. To the extent that students are empirically found to do something other than what the present form of a question might implicitly assume, there would be grounds for revising that question to conform more to what the students do, or rather to the ways in which they typically think in responding. Method Subjects A total of 45 volunteer students in the nal semester of their undergraduate programs in each of Law, Engineering and Psychology were recruited. These volunteers comprised three 5-student groups within each discipline, drawn from the full-time, on-campus enrolments. Subjects were not paid, however their participation occurred during lunchtimes, and juice and sandwiches were provided. Materials The same three versions of the CEQ that were used in Study One were used here also. All three, rather than only the standard GCCA version, were used to allow for the possibility that students’ response processes might vary with version. Procedure For each 5-student group, the session began with all members of that group completing one of the three CEQ versions. The purpose of the session was de ned as the gathering of student comment on how they individually decided their responses to the CEQ, and to provide a context for that comment the students would rst be required to complete the CEQ on their own “about to be completed” degree studies. Following the questionnaire’s completion, the students were taken through a sample of 11 CEQ questions. For each question the text of the question was read out, and the students were invited to relate as directly as possible how they came to their individual responses. The general instruction was that they should recall their thinking, and then simply describe that thinking. Each student in the group was The Course Experience Questionnaire 303 given an opportunity to respond. The only comments made by the experimenter during this process were requests for clari cation or elaboration when a student’s description was vague or ambiguous, or redirections when the discussion veered too far from a focus on describing the response thinking for individual questions. In using such a self-report methodology it was assumed that the processes functionally involved in students’ decision making are indeed conscious, working memory processes, and that at least in the short term they remain available to be recalled and verbalised. Further, if those decision processes are in large part themselves verbal in nature, then the retrospective reporting would be a relatively direct externalisation. In that the time delay between actually responding to a question and verbalising the response processing was minutes, and that a re-statement of the question was used to cue the recall of those response processes, the expectation was that the resulting report data would be as good a re ection of those actual processes as practicalities allowed (see Ericsson & Simon, 1980). All students were taken through the same sample of questions, in the same order, regardless of discipline or CEQ version. This order was questions 18, 24, 22, 12, 23, 25, 17, 6, 11, 19 and 4, being one question from each of the Good Teaching, Clear Goals, Skills, Assessment, and Workload subscales, then the general overall question, and then a further question from each of the subscales. For each student group the responses were audiotape recorded. These tapes were later transcribed for analysis. Sessions typically required about 40 to 50 minutes Results and Discussion Student responses for each group were inspected for statements that clearly described how an individual’s response to a given CEQ question had been derived. Students would often embellish their responses with comment or re ection about their study experiences more generally. They would often divert into discussion about what ought to have been, rather than what was. Such embellishments and diversions were ignored for the analyses here. Only those statements that described what a student had done in deciding his or her response were taken as the data for analysis. It was often the case that individual students would not give a personal description of a response decision process, but would instead concur with another group member’s description. Such simple concurrences were also not considered or counted in the analyses here. The purpose of the analyses was not to compute some accurate count of individual response approaches. Rather, the purpose was to categorise those discrete statements that were offered to determine the range of approaches used, and the broad relativities between them. The assumption was that the picture gained from analysing only those descriptions actually offered would re ect also the approaches taken by students who simply concurred. The clear initial observation is that the described response approaches were overwhelmingly based on recalling personal, speci c and concrete experiences. A consistent thread through all groups, irrespective of CEQ version and degree program, was that an individual question would prompt recall of particular experi- 304 M. Eley TABLE 6. Counts and percentages of reported response approaches tting different categories Recall of speci c concrete instances prompted directly by the question. … apparently unguided or nonsystematic recall. … focus on the salient or recent. … focus on extreme instances. Sub-totals Recall guided by preliminary de nition of classes of events. … de ned observable criteria used to guide attempted recall of speci c instances. … deliberate narrowing of the range of instances to be recalled. Sub-totals Responses not clearly indicative of the recall of personal, speci c and concrete experiences. Totals GCCA respondents BOS respondents DRS respondents 26 25.7% 25 22.7% 17 16.0% 11 2 39 10.9% 2.0% 38.6% 14 4 43 12.7% 3.6% 39.1% 5 3 25 4.7% 2.8% 23.6% 39 38.6% 38 34.5% 48 45.3% 18 17.8% 18 16.4% 14 13.2% 57 56.4% 56 50.9% 62 58.5% 5 5.0% 11 10.0% 19 17.9% 101 110 106 ences, events, people, which in various ways would then be used to derive a response to the question. In essence, none of the students reported doing what any of the CEQ versions asked, that being to base their responses on their overall experiences. Often the students would “confess” that they found it impossible to recall their entire experiences, and that they had no option but to base their responses on the sample of experiences that they could recall, typically with the explicit recognition that inaccurate re ections of those overall experiences might result. There was however variation in the ways in which individual students made use of those recalled concrete experiences. Two broad categories of approach could be discerned, with further subdivision within each. These categories and subcategories are not claimed to be mutually exclusive, but to be broad, general descriptions of the approaches taken, sometimes in isolation, sometimes in combination. In constructing the summary tabulation of the 317 discrete descriptions that comprised the base data (Table 6), individual descriptions were assigned to single subcategories even though they might not have re ected that subcategory uniquely. The categorisation process was admittedly judgmental, with descriptions being assigned on a “best apparent t” criterion. Although the responses are shown in Table 6 as classi ed according to CEQ version, the discussion here will ignore that distinction. The simple nding was that there were no strong and apparent differences in the range and mix of decision approaches attributable to CEQ version. Judgmental categorisation notwithstanding, differences between the subcategory distributions for the three versions were not statistically signi cant (c2 (df 10) 5 15.998). This is perhaps further indication of the strength of the reliance on recalled personal and speci c concrete experiences, The Course Experience Questionnaire 305 constituting a warning that to assume students do something else might almost be guaranteed problematic. Before discussing the Table 6 subcategories in more detail, it is acknowledged that not all descriptions given by the students indicated the recall of personal, speci c and concrete experiences. These are the 11% of total responses shown in the last category of Table 6. In the main, these were statements that were clearly identi able as referring to decision processes, but which nonetheless gave no indication of the detail of just what those processes might have been. Some examples are: I personally compared different lecturers that I’d had. I eliminated those that were an exception to the rule. It’s obviously hard to do it generally, so you sort of sift through and make a decision of the majority. Over the three years I thought it was poor, because of the variation. The sheer bulk of essays that I’ve had to write; my skills must have improved. It is not so much that these statements indicate something other than the use of recalled speci c concrete experiences. Rather it is that for the most part they did not unambiguously allow any process interpretation beyond the global identi cation that they were genuinely about deciding. The 89% of discrete response process descriptions that did indicate the use of recalled speci c concrete experiences are now discussed. These descriptions were broadly distinguishable into those in which the recall of speci c concrete instances seemed prompted directly by a CEQ question, or those in which that recall seemed guided by some preliminary de nition of what might constitute appropriate or applicable events or experiences. For instance, in relation to Q12 (“The staff seemed more interested in testing what I had memorised than what I had understood”) two responses were: I actually thought of a couple of exam questions that I had memorised, and a couple that I worked out. I thought of how for this year’s exam I already have most of the questions that are going to be asked. Neither of these students seems to be indicating that the question was thought through before the recall occurred. It appears that the recall was somewhat spontaneous. The question has simply prompted the recall of particular concrete examples that t. In contrast, another student reported: I tried to work out how many subjects had allowed us to take in a “cheat sheet”, which would mean testing more for understanding. Now we see that the presence of “cheat sheets” is being used as a lter. The focus is still on the recall of speci c instances of examinations, but examinations of a particular type. The recall is now mediated or guided. Presumably the use of the “cheat sheet” lter resulted from some preliminary consideration of the question. 306 M. Eley The rst two responses would be categorised as “prompted directly” while the latter response would be categorised as “guided”. The “prompted directly” responses could be further subcategorised according to the nature of the recalled instances. Although the common thread in this broad category is that there seemed no preliminary consideration apparent before recall occurred, sometimes the recalled instances seemed other than simply what rst came to mind. Sometimes the recalled events were described as being particularly salient or recent, or as being extremes. The interpretation is that although the student might not have deliberately tried to recall events that were salient or extreme, that salience or extremeness might nonetheless have been the basis for the recall. Some examples of “salient or recent” are: I was very in uenced by my bad experiences this year in which we were told of what was expected after we had done the assignment. (Q24) I thought of a particular lecturer from rst year who told us how Law is different, and how we might nd it dif cult to predict how we would go in our studies. (Q24) I was quite focused on my current lecturers; I can’t remember the way I was taught earlier. (Q18) I thought of one particular subject in which there were two projects, where we were given a large slice of time to do the work. (Q22) Some examples of “extremes” are: I can think of a really good one, and particularly bad. You try to think of a few of them. (Q18) I rst thought of the bad lecturers, and then I thought I have to be general so I thought of the good ones. (Q18) I compared last year’s research project to this year’s: one was really bad and the other quite good. (Q22) I remembered rst year in which everything was crammed, so you don’t remember any of it. But then in my later subjects you try to get the skills rather than just memorise. (Q12) As can be seen from Table 6, the majority of “prompted directly” responses were neither “salient or recent” nor “extremes”. Most decision descriptions in this broad category seem to indicate the simple nonsystematic recall of particular concrete instances prompted directly by the question. Descriptions in which the recall was indicated to be of extremes were in reality relatively infrequent. As already noted, the second broad category of decision descriptions were those in which the recalled instances were referred to as tting some set of de ning criteria. The interpretation drawn was that some initial consideration had been given to determining those criteria, and that the subsequent recall of experiences was thus at least minimally guided or ltered. This is not to claim that that recall was The Course Experience Questionnaire 307 consciously exhaustive, or a representative sampling across time, just that it was not a matter of a simple “ rst to mind” prompting. Some further examples of this category are: I thought about when a new concept comes up, and whether the lecturer is able to express it coherently so that I walked away having understood. (Q18) I thought of how assessment requirements are described in terms of how much they’re worth, and of the times when I wondered what I had to do. (Q24) I thought of the practicals, where you basically follow a recipe, so there’s no planning required. (Q22) I focused on how much I can now recall from previous subjects; I doubt I’d be able to recall it all, so there must have been too much. (Q23) A variant on the “guided” category of decision descriptions was when the phrasing suggested that the student had consciously narrowed the focus of the question from the range of experiences that it could logically include. The student would typically state something like “I took this to mean …”. The inference is that the student was aware of the narrowing, but that it was perhaps a deliberate ploy. Some examples of narrowing are: I let the incompetent lecturers sway me here, because they probably had more of an effect on my university career. (Q18) I took “expectation” to mean assignments, and what you need to go into exams with. (Q24) I decided between whether the question referred to time and deadlines, rather than planning in a drafting sense. (Q22) I interpreted the question in terms of me being able to memorise and getting away with it anyhow. (Q12) I focused on “thoroughly”; I haven’t thoroughly comprehended anything in a 13-week semester. (Q23) From Table 6 it can be seen that more than half of the response decision descriptions overall were in the broad “guided” category. Within that, about a third were of the “narrowing” variety. Considering just the 89% of descriptions interpretable as indicating the use of recalled speci c concrete experiences, the broad “guided” category accounted for about 60% compared to the “prompted directly”. Finally, consider the response decision approaches for the overall rating Q25. Like those reported for the other questions, these seem also predominantly based on personal experiential criteria (see Table 7). Only two reported approaches could be categorised as being “overall”. One “… looked back at [his/her] responses to the other questions, and just summarised”. The second “… compared to the other course in [his/her] double degree”. All the other students reported deciding on the 308 M. Eley TABLE 7. Examples of approaches reported for Q25 “Overall, I was satis ed with the quality of this course” (standard phrasing) I look at how con dent I am now with my abilities. I took “satis ed” to mean doing subjects that I actually enjoyed. I know that for most of my course material I have to go back and refresh, and that makes me judge low. How well the lectures had been structured and taught came into it for me. I thought in terms of what interesting material could have been covered, but which wasn’t. The dealings that I have with administration people in getting through. I judged in terms of whether I’m now ready for a professional job. Whether I could think of things in the course that I would change. I judged by the credentials and expertise of my lecturers. The course’s reputation relative to those of courses at other universities. I thought of my own preferential reaction to the content material. I answered on pure interest. After the exams, when I get my results, thinking how well I had learned. I thought entirely emotionally, how happy I was with the course. basis of some observable or experiential criterion. This is interesting in that this particular question quite explicitly asks the student to make an overall satisfaction judgement on the quality of his or her course. It would seem that the students simply do not do that. Instead they choose some single experiential aspect, and focus on that. The students seem to use something akin to a “guided”, even “narrowing” approach. Maybe judging “overall quality satisfaction” is just too broad a task. Maybe students nd that they simply cannot “take everything into account”, and have no option but to select some narrower, but manageable aspect of their course experiences, and base their individual responses on that. There is, however, a critical difference between “guided” and “narrowing” as applied here to Q25, and as applied to the other CEQ questions. Each of those other questions has a particular intentional focus, be it the helpfulness of feedback, the development of writing skills, the emphasis on factual assessment, or whatever. That particular focus should constrain the choice of any experiential proxy to being a logical operationalisation of that same focus. Students may well narrow a question in different ways, but they should still at least partially be answering the same question. For Q25 however, there is no constraining focus. When students re-de ne overall satisfaction to mean con dence, or interest levels, or credentials of lecturers, or administrative pro ciency, or emotional reactions, they are choosing experiential proxies from a much broader range. When same-course peers choose such different aspects of their course experiences on which to base their Q25 responses, they are each essentially answering different questions. This raises a methodological concern The Course Experience Questionnaire 309 in that such differences will constitute a source of variation that may not be systematically related to satisfaction, making it more dif cult to show betweencohort differences. Further, it raises an interpretational concern in that such differences will make it very dif cult to draw any particular meaning from pooled student responses. In summary then, the ndings from this second study seem quite unambiguous. When students were asked to describe how they decided their responses to individual CEQ questions they consistently and uniformly did not describe being general and dimensional. They did not report performing some sort of comprehensive cataloguing of their experiences over the duration of their degree studies. They did not report using some sort of abstract or systematic judgmental process. Instead, they reported processes very much based in the particular, the concrete, the speci c, in the recall of personally experienced episodic events and instances. This was true irrespective of CEQ version. The present categorisations of those decision processes are perhaps best interpreted as simply re ecting different ways in which students cope with what seems the pervasive reality of speci c experiential recall. The implications for CEQ question forms are also clear. The present ndings suggest that no matter what form questions might take, the responding students will make use of speci c recollections. Perhaps it is simply unavoidable. When something like “feedback” is mentioned to students, the recall to mind of particular relevant events might be almost automatic. And once those recollections are present in consciousness they are attended to. To use any question format that implicitly assumes that students will make balanced and measured judgements over an entire degree time frame is probably to accept a ction. Conclusions The starting point for the studies reported here was a concern that the question formats used in the CEQ might not actually be well suited to the instrument’s purpose. Although considerable research has well established the practical and theoretical validity of what the CEQ’s questions ask, little work seems to have considered how those questions are asked. The rst study here considered whether evidence could be found that the CEQ’s agree/disagree question format might be less than optimal. The second study considered the thinking processes employed by students in deciding their responses to individual questions: were those processes different to what the form of the CEQ seemed implicitly to assume? In the rst study, the standard CEQ question format was “tinkered with” in deliberately constrained or minimalist ways. The logic was that were the standard agree/disagree format optimal, then any such tinkering should result in a deterioration of the CEQ as a measurement instrument. On none of the tests applied did the standard CEQ format prove itself to be at all superior. The conclusion drawn was that that standard format was likely less than optimal, and that developing alternatives could well be a fruitful course to pursue. The second study had students report on the thinking processes that they had used in responding to individual questions in a just completed CEQ. The assump- 310 M. Eley tion implicit in the present CEQ that students do indeed sample from their experiences across the duration of their degree studies, and seek to generate a response representative of those studies, simply did not hold. The clear and strong nding was that responding students used the recall of particular and concrete personal experiences and events, as prompted by the question. These recalled events are typically not logically exhaustive. They are typically not drawn from the entirety of the student’s study experiences. They often focus on the salient and the recent. They are often recognised, and explicitly acknowledged, as being biased and non-representative. If alternatives to the present CEQ should be developed, do the present data give any suggestions, albeit speculative, for directions that that development might take? A rst suggestion is probably quite obvious; begin by accepting the reality of how students will respond. Accept that it is likely unavoidable that a question will prompt the recall of personal, concrete events that in some sense t with the question. So whatever question formats we might eventually choose, it would seem that in some fashion they must work with students’ prompted recall of speci c experiences. A second suggestion can be made, but it is more speculative. Perhaps we could take a lead from the apparently dominant decision approach found in the present data, that of using a preliminarily decided set of criteria to guide or lter the subsequent prompted recall of instances. In the context of re-crafting the CEQ that might mean more than some minimal re-phrasing of questions. It might mean returning to the concern or issue underlying a given question and deciding on a range of more de ned instances that could be experiential indicators of that concern or issue. That single question might then be replaced with a cluster of questions, each seeking to determine the extent to which some particular experiential indicator was indeed part of the student’s personal experience. The range of these experiential indicators could of course be something determined by further research. In essence such an alternative approach could be seen as simply a means of harnessing already extant processing predilections, but in a much more determined, or channeled fashion. The expectation is that students would not nd such “clustered speci c prompt” approaches dif cult or alien. The bene t of course would be much greater consistency in the ways in which questions were interpreted by students, and thus responded to. This in turn should translate to those response distributions being more usefully interpretable by institutions. What of the overall satisfaction question? Do the present response data offer any speculative suggestions there? In the earlier discussion, it was noted that students seemed to apply a “guided” or “narrowing” approach here also. But in the context of re-crafting the CEQ, this nding might be more problematic than suggestive. The range of concrete instances that individual students recalled in constructing an overall satisfaction response was very wide. Adopting a strategy of replacing Q25 with a cluster of more speci c prompts might simply be impractical. For any of the other “more targeted” questions, it is at least imaginable that there could exist a constrained range of speci c instance classes that would contain the bulk of actual recalls that individual students might have. For Q25, such is probably unlikely. The suggestion here then is that the best option might be simply to discard Q25 The Course Experience Questionnaire 311 as a discrete question, as bureaucratically unattractive as that might be. It might be better to investigate ways of using the responses to the other CEQ questions to construct or generate some index of overall satisfaction. One bene t of going down this route might be that the computation of such an index would be known, and transparent. Such an index would be recognised as being a derived entity. It would thus perhaps not divert institutional attention away from the more speci c questions and dimensions, wherein the real feedback value arguably lies. As a nal point on which to conclude, it is well to remember what the present studies have not investigated. The present ndings provide no evidence at all that the conceptual structure underlying the CEQ should be tampered with. That scales re ecting dimensional variability on teaching activity, instructional goals, transferable skills, assessment and workload should prove predictive of teaching quality and learning outcomes generally, is well established in the literature. The concern here was with how we sample that dimensional variability, and not at all with sampling something else. Address for correspondence: Dr Malcolm Eley, Higher Education Development Unit, PO Box 91, Monash University, Victoria 3800, Australia. E-mail: Malcolm.Eley@CeLTS.monash.edu.au References Anastasi, A. (1982). Psychological testing. 5e. New York: Macmillan. Ashenden, D., & Milligan, S. (1999). Good universities guide, Australian universities. Port Melbourne: Mandarin Australia. Borman, W.C. (1986). Behavior-based rating scales. In R.A. Beck (Ed.), Performance assessment: Methods and applications (pp. 100–120). Baltimore: Johns Hopkins University Press. Eley, M.G., & Stecher, E.J. (1994). Comparison of an observationally – versus an attitudinally– based response scale in teaching evaluation questionnaires: II. Variation across time and teaching quality. Research and Development in Higher Education, 17. Proceedings of the 20th Annual Conference of the Higher Education Research and Development Society of Australasia. Canberra, ACT, 6–10 July, pp. 196–202. Eley, M.G., & Stecher, E.J. (1995). The comparative effectiveness of two response scale formats in teaching evaluation questionnaires. Research and Development in Higher Education, 18. Proceedings of the 21st Annual Conference of the Higher Education Research and Development Society of Australasia. Rockhampton, Queensland, 4–8 July, pp. 278–283. Eley, M.G., & Stecher, E.J. (1997). A comparison of two response scale formats used in teaching evaluation questionnaires. Assessment and Evaluation in Higher Education, 22, 65–79. Ericsson, K.A., & Simon, H.A. (1980). Verbal reports as data. Psychological Review, 87, 215–251. Hand, T., Trembath, K., & Elsworthy, P. (1998). Enhancing and customising the analysis of the Course Experience Questionnaire. Evaluation and Investigations Program; Department of Employment, Education, Training and Youth Affairs, Canberra, Australia. Kaiser, H.F. (1974). An index of factorial simplicity. Psychometrika, 39, 31–36. Latham, G.P., & Wexley, K.N. (1977). Behavioral observation scales for performance appraisal purposes. Personnel Psychology, 33, 815–821. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, No. 140. 312 M. Eley Long, M., & Johnson, T. (1997). In uences on the Course Experience Questionnaire scales. Evaluation and Investigations Program; Department of Employment, Education, Training and Youth Affairs, Canberra, Australia. Ramsden, P. (1991). A performance indicator of teaching quality in higher education: The Course Experience Questionnaire. Studies in Higher Education, 16, 129–150. Richardson, J.T.E. (1994). A British evaluation of the Course Experience Questionnaire. Studies in Higher Education, 19, 59–68. Stecher, E.J., & Eley, M.G. (1994). Comparison of an observationally – versus an attitudinally– based response scale in teaching evaluation questionnaires: I. Variation relative to a common teaching sample. Research and Development in Higher Education, 17. Proceedings of the 20th Annual Conference of the Higher Education Research and Development Society of Australasia. Canberra, ACT, 6–10 July, pp. 210–217. Tabachnick, B.G., & Fidell, L.S. (1996). Using multivariate statistics. 3e. New York: HarperCollins. Wilson, K.L., Lizzio, A., & Ramsden, P (1997). The development, validation and application of the Course Experience Questionnaire. Studies in Higher Education, 22, 33–52.