The Course Experience Questionnaire: Altering

advertisement
Higher Education Research & Development, Vol. 20, No. 3, 2001
The Course Experience Questionnaire:
Altering question format and phrasing
could improve the CEQ’s effectiveness
MALCOLM ELEY
Monash University
Two studies considered whether the Course Experience Questionnaire’s
(CEQ) question format was the most appropriate for the CEQ’s purpose. In the Ž rst,
comparisons were made against two alternative but minimalist variations on the standard
format. None of three tests showed the standard format to be superior. In the second,
students reported the thinking used in deciding their responses on a sample of speciŽ c CEQ
questions. Those reports uniformly showed responses to be decided by the recall of particular,
concrete and personal experiences prompted by a question, and not by the overviewing
implicitly assumed by the standard format. The implications drawn are that the systematic
trialing of alternative question forms could well result in improved performance of the CEQ
as an instrument, and that those alternative forms should probably be constrained to those
directly prompting the recall of personal experience, but in more guided fashion than seems
presently to occur.
ABSTRACT
Introduction
The Course Experience Questionnaire (CEQ) (Ramsden, 1991) was developed to
gather information relevant to judging the overall effectiveness of degree programs.
As commonly used within the Australian higher education sector it comprises 25
questions, structured as Ž ve subscales plus a single overall rating question, and asks
recent graduates to judge the quality of the teaching and learning support experienced during their entire degree studies. The CEQ has undergone considerable
development work, and its overall construct validity, and that of its component
subscales seems well established (Richardson, 1994; Wilson, Lizzio, & Ramsden,
1997). Further, there has been work to establish the in uence on CEQ responses of
various characteristics of the responding graduates (Long & Johnson, 1997). Concerns about which particular aspects of a graduate’s teaching and learning experiences constitute potentially useful sources of variation amongst degree programs,
which show valid relationships with teaching quality, and which seem therefore
sensible targets for questionnaire questions have been well considered.
The CEQ has been incorporated into an annual survey of graduates run by the
Graduate Careers Council of Australia (GCCA) since 1993. The responses colISSN 0729-4360 print; ISSN 1469-8360 online/01/030293-20 Ó
DOI: 10.1080/07294360120108377
2001 HERDSA
294
M. Eley
lected in that survey are fed back to the participating institutions, and in varied
fashion those institutions make use of the data for their own course review processes.
Those same responses are also used annually to rank Australian universities within
particular Ž elds of study (e.g., Ashenden & Milligan, 1999), which rankings are
widely available to prospective students. Recently, the Australian federal government has shown interest in developing a set of performance indicators applicable to
universities’ activities, and the CEQ has been identiŽ ed as a candidate for that set.
The latest area of development is in statistical analysis and interpretation tools that
will enable institutions to make valid, appropriate and more targeted use of CEQ
data (Hand, Trembath, & Elsworthy, 1998).
However, with all this national and institutional interest, there is yet one aspect of
the CEQ that seems to have received little research attention. This relates not to the
substance of its questions, but more to their form. The CEQ uses a question format
originally devised for attitude surveying, in which the intent is that responses should
indicate something of respondents’ attitudes and values (Likert, 1932; Anastasi,
1982). In the CEQ context, a statement that represents a particular value position—
e.g., “It was always easy to know the standard of work expected” or “To do well in
this course all you really needed was a good memory”—is presented, and the student
responds by indicating a degree of agreement or disagreement. But rather than
responses being used to describe the students, as is the original design purpose of
this question format, those responses are instead used to infer something about the
courses that they experienced. There has been a subtle shift; the “tool” has been
used for something other than that for which it was originally designed. We should
not assume that this shift is of no matter. With student questionnaires focused on
the teaching of an individual academic, question form can have signiŽ cant effects on
the psychometric properties of the questionnaire (Eley & Stecher, 1994; 1995; 1997;
Stecher & Eley, 1994). Questionnaires comprising behavioural observation scale
questions (e.g., Borman, 1986; Latham & Wexley, 1977) proved better at detecting
differences between independent samples of teaching, and between distinct aspects
within a single sample, than parallel equivalent questionnaires using a CEQ-like
agree/disagree format.
There is yet a further point. In asking a student to indicate agreement with
something like “Feedback on my work was usually provided only in the form of
marks or grades” the CEQ makes some implicit assumptions about how individual
students determine their responses. In that they are asked to respond relative to their
overall course, the assumption is that students bring to mind some representative
sampling of their experiences from across the years, and then apply some sort of
averaging or mode Ž nding algorithm. Further, in that all students’ responses to any
single question are pooled in analyses, the assumption is that all those responses
belong to the same class; that is, all students took the same intended meaning from
the question. Neither of these assumptions need be valid. Students might well vary
in the time frames that they apply to their response processing. Students might vary
in terms of what they include in “my work”, or what “usually” means quantitatively,
or indeed what constitutes “feedback”. The point is that we do not know much of
the thinking that students use in deciding their CEQ responses. We do not know
The Course Experience Questionnaire
295
whether the CEQ question form implicitly asks students to do something that they
might Ž nd impossible to do, or which is perhaps contrary to their well established
processing predispositions.
The question arises as to whether some modiŽ cations to the ways in which the
CEQ asks its questions might improve the effectiveness of the CEQ as an instrument. What might happen if question formats were varied from the indirect,
attitudinally based agree/disagree form used? Are there discernible patterns of
processing that students use in deciding their CEQ responses?
Study One
The Ž rst study tested the proposition that the CEQ’s agree/disagree question format
might not be the most optimal psychometrically. More speciŽ cally, the concern was
to test whether altering the form of the CEQ’s individual questions offered any
reasonable prospect of resulting in improvements in the reliabilities of the CEQ’s
subscales, in the empirical Ž t of those subscales to the underlying conceptual
dimensionality, and in the ability of the subscales to differentiate amongst deŽ ned
cohorts within Ž elds of study.
The basic approach was not to develop and test full alternatives to the CEQ.
Rather, the present CEQ was compared to alternatives that incorporated quite
minimal modiŽ cations. Only the form of the individual questions was altered. The
semantic intent of those questions was maintained. The order and number of
questions was kept the same. The general instructions to respondents were kept as
parallel as possible. The simple logic was that if the present form of the CEQ is even
broadly optimal for its task, then any sort of tinkering should result in a deterioration
of its capabilities as a measurement device. It should certainly not be the case that
the present form of the CEQ would prove to be poorer than any such modiŽ cation.
One modiŽ ed version used an adaptation of the behavioural observation scale
(BOS) format. In developing “true” BOS questions, speciŽ c concrete events that
respondents might observe or experience, and which could be taken as indicators of
the dimension of interest, are generated. In an individual BOS question, a description of one of these “indicator events” would be presented, and the respondent
would report the frequency with which it had been observed or experienced. So to
parallel a CEQ question like “The staff put a lot of time into commenting on my
work”, more speciŽ c descriptions like “Assignment work was returned with written
comment on its quality”, “Tutors observed my in-class work, and suggested how I
might improve it”, or “Teachers gave me out-of-class assistance in interpreting my
mid-term test results” might be generated. BOS questions would then present these
descriptions, each time asking the student to report how consistently or frequently
they were observed or experienced.
In the present study however, such speciŽ c BOS descriptions were not generated.
Rather, the original CEQ phrasing was minimally altered, if at all, and only as
necessary for a frequency-based response scale to seem sensible semantically. In
essence this meant that the present “BOS” version of the CEQ comprised phasings
very close to the standard phrasings, but associated with a frequency-based response
296
M. Eley
scale rather than the agree/disagree. The “BOS” CEQ used here was not so much
a full BOS alternative to the CEQ, but the standard CEQ with its questions
minimally transformed to be “BOS-like”.
In similar fashion, the second CEQ version used here was an adaptation of the
dimensional rating scale (DRS) format often found in teaching evaluations. In this
format students might be given dimensional aspects of lecturing such as
“punctuality”, “clarity”, “supportiveness” or “availability”, and be required to rate
a lecturer against each on something like a “very good” through “very poor” scale.
In the present context, CEQ phrasings like “The teaching staff normally gave me
helpful feedback on how I was going” would be re-caste as a dimensional description
like “The helpfulness of the feedback given by teaching staff on how you were
going”. Again then, the “DRS” CEQ used here was not a full dimensional rating
alternative; it was constrained to comprise the standard CEQ questions minimally
transformed to be “DRS-like”.
This Ž rst study was clearly then a quite conservative test of the simple proposition
that the present CEQ, with its agree/disagree question format, was the most optimal
for its purpose. The study was not a test of what fully developed alternative CEQ
formats might look like. It was rather a test of whether proceeding to such
development effort might be worthwhile.
Method
Three parallel versions of the 25-item CEQ were prepared. The Ž rst was the
“standard” version used by the GCCA. It comprised question stems that were
qualitative statements paired with agree/disagree response scales. The instructions
were those in the GCCA usage. Students were instructed to think about their course
as a whole, and to indicate their view of each question’s statement. The second
version used adaptations of the behavioural observation scale (BOS) question
format. The GCCA stems were modiŽ ed by removing any frequency connoting
qualiŽ ers (e.g., always, sometimes, usually), and by deleting any syntactically unnecessary references to “the course”. The response scale ranged “all or almost all”
through “very few or none”. Students were instructed to think back over the
individual subjects taken in their course, and to estimate the proportion for which
each question’s statement was true. The individual subject was used as the basis for
response for purely practical reasons; students could not be expected to recall
speciŽ c events over their degree durations, yet some “countable” unit was needed
for the “BOS-like” format to be sensible. The Ž nal version adapted a dimensional
rating scale (DRS) question format. Each GCCA stem was converted into an
unqualiŽ ed description of a dimension, and the response scale asked for a rating
from “very good” through “very poor” against that dimension. Students were
instructed to think about their course as a whole, and to rate the course element
described in each question. Example parallel questions are shown in Table 1.
The three versions were administered to Ž nal year undergraduate Engineering and
Business students in their last semester before graduation. Engineering and Business
were chosen because the degree programs are relatively structured, with consider-
The Course Experience Questionnaire
297
TABLE 1. Examples of question stems from each CEQ version
GCCA (standard) version
BOS version
DRS version
It was always easy to know
the standard of work
expected.
It was easy to know the standard The ease with which you could
of work expected.
determine the standard of work
expected.
I usually had a clear idea of
where I was going and what
was expected of me in this
course.
I had a clear idea of where I was
going and what was expected
of me.
As a result of my course, I
The subject developed my
feel conŽ dent about tackling conŽ dence about tackling
unfamiliar problems.
unfamiliar problems.
How clearly you understood
where you were going and
what was expected of you in
the course.
The course’s development of
your conŽ dence to tackle
unfamiliar problems.
able commonality within the programs undertaken for any speciŽ c specialty. This
meant that the students making judgments about say Civil Engineering would have
much experience in common. Any individual student completed one of the three
versions. The versions were distributed randomly to students within each accessed
class group, ensuring that roughly equivalent numbers within any specialty completed each. Because of varying access practicalities, questionnaire completion
occurred via a mixture of supervised individuals , supervised in-class, at home but
returned in class, and mailed out and mailed back. This variation in completion
mode could be expected to contribute to nonsystematic error variance, and thus to
the conservative nature of the present study.
Results and Discussion
Responses from a total of 352 students were analysed, being 147 from Engineering
(50.3% of the enrolment) and 205 from Business (48.8% of enrolment). The
breakdown of students according to specialty and CEQ version is shown in Table 2.
In summary, 126 students completed the GCCA version, 123 the BOS, and 103 the
DRS. For each version, response possibilities were scored by ascribing point values
5 through 1, so that larger values consistently associated with more positive response
meanings. This meant some variation between version scoring regimes since eight of
the GCCA and BOS questions were negatively phrased whereas all DRS questions
were unidirectional.
Two sets of analyses were conducted on the entire sample. First, Cronbach a
reliabilities were calculated on each of the Ž ve deŽ ned CEQ subscales, separately for
each of the three versions (see Table 3). For four of the subscales the GCCA version
yielded the lowest reliability, indicating that as measures the subscales seemed
generally the weakest when deŽ ned by the standard GCCA format questions. Even
for the Good Teaching subscale, for which the GCCA reliability was the highest, the
reliabilities for the three versions were probably more similar than distinct. The
298
M. Eley
TABLE 2. Numbers of responding students within each specialty program responding to the CEQ
versions
GCCA
BOS
DRS
GCCA
BOS
DRS
Chemical
Civil
Electrical
Materials
Mechanical
8
10
8
15
13
7
10
17
9
2
1
3
18
11
15
Accounting
Banking &
Finance
Business Admin.
International
trade
Management
Marketing
11
15
12
31
30
18
5
10
5
1
1
1
12
5
11
10
6
11
Note: A further 10 business students were not identiŽ able as belonging to one of the specialties.
GCCA version of the CEQ was not found to yield clearly superior subscale
reliabilities than what were presumably less than optimal manifestations of alternative question format versions.
The second analyses on the entire sample were factor analyses (principal components, varimax) of responses to the 24 subscale deŽ ning questions from each of the
three versions separately. Each analysis forced a Ž ve-factor solution to investigate the
extent to which factor loadings would re ect the a priori conceptual subscale
structure deŽ ned by the “standard” GCCA version. Table 4 lists rotated factor
loadings for each CEQ question, grouped by subscale membership and solution
factor. For any CEQ version, the strongest indication of stable subscale structure
would be for all primary loadings (the largest factor loading for any given question)
to site in the diagonal cells in Table 4. This would indicate a close coincidence
between factor solution and deŽ ned subscale membership of the questions.
In general terms the Ž t with conceptual subscale structure seemed best for the
BOS version, next for the GCCA, and least for the DRS. For the BOS version 21
of the 24 questions showed agreement between primary loadings and subscale
membership, for the GCCA this was true of 19 questions, and for the DRS 16
questions. Putting this another way, all Ž ve BOS subscales had a majority of items
TABLE 3. Subscale reliabilities (decimal points omitted) calculated for each version
GCCA
BOS
DRS
Good
teaching (GT)
6 questions
Clear
goals (CG)
4 questions
Generic
skills (GS)
6 questions
Appropriate
assessment (AA)
4 questions
Appropriate
workload (AW)
4 questions
863
843
830
677
766
785
648
787
780
489
605
716
654
662
827
The Course Experience Questionnaire
299
TABLE 4. Primary factor loadings for each CEQ question for Ž ve-factor solutions on each CEQ
version separately (principal components, varimax, decimal points omitted)
Factors for GCCA version
Question
GT
GT a
03
07
15
17
18
20
711b
785
753
801
672
671
CG
01
06
13
24
GS
02
05
09
10
11
22
AA
08
12
16
19
AW
04
14
21
23
a
b
CG
GS
AA
AW
Factors for BOS version
GT
CG
AA
AW
623
786
544
726
GT
768
522
579
687
628
721
468
785
730
621
718
632
497
608
673
721
731
426
686
764
563
782
765
541
534
634
689
844
630
762
AW
437
581
519
AA
569
491
494
678
829
510
785
370
GS
507
660
680
439
655
CG
706
681
605
594
673
616
736
373
GS
Factors for DRS version
590
502
776
619
699
771
754
751
Questions are listed as grouped into their CEQ subscales.
Bold Ž gures are the loadings that conform to CEQ subscale structure.
showing solution loading agreement, with two showing complete agreement; four of
the GCCA showed majority agreement with one showing complete; for the DRS it
was two subscales and one respectively.
There are some further indications deriving from the factor solutions. Initial
components extraction for the GCCA analysis showed seven eigenvalues greater
than one, whereas for each of the BOS and DRS six were shown. The rotated
Ž ve-factor solution for the GCCA version explained 54.8% of the variance, whereas
for each of the BOS and DRS this Ž gure was 57.5% and 61.3% respectively. For the
GCCA some 42% of nonredundant residuals had absolute values greater than 0.05,
300
M. Eley
whereas for each of the BOS and DRS this was 38% and 41% respectively. Finally,
Kaiser-Meyer-Olkin measures of sampling adequacy calculated on the Ž ve-factor
solutions were 0.768 for the GCCA, 0.803 for the BOS, and 0.846 for the DRS,
indicating that the GCCA solution could be characterised as “middling” and the
BOS and DRS as “meritorious” (Kaiser, 1974). All of this combines to suggest that
the a priori expectation of a Ž ve-factor solution seems to Ž t least well for the GCCA
data, compared to the BOS and DRS (Tabachnick & Fidell, 1996).
There are some further observations that warrant comment. The Ž ve “errant”
GCCA questions with loading mismatches were all from the four subscales other
than Good Teaching, with all migrating to align with Good Teaching. For the other
two CEQ versions the picture was more mixed, with “errant” BOS questions
migrating to align with Good Teaching or Clear Goals, and the “errant” DRS with
Good Teaching, Clear Goals or Generic Skills. One interpretation might be that for the
GCCA version the Good Teaching subscale is dominant or superordinate, whereas
for the other versions the subscales seem more independent and equally weighted.
In summary, while all of these comparisons of the CEQ versions on subscale Ž t
are rough measures, the consistent Ž nding would seem to be that the standard
GCCA question format did not result in unambiguous superiority.
A further set of analyses was conducted, but now separately on the Engineering
and Business samples. Within each sample, and separately for each CEQ version,
univariate one-way analyses of variance comparing degree specialties were calculated
using subscale and overall rating scores as dependent variables. For reasons of small
group sizes, Materials students and International Trade students were excluded.
The purpose in these analyses was not so much to test for actual differences amongst
the Engineering and Business specialties, but rather to test the comparative ability
of the three CEQ versions to detect whatever between-specialty differences there
might be. The outcomes of interest therefore were not the actual main effect
signiŽ cances, but rather the relative strengths of those effects across equivalent
analyses for each CEQ version. The measure used was h2, a ratio of SSbet to SStotal:
larger values of h2 indicate that group means are more different than similar. If
within one of the Engineering or Business samples, the h2 for a particular dependent
variable for one CEQ version is larger than the parallel values for the other two
versions, then it is a rough indicator that the between-specialty differences, as
measured by that CEQ version, are greater than the differences measured by the
other versions. To put it another way, that one CEQ version would be a better
measurement instrument in that differences would have a better chance of being
detected. So if across all dependent variables we Ž nd that one CEQ version
consistently tends to show the largest (or smallest) h2 values compared to parallel
values for the other versions, then what individually were “rough indicators” of
measurement superiority (or inferiority) combine into more conŽ dent grounds.
For only one of the 12 analyses within the Engineering or Business samples did
the GCCA version yield the greatest h2 value (see Table 5). Were the GCCA version
the better measurement instrument, then the simple expectation would be that
between-group differences would more likely manifest with the GCCA. The GCCA
version should show the greatest h2 values on at least a majority of instances,
The Course Experience Questionnaire
301
TABLE 5. Strength of association (h2) values from AnoVas on each CEQ version, for Engineering
and Business samples separately
Good
teaching
Clear
goals
Generic
skills
Appropriate
assessment
Appropriate
workload
Overall
rating
Engineering
0.219a
0.172
0.314
0.062
0.176
0.189
0.184
0.130
0.188
0.059
0.048
0.181
0.133
0.097
0.202
0.146
0.083
0.144
Business
0.057
0.024
0.086
0.046
0.086
0.047
0.031
0.227
0.450
0.018
0.085
0.067
0.115
0.148
0.031
0.047
0.117
0.053
a
Top Ž gure is for GCCA version, middle is for BOS, bottom is for DRS. The largest value within
each comparable set is set in bold type.
certainly better than simple chance. It did not. As rough and conservative a test as
this comparative use of h2 is, students’ responses using the GCCA version seemed
least likely to differentiate between the specialty groups within either Engineering or
Business. When it comes to the ability of the CEQ to operate as a measurement
device, and to detect differences, the standard GCCA version did not show itself to
be clearly superior.
The question addressed by this Ž rst study was very basic. Can evidence be found
that the present CEQ with its agree/disagree question format is likely less than
optimal for its role as a measurement instrument? The study used a conservative
methodology in that the standard CEQ was compared against minimalist variations
from itself, rather than against genuinely developed alternatives. The analyses
applied three conservative tests. The simple conclusion is that on none of those tests
did the standard CEQ show itself to be unambiguously superior. The CEQ’s
subscales were generally found to exhibit better scale reliabilities when the question
formats were something other than the standard agree/disagree. The empirical Ž t of
individual questions’ response patterns to those subscales as an underlying a priori
structure was better when the question formats were other than agree/disagree.
Finally, the capability of the CEQ to detect measurable differences was better when
the question format was other than agree/disagree.
In summary, this Ž rst study’s Ž ndings clearly suggest that the present form of the
CEQ is indeed less than optimal for its purpose. The minimalist and conservative
comparisons made here suggest that a careful and systematic development of
alternative question forms could well be expected to result in the clearer deŽ nition
of the CEQ’s subscales, and an improved capability of the CEQ to distinguish
amongst cohorts, institutions and programs.
Study Two
Taking the Study One Ž ndings to indicate that the development of alternative CEQ
forms would likely prove fruitful, an obvious next question is what such alternative
302
M. Eley
forms might look like? A good starting point might be to investigate the thinking
processes that respondents use in deciding their responses to the CEQ. The concern
here is not to alter the conceptual dimensions of the CEQ, but rather to consider
better ways of collecting students’ responses relating to those conceptual dimensions. Were we to investigate how students presently make their response decisions,
what they speciŽ cally recall to mind, how they use that recalled information, we
might then be able to devise ways of asking the CEQ questions that Ž t better with
those decision processes.
Such a concern that any alternative CEQ form should be developed especially to
Ž t closely with student processing was the basis of the second study. Students were
asked to report on the speciŽ c decision making that led to particular responses. To
the extent that students are empirically found to do something other than what the
present form of a question might implicitly assume, there would be grounds for
revising that question to conform more to what the students do, or rather to the
ways in which they typically think in responding.
Method
Subjects
A total of 45 volunteer students in the Ž nal semester of their undergraduate
programs in each of Law, Engineering and Psychology were recruited. These
volunteers comprised three 5-student groups within each discipline, drawn from the
full-time, on-campus enrolments. Subjects were not paid, however their participation occurred during lunchtimes, and juice and sandwiches were provided.
Materials
The same three versions of the CEQ that were used in Study One were used here
also. All three, rather than only the standard GCCA version, were used to allow for
the possibility that students’ response processes might vary with version.
Procedure
For each 5-student group, the session began with all members of that group
completing one of the three CEQ versions. The purpose of the session was deŽ ned
as the gathering of student comment on how they individually decided their
responses to the CEQ, and to provide a context for that comment the students
would Ž rst be required to complete the CEQ on their own “about to be completed”
degree studies.
Following the questionnaire’s completion, the students were taken through a
sample of 11 CEQ questions. For each question the text of the question was read
out, and the students were invited to relate as directly as possible how they came to
their individual responses. The general instruction was that they should recall their
thinking, and then simply describe that thinking. Each student in the group was
The Course Experience Questionnaire
303
given an opportunity to respond. The only comments made by the experimenter
during this process were requests for clariŽ cation or elaboration when a student’s
description was vague or ambiguous, or redirections when the discussion veered too
far from a focus on describing the response thinking for individual questions.
In using such a self-report methodology it was assumed that the processes
functionally involved in students’ decision making are indeed conscious, working
memory processes, and that at least in the short term they remain available to be
recalled and verbalised. Further, if those decision processes are in large part
themselves verbal in nature, then the retrospective reporting would be a relatively
direct externalisation. In that the time delay between actually responding to
a question and verbalising the response processing was minutes, and that a
re-statement of the question was used to cue the recall of those response processes,
the expectation was that the resulting report data would be as good a re ection of
those actual processes as practicalities allowed (see Ericsson & Simon, 1980).
All students were taken through the same sample of questions, in the same order,
regardless of discipline or CEQ version. This order was questions 18, 24, 22, 12, 23,
25, 17, 6, 11, 19 and 4, being one question from each of the Good Teaching, Clear
Goals, Skills, Assessment, and Workload subscales, then the general overall question,
and then a further question from each of the subscales.
For each student group the responses were audiotape recorded. These tapes were
later transcribed for analysis. Sessions typically required about 40 to 50 minutes
Results and Discussion
Student responses for each group were inspected for statements that clearly described how an individual’s response to a given CEQ question had been derived.
Students would often embellish their responses with comment or re ection about
their study experiences more generally. They would often divert into discussion
about what ought to have been, rather than what was. Such embellishments and
diversions were ignored for the analyses here. Only those statements that described
what a student had done in deciding his or her response were taken as the data for
analysis.
It was often the case that individual students would not give a personal description
of a response decision process, but would instead concur with another group
member’s description. Such simple concurrences were also not considered or
counted in the analyses here. The purpose of the analyses was not to compute some
accurate count of individual response approaches. Rather, the purpose was to
categorise those discrete statements that were offered to determine the range of
approaches used, and the broad relativities between them. The assumption was that
the picture gained from analysing only those descriptions actually offered would
re ect also the approaches taken by students who simply concurred.
The clear initial observation is that the described response approaches were
overwhelmingly based on recalling personal, speciŽ c and concrete experiences. A
consistent thread through all groups, irrespective of CEQ version and degree
program, was that an individual question would prompt recall of particular experi-
304
M. Eley
TABLE 6. Counts and percentages of reported response approaches Ž tting different categories
Recall of speciŽ c concrete instances prompted
directly by the question.
… apparently unguided or nonsystematic
recall.
… focus on the salient or recent.
… focus on extreme instances.
Sub-totals
Recall guided by preliminary deŽ nition of classes
of events.
… deŽ ned observable criteria used to guide
attempted recall of speciŽ c instances.
… deliberate narrowing of the range of
instances to be recalled.
Sub-totals
Responses not clearly indicative of the recall of
personal, speciŽ c and concrete experiences.
Totals
GCCA
respondents
BOS
respondents
DRS
respondents
26
25.7%
25
22.7%
17
16.0%
11
2
39
10.9%
2.0%
38.6%
14
4
43
12.7%
3.6%
39.1%
5
3
25
4.7%
2.8%
23.6%
39
38.6%
38
34.5%
48
45.3%
18
17.8%
18
16.4%
14
13.2%
57
56.4%
56
50.9%
62
58.5%
5
5.0%
11
10.0%
19
17.9%
101
110
106
ences, events, people, which in various ways would then be used to derive a response
to the question. In essence, none of the students reported doing what any of the
CEQ versions asked, that being to base their responses on their overall experiences.
Often the students would “confess” that they found it impossible to recall their
entire experiences, and that they had no option but to base their responses on the
sample of experiences that they could recall, typically with the explicit recognition
that inaccurate re ections of those overall experiences might result.
There was however variation in the ways in which individual students made use
of those recalled concrete experiences. Two broad categories of approach could be
discerned, with further subdivision within each. These categories and subcategories
are not claimed to be mutually exclusive, but to be broad, general descriptions of the
approaches taken, sometimes in isolation, sometimes in combination. In constructing the summary tabulation of the 317 discrete descriptions that comprised the base
data (Table 6), individual descriptions were assigned to single subcategories even
though they might not have re ected that subcategory uniquely. The categorisation
process was admittedly judgmental, with descriptions being assigned on a “best
apparent Ž t” criterion.
Although the responses are shown in Table 6 as classiŽ ed according to CEQ
version, the discussion here will ignore that distinction. The simple Ž nding was that
there were no strong and apparent differences in the range and mix of decision
approaches attributable to CEQ version. Judgmental categorisation notwithstanding, differences between the subcategory distributions for the three versions were not
statistically signiŽ cant (c2 (df 10) 5 15.998). This is perhaps further indication of the
strength of the reliance on recalled personal and speciŽ c concrete experiences,
The Course Experience Questionnaire
305
constituting a warning that to assume students do something else might almost be
guaranteed problematic.
Before discussing the Table 6 subcategories in more detail, it is acknowledged that
not all descriptions given by the students indicated the recall of personal, speciŽ c
and concrete experiences. These are the 11% of total responses shown in the last
category of Table 6. In the main, these were statements that were clearly identiŽ able
as referring to decision processes, but which nonetheless gave no indication of the
detail of just what those processes might have been. Some examples are:
I personally compared different lecturers that I’d had.
I eliminated those that were an exception to the rule.
It’s obviously hard to do it generally, so you sort of sift through and make
a decision of the majority.
Over the three years I thought it was poor, because of the variation.
The sheer bulk of essays that I’ve had to write; my skills must have
improved.
It is not so much that these statements indicate something other than the use of
recalled speciŽ c concrete experiences. Rather it is that for the most part they did not
unambiguously allow any process interpretation beyond the global identiŽ cation that
they were genuinely about deciding.
The 89% of discrete response process descriptions that did indicate the use of
recalled speciŽ c concrete experiences are now discussed. These descriptions were
broadly distinguishable into those in which the recall of speciŽ c concrete instances
seemed prompted directly by a CEQ question, or those in which that recall seemed
guided by some preliminary deŽ nition of what might constitute appropriate or
applicable events or experiences. For instance, in relation to Q12 (“The staff seemed
more interested in testing what I had memorised than what I had understood”) two
responses were:
I actually thought of a couple of exam questions that I had memorised, and
a couple that I worked out.
I thought of how for this year’s exam I already have most of the questions
that are going to be asked.
Neither of these students seems to be indicating that the question was thought
through before the recall occurred. It appears that the recall was somewhat spontaneous. The question has simply prompted the recall of particular concrete examples that Ž t. In contrast, another student reported:
I tried to work out how many subjects had allowed us to take in a “cheat
sheet”, which would mean testing more for understanding.
Now we see that the presence of “cheat sheets” is being used as a Ž lter. The focus
is still on the recall of speciŽ c instances of examinations, but examinations of a
particular type. The recall is now mediated or guided. Presumably the use of the
“cheat sheet” Ž lter resulted from some preliminary consideration of the question.
306
M. Eley
The Ž rst two responses would be categorised as “prompted directly” while the latter
response would be categorised as “guided”.
The “prompted directly” responses could be further subcategorised according to
the nature of the recalled instances. Although the common thread in this broad
category is that there seemed no preliminary consideration apparent before recall
occurred, sometimes the recalled instances seemed other than simply what Ž rst
came to mind. Sometimes the recalled events were described as being particularly
salient or recent, or as being extremes. The interpretation is that although the
student might not have deliberately tried to recall events that were salient or
extreme, that salience or extremeness might nonetheless have been the basis for the
recall. Some examples of “salient or recent” are:
I was very in uenced by my bad experiences this year in which we were
told of what was expected after we had done the assignment. (Q24)
I thought of a particular lecturer from Ž rst year who told us how Law is
different, and how we might Ž nd it difŽ cult to predict how we would go in
our studies. (Q24)
I was quite focused on my current lecturers; I can’t remember the way I
was taught earlier. (Q18)
I thought of one particular subject in which there were two projects, where
we were given a large slice of time to do the work. (Q22)
Some examples of “extremes” are:
I can think of a really good one, and particularly bad. You try to think of
a few of them. (Q18)
I Ž rst thought of the bad lecturers, and then I thought I have to be general
so I thought of the good ones. (Q18)
I compared last year’s research project to this year’s: one was really bad and
the other quite good. (Q22)
I remembered Ž rst year in which everything was crammed, so you don’t
remember any of it. But then in my later subjects you try to get the skills
rather than just memorise. (Q12)
As can be seen from Table 6, the majority of “prompted directly” responses were
neither “salient or recent” nor “extremes”. Most decision descriptions in this broad
category seem to indicate the simple nonsystematic recall of particular concrete
instances prompted directly by the question. Descriptions in which the recall was
indicated to be of extremes were in reality relatively infrequent.
As already noted, the second broad category of decision descriptions were those
in which the recalled instances were referred to as Ž tting some set of deŽ ning
criteria. The interpretation drawn was that some initial consideration had been given
to determining those criteria, and that the subsequent recall of experiences was thus
at least minimally guided or Ž ltered. This is not to claim that that recall was
The Course Experience Questionnaire
307
consciously exhaustive, or a representative sampling across time, just that it was not
a matter of a simple “Ž rst to mind” prompting. Some further examples of this
category are:
I thought about when a new concept comes up, and whether the lecturer
is able to express it coherently so that I walked away having understood.
(Q18)
I thought of how assessment requirements are described in terms of how
much they’re worth, and of the times when I wondered what I had to do.
(Q24)
I thought of the practicals, where you basically follow a recipe, so there’s
no planning required. (Q22)
I focused on how much I can now recall from previous subjects; I doubt I’d
be able to recall it all, so there must have been too much. (Q23)
A variant on the “guided” category of decision descriptions was when the phrasing
suggested that the student had consciously narrowed the focus of the question from
the range of experiences that it could logically include. The student would typically
state something like “I took this to mean …”. The inference is that the student was
aware of the narrowing, but that it was perhaps a deliberate ploy. Some examples of
narrowing are:
I let the incompetent lecturers sway me here, because they probably had
more of an effect on my university career. (Q18)
I took “expectation” to mean assignments, and what you need to go into
exams with. (Q24)
I decided between whether the question referred to time and deadlines,
rather than planning in a drafting sense. (Q22)
I interpreted the question in terms of me being able to memorise and
getting away with it anyhow. (Q12)
I focused on “thoroughly”; I haven’t thoroughly comprehended anything in
a 13-week semester. (Q23)
From Table 6 it can be seen that more than half of the response decision descriptions overall were in the broad “guided” category. Within that, about a third were
of the “narrowing” variety. Considering just the 89% of descriptions interpretable as
indicating the use of recalled speciŽ c concrete experiences, the broad “guided”
category accounted for about 60% compared to the “prompted directly”.
Finally, consider the response decision approaches for the overall rating Q25. Like
those reported for the other questions, these seem also predominantly based on
personal experiential criteria (see Table 7). Only two reported approaches could be
categorised as being “overall”. One “… looked back at [his/her] responses to the
other questions, and just summarised”. The second “… compared to the other
course in [his/her] double degree”. All the other students reported deciding on the
308
M. Eley
TABLE 7. Examples of approaches reported for Q25 “Overall, I was satisŽ ed with the quality of this
course” (standard phrasing)
I look at how conŽ dent I am now with my
abilities.
I took “satisŽ ed” to mean doing subjects
that I actually enjoyed.
I know that for most of my course material I
have to go back and refresh, and that makes
me judge low.
How well the lectures had been structured
and taught came into it for me.
I thought in terms of what interesting material
could have been covered, but which wasn’t.
The dealings that I have with administration
people in getting through.
I judged in terms of whether I’m now ready for
a professional job.
Whether I could think of things in the
course that I would change.
I judged by the credentials and expertise of my
lecturers.
The course’s reputation relative to those of
courses at other universities.
I thought of my own preferential reaction to the
content material.
I answered on pure interest.
After the exams, when I get my results, thinking
how well I had learned.
I thought entirely emotionally, how happy I
was with the course.
basis of some observable or experiential criterion. This is interesting in that this
particular question quite explicitly asks the student to make an overall satisfaction
judgement on the quality of his or her course. It would seem that the students simply
do not do that. Instead they choose some single experiential aspect, and focus on
that. The students seem to use something akin to a “guided”, even “narrowing”
approach. Maybe judging “overall quality satisfaction” is just too broad a task.
Maybe students Ž nd that they simply cannot “take everything into account”, and
have no option but to select some narrower, but manageable aspect of their course
experiences, and base their individual responses on that.
There is, however, a critical difference between “guided” and “narrowing” as
applied here to Q25, and as applied to the other CEQ questions. Each of those other
questions has a particular intentional focus, be it the helpfulness of feedback, the
development of writing skills, the emphasis on factual assessment, or whatever. That
particular focus should constrain the choice of any experiential proxy to being a
logical operationalisation of that same focus. Students may well narrow a question
in different ways, but they should still at least partially be answering the same
question. For Q25 however, there is no constraining focus. When students re-deŽ ne
overall satisfaction to mean conŽ dence, or interest levels, or credentials of lecturers,
or administrative proŽ ciency, or emotional reactions, they are choosing experiential
proxies from a much broader range. When same-course peers choose such different
aspects of their course experiences on which to base their Q25 responses, they are
each essentially answering different questions. This raises a methodological concern
The Course Experience Questionnaire
309
in that such differences will constitute a source of variation that may not
be systematically related to satisfaction, making it more difŽ cult to show betweencohort differences. Further, it raises an interpretational concern in that such differences will make it very difŽ cult to draw any particular meaning from pooled student
responses.
In summary then, the Ž ndings from this second study seem quite unambiguous.
When students were asked to describe how they decided their responses to individual CEQ questions they consistently and uniformly did not describe being general
and dimensional. They did not report performing some sort of comprehensive
cataloguing of their experiences over the duration of their degree studies. They did
not report using some sort of abstract or systematic judgmental process. Instead,
they reported processes very much based in the particular, the concrete, the speciŽ c,
in the recall of personally experienced episodic events and instances. This was true
irrespective of CEQ version. The present categorisations of those decision processes
are perhaps best interpreted as simply re ecting different ways in which students
cope with what seems the pervasive reality of speciŽ c experiential recall.
The implications for CEQ question forms are also clear. The present Ž ndings
suggest that no matter what form questions might take, the responding students will
make use of speciŽ c recollections. Perhaps it is simply unavoidable. When something like “feedback” is mentioned to students, the recall to mind of particular
relevant events might be almost automatic. And once those recollections are present
in consciousness they are attended to. To use any question format that implicitly
assumes that students will make balanced and measured judgements over an entire
degree time frame is probably to accept a Ž ction.
Conclusions
The starting point for the studies reported here was a concern that the question
formats used in the CEQ might not actually be well suited to the instrument’s
purpose. Although considerable research has well established the practical and
theoretical validity of what the CEQ’s questions ask, little work seems to have
considered how those questions are asked. The Ž rst study here considered whether
evidence could be found that the CEQ’s agree/disagree question format might be
less than optimal. The second study considered the thinking processes employed by
students in deciding their responses to individual questions: were those processes
different to what the form of the CEQ seemed implicitly to assume?
In the Ž rst study, the standard CEQ question format was “tinkered with” in
deliberately constrained or minimalist ways. The logic was that were the standard
agree/disagree format optimal, then any such tinkering should result in a deterioration of the CEQ as a measurement instrument. On none of the tests applied did
the standard CEQ format prove itself to be at all superior. The conclusion drawn
was that that standard format was likely less than optimal, and that developing
alternatives could well be a fruitful course to pursue.
The second study had students report on the thinking processes that they had
used in responding to individual questions in a just completed CEQ. The assump-
310
M. Eley
tion implicit in the present CEQ that students do indeed sample from their
experiences across the duration of their degree studies, and seek to generate a
response representative of those studies, simply did not hold. The clear and strong
Ž nding was that responding students used the recall of particular and concrete
personal experiences and events, as prompted by the question. These recalled events
are typically not logically exhaustive. They are typically not drawn from the entirety
of the student’s study experiences. They often focus on the salient and the recent.
They are often recognised, and explicitly acknowledged, as being biased and
non-representative.
If alternatives to the present CEQ should be developed, do the present data give
any suggestions, albeit speculative, for directions that that development might take?
A Ž rst suggestion is probably quite obvious; begin by accepting the reality of how
students will respond. Accept that it is likely unavoidable that a question will prompt
the recall of personal, concrete events that in some sense Ž t with the question. So
whatever question formats we might eventually choose, it would seem that in some
fashion they must work with students’ prompted recall of speciŽ c experiences.
A second suggestion can be made, but it is more speculative. Perhaps we could
take a lead from the apparently dominant decision approach found in the present
data, that of using a preliminarily decided set of criteria to guide or Ž lter the
subsequent prompted recall of instances. In the context of re-crafting the CEQ that
might mean more than some minimal re-phrasing of questions. It might mean
returning to the concern or issue underlying a given question and deciding on a
range of more deŽ ned instances that could be experiential indicators of that concern
or issue. That single question might then be replaced with a cluster of questions,
each seeking to determine the extent to which some particular experiential indicator
was indeed part of the student’s personal experience. The range of these experiential
indicators could of course be something determined by further research.
In essence such an alternative approach could be seen as simply a means of
harnessing already extant processing predilections, but in a much more determined,
or channeled fashion. The expectation is that students would not Ž nd such
“clustered speciŽ c prompt” approaches difŽ cult or alien. The beneŽ t of course
would be much greater consistency in the ways in which questions were interpreted
by students, and thus responded to. This in turn should translate to those response
distributions being more usefully interpretable by institutions.
What of the overall satisfaction question? Do the present response data offer any
speculative suggestions there? In the earlier discussion, it was noted that students
seemed to apply a “guided” or “narrowing” approach here also. But in the context
of re-crafting the CEQ, this Ž nding might be more problematic than suggestive. The
range of concrete instances that individual students recalled in constructing an
overall satisfaction response was very wide. Adopting a strategy of replacing Q25
with a cluster of more speciŽ c prompts might simply be impractical. For any of the
other “more targeted” questions, it is at least imaginable that there could exist a
constrained range of speciŽ c instance classes that would contain the bulk of actual
recalls that individual students might have. For Q25, such is probably unlikely.
The suggestion here then is that the best option might be simply to discard Q25
The Course Experience Questionnaire
311
as a discrete question, as bureaucratically unattractive as that might be. It might be
better to investigate ways of using the responses to the other CEQ questions to
construct or generate some index of overall satisfaction. One beneŽ t of going down
this route might be that the computation of such an index would be known, and
transparent. Such an index would be recognised as being a derived entity. It would
thus perhaps not divert institutional attention away from the more speciŽ c questions
and dimensions, wherein the real feedback value arguably lies.
As a Ž nal point on which to conclude, it is well to remember what the present
studies have not investigated. The present Ž ndings provide no evidence at all that
the conceptual structure underlying the CEQ should be tampered with. That scales
re ecting dimensional variability on teaching activity, instructional goals, transferable skills, assessment and workload should prove predictive of teaching quality and
learning outcomes generally, is well established in the literature. The concern here
was with how we sample that dimensional variability, and not at all with sampling
something else.
Address for correspondence: Dr Malcolm Eley, Higher Education Development Unit,
PO Box 91, Monash University, Victoria 3800, Australia. E-mail: Malcolm.Eley@CeLTS.monash.edu.au
References
Anastasi, A. (1982). Psychological testing. 5e. New York: Macmillan.
Ashenden, D., & Milligan, S. (1999). Good universities guide, Australian universities. Port Melbourne: Mandarin Australia.
Borman, W.C. (1986). Behavior-based rating scales. In R.A. Beck (Ed.), Performance assessment:
Methods and applications (pp. 100–120). Baltimore: Johns Hopkins University Press.
Eley, M.G., & Stecher, E.J. (1994). Comparison of an observationally – versus an attitudinally–
based response scale in teaching evaluation questionnaires: II. Variation across time and
teaching quality. Research and Development in Higher Education, 17. Proceedings of the 20th
Annual Conference of the Higher Education Research and Development Society of Australasia. Canberra, ACT, 6–10 July, pp. 196–202.
Eley, M.G., & Stecher, E.J. (1995). The comparative effectiveness of two response scale formats
in teaching evaluation questionnaires. Research and Development in Higher Education, 18.
Proceedings of the 21st Annual Conference of the Higher Education Research and
Development Society of Australasia. Rockhampton, Queensland, 4–8 July, pp. 278–283.
Eley, M.G., & Stecher, E.J. (1997). A comparison of two response scale formats used in teaching
evaluation questionnaires. Assessment and Evaluation in Higher Education, 22, 65–79.
Ericsson, K.A., & Simon, H.A. (1980). Verbal reports as data. Psychological Review, 87, 215–251.
Hand, T., Trembath, K., & Elsworthy, P. (1998). Enhancing and customising the analysis of the
Course Experience Questionnaire. Evaluation and Investigations Program; Department of
Employment, Education, Training and Youth Affairs, Canberra, Australia.
Kaiser, H.F. (1974). An index of factorial simplicity. Psychometrika, 39, 31–36.
Latham, G.P., & Wexley, K.N. (1977). Behavioral observation scales for performance appraisal
purposes. Personnel Psychology, 33, 815–821.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, No. 140.
312
M. Eley
Long, M., & Johnson, T. (1997). In uences on the Course Experience Questionnaire scales. Evaluation
and Investigations Program; Department of Employment, Education, Training and Youth
Affairs, Canberra, Australia.
Ramsden, P. (1991). A performance indicator of teaching quality in higher education: The Course
Experience Questionnaire. Studies in Higher Education, 16, 129–150.
Richardson, J.T.E. (1994). A British evaluation of the Course Experience Questionnaire. Studies
in Higher Education, 19, 59–68.
Stecher, E.J., & Eley, M.G. (1994). Comparison of an observationally – versus an attitudinally–
based response scale in teaching evaluation questionnaires: I. Variation relative to a
common teaching sample. Research and Development in Higher Education, 17. Proceedings of
the 20th Annual Conference of the Higher Education Research and Development Society
of Australasia. Canberra, ACT, 6–10 July, pp. 210–217.
Tabachnick, B.G., & Fidell, L.S. (1996). Using multivariate statistics. 3e. New York: HarperCollins.
Wilson, K.L., Lizzio, A., & Ramsden, P (1997). The development, validation and application of
the Course Experience Questionnaire. Studies in Higher Education, 22, 33–52.
Download