Analyze Data

advertisement
Analyzing quantitative content analysis data
Rater reliability
To establish reliability when evaluating content, use two or more raters and make sure they are rating in a
consistent manner. Train them in a group and have them practice using the same criteria and compare
results. They should continue to compare results as the study progresses to make sure their ratings do not
drift apart. Although reliability will be higher when rating concrete (for example, use of transitional phrases
in an essay) rather than abstract content (for example, persuasiveness), rating only concrete content might
lead to overlooking important indicators of quality. Once you begin ratings, refine procedures or definitions
of categories to increase reliability.
There are several ways to assess reliability among two or more raters. The simplest way is to record
the percentage agreement:
# of ratings that agree
total # of ratings
x 100 = % agreement
If you are using a 4-point rating scale (0 = not present to 3 = exceeds criteria), compute the percentage of
times raters made the same rating. If there is not at least 80% agreement between evaluators, discuss
differences and repeat the process until they achieve satisfactory agreement. Avoid using a rating scale with
more than 5-points because raters will have difficulty making subtle distinctions. Many journals prefer
Cohen's Kappa or the intraclass correlation coefficient because percentage agreement does not correct for
chance agreement.
Cohen's Kappa is a measure of reliability that corrects for chance agreement and can be used for checklists
that involve yes/no decisions or decisions between mutually exclusive categories. Generally, a Kappa value of
.7 or greater indicates acceptable reliability.
The intraclass correlation coefficient (ICC) is a measure of reliability between observers that can be used with
categorical or continuous data, such as observing the number of questions students ask during a class.
Once you establish adequate reliability, you can simplify analyses by averaging ratings across raters.
Alternatively, you can analyze only one set of ratings if you designate one person at the outset as the main
rater and use a secondary rater only to establish reliability.
Using content analysis in an experiment
EXAMPLE
Example 1:
You decide to conduct a study that compares the quality of papers for two groups of students in a
sociology class. At the start of the semester, students agree to be randomly assigned to one of two
groups. At the semester mid-point, each student in Group 1 posts a rough draft of their paper on the
course website and receives an overall quality score ranging from 1-poor to 4-excellent from three
other students in Group 1. Each student in Group 2 also posts a rough draft at the same time and
receives an overall quality score and also detailed comments on an evaluation form from three other
students in Group 2. Students comment about organization, use of examples, use of theory, and
persuasiveness of arguments. Both groups incorporate the feedback in their papers. Two judges, who
do not know what group students belong to, then rate overall paper quality (1- poor to 4-excellent)
and use therubric below to assess paper quality in four categories. You have three hypotheses: 1) As
rated by the two judges, overall paper quality for Group 2 will be higher than for Group 1, on average;
2) Total rubric scores will be higher, on average, for Group 2 than Group 1; 3) Overall quality scores
will not differ between groups for rough drafts.
Below is an example of ratings provided by the first judge for a student in Group 1:
Overall quality (1-poor; 2-average; 3-good; 4-excellent)
2 -- average
Rubric ratings of paper:
Key:
Not present = no use or demonstration of objective
Below criteria = little use or demonstration of objective or use is frequently inaccurate
Meets criteria = consistent use or demonstration of objective
Exceeds criteria = consistent and skillful use or demonstration of objective
Objective
Not
Present
(0 pts)
Below Criteria
(1 pt)
Meets
Criteria
(2 pts)
Exceeds
Criteria (3 pts)
X
X
Use of theory
X
Persuasiveness
X
Organization
Use of examples
Enter all data into a statistical program such as SPSS or SAS and calculate means and standard
deviations for overall quality ratings and total rubric scores. While comparing means provides a
rough sense of differences between the groups, statistical tests demonstrate that these differences are
unlikely to have occurred by chance. Many statistical programs provide a p value that indicates the
probability that group differences occurred by chance alone. For example, a p value of .05 indicates
that there is a 5% probability that differences between groups occurred by chance rather than
because of the intervention. Prior to analyzing the data, you set a p value of .05 or less as the criterion
forstatistical significance . In addition, you make sure that outcome variables are normally
distributed , a requirement for many statistical tests. If a variable is not distributed normally,
consult with a statistician to determine if you need to transform the variable.
To determine if there is a difference in paper quality between the two groups, conduct two t-tests for
independent groups (also called the an independent samples t-test, or the t-test for independent
means), one comparing average group ratings of overall quality and a second t-test comparing total
rubric scores. At the study's outset, it would be wise to give all participants a standardized test of
writing quality to make sure that Groups 1 and 2 did not significantly differ in writing ability. This
would enable you to conclude that later group differences in paper quality were not due to differences
that existed before you began the study.
If you discover that, before you start your study, two groups differ on a variable you are measuring,
such as writing quality, you can control for these differences using an Analysis of Covariance
(ANCOVA) procedure. You cannot use an ANCOVA, however, to control for pre-existing group
differences when there is no random assignment, so consult with a statistician in this case.
To test for differences between three or more groups, use anindependent samples analysis of
variance (ANOVA). Obtaining a significant F value for an ANOVA tells you that, overall, scores differ
at different times, but it does not tell you which scores are significantly different from each other. To
answer that question, you must perform post-hoc comparisons after you obtain a significant F, using
tests such as Tukey's and Scheffe's, which set more stringent significance levels as you make more
comparisons. However, if you make specific predictions about differences between means, you can
test these predictions withplanned comparisons, which enable you to set significance levels at p <
.05. Planned comparisons are performed instead of an overall ANOVA.
EXAMPLE
Example 2:
You decide to alter your study design from Example 1. You use the same procedure but ask the judges
to make two additional ratings when students post rough drafts on the course website: 1) an overall
rating of rough draft quality 2) a total rubric score based on ratings of organization, use of examples,
use of theory, and persuasiveness of arguments. Both groups receive feedback from other students,
as described in Example 1, but judges' ratings are not shared with students. You then compare
judges' rough and final draft ratings and test if the average amount of change for the Groups 1 and 2
is significantly different, using a mixed factorial (ANOVA). A mixed ANOVA enables you to
simultaneously consider change over time within each group and differences between the two
groups.
Other statistical procedures
To test whether a statistically significant change has occurred within one group of students at two points in
time (for example, the start and end of the semester), use a t-test for dependent means (also called a paired
samples t-test, repeated measures t-test, or t-test for dependent samples). To compare ratings at three or
more points in time (for example, the start, midpoint, and end of the semester), one option is a repeated
measures analysis of variance (ANOVA)(also called ANOVA for correlated samples).
If you are rating a student product using categories that are not on a continuous scale (for example,
"inadequate, satisfactory, above average"), you can test for differences between groups or times using a chisquare statistic. For example, you may rate whether the thesis for a research paper is "clearly stated" of "not
clearly stated."
You might also compute correlations to determine whether there is a statistically significant positive or
negative relationship between two continuous variables. For example, you could determine if ratings of the
quality of student essays is related to students' satisfaction with the course. Be aware, however, that
computing correlations between several sets of variables increases the chances of finding a relationship due
to chance alone, and that finding significant correlations between variables does not tell you what causes
those relationships.
If you need additional help from someone knowledgeable about statistics, contact the research consulting
staff at UT's Austin's Division of Statistics & Scientific Computation.
Additional information
2 x 2 Mixed Factorial Design. Retrieved June 21, 2006 from the University of Missouri - Rolla, Psychology
World Web site:http://web.umr.edu/~psyworld/mixed_designs.htm
Aron, A. & Aron, E. N. (2002). Statistics for Psychology, 3rd edition. Upper Saddle River, N J: Prentice Hall.
Chi­square: One Way. Retrieved June 21, 2006 from the Georgetown University, Department of Psychology,
Research Methods and Statistics Resources Web
site:http://www.georgetown.edu/departments/psychology/researchmethods/statistics/inferential/chisquareone.htm.
Cohen's Kappa: Index of inter­rater reliability. Retrieved June 21, 2006 from University of Nebraska
Psychology Department Research Design and Data Analysis Directory Web site: http://wwwclass.unl.edu/psycrs/handcomp/hckappa.PDF .
Helberg, C. (1995). Pitfalls of data analysis. Retrieved June 21, 2006
from:http://my.execpc.com/4A/B7/helberg/pitfalls
Lane, D. M. (2003). Tests of linear combinations of means, independent groups. Retrieved June 21, 2006
from the Hyperstat Online textbook:http://davidmlane.com/hyperstat/confidence_intervals.html
Lowry, R. P. (2005). Concepts and Applications of Inferential Statistics. Retrieved June 21, 2006
from: http://faculty.vassar.edu/lowry/webtext.html
T­test. Retrieved December 4, 2007 from the Georgetown University, Department of Psychology, Research
Methods and Statistics Resources Web
site:http://www1.georgetown.edu/departments/psychology/resources/researchmethods/statistics/8318.html.
Weunschk, K.L. (2003). Inter­rater Agreement. Retrieved June 21, 2006
from:http://core.ecu.edu/psyc/wuenschk/docs30/InterRater.doc
Download