Content Analysis Practicum CMN 529 SA (CRN# 53578) T Th 3:30pm–4:50pm, Foreign Languages Building G7 Fall 2010 Professor Scott Althaus Office: 256 Computer Applications Building Hours: Tuesdays 1–3pm, and by appt. E-mail: salthaus@illinois.edu Web: www.illinois.edu/~salthaus Phone: 217.265.7879 Course Objectives The objectives of this course are threefold: (1) to teach a generic and multi-purpose method of quantitative content analysis that is commonly employed by scholars of mass communication and political communication to measure trends and discourse elements in news coverage; (2) to give students practical experience in all stages of quantitative content analysis, from protocol design to validity testing, reliability testing, coding, data entry, and data analysis; (3) to produce publishable research papers co-authored by participants in the course (the number of these papers will depend on the number of students taking the course and the scope of the projects we choose to undertake). The first half of the semester will be spent on learning the methods of content analysis, while the second half of the semester will be spent on applying them. The Practicum Project In order to focus on the practical aspects of content analysis and to complete a major data collection effort by the end of the semester, students will either pursue their own content analysis projects (which must be approved by the instructor) or participate in a group project headed by the instructor. Those students wishing to pursue independent content analysis projects as part of the course must have their proposed projects approved by the instructor and must have a well-developed literature review, including hypotheses and a draft codebook for the content analysis, ready for submission soon after the start of the class. This means that doing your own project (by yourself or in combination with other collaborators in the class) that is independent of the main class project(s) requires more outside work than taking part in the instructor’s group project. Everyone else will work on a research project designed by the instructor in advance of the semester to address a research question that is of broad topical interest, both popularly and in scholarly circles, and that has never been subject to systematic quantitative content analysis. This project is designed to yield a high-profile academic publication. This semester’s project will produce a validity study that compares several different computer data-mining strategies for coding the evaluative tone of news stories (that is, positive versus negative coverage). Done in collaboration with Kalev Leetaru (Coordinator of Information and Research Technologies at the Cline Center for Democracy at UIUC, and a PhD student in the Graduate School of Library and Information Sciences), the project will focus on New York Times coverage of the Israeli-Arab conflict from 1946 to 2004, and will seek to determine whether computer-based sentiment analysis methods can produce tone results comparable to those produced by human coders. We have unprecedented access to the entire full text of the 1 New York Times over this period, so the scale of this project will be truly impressive. It will not only yield valuable tests of the validity of such methods relative to human coding, but will also produce unprecedented insights into the dynamics of American news coverage about Israel and Palestine over the last 60 years. TARGET JOURNALS: American Journal of Political Science; Political Analysis; Journal of Communication. Required Reading Students are required to obtain the following book, which is available at local bookstores: • Kimberly A. Neuendorf. 2002. The Content Analysis Guidebook. Thousand Oaks, CA: Sage Publications (one non-circulating copy is in the reserve stacks of the Education Library, call number 301.01 N393c) A Web site for this textbook can be found at: http://academic.csuohio.edu/neuendorf_ka/content/ Students are also required to obtain a set of additional readings for class, most of which are available in electronic form on the course Moodle. Course Moodle Site This course has a Moodle site that will be the primary vehicle for receiving course assignments and distributing course-related materials in electronic form. The Moodle site can be accessed here (course enrollment required for access): https://courses.las.illinois.edu/ Assignments Your final grade for this course will be determined by your performance on the following assignments: • • • • • • Reliability analysis assignment (2-3 page paper) Validity analysis assignment (2-3 page paper) Weekly participation in class discussions and lab sessions Weekly content analysis case quotas Lexicon review paper and presentation Technical report paper and presentation (10% of final grade) (10% of final grade) (20% of final grade) (20% of final grade) (20% of final grade) (20% of final grade) Students pursuing independent content analysis projects for course credit will turn in final papers that include the following sections: literature review, research questions and hypotheses, methods and data, findings, and conclusions. These papers will be worth 50% of the student’s final grade. The reliability analysis assignment requires you to assess the reliability of variables in a hypothetical two-coder reliability test (using “fake” data generated from an actual coding scheme—hat tip to Kimberly Neuendorf for this assignment idea). You will calculate rates of intercoder agreement and a variety of intercoder reliability statistics for each of the variables in the data set. You will then outline concrete steps to improve the reliability of the data and recommend whether the hypothetical project should proceed to the data collection stage or should do another round of reliability testing. The validity analysis assignment requires you to assess the methodological soundness and appropriateness (in terms of both validity and reliability) of a major content analysis project that is available to scholars and used in your area of research. This is not an assignment to analyze the coding used in a particular paper, but rather to analyze a major content analysis data source that is already in use by multiple scholars in your field. You should choose a dataset that’s relevant to you and that you might actually use or engage with down the road. Applying methodological concepts learned in class, you’ll review the research questions and hypotheses that the dataset 2 was designed to address, and assess the quality of sampling, the operationalization of key content analysis variables, as well as the reliability of the coding scheme. Possible examples include the Wisconsin Advertising Project (http://wiscadproject.wisc.edu/); the Policy Agendas Project (http://www.policyagendas.org/); the Kansas Event Data System (http://web.ku.edu/~keds/); the Comparative Manifesto Project (http://www.wzb.eu/zkd/dsl/Projekte/projekte-manifesto.en.htm), Media Tenor (http://www.mediatenor.com/), and the Project for Excellence in Journalism’s News Coverage Index (http://www.journalism.org/by_the_numbers/datasets). The lexicon review paper and presentation is designed to support the class project, and will be written by groups of students. The paper and accompanying PowerPoint presentation will have four goals: (1) review and explain the analytical strategy of a popular sentiment lexicon (to be assigned by the instructor); (2) identify the theoretical assumptions about communication that inform the lexicon’s use; (3) review the academic literature that has used the lexicon; and (4) assess the validity of the lexicon’s analytical strategy for sentiment analysis in political science and communication research. The assigned lexicons will later be tested against our class’s human coding and compared against the parallel results of other sentiment lexicons. The technical report paper and presentation presents the results of an empirical test comparing human coding to an assigned lexicon, where both sets of results are drawn from the same underlying data. This assignment follows directly from the literature review assignment, and groups will retain the same lexicons for both assignments. The paper and accompanying PowerPoint presentation will have three goals: (1) assess how accurately the lexicon reproduces the tone scores made by of human coders; (2) to the degree that the lexicon misrepresents the tone score of human coders, assess how the lexicon’s analytical strategy may have been responsible for producing particular errors; (3) make recommendations on how to adjust the lexicon’s analytical strategy to better match the results of human coding. Authorship and Data-Sharing Agreement For those participating in the course’s collaborative data collection project, the following guidelines determine authorship credits for the specific projects detailed above: • Default order of authorship credit: Kalev Leetaru and the course instructor will be listed as first authors (order to be determined), followed by each student (in alphabetical order) who participated materially in the collection of data, the analysis of data, or manuscript preparation for the particular project. • Modifications to this default order of authorship credit can be made with the consent of the majority of persons claiming authorship in the project’s publication. Reasons for modifying the order of authorship include, but are not limited to: recognizing additional data collection effort above and beyond those of other co-authors, recognizing the contribution of special skills or additional time above and beyond those contributed by other co-authors, and recognizing additional effort given to preparing or revising the manuscript for publication. It is common for some participants to contribute only to data collection, while others go on to also put extra work into the manuscript and publication process—these differences in material contributions normally should be factors influencing the order of authorship. Authors can also be dropped from sharing credit in the project if the data they supply turns out to be flawed or otherwise unusable due to problems in the quality of data collection. In addition, students taking this course and participating in its data collection project(s) also agree that once each research project (as defined above) yields its intended publication, all data collected for the project can be used freely by any of the project co-authors, who agree to cite the authors of the original publication as the source of the data. This means that every project participant is guaranteed authorship credit on the first publication, but not on any subsequent publications that might result from this data collection effort. This is 3 meant to encourage students to add further data to the cases or to use the original data for different projects without having to involve or get individual permission from all of the original co-authors. Tentative Course Schedule Tuesdays 8/24 Overview of Course and Introduction to Content Analysis Projects Thursdays 8/26 The History, Applications and Analytical Logic of Content Analysis In-class: begin discussing sampled news stories to define tone and the two actors for which tone will be assessed 9/2 Before class: read the second batch of sampled news content with the draft coding scheme in mind 8/31 What Is Content Analysis Trying to Do? In-class: discuss assigned readings; finish discussing the first batch of sampled news stories to define tone and the two actors for which tone will be assessed In-class: discuss the sampled news content to define tone and the two actors for which tone will be assessed 9/7 Validity I: Generalizability 9/9 Before class: read the third batch of sampled news content with the draft coding scheme in mind In-class: refine draft codebook 9/16 In-class: Finalize codebook; determine sampling frames for content analysis project; draw sample for first reliability test 9/14 Validity II: Operationalization 9/21 Validity III: Reliability 9/28 Validity IV: Databases and Content Proxies 9/23 In-class: Discuss results of first reliability test. If necessary, refine coding scheme, and draw new sample; otherwise begin coding 9/30 In-class: Work on second reliability test Reliability analysis assignment due 10/5 Introduction to Data Mining Approaches 10/7 In-class: Interesting examples session In class: Discuss results of second reliability test. If necessary, refine coding scheme, and draw new sample; otherwise begin coding 4 10/12 Sentiment Analysis I Validity analysis assignment due 10/19 Sentiment Analysis II 10/14 In-class: Interesting examples session 10/26 Network Analysis 10/28 In-class: Interesting examples session 11/2 Automatic Topic Categorization 11/4 In-class: Interesting examples session 11/9 Lexicon Review Presentations 11/11 In-class: Data cleaning and analysis session 11/16 Comparative Lexicon Performance Analysis 11/18 In-class: Determine the parallel analyses that will be presented in the technical reports, as well as graphical styles for presenting the reports 11/25 NO CLASS—THANKSGIVING BREAK 12/2 Special Topic (To Be Determined) 10/21 In-class: Interesting examples session 11/23 NO CLASS—THANKSGIVING BREAK 11/30 Special Topic (To Be Determined) 12/7 Special Topic (To Be Determined) 12/16 Technical Report Presentations, 1:30-4:30 PM Reading List 8/26 The History, Applications and Analytical Logic of Content Analysis CAG chapters 1, 2, 3, and 9 8/31 What Is Content Analysis Trying to Do? Allport, Gordon W. 2009 [1965]. Letters from Jenny. In The content analysis reader, edited by K. Krippendorff and M. A. Bock. Los Angeles: Sage. Janis, Irving. 2009 [1965]. The problem of validating content analysis. In The content analysis reader, edited by K. Krippendorff and M. A. Bock. Los Angeles: Sage. Poole, M. Scott, and Joseph P. Folger. 2009 [1981]. Modes of observation and the validation of interaction analysis schemes. In The content analysis reader, edited by K. Krippendorff and M. A. Bock. Los Angeles: Sage. Berinsky, Adam and Donald R. Kinder. 2006. Making sense of issues through media frames: Understanding the Kosovo Crisis. Journal of Politics 68(3): 640-656. 5 9/7 Validity I: Generalizability CAG chapter 4 ( “Message Units and Sampling”) Barnhurst, Kevin. (in preparation). Skim chapters 1 (“Long”), 3 (“What”), and 4 (“Where”) of The Myth of American Journalism. Unpublished manuscript. Available here Weare, Christopher, and Wan-Ying Lin. 2000. Content analysis of the World Wide Web. Social Science Computer Review 18 (3):272-292. Hinduja, Sameer, and Justin W. Patchin. 2008. Personal information of adolescents on the Internet: A quantitative content analysis of MySpace. Journal of Adolescence 31 (1):125-146. For further reading: Herring, Susan C. 2010. Web content analysis: Expanding the paradigm. In International Handbook of Internet Research, edited by J. Hunsinger, L. Klastrup and M. Allen. New York: Springer. Hester, Joe Bob, and Elizabeth Dougall. 2007. The efficiency of constructed week sampling for content analysis of online news. Journalism & Mass Communication Quarterly 84 (4):811-824. Riffe, Daniel, Charles F. Aust, and Stephen R. Lacy. 1993. The effectiveness of random, consecutive day and constructed week samplings in newspaper content analysis. Journalism Quarterly 70:133-139. 9/14 Validity II: Operationalization CAG chapters 5 and 6 Krippendorff, Klaus. 2004. Content Analysis: An Introduction to Its Methodology, Second Edition. Thousand Oaks, CA: Sage Publications. Chapter 5, “Unitizing” Krippendorff, Klaus. 2004. Content Analysis: An Introduction to Its Methodology, Second Edition. Thousand Oaks, CA: Sage Publications. Chapter 7, “Recording/Coding” Potter, W. James, and Deborah Levine-Donnerstein. 1999. Rethinking validity and reliability in content analysis. Journal of Applied Communication Research 27: 258-284. For further reading: Dixon, Travis L. and Daniel Linz. 2000. Overrepresentation and underrepresentation of African Americans and Latinos as lawbreakers on television news. Journal of Communication 50(2): 131-154. Nadeau, Richard, Elisabeth Gidengil, Neil Nevitte, André Blais. 1998. Do trained and untrained coders perceive electoral coverage differently? Paper delivered at the annual meeting of the American Political Science Association, Boston, MA. Watts, Mark D., David Domke, Dhavan V. Shah, and David P. Fan. 1999. Elite cues and media bias in presidential campaigns - Explaining public perceptions of a liberal press. Communication Research 26(2):144-175. 9/21 Validity III: Reliability CAG chapter 7 (“Reliability”) Krippendorff, Klaus. 2004. Content Analysis: An Introduction to Its Methodology, Second Edition. Thousand Oaks, CA: Sage Publications. Chapter 11, “Reliability” Lombard, Matthew, Jennifer Snyder-Duch, and Cheryl Campanella Bracken. 2002. Content analysis in mass 6 communication: Assessment and reporting of intercoder reliability. Human Communication Research 28 (4):587-604. Krippendorff, Klaus. 2004. Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research 30(3): 411-433. Artstein, Ron, and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics 34 (4):555-596. Useful Web resources: Andrew Hayes’ Kalpha Macro for SPSS and SAS: http://www.comm.ohiostate.edu/ahayes/SPSS%20programs/kalpha.htm Deen Feelon’s ReCal System for Calculating Intercoder Reliability: http://dfreelon.org/utils/recalfront/ Matthew Lombard’s Intercoder Reliability Page: http://astro.temple.edu/~lombard/reliability/ PRAM Intercoder Reliability Software Documentation (Software will be distributed from the course Moodle): http://academic.csuohio.edu/neuendorf_ka/content/pram.html For further reading: Bakeman, Roger and John M. Gottman. 1997. Observing Interaction: An Introduction to Sequential Analysis, 2nd ed. New York: Cambridge University Press. Chapter 4, “Assessing observer agreement.” Freelon, Deen G. 2010. ReCal: Intercoder reliability calculation as a Web service. International Journal of Internet Science 5 (1):20-33. Hayes, Andrew F., and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1:77-89. Rothman, Steven B. 2007. Understanding data quality through reliability: A comparison of data reliability assessment in three international relations datasets. International Studies Review 9 (3):437-456. 9/28 Validity IV: Databases and Content Proxies Databases CAG resource 2 (“Using Nexis”) Snider, James H., and Kenneth Janda. 1998. “Newspapers in bytes and bits: Limitations of electronic databases for content analysis.” Paper presented at the annual meeting of the American Political Science Association, September 3-6, at Boston, MA. Stryker, Jo Ellen, Ricardo J. Wray, Robert C. Hornik, and Itzik Yanovitsky. 2006. “Validation of database search terms for content analysis: The case of cancer news coverage.” Journalism & Mass Communication Quarterly 83(2): 413-430. News Indices Woolley, John T. 2000. Using media-based data in studies of politics. American Journal of Political Science 44:156-173. Althaus, Scott, Jill Edy, and Patricia Phalen. 2001. Using substitutes for full-text news stories in content analysis: Which text is best? American Journal of Political Science 45(3): 707-723. 7 News Abstracts Althaus, Scott, Jill Edy, and Patricia Phalen. 2002. Using the Vanderbilt Television Abstracts to track broadcast news content: Possibilities and pitfalls. Journal of Broadcasting & Electronic Media 46(3): 473-492. Edy, Jill, Scott Althaus, and Patricia Phalen. 2005. Using news abstracts to represent news agendas. Journalism & Mass Communication Quarterly 82(2): 434-46. Page, Benjamin, Robert Shapiro, and Glenn Dempsey. 1987. What moves public opinion? American Political Science Review 81 (1):23-43. 10/5 Introduction to Data Mining Approaches Cardie, Claire, and John Wilkerson. 2008. Text annotation for political science research. Journal of Information Technology & Politics 5 (1):1 - 6. Leetaru, Kalev. In preparation. Content Analysis: A Data Mining and Intelligence Approach. Unpublished book manuscript. Chapters 4 (“Vocabulary Analysis”), 5 (“Correlation and Cooccurance”), 6 (“Gazeteers, Lexicons, and Entity Extraction”), 7 (“Topic Extraction”), and 8 (“Automatic Categorization and Clustering”). Hogenraad, Robert, Dean P. McKenzie, and Normand Péladeau. 2003. Force and influence in content analysis: The production of new social knowledge. Quality & Quantity 37 (3):221-238. Useful Web sites: CAG resource 3, “Computer Content Analysis Software” and software links available in the Content Analysis Guidebook Online web page at this location Stuart Soroka’s Annotated Bibliography of the Data Mining Literature: http://www.snsoroka.com/lexicoder/resources.html Amsterdam Content Analysis Toolkit (AmCAT) Software: http://content-analysis.org/ Lexicoder Automated Content Analysis Software: http://www.snsoroka.com/lexicoder/index.html Software Environment for the Advancement of Scholarly Research (SEASR) Software: http://seasr.org/ For further reading: Alexa, Melina, and Cornelia Zuell. 2000. Text analysis software: Commonalities, differences and limitations: The results of a review. Quality & Quantity 34 (3):299-321. Pang, Bo, and Lillian Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1-2):1-135. 10/12 Sentiment Analysis I van Atteveldt, Wouter, Jan Kleinnijenhuis, Nel Ruigrok, and Stefan Schlobach. 2008. Good news or bad news? Conducting sentiment analysis on Dutch text to distinguish between positive and negative relations. Journal of Information Technology & Politics 5 (1):73 - 94. Soroka, Stuart N., Marc André Bodet, and Lori Young. 2009. Campaign news and vote intentions in Canada, 1993-2008. Paper presented at the annual meeting of the American Political Science Association. Toronto, Canada, Sept 3-6 2009. 8 Gentzkow, Matthew, and Jesse M. Shapiro. 2010. What drives media slant? Evidence from U.S. daily newspapers. Econometrica 78 (1):35-71. 10/19 Sentiment Analysis II Dodds, Peter, and Christopher Danforth. 2010. Measuring the happiness of large-scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies 11 (4):441-456. Mislove, Alan, Sune Lehmann, Yong-Yeol Ahn, Jukka-Pekka Onnela, and J. Niels Rosenquist. 2010. Pulse of the nation: U.S. mood throughout the day inferred from Twitter. Available URL: http://www.ccs.neu.edu/home/amislove/twittermood/ Whissell, Cynthia M. 1996. Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon. Computers and the Humanities 30:257-265. Sigelman, Lee, and Cynthia Whissell. 2002. 'The Great Communicator' and 'The Great Talker' on the radio: Projecting presidential personas. Presidential Studies Quarterly 32 (1):137-45. 10/26 Network Analysis Corman, Steven R., Timothy Kuhn, Robert D. McPhee, and Keven J. Dooley. 2002. Studying complex discursive systems. Human Communication Research 28 (2):157-206. O'Connor, Amy, and Michelle Shumate. In press. Differences among NGOs in the business-NGO cooperative network. Business & Society. Batagelj, Vladimir, and Andrej Mrvar. 2003. Density based approaches to network analysis: Analysis of Reuters terror news network. Available URL: http://www2.cs.cmu.edu/~dunja/LinkKDD2003/papers/Batagelj.pdf 11/2 Automatic Topic Classification Evans, Michael, Wayne McIntosh, Jimmy Lin, and Cynthia Cates. 2007. Recounting the courts? Applying automated content analysis to enhance empirical legal research. Journal of Empirical Legal Studies 4 (4):1007-1039. Hillard, Dustin, Stephen Purpura, and John Wilkerson. 2008. Computer-assisted topic classification for mixed-methods social science research. Journal of Information Technology & Politics 4 (4):31 - 46. Quinn, Kevin M., Burt L. Monroe, Michael Colaresi, Michael H. Crespin, and Dragomir R. Radev. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54 (1):209-228. Hopkins, Daniel J., and Gary King. 2010. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54 (1):229-247. 9