COM 633: Content Analysis CATA Kimberly A. Neuendorf, Ph.D. Cleveland State University Fall 2010 COM 633 Fall 2010 CATA Presentations Kate & Julie: Jen & Diane: Fran & Dongwoo: Jon & Elizabeth: Joe: LIWC & PCAD LIWC & MCCALite? CATPAC & WordStat Yoshikoder & General Inquirer Diction CATA: Computer Aided Text Analysis Why might you want to use CATA rather than traditional human-coding techniques? CATA programs typically have been written by researchers with a specific need; thus, their utility is often limited. Online search and acquisition opportunities have made CATA easier, more attractive (e.g., Nexis) Purposes of CATA 1. Descriptive—e.g., word counts Modell project, using: VBPro, M. Mark Miller, 1980s software 2. Coding of Open-ended Survey Responses WordStat, SimStat adjunct program (Provalis Research; Normand Peladeau) Purposes of CATA Standard Dictionaries: Most of the following applications use internal “standard” dictionaries: 3. Linguistic and Sociolinguistic Measures General Inquirer, Philip Stone, 1966 Harvard IV Dictionary MCCALite, Don McTavish & Ellen Pirro 116 “idea categories” are applied to multiple characters in a script CATPAC, Joseph Woelfel Semantic “neural” networks—no actual dictionary Purposes of CATA 4. Psychometric Measures (or “Thematic Content Analysis”—Smith) General Inquirer e.g., Lasswell Values Dictionary 5. Clinical Psychological/Psychiatric Diagnoses PCAD, Louis Gottschalk & Robert Bechtel Computer version of Gottschalk’s earlier humancoded schemes devised to provide alternative diagnostic techniques Purposes of CATA 6. Verbal Style or Communicator Style LIWC, Pennebaker, Booth, & Francis e.g., positive emotions, cognitive processes Also includes many linguistic measures and some that might be used as psychometrics Diction, Rod Hart Computer application of Hart’s earlier humancoded schemes aimed at measuring characteristics of political speech—e.g., aggression, cooperation, ambivalence Purposes of CATA 7. Authorship Attribution Most use simple counts of letters or words to attribute authorship (e.g., the Federalist papers; Raymond Chandler; Shakespeare) Basic computer/word processing programming is sufficient Measurement in CATA Three choices: Custom Dictionaries Complicated, time-consuming Standard Dictionaries A task of matching one’s conceptualization to someone else’s operationalization—sometimes a scavenger hunt Similar to the challenge of finding an appropriate scale for a survey “Emergent” Coding—outcome based on language patterns that emerge (e.g., CATPAC) Quantitative CATA Programs Program Author Original Purpose VBPro M. Mark Miller Newspaper articles Yoshikoder Will Lowe Political documents WordStat Normand Peladeau Part of SimStat, a statistical analysis package General Inquirer Philip Stone General mainframe computer application (1960s) Profiler Plus Michael Young Communications of world leaders LIWC 2007 Pennebaker, Booth, & Francis Linguistic characteristics & psychometrics Diction 5.0 Rod Hart Political speech PCAD 2000 Gottschalk & Bechtel Psychiatric diagnoses WORDLINK James Danowski Network analysis/communication CATPAC Joseph Woelfel Consumer behavior/marketing Quantitative CATA Programs Program Type VBPro Word count/researcher-created dictionaries only Yoshikoder Word count/researcher-created dictionaries only WordStat Word count/researcher-created dictionaries only General Inquirer Word count with pre-set dictionaries Profiler Plus Word count with pre-set dictionaries LIWC 2007 Word count with pre-set dictionaries (researchercreated dictionaries may be added) Diction 5.0 Word count with pre-set dictionaries PCAD 2000 Word count with pre-set dictionaries (researchercreated dictionaries may be added) WORDLINK Word co-occurrence CATPAC Word co-occurrence Validity and CATA Validation part of development of CATA system (e.g., Lin et al., 2009—genres of online discussion threads) Validation of thematic CA (psychometrics) against self-report— rare and uncertain (e.g., McClelland et al., 1992) A comprehensive model for assessing content, external, and predictive validity when using CATA—Short, Broberg, Cogliser, Brigham (2010) as applied to “entrepreneurial orientation”: Content validity—an inductive/deductive combo External validity—use multiple sampling frames Predictive validity—measure non-CATA variables that should relate Validity of Standard Dictionaries Trusting the Standard Dictionary—an issue of face validity Few CATA programs reveal the full dictionary lists (e.g., Diction, General Inquirer) None reveal the full algorithm (including disambiguation (e.g., well, pot, leaves)) None account for negation Construct and Criterion Validity Rod Hart’s Diction—”normed” rather than validated Gottschalk and Bechtel’s PCAD—validated against standard psychiatric diagnoses Quantitative CATA Programs Program Type Validation VBPro Word count/researcher-created dictionaries only N/A—all custom dictionaries Yoshikoder Word count/researcher-created dictionaries only N/A—all custom dictionaries WordStat Word count/researcher-created dictionaries only N/A—all custom dictionaries General Inquirer Word count with pre-set dictionaries No--Dictionaries adapted from Harvard IV, Lasswell values, other standard linguistic and socio-psychological scales Profiler Plus Word count with pre-set dictionaries Proprietary LIWC 2007 Word count with pre-set dictionaries (researchercreated dictionaries may be added) Some dimensions have been validated against assessments by human judges Diction 5.0 Word count with pre-set dictionaries No—Based on R. Hart’s substantive work PCAD 2000 Word count with pre-set dictionaries (researchercreated dictionaries may be added) Long history of development of a human-coded scheme; both human & CATA heavily validated against clinical diagnoses WORDLINK Word co-occurrence N/A—emergent dimensions CATPAC Word co-occurrence N/A—emergent dimensions Yoshikoder About Yoshikoder Created by Will Lowe at Harvard’s Department of Government Can be downloaded free at www.yoshikoder.org A cross-platform, multi-lingual CATA program Must run one case at a time Assumes the researcher will create dictionaries Can import external dictionaries Exports results into Excel Yoshikoder: KWIC and Concordance Yoshikoder: Dictionary Report WordStat About WordStat • Created by Normand Peladeau, as part of the SimStat suite for quantitative data analysis (a counterpart to SPSS) •Must be run as part of SimStat •Particularly suited to analyzing open-ended responses, in that data set typically includes both numeric and textual variables—which can immediately be crosstabulated •The “standard” dictionaries that are included are incomplete and should be avoided •Also includes KWIC The WordStat Interface (within SimStat) Selection of Independent & Dependent Variables—Including Textual Variable Standard WordStat “Dictionaries” Breakdown of very limited WordStat “Dictionary” WordStat Output: Word counts WordStat Output: Dendogram WordStat Output: Crosstab with bar graph WordStat Output: Crosstab and 3D representation WordStat Output: KWIC General Inquirer (PC/MAC version) About General Inquirer Created by Philip Stone in the Department of Social Relations at Harvard in the 1960s—on mainframe for many years The current version combines the "Harvard IV-4" dictionary content-analysis categories, the "Lasswell" dictionary content-analysis categories, and five categories based on the social cognition work of Semin and Fiedler, making for 182 categories (dictionaries) in all The General Inquirer (PC) Interface Input and output files must be named Two choices: Tags (application of dictionaries) & Words General Inquirer Output: Tags (data file that may easily be exported to Excel & SPSS) First row of each set is the ‘r’ (raw count) form of the output. This corresponds to frequencies. Second row of each set is the ‘s’ (scaled count) form of the output. This corresponds to percentages (of total). General Inquirer Output: Words PCAD About PCAD Developed by Gottschalk & Bechtel, using scales developed by Gottschalk & Gleser for human-coding in 1960s Diagnostic—assesses one text at a time Intended for naturally-occurring speech or writing, minimum 80 words Measures states of neuropsychiatric interest such as: Anxiety Hostility Cognitive impairment Depression Schizophrenia Achievement Strivings Hope The PCAD Interface PCAD Interface-2 PCAD Output: 4 Types (Clauses, Summaries, Analyses, Diagnoses) PCAD Output: Analyses PCAD Output: Diagnoses LIWC About LIWC •Created by Pennebaker, Booth, & Francis •“Looks at how people write & their state of mind” •Intended to measure both affective and cognitive constructs •84 Output Variables (standard dictionaries): •17 Standard linguistic dimensions (e.g., number of pronouns) •25 Word categories (e.g., “psychological constructs – affect, cognition”) •10 Time categories (e.g.“space, motion”) •19 Personal concerns (e.g., “home”) James W. Pennebaker & Martha E. Francis LIWC Dictionaries (dimensions) with sample words http://www.liwc.net/descriptiontable1.php The LIWC Interface LIWC Output: Data Matrix (Each row is a case/text, each column a dictionary) Diction About Diction • Created by Roderick P. Hart, University of Texas, originally for the purpose of analyzing political discourse • To measure “semantic features”, uses a series of 31 standard dictionaries and five “Master Variables” (scales constituted of combinations of the 31): • Activity • Optimism • Certainty • Realism • Commonality Users can create custom dictionaries in addition to standard dictionaries. The program can accept individual or multiple passages. The Diction Interface Diction Output: Calculated & Master Variables Diction Output: Dictionary Totals with Normative Values Diction Output: Interactively Changing Normative Values Diction: Custom Dictionaries as Simple .txt Files Diction Output: Data file may be exported to SPSS SPSS Syntax Editor CATPAC About CATPAC Created by Joseph Woelfel, Communication scientist at University of Buffalo Part of the GALILEO suite of softwares that analyze and display various types of networks CATPAC uses a neural network approach, identifying the most frequent words and determining patterns of connection based on co-occurrence A scanning window is used to measure the association/cooccurrence Uses cluster analysis to present results of this cooccurrence procedure The CATPAC Interface Text input will appear in CATPAC main screen CATPAC Output: Descending Frequency List, Alphabetically Sorted List CATPAC Output: Dendogram CATPAC Output: 3D Plot (using ThoughView, another part of Galileo Suite) VBPro About VBPro Created by M. Mark Miller at the University of Tennessee For use with MS-DOS (!!) Entirely do-it-yourself. . . no standard dictionaries Quantitative: frequencies & coding texts in numeric format for analysis in statistical software Qualitative: can provide KWIC (key word in context) VBPro: Preparing the text Multiple cases within one file are prefixed with an identification tag and saved as a .txt file (NOT .asc, the old standard) VBPro: Preparing Dictionaries Each search dictionary is headed with >>#<< The VBPro Interface VBPro Output: Data matrix (each row is a case/text, each column a dictionary) VBPro Output: Alphabetization VBPro Output: Word Frequency MCCALite About MCCALITE Created by Donald G. McTavish & Ellen B. Pirro, sociologists at the University of Minnesota, 1990 Full name: Minnesota Contextual Content Analysis Measures the frequency of words in 116 “idea categories” (dictionaries) and compare these frequencies to the norms of general usage statistics for the English Language There are standard dictionaries (categories) and KWIC, DIMAP Two types of dictionary scores are reported: E-Scores (emphasis) and C-Scores (context) Ideal content for MCCALITE are multiple-person transcripts (plays, hearings, interviews, TV) The MCCALite Interface & Output MCCALite: One more example (of many possible) end