Social Psychological Analysis of Public Political Comments

advertisement
Social Psychological Analysis of
Public Political Comments on
Facebook
Márton Miháltz
21st November 2014, Budapest
www.trendminer-project.eu
TrendMiner Overview
• What kind of social political trends are there in Hungarian
comments to political posts on Facebook?
– Facebook in Hungary: 4.27M registered users = 59.2% of internet
users, 43% of total population
• Download all public comments from Hungarian politicians’,
parties’ facebook pages
• Analysis of comments:
–
–
–
–
Basic NLP (tokenization, PoS, stemming), domain-adapted
Entities: political actors (people, organizations)
Sentiment
Social psychology dimensions: agency/communion, individualism/collectivism,
optimism/pessimism, primordial/conceptual thinking
• In cooperation with Narrative Psychology Research Group,
Hungarian Academy of Sciences
2
Data Acquisition
• Get comments via fb Graph API
– 1.9M comments for 141K fb posts (2013.10.01 – 2014.09.02)
– from 1344 fb pages
• Organizations: parties, regional and associated branches
• People: candidate and elected representatives (MPs), government,
party officials
• Official and fan pages
– In 3 categories
• Hungarian parliament 2010-2014
• Hungarian parliament elections 2014 (6th April)
• EU parliament elections 2014 (25th May)
• Sources: valasztas.hu, wikipedia.hu
• Everything in a MySQL database
– For arbitrary queries (political groups, time etc.)
Data model
•
Fb_pages
– Id, URL, Page title
– Type: person or organization
– Affiliated party (3 campaigns)
•
Fb_posts, Fb_comments
– Id, Created_timestamp
– Message text, Author_user_id
•
Comments_annotations
– Sentence_id, Start_token,
End_token index
– Annotated text,
Lemmatized_annotated_text,
Annotation_tag
•
Fb_comments_scores
– 16 scores and counts
(sentiment, RID,, agency,
communion, optimism, …)
Hungarian Political Ontology
• Extending TM multilingual political ontology
– 8 New classes, 3+3 new object/data properties, 1579
new instances (1 Country,18 Party, 661 Politician,
899 Nomination)
– Nominated and elected MPs (2010 Hu. Parl., 2014
Hu. Parl., 2014 EU Parl.), nominating parties;
– Names, abbreviated names, nicknames, Facebook
page URLs etc.
• Example:
5
Hungarian Political Ontology
Example: Benedek Jávor was member of Hungarian
Parliament during 2010-2014 (nominated by
LMP), member of European Parliament from
2014 (nominated by EGYÜTT-PM).
6
Processing Pipeline
•
•
•
•
•
•
•
•
Downloading (Fb Graph API py script)
Tokenizaton (huntoken tool)
PoS-tagging (hunmorph tool)
Morphological analysis (hunmorph tool)
Stem+analysis disambiguation (Python script)
Content analysis (Java NooJ)
Scoring & storage in DB
Uploading in RDF to TM Integration Server
Domain Adaptation
• Problem: existing NLP tools developed on
different domain, (f)ail on social media language
(facebook comments)
• Using corpus for survey:
–
–
–
–
–
1.25M fb comments (29M tokens)
2.25M unknown tokens (694K types)
Frequency list, f > 15 items manually revised
Identify common problems
Lists of frequent, relevant unknown, new words etc.
Domain Adaptation: Tokenization
• Huntoken tool
• Frequent problems:
– missing spaces around punctuation
... end of sentence.Beginning of another ...
– Multiplicated punctuation
first part……. Second part
– Contracted words (slang)
asszem = azt hiszem (“I think”)
– Consonant multiplication (interjections, onomatopeic words etc.)
e.g. pfffffffff, uffffff, ejjjjjjjj (pff(f*), uff(f*), ej(j*))
– split large numbers by decimal groups
125 000
– split URLS
– split emoticons
:D
Domain Adaptation: PoS/stemming
• Hunpos tagger + hunmorph analyzer + stemming script
• Frequent problems:
– Unknown words (no lemma/PoS)
• add to hunmorph analyzer’s lexicon
• using analogous words (morphological paradigm)
• Compounds, abbreviations, acronyms, slang words etc.
– Frequently misspelled word forms:
• replace with correct forms
– Wrong capitalization
e.g. SENTENCES IN ALL CAPS
– Missing accent characters –disambiguation model needed
E.g. kor (age), kór (disease), kör (circle)
NooJ, Java NooJ, Nooj-cmd
• Java NooJ
– Open source version of NooJ: define and run finite state
machines for querying, annotation etc. (morphology, syntax)
– NooJ-Cmd extension: all NooJ GUI features => command line
options
– Open source: https://github.com/tkb-/nooj-cmd
• NooJ grammars (FSMs) for annotation:
–
–
–
–
–
–
Actors (entities)
Emotional valence (sentiment polarity)
Regressive imagery dictionary
Agency-communion
Optimism-pessimism
Individualism-collectivism
Development of NooJ Grammars
• In collaboration with social psychologist researchers
– Social Psychology Department, Eötvös Lóránd University,
Budapest
– Narrative Psychology Research Group, Hungarian Academy of
Sciences
• Development Corpus
– 176K sample fb comments from 570 fb pages (4.9M tokens)
– NLP annotation
– Frequency lists (lemmas, lemmas+PoS, lemmas+morphological
info etc.)
• Development:
–
–
–
–
f > 100 content words from development corpus (3500 types)
7 independent annotators
>= 4 annotartors agree: manual revision
Compile into NooJ grammar with polarity shifters, items to be
excluded etc.
1. Political Actors (NEs)
• Maxent NE tool (huntag): low performance on
domain
– Trained on standard language news texts
– Miscategorization, false positive NEs, entity boundary
recognition problems
• NooJ grammar/lexicon for Trendminer
– Person names:
family_name (given_name_lemmatized)? |
frequent_nicknames …
– Organization names:
Standard_form | abbreviated_forms… | nicknames…
– Created automatically (names from DB) + manually
(nicknames from freq. lists)
2. Emotional Valence
• Emotions with positive or negative polarity
• Polarity in context: recognize negation using simple
rules
• Nouns, adjectives, verbs, adverbs, emoticons, multiword expressions
• 500 Positive, 420 negative entries
3. Regressive Imagery Dictionary
• Martindale (1975, 1990): uncover psychological
processes reflected in the text
• 2 basic categories of thinking:
– Primordial (primary): associative, concrete, and takes little
account of reality (fantasy, dreams)
– Conceptual (secondary): abstract, logical, reality oriented,
aimed at problem solving
• 7+29 more subcategories (social behavior,
cognition, perceptions, sensations etc.)
• Hungarian version by Pólya and Szász
• 3000+ terms
4. Agency/Communion
• 2 fundamental dimensions of social values:
– Communion: moral and emotional aspects of an individual’s
relations to others (affection, expressiveness, cooperation,
social benefit etc.)
– Agency: efficiency of an individual’s goal-orientated behavior
(motivation, competence, control)
• Positive or negative for both dimensions
– Context dependent (e.g. negation)
• 640 expressions
5. Optimism/Pessimism
• Based on PoS and morphology annotations +
time expressions
• 2 measures:
1. |future_tense_verbs| /
(|present_tense_verbs| + |past_tense_verbs|)
2. |present_tense_verbs| /
|past_tense_verbs|
• Both correlate with degree of optimism
6. Individualism/Collectivism
• Based on PoS and morphology annotations
• 1 measure:
|personal pronouns| /
(|verbs with personal inflection| +
|nouns with possessive inflection|)
• Higher score: higher degree of individualism
Visualisation
19
20
21
22
23
Dissemination and Exploitation
• Presentations
– Hungarian NLP Meetup, Sept. 25. 2014., Budapest
– conText, Nov. 20. 2014, Budapest
• Conference papers, presentations
– 2 papers at 11th Conference on Hungarian Computational Linguistics (January
15-16. 2015., Szeged)
• Source code
– https://github.com/mmihaltz/trendminer-hunlp
– https://github.com/mmihaltz/trendminer-hutools
– https://github.com/tkb-/nooj-cmd
• Project website (http://corpus.nytud.hu/trendminer)
– Download political ontology
– Download 1.9M facebook comments corpus (w/ annotations)
– Project info, papers, presentations slides
24
Thank You!
21st November 2014, Budapest
www.trendminer-project.eu
Download