Exposing and Quantifying Narrative and Thematic Structures in Well

advertisement
Exposing and Quantifying
Narrative and Thematic
Structures in Well-formed and
Ill-formed Text
Dr Dale Chant, Red Centre Software Pty Ltd
ASC Conference: Making Sense of New Research Technologies
Critical Reflections on Methodology and Technology:
Gamification, Text Analysis, and Data Visualisation
Friday 6th and Saturday 7th September 2013, University of
Winchester
The Problem
• Coding open-ended verbatims takes a long time
• Inconsistent coding judgements can wreak havoc
on small weekly samples
• Some bodies of free text, such as Twitter feeds,
are beyond human capacity to digest due to
sheer volume
• Machine coding by string matching assumes wellformed text – variant morphologies difficult to
accommodate
Naïve Auto-Coding
• Read all source words (or complete strings)
into an array
• Sort alphabetically
• Assign codes from 1 to N, where N is the
number of unique words (or unique strings)
• Write the assigned codes in the original word
(or string) order
Naïve Auto-Coding
• Well-formed published text is code-complete
The first line of Wuthering Heights
The complete code frame has 9,201 items
Netting to a Theme
• With the code frame defined, themes can be
netted from individual words
abandonment = abandon/abandoned/abandonment/reject/rejected/rejecting
Coded
Decoded
Theme(1) = Text(3/5,6530/6532)
Code Incomplete
• Open-ended tracker Brand Awareness
questions, time-dependent blog or social
media exchanges
• Can never be code-complete, because
forthcoming data may throw up unanticipated
variations
Dog/dogs/mongrel/mongrels/mutt/mutts/dingo/wolf/…
Damerau-Levenshtein
• One approach is Approximate String Matching
• Match a source string to a target string by
combinations of
i) Insert
ii) Delete
iii) Replace
iv) Transpose
• The edit distance is the number of transforms
needed to get from the source to the target
The Algorithm in Action
• There is an interactive implementation of
Damerau-Levenshtein at
http://fuzzy-string.com/Compare/
Scaling the Algorithm (1)
• To be useful, the allowable distance for a
positive match needs to scale against the
length of the target strings
• ‘ox’ to ‘fox’ has distance 1 (insert at head)
This would be a false positive
• ‘megalomania’ to ‘megalomaniacs’ has
distance 2 (insert twice at tail)
This is a good match
Scaling the Algorithm (2)
•
•
•
•
Short strings need a distance of zero
Intermediate strings need 1 or 2
Longer strings can bear 2 or 3 or more
The thresholds for short/intermediate/long
and allowed distances for a positive match are
here termed the fuzz parameters
• Fuzz parameters are determined empirically,
and will vary with the body of text being
analysed.
What is Gained
The target string megalomaniac at an edit distance of
1 will match on:
12 * 26 in situ typos (negalomaniac) +
12 missing (megaomaniac) +
12*26 extraneous (megaloomaniac) +
11 transpositions (meglaomaniac) +
2*26 extra pre/post character (mmegalomaniac)
= 699 possible variations
The Procedure
• Code the source text, one code per unique word
• Run a sorted frequency count to expose recurrent
themes and concepts
• Review actual instances of these words in situ to
determine appropriate fuzz parameters and the
thematic and conceptual contexts
• Devise a compact target code frame which maps the
themes and concepts words of interest to synonym and
variant lists
• Process the source text against the targets, to create a
categorical variable which can be tabulated in the
normal manner against any other variable
Exposure and Quantification:
Romeo and Juliet
• Since this text is bounded, code-complete and
well-formed, the fuzz parameters can all be
set to zero
• The Exposure step reveals dominance for
i) love and related
ii) misery and despair
iii) conflict and death
Love dominates, then diminishes
Romeo Romeo, wherefore
art thou Romeo?
To be replaced by misery and despair
As Percents (share of scene)
Interlaced with Conflict and Death
Near mathematical symmetry:
Wuthering Heights
Ill-formed Text
• Tweets on Australian Federal Politics
• From 1 June 2013 to 31 July 2013
• Search term: #auspol OR #auspoll OR
#ausvotes OR #ozcot
• 927,190 cases
• Average between 10 to 20,000 per day
• Huge spike on 26 June
http://votecompass.com/2013/07/25/are-you-among-australias-most-influential-politicaltweeters-votecompass-maps-the-auspol-twittersphere/
Tweet Frequencies
Data Sources
• Two commercial data source providers were used: Gnip and
ScraperWiki
• The Gnip data was collected in a single 28 hour run conducted
on 15 Aug 2013
• ScraperWiki provides user-initiated searches for up to the
prior seven days
• Because ScraperWiki is near real time, accounts later banned
or suspended by 15th August, and hence not in the Gnip data,
remain present
• The ScraperWiki data is used below only to demonstrate this
point.
http://gnip.com/
https://scraperwiki.com/
Australian Federal Politics since 2007
Howard,
Conservative
Rudd,
Labor
Gillard,
Labor
Abbott,
Conservative
Timeline:
‘07
Rudd
defeats
Howard at
general
election
‘08
‘09
‘10
Gillard
challenges
and defeats
Rudd, calls
election,
hung
parliament
‘11
‘12
‘13
Rudd
challenges
and defeats
Gillard, calls
election for
7 Sept
The Grand Narrative
• With the Pretender to the Throne (Gillard) summarily
dispatched
• The True and Rightful King (Rudd), triumphantly returned
from (backbench) exile
• Now faces the Great Adversary (Abbott) in a battle to the
(political) death for control of the realm
Warning: Aussie Vernacular Alert
Assange Senate Bid
Rudd’s Major Problem
The Conservative’s Hammer
Distorting the Message (1)
Attack of the TweetBots
@ALPDirt Posts Once a Minute
Distorting the Message (2)
The Scheduled Automatons
Obsessive / Compulsives
HashTags
• Much more than just a metatag
• Function as a message tokens too:
– Commentary on current affairs
(#1000BoatDeaths, #20000JobCuts)
– Calls to action (#2013electiondateplease,
#AbolishParliament)
– Political attack (#AbbotLies)
– Take a position (#AgeOfEntitlement)
– Make a joke or pun
(#calmdownbirdie, #fraudband)
33,206 Unique HashTags over June/July
Colour Masked to Highlight the Clumps
Zoom-in on BattleRort
Hashtag Spawn
Quantification should capture as many variants as possible.
Sort on 19 July, Zoom
The PNG Solution is more punitive than anything the Conservatives have tried
But many instances missed
Three dominant tags are clear, but the variants will be lost
under a search on just asylum OR asylumseeker/s
Ditto Refugee/Refugees, etc.
Quantification
• Smoothed percentage chart of all instances exposes the
narratives, but to quantify them accurately, we cannot forego
counting the variants.
• To get a more precise read, we apply Damerau-Levenshtein.
• Recalling the four transformation rules (insert, delete, replace,
transpose), the following matches (among many others) will be
made to the dominant forms at run time:
• battelrort->battlerort (transpose once)
• calmdownbirdie->calmdownbridie (transpose once)
• asylumseeke ->asylumseekers (insert twice)
• asylumseekeers->asylumseekers (delete once)
• asylymseekers->asylumseekers (replace once)
Prepare the synonym/variants lists for
the dominant tags
The procedure is:
• Code the hashtags, one code per unique tag
• Generate a sorted frequency count table
• Choose a cut-off point - I have used 30
• Review all items > 30, define and initialise a coded
synonym/variants list with the dominant tags
• Sort the table alphabetically by label
• Review label blocks for any variants which are too
coarse for Damerau-Levenshtein, and add to the
relevant synonym/variants target list
Define Coded Categories against Targets
Code
1
Category
BattleRort
2
CalmDownBri #CalmDownBridie/#CalmDownTony/#CalmDownAbbott/#CalmDown
die
Any Media
#qanda/#abcnews24/#abcnews/#abc24/#Insiders/#730report/#abc730/#730/#lateline/
#thedrum/#pmlive/#pmagenda/#amagenda/#4corners/#contrarians/#abc/#MSMfail/
#MSM/#contrarians/#theboltreport/#viewpoint/#media/#Murdoch/#ABC1/
#mediawatch/#datelineSBS
3
Synonym/ Variant Targets
#BattleRort/#BattleRortAbbott/#BattleRortGate/#BattleRortMovies/#BattleRortSongs
4
Any Border
Protection
#asylumseekers/#refugees/#boatpeople/#asylum/#PNGSolution/#PNG/#Nauru/
#ManusIsland/#Manus/#humanrights/#stoptheboats/#Indonesia/#immigration/#boats/
#operationsovereignborders
5
6
7
Islam
NBN
Any
Environment
8
pinkbatts
#islamophobe/#islamist/#islamlaw/#islamic/#Islam/#muslim
#NBNCo/#NBN/#fraudband
#climate/#coal/#fracking/#carbon/#energy/#CSG/#climatetax/#climatechange/
#environment/#ETS/#carbonscam/#carbontaxscam/#climatescam/#climatecon/#green/
#AGWHoax/#globalwarming/#greenarmy/#renewables/#votegreen/#climategate/
#naturalcsg/#naturalgas
#pinkbatts
Confirm it Works
• Set fuzz parameters as distance=0 for strings 4 characters or less, distance=1 for 9
characters or less, and distance=2 for 10 characters or more
• Run the source hashtags against these targets to create a new variable comprising
eight categorical codes
• To confirm, run a table of the eight coded categories against the original raw hashtag
text
The source strings battelrort and calmdownbirdie are both correctly captured and
coded.
Doing Likewise for the Tweet Text
Code Category
the government
1
Rudd
2
3
4
5
Albanese
Gillard
Shorten
Labor
Party
6
Greens
7
Unions
the opposition
8
Abbott
9
10
Turnbull
Coalition
Synonym/Variant Targets
Kevin Rudd/KevinRudd/Kevin13/Kevin747/Kevin/@KRuddMP/KRudd/Rudd/
CrudDudd/KR/milky bar/messiah
Albanese/Albosleaze/Albo
Julia Gillard/JuliaGillard/Gillard/Juliar/Julia/Jules
Bill Shorten/Shorten
Labor Party/Labor/#ALP/ALP/the government/the govt
Greens/Milne/Bob Brown
unions/faceless men/AWU/HSU/Bill Ludwig/Ludwig/Paul Howes/Piggy Howes
@TonyAbbotMHR/Tony Abbott/TonyAbbott/TAbbott/Abbott/Tony/TA/budgie smuggler/
mad monk
Malcolm Turnbull/@TurnbullMalcolm/Turnbull
Coalition/liberal/the opposition/Libs/#LNP/LNP
etc to code 46
Continued to Code 46
Code
34
35
36
37
38
Category
Corruption
Synonym/variant Targets
corruption/corrupt/fraud/sleaze/dishonest/stealing/steal/greed/shonky/crooks/criminals/
thuggish/thugs/thieves/thief
Treachery
treachery/treacherous/back stab/backstab/back
stabbing/backstabbing/stabbed/stabbing/knifing/knifed/plot/betrayal/betrayed/betray/spill/
leadership coup/ousting/oust
Insanity
insanity/insane/nutter/crazy/lunatic/unstable/psychopath/psychotic/psycho/narcissist/
delusional/delusion/egotist/egotistic/egotistical/egomaniac/ego/
power mad/powermad/madness/deviant
Stupidity
stupidity/stupid/wanker/numpty/imbecilic/imbecile/zombie/clueless/moron/retarded/retard/
idiotic/idiot/bogan
Incompetence incompetent/mismanagement/dysfunctional/waste/inefficient/chaotic/chaos/destructive/inept
39
Cowardice
40
Hypocrisy
41
Arrogance
scandals
42
Scandals
43
AWU Slush
Fund
44
ALP Scandals
cowardice/coward/gutless/ticker/frightened/scared
hypocrisy/hypocrit/bigoted/bigot
arrogance/arrogant/smart arse/smartarse/smart ass/self indulgent/smugness/smug
45
Heiner
policy
46
Policy
Heiner
scandalous/scandal
AWU slush fund/slush fund/Bruce Wilson
Peter Slipper/Slipper/Craig Thompson/Thompson/Eddie Obeid/Obeid
agenda/policies/policy
Share of Voice
All synonym and variant matches for Rudd, Abbott, Gillard, as percentages of
the sum of their total mentions per day:
Compare June to July
Tweet Categories by Day
Rudd vs Abbott – Image Attributes
Corruption Share – June vs July
Compared to Topsy Sentiment Score
http://www.couriermail.
com.au/news/specialfeatures/ruddeffect-onthe-wane-as-abbottretains-thepeople8217s-trust/storyfnho52jo1226683181964
Not much agreement here.
Who is right?
Performance
• Machine: Standard business Dell laptop, dual core, 4 gig RAM, nothing
fancy, no accelerations
• The bottleneck is the Damerau-Levenshtein step on the tweet text, which
for the above 46 categories over 113 meg of plain text, takes about 15 hours
• Performance is linear to the number of individual target synonyms/variants
• Damerau-Levenshtein on the hashtags, a much smaller set of targets,
completes in about 20 minutes
• The major time commitment from a human is in devising the target
synonym and variants lists, here several hours
• For more routine applications of the technique, such as open-ended brand
lists, preparing the target lists is trivial
[end of document]
The Finale
http://www.theage.com.au/
And Throughout the land of Oz,
the Loud Lament was Raised:
http://www.michaelsmithnews.com/2013/08/j
oin-simon-sheikh-not-even-for-practice.html
DatDat: Rebirthing Dada for the Digital Milieu
Tentative: 5.30 Tuesday Goldsmiths. Email dale@redcentresoftware.com to confirm.
Download