CHI Course 2013
Emre Kıcıman, Shelly Farnham emrek@microsoft.com, shellyfa@microsoft.com
Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia
Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia
2. Tweet
1. Observe
But people are not perfect sensors, for many reasons.
2000
1800
1600
1400
1200
1000
800
600
400
200
0
1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.
700
600
500
400
300
200
100
0
1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.
400
350
300
250
200
150
100
50
0
1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.
Drugs, diseases, and contagions
“You Are What You Tweet: Analyzing Twitter for Public Health”
Paul and Dredze, 2011
Symptoms and medication usage, tracking illness over time, behavioral risk factors
“Predicting Disease Transmission from Geo-Tagged Micro-Blog Data”
Sadilek, Kautz and Silenzio, 2012
Study disease transmission in physical world based on location traces of sick & healthy people
Public Sentiment
Political and election indices, market insights
Everyday life
Cross-domain / open-domain
Large-scale, fine-grained, naturalistic
Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia
What are common conventions in interactions
Ex. “Unsupervised Modeling of Twitter Conversations”, Ritter, Cherry and Dolan, 2010
How do people’s interactions impact each other? How do norms form?
Ex. “The Birth of Retweeting Conventions on Twitter”, Cha, Gummadi, Kooti, Mason and
Yang, 2012
How do communities organize themselves?
Ex., social media usage in the context of war, disasters and crises. Starbird et al. 2010; Al-
Ani, Mark & Semaan 2010; Monroy-Hernandez et al. 2012
Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia
Ex. “Feed Me: Motivating Newcomer Contribution in Social Network Sites,” Burke,
Marlow and Lento, 2009.
Plus, a case study a bit later on return visits from first-time users
Social media captures people talking and interacting with each other, publicly, on a wide variety of topics.
We would like to study it to learn about the world…
… about how people interact with each other
… and the role of the system in influencing these interactions
Specialize in social technologies
Social networks, community, identity, mobile social
Early stage innovation
Extremely rapid R&D cycle study, brainstorm, design, prototype, deploy, evaluate (repeat)
Convergent evaluation methodologies: usage analysis, interviews, questionnaires
Career
PhD in Social Psychology from UW
7 years Microsoft Research: Virtual Worlds, Social Computing, Community Technologies
4 years startup world: Waggle Labs (consulting), Pathable
2 Years Yahoo!
FUSE Labs, Microsoft Research
Personal Map
Specialize in Social Data Analytics
Social media, analytics and search
Focus:
1. Improving our analytical capabilities
2. Extracting information about the world from social media
3. Reasoning about social media biases, reinforcing useful signals
Career
Ph.D. in computer science from Stanford University, ’05
7 years at Internet Services Research Center, Microsoft Research
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
Message from Bob
Alice
Effect
* Wow, Bob is ______!
* I should donate money to _____!
* Write back to Bob....
* Forward Bob’s message to others…
Effect depends on many factors:
* Relationship <Alice, Bob>
* Content of the message
* Alice’s environment:
-> social context
-> current tasks
Effect
Message1 ?
Message2 ?
Message3 ?
Bob
Alice
Assumption: Source writes a message to effect recipient
Effect might be to project a persona; to elicit engagement or action; build social capital [cite communications theory; social capital; etc]
(Of course, this is not always true; sometimes messages are written for effect on self.
E.g., cathartic writing)
Let effect be a function of relationship, message and context:
𝐸 = 𝐹 𝑠, 𝑟 , 𝑚, 𝑒 𝑠, 𝑟 represents the relationship between a source and recipient. E.g
., “close friends”,
“authority/expert” 𝑚 represents the message content and style 𝑒 represents the environment in which the message is received E.g.
, facebook or linkedin.
Also includes broader social norms, etc.
Then a rational source, trying to achieve an effect 𝐸 ∗ will select messages based on: 𝑎𝑟𝑔𝑚𝑖𝑛 𝑚∈𝑀
|𝐸 ∗
Where 𝑠, 𝑟 and 𝑒 are fixed and 𝐹 is source’s approximation of 𝐹 .
More complex versions take into account multiple recipients, thresholds of cost & utility, etc.
Let’s assume for simplicity that style is constant, and content is what an author is actively choosing. Then, given a set of observations 𝑊 about real-world events, author chooses a 𝑤 : 𝑎𝑟𝑔𝑚𝑖𝑛 𝑤∈𝑊
|𝐸 ∗
If no 𝑤 achieves goal 𝐸 ∗ within some threshold, author writes nothing.
The weather bias case study is essentially investigating the relationship between features of 𝑤 and 𝐸 ∗ > t
(this slide intentionally left blank)
Effect
Alice
?
Bob
Charlie
Justin
Alice
?
Effect
Bob
Charlie
?
Justin
Alice
Message
Bob
Alice
Effect
Message
Bob
Alice: 𝐸
𝐴𝑙𝑖𝑐𝑒
= 𝐹 𝐵𝑜𝑏, 𝐴𝑙𝑖𝑐𝑒 , 𝑚, 𝑒
Bob: 𝑎𝑟𝑔𝑚𝑖𝑛 𝑚∈𝑀
|𝐸 ∗
𝐵𝑜𝑏
Social media system is trying to align parameters to achieve its own effect 𝐸 ∗
𝑆𝑦𝑠
E.g., align relationships 𝑆, 𝑅 , the environment 𝑒 , as well as the space of messages 𝑚 that Bob selects from.
𝑎𝑟𝑔𝑚𝑖𝑛 𝑤∈𝑊
𝐸 ∗
Ex. Reinforcing useful signals: if we want to increase the likelihood that world events
𝑊 + that we care about are reported, this model indicates several possible directions:
Investigate and optimize 𝐸 ∗ , 𝑠, 𝑟 and 𝑐 such that 𝑊 + is reported
Design feedback that improves approximation 𝐹 to reinforce 𝑊 +
Basic model of social media system interactions:
Messages have an effect on recipients, conditioned on various factors
Authors choose their messages to have some desired effect
Social media systems today can (and do) play an active role
Having a generation process in the back of our mind helps us realize biases and limitations of data
Having a sense of the basic knobs can help us think about how to improve social media systems
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
Clarity:
* Be clear in your purpose and the real-world problem you wish to address
* State your question as a hypothesis
Testability:
* In some cases, the hypothesis is tested through the social data
* Other times validation must lie outside the social data
Passive and active experiments:
* If we look at questions of causality we often need to do active experimentation
Data Analysis is one tool of many:
* User surveys, interviews, mockups, prototypes, etc.
Amount of data overwhelming – the more defined your question, the easier the analysis
What real world problem are you trying to explore?
Avoid pitfall of technology for technology’s sake
What argument do you want to be able to make?
State your problem as a hypothesis
Studying the relationship between activities and locations
What do people do? And where do they do it?
(Note: research question is about accuracy of results, and its use in applications)
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Followed by higher-level statistical, graph and machine learning analyses to build
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
“I had fun hiking Tiger
Mountain last weekend” – Alice said on Monday, at 10am
Location Tiger Mountain
Mood Happy
Activity Hiking
Name Alice
Gender Female
Post Time Mon 10am
Activity Time {Sat-Sun}
Location:
Tiger
Mountain
Activity:
Hiking
Followed by higher-level graph and machine learning analyses on the combined structure and context…
Instrumentation
Avoid tendency to collect everything without organization
Validate logging -> untested instrumentation is prone towards bugs
Design for key scenarios: Make it easy to get data for key questions up front
Streaming and Search APIs
Easy to use. Appropriate for many experiments
Often rate-limited, but can build large-scale data over time
Crawling
More effort, but can grab historical data
Some sites will block
In all cases, do consider user privacy and expectations.
Filters
Time span, type of person, type of actions
Sampling
Random selection
Snow balling, to get complete picture of person’s social experience
Consider your research questions, how you want to generalize
Clean once: removing irrelevant raw data (depends on your research question)
Spammy users, people who were never active,
Geographic or temporal filtering,
When you remove a user, message, or action, think about whether to remove associated data
(e.g., might want to keep a spammy user’s interactions with other, non-outlier users)
Implication:
Feature extraction:
Entity recognition from text the raw data and the feature results
User classification, …
This is also a good stage to bring in external data…
Clean again:
Look for outliers and remove feature values that are not dependable
Keep samples of raw data for distinct feature values to make inspection easier
After extracting features from our social media, we want to reason about the relationships among these features.
What defines a relationship?
One common choice: Context == Co-occurrence within the same message
(next slide)
“I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am
Location:
Tiger
Mountain
Gender:
Female
Mood:
Happy
Activity:
Hiking
Name: Alice
Post Time:
Mon 10am
Activity
Time: {Sat-
Sun}
Gender:
Male
Location:
Tiger
Mountain
Gender:
Female
Mood:
Happy
Name: Bob
Activity:
Hiking
Post Time:
Fri 3pm
Name: Alice
Post Time:
Mon 10am
Activity
Time: {Sat-
Sun}
User as defining context
-> Two things are related if they are associated with the same user
-> Common in recommender systems
-> Ex. Livehood study of neighborhood boundaries
Location as defining context
-> Two things (users, actions, …) are related if co-occur at same physical location
-> Ex. Sadilek’s study of disease transmission
Location:
Tiger
Mountain
Activity:
Hiking
• Focus on core relationships among domains of interest
• Strength defined by how frequently items co-occurred in key context
• Statistical distribution of other features annotates core relationships
Gender:
Male
Location:
Tiger
Mountain
Activity:
Hiking
Gender:
Female
Statistical tests
Ex. Test that relationships are statistically significant
Ex. Test that two items are statistically different from one another
Graph analyses
Ex. Clique finding, graph clustering, path algorithms, network centrality, ….
Machine learning
Ex. Classifiers, clustering, etc. Based on graph relationships or annotations
Build “profiles” of things based on the words and sentiment used
Build demographics of places and concepts based on who is talking about them
Build co-mention graph among entities, people, places, etc.
Build user profiles of users based on what they talk about and how they express themselves
Include “time” in the projection, and see how profiles change over time
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
Social responses and engagement on So.Cl
Clear & simple study of user interactions
Population biases in political tweets
Extracting basic features from tweets
Demonstrates complexity of population biases
Studying self-reporting bias by comparing tweet rates to ground-truth
Example of building a domain classifier
Methodology to study
Annotating graph structures with discussion context to interpret high-level graph analysis results
Applies higher-level graph analyses to graphs of discussion topics
Shows how discussion context can be useful at different layers
Part I: Introduction and conceptual framework
1.
Introduction and preliminaries
2.
Basic model for interaction through social media
3.
A processing pipeline for analyzing social media
Part II: Case Studies
4.
Social responses and engagement on So.Cl
5.
Population biases in political tweets
6.
Studying self-reporting bias by comparing tweet rates to ground-truth
7.
Annotating graph structures with discussion context to interpret high-level graph analysis results
8.
(Bonus) Statistical language modeling to analyze language differences across user populations
So.cl is an experimental web site that allows people to connect around their interests by integrating search tools with social networking.
Study: How important are social interactions in encouraging users to become engaged with an interest network?
We’ll start with some background on So.Cl, and then dive into the case study.
reimagining search as social from the ground up search + sharing + networking
= informal discovery and learning
History:
Oct 2011:
Pre-release deployment study
Dec 2011:
Private, invitation-only beta
May 2012: removed invitation restrictions
Nov 2012: over 300K registered users,
13K active per month
Try it now! http://www.so.cl
Find others around common interests
Be inspired by new interests
Learn from each other through these shared interests
Search & Post
Feed Filters
People
Try it now! http://www.so.cl
– use facsumm tag
Feed
Search (Bing)
Post Builder
Filter Results
Experience:
Step 1: Perform search
Step 2: Click on items in results to add to post
Step 3: Add a message
Step 4: Tag
Try it now! http://www.so.cl
– use facsumm tag
Results
Social
Search
Discovery
Network
Interest
Network
Collaborate
Connect follow liking Simple profiles commenting riffing
Wall messages
People list
?
Collect
Create
Consume interests visual
Post builder stream
Add links
Video parties
Explore
Page
Interest
Page
Search interests
?
Increasing Engagement, Community, Learning, Innovation
?
?
Access to public so.cl behavioral data for research purposes
Foster research in interest networking, social search, and community development http://fuse.microsoft.com/research/srd
Hypothesis:
If people receive a social response when they first join So.cl they are more likely to become engaged.
Measuring social/behavioral constructs:
When first join
First session = time of first action to time of last action prior to an hour of inactivity
Social responses
Follows user, likes user’s post(s), comments on user’s post(s)
Engagement = coming back
A second session = any action occurs 60 minutes or more after first session
Restating hypothesis:
If a people receive follows, likes, and comments in their first session they are more likely to come back for a second session
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Simple, common instrumentation schema, kept in a database
Users table: Row per user
Include creation time and other metadata
Content table: Row per content includes text, URLs, etc.
Actions table: Row per action
Filter out non-meaningful, non-user generated actions
Actions capture user interactions and context
Extract Core
Relationships
Always look at your raw data: play with it, ask yourself if it makes sense, test!
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Filters
Time span, type of person, type of actions
Sampling
Random selection
Snow balling, so get complete picture of person’s social experience
Consider your research questions, how you want to generalize
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Filtered out administrators/community managers
New users only
Date range: Sept 28 to Oct 13
100% sample for that time span: 2462 people
Large percent never become active or return
-“kicking the tires” unduly biases averages
Common reporting format:
X% performed Y behavior, of those averaged Z times each
5% commented on a post their first session, averaging 5 times each
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
OUTLIERS: Filtered out 13 people outliers z > 4 in number of actions (if do more than sign in)
SYSTEMATIC BIASES IN SOCIAL SYSTEMS #2
A small percent “hyper-active” users: avid, spammers, trolls, administrators, and can unduly bias averages
Remove outliers
A substantial percent are consumers but not producers
(“lurkers”), often no signal for lurkers
So.cl has about 75% lurkers
Custom instrumentation, logging sign ins
Web analytics for clicks
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Very important to spend time examining data
Descriptives, Frequencies,
Correlations, Graphs
Use tool that easily generates graphs, correlations
Does it make sense? If not, really chase it down. Often a bug or misinterpretation of data.
Extract Core
Relationships
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: Active Sessions
Active session = a time of activity
(public), with 60 minute gap of no activity before or after
91% of users only one active session
On average,
34.6 hours apart
First session,
1.6 minutes
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Feature: User Actions
Number of Posts in First Session
Define the Key
Context
Extract Core
Relationships
Actions in First Session
8% created a post in their first session, of those averaged 1.5 posts each
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Feature: Coming back
9.1% came back for another active session
(~25% including inactive)
On average, 35 hours later
Define the Key
Context
Extract Core
Relationships
Collect:
Content &
Interactions
Cleaning and Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Aggregation: merging down for summarization
What is your level of analysis?
Person, group, network
Content types
If person is unit of analysis, aggregate measures to the person level
E.g. in SPSS: One line per person very important to have appropriate unit analysis, to avoid bias in statistics
SPSS Syntax:
Always ask, does this pattern make sense?
How often is user the target of social behavior?
23% received some response up to 2 nd session
->3% if did not create a post, 37% if did create a post
Response *During* First Session Response *in Between* 1 st and 2 nd Sessions
Social responses inspire people to return to site, especially if occurring during first session
N = 2273 N = 179 N = 1942 N = 510
Social responses to user: following, commenting on post, liking post, liking comment, riffing
Logistic Regression, Any Response Predicts Coming Back
Created post first session
Response1: during first session
Response2: after first session
B
.71
1.12
.60
S.E.
.20
.21
.17
Sig.
.000
.000
.000
Logistic Regression, Which Predicts Coming Back
Created post first session
B
.95
Sig.
.000
Followed
Commented On
Post Liked
.92
.38
.87
.003
ns
.02
Comment Liked -.09
ns
Messaged
Riffed
-.09
.00
ns ns
IDENTIFYING SUBGROUPS
Type:
% Variance:
Created post
Invited
Followed
Added item to post
Searched
Commented
Liked post
Liked comment
Messaged
Viewed person
Navigated to All
.83
.81
.36
.15
.13
-.09
.22
.51
Component Matrix a
Component
Creators Socialites Browsers
32% 12% 9%
.86
.17
.10
.01
-.03
-.16
.10
.63
.37
.08
.03
.64
.58
.80
.50
.47
.37
-.06
.17
.09
.32
.06
-.08
.48
.53
Joined party .17
.09
.68
Principle components, varimax rotation [meaning forced to be orthoganol]
Factor Analysis for Associated Behaviors:
Three types of usage – creating, socializing, browsing
Factors about equally predict if user comes back
Regression Coefficients
Creating
Socializing
Browsing
Beta
.14
.07
.19
t
5.28
2.61
7.20
Sig
.000
.000
.000
Browsing stronger predictor of overall activity level
Regression Coefficients
Creating
Socializing
Browsing
Beta
0.20
0.17
0.29
t
7.89
6.58
9.07
Sig
0.00
0.00
0.00
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
There was a significant amount of political discussion on Twitter during the US election season in Summer/Fall 2012.
Case Study: Is the population of tweeters representative of US demographics along two demographic axes of gender and geography?
Why this case study?
Good illustration of simple extractors for gender, location, and simple methods for identifying topics.
More fundamentally, highlights challenges of dealing with population biases
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Collected all tweets during August – November, 2012 that mentioned “Obama”, “Romney” or other politician names
Inspecting raw data:
* Removed some common names and issue phrases from collection
Extract Core
Relationships
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: Gender
Simple gender classifier based on first name of Twitter user in profile
Approach: Look up first name in a weighted gender map built from census data and other sources.
Practical results:
Ad hoc inspection is positive
Coverage is 60-70%, depending on domain. Remainder are organizations and ambiguous names
Still requires:
Accuracy evaluation based on ground-truth data
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: Location
Map from self-declared user profile locations to lat-lon regions.
Approach: Use a mapping learned from the small % of tweets that are geocoded.
Cluster mapped geo-locations together into city-size areas.
Practical results:
Maps to metropolitan-area size regions.
Learns official location names, as well as abbreviations, nicknames, etc.
Automatically identifies non-specific locations
Coverage is 60-70%, depending on domain. Remainder have non-specific locations or “tail” locations not covered in training set.
Location cluster Example members
New York “NYC”, “Yonkers”, “manhattan,”
Los Angeles
“NY,NY”, “Nueva York”, “N Y C”,
The Big Apple”
“Laguna beach”, “long beach”,
Filtered out due to ambiguity
(large area)
“LosAngeles,CA”, “West Los
Angeles, CA”, “Downtown Los
Angeles”, “LAX”
“World”, “everywhere”, “USA”,
“California”, …
1. Use geo-tagged tweets.
- Most appropriate when you need fine-grained locations per tweet (e.g., user tracking)
- But trade-off is that very small % of tweets are geo-coded
2. Much recent research on location inference.
- State-of-the-art uses textual references to known locations to identify user location.
This mapping technique is a little coarser-grained, but simpler.
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Feature: Politician mention
Approach: Exact-match on well-known, unambiguous politician names.
Still needs:
* Domain classification and/or stronger entity linking to recognize ambiguous names. For example, “Mitt” is likely Mitt Romney in a political context, but not otherwise.
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Key context is the tweet itself.
We will assume a relationship among features if they co-occur in the same tweet.
It will be stronger if it co-occurs across many tweets.
For example:
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
We extract two sets of relationships:
1) Politician mentions per Day:
Strength of relationship indicates volume of discussion about a given politician on a given day
Discussion context summarizes gender and location for each day.
2) Politician mentions over all time:
Discussion context summarizes gender and location over all time
Gender distribution of authors of tweets mentioning
Obama
80%
70%
60%
50%
40%
30%
20%
10%
0%
8.29.12
9.3.12
9.8.12
9.13.12
9.18.12
9.23.12
9.28.12
10.3.12
m f
10.8.12
10.13.12
Gender distribution equalizes during high-volume events like DNC
Metro-area
New York, NY
Tweets % of tweets Actual population
141878 10% 22,000,000
Washington, DC 135347
Los Angeles 68676
9% 8,500,000
5% 12,800,000
Chicago
Atlanta, GA
Houston, TX
Boston, MA
47130
45475
35956
34363
3% 9,800,000
3% 5,200,000
2% 2,100,000
2% 7,600,000
* Geographic distribution of tweets mentioning Obama during 2012 Elections
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
Background:
Frequency of discussion about events does not directly reflect real-world frequency of occurrence.
We may assume that bias is constant for a given kind of event, but not about bias across different kinds of events.
We can make few inferences about the relationship between distinct events through social media analysis.
Study:
Compare tweet rates about weather to ground-truth data about weather
Why this case study:
Easy example of domain identification and ambiguity resolution in cleaning stage
Good illustration of self-reporting bias
10000
Hottest day
Thunderstorm
1000
60
50
40
100
30
20
10
10
Weather-Related Tweet Rate Temperature
1
Sep. 1 Sep. 15 Sep. 29 Oct. 13
0
Weather-related Tweet rate and temperature in San Diego, CA from Sep. 1-Oct 15, 2010
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Collected 12 months of tweets that mentioned weatherrelated words (e.g., “rain”, “snow”, “sun”, “heat”…) Word list built by hand from weather glossaries, dictionaries, etc.
Extract Core
Relationships
Woke up to a sunny 63F (17C) morning. It's going to be a good day :)
The rainy season has started.
The inside of our house looks like a tornado came through it.
Japan, Germany hail U.N. Iran sanctions resolution
Used a language-based classifier, with a simple Bayes model:
1
|𝑇| 𝑡∈𝑇
𝑃(𝑤𝑒𝑎𝑡ℎ𝑒𝑟|𝑡)
Where 𝑇 is the set of features (all pairs of co-occurring words within a tweet, regardless of order)
𝑃 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 = (1 + 𝐶 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 ) (1 + 𝐶 𝑡 )
Also:
• Simple stemming of words – remove ‘-s’ and ‘-ing’ suffixes
Labeling 2000 tweets manually (2 labelers) to create a “gold training/test set”
What were challenges of labeling? Mainly a strong, consistent criteria.
For example: “incidental” mentions of the weather?
Or, mentions of the weather someplace else?
Slightly less complicated:
Mentions of the weather in proverbs (‘when it rains it pours’)
Results:
Classifier F-Score of 0.83, with a precision of 0.80 and recall of 0.85
Is this good? In general, the precision/recall will depend heavily on the domain and the collection criteria for tweets.
Collect:
Raw Social
Media
Feature
Extraction
Feature: Location
Extract as described in politics case study
Define the Key
Context
Extract Core
Relationships
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Add derived weather features from external (non-social) data:
Extremeness
Expectation
Change
Calculated based on the nearest weather station to the median location within the metropolitan area
Extract Core
Relationships
12 months of Tweets, June 2010-June 2011
130M tweets include a weather-related word
179 words from weather glossaries, etc.
71M tweets pass a Bayesian classifier
Trained on 2k labeled tweets
8M tweets geo-located to 56 US cities
Used geo-tagged tweets to learn a mapping from profile locations
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Key context in this case is location-day pair.
This also defines the core relationship.
We are most interested in is the count of tweets per locationday and the weather features per location-day…
Extract Core
Relationships
Linear regression on derived features with L2 regularization.
Model Features
Basic Weather
Expectation + Basic
Change + Basic
Extreme + Basic
Global R 2 Correlation Local R 2 Correlation
0.30
0.45
0.33
0.35
0.70
0.71
0.40
0.70
100%
80%
60%
40%
20%
0%
98,2%
Extreme
85,7%
66,1%
57,1%
Basic Expectation Change
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
Question: Given a set of related locations inferred from social media, what we can we tell about why they are related?
Why this case study?
Introduction to higher-level analyses and using context to interpret them.
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
Extract Core
Relationships
Extracting features:
Activities: Exact match on activity names derived from search queries
Locations: Exact match on unambiguous location names from Wikipedia articles
Key Context == Tweet
Extract Core Relationships: Locations
Pseudo-clique of
NYC tourist locations
Pseudo-clique of
NYC “midtown worker”
Gender Male
Female
New York Tourist Midtown Worker
49% 63%
33% 23%
Metroarea NYC
Other
Mood Joviality
Fear
Sadness
Guilt
Fatigue
Serenity
Hostility
33%
67%
56%
14%
11%
8%
3%
3%
2%
54%
46%
49%
13%
15%
6%
6%
4%
4%
Part I: Introduction and conceptual framework
1. Introduction and preliminaries
2. Basic model for interaction through social media
3. A processing pipeline for analyzing social media
Part II: Case Studies
4. Social responses and engagement on So.Cl
5. Population biases in political tweets
6. Studying self-reporting bias by comparing tweet rates to ground-truth
7. Annotating graph structures with discussion context to interpret high-level graph analysis results
Message
Alice Bob
Bob sends Alice a message 𝑚 : 𝑎𝑟𝑔𝑚𝑖𝑛 𝑚∈𝑀
|𝐸 ∗
Collect:
Raw Social
Media
Feature
Extraction
Define the Key
Context
“I had fun hiking Tiger
Mountain last weekend” – Alice said on Monday, at 10am
Location Tiger Mountain
Mood Happy
Activity Hiking
Name Alice
Gender Female
Post Time Mon 10am
Activity Time {Sat-Sun}
Location:
Tiger
Mountain
Extract Core
Relationships
Activity:
Hiking
Followed by higher-level graph and machine learning analyses on the combined structure and context…
Social responses and engagement on So.Cl
Clear & simple study of user interactions
Population biases in political tweets
Extracting basic features from tweets
Demonstrates complexity of population biases
Studying self-reporting bias by comparing tweet rates to ground-truth
Example of building a domain classifier
Methodology to study
Annotating graph structures with discussion context to interpret high-level graph analysis results
Applies higher-level graph analyses to graphs of discussion topics
Shows how discussion context can be useful at different layers
Social media data provides a fine-grained and large-scale representation of people’s discussions and interactions with each other.
Extract information about the real-world
Study people’s interactions with each other
How system design influences those interactions
But be careful, social media is generated through a complicated system, and has many biases!
E-mail: Emre Kiciman emrek@microsoft.com
http://research.Microsoft.com/~emrek/
Selected Dataset resources
So.cl dataset: http://fuse.microsoft.com/research/srd
ICWSM Datasets http://icwsm.org/2013/datasets/datasets/
MyPersonality Project: http://mypersonality.org
Part I: Introduction and conceptual framework
1.
Introduction and preliminaries
2.
Basic model for interaction through social media
3.
A processing pipeline for analyzing social media
Part II: Case Studies
4.
Social responses and engagement on So.Cl
5.
Population biases in political tweets
6.
Studying self-reporting bias by comparing tweet rates to ground-truth
7.
Annotating graph structures with discussion context to interpret high-level graph analysis results
8.
(Bonus) Statistical language modeling to analyze language differences across user populations
What we’re doing:
Build and compare language models of Tweets, conditioned on various metadata features such as geography and number of followers.
Why we’re doing it:
1. It’s also just interesting to find and quantify the differences in style, topic among different groups of users.
2. Analysis and info extraction from Tweets important. More accurate language models may improve algo’s for word segmentation, NER, …
Class
Geography
Explicit Signals
Time zone
GPS Coordinates
User Metadata Number of followers
Number followed
Total tweets count
Age of account
Message Metadata Message length
Retweet
Contains URL
Number of user references
Time of day
Inferred Signals
User reported location
Gender
Interests
#Topic
Well-capitalized
Data
72M Tweets gathered over ~3 days
90% training; 10% test
Focus on English tweets in these experiments
Partition by metadata feature
E.g., group messages by whether there’s a link in it
Build 1- to 3-gram LM per partition
Smoothed LMs with closed vocabulary
Cross-entropy among all partitions
Analyze differences in term-likelihoods among LMs
Hawaii 1573 3078 3623 2795 3018 3294 3094 5591 4228 2027 6623 5051 3529
Alaska
3506 1500 3238 2641 2866 3182 2892 11005 6496 2907 6004 12610 6477
Pacific
2775 1894 1303 1825 2040 2222 2226 11676 6501 2493 2769 11591 5611
Mountain 4619 4379 5263 1360 2362 2742 2824 13384 7465 2874 17897 13453 7023
Central 4941 4655 5969 1774 1185 2009 1838 13244 7368 2695 24610 14107 6740
Eastern 5586 5208 7244 2053 1943 1216 1767 15560 8475 2648 31850 14535 6953
Quito 5042 4689 6539 2324 2200 2241 1153 8234 6061 2810 26049 13806 7197
Brasilia
8063 8279 10229 5674 6230 6528 6666 724 5810 4909 28775 11331 7465
Greenland
4437 4776 5966 3642 4006 4170 4030 1932 1536 2868 14817 11179 5962
London
5013 5573 7160 3478 4065 4115 4266 10621 6561 917 21472 15561 7342
Jakarta 5631 4896 5494 5298 5761 6200 6138 17000 9690 4461 1338 12107 7407
Osaka 8276 8086 9359 6599 6944 7340 7252 16236 10461 5444 19994 1598 4495
Tokyo 5682 5546 6589 4521 5006 5043 5222 8904 6811 3635 13864 2386 1265
Perplexity of bi-gram models learned for each time zone with respect to others
3 kinds of differences:
• Geographic locations
• Topic variance
• Dialect, spelling differences
0≤x<10
0≤x<100
0≤x<1000 x≥1000
0≤x<10 0≤x <100 0≤x <1000 x≥ 1000
922 2413 4528 7831
1166
1682
3345
1071
1341
2421
2477
1216
2804
4811
2317
1544
• Similar language models for <10, <100, <1000 followers
• Differences appear for authors with > 1000 follower
1,4
1,2
1
0,8
0,6
0,4
0,2
0
0≤x<10 10≤x<100 100≤x<1000
Number of Followers x≥1000