Analyzing Social Media Systems CHI Course 2013 Emre Kıcıman, Shelly Farnham

advertisement

Analyzing Social Media

Systems

CHI Course 2013

Emre Kıcıman, Shelly Farnham emrek@microsoft.com, shellyfa@microsoft.com

Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia

People talking and interacting.

Publicly. Or semi-publicly.

Often about the quotidian, but not necessarily.

Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia

Why?

Learn about World

2. Tweet

1. Observe

But people are not perfect sensors, for many reasons.

Let’s Learn about Donuts

Where Do People Get Donuts?

2000

1800

1600

1400

1200

1000

800

600

400

200

0

1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.

What Do People Drink with Donuts?

700

600

500

400

300

200

100

0

1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.

What Kind of Donuts Do People Eat?

400

350

300

250

200

150

100

50

0

1 week of tweets mentioning “donut” or “doughnut”: ~180k tweets during week of Feb 6-12, 2012.

Beyond Donuts…

Drugs, diseases, and contagions

“You Are What You Tweet: Analyzing Twitter for Public Health”

Paul and Dredze, 2011

Symptoms and medication usage, tracking illness over time, behavioral risk factors

“Predicting Disease Transmission from Geo-Tagged Micro-Blog Data”

Sadilek, Kautz and Silenzio, 2012

Study disease transmission in physical world based on location traces of sick & healthy people

Public Sentiment

Political and election indices, market insights

Everyday life

Why use social media?

Cross-domain / open-domain

Large-scale, fine-grained, naturalistic

Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia

Why?

Learn How People Interact with Each Other

What are common conventions in interactions

Ex. “Unsupervised Modeling of Twitter Conversations”, Ritter, Cherry and Dolan, 2010

How do people’s interactions impact each other? How do norms form?

Ex. “The Birth of Retweeting Conventions on Twitter”, Cha, Gummadi, Kooti, Mason and

Yang, 2012

How do communities organize themselves?

Ex., social media usage in the context of war, disasters and crises. Starbird et al. 2010; Al-

Ani, Mark & Semaan 2010; Monroy-Hernandez et al. 2012

Douglas Wray http://instagr.am/p/nm695/ @ThreeShipsMedia

Why?

Learn How System Influences People

Ex. “Feed Me: Motivating Newcomer Contribution in Social Network Sites,” Burke,

Marlow and Lento, 2009.

Plus, a case study a bit later on return visits from first-time users

Recap: Social Media Analyses

Social media captures people talking and interacting with each other, publicly, on a wide variety of topics.

We would like to study it to learn about the world…

… about how people interact with each other

… and the role of the system in influencing these interactions

Preliminaries

Speaker Bio: Shelly Farnham

Specialize in social technologies

Social networks, community, identity, mobile social

Early stage innovation

Extremely rapid R&D cycle study, brainstorm, design, prototype, deploy, evaluate (repeat)

Convergent evaluation methodologies: usage analysis, interviews, questionnaires

Career

PhD in Social Psychology from UW

7 years Microsoft Research: Virtual Worlds, Social Computing, Community Technologies

4 years startup world: Waggle Labs (consulting), Pathable

2 Years Yahoo!

FUSE Labs, Microsoft Research

Personal Map

Speaker Bio: Emre Kıcıman

Specialize in Social Data Analytics

Social media, analytics and search

Focus:

1. Improving our analytical capabilities

2. Extracting information about the world from social media

3. Reasoning about social media biases, reinforcing useful signals

Career

Ph.D. in computer science from Stanford University, ’05

7 years at Internet Services Research Center, Microsoft Research

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Basic Model of Social Media Interaction

Message from Bob

Alice

Effect

* Wow, Bob is ______!

* I should donate money to _____!

* Write back to Bob....

* Forward Bob’s message to others…

Effect depends on many factors:

* Relationship <Alice, Bob>

* Content of the message

* Alice’s environment:

-> social context

-> current tasks

From Bob’s Point-of-View

Effect

Message1 ?

Message2 ?

Message3 ?

Bob

Alice

Bob needs feedback!

A bit more formally…

Assumption: Source writes a message to effect recipient

Effect might be to project a persona; to elicit engagement or action; build social capital [cite communications theory; social capital; etc]

(Of course, this is not always true; sometimes messages are written for effect on self.

E.g., cathartic writing)

A bit more formally…

Let effect be a function of relationship, message and context:

𝐸 = 𝐹 𝑠, 𝑟 , 𝑚, 𝑒 𝑠, 𝑟 represents the relationship between a source and recipient. E.g

., “close friends”,

“authority/expert” 𝑚 represents the message content and style 𝑒 represents the environment in which the message is received E.g.

, facebook or linkedin.

Also includes broader social norms, etc.

A bit more formally…

Then a rational source, trying to achieve an effect 𝐸 ∗ will select messages based on: 𝑎𝑟𝑔𝑚𝑖𝑛 𝑚∈𝑀

|𝐸 ∗

Where 𝑠, 𝑟 and 𝑒 are fixed and 𝐹 is source’s approximation of 𝐹 .

More complex versions take into account multiple recipients, thresholds of cost & utility, etc.

Messages about the real world…

Let’s assume for simplicity that style is constant, and content is what an author is actively choosing. Then, given a set of observations 𝑊 about real-world events, author chooses a 𝑤 : 𝑎𝑟𝑔𝑚𝑖𝑛 𝑤∈𝑊

|𝐸 ∗

If no 𝑤 achieves goal 𝐸 ∗ within some threshold, author writes nothing.

The weather bias case study is essentially investigating the relationship between features of 𝑤 and 𝐸 ∗ > t

Bob is not the only Actor

Messaging is not the only Action

(this slide intentionally left blank)

Effect

Alice

?

Bob

Charlie

Justin

Alice

?

Effect

Bob

Charlie

?

Justin

Alice

Message

Bob

Alice

Effect

Message

Social Media System

Bob

Social Media System Role

Alice: 𝐸

𝐴𝑙𝑖𝑐𝑒

= 𝐹 𝐵𝑜𝑏, 𝐴𝑙𝑖𝑐𝑒 , 𝑚, 𝑒

Bob: 𝑎𝑟𝑔𝑚𝑖𝑛 𝑚∈𝑀

|𝐸 ∗

𝐵𝑜𝑏

Social media system is trying to align parameters to achieve its own effect 𝐸 ∗

𝑆𝑦𝑠

E.g., align relationships 𝑆, 𝑅 , the environment 𝑒 , as well as the space of messages 𝑚 that Bob selects from.

Reinforcing real-world signals

𝑎𝑟𝑔𝑚𝑖𝑛 𝑤∈𝑊

𝐸 ∗

Ex. Reinforcing useful signals: if we want to increase the likelihood that world events

𝑊 + that we care about are reported, this model indicates several possible directions:

Investigate and optimize 𝐸 ∗ , 𝑠, 𝑟 and 𝑐 such that 𝑊 + is reported

Design feedback that improves approximation 𝐹 to reinforce 𝑊 +

Recap & How might this model help?

Basic model of social media system interactions:

 Messages have an effect on recipients, conditioned on various factors

 Authors choose their messages to have some desired effect

 Social media systems today can (and do) play an active role

Having a generation process in the back of our mind helps us realize biases and limitations of data

Having a sense of the basic knobs can help us think about how to improve social media systems

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Some Basics:

Clarity:

* Be clear in your purpose and the real-world problem you wish to address

* State your question as a hypothesis

Testability:

* In some cases, the hypothesis is tested through the social data

* Other times validation must lie outside the social data

Passive and active experiments:

* If we look at questions of causality we often need to do active experimentation

Data Analysis is one tool of many:

* User surveys, interviews, mockups, prototypes, etc.

Defining Research Question

Amount of data overwhelming – the more defined your question, the easier the analysis

What real world problem are you trying to explore?

Avoid pitfall of technology for technology’s sake

What argument do you want to be able to make?

State your problem as a hypothesis

Introducing a Running Example

Studying the relationship between activities and locations

What do people do? And where do they do it?

(Note: research question is about accuracy of results, and its use in applications)

Collect:

Content &

Interactions

Processing Pipeline

Cleaning and Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Followed by higher-level statistical, graph and machine learning analyses to build

Collect:

Raw Social

Media

Social Media Processing Pipeline

Feature

Extraction

Define the Key

Context

Extract Core

Relationships

“I had fun hiking Tiger

Mountain last weekend” – Alice said on Monday, at 10am

Location Tiger Mountain

Mood Happy

Activity Hiking

Name Alice

Gender Female

Post Time Mon 10am

Activity Time {Sat-Sun}

Location:

Tiger

Mountain

Activity:

Hiking

Followed by higher-level graph and machine learning analyses on the combined structure and context…

1. Collection

Instrumentation

Avoid tendency to collect everything without organization

Validate logging -> untested instrumentation is prone towards bugs

Design for key scenarios: Make it easy to get data for key questions up front

Streaming and Search APIs

Easy to use. Appropriate for many experiments

Often rate-limited, but can build large-scale data over time

Crawling

More effort, but can grab historical data

Some sites will block

In all cases, do consider user privacy and expectations.

To consider, during collection and signal extraction:

Filters

Time span, type of person, type of actions

Sampling

Random selection

Snow balling, to get complete picture of person’s social experience

Consider your research questions, how you want to generalize

2. Cleaning & Feature Extraction

Clean once: removing irrelevant raw data (depends on your research question)

Spammy users, people who were never active,

Geographic or temporal filtering,

When you remove a user, message, or action, think about whether to remove associated data

(e.g., might want to keep a spammy user’s interactions with other, non-outlier users)

Implication:

Feature extraction:

Entity recognition from text the raw data and the feature results

User classification, …

This is also a good stage to bring in external data…

Clean again:

Look for outliers and remove feature values that are not dependable

Keep samples of raw data for distinct feature values to make inspection easier

3. Pick defining context of relationships

After extracting features from our social media, we want to reason about the relationships among these features.

What defines a relationship?

One common choice: Context == Co-occurrence within the same message

(next slide)

“I had fun hiking Tiger Mountain last weekend” – Alice said on Monday, at 10am

Location:

Tiger

Mountain

Gender:

Female

Mood:

Happy

Activity:

Hiking

Name: Alice

Post Time:

Mon 10am

Activity

Time: {Sat-

Sun}

Gender:

Male

Location:

Tiger

Mountain

Gender:

Female

Mood:

Happy

Name: Bob

Activity:

Hiking

Post Time:

Fri 3pm

Name: Alice

Post Time:

Mon 10am

Activity

Time: {Sat-

Sun}

Other common choices

User as defining context

-> Two things are related if they are associated with the same user

-> Common in recommender systems

-> Ex. Livehood study of neighborhood boundaries

Location as defining context

-> Two things (users, actions, …) are related if co-occur at same physical location

-> Ex. Sadilek’s study of disease transmission

4. Extracting Core Relationships

Location:

Tiger

Mountain

Activity:

Hiking

• Focus on core relationships among domains of interest

• Strength defined by how frequently items co-occurred in key context

• Statistical distribution of other features annotates core relationships

Iterate on “core relationships”

Gender:

Male

Location:

Tiger

Mountain

Activity:

Hiking

Gender:

Female

Higher-level algorithms

Statistical tests

Ex. Test that relationships are statistically significant

Ex. Test that two items are statistically different from one another

Graph analyses

Ex. Clique finding, graph clustering, path algorithms, network centrality, ….

Machine learning

Ex. Classifiers, clustering, etc. Based on graph relationships or annotations

Some usage scenarios

Build “profiles” of things based on the words and sentiment used

Build demographics of places and concepts based on who is talking about them

Build co-mention graph among entities, people, places, etc.

Build user profiles of users based on what they talk about and how they express themselves

Include “time” in the projection, and see how profiles change over time

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Selection of Case Studies

Social responses and engagement on So.Cl

Clear & simple study of user interactions

Population biases in political tweets

Extracting basic features from tweets

Demonstrates complexity of population biases

Studying self-reporting bias by comparing tweet rates to ground-truth

Example of building a domain classifier

Methodology to study

Annotating graph structures with discussion context to interpret high-level graph analysis results

Applies higher-level graph analyses to graphs of discussion topics

Shows how discussion context can be useful at different layers

Outline

Part I: Introduction and conceptual framework

1.

Introduction and preliminaries

2.

Basic model for interaction through social media

3.

A processing pipeline for analyzing social media

Part II: Case Studies

4.

Social responses and engagement on So.Cl

5.

Population biases in political tweets

6.

Studying self-reporting bias by comparing tweet rates to ground-truth

7.

Annotating graph structures with discussion context to interpret high-level graph analysis results

8.

(Bonus) Statistical language modeling to analyze language differences across user populations

Case Study: Usage analysis of So.Cl

So.cl is an experimental web site that allows people to connect around their interests by integrating search tools with social networking.

Study: How important are social interactions in encouraging users to become engaged with an interest network?

We’ll start with some background on So.Cl, and then dive into the case study.

SO.CL

reimagining search as social from the ground up search + sharing + networking

= informal discovery and learning

History:

Oct 2011:

Pre-release deployment study

Dec 2011:

Private, invitation-only beta

May 2012: removed invitation restrictions

Nov 2012: over 300K registered users,

13K active per month

Try it now! http://www.so.cl

So.Cl as

Interest Network

 Find others around common interests

 Be inspired by new interests

 Learn from each other through these shared interests

Search & Post

How It Works

Feed Filters

People

Try it now! http://www.so.cl

– use facsumm tag

Feed

Search (Bing)

Post Building

Post Builder

Filter Results

Experience:

Step 1: Perform search

Step 2: Click on items in results to add to post

Step 3: Add a message

Step 4: Tag

Try it now! http://www.so.cl

– use facsumm tag

Results

So.Cl as Research Platform

Social

Search

Discovery

Network

Interest

Network

Collaborate

Connect follow liking Simple profiles commenting riffing

Wall messages

People list

?

Collect

Create

Consume interests visual

Post builder stream

Add links

Video parties

Explore

Page

Interest

Page

Search interests

?

Increasing Engagement, Community, Learning, Innovation

?

?

So.cl Research Dataset Program

Access to public so.cl behavioral data for research purposes

Foster research in interest networking, social search, and community development http://fuse.microsoft.com/research/srd

Case Study:

Hypothesis:

If people receive a social response when they first join So.cl they are more likely to become engaged.

Measuring social/behavioral constructs:

When first join

First session = time of first action to time of last action prior to an hour of inactivity

Social responses

Follows user, likes user’s post(s), comments on user’s post(s)

Engagement = coming back

A second session = any action occurs 60 minutes or more after first session

Restating hypothesis:

If a people receive follows, likes, and comments in their first session they are more likely to come back for a second session

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

Simple, common instrumentation schema, kept in a database

Users table: Row per user

Include creation time and other metadata

Content table: Row per content includes text, URLs, etc.

Actions table: Row per action

Filter out non-meaningful, non-user generated actions

Actions capture user interactions and context

Extract Core

Relationships

Always look at your raw data: play with it, ask yourself if it makes sense, test!

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Filters

Time span, type of person, type of actions

Sampling

Random selection

Snow balling, so get complete picture of person’s social experience

Consider your research questions, how you want to generalize

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Filtered out administrators/community managers

New users only

Date range: Sept 28 to Oct 13

100% sample for that time span: 2462 people

SYSTEMATIC BIASES IN SOCIAL SYSTEMS #1

If you want to understand your “typical” users, keep in mind generally find:

Large percent never become active or return

-“kicking the tires” unduly biases averages

Common reporting format:

X% performed Y behavior, of those averaged Z times each

5% commented on a post their first session, averaging 5 times each

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

Extract Core

Relationships

OUTLIERS: Filtered out 13 people outliers z > 4 in number of actions (if do more than sign in)

SYSTEMATIC BIASES IN SOCIAL SYSTEMS #2

A small percent “hyper-active” users: avid, spammers, trolls, administrators, and can unduly bias averages

Remove outliers

A substantial percent are consumers but not producers

(“lurkers”), often no signal for lurkers

So.cl has about 75% lurkers

Custom instrumentation, logging sign ins

Web analytics for clicks

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

 Very important to spend time examining data

 Descriptives, Frequencies,

Correlations, Graphs

 Use tool that easily generates graphs, correlations

 Does it make sense? If not, really chase it down. Often a bug or misinterpretation of data.

Extract Core

Relationships

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Feature: Active Sessions

Active session = a time of activity

(public), with 60 minute gap of no activity before or after

91% of users only one active session

On average,

34.6 hours apart

First session,

1.6 minutes

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Feature: User Actions

Number of Posts in First Session

Define the Key

Context

Extract Core

Relationships

Actions in First Session

8% created a post in their first session, of those averaged 1.5 posts each

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Feature: Coming back

9.1% came back for another active session

(~25% including inactive)

On average, 35 hours later

Define the Key

Context

Extract Core

Relationships

Collect:

Content &

Interactions

Cleaning and Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Aggregation: merging down for summarization

What is your level of analysis?

Person, group, network

Content types

If person is unit of analysis, aggregate measures to the person level

E.g. in SPSS: One line per person very important to have appropriate unit analysis, to avoid bias in statistics

AGGREGATIONS

SPSS Syntax:

PRELIMINARY CORRELATIONS

Always ask, does this pattern make sense?

IN THE FIRST SESSION

How often is user the target of social behavior?

23% received some response up to 2 nd session

->3% if did not create a post, 37% if did create a post

Response *During* First Session Response *in Between* 1 st and 2 nd Sessions

PREDICTORS OF COMING BACK

Social responses inspire people to return to site, especially if occurring during first session

N = 2273 N = 179 N = 1942 N = 510

Social responses to user: following, commenting on post, liking post, liking comment, riffing

WHICH RESPONSE MATTERS

Logistic Regression, Any Response Predicts Coming Back

Created post first session

Response1: during first session

Response2: after first session

B

.71

1.12

.60

S.E.

.20

.21

.17

Sig.

.000

.000

.000

Logistic Regression, Which Predicts Coming Back

Created post first session

B

.95

Sig.

.000

Followed

Commented On

Post Liked

.92

.38

.87

.003

ns

.02

Comment Liked -.09

ns

Messaged

Riffed

-.09

.00

ns ns

IDENTIFYING SUBGROUPS

Type:

% Variance:

Created post

Invited

Followed

Added item to post

Searched

Commented

Liked post

Liked comment

Messaged

Viewed person

Navigated to All

.83

.81

.36

.15

.13

-.09

.22

.51

Component Matrix a

Component

Creators Socialites Browsers

32% 12% 9%

.86

.17

.10

.01

-.03

-.16

.10

.63

.37

.08

.03

.64

.58

.80

.50

.47

.37

-.06

.17

.09

.32

.06

-.08

.48

.53

Joined party .17

.09

.68

Principle components, varimax rotation [meaning forced to be orthoganol]

Factor Analysis for Associated Behaviors:

Three types of usage – creating, socializing, browsing

Factors about equally predict if user comes back

Regression Coefficients

Creating

Socializing

Browsing

Beta

.14

.07

.19

t

5.28

2.61

7.20

Sig

.000

.000

.000

Browsing stronger predictor of overall activity level

Regression Coefficients

Creating

Socializing

Browsing

Beta

0.20

0.17

0.29

t

7.89

6.58

9.07

Sig

0.00

0.00

0.00

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Case Study: Population biases in political tweets

There was a significant amount of political discussion on Twitter during the US election season in Summer/Fall 2012.

Case Study: Is the population of tweeters representative of US demographics along two demographic axes of gender and geography?

Why this case study?

Good illustration of simple extractors for gender, location, and simple methods for identifying topics.

More fundamentally, highlights challenges of dealing with population biases

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Collected all tweets during August – November, 2012 that mentioned “Obama”, “Romney” or other politician names

Inspecting raw data:

* Removed some common names and issue phrases from collection

Extract Core

Relationships

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Feature: Gender

Simple gender classifier based on first name of Twitter user in profile

Approach: Look up first name in a weighted gender map built from census data and other sources.

Practical results:

Ad hoc inspection is positive

Coverage is 60-70%, depending on domain. Remainder are organizations and ambiguous names

Still requires:

Accuracy evaluation based on ground-truth data

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Feature: Location

Map from self-declared user profile locations to lat-lon regions.

Approach: Use a mapping learned from the small % of tweets that are geocoded.

Cluster mapped geo-locations together into city-size areas.

Practical results:

Maps to metropolitan-area size regions.

Learns official location names, as well as abbreviations, nicknames, etc.

Automatically identifies non-specific locations

Coverage is 60-70%, depending on domain. Remainder have non-specific locations or “tail” locations not covered in training set.

Example results:

Location cluster Example members

New York “NYC”, “Yonkers”, “manhattan,”

Los Angeles

“NY,NY”, “Nueva York”, “N Y C”,

The Big Apple”

“Laguna beach”, “long beach”,

Filtered out due to ambiguity

(large area)

“LosAngeles,CA”, “West Los

Angeles, CA”, “Downtown Los

Angeles”, “LAX”

“World”, “everywhere”, “USA”,

“California”, …

Location detection alternatives

1. Use geo-tagged tweets.

- Most appropriate when you need fine-grained locations per tweet (e.g., user tracking)

- But trade-off is that very small % of tweets are geo-coded

2. Much recent research on location inference.

- State-of-the-art uses textual references to known locations to identify user location.

This mapping technique is a little coarser-grained, but simpler.

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Feature: Politician mention

Approach: Exact-match on well-known, unambiguous politician names.

Still needs:

* Domain classification and/or stronger entity linking to recognize ambiguous names. For example, “Mitt” is likely Mitt Romney in a political context, but not otherwise.

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Key context is the tweet itself.

We will assume a relationship among features if they co-occur in the same tweet.

It will be stronger if it co-occurs across many tweets.

For example:

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Extract Core

Relationships

We extract two sets of relationships:

1) Politician mentions per Day:

 Strength of relationship indicates volume of discussion about a given politician on a given day

 Discussion context summarizes gender and location for each day.

2) Politician mentions over all time:

 Discussion context summarizes gender and location over all time

Gender Bias

Gender distribution of authors of tweets mentioning

Obama

80%

70%

60%

50%

40%

30%

20%

10%

0%

8.29.12

9.3.12

9.8.12

9.13.12

9.18.12

9.23.12

9.28.12

10.3.12

m f

10.8.12

10.13.12

Gender distribution equalizes during high-volume events like DNC

Geographic bias in Political Tweets

Metro-area

New York, NY

Tweets % of tweets Actual population

141878 10% 22,000,000

Washington, DC 135347

Los Angeles 68676

9% 8,500,000

5% 12,800,000

Chicago

Atlanta, GA

Houston, TX

Boston, MA

47130

45475

35956

34363

3% 9,800,000

3% 5,200,000

2% 2,100,000

2% 7,600,000

* Geographic distribution of tweets mentioning Obama during 2012 Elections

Moods over time for Obama

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Studying self-reporting bias by comparing tweet rates to ground-truth

Background:

 Frequency of discussion about events does not directly reflect real-world frequency of occurrence.

 We may assume that bias is constant for a given kind of event, but not about bias across different kinds of events.

 We can make few inferences about the relationship between distinct events through social media analysis.

Study:

 Compare tweet rates about weather to ground-truth data about weather

Why this case study:

 Easy example of domain identification and ambiguity resolution in cleaning stage

 Good illustration of self-reporting bias

Self-reporting Bias

We study reporting bias by comparing tweet rates about the weather to ground-truth weather data.

Does the weather’s extremeness, changes, or unexpectedness affect tweet rates?

[Kıcıman, ICWSM 2012]

Tweets & Weather Timeline

10000

Hottest day

Thunderstorm

1000

60

50

40

100

30

20

10

10

Weather-Related Tweet Rate Temperature

1

Sep. 1 Sep. 15 Sep. 29 Oct. 13

0

Weather-related Tweet rate and temperature in San Diego, CA from Sep. 1-Oct 15, 2010

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Collected 12 months of tweets that mentioned weatherrelated words (e.g., “rain”, “snow”, “sun”, “heat”…) Word list built by hand from weather glossaries, dictionaries, etc.

Extract Core

Relationships

Example “Weather” Tweets

Woke up to a sunny 63F (17C) morning. It's going to be a good day :)

The rainy season has started.

The inside of our house looks like a tornado came through it.

Japan, Germany hail U.N. Iran sanctions resolution

Domain Classifier

Used a language-based classifier, with a simple Bayes model:

1

|𝑇| 𝑡∈𝑇

𝑃(𝑤𝑒𝑎𝑡ℎ𝑒𝑟|𝑡)

Where 𝑇 is the set of features (all pairs of co-occurring words within a tweet, regardless of order)

𝑃 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 = (1 + 𝐶 𝑤𝑒𝑎𝑡ℎ𝑒𝑟 𝑡 ) (1 + 𝐶 𝑡 )

Also:

• Simple stemming of words – remove ‘-s’ and ‘-ing’ suffixes

Domain Classifier: Labeling

Labeling 2000 tweets manually (2 labelers) to create a “gold training/test set”

What were challenges of labeling? Mainly a strong, consistent criteria.

For example: “incidental” mentions of the weather?

Or, mentions of the weather someplace else?

Slightly less complicated:

Mentions of the weather in proverbs (‘when it rains it pours’)

Domain Classifier

Results:

Classifier F-Score of 0.83, with a precision of 0.80 and recall of 0.85

Is this good? In general, the precision/recall will depend heavily on the domain and the collection criteria for tweets.

Collect:

Raw Social

Media

Feature

Extraction

Feature: Location

Extract as described in politics case study

Define the Key

Context

Extract Core

Relationships

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Add derived weather features from external (non-social) data:

Extremeness

Expectation

Change

Calculated based on the nearest weather station to the median location within the metropolitan area

Extract Core

Relationships

Data Preparatıon

12 months of Tweets, June 2010-June 2011

130M tweets include a weather-related word

179 words from weather glossaries, etc.

71M tweets pass a Bayesian classifier

Trained on 2k labeled tweets

8M tweets geo-located to 56 US cities

Used geo-tagged tweets to learn a mapping from profile locations

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Key context in this case is location-day pair.

This also defines the core relationship.

We are most interested in is the count of tweets per locationday and the weather features per location-day…

Extract Core

Relationships

Correlation Analysis

Linear regression on derived features with L2 regularization.

Model Features

Basic Weather

Expectation + Basic

Change + Basic

Extreme + Basic

Global R 2 Correlation Local R 2 Correlation

0.30

0.45

0.33

0.35

0.70

0.71

0.40

0.70

Granger Analysis

100%

80%

60%

40%

20%

0%

98,2%

Extreme

85,7%

66,1%

57,1%

Basic Expectation Change

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Case study on activities & locations

Question: Given a set of related locations inferred from social media, what we can we tell about why they are related?

Why this case study?

Introduction to higher-level analyses and using context to interpret them.

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

Extract Core

Relationships

Extracting features:

Activities: Exact match on activity names derived from search queries

Locations: Exact match on unambiguous location names from Wikipedia articles

Key Context == Tweet

Extract Core Relationships: Locations

Contextual statistics of discussions

Pseudo-clique of

NYC tourist locations

Pseudo-clique of

NYC “midtown worker”

Gender Male

Female

New York Tourist Midtown Worker

49% 63%

33% 23%

Metroarea NYC

Other

Mood Joviality

Fear

Sadness

Guilt

Fatigue

Serenity

Hostility

33%

67%

56%

14%

11%

8%

3%

3%

2%

54%

46%

49%

13%

15%

6%

6%

4%

4%

Outline

Part I: Introduction and conceptual framework

1. Introduction and preliminaries

2. Basic model for interaction through social media

3. A processing pipeline for analyzing social media

Part II: Case Studies

4. Social responses and engagement on So.Cl

5. Population biases in political tweets

6. Studying self-reporting bias by comparing tweet rates to ground-truth

7. Annotating graph structures with discussion context to interpret high-level graph analysis results

Recap: Basic model of interaction

Message

Alice Bob

Bob sends Alice a message 𝑚 : 𝑎𝑟𝑔𝑚𝑖𝑛 𝑚∈𝑀

|𝐸 ∗

Recap: Processing Framework

Collect:

Raw Social

Media

Feature

Extraction

Define the Key

Context

“I had fun hiking Tiger

Mountain last weekend” – Alice said on Monday, at 10am

Location Tiger Mountain

Mood Happy

Activity Hiking

Name Alice

Gender Female

Post Time Mon 10am

Activity Time {Sat-Sun}

Location:

Tiger

Mountain

Extract Core

Relationships

Activity:

Hiking

Followed by higher-level graph and machine learning analyses on the combined structure and context…

Recap: Case Studies

Social responses and engagement on So.Cl

Clear & simple study of user interactions

Population biases in political tweets

Extracting basic features from tweets

Demonstrates complexity of population biases

Studying self-reporting bias by comparing tweet rates to ground-truth

Example of building a domain classifier

Methodology to study

Annotating graph structures with discussion context to interpret high-level graph analysis results

Applies higher-level graph analyses to graphs of discussion topics

Shows how discussion context can be useful at different layers

Summary

Social media data provides a fine-grained and large-scale representation of people’s discussions and interactions with each other.

Extract information about the real-world

Study people’s interactions with each other

How system design influences those interactions

But be careful, social media is generated through a complicated system, and has many biases!

Questions?

E-mail: Emre Kiciman emrek@microsoft.com

http://research.Microsoft.com/~emrek/

Dataset Resources:

Selected Dataset resources

So.cl dataset: http://fuse.microsoft.com/research/srd

ICWSM Datasets http://icwsm.org/2013/datasets/datasets/

MyPersonality Project: http://mypersonality.org

Extra

Outline

Part I: Introduction and conceptual framework

1.

Introduction and preliminaries

2.

Basic model for interaction through social media

3.

A processing pipeline for analyzing social media

Part II: Case Studies

4.

Social responses and engagement on So.Cl

5.

Population biases in political tweets

6.

Studying self-reporting bias by comparing tweet rates to ground-truth

7.

Annotating graph structures with discussion context to interpret high-level graph analysis results

8.

(Bonus) Statistical language modeling to analyze language differences across user populations

Statistical language modeling to analyze language differences across user populations

What we’re doing:

Build and compare language models of Tweets, conditioned on various metadata features such as geography and number of followers.

Why we’re doing it:

1. It’s also just interesting to find and quantify the differences in style, topic among different groups of users.

2. Analysis and info extraction from Tweets important. More accurate language models may improve algo’s for word segmentation, NER, …

Metadata

Class

Geography

Explicit Signals

Time zone

GPS Coordinates

User Metadata Number of followers

Number followed

Total tweets count

Age of account

Message Metadata Message length

Retweet

Contains URL

Number of user references

Time of day

Inferred Signals

User reported location

Gender

Interests

#Topic

Well-capitalized

Twitter Data set

Data

72M Tweets gathered over ~3 days

90% training; 10% test

Focus on English tweets in these experiments

Approach

Partition by metadata feature

E.g., group messages by whether there’s a link in it

Build 1- to 3-gram LM per partition

Smoothed LMs with closed vocabulary

Cross-entropy among all partitions

Analyze differences in term-likelihoods among LMs

Cross-entropy across Timezones

Hawaii 1573 3078 3623 2795 3018 3294 3094 5591 4228 2027 6623 5051 3529

Alaska

3506 1500 3238 2641 2866 3182 2892 11005 6496 2907 6004 12610 6477

Pacific

2775 1894 1303 1825 2040 2222 2226 11676 6501 2493 2769 11591 5611

Mountain 4619 4379 5263 1360 2362 2742 2824 13384 7465 2874 17897 13453 7023

Central 4941 4655 5969 1774 1185 2009 1838 13244 7368 2695 24610 14107 6740

Eastern 5586 5208 7244 2053 1943 1216 1767 15560 8475 2648 31850 14535 6953

Quito 5042 4689 6539 2324 2200 2241 1153 8234 6061 2810 26049 13806 7197

Brasilia

8063 8279 10229 5674 6230 6528 6666 724 5810 4909 28775 11331 7465

Greenland

4437 4776 5966 3642 4006 4170 4030 1932 1536 2868 14817 11179 5962

London

5013 5573 7160 3478 4065 4115 4266 10621 6561 917 21472 15561 7342

Jakarta 5631 4896 5494 5298 5761 6200 6138 17000 9690 4461 1338 12107 7407

Osaka 8276 8086 9359 6599 6944 7340 7252 16236 10461 5444 19994 1598 4495

Tokyo 5682 5546 6589 4521 5006 5043 5222 8904 6811 3635 13864 2386 1265

Perplexity of bi-gram models learned for each time zone with respect to others

Differences across Timezone

3 kinds of differences:

• Geographic locations

• Topic variance

• Dialect, spelling differences

Cross-entropy across Num Followers

0≤x<10

0≤x<100

0≤x<1000 x≥1000

0≤x<10 0≤x <100 0≤x <1000 x≥ 1000

922 2413 4528 7831

1166

1682

3345

1071

1341

2421

2477

1216

2804

4811

2317

1544

• Similar language models for <10, <100, <1000 followers

• Differences appear for authors with > 1000 follower

Differences across num. followers

1,4

1,2

1

0,8

0,6

0,4

0,2

0

0≤x<10 10≤x<100 100≤x<1000

Number of Followers x≥1000

Download