Case Studies CS910: Foundations of Data Analytics Graham Cormode

advertisement
CS910: Foundations
of Data Analytics
Graham Cormode
G.Cormode@warwick.ac.uk
Case Studies
Case Studies
 4 papers on data analytics published in the scientific literature:
I Tube, You Tube, Everybody Tubes: Analyzing the World’s
Largest User Generated Content Video System
Internet Measurement Conference 2007
– What is Twitter, a Social Network or a News Media?
19th international conference on World wide web 2010
– Meme-tracking and the Dynamics of the News Cycle
Knowledge Discovery and Data Mining (KDD), 2009
– Link Prediction by De-anonymization: How We Won the Kaggle
Social Network Challenge
2011 International Joint Conference on Neural Networks (IJCNN)
–
 Read them!
2
CS910 Foundations of Data Analytics
Details on Case Studies
 Full details and links to the papers on course web site
www2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs910/
 Please read the papers in detail to get the full story
 Bias: papers are from Computer Science research community
Mostly address data analysis applied to large websites
– The most well-studied example of “Big Data”
– Examples should be familiar to you (YouTube, Facebook, Twitter)
–
 Objectives for the case studies:
To see examples of data analytics in practice
– To introduce and motivate topics we will study in more detail later
– To see examples of going from data to insight to understanding
–
3
CS910 Foundations of Data Analytics
Case Study 1: Online video
 “I Tube, You Tube, Everybody Tubes: Analyzing the world’s largest
user generated content video system”
 By Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol
Ahn, Sue Moon (Telefonica Research and KAIST)
 Published in Internet Measurement Conference 2007
 http://conferences.sigcomm.org/imc/2007/papers/imc131.pdf
4
CS910 Foundations of Data Analytics
Objectives





5
To understand the impact of video sharing systems
To study the popularity life-cycle of videos
Study Statistical properties of requests, relation to video age
Study prevalence of copying activities
Understand potential for caching to save bandwidth
CS910 Foundations of Data Analytics
Data Collection
 Crawled YouTube and Daum sites in 2007
–
Wrote programs to automatically collect data about all videos
 YouTube was already very large in 2007
–
Restricted crawl to ‘Entertainment’ and ‘Science/Tech’ categories
 Collected data on each video:
Fixed: Uploader id, date of upload, duration of video
– Variable: #views, #total ratings, #positive ratings, links to
–
 Daily crawl for 6 days to see changes
6
CS910 Foundations of Data Analytics
Video popularity distribution
 Plot what fraction of views are outside the top videos
 Normalize ranks from 0 to 100, to allow comparison
 Top 10% of videos account for 80% of views
Very skewed distribution
– Wide variation in popularity. Why?
–
7
CS910 Foundations of Data Analytics
Understanding video popularity distribution
 “Skewness” (“the long tail”) is a common phenomena in data
 Observed by plotting data on a log-log scale: straight lines
 Plot views on x-axis, #videos with more than x views on y-axis
8
CS910 Foundations of Data Analytics
Modeling Skewness
Several distributions generate skew
 “Power law” (Pareto, zipf):
y proportional to x-a for some a
–
Gives straight line on log-log plot
 “Power law with exponential cut-off”:
y proportional to x-ae-bx
x<1/b: behaves like power law
– x>1/b: behaves as exponential decay
–
 Log-normal: taking log of distribution
produces a Normal curve
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
10
1
10
2
10
3
10
4
10
5
10
6
10
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
3
10
4
10
5
10
6
100
10-1
10-2
10-3
10-4
10-5
How to tell which we are seeing?
9
CS910 Foundations of Data Analytics
10-6
100
101
102
103
104
105
106
Fit a curve
 Find the best fitting curve for each model
A regression problem (covered later)
– Log-normal captures behaviour for popular videos (head)
– Power law with exponential cutoff seems best for tail
–

Why?
10
CS910 Foundations of Data Analytics
Possible explanations
Several mechanisms are known to generate long tail distributions:
 Preferential attachment: popular items are most likely
–
Describes the main behaviour, but not the truncated tail
 Aging effect: “old” items eventually die, receive no new activity
–
Does not fit videos: no ‘death’ or ‘removal’ of an old video
 Information filtering: a user can only view a fixed number
–
Does not fit: people can keep watching new videos
 Fetch-at-most-once: a user can view each video at most once
–
11
Better: people prefer to watch new videos, don’t watch top-10
over and over
 Some exceptions: music fans watch favourites many times
CS910 Foundations of Data Analytics
Validating ‘fetch-at-most-once’
 Simulate preferential attachment + fetch-at-most-once
R requests per user
– U different users
– V different videos
–
 Observations from simulation
Increasing R sharpens tail
– Increasing U shifts graph
 Shape doesn’t change much
–
 Do you agree?
12
CS910 Foundations of Data Analytics
Effect of time
 Views increase over time
 Truncation gets sharper over time
–
13
Many possible reasons: e.g. more push to most popular content
CS910 Foundations of Data Analytics
Impact of age
 Popularity does not vary strongly by age
 Most recent videos are slightly more popular
 Data is from early days in YouTube: have things changed?
14
CS910 Foundations of Data Analytics
Can we predict future popularity?
 Consider current popularity (views) and age (time since upload)
–
Does this correlate with future popularity?
–
Table shows correlation coefficient (number of videos sampled)
 Strong correlation of instant popularity with future popularity
–
15
From day 2. Day 3 does not change much.
CS910 Foundations of Data Analytics
Use these observations to cache
 Streaming video uses a lot of Internet bandwidth (up to 66%?)
 Could we cut bandwidth usage by running a cache?
–
E.g. put a video cache for all of Warwick University
 How to fill the cache?
Static: Pick the most popular items once and for all
– Dynamic: Initialize with most popular, then cache all new videos
 Unrealistic, but a point of comparison
– Hybrid: Static + daily most popular
–
16
CS910 Foundations of Data Analytics
Reflections on the paper
 A widely referenced paper from early in YouTube’s history
–
1100+ citations in the literature
 Characterized many aspects of video viewing behaviour
And attempted to explain many of these
– Many other plots in the paper
–
 Video on the Internet has changed a lot since 2007
Changes to YouTube website structure
– Huge growth in mobile devices
– Videos with billions of views
–
 Do conclusions still hold? What other phenomena emerge?
17
CS910 Foundations of Data Analytics
Case Study 2: Microblogging
 “What is Twitter, a social network or a news media?”
Kwak, Lee, Park, Moon (KAIST), in WWW conference 2010
– http://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdf
–
 Objectives:
“To study the topological characteristics of Twitter and its
power as a new medium of information sharing”
– “The first quantitative study on the entire Twittersphere and
information diffusion on it”
–
18
CS910 Foundations of Data Analytics
One slide on Twitter
 Messaging service where messages are up to 140 characters
 Broadcast (by default) to all other users
@userid addresses a particular user
– #hashtag to tag a message
– RT: re-tweet someone else’s message (sometimes with comment)
–
 Users “follow” another, and receive that user’s messages
19
CS910 Foundations of Data Analytics
Data Collection
 User Profiles
Crawl name, location, timezone, number of tweets
– Begin at a popular user to crawl the “giant connected component”
– Twitter limits to 20,000 requests an hour
 Using 20 machines with different IP addresses, took 24 days
–
 Tweets
Collected all tweets mentioning current “trending topics”
 Probed every 5 minutes
– Used Twitter search API to get up to 1500 tweets per query
– Collect text, author, timestamp
– Removed “spam tweets” from very new users
–
20
CS910 Foundations of Data Analytics
Follower/following analysis
CCDF: Fraction
with more than
this number
 Asymmetry between followers and following
 Some bumps can be explained with domain knowledge
Following 20: Twitter suggests an initial set of 20 to follow
– Following 2000: used to be a limit of 2000, since removed
–
 Fits power law, with exponent 2.28 (quite skewed)
21
CS910 Foundations of Data Analytics
Degree of separation
 2 users are “friends” if there is a mutual following relationship
 Study the friendship distance from random starting points
–
22
80% are within 4 steps (compare to “6 degrees of separation”)
CS910 Foundations of Data Analytics
Proximity of users
 Are friends geographically close? How would you test this?
 Look as average time difference between friends
–
23
As a function of number of friends
CS910 Foundations of Data Analytics
Importance of Twitter users
 Who are most important/influential Twitter users?
F: Count followers? Too crude?
– R: Most retweeted?
– PR: Most critical in the follower-following graph?
 Use PageRank: defined measure importance of web pages
 Recursive definition: PageRank of a node is sum of
PageRanks of its followers
 Can be computed efficiently (see later)
–
 Compare top-20 in each: they appear similar
 Compute rank-correlation between top-k
24
CS910 Foundations of Data Analytics
Comparison of importance measures
 Followers (F) and PageRank (PR) give most similar ranking
 ReTweets (RT) are also correlated, but more weakly so
25
CS910 Foundations of Data Analytics
Spread of influence
 ReTweets can spread a message far
and wide
 Figure shows retweets of messages
about a plane crash
–
Most trees are shallow (<3 hops)
CCDF: Fraction with
26 more than this number
CS910 Foundations of Data Analytics
Retweet time
Distribution of delay
from initial tweet to
retweet
Inter-hop delay
through retweet trees
 What do these plots tell you?
27
CS910 Foundations of Data Analytics
Reflections on the paper
 See also “A few chirps about twitter”, Krisnamurthy, Gill, Arlitt
From 2008, early in Twitter’s history
– Instructive to compare the approach and the findings
–
 Shows that there are many ways to slice and dice data from a
relatively “simple” data source
–
Did not even look much at content of tweets
 Widely cited (3400+ citations) as early work on Twitter
28
CS910 Foundations of Data Analytics
Case Study 3: Meme-tracking
 “Meme-tracking and the Dynamics of the News Cycle”
 Jure Leskovec, Lars Backstrom, Jon Kleinberg (Stanford, Cornell)
 International Conference on Knowledge Discovery and Data
Mining (KDD), 2009
snap.stanford.edu/class/cs224w-readings/leskovec09meme.pdf
 Objectives:
Track short distinctive phrases that travel through the web
– Use this to study the “news cycle” in the news media
–
29
CS910 Foundations of Data Analytics
Data Collection and Preparation
 News and blog activity from August 1 to October 31 2008
90 million documents from 1.65 million sites
– Used Spinn3r API to collect: see spinn3r.com
–
 Extracted 112 million phrases in “quotes”
Discard those with < 4 words or seen < 10 times (uninteresting)
– Discard those where > 25% occurrences from same source (spam)
– Leaves 47 million occurrences of phrases
–
 Collect phrases into clusters, based on overlap
Consider two phrases linked if they differ by at most 1 word
– Partition the induced “phrase graph” to isolate key phrases
 A long speech may have several key phrases in it
 Quite detailed process; see paper for details
–
30
CS910 Foundations of Data Analytics
Phrase distribution
 For each volume, plot # of phrases with at least that volume
–
31
For all phrases, clusters of phrases, and phrases about “lipstick
on a pig” (largest phrase cluster)
CS910 Foundations of Data Analytics
Most important threads
 Thread is all articles containing a phrase from a cluster
–
32
Plot is automatically generated and labeled to show volume
CS910 Foundations of Data Analytics
Formulating a model
 Advanced data analytics: propose a new model to explain data
 Try to capture major effects, and neglect minor points
Imitation: sources imitate/copy each other
– Recency: news cycle dominated by recent events
–
 Model: simulate discrete time steps: news sources report threads
New thread produced at each step
– At time t, each source picks thread j with probability f(nj)d(t-tj)
 nj: number of sources reporting on thread j [“Imitation”]
 d(t-tj) : decay factor based on age of thread j [“Recency”]
–
33
CS910 Foundations of Data Analytics
Validating the Model
Recency-only
Imitation-only
 Simulation: pick f() as power-law and d() as exponential decay
Generates synthetic data that looks similar to real
– Can we more rigourously validate the model?
–
34
CS910 Foundations of Data Analytics
Reflections on the paper
 A very innovative approach to the question
–
–
–
–
–
–
35
Created from scratch a way to think about “memes”
Proposed new models, and gave some evaluation of them
Tackled a timely question and used compelling examples
Plots and figures illustrate the key points
Widely cited (800+ citations)
But - models not robustly evaluated
CS910 Foundations of Data Analytics
Case Study 4: Link prediction
 “Link Prediction by De-anonymization: How we won the Kaggle
Social Network Challenge”
 Narayanan (Texas), Shi (Berkeley), Rubinstein (Microsoft)
 International Joint Conference on Neural Networks, 2011
http://arxiv.org/abs/1102.4374
 Objectives:
Correctly predict whether two users in a network would form a link
– Use additional background information to improve the results
–
36
CS910 Foundations of Data Analytics
The Competition
Kaggle hosts competitions for data analytics
 Hosted the 2011 IJCNN Social Network Challenge in late 2010
 Provided a graph drawn from a social network
Nodes correspond to users in the network
– (Directed) edges indicate a following relationship
– Evaluation: determine whether a set of test edges truly occur
–
 It was later disclosed that the graph came from Flickr
–
37
Edges are (directed) “friendship” relations between users
CS910 Foundations of Data Analytics
Link Prediction
 Goal of “link prediction” is to determine which new links will
form, given current state of the graph
 Many factors can be taken into account:
Properties of the nodes
– Existing number of links
– Common neighbours between a pair
– Graph distance between the pair
–
 This work used an additional factor:
–
38
Try to match nodes in data set to their own data collection
CS910 Foundations of Data Analytics
Data Collection
 Competition Data: Kaggle
1.1M nodes, 7.2M edges provided as main data
– 8960 “test edges”: 50% true edges (removed from main data)
– 20% of test set held back by Kaggle to evaluate the results
–
 Competitors Data: Flickr
Crawled Flickr social graph (used Python + Curl library)
– 2M nodes and all outgoing edges crawled
– Total of 9.1M nodes, with 163M edges (much bigger data set)
–
 Evaluation: Area Under the Curve (AUC) [defined later]
Values are True/False, predicted as Positive/Negative
– Ranges from 0.5 poor (random chance) to 1 (perfect)
–
39
CS910 Foundations of Data Analytics
Degree distribution
 Each node has an in-degree and an out-degree
 Skewed distribution, few nodes have high in/out-degree
–
40
Can try to use these as “landmarks”
CS910 Foundations of Data Analytics
Seed identification
 Try to find match a few nodes between the two graphs
Look at nodes with high in-degree (pointed to by many)
– These are likely to be present in both graphs
 Because of the crawling process
–
 Pick highest n (20) degree nodes from Kaggle (K) and Flickr (F)
Try to match them up to get a “seed” matching
– For a pair of nodes v, w in K (F), compute their “cosine similarity”:
 #common neighbours(v,w)/√(#neighbours(v)*#neighbours(w))
– Find best matching of nodes in K and F based on cosine similarities
 Initially: manually
 Later: optimization problem (see OR and optimization)
–
41
CS910 Foundations of Data Analytics
Propagation
 Now have matched n nodes between Kaggle and Flickr graphs
 Maintain a matching, and try to extend based on neighbors
Find pairs of nodes in Kaggle and Flickr whose similarity is high
– Extend the matching. Iterate.
– Some heuristics to accept a new pair into matching:
 Must be at least 4 mapped common neighbors
 Cosine-similarity score must be at least 0.5
 Difference in similarity scores between best, and second
best must be at least 0.2
–
42
CS910 Foundations of Data Analytics
Results
 Using “ground truth” information:
After 120,000 mappings in first stage, 99.3% correct
– After second stage, had mappings for 14K out of 17.6K in test set
 Overall accuracy for matched nodes 97.8%
 A coverage of 57% of edges: still need to give answer for rest
 For test edges, accuracy was 95%
–
 Use inferred information to predict links for more of test set
Look at all possible candidates for node pair in Flickr
– Use these to vote on whether the edge is present or not
 Accept if unanimous vote
– Covers a further 19% of test edges
–
43
CS910 Foundations of Data Analytics
Machine Learning
 Leaves 24% of test edges without a mapping to Flickr
 Apply Machine Learning (the original goal of the challenge)
Create a number of “features” for each edge
 In-degree and out-degree of node
 Whether reverse edge exists
 Measures of local graph (number of common neighbors etc.)
– Train a “classifier” : see later lectures on classification
– AUC for the classifier approach is ~0.9
–
 Total AUC for the whole approach on test data is 0.981
–
44
Excellent accuracy for deanonymized nodes, less for rest
CS910 Foundations of Data Analytics
Reflections on the paper
 The score of 0.981 AUC was enough to win the competition
Second best was 0.969 www.kaggle.com/c/socialNetwork/leaderboard
– The researchers contacted the organizers to reveal their method
– Were told that this was within the rules
– Read the messageboard for the competition to see other opinions
www.kaggle.com/c/socialNetwork/forums
–
 Lesson: when understanding data, think beyond what you have
Are there other data sets that can help understand it better?
– Can you learn properties of one data set and transfer to another?
– Can you link two data sets to learn more about the first?
–
 Lesson: removing information does not “anonymize” data
45
CS910 Foundations of Data Analytics
Lessons from Case Studies
 Start by describing the data
–
How collected, with what attributes, what is dropped
 Look at the data
–
Plot combinations of attributes to see correlations/distributions
 Find models that agree with the data
–
Lines of best fit, long tail distributions
 Find explanations that are consistent with data and knowledge
 Extract observations of interest/value
 Make predictions and evaluate the quality
46
CS910 Foundations of Data Analytics
Download