CS910: Foundations of Data Analytics Graham Cormode G.Cormode@warwick.ac.uk Case Studies Case Studies 4 papers on data analytics published in the scientific literature: I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System Internet Measurement Conference 2007 – What is Twitter, a Social Network or a News Media? 19th international conference on World wide web 2010 – Meme-tracking and the Dynamics of the News Cycle Knowledge Discovery and Data Mining (KDD), 2009 – Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge 2011 International Joint Conference on Neural Networks (IJCNN) – Read them! 2 CS910 Foundations of Data Analytics Details on Case Studies Full details and links to the papers on course web site www2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs910/ Please read the papers in detail to get the full story Bias: papers are from Computer Science research community Mostly address data analysis applied to large websites – The most well-studied example of “Big Data” – Examples should be familiar to you (YouTube, Facebook, Twitter) – Objectives for the case studies: To see examples of data analytics in practice – To introduce and motivate topics we will study in more detail later – To see examples of going from data to insight to understanding – 3 CS910 Foundations of Data Analytics Case Study 1: Online video “I Tube, You Tube, Everybody Tubes: Analyzing the world’s largest user generated content video system” By Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, Sue Moon (Telefonica Research and KAIST) Published in Internet Measurement Conference 2007 http://conferences.sigcomm.org/imc/2007/papers/imc131.pdf 4 CS910 Foundations of Data Analytics Objectives 5 To understand the impact of video sharing systems To study the popularity life-cycle of videos Study Statistical properties of requests, relation to video age Study prevalence of copying activities Understand potential for caching to save bandwidth CS910 Foundations of Data Analytics Data Collection Crawled YouTube and Daum sites in 2007 – Wrote programs to automatically collect data about all videos YouTube was already very large in 2007 – Restricted crawl to ‘Entertainment’ and ‘Science/Tech’ categories Collected data on each video: Fixed: Uploader id, date of upload, duration of video – Variable: #views, #total ratings, #positive ratings, links to – Daily crawl for 6 days to see changes 6 CS910 Foundations of Data Analytics Video popularity distribution Plot what fraction of views are outside the top videos Normalize ranks from 0 to 100, to allow comparison Top 10% of videos account for 80% of views Very skewed distribution – Wide variation in popularity. Why? – 7 CS910 Foundations of Data Analytics Understanding video popularity distribution “Skewness” (“the long tail”) is a common phenomena in data Observed by plotting data on a log-log scale: straight lines Plot views on x-axis, #videos with more than x views on y-axis 8 CS910 Foundations of Data Analytics Modeling Skewness Several distributions generate skew “Power law” (Pareto, zipf): y proportional to x-a for some a – Gives straight line on log-log plot “Power law with exponential cut-off”: y proportional to x-ae-bx x<1/b: behaves like power law – x>1/b: behaves as exponential decay – Log-normal: taking log of distribution produces a Normal curve 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 10 1 10 2 10 3 10 4 10 5 10 6 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 10 1 10 2 10 3 10 4 10 5 10 6 100 10-1 10-2 10-3 10-4 10-5 How to tell which we are seeing? 9 CS910 Foundations of Data Analytics 10-6 100 101 102 103 104 105 106 Fit a curve Find the best fitting curve for each model A regression problem (covered later) – Log-normal captures behaviour for popular videos (head) – Power law with exponential cutoff seems best for tail – Why? 10 CS910 Foundations of Data Analytics Possible explanations Several mechanisms are known to generate long tail distributions: Preferential attachment: popular items are most likely – Describes the main behaviour, but not the truncated tail Aging effect: “old” items eventually die, receive no new activity – Does not fit videos: no ‘death’ or ‘removal’ of an old video Information filtering: a user can only view a fixed number – Does not fit: people can keep watching new videos Fetch-at-most-once: a user can view each video at most once – 11 Better: people prefer to watch new videos, don’t watch top-10 over and over Some exceptions: music fans watch favourites many times CS910 Foundations of Data Analytics Validating ‘fetch-at-most-once’ Simulate preferential attachment + fetch-at-most-once R requests per user – U different users – V different videos – Observations from simulation Increasing R sharpens tail – Increasing U shifts graph Shape doesn’t change much – Do you agree? 12 CS910 Foundations of Data Analytics Effect of time Views increase over time Truncation gets sharper over time – 13 Many possible reasons: e.g. more push to most popular content CS910 Foundations of Data Analytics Impact of age Popularity does not vary strongly by age Most recent videos are slightly more popular Data is from early days in YouTube: have things changed? 14 CS910 Foundations of Data Analytics Can we predict future popularity? Consider current popularity (views) and age (time since upload) – Does this correlate with future popularity? – Table shows correlation coefficient (number of videos sampled) Strong correlation of instant popularity with future popularity – 15 From day 2. Day 3 does not change much. CS910 Foundations of Data Analytics Use these observations to cache Streaming video uses a lot of Internet bandwidth (up to 66%?) Could we cut bandwidth usage by running a cache? – E.g. put a video cache for all of Warwick University How to fill the cache? Static: Pick the most popular items once and for all – Dynamic: Initialize with most popular, then cache all new videos Unrealistic, but a point of comparison – Hybrid: Static + daily most popular – 16 CS910 Foundations of Data Analytics Reflections on the paper A widely referenced paper from early in YouTube’s history – 1100+ citations in the literature Characterized many aspects of video viewing behaviour And attempted to explain many of these – Many other plots in the paper – Video on the Internet has changed a lot since 2007 Changes to YouTube website structure – Huge growth in mobile devices – Videos with billions of views – Do conclusions still hold? What other phenomena emerge? 17 CS910 Foundations of Data Analytics Case Study 2: Microblogging “What is Twitter, a social network or a news media?” Kwak, Lee, Park, Moon (KAIST), in WWW conference 2010 – http://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdf – Objectives: “To study the topological characteristics of Twitter and its power as a new medium of information sharing” – “The first quantitative study on the entire Twittersphere and information diffusion on it” – 18 CS910 Foundations of Data Analytics One slide on Twitter Messaging service where messages are up to 140 characters Broadcast (by default) to all other users @userid addresses a particular user – #hashtag to tag a message – RT: re-tweet someone else’s message (sometimes with comment) – Users “follow” another, and receive that user’s messages 19 CS910 Foundations of Data Analytics Data Collection User Profiles Crawl name, location, timezone, number of tweets – Begin at a popular user to crawl the “giant connected component” – Twitter limits to 20,000 requests an hour Using 20 machines with different IP addresses, took 24 days – Tweets Collected all tweets mentioning current “trending topics” Probed every 5 minutes – Used Twitter search API to get up to 1500 tweets per query – Collect text, author, timestamp – Removed “spam tweets” from very new users – 20 CS910 Foundations of Data Analytics Follower/following analysis CCDF: Fraction with more than this number Asymmetry between followers and following Some bumps can be explained with domain knowledge Following 20: Twitter suggests an initial set of 20 to follow – Following 2000: used to be a limit of 2000, since removed – Fits power law, with exponent 2.28 (quite skewed) 21 CS910 Foundations of Data Analytics Degree of separation 2 users are “friends” if there is a mutual following relationship Study the friendship distance from random starting points – 22 80% are within 4 steps (compare to “6 degrees of separation”) CS910 Foundations of Data Analytics Proximity of users Are friends geographically close? How would you test this? Look as average time difference between friends – 23 As a function of number of friends CS910 Foundations of Data Analytics Importance of Twitter users Who are most important/influential Twitter users? F: Count followers? Too crude? – R: Most retweeted? – PR: Most critical in the follower-following graph? Use PageRank: defined measure importance of web pages Recursive definition: PageRank of a node is sum of PageRanks of its followers Can be computed efficiently (see later) – Compare top-20 in each: they appear similar Compute rank-correlation between top-k 24 CS910 Foundations of Data Analytics Comparison of importance measures Followers (F) and PageRank (PR) give most similar ranking ReTweets (RT) are also correlated, but more weakly so 25 CS910 Foundations of Data Analytics Spread of influence ReTweets can spread a message far and wide Figure shows retweets of messages about a plane crash – Most trees are shallow (<3 hops) CCDF: Fraction with 26 more than this number CS910 Foundations of Data Analytics Retweet time Distribution of delay from initial tweet to retweet Inter-hop delay through retweet trees What do these plots tell you? 27 CS910 Foundations of Data Analytics Reflections on the paper See also “A few chirps about twitter”, Krisnamurthy, Gill, Arlitt From 2008, early in Twitter’s history – Instructive to compare the approach and the findings – Shows that there are many ways to slice and dice data from a relatively “simple” data source – Did not even look much at content of tweets Widely cited (3400+ citations) as early work on Twitter 28 CS910 Foundations of Data Analytics Case Study 3: Meme-tracking “Meme-tracking and the Dynamics of the News Cycle” Jure Leskovec, Lars Backstrom, Jon Kleinberg (Stanford, Cornell) International Conference on Knowledge Discovery and Data Mining (KDD), 2009 snap.stanford.edu/class/cs224w-readings/leskovec09meme.pdf Objectives: Track short distinctive phrases that travel through the web – Use this to study the “news cycle” in the news media – 29 CS910 Foundations of Data Analytics Data Collection and Preparation News and blog activity from August 1 to October 31 2008 90 million documents from 1.65 million sites – Used Spinn3r API to collect: see spinn3r.com – Extracted 112 million phrases in “quotes” Discard those with < 4 words or seen < 10 times (uninteresting) – Discard those where > 25% occurrences from same source (spam) – Leaves 47 million occurrences of phrases – Collect phrases into clusters, based on overlap Consider two phrases linked if they differ by at most 1 word – Partition the induced “phrase graph” to isolate key phrases A long speech may have several key phrases in it Quite detailed process; see paper for details – 30 CS910 Foundations of Data Analytics Phrase distribution For each volume, plot # of phrases with at least that volume – 31 For all phrases, clusters of phrases, and phrases about “lipstick on a pig” (largest phrase cluster) CS910 Foundations of Data Analytics Most important threads Thread is all articles containing a phrase from a cluster – 32 Plot is automatically generated and labeled to show volume CS910 Foundations of Data Analytics Formulating a model Advanced data analytics: propose a new model to explain data Try to capture major effects, and neglect minor points Imitation: sources imitate/copy each other – Recency: news cycle dominated by recent events – Model: simulate discrete time steps: news sources report threads New thread produced at each step – At time t, each source picks thread j with probability f(nj)d(t-tj) nj: number of sources reporting on thread j [“Imitation”] d(t-tj) : decay factor based on age of thread j [“Recency”] – 33 CS910 Foundations of Data Analytics Validating the Model Recency-only Imitation-only Simulation: pick f() as power-law and d() as exponential decay Generates synthetic data that looks similar to real – Can we more rigourously validate the model? – 34 CS910 Foundations of Data Analytics Reflections on the paper A very innovative approach to the question – – – – – – 35 Created from scratch a way to think about “memes” Proposed new models, and gave some evaluation of them Tackled a timely question and used compelling examples Plots and figures illustrate the key points Widely cited (800+ citations) But - models not robustly evaluated CS910 Foundations of Data Analytics Case Study 4: Link prediction “Link Prediction by De-anonymization: How we won the Kaggle Social Network Challenge” Narayanan (Texas), Shi (Berkeley), Rubinstein (Microsoft) International Joint Conference on Neural Networks, 2011 http://arxiv.org/abs/1102.4374 Objectives: Correctly predict whether two users in a network would form a link – Use additional background information to improve the results – 36 CS910 Foundations of Data Analytics The Competition Kaggle hosts competitions for data analytics Hosted the 2011 IJCNN Social Network Challenge in late 2010 Provided a graph drawn from a social network Nodes correspond to users in the network – (Directed) edges indicate a following relationship – Evaluation: determine whether a set of test edges truly occur – It was later disclosed that the graph came from Flickr – 37 Edges are (directed) “friendship” relations between users CS910 Foundations of Data Analytics Link Prediction Goal of “link prediction” is to determine which new links will form, given current state of the graph Many factors can be taken into account: Properties of the nodes – Existing number of links – Common neighbours between a pair – Graph distance between the pair – This work used an additional factor: – 38 Try to match nodes in data set to their own data collection CS910 Foundations of Data Analytics Data Collection Competition Data: Kaggle 1.1M nodes, 7.2M edges provided as main data – 8960 “test edges”: 50% true edges (removed from main data) – 20% of test set held back by Kaggle to evaluate the results – Competitors Data: Flickr Crawled Flickr social graph (used Python + Curl library) – 2M nodes and all outgoing edges crawled – Total of 9.1M nodes, with 163M edges (much bigger data set) – Evaluation: Area Under the Curve (AUC) [defined later] Values are True/False, predicted as Positive/Negative – Ranges from 0.5 poor (random chance) to 1 (perfect) – 39 CS910 Foundations of Data Analytics Degree distribution Each node has an in-degree and an out-degree Skewed distribution, few nodes have high in/out-degree – 40 Can try to use these as “landmarks” CS910 Foundations of Data Analytics Seed identification Try to find match a few nodes between the two graphs Look at nodes with high in-degree (pointed to by many) – These are likely to be present in both graphs Because of the crawling process – Pick highest n (20) degree nodes from Kaggle (K) and Flickr (F) Try to match them up to get a “seed” matching – For a pair of nodes v, w in K (F), compute their “cosine similarity”: #common neighbours(v,w)/√(#neighbours(v)*#neighbours(w)) – Find best matching of nodes in K and F based on cosine similarities Initially: manually Later: optimization problem (see OR and optimization) – 41 CS910 Foundations of Data Analytics Propagation Now have matched n nodes between Kaggle and Flickr graphs Maintain a matching, and try to extend based on neighbors Find pairs of nodes in Kaggle and Flickr whose similarity is high – Extend the matching. Iterate. – Some heuristics to accept a new pair into matching: Must be at least 4 mapped common neighbors Cosine-similarity score must be at least 0.5 Difference in similarity scores between best, and second best must be at least 0.2 – 42 CS910 Foundations of Data Analytics Results Using “ground truth” information: After 120,000 mappings in first stage, 99.3% correct – After second stage, had mappings for 14K out of 17.6K in test set Overall accuracy for matched nodes 97.8% A coverage of 57% of edges: still need to give answer for rest For test edges, accuracy was 95% – Use inferred information to predict links for more of test set Look at all possible candidates for node pair in Flickr – Use these to vote on whether the edge is present or not Accept if unanimous vote – Covers a further 19% of test edges – 43 CS910 Foundations of Data Analytics Machine Learning Leaves 24% of test edges without a mapping to Flickr Apply Machine Learning (the original goal of the challenge) Create a number of “features” for each edge In-degree and out-degree of node Whether reverse edge exists Measures of local graph (number of common neighbors etc.) – Train a “classifier” : see later lectures on classification – AUC for the classifier approach is ~0.9 – Total AUC for the whole approach on test data is 0.981 – 44 Excellent accuracy for deanonymized nodes, less for rest CS910 Foundations of Data Analytics Reflections on the paper The score of 0.981 AUC was enough to win the competition Second best was 0.969 www.kaggle.com/c/socialNetwork/leaderboard – The researchers contacted the organizers to reveal their method – Were told that this was within the rules – Read the messageboard for the competition to see other opinions www.kaggle.com/c/socialNetwork/forums – Lesson: when understanding data, think beyond what you have Are there other data sets that can help understand it better? – Can you learn properties of one data set and transfer to another? – Can you link two data sets to learn more about the first? – Lesson: removing information does not “anonymize” data 45 CS910 Foundations of Data Analytics Lessons from Case Studies Start by describing the data – How collected, with what attributes, what is dropped Look at the data – Plot combinations of attributes to see correlations/distributions Find models that agree with the data – Lines of best fit, long tail distributions Find explanations that are consistent with data and knowledge Extract observations of interest/value Make predictions and evaluate the quality 46 CS910 Foundations of Data Analytics