CSCI-599: Social Media Analysis Kristina Lerman University of Southern California University of Southern California 1 Bugzilla essembly delicious social media essembly delicious Bugzilla Social media is a platform for people to create, organize, and share information Global participation in social media, low barrier to entry Amplification via networks http://blog.socialflow.com/post/5246404319/ breaking-bin-laden-visualizing-the-power-of-a-single Interesting emergent behavior Social media elements • Users create information – – – – – – Text – Blogs, Facebook, Twitter, … Images – Flickr, Instagram, Pinterest … Videos – YouTube, Vimeo, … Maps – OpenStreetMaps, … Personal profiles – Facebook, LinkedIn, … Structured data – Metaweb, Google Base, … • Explosion of user-generated content – `Every hour, 20 hours of video content uploaded to YouTube’ (2009) – `Every minute, 20+ hours of video content uploaded’ (2013) Social media elements • Users organize information – Annotate it with descriptive labels – tags – Geo-referenced with spatial coordinates – geo-tags – Discussion or comments – Hierarchically in directories • • • • • • • • • • • • • • • • Tags: Malibu wild fire brush fire PCH Santa Ana fire USA Tags: station fire Los Angeles california satellite image USA Social media elements • Users share information – Follow others to receive updates from them Form social networks – Create and join special-interest groups What to expect • This course will not make you rich • It will teach you – How to ask interesting questions – How to answer these questions how to use data analysis to study human behavior • Give you an opportunity to put it all together in a creative, technical project Overview of Topics University of Southern California 11 % of all tweets Phenomenology of social media = + Network analysis basics How to characterize network structure? A= … and properties? 0 1 1 1 1 1 1 … 1 0 1 1 0 0 0 … 1 1 0 1 0 0 0 … 1 1 1 0 0 0 0 … 1 0 0 0 0 0 1 … 1 0 0 0 0 0 1 … 1 0 0 0 1 1 0 … … … … … … … … Influence and centrality in social networks Who is important? PageRank [Brin et al, 1998] Eigenvector centrality [Bonacich, 2001] Topic analysis basics Given users’ ratings, what are their preferences? Item D Topic Item Daniel TV series, Classic, Action… Sara V K D User N R Drama, Family, … User Bob N Marvel’s hero, Classic, Action... Topic K U Sentiment analysis and opinion mining What does social media say about our mood? … or our opinions … … or how we intend to vote? Hourly changes in average positive (top) and negative (bottom) affect in Twitter posts, arrayed by time (X-axis) and day (color). (Golder & Macy, Science 333(2011):1878-1881) Information diffusion t1 Collect data about real information cascades t2 … analyze mathematically Characterize dynamic structure of cascades f(t) Information cascade on Twitter t3 … to answer scientific questions how widely they spread how deeply they spread time Wikipedia analysis Volunteer contributed semi-structured information about everything … as a source for structured knowledge fire USA California wild fire … brush fire Los Angeles Malibu PCH … … … as a basis for representing anything Search query logs How can we use billions of search queries …to make new searches easier… …or as a knowledge base of intentions? Social ties and information diffusion Where do you want to position yourself in a network to maximize access to novel information? [flickr: GustavoG] Social ties and link prediction Who will become friends in the future? [flickr: GustavoG] Social bots and spam in social media # tweets over time NYTimes post Justin Bieber fansite post Ad post Geo-spatial social data mining Tags: station fire los angeles california satellite image Tags: usa malibu california wild fire brush fire PCH Santa Ana usa ‘california’ boundary ‘california’ gazeteer california los angeles … malibu PCH … … Privacy and health in networked world How much does Facebook know about you from your Likes and Shares? How well can we predict your health from your online behavior? Politics and social media Structures of the aggregate following, retweeting, and mentioning networks of German politicians from around the time of 2013 federal elections. [Lietz et al. (2014)] Predicting the future What social media behavior best correlates with info we care about? …and how do we use it to predict what happens next? Emotional contagion When it rains in one place, do people elsewhere feel sad? Detecting contagions Who do we monitor to detect a contagion before it reaches epidemic proportions? Social Network Sensors for Early Detection of Contagious Outbreaks by: Nicholas A. Christakis, James H. Fowler Crowdsourcing Fast and cheap! … but is it good? Social tagging and folksonomies User-generated annotations … … aggregated from many users … Tags: station fire Los Angeles california satellite image USA … to extract social knowledge fire USA Tags: Malibu wild fire brush fire PCH Santa Ana fire USA California Los Angeles Malibu PCH … … wild fire brush fire … Course Details University of Southern California 31 Where to find Professor Lerman… • Research Associate Professor Computer Science Department Outside KAP146 (Immediately before or after class) • Project Leader Information Sciences Institute Marina del Rey ISI Rm. 932 (by appointment) 310-448-8714 • Email: lerman@isi.edu University of Southern California 32 TA • Farshad Kooti – Office Hours: TBD – Location: TBD – Email: kooti@usc.edu University of Southern California 33 Course Web Pages • Blackboard – blackboard.usc.edu – Your USC login works on this account – If you are registered for 599, you will have access • All course material will be posted on the site page • Please check for announcements and read the discussion board on a regular basis • All questions should be posted (not emailed!) – If you know the answer to a posted question, please try to provide helpful suggestions University of Southern California 34 Readings • Posted on the site each week – You can read it online or print them • Please read all required readings before the class they are covered University of Southern California 35 Slides • Available online by midnight of the day before the lecture • These are not intended as a replacement for the lecture • You can print these out and make notes on them – I suggest you print 6 slides per page to save paper – Print double-sided University of Southern California 36 Prerequisites & Recommendations • Prerequisites – No formal prerequisites, but recommended courses a plus • Recommended Courses – – – – – CS561 – Introduction to AI CS573 – Advanced AI Databases Probability and statistics Networks or Graph Theory University of Southern California 37 Grading • Quizzes: 30% • Presentation of 1 paper: 20% • Course project: 45% • Class participation: 5% University of Southern California 38 Work Load – Quizzes • Importance – 30% of the final score • Given on Wednesdays • No make-ups! • There will be 11 quizzes, but 10 best scores will contribute to the grade • Each quiz will have 2-3 questions, nothing complicated, just to make sure you keep up with the readings Work Load – Paper Presentation • Importance – 20% of the final score • One paper from the syllabus • Length – 25 min • Written summary: (1 page) 5% – Introduction of the problem – Contribution of the paper – Discussion of your new ideas or other improvement • Presentation slides: (12-15 slides) 5% – Introduction of the problem – Contribution of the paper – Technical details • In-class presentation: 10% Work Load – Paper Presentation • Timeline – Sign up for date to present (goal is 2 students per class) – First come, first served – Sign up using google doc • Late Policy – – – – – Submit written summary 1 week before presentation date Submit presentation slides 3 days before presentation date Submission of materials by EMAIL to the assigned instructor Show up for your presentation! No extensions! Work Load – Projects • Importance – 40% of the final score • Grade breakdown – Proposal: 5% • Due October 8 – Mid-term progress report: 5% • Due November 5 – Final report: 15% • Due November 24 – Presentation: 15% Course Projects • A research project based on what you have learned in class • Be creative! • An ideal project is one that you could publish a paper about – Empirical validation is important – show that your method beats state-of-the-art University of Southern California 43 Example Projects • Personalized news search • Friends mood ring – sentiment analysis of friends posts • Twitter spam detector • Twitter bot creator • Anything that builds on or extends the ideas and methods that we cover in class…be creative! University of Southern California 44 Data sets • Social networks – Stanford Large Network Dataset Collection (SNAP) – http://snap.stanford.edu/data/ • Information diffusion – – – – Digg data set http://www.isi.edu/~lerman/downloads/digg2009.html Twitter data set http://www.isi.edu/~lerman/downloads/twitter/twitter2010.html • Social annotations – Personal directories and tags on Flickr – http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html – Email me for password Data sets (continued) • Microblogs (Twitter) – TREC Tweets2011 Dataset, 16 million tweets – http://trec.nist.gov/data/tweets/ • Weblogs – ICWSM 2011 Spinn3r Dataset, 386 million blog posts – http://icwsm.org/data/index.php • Forums – ICWSM boards.ie Forums Dataset, 10 years of discussion – http://www.icwsm.org/2012/submitting/datasets/ Project Proposal • Proposal should include: – Title • This should be the title of your final paper – Authors • The people that will be involved in the project (1 or 2) – A description of the project • Be sure to state what you think is new and innovative about your project • The use of pictures and screen shots is encouraged – The date you would like to present your project – Length – 2 pages (max) – Format: Latex University of Southern California 47 Project Paper • • • • Due at the beginning of the last class Length: 5 pages Format: Latex – same as project proposal Content: – – – – Title block Abstract Body References University of Southern California 48 Title Block • Title of the paper – Choose a title that highlights the contribution of your paper – Full names of the authors – Address includes affiliation: • • • • University of Southern California Computer Science Department Los Angeles, CA 90089 Your email address University of Southern California 49 Abstract • An abstract is a 100-250 word summary of your paper – Identify the problem you are solving – Describe your solution – Summarize the results that support your solution • An abstract is NOT the same as the introduction – NEVER repeat the abstract in the introduction – You might restate portions of it in the paper, but using different words University of Southern California 50 Body of the Paper • All papers should have – Introduction • What is the problem you are solving? Why is it important? • What is the state of the art? What are its drawback? • What is your contribution? – Conclusion/Discussion • Summarizes the contribution, including any conclusions and directions for future research • Other sections might include: – – – – Motivating application or example Approach Empirical results or Evaluation Related work • Use pictures, screen shots, and diagrams University of Southern California 51 References • References to both related work and work that you build on • Use the “named” bibliography style [Ambite, 2004]. • Bibliography – Ambite, Jose Luis, 2004. Planning by Rewriting, Journal of Artificial Intelligence, Kluwer Academic Publishers, 4(2), pg 27—34. University of Southern California 52 Class Presentation • Schedule will be posted under assignments/projects – Individual presentations – 15 minutes + 5 minutes for questions – Joint – 20 minutes (roughly 10 minutes each) • Use Powerpoint – Slides submitted online by 3pm on the first day of presentations • Length – No more than 1 slide per minute • Content – summary of your project (not all the details): make me want to read the paper • Format – Use pictures – No font smaller than 18pt • Practice your talk – time it! 53 Cheating • Not tolerated! • No second chances – all infractions will be reported – First offense is automatic failure in the class – Second offense is suspension from the University • Examples: – Turning in someone else’s work – Copying from someone else during a quiz – Doing a project that uses someone else’s work without giving them credit University of Southern California 54 Cell Phone Use • If it makes noise, turn it off or to vibrate mode in class • No texting! No web surfing! • Pay attention! You are paying for the privilege of attending the course – make the most of it University of Southern California 55 When the Course is Over • • • • Directed research (1-2 MS or Phd Students) M.S. Thesis Summer interns (MS or Phd) Research Assistantships (Phd Students) – We can also recommend you for positions in other groups • Teaching Assistantships (for PhD students) • Recommendation letters (anyone that gets at least an A-) • Positions at companies affiliated with USC – Other companies are often looking for students University of Southern California 56