CSCI-599: Social Media Analysis Kristina Lerman University of Southern California 1

advertisement
CSCI-599: Social Media Analysis
Kristina Lerman
University of Southern California
University of Southern California
1
Bugzilla
essembly
delicious
social
media
essembly
delicious
Bugzilla
Social media is a platform for people to
create, organize, and share information
Global participation in social media, low
barrier to entry
Amplification via networks
http://blog.socialflow.com/post/5246404319/
breaking-bin-laden-visualizing-the-power-of-a-single
Interesting emergent behavior
Social media elements
• Users create information
–
–
–
–
–
–
Text – Blogs, Facebook, Twitter, …
Images – Flickr, Instagram, Pinterest …
Videos – YouTube, Vimeo, …
Maps – OpenStreetMaps, …
Personal profiles – Facebook, LinkedIn, …
Structured data – Metaweb, Google Base, …
• Explosion of user-generated content
– `Every hour, 20 hours of video content uploaded to
YouTube’ (2009)
– `Every minute, 20+ hours of video content uploaded’
(2013)
Social media elements
• Users organize
information
– Annotate it with
descriptive labels –
tags
– Geo-referenced with
spatial coordinates –
geo-tags
– Discussion or
comments
– Hierarchically in
directories
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Tags:
Malibu
wild fire
brush fire
PCH
Santa Ana
fire
USA
Tags:
station
fire
Los
Angeles
california
satellite
image
USA
Social media elements
• Users share
information
– Follow others to
receive updates
from them
 Form social
networks
– Create and join
special-interest
groups
What to expect
• This course will not make you rich
• It will teach you
– How to ask interesting questions
– How to answer these questions
 how to use data analysis to study human behavior
• Give you an opportunity to put it all together in a
creative, technical project
Overview of Topics
University of Southern California
11
% of all tweets
Phenomenology of social media
=
+
Network analysis basics
How to characterize
network structure?
A=
… and properties?
0
1
1
1
1
1
1
…
1
0
1
1
0
0
0
…
1
1
0
1
0
0
0
…
1
1
1
0
0
0
0
…
1
0
0
0
0
0
1
…
1
0
0
0
0
0
1
…
1
0
0
0
1
1
0
…
…
…
…
…
…
…
…
Influence and centrality in social networks
Who is
important?
PageRank
[Brin et al, 1998]
Eigenvector centrality
[Bonacich, 2001]
Topic analysis basics
Given users’
ratings, what are
their preferences?
Item
D
Topic
Item
Daniel
TV series, Classic, Action…
Sara
V
K
D
User
N
R
Drama, Family, …
User
Bob
N
Marvel’s hero, Classic, Action...
Topic
K
U
Sentiment analysis and opinion mining
What does social media
say about our mood?
… or our opinions …
… or how we
intend to vote?
Hourly changes in average positive (top) and negative (bottom)
affect in Twitter posts, arrayed by time (X-axis) and day (color).
(Golder & Macy, Science 333(2011):1878-1881)
Information diffusion
t1
Collect data about real
information cascades
t2
… analyze mathematically
Characterize dynamic
structure of cascades
f(t)
Information
cascade on
Twitter
t3
… to answer
scientific questions
how widely they spread
how deeply they spread
time
Wikipedia analysis
Volunteer contributed
semi-structured
information about
everything
… as a source for
structured knowledge
fire
USA
California
wild fire
…
brush fire
Los Angeles
Malibu
PCH
…
…
… as a basis for
representing anything
Search query logs
How can we use billions
of search queries
…to make new searches
easier…
…or as a knowledge
base of intentions?
Social ties and information diffusion
Where do you want
to position yourself
in a network to
maximize access to
novel information?
[flickr: GustavoG]
Social ties and link prediction
Who will become
friends in the
future?
[flickr: GustavoG]
Social bots and spam in social media
# tweets over time
NYTimes
post
Justin Bieber
fansite post
Ad post
Geo-spatial social data mining
Tags:
station
fire
los angeles
california
satellite
image
Tags:
usa
malibu
california
wild fire
brush fire
PCH
Santa Ana
usa
‘california’ boundary
‘california’ gazeteer
california
los angeles
…
malibu
PCH
…
…
Privacy and health in networked world
How much does
Facebook know
about you from
your Likes and
Shares?
How well can we
predict your
health from your
online behavior?
Politics and social media
Structures of the aggregate following, retweeting, and
mentioning networks of German politicians from around
the time of 2013 federal elections.
[Lietz et al. (2014)]
Predicting the future
What social
media behavior
best correlates
with info we care
about?
…and how do we
use it to predict
what happens
next?
Emotional contagion
When it rains in
one place, do
people elsewhere
feel sad?
Detecting contagions
Who do we monitor to
detect a contagion before
it reaches epidemic
proportions?
Social Network Sensors for Early Detection of Contagious Outbreaks
by: Nicholas A. Christakis, James H. Fowler
Crowdsourcing
Fast and cheap!
… but is it good?
Social tagging and folksonomies
User-generated
annotations …
… aggregated from many users …
Tags:
station
fire
Los Angeles
california
satellite
image
USA
… to extract social knowledge
fire
USA
Tags:
Malibu
wild fire
brush fire
PCH
Santa Ana
fire
USA
California
Los Angeles Malibu
PCH
…
…
wild fire
brush fire
…
Course Details
University of Southern California
31
Where to find Professor Lerman…
• Research Associate Professor
Computer Science Department
Outside KAP146 (Immediately before or after class)
• Project Leader
Information Sciences Institute
Marina del Rey
ISI Rm. 932 (by appointment)
310-448-8714
• Email: lerman@isi.edu
University of Southern California
32
TA
• Farshad Kooti
– Office Hours: TBD
– Location: TBD
– Email: kooti@usc.edu
University of Southern California
33
Course Web Pages
• Blackboard – blackboard.usc.edu
– Your USC login works on this account
– If you are registered for 599, you will have access
• All course material will be posted on the site page
• Please check for announcements and read the discussion
board on a regular basis
• All questions should be posted (not emailed!)
– If you know the answer to a posted question, please try to provide
helpful suggestions
University of Southern California
34
Readings
• Posted on the site each week
– You can read it online or print them
• Please read all required readings before the class
they are covered
University of Southern California
35
Slides
• Available online by midnight of the day before the lecture
• These are not intended as a replacement for the lecture
• You can print these out and make notes on them
– I suggest you print 6 slides per page to save paper
– Print double-sided
University of Southern California
36
Prerequisites & Recommendations
• Prerequisites
– No formal prerequisites, but recommended courses a plus
• Recommended Courses
–
–
–
–
–
CS561 – Introduction to AI
CS573 – Advanced AI
Databases
Probability and statistics
Networks or Graph Theory
University of Southern California
37
Grading
• Quizzes: 30%
• Presentation of 1 paper: 20%
• Course project: 45%
• Class participation: 5%
University of Southern California
38
Work Load – Quizzes
• Importance – 30% of the final score
• Given on Wednesdays
• No make-ups!
• There will be 11 quizzes, but 10 best scores will
contribute to the grade
• Each quiz will have 2-3 questions, nothing
complicated, just to make sure you keep up with the
readings
Work Load – Paper Presentation
• Importance – 20% of the final score
• One paper from the syllabus
• Length
– 25 min
• Written summary: (1 page) 5%
– Introduction of the problem
– Contribution of the paper
– Discussion of your new ideas or other improvement
• Presentation slides: (12-15 slides) 5%
– Introduction of the problem
– Contribution of the paper
– Technical details
• In-class presentation: 10%
Work Load – Paper Presentation
• Timeline
– Sign up for date to present (goal is 2 students per class)
– First come, first served
– Sign up using google doc
• Late Policy
–
–
–
–
–
Submit written summary 1 week before presentation date
Submit presentation slides 3 days before presentation date
Submission of materials by EMAIL to the assigned instructor
Show up for your presentation!
No extensions!
Work Load – Projects
• Importance – 40% of the final score
• Grade breakdown
– Proposal: 5%
• Due October 8
– Mid-term progress report: 5%
• Due November 5
– Final report: 15%
• Due November 24
– Presentation: 15%
Course Projects
• A research project based on what you have learned
in class
• Be creative!
• An ideal project is one that you could publish a paper
about
– Empirical validation is important – show that your method
beats state-of-the-art
University of Southern California
43
Example Projects
• Personalized news search
• Friends mood ring – sentiment analysis of friends
posts
• Twitter spam detector
• Twitter bot creator
• Anything that builds on or extends the ideas and
methods that we cover in class…be creative!
University of Southern California
44
Data sets
• Social networks
– Stanford Large Network Dataset Collection (SNAP)
– http://snap.stanford.edu/data/
• Information diffusion
–
–
–
–
Digg data set
http://www.isi.edu/~lerman/downloads/digg2009.html
Twitter data set
http://www.isi.edu/~lerman/downloads/twitter/twitter2010.html
• Social annotations
– Personal directories and tags on Flickr
– http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
– Email me for password
Data sets (continued)
• Microblogs (Twitter)
– TREC Tweets2011 Dataset, 16 million tweets
– http://trec.nist.gov/data/tweets/
• Weblogs
– ICWSM 2011 Spinn3r Dataset, 386 million blog posts
– http://icwsm.org/data/index.php
• Forums
– ICWSM boards.ie Forums Dataset, 10 years of discussion
– http://www.icwsm.org/2012/submitting/datasets/
Project Proposal
• Proposal should include:
– Title
• This should be the title of your final paper
– Authors
• The people that will be involved in the project (1 or 2)
– A description of the project
• Be sure to state what you think is new and innovative
about your project
• The use of pictures and screen shots is encouraged
– The date you would like to present your project
– Length – 2 pages (max)
– Format: Latex
University of Southern California
47
Project Paper
•
•
•
•
Due at the beginning of the last class
Length: 5 pages
Format: Latex – same as project proposal
Content:
–
–
–
–
Title block
Abstract
Body
References
University of Southern California
48
Title Block
• Title of the paper
– Choose a title that highlights the contribution of your
paper
– Full names of the authors
– Address includes affiliation:
•
•
•
•
University of Southern California
Computer Science Department
Los Angeles, CA 90089
Your email address
University of Southern California
49
Abstract
• An abstract is a 100-250 word summary of your
paper
– Identify the problem you are solving
– Describe your solution
– Summarize the results that support your solution
• An abstract is NOT the same as the introduction
– NEVER repeat the abstract in the introduction
– You might restate portions of it in the paper, but using
different words
University of Southern California
50
Body of the Paper
• All papers should have
– Introduction
• What is the problem you are solving? Why is it important?
• What is the state of the art? What are its drawback?
• What is your contribution?
– Conclusion/Discussion
• Summarizes the contribution, including any conclusions and
directions for future research
• Other sections might include:
–
–
–
–
Motivating application or example
Approach
Empirical results or Evaluation
Related work
• Use pictures, screen shots, and diagrams
University of Southern California
51
References
• References to both related work and work that you
build on
• Use the “named” bibliography style [Ambite, 2004].
• Bibliography
– Ambite, Jose Luis, 2004. Planning by Rewriting, Journal of
Artificial Intelligence, Kluwer Academic Publishers, 4(2), pg
27—34.
University of Southern California
52
Class Presentation
• Schedule will be posted under assignments/projects
– Individual presentations – 15 minutes + 5 minutes for questions
– Joint – 20 minutes (roughly 10 minutes each)
• Use Powerpoint
– Slides submitted online by 3pm on the first day of presentations
• Length
– No more than 1 slide per minute
• Content
– summary of your project (not all the details): make me want to
read the paper
• Format
– Use pictures
– No font smaller than 18pt
• Practice your talk – time it!
53
Cheating
• Not tolerated!
• No second chances – all infractions will be reported
– First offense is automatic failure in the class
– Second offense is suspension from the University
• Examples:
– Turning in someone else’s work
– Copying from someone else during a quiz
– Doing a project that uses someone else’s work without giving them
credit
University of Southern California
54
Cell Phone Use
• If it makes noise, turn it off or to vibrate mode in
class
• No texting! No web surfing!
• Pay attention! You are paying for the privilege of
attending the course – make the most of it
University of Southern California
55
When the Course is Over
•
•
•
•
Directed research (1-2 MS or Phd Students)
M.S. Thesis
Summer interns (MS or Phd)
Research Assistantships (Phd Students)
– We can also recommend you for positions in other groups
• Teaching Assistantships (for PhD students)
• Recommendation letters (anyone that gets at least
an A-)
• Positions at companies affiliated with USC
– Other companies are often looking for students
University of Southern California
56
Download