ppt - Computer Science - University of Massachusetts Lowell

advertisement
Predicting Flu Trends using Twitter Data
SNEFT – Social Network Enabled Flu Trends
Harshavardhan Achrekar [1]
Avinash Gandhe [2]
Ross Lazarus [3]
Ssu-Hsin Yu[2]
Benyuan Liu [1]
Workshop on Cyber-Physical Networking Systems
in conjunction with INFOCOM 2011
CPNS 2011, Shanghai, China
[1]
[2]
[3]
Computer Science Department, University of Massachusetts Lowell
Scientific Systems Company Inc , Woburn, MA
Department of Population Medicine - Harvard Medical School
Outline
Background
Related Work
Our Approach
SNEFT System Architecture
Twitter Dataset Description
Twitter Dataset Analysis
Detection and Prediction
Conclusion
Seasonal flu
•
•
•
•
•
•
•
Influenza (flu) is contagious respiratory
illness caused by influenza viruses.
Seasonal - wave occurrence pattern.
5 to 20 % of population gets flu
≈ 200,000 people are hospitalized from flu
related complications.
36,000 people die from flu every year in
USA.
worldwide death toll is 250,000 to 500,000.
Epidemiologists use early detection of
disease outbreak to reduce no. of people
affected.
Related Work :- Google Flu Trends
•
•
•
•
•
Certain Web Search terms are good
Indicators of flu activity.
Google Trend uses Aggregated
search data on flu indicators.
Estimate current flu activity around
the world in real time.
Accuracy of data
{not every person who searches for
“Flu” is sick}
From example :- Google Flu Trend
detects increased flu activity two
weeks before CDC.
Link:- www.google.com/flutrends
CDC stands for Center for Disease Control
Our Approach
•
•
•
•
•
•
OSN emerged as popular platform for people
to make connections,share information and
interact.
OSN represent a previously untapped data
source for detecting onset of an epidemic and
predicting its spread.
{“i am down with flu”, “got flu.”} msg exchange
between users provide early ,robust
predictions.
Twitter/Facebook mobile users tweet/posts
updates with their geo-location updates. helps
in carrying out refined analysis.
User demographics like age, gender, location,
affiliated networks.,etc can be inferred from
data.
snapshot of current epidemic condition and
preview on what to expect next on daily or
hourly bases.
User Population (in millions)
FaceBook:- 400 ,
Myspace:- 200 ,
Twitter:- 80
System Architecture of SNEFT
Data Collection Engine
downloader
ILI
Data
ARMA Model
ILI
Prediction
Novelty
Detector
Flu
Warning
Filter /
Predictor
State
Estimate
Internet
crawler
OSN
Data
OSN models
Math models
ILI stands for Influenza-Like Illness
OSN Data Collection
Design of the Twitter data collection engine / Crawler
Twitter Data Set
Real Time Response Stream
fetches entries relevant to searched
keyword having the tweets in
reverse-time order.
Data collection active from October
18, 2009 until present.
Until October 23, 2010 we have
collected 4.7 million tweets from 1.5
million unique users.
CDC’s Inactive ILI period last from
May 23, 2010 to October 9, 2010,
Results in 31 weeks of CDC data
available for comparison with
Twitter dataset.
(Power outage on our data collection site resulted in no data being collected from January
18, 2010 till January 20, 2010.)
Spatio Temporal Database for Twitter Data Set
Crawler uses Streaming Real time Search Application Programming Interface (API)
to fetch data at regular time intervals. A tweet has the
Twitter User Name,
the Post with status id
Time stamp attached with each post.
From Twitter’s username we can get profile details attached to every user which
include
number of followers,
number of friends,
his/her profile creation date,
location {public or private from the profile page or mobile client}
with status updates count
User’s current location is passed as an input to Google’s location based web
services to get geo-location codes (i.e., latitude and longitude) along with the
country, state, city with a certain accuracy scale.
Twitter Data Set Analysis
In our Twitter dataset
30.6% users are from USA,
41.3% users are outside USA
28.1% users have not published their location details.
Status posting times (tweet timestamp in GMT) are converted to the local timezone of the
individual profile. Day light saving are applied within required time frame.
State-wise Distribution of USA users on Twitter for flu postings
Twitter Data Set Analysis
Hourly Twitter usage pattern in USA
Average daily Twitter usage within a week
Figure shows percentage of unique Twitter users who mentioned about flu in tweets
at different hours of the day.
The hourly activity patterns observed at different hours of the day are much to our
expectations, with high traffic volumes being witnessed from late morning to early
afternoon and less tweet posted from midnight to early morning, reflecting people’s
work and rest hours within a day.
Average daily usage pattern within a week suggests a trend on OSN sites with more
people discussing about flu on weekdays than on weekends.
Twitter Data Set Cleaning
Retweets: A retweet is a post originally made by one user that is forwarded by
another user.
Syndrome elapsed time: An individual patient may have multiple encounters
associated with a single episode of illness . To avoid duplication the first encounter
for each patient within any single syndrome group is reported to CDC, but
subsequent encounters with the same syndrome are not reported as new episodes
until more than six weeks has elapsed since the most recent encounter in the same
syndrome. We call it syndrome elapsed time.
Remove retweets and tweets from the same user within a certain syndrome elapsed
time, since they do not indicate new ILI cases.
Twitter Data Set Analysis
Number of Twitter users per week versus percentage of weighted ILI visits by CDC
Increase in number of users tweeting about flu related activity is accompanied by
increase in the percentage of weighted ILI visits reported by CDC in the same week.
Twitter Data Set Analysis
Percentage of weighted ILI visits by CDC, original Scaled Twitter dataset and
Scaled Twitter dataset filtered by retweets and syndrome elapsed time of one week
displayed on weekly basis
Plot sketches CDC’s percentage of ILI visits to physician with the original Twitter data and
Filtered Twitter data both normalized to the scale of CDC data for a time-span of thirty one
week when our data collection mechanism was active and CDC was publishing their
reports online.
Twitter Improves Prediction of Influenza Data
where t indexes weeks, y(t) denotes the percentage of physician visits due to ILI in
week t, u(t) represents the number of unique Twitter users with flu related tweets in
week t, and e(t) is a sequence of independent random variables. c is a constant term
to account for offset.
To predict the flu cases in week t using the ARX model based on the CDC data with
2 weeks of delay and/or the up-to-date Twitter data, we use
where ˆy(t) represents the predicted CDC data in week t.
Cross Validation Results
The results of 5-fold cross validation are given in Table II. According to 5-fold cross
validation results, the model corresponding to m = 0 and n = 3 has the lowest RMSE.
Predicted Influenza Data (percentage of weighted ILI visits)
Twitter dataset normalized to the same scale as CDC data along with its predicted
values for percentage off weighted ILI visits (5-fold cross validation)
Conclusion and Future Work
Investigated the use of a previously untapped data source, namely, messages posted on
Twitter to track and predict influenza epidemic situation in the real world.
Results show that the number of flu related tweets are highly correlated with ILI activity
in CDC data with a Pearson correlation coefficient of 0.9846.
Build auto-regression models to predict number of ILI cases in a population as
percentage of visits to physicians in successive weeks.
Tested our regressive models with the historic CDC data and verified that Twitter data
substantially improves our model’s accuracy in predicting ILI cases.
In view of the lag inherent in CDC’s ILI reports, Twitter data provides near real time
assessment of influenza activity and can be used to effectively predict current ILI activity
levels.
Opportunity to significantly enhance public health preparedness among the masses for
influenza epidemic and other large scale pandemic.
Thank You
Download