ChatterGrabber.py Methods and Development A System for High Throughput Social Media Data Collection By James Schlitt, in collaboration with Elizabeth Musser, Dr. Bryan Lewis, and Dr. Stephen Eubank Introduction Social – – – media surveillance is a valuable tool for epidemiological research: Pros: Cheap, consistent, and easy to parse data source. Cons: Volume & specificity vary, content cannot be easily verified. Tweepy provides easy Twitter API access via python. The Twitter Norovirus Study • Developed under MIDAS funding, the Virginia Department of Health requested a tool to track Norovirus and gastrointestinal illness (GI) outbreaks within Montgomery County, VA with the following capabilities: – Automated surveillance of social media. – No special skills required to use. – Forward compatible for GIS applications. • Twitter was well suited to GI outbreak surveillance due to the short duration of infection and Tweet-worthy symptoms. For example: @ATweeter20 My hubs did some vomiting w/ his flu. Had stuff messing w/ his tummy - high fever & snot was bad. Get better! • Challenged by low population density, high degree of linguistic confounders. Methods Considered • Tweepy.py: Python wrapper for Twitter RESTful APIs • Gnip: Twitter commercial partner Twitter Method Comparison Streaming: • • • • • Up to 1% of total stream volume by location or keywords from Twitter. About 10 keywords per query, 1 query per stream, and 1 stream per OAuth key. Tweets come in real-time. Tweet pull rate limited by Twitter. Most commonly used, great for whole country studies with simple queries. Search: • • • • • • Up to 100% of all tweets matching query, may use multiple queries. Search by location and/ or keywords. ~35 mile search radius limit. All tweets within the last week. Query rate limited per Twitter OAuth key to 180 searches every 15 minutes. Narrower geographic coverage, but very flexible. Gnip Method Comparison • Official Twitter data partner. • Historic or real-time, variety of services. • Large volume, representative sample. Excellent choice when affordable! • Prices not public, quoted in a 2010 interview as: – 5% stream for $60k/year. – 50% stream for $360k/year. Challenges Given a partial data sample, how can we accurately track tweets in an area with low engagement? – 12 potential NRV Norovirus/GI Tweets per day. – 4 suspected hits after human confirmation. – Long keyword list requires multiple queries. Twitter 1% streaming limited by query length and volume and Gnip was not affordable, that leaves the search method... ChatterGrabber Introduction ChatterGrabber: A search method based social media data miner developed in Python. – GDI Google Docs interface included for simplified partner access. – Specialized hunters pull from GDI Spreadsheets to set run parameters. – Multiple logins may be used to increase search frequency during collaborative experiments. – No limits on query length. – Data sent nightly to subscribers as CSV. – Summary of history presented in dashboard (under development) ChatterGrabber Reliability High redundancy & error tolerance for long term experiments: – If multiple API keys used, functional keys take up the work of failed keys until they may be reconnected. – Daemon automatically executes & resumes experiments on start up and after an interruption. – Any hunter may be resumed up to 1 week after termination without loss of incoming data. General Execution Pull list of condition phrases & config from Google spreadsheet Partition conditions into {x} queries Search radius > 35 miles? Generate {y} Coordinate sets via covering algorithm Prepare search with |x|*|y| queries Prepare search With |x| queries Run Twitter search, from last tweet ID recorded for location and query pair Filter results by phrases, classifiers, and location; sleep Yes No Store data, send subscribers CSV and config link Yes Has a new day begun? No ChatterGrabber GDI Interface Example ChatterGrabber Search Methods Pure Query Based: NLTK* Based: • Conditions, qualifiers, & exclusions. • Searches by conditions, keeps if qualifier and no exclusions present. • Simple, easy to setup, but vulnerable to complexities of wording. • Take output from conditions search, manually classify. • Train NLTK maxEnt or Naïve Bayesian classifier via content n-grams. • Classifier discards tweets that don’t fit desired categories. • Powerful, but requires longer setup, representative tweet sample. *NLTK: Natural Language Tool Kit Tweet Linguistic Classification Yes Tweet passed for classification Using NLTK mode? No Extract features from Tweet No Does Tweet contain an exclusion? Classify Tweet by features Yes Yes Does Tweet contain a qualifier? Is Tweet classification sought? Yes No Store Tweet data and derived data No Yes Keeping non-hits? No Discard Tweet NLTK Classifier Example ChatterGrabber Geographic Methods • • • Large lat/lon boxes filled via covering algorithm. Fine and coarse geolocations obtained via GoogleMapsV3 API: – If coordinates to tweet are present, finds street address. – If common name present, finds coordinates, then searches by coordinates for proper name/ street address of position. – If location is outside of lat/lon box, discards tweet. All geo queries cached locally, shared between experiments, and pulled on demand to reduce API utilization. Work Flow Basic Execution: 1. Create GDI sheet, run initial experiment. 2. Check first results for confounders, update keyword lists. 3. Rerun experiment with new keywords, monitor periodically for new keywords & memes. If Greater Specificity Desired: 1. Run whole country experiment with desired query list. 2. Score output manually & enable NLTK classification. 3. Expand area as desired. Results • • Found and geolocated 4,000-8,000 suspected Norovirus tweets per day across the US during peak Norovirus season. Preliminary estimates of 70-80% accuracy with 2,000 tweet training set Results Continued Limitations • Results exceed the geographic and temporal resolution of existing surveillance systems, complicating verification • No true denominator, ChatterGrabber only collects queried hits. • Not all desired information is available in social media, some may be incomplete or falsified. • ChatterGrabber is just an information gathering method, external analysis and review needed for validity. • Twitter users will differ from population at large. Conclusions ● ChatterGrabber provides an easy to use social media surveillance tool – Natural Language Processing speeds illness identification. – Geographic region directed searching allows complete coverage of a user defined jurisdiction. ● ChatterGrabber can successfully identify GI illness related tweets in a population. – – – – 220 Million USA Tweets per day 6,000 matches per day by Nationwide NLTK search. 353 matches per day by Virginia keyword search. 136 matches per day by Virginia NLTK search. Next Steps • Streamlined web interface needed for NDSSL long term studies. • Real-time bioterrorism surveillance methods under evaluation using gun violence as a proof of concept. • Norovirus visualization & dashboard under development by Elizabeth Musser. • Tick bite zoonosis and unlicensed tattoo hepatitis risk tracking underway by Pyrros Telionis. • Vaccine sentiment tracking underway by Meredith Wilson. Next Steps Next Steps Firearm violence related tweets by time of day Next Steps Next Steps Next Steps: Public Health Outreach • Design and execution of real-world use by state and local public health offices. • Dashboard deployed and customized for users across Virginia. • Evaluation of pre and post deployment practice. • Assessment of utility and iterative refinement. • If interested contact: blewis@vbi.vt.edu References Python Resources I. Roesslein, J. (2009). Tweepy (Version 1.8) [Computer program]. Available at https://github.com/tweepy/tweepy (Accessed 1 November 2013) II. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O’Reilly Media Inc. (Accessed 14 January 2014) III. Google Developers (2012). gdata-python-client (Version 3.0) [Computer program]. Available at http://code.google.com/p/gdata-python-client/ (Accessed 6 January 2014) IV. McKinney, W. (2010). Data structures for statistical computing in Python. In Proc. 9th Python Sci. Conf (pp. 51-56) V. Tigas, M. (2014). GeoPy (Version 0.99) [Computer program]. Available at https://github.com/geopy/geopy (Accessed 21 December 2013) VI. KilleBrew, K. (2013). query_places.py [Computer program]. Available at https://gist.github.com/flibbertigibbet/7956133 (Accessed 27 January 2014) VII. Coutinho, R. (2007, August 22nd) Sending emails via Gmail with Python [Web log Post]. Retrieved January 5th fromhttp://kutuma.blogspot.com/2007/08/sending-emails-via-gmail-with-python.html Relevant Papers I. Rivers, C. M., & Lewis, B. L. (2014). Ethical research standards in a world of big data. F1000Research, 3. II. Young, S. D., Rivers, C., & Lewis, B. (2014). Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes. Preventive medicine. III. Chakraborty, P., Khadivi, P., Lewis, B., Mahendiran, A., Chen, J., Butler, P., ... & Ramakrishnan, N. Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions. SDM14 Questions?