Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A. Fox mmagdy@vt.edu, fox@vt.edu DLRL, CS @ Virginia Tech April 27 – May 1, 2015 Acknowledgments IDEAL team also includes Drs. Kavanaugh, Sheetz, and Shoemaker; and GRA Sunshin Lee • Related Funding: – 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library Testbed for Research Related to 4/16/2007 at Virginia Tech – 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery network (CTRnet) – 2013-2016: NSF IIS-1319578, Integrated Digital Event Archive & Library (IDEAL) • The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support – Hosting the crawls and resulting archives Outline • • • • • Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future Work Building archives for events – 1 Manual Curation • We have created ~ 60 collections ( https://archive-it.org/organizations/156 ) • These collections are about disaster events: bombings, earthquakes, hurricanes, plane crashes, shootings, floods, fires • Manual preparation of URLs and archiving using Archive-it service Sample Web Collections Collection Name Alabama University Shooting April 16 Archive Chile Earthquake Nevada air race crash China Floods Encephalitis (India) Hurricane Irene No. of Seeds 116 88 19 64 60 59 70 Building archives for events - 2 Seeds from social media (Twitter) • We created more than 600 tweet collections with ~ 1 billion tweets • For each collection we extract URLs in the tweets, fetch webpages, and archive just those webpages • Webpage collections are of two types: – Disaster events: shootings, earthquakes, plane crashes, hurricanes, bombings, terrorism, floods, fire – Community and political events Sample Tweet Collections Collection Keywords/Hashtags No. of Tweets Start date Hurricane Sandy Ebola Ferguson shooting Thanksgiving hurricane sandy #ebola #Ferguson #Thanksgiving 3,219,383 1,855,680 1,580,479 214,888 2012-10-26 2014-07-30 2014-08-11 2014-11-20 AirAsia Plane Crash Charlie Hebdo shooting #QZ8501 #CharlieHebdo 174,353 451,009 2014-12-30 2015-01-07 Iran Talks #IranTalks 117,966 2015-04-02 For full list check: http://hadoop.dlib.vt.edu:81/twitter/ Building archives for events - 2 Seeds from social media Event Keyword/Hashtag Collect Archive/Organize/Analyze Collect Tweets Tweet Collection Extract URLs Shortened URLs Expand Index Search Browse Archive Original Webpages SOLR Access WARC Wayback Building archives for events - 3 Focused Crawling • Curator selects high quality seed URLs • Use Event Focused Crawler (EFC) to retrieve webpages that are highly similar to those with the seed URLs • Curator can configure EFC to adjust the number of webpages retrieved and the quality of retrieved webpages (similarity threshold) Building archives for events - 3 Focused Crawling Outline • • • • • Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future Work Event Model and Representation • Modeling events – What happened, where, and when • Information retrieval – Helps find What part (Vector Space/Probabilistic) • Natural language processing – Helps find Where and When parts (Named Entity Recognition) Event Model and Representation • Educational activities – CS4984 Computational Linguistics (Fall 2014) – CS5604 Information Retrieval (Spring 2015) • Equipment – Hadoop cluster with 20 data nodes – 612 RAM, 76 Cores, and 60 TB Disk • Processing methods – Stanford Named Entity Recognition – Mahout routine for topic identification – Python programming for text analysis (Hadoop streaming) Outline • • • • • Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future Work Assessing archive quality using event model • Approaches to textual and linguistic analysis of an archive – Frequent and important words in whole collection – Important sentences, sentences that have one or more frequent words – Frequent entities (location and dates) extracted from important sentences Assessing archive quality using event model Webpages Text Extraction Text Content Sentence Tokenization Sentence s Keyword Matching Frequency Analysis Named Entity Recognition Frequent Words Topic: (t1,t2,..,tn) Location: (l1,l2,..,ln) Date: (d1,d2,…,dn) Event Model Selected Sentences Aggregation Event Entities Example • Ebola Outbreak (22 documents) • Top 10 frequent words and top 2 sentences which includes 2 or more frequent words Frequent Words Important Sentences Extracted Entities Ebola Virus Disease Health 2014 Africa West Ago University Outbreak - Outbreak of Ebola virus disease in West Africa: third update, 1 August 2014. (7) DATE: ['August 2014'], LOCATION: ['West Africa'] - ECDC (2014) Outbreak LOCATION: [u'West Africa'] of Ebola virus disease in West Africa. (7) Outline • • • • • Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future work Archive Quality Assessment • http://nick.dlib.vt.edu/EventModel/ • Input: – Existing collections, WARC file, Text file with list of URLs • Frontend: HTML, Javascript/Dojo • Backend: Python, NLTK Sample Results Future Work • Use event model to: – Summarize event collection (generate most informative sentence) – Extract relevant parts from webpage Thank You Questions? Mohamed Farag Dr. Edward A. Fox mmagdy@vt.edu, fox@vt.edu IDEAL Interface • http://nick.dlib.vt.edu/ideal/collections/index.ph p • Collections – 11 events categories , 2 events each (Small and Big size) – Total 1.6 M documents • Services: – Search (keywords, web collections text) – Browse (Event categories and events metadata, web and tweet collections) Technologies • Search engine – Solr 4.9 (http://lucene.apache.org/solr/) • Web Interface – Apache server – JavaScript - Solr library (https://github.com/evolvingweb/ajax-solr/wiki ) • Tweets archiving – yourTwapperKeeper (https://github.com/540co/yourtwapperkeeper ) • Webpages archiving – Archive-it service from Internet Archive (https://archive-it.org/ ) Collections Category/Collection Big Small Accident Train derailment in Quebec Texas factory explosion Bombing Boston bombing Somalia Blast Community Blacksburg events Labor day and world cup 2014 Disease Outbreak Ebola encephalitis Earthquake Turkey earthquake Virginia earthquake and others Fire Brazil night club fire Texas wild fire Flood Pakistan flood China flood and Islip 13 inch rain Hurricane Hurricane Sandy Typhoon Haiyan Plane Crash Russia Plane Crash Nevada air race crash Shooting April 16 shooting Norway shooting and others Search Interface Searching Sandy Faceted Search Search all events under Fire Faceted Search Search Brazil Night Club Fire Browse Interface Select Event Type Select Event Hurricane Events