Archive-It partner meeting

advertisement
Web Archive Content Analysis:
Disaster Events Case Study
IIPC 2015 General Assembly
Stanford University and Internet Archive
Mohamed Farag
Dr. Edward A. Fox
mmagdy@vt.edu, fox@vt.edu
DLRL, CS @ Virginia Tech
April 27 – May 1, 2015
Acknowledgments
IDEAL team also includes Drs. Kavanaugh, Sheetz, and Shoemaker;
and GRA Sunshin Lee
• Related Funding:
– 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library
Testbed for Research Related to 4/16/2007 at Virginia Tech
– 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery
network (CTRnet)
– 2013-2016: NSF IIS-1319578, Integrated Digital Event
Archive & Library (IDEAL)
• The Internet Archive (Kristine Hanna, co-PI):
– Heritrix crawler and other tools and support
– Hosting the crawls and resulting archives
Outline
•
•
•
•
•
Building archives for events
Event modeling and representation
Assessing archive quality using event model
Quality tool and results
Future Work
Building archives for events – 1
Manual Curation
• We have created ~ 60 collections
( https://archive-it.org/organizations/156 )
• These collections are about disaster events:
bombings, earthquakes, hurricanes, plane
crashes, shootings, floods, fires
• Manual preparation of URLs and archiving
using Archive-it service
Sample Web Collections
Collection Name
Alabama University Shooting
April 16 Archive
Chile Earthquake
Nevada air race crash
China Floods
Encephalitis (India)
Hurricane Irene
No. of Seeds
116
88
19
64
60
59
70
Building archives for events - 2
Seeds from social media (Twitter)
• We created more than 600 tweet collections
with ~ 1 billion tweets
• For each collection we extract URLs in the
tweets, fetch webpages, and archive just
those webpages
• Webpage collections are of two types:
– Disaster events: shootings, earthquakes, plane
crashes, hurricanes, bombings, terrorism, floods,
fire
– Community and political events
Sample Tweet Collections
Collection
Keywords/Hashtags No. of Tweets Start date
Hurricane Sandy
Ebola
Ferguson shooting
Thanksgiving
hurricane sandy
#ebola
#Ferguson
#Thanksgiving
3,219,383
1,855,680
1,580,479
214,888
2012-10-26
2014-07-30
2014-08-11
2014-11-20
AirAsia Plane Crash
Charlie Hebdo
shooting
#QZ8501
#CharlieHebdo
174,353
451,009
2014-12-30
2015-01-07
Iran Talks
#IranTalks
117,966
2015-04-02
For full list check: http://hadoop.dlib.vt.edu:81/twitter/
Building archives for events - 2
Seeds from social media
Event
Keyword/Hashtag
Collect
Archive/Organize/Analyze
Collect
Tweets
Tweet
Collection
Extract
URLs
Shortened
URLs
Expand
Index
Search
Browse
Archive
Original
Webpages
SOLR
Access
WARC
Wayback
Building archives for events - 3
Focused Crawling
• Curator selects high quality seed URLs
• Use Event Focused Crawler (EFC) to retrieve
webpages that are highly similar to those with
the seed URLs
• Curator can configure EFC to adjust the
number of webpages retrieved and the quality
of retrieved webpages (similarity threshold)
Building archives for events - 3
Focused Crawling
Outline
•
•
•
•
•
Building archives for events
Event modeling and representation
Assessing archive quality using event model
Quality tool and results
Future Work
Event Model and Representation
• Modeling events
– What happened, where, and when
• Information retrieval
– Helps find What part (Vector Space/Probabilistic)
• Natural language processing
– Helps find Where and When parts (Named Entity
Recognition)
Event Model and Representation
• Educational activities
– CS4984 Computational Linguistics (Fall 2014)
– CS5604 Information Retrieval (Spring 2015)
• Equipment
– Hadoop cluster with 20 data nodes
– 612 RAM, 76 Cores, and 60 TB Disk
• Processing methods
– Stanford Named Entity Recognition
– Mahout routine for topic identification
– Python programming for text analysis (Hadoop
streaming)
Outline
•
•
•
•
•
Building archives for events
Event modeling and representation
Assessing archive quality using event model
Quality tool and results
Future Work
Assessing archive quality using
event model
• Approaches to textual and linguistic analysis
of an archive
– Frequent and important words in whole collection
– Important sentences, sentences that have one or
more frequent words
– Frequent entities (location and dates) extracted
from important sentences
Assessing archive quality using
event model
Webpages
Text
Extraction
Text
Content
Sentence
Tokenization
Sentence
s
Keyword
Matching
Frequency Analysis
Named
Entity
Recognition
Frequent
Words
Topic: (t1,t2,..,tn)
Location: (l1,l2,..,ln)
Date: (d1,d2,…,dn)
Event
Model
Selected
Sentences
Aggregation
Event
Entities
Example
• Ebola Outbreak (22 documents)
• Top 10 frequent words and top 2 sentences
which includes 2 or more frequent words
Frequent Words
Important Sentences
Extracted Entities
Ebola
Virus
Disease
Health
2014
Africa
West
Ago
University
Outbreak
- Outbreak of Ebola
virus disease in West
Africa: third update, 1
August 2014. (7)
DATE: ['August 2014'], LOCATION: ['West
Africa']
- ECDC (2014) Outbreak LOCATION: [u'West Africa']
of Ebola virus disease in
West Africa. (7)
Outline
•
•
•
•
•
Building archives for events
Event modeling and representation
Assessing archive quality using event model
Quality tool and results
Future work
Archive Quality Assessment
• http://nick.dlib.vt.edu/EventModel/
• Input:
– Existing collections, WARC file, Text file with list of
URLs
• Frontend: HTML, Javascript/Dojo
• Backend: Python, NLTK
Sample Results
Future Work
• Use event model to:
– Summarize event collection (generate most
informative sentence)
– Extract relevant parts from webpage
Thank You
Questions?
Mohamed Farag
Dr. Edward A. Fox
mmagdy@vt.edu, fox@vt.edu
IDEAL Interface
• http://nick.dlib.vt.edu/ideal/collections/index.ph
p
• Collections
– 11 events categories , 2 events each (Small and Big
size)
– Total 1.6 M documents
• Services:
– Search (keywords, web collections text)
– Browse (Event categories and events metadata, web
and tweet collections)
Technologies
• Search engine
– Solr 4.9 (http://lucene.apache.org/solr/)
• Web Interface
– Apache server
– JavaScript - Solr library
(https://github.com/evolvingweb/ajax-solr/wiki )
• Tweets archiving
– yourTwapperKeeper
(https://github.com/540co/yourtwapperkeeper )
• Webpages archiving
– Archive-it service from Internet Archive
(https://archive-it.org/ )
Collections
Category/Collection
Big
Small
Accident
Train derailment in Quebec Texas factory explosion
Bombing
Boston bombing
Somalia Blast
Community
Blacksburg events
Labor day and world cup
2014
Disease Outbreak
Ebola
encephalitis
Earthquake
Turkey earthquake
Virginia earthquake and
others
Fire
Brazil night club fire
Texas wild fire
Flood
Pakistan flood
China flood and Islip 13
inch rain
Hurricane
Hurricane Sandy
Typhoon Haiyan
Plane Crash
Russia Plane Crash
Nevada air race crash
Shooting
April 16 shooting
Norway shooting and
others
Search Interface
Searching Sandy
Faceted Search
Search all events under Fire
Faceted Search
Search Brazil Night Club Fire
Browse Interface
Select Event Type
Select Event
Hurricane Events
Download