Building Geo spatial Mashups to Visualize Information for Crisis Management Authors: Shubham Gupta and Craig A. Knoblock Presented By: Shrikanth Mayuram, Akash Saxena, Namrata Kaushik Contents: • • • • • • • Term Definitions Problem Definition Data Retrieval Source Modeling Data Cleaning Data Integration Data Visualization Term Definitions Mash up • Heterogeneous data sources combined to suite users needs * Geospatial • Data that is geographic and spatial in nature Information Visualization • Visualizing large data set in effective and judicious manner to aid in decision making Programming-by-demonstration • Enables user to write programs by demonstrating concrete examples through UI Example of geospatial mashups • • • • WikiMapia (wikimapia.org) Zillow (Zillow.com) Yahoo’s Pipes (pipes.yahoo.com) Intel’s MashMaker (mashmaker.intel.com) Problem Addressed in Paper • Existing tools use widgets • Requires understanding of program concepts • No customization for data visualization on final mash up built • Emergency Management o o Heterogeneous Data Sources Time sensitive data visualization Question? • What are the problems associated with existing mash up building tools? a) Uses Widgets which requires programming concepts b) No customization for data visualization c) Heterogeneous data sources d) All of the above Ans) d Motivating Example Drawbacks • Time consumption o Switching between data source o Analyzing data using Various software packages Solution • Programming by demonstration • Geospatial Mash up with visualization techniques Geospatial Mash up Developed for Analyst’s Scenario Programming-By-Demonstration • Advantage o Saves time in constructing program. o Making quick decisions by analyzing data. o Makes this solution ideal when no time for training. Tool: Karma • Issues in mash-up creation process- Data Retrieval, Source Modeling, Data Cleaning, Data Integration and Data Visualization. • Karma solves all above issues in one interactive process Question? • Question) Karma has the ability to work with excel, text, database, semi-structured data a) True b) False Ans) True Data Retrieval • The searching, selecting, and retrieving of actual data from a personnel file, data bank, or other file. • In karma Figure 6: Extracting data from Evacuation Centers List (CSV Text file) using drag and drop in Karma Data Retrieval Continued… • Drag and Drop • Constructs query to get similar data. • Extracting semi-structured data using wrappers. S/W Fetch Agent Platform Open Kapow • Hence, a unified platform for accessing and extracting data from heterogeneous data sources. Source Modeling • Process of learning Underlying model of data source with help of semantic matching • In Karma o User input by selecting the existing semantic type ranked by previous learning/hypothesis o Or user defines new semantic type o Karma learns and maintains repository of these learnt semantic types. o Semantic type is a description of attribute that helps in identifying the behavior of an attribute. Data cleaning • The act of detecting and correcting corrupt or inaccurate records from a record set, table, or database. • Join operation aids data cleaning process. • In karma user specifies how clean data should be. Figure 7: Analyst provides example of cleaned data in Karma during data cleaning Data Integration • Process of combining the data from multiple sources to provide a unified view of data. • Major challenge here is to identify related sources being manipulated for the process of integration. • In karma o Automatic detection and ranking relation with other sources based on attribute names and matching semantic types. Data Integration • Default weights change based on learning. Figure 8: Data Integration in Karma Question? • In what sequence is the mash up built in Karma? a) Data Retrieval -> Data Cleaning -> Data Integration -> Source modeling -> Data Visualization b) Data Retrieval -> Source modeling -> Data Cleaning -> Data Integration -> Data Visualization c) Data Cleaning -> Source modeling -> Data Cleaning -> Data Integration -> Data Visualization Ans) a Data Visualization • Advantages o Detecting patterns o Anomalies o Relationship Between data o Lowers the probability of incorrect decision making o Harness the capabilities of human visual system. o Related factors Structure of underlined data set Task at hand Dimension of display Figure 9: Statistical Data in Table Format Figure 10: Statistical Data Visualized as Chart Figure 11: Sample data elements are dragged to the List Format interactive pane for bulleted list visualization. A preview is also generated in the output preview window. Figure 12: Data Visualization in Chart Format Figure 13: Data Visualization in Paragraph Format Figure 14: Data Visualization in Table Format Figure 15: Data Visualization in List Format Visualization in Karma • Karma uses Google charts API that lets users generate charts dynamically. • Uses semantic type generated during semantic mapping • In geo spatial mash up this info appears as pop ups of markups. Similar Tools • MIT’s Simile o Emphasizes on Data Retrieval process • CMU’s Marmite o Has Widget approach, user requires Programming Knowledge • Intel’s Mash Maker o o Browser extension, mash up on only current site. Data retrieval is limited to web pages & integration requires expert user. • All the above tools lack the data visualization feature. Karma’s Contribution.. • Programming-by-demonstration approach to data visualization. o User can customize the output with out any knowledge of programming. • Mash up in one seamless interactive process • solving all issues, including data visualization the way user wants. Future Work • To include more visualization formats such as scatter plots, 2D/3D iso surfaces and etc. • Reading the geo spatial data to integrate with in karma. • To save the plans for extracting and integrating the data, to apply when available. References For the working of Karma watch this video http://www.youtube.com/watch?v=hKqcmsvP0No • http://mashup.pubs.dbs.unileipzig.de/files/Wong2007Makingmashups withmarmitetowardsenduserprogrammingfor.pdf • Paper: Making Mash ups with Marmite: Towards End-User Programming for the Web - Wong and Hong • http://www.simile-widgets.org/exhibit/ • Paper: Intel Mash Maker: Join the Web - rob ennals, Eric Brewer, Minos Garofalakis, Michael Shadle, Prashant Gandhi Web-a-where: Geotagging Web Content Authors: Einat Amitay, Nadav Har’El, Ron Sivan, Aya Soffer Contents • • • • • Motivation Problem Ambiguity tackling till now Tool: Web-a-Where Page Focus Algorithm Motivation • Understanding place names benefits o o o o Data Mining Systems Search Engines Location-based services for mobile devices Every page have 2 types of Geography associated with it: source and target Problem • Ambiguity of place names o o o Name of person (Jack London) and place name Multiple places having same name i.e.US has 18 cities named Jerusalem Web Data to be processed huge so ambiguity resolution should be fast Ambiguity Tackling Till now • NER(Name Entity Recognition) o Uses Natural Language Processing with statistical-learning o Machine learning from structure and context expensive require more training data o e.g. Charlotte Best pizza o Slow for web data mining • Data Mining o Grounding/Localization: Using glossaries and gazetteers ( general knowledge like all places in atlas) • Plausible principles o Single sense per discourse (Portland, OR …… Portland,…….) o Nearby locations in one context (Vienna, Alexandria – Northern Virginia) • Web Pages o URL, Language written in, phone numbers, zip codes, hyper link connection o Requires a lot of information about postal details, phone directories easily available in US than other parts of world Tool: Web-a-Where • 3 Step processing to process any page • Spotting: Identify geo location o Finds and disambiguates geographic names ( taxonomy approach) with help of gazetteer • Disambiguation: Assign meaning and confidence • Focus Determination: Derive focus (Aggregate spots and represent geographic focus of whole page) • Most of the work is theoretical but in this paper experimental proof of effectiveness is provided for the tool. Gazetteer • To resolve disambiguate associates place with o o o o o canonical taxonomy node (Paris/France/Europe) abbreviations(Alabama, AL), world co-ordinates and population Geo/non-geo –e.g. Different languages -“Of” (Turkey) Mobile is considered non-geo unless followed by Alabama. Resolved by frequency and if not capitalized e.g. Asbestos(Quebec) More frequency directly related to population – Metro , Indonesia Short abbreviations not used- Too ambiguousIN(Indiana or India). But helps in disambiguate other spots like “Gary, IN” Disambiguating Spots Algorithm Steps: 1.Assigning confidence e.g. IL, Chicago (confidence=0.9) & London, Germany (unassigned confidence) 2.Unresolved spots assigned confidence=0.5 to places with largest population 3.Single Sense per discourse, Delegate qualified spot confidence(0.8 to 0.9) 4.Diambiguating Context : Spots with confidence <(0.7) context of the region considered. e.g. page data “London and Hamilton” resolved by London -> England, UK & Ontario, Canada Hamilton -> Ohio, USA & Ontario, Canada Page Focus • Decides geographic mentions are incidental and which constitute actual focus of the page Rationale of focus Algorithm e.g.- Search = California => page containing cities of California rather than page containing San José, Chicago and Louisiana • Several regions of focus e.g. News mentioning 2 countries • Coalesce into one region e.g. page listing 50 US-states have page focus US • Coalescing into continents not productive • Page focus assigns higher weight if previous disambiguation algorithm assigned high confidence and vice-versa Outline of focus algorithm • Mainly involves summing of taxonomy node • E.g. Page contains : Orlando,Florida (Confidence 0.5) 3 times Texas(Confidence 0.75) 8times Fort Worth/Texas(0.75) Final scores: 6.41 Texas/United States/North America 4.50 Fort Worth/Texas/United States/North America 1.00 Orlando/Florida/United States (Second Focus) Focus Scoring Algorithm • Algorithm loops over according to importance of various levels of taxonomy nodes. • Algorithm stops after 4 nodes or when the confidence is lower than a threshold value. • Algorithm skips over already covered node o E.g. United States/North America is contained in North America Question • Focus Scoring Algorithm stops whenA. Confidence is higher than a threshold value B. Confidence is equal to threshold value C. Confidence is lower than a threshold value Ans) C Testing Page Focus • Focus-Finding Algo is evaluated in first stage by comparing its decision to those of human editors. • Second Stage: Open Directory Project(ODP) Is the largest human-edited directory of the Web. • Random sample of about 20,000 web-pages from ODP’s Regional section is chosen. • Web-a-Where is run on this sample and the foci is compared to those listed in the ODP index. • Performed quite well. It found a page focus 92% correct up to country level. Evaluation of Geotagging Process • Web-a-Where is tested on three different web-page collections: Arbitrary Collection “.GOV Collection” “ODP Collection” • All 3 collections were geotagged with a Web-a-Where and manually checked for correctness. • Each geotags was labeled either “correct”, error of type “Geo/Non-Geo”, error of type “Geo/Geo”, or error of type “Not in Gazetteer”. Question? • Web-a-Where is run on the sample of web pages and the foci is compared to those listed in the ODP index A). True B). False Ans) A Future Work • Main source of error was due to Geo/Non-geo ambiguity o To resolve this rule out all the uncapitalized words in properly-capitalized text, part-ofspeech tagger o Based on coordinates of places, linkage among Web-pages Thank You!!!