KDD- Service based Numerical Entity Searcher (KSNES) Presentation 3 on April 14 th , 2009 Naga Sowjanya Karumuri sowji@ksu.edu CIS 895 – MSE PROJECT 1 OUTLINE Introduction Terms Motivation Goal Project Overview Project Data Flow Diagram Component Design Project Evaluation Future Work Prototype Demonstration Questions / Comments 2 TERMS[1] Knowledge Discovery in Databases (KDD) a group headed by Dr. Hsu primary focus is machine learning, data mining, human-computer intelligent interaction Natural Language Processing (NLP) To allow computers to process and understand human languages Some areas like Text Segmentation (identify word boundaries) Part-of-speech tagging Word sense disambiguation (words with more than one meaning) 3 TERMS[2] Named Entity Recognition (NER) Locating and classifying atomic elements (single part of speech) in text into predefined categories such as Names of Persons Names of Locations Names of Organizations Names of Miscellaneous Entities Example Dr. William H. Hsu is a Professor at Kansas State University located in Manhattan, Kansas. Dr. [PER William H. Hsu ] is a Professor at [ORG Kansas State University ] located in [LOC Manhattan ] , [LOC Kansas ] . 4 TERMS[3] Shallow Parsing/Chunking NLP technique that attempts to look for key phrases but not to fully parse into a parse tree. Output - series of words mostly nouns, verbs, preposition phrases etc., Example Chunker: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only L1.8 billion ] Full Parser: (PRP)He (VBZ)reckons (DT)the (JJ)current (NN)account (NN)deficit (MD)will (VB)narrow (TO)to (RB)only (L)L (CD)1.8 (CD)billion 5 PROJECT OVERVIEW[1] Motivation Occurrence of events is naturally anchored in time within the narrative text Is Bush currently the President of America? When was India attacked by Pakistan in last century? To know the quantities of entities How many Oscar awards are won by Steven Spielberg? What was the highest temperature recorded in the year 2008? 6 PROJECT OVERVIEW[2] Goal To develop a system that extracts Numerical Phrases from raw text displays value – unit – unit-type System is set as a service on the web server User interacts through a webpage Numerical Phrase: Types Number Phrase 33 dollars, 100 Watts, 13 years, two miles Date Phrase Aug 1998, Nov 10th 1984, between 1989 and 2006 7 PROJECT OVERVIEW[3] Purpose To understand the timestamp of an event To understand the order of occurrence of events To understand the persistence of an event i.e., the time period over which the event occurred and continued For KDD Group To gather certain statistical information from the data they gather by crawling different web pages How many cattle have been affected by the virus? When did the disease break out? Sample NABC (National Agricultural Bio-Security Centre) data is given to the system for testing 8 APPLICATION AREAS Textual Entailment (TE) Recognition Given two fragments, whether the meaning of one text can be inferred from another text. Question Answering (QA) System Identifies text that entails the expected answer. Ex: During 1997, 10,000 cattle were killed because of the RVF. Possible inferences (TE) 10,000 cattle were killed because of RVF. RVF occurred during 1997. Possible Questions (QA) How many cattle were killed during 1997 RVF outbreak? When did RVF occur? 9 SYSTEM OVERVIEW 10 PROJECT DATA FLOW DIAGRAM: NUMERICAL ENTITY SEARCHER 11 MODULES IN THE PROJECT Webpage (JSP): For requesting and receiving information from the service. POS Tagger (Java): Stanford POS Tagger Numerical Phrase Extractor (Java): Implemented using Shallow Parsing Technique Number-Unit/Date Pattern Recognizer (Java): Implemented based on the Numerical Quantifier developed by Benjamin Sapp, UIUC. 12 POS TAGGER TAGSET 13 http://www.cs.ualberta.ca/~lindek/650/Slides/POSTagging.ppt IMPLEMENTING NUMERICAL PHRASE EXTRACTOR Input: Tagged Text I/PRP lost/VBD thirty-three/JJ dollars/NNS in/IN 1998/CD Regular expressions (regex) are used to determine the numerical patterns in the input. thirty-three/JJ dollars/NNS in/IN 1998/CD Output: Numerical Phrases thirty-three dollars in 1998 14 SOME PATTERNS "\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN" parses \\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN 3-2/JJ lead/NN 20-20/JJ match/NN "(between|Between|from|From|In|in|since| Since|during|During)/IN ..../CD (([a-zAZ]+/CC|[a-z]+/TO) ..../CD)?” parses 'between 1987 and 1997', 'in 2007 and 2008’ 15 COMPONENT DESIGN Contains class variables and functions Added separate table to describe the roles of functions 16 COMPONENT DESIGN (MYPATTERNS)[1] Patterns Matching Numerical Phrases p_words about, around, approximately, more than, nearly, almost, no more than, at least, less than, no fewer than p_tnl this, next, last, since, in p_inl between, from, in, since, during p_words + p_abtfrac about two-thirds of the vote, millions of books p_words + p_age p_words + p_ampm 27 year-old bachelor, 27-year-old bachelor About 3:00 a.m., 4:15 p.m. CST p_and 3,792 children and adolescents p_tnl + p_anydate Oct 1st 1987, Nov 5, December 21, 1998 17 COMPONENT DESIGN (MYPATTERNS)[2] Patterns p_inl + p_btwfrm p_inl + p_btwfrmd Matching Numerical Phrases between 1987 and 1997, in 2007 and 2008 from 200 to 300 miles, from 7.5 percent to 6.85 percent p_date 18 April 2008 p_tnl + p_days this Monday, next Saturday, last Friday, Tuesday, Wednesday, p_centuary 17th century, 17th-centuary p_words + p_hyphenww p_hyphennumn um million-dollar home, six-bedroom home, thirty-three dollars the 20-20 match, a 3-2 lead p_in 9 in 10 people, 1 in every 8 women p_mids mid-1990s, the early 1990s, 1970s p_months January, February, December, Jan, Feb, Sept, Dec 18 COMPONENT DESIGN (MYPATTERNS)[3] Patterns Matching Numerical Phrases p_words + p_numunit 33 USD, about 34 miles, 33,333 tons, 3.3 million dollars, one thing, 3.4 billion p_words + p_per $33 per day, about 100 miles per hour p_words + p_percentinches 39%, 0.5-1%, about 90 %, 20" p_ratio one of the five people, 89 percent of people, 3 out of 5 people p_tty today, tomorrow, yesterday, noon p_twmy this year, this month, next year, next month, last week, last year, last month p_xbits 1024KB, 8MB, 320GB, 1TB p_words + p_yrange In 1998-99, during 2000-09 19 SAMPLE SENTENCES[1] Sentence Patterns I have lost 33,000 dinars in 1998 p_numnit p_btwfrm At just 12-years-old, he enrolled as a freshman at F.I.U. in Miami. p_age The 20" iMac is cheaper at $1200 and it has a 320GB hard drive. p_percentinches p_numunit p_xbits Volunteers bring in a heavy crane for work on a bridge last month. p_twmy As for those who do not invest, around 40% say capitalism is better. p_percentinches As of 7 January 2007, about 75 people have died and another 183 infected. p_date p_numunit 20 SAMPLE SENTENCES[2] Sentence Patterns Approximately 1% of human sufferers die of the disease. p_percentinches Current listings of 2,000 children and adults p_and who are reported missing, including in-depth coverage of high-profile cases. 38 of the 62 patients who provided blood samples tested positive. p_ratio She became an exotic dancer at Scores in New York City in the mid-1990s. p_mids Peterson's three capped the surge, giving New Orleans a 64-51 lead. p_numunit p_hyphennumnum 21 PROBLEMS ENCOUNTERED Determining the Patterns Lots of Numerical Phrases found Designed Patterns to filter more than one kind of Numerical Pattern Prioritizing the Patterns More than one pattern may match the same Numerical Phrase To avoid clashes between the Patterns 22 PROJECT EVALUATION[1] Test Case Main Functionality Tested Pass/Fail Test Case 1 Application Functionality Pass Test Case 2 POS Tagger Functionality Pass Test Case 3 Numerical Phrase Extractor Functionality Pass Test Case 4 Number-Unit/Date Pattern Recognizer Functionality Pass 23 PROJECT EVALUATION[2] Phase Expected Completion Phase Actual Completion Phase 1 February 26, 2009 February 24, 2009 2 March 26, 2009 March 31, 2009 3 April 17, 2009 April 14, 2009 24 PROJECT EVALUATION[3] Phase 2 took more time since Implementation and Testing are done simultaneously 25 PROJECT EVALUATION[4] More time for Coding and the Documentation 26 PROJECT EVALUATION[5] More time spent in discussing since it’s the initial phase 27 PROJECT EVALUATION[6] More time is spent in Coding after gather the requirements in the first phase. 28 PROJECT EVALUATION[7] Lot of time spent on Documenting the things as per the ETDR standards. 29 FUTURE WORK Adding more Patterns To filter more different kinds of numerical phrases Improving the Output Display By displaying the number and date phrases in different colors To make it more readable for the user 30 LESSONS LEARNED Java Tool Usage Java Eclipse IDE Design Development MS Visio SDLC Documentation 31 PROTOTYPE DEMONSTRATION KSNES Project Set up as a Service on the CIS Server A webpage is set up: http://viper.cis.ksu.edu:11603/numerical/ 32 FINAL STEPS Final Examination Ballot Make necessary changes to the MSE Portfolio Deliver the Portfolio 33 Questions?? Suggestions!! THANK YOU 34