Project Presentation PPT - People

advertisement
KDD- Service based Numerical Entity
Searcher
(KSNES)
Presentation 3 on April 14
th
, 2009
Naga Sowjanya Karumuri
sowji@ksu.edu
CIS 895 – MSE PROJECT
1
OUTLINE

Introduction
Terms
 Motivation
 Goal

Project Overview
 Project Data Flow Diagram
 Component Design
 Project Evaluation
 Future Work
 Prototype Demonstration
 Questions / Comments

2
TERMS[1]

Knowledge Discovery in Databases (KDD)
a group headed by Dr. Hsu
 primary focus is machine learning, data mining,
human-computer intelligent interaction


Natural Language Processing (NLP)
To allow computers to process and understand
human languages
 Some areas like

Text Segmentation (identify word boundaries)
 Part-of-speech tagging
 Word sense disambiguation (words with more than one
meaning)

3
TERMS[2]

Named Entity Recognition (NER)

Locating and classifying atomic elements (single part
of speech) in text into predefined categories such as
Names of Persons
 Names of Locations
 Names of Organizations
 Names of Miscellaneous Entities


Example
Dr. William H. Hsu is a Professor at Kansas State
University located in Manhattan, Kansas.
 Dr. [PER William H. Hsu ] is a Professor at [ORG
Kansas State University ] located in [LOC
Manhattan ] , [LOC Kansas ] .

4
TERMS[3]

Shallow Parsing/Chunking
NLP technique that attempts to look for key phrases
but not to fully parse into a parse tree.
 Output - series of words mostly nouns, verbs,
preposition phrases etc.,


Example
Chunker: [NP He ] [VP reckons ] [NP the current
account deficit ] [VP will narrow ] [PP to ] [NP only
L1.8 billion ]
 Full Parser: (PRP)He (VBZ)reckons (DT)the
(JJ)current (NN)account (NN)deficit (MD)will
(VB)narrow (TO)to (RB)only (L)L (CD)1.8 (CD)billion

5
PROJECT OVERVIEW[1]

Motivation

Occurrence of events is naturally anchored in time
within the narrative text
Is Bush currently the President of America?
 When was India attacked by Pakistan in last century?


To know the quantities of entities
How many Oscar awards are won by Steven Spielberg?
 What was the highest temperature recorded in the year
2008?

6
PROJECT OVERVIEW[2]

Goal

To develop a system that
extracts Numerical Phrases from raw text
 displays value – unit – unit-type

System is set as a service on the web server
 User interacts through a webpage


Numerical Phrase: Types

Number Phrase


33 dollars, 100 Watts, 13 years, two miles
Date Phrase

Aug 1998, Nov 10th 1984, between 1989 and 2006
7
PROJECT OVERVIEW[3]

Purpose




To understand the timestamp of an event
To understand the order of occurrence of events
To understand the persistence of an event i.e., the
time period over which the event occurred and
continued
For KDD Group

To gather certain statistical information from the
data they gather by crawling different web pages
How many cattle have been affected by the virus?
 When did the disease break out?


Sample NABC (National Agricultural Bio-Security
Centre) data is given to the system for testing
8
APPLICATION AREAS

Textual Entailment (TE) Recognition


Given two fragments, whether the meaning of one
text can be inferred from another text.
Question Answering (QA) System

Identifies text that entails the expected answer.
Ex: During 1997, 10,000 cattle were killed because of the RVF.
Possible inferences (TE)
 10,000 cattle were killed because of RVF.
 RVF occurred during 1997.
 Possible Questions (QA)
 How many cattle were killed during 1997 RVF outbreak?
 When did RVF occur?

9
SYSTEM OVERVIEW
10
PROJECT
DATA FLOW
DIAGRAM:
NUMERICAL
ENTITY
SEARCHER
11
MODULES IN THE PROJECT




Webpage (JSP): For requesting and receiving
information from the service.
POS Tagger (Java): Stanford POS Tagger
Numerical Phrase Extractor (Java): Implemented
using Shallow Parsing Technique
Number-Unit/Date Pattern Recognizer (Java):
Implemented based on the Numerical Quantifier
developed by Benjamin Sapp, UIUC.
12
POS TAGGER TAGSET
13
http://www.cs.ualberta.ca/~lindek/650/Slides/POSTagging.ppt
IMPLEMENTING NUMERICAL PHRASE
EXTRACTOR

Input: Tagged Text


I/PRP lost/VBD thirty-three/JJ dollars/NNS in/IN
1998/CD
Regular expressions (regex) are used to
determine the numerical patterns in the input.
thirty-three/JJ dollars/NNS
 in/IN 1998/CD


Output: Numerical Phrases
thirty-three dollars
 in 1998

14
SOME PATTERNS


"\\d+-\\d+(/JJ|/CD) [a-zA-Z]+/NN"
parses
\\d+-\\d+(/JJ|/CD)
[a-zA-Z]+/NN
3-2/JJ
lead/NN
20-20/JJ
match/NN
"(between|Between|from|From|In|in|since|
Since|during|During)/IN ..../CD (([a-zAZ]+/CC|[a-z]+/TO) ..../CD)?”
parses
'between 1987 and 1997', 'in 2007 and 2008’
15
COMPONENT DESIGN


Contains class variables and functions
Added separate table to describe the roles of
functions
16
COMPONENT DESIGN (MYPATTERNS)[1]
Patterns
Matching Numerical Phrases
p_words
about, around, approximately, more than, nearly, almost, no
more than, at least, less than, no fewer than
p_tnl
this, next, last, since, in
p_inl
between, from, in, since, during
p_words +
p_abtfrac
about two-thirds of the vote, millions of books
p_words +
p_age
p_words +
p_ampm
27 year-old bachelor, 27-year-old bachelor
About 3:00 a.m., 4:15 p.m. CST
p_and
3,792 children and adolescents
p_tnl +
p_anydate
Oct 1st 1987, Nov 5, December 21, 1998
17
COMPONENT DESIGN (MYPATTERNS)[2]
Patterns
p_inl +
p_btwfrm
p_inl +
p_btwfrmd
Matching Numerical Phrases
between 1987 and 1997, in 2007 and 2008
from 200 to 300 miles, from 7.5 percent to 6.85 percent
p_date
18 April 2008
p_tnl + p_days
this Monday, next Saturday, last Friday, Tuesday, Wednesday,
p_centuary
17th century, 17th-centuary
p_words +
p_hyphenww
p_hyphennumn
um
million-dollar home, six-bedroom home, thirty-three dollars
the 20-20 match, a 3-2 lead
p_in
9 in 10 people, 1 in every 8 women
p_mids
mid-1990s, the early 1990s, 1970s
p_months
January, February, December, Jan, Feb, Sept, Dec
18
COMPONENT DESIGN (MYPATTERNS)[3]
Patterns
Matching Numerical Phrases
p_words +
p_numunit
33 USD, about 34 miles, 33,333 tons, 3.3 million dollars, one
thing, 3.4 billion
p_words +
p_per
$33 per day, about 100 miles per hour
p_words +
p_percentinches
39%, 0.5-1%, about 90 %, 20"
p_ratio
one of the five people, 89 percent of people, 3 out of 5 people
p_tty
today, tomorrow, yesterday, noon
p_twmy
this year, this month, next year, next month, last week, last
year, last month
p_xbits
1024KB, 8MB, 320GB, 1TB
p_words +
p_yrange
In 1998-99, during 2000-09
19
SAMPLE SENTENCES[1]
Sentence
Patterns
I have lost 33,000 dinars in 1998
p_numnit
p_btwfrm
At just 12-years-old, he enrolled as a freshman at
F.I.U. in Miami.
p_age
The 20" iMac is cheaper at $1200 and it has a
320GB hard drive.
p_percentinches
p_numunit
p_xbits
Volunteers bring in a heavy crane for work on a
bridge last month.
p_twmy
As for those who do not invest, around 40% say
capitalism is better.
p_percentinches
As of 7 January 2007, about 75 people have died
and another 183 infected.
p_date
p_numunit
20
SAMPLE SENTENCES[2]
Sentence
Patterns
Approximately 1% of human sufferers die of
the disease.
p_percentinches
Current listings of 2,000 children and adults p_and
who are reported missing, including in-depth
coverage of high-profile cases.
38 of the 62 patients who provided blood
samples tested positive.
p_ratio
She became an exotic dancer at Scores in
New York City in the mid-1990s.
p_mids
Peterson's three capped the surge, giving
New Orleans a 64-51 lead.
p_numunit
p_hyphennumnum
21
PROBLEMS ENCOUNTERED

Determining the Patterns
Lots of Numerical Phrases found
 Designed Patterns to filter more than one kind of
Numerical Pattern


Prioritizing the Patterns
More than one pattern may match the same
Numerical Phrase
 To avoid clashes between the Patterns

22
PROJECT EVALUATION[1]

Test Case
Main Functionality Tested
Pass/Fail
Test Case 1
Application Functionality
Pass
Test Case 2
POS Tagger Functionality
Pass
Test Case 3
Numerical Phrase Extractor Functionality
Pass
Test Case 4
Number-Unit/Date Pattern Recognizer
Functionality
Pass
23
PROJECT EVALUATION[2]

Phase
Expected Completion Phase
Actual Completion Phase
1
February 26, 2009
February 24, 2009
2
March 26, 2009
March 31, 2009
3
April 17, 2009
April 14, 2009
24
PROJECT EVALUATION[3]

Phase 2 took more time since Implementation
and Testing are done simultaneously
25
PROJECT EVALUATION[4]

More time for Coding and the Documentation
26
PROJECT EVALUATION[5]

More time spent in discussing since it’s the initial
phase
27
PROJECT EVALUATION[6]

More time is spent in Coding after gather the
requirements in the first phase.
28
PROJECT EVALUATION[7]

Lot of time spent on Documenting the things as
per the ETDR standards.
29
FUTURE WORK

Adding more Patterns


To filter more different kinds of numerical phrases
Improving the Output Display
By displaying the number and date phrases in
different colors
 To make it more readable for the user

30
LESSONS LEARNED

Java Tool Usage
Java
 Eclipse IDE


Design Development
MS Visio
 SDLC
 Documentation

31
PROTOTYPE DEMONSTRATION

KSNES Project
Set up as a Service on the CIS Server
 A webpage is set up:


http://viper.cis.ksu.edu:11603/numerical/
32
FINAL STEPS
Final Examination Ballot
 Make necessary changes to the MSE Portfolio
 Deliver the Portfolio

33
Questions??
Suggestions!!
THANK YOU
34
Download