Stop words

advertisement
Stop Word and Related Problems in
Web Interface Integration
Eduard C. Dragut (speaker)
Fang Fang
Clement Yu
Prasad Sistla
Weiyi Meng
University of Illinois at Chicago
University of Illinois at Chicago
University of Illinois at Chicago
University of Illinois at Chicago
SUNY at Binghamton
VLDB 2009, Lyon, France
Objectives

Address the problem of automatically identifying the set of stop
words in a given application domain.





Establish semantic relationships between multi-word phrases
beyond those in electronic dictionaries (e.g., Wordnet)


“Stop words is the name given to words which are filtered out prior to, or
after, processing of natural language data (text)”, wikipedia.ord,
answers.com
There is no definite list of stop words.
The process of identifying stop words is manually carried out.
Hans Peter Luhn is credited with coining the phrase.
We focus on synonymy and hyponymy/hypernymy relationships
Analyze the impact of words such as and and or (which, by the
way, are regarded as stop words, in general) when establishing
semantic relationships

E.g., Is drop-off date and time a hyponym of date and time?
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 2
A Motivating Scenario

Looking for the cheapest ticket

Chicago – Paris, August 20th – August 29th
AirFrance.com

A user looking for the “best” price for a ticket:


Has to explore multiple sources
It is tedious, frustrating and time-consuming
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 3
The Goal

Provide a unified way to query
multiple sources in the same domain
The Web
Unified query interface
AirFrance.com
Formulate the query
Lufthansa.com
united.com
delta.com
nwa.com
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 4
Overview of Integrating Web Interfaces
Auto
Cluster query
interfaces
Barbosa07, He04,
Peng04
Books
Car Rental
Extract query
interfaces
He05, Zhang04,
Dragut09
Airfare
(Deep) Web
Match query interfaces
B.He03, Dhamankar04,
Doan02, Madhavan05,
Wu04, 06
Various formats
e.g. ASCII files
Integration of Interfaces
H.He03,
Dragut 06
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 5
Motivation for Stop Words


Automating the process of identifying the set of stop words
Establishing semantic relationships between labels


Stop words express important semantic information and their removal may
lead to erroneous logic inferences
Stop words removal may leave some labels empty

Issue: No semantic relationships can be establish using empty labels
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 6
Motivation for Stop Words, cont’

The stop words are domain
dependent, i.e. a stop word
in one domain may not be
a stop word in another
domain.

The word where is a stop
word in the Airline
domain, but not in the
Credit Card domain
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 7
Motivation for Semantic Enrichment Words


The labels of attributes may
contain the words AND, OR
and the characters “/”, “&”
Questions:




What is their intended use?
What are their semantics?
Where are they used, in the
labels of fields or in the labels of
sections?
How should they be handled
when semantic relationships are
established?


Is “Pick-up Date & Time” a
hyponym of “Dates & Time”?
Is “Pick-up Date ” a hyponym
of “Pick-up Date & Time”?
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 8
Motivation for Semantic Relationships

Goal:


Provide a systematic way to distinguish between synonymy
and hyponymy relationships
Usage:



Schema matching
Naming the attributes of an integrated query interface [Dragut
06], as part of Web interface integration
Integration of hierarchies

Two synonym concepts from distinct hierarchies are collapsed into
one concept in the integrated hierarchy
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 9
The Stop Words Problem - Solution

The Problem:


Given a set of query interfaces in the same application domain (e.g., real
estate), determine those words within the labels of the query interfaces
that are stop words
The input:

A set of query interfaces in the same domain


E.g. Airline domain: Delta, AA, NWA, Orbitz, Travelocity
Each query interface is represented hierarchically [Wu04]
Vacations
Where and when do you want to travel?
How many people are going?
2
1
Departing Going
from
to
Leaving
Adults Seniors Children
Returning
depDate depTime retDate retTime

A general purpose dictionary of stop words

E.g. dcs.gla.ac.uk/idom/ir resources/linguistic_utils/stop_words.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 10
The Stop Words Problem - Solution

The main heuristic observation:

The set of stop words from an Information Integration perspective is a
subset of the set of stop words from an Information Retrieval perspective


The strategy


E.g. the word last in the label Last Name is a stop word from IR perspective,
but it is not stop word in the label.
Take an arbitrary general purpose dictionary of stop words and find its
largest subset satisfying constraints specific to the information integration
problem.
The constraints

After the removal of incorrect stop words, the following situations arise:



Empty label - A non-empty label becomes empty after the removal. It cannot
be used to derive any knowledge.
Homonymy - Two sibling nodes in a hierarchy have synonym labels.
Hyponymy - Two sibling nodes in a hierarchy have hyponym labels.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 11
The Stop Words Problem - Solution

Did we run into a chicken or egg dilemma?



Before semantic relationships between labels could be established the stop
words need to be removed.
To remove the stop words semantic relationships need to be established.
How do we avoid this issue?

Regard labels as mere sets of words. Consequently,


Two labels are synonyms if they have identical sets of words
A label is a hyponym of another label if the set of words of the former is a
proper subset of the set of words of the latter.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 12
The Stop Words Problem - Solution

Why does this work?


There are numerous instances of query interfaces such that labels of
sibling nodes share many of their words and the designer uses some key
distinct words to emphasize the semantic meanings of the sibling fields.
Example:
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 13
The Stop Words Problem - Solution

The constraints

After the removal of incorrect stop words, the following situations arise:




Empty label - A non-empty label becomes empty after the removal. It cannot
be used to derive any knowledge.
Homonymy - Two sibling nodes in a hierarchy have synonym labels.
Hyponymy - Two sibling nodes in a hierarchy have hyponym labels.
Example:
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 14
The Stop Words Problem - Solution



The Stop Word Problem is intractable, it is NPcomplete.
Worse, regardless of the subset of constraints chosen the
problem remains “equally” hard.
Common practice

Come up with an approximation algorithm


Not covered.
The proposed algorithm produces a maximal set of stop words
with respect to the stop word constraints.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 15
Semantic Relationships Among Labels

The goal is to devise a methodology for establishing synonymy
and hyponymy relationships between multi-word phrases.

Why is the problem of establishing semantic relationships
between labels (names) difficult in practice?

Is it because, in a given application domain, a content word occurs with
multiple senses with respect to a (electronic) dictionary (e.g., Wordnet
[Fellbaum98])?


Is it because of the context of usage of words?


E.g. Select an area vs. Minimum floor area
E.g. Home address vs. Business address
Is it because of the occurrence of the semantic enrichment words?


E.g., Pick-up date and time vs. Pick-up date
E.g., Date and time vs. Pick-up date and time
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 16
The Sense of a Word in a Domain

To better see the number of meanings of content words

Create inverted lists of labels for each domain used in our experiments



9 domains were used. There are 735 distinct words and 2,319 labels.
Manually check the number of meanings of each word.
Domains
Words
Labels
Credit Card
Address
Home address, Company address, Email address
Credit Card
Type
3rd party credit card type, Major credit card type
Real estate
Type
Property type, Parcel type, Type of use
Real estate
Area
Select an area, Minimum floor area
Finding: Only one word (i.e., the word “area” in the Real estate domain) out
of 735 words has multiple senses in the same application domain.

Assumption:

each word has a unique sense in a given domain.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 17
Defining Semantic Relationships

Normalization [e.g., He03 et al, Madhavan01 et al , Rahm01 et al]


E.g. Adults (18-64) becomes adult
A label is seen as a set of normalized content words



E.g., {area, study} corresponds to Area of Study
E.g., {field, work} corresponds to Field of Work
A label A is synonym to a label B if there exits a bijection f
between their sets of words, such that f(a) = b, if a = b or a is a
synonym of b.
 Area of Study is a synonym of Field of Work


Area is synonym of Field (by WordNet)
Study is synonym of Work (by WordNet)
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 18
Defining Semantic Relationships

A label A is a hypernym (i.e., semantically more general) of a
label B if there exits an injection f from A’s set of words to B’s
set of words, such that f(a) = b, if either a = b, a is a synonym of
b or a is a hypernym of b. If A and B have the same number of
content words then at least one of the relationships must be
hypernymy relationship.


The intuition is that additional words usually restrict the meaning of a
phrase
Example:


Financial Information is a hypernym of Household Financial Information
Employment Information is a hypernym of Job Information

Employment is a hypernym of Job (by Wordnet)
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 19
Computing Semantic Relationships

Between two sets A and B, with A and B having n and m elements
(n ≤ m), respectively, there can be P(m, n) injective functions (P
stands for permutation) and m! bijective functions, if n = m.


A brute force enumeration algorithm takes exponential time.
Solution sketch:

Convert the problem to bipartite matching problems



The vertices of the graph correspond to the content words of the labels.
An edge corresponds to two words of the two labels being either equal,
synonyms or hyponyms.
The trick to distinguish a synonymy relationship from a hyponymy one is:




To assign a weight of 1 to edges denoting equality or synonymy relationships and a
weight of 2 to edges denoting hyponymy relationships.
When |A| = |B| (|A| = number of content words of A) , a synonymy relationship
corresponds to a maximum weighted bipartite matching whose weight is equal
to |A|.
When |A| = |B| a hyponymy relationship corresponds to a maximum weighted
bipartite matching whose weight is larger than |A|.
When |A| < |B| a hyponymy relationship corresponds to a maximum bipartite
matching whose weight is equal to |A|.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 20
Computing Semantic Relationships

Examples:
Synonymy – as a perfect matching
Area
Field
Study
Work
Hyponymy – as a maximum bipartite
weighted matching
Employment
Job
Information
Information
Denotes a hyponym edge
Hyponymy – as a maximum bipartite matching
Household
Financial
Information
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Financial
Information
Page 21
Dictionary Senses versus Context of Use

An example:

Consider the noun Address in the following labels:


Home Address, Company Address, Relative’s Address, Email Address
Address has the same meaning in all of them, according to Wordnet:



“the place where a person or organization can be found or communicated with”
It will wrongly suggest that Home Address is a hyponym of Address
(Electronic) Dictionaries are limited


The context of a label needs to be also taken into consideration
The context of a label of an internal node is the set of its descendant leaves
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 22
Semantic Enrichment Words, briefly


In the presence of semantic enrichment words (i.e., and and or),
the intuition that additional words restrict the meaning of a phrase
is no longer true
Examples:



Pick-up date is a hyponym of Pick-up date and time
City or airport code is a hyponym of City, point of interest or airport code
Some observations:


AND appears frequently (91.3%) among the labels of the internal nodes
OR appears frequently (96%) among the labels of the (fields) leaf nodes
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 23
Experiments

Goals:

Evaluate the approximation algorithm for computing the
dictionary of stop words.

Asses the ability of the proposed methods to establish
semantic relationships.

Determine the impact of stop words on determining
semantic relationships.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 24
Experiments

Setup

9 real world domains from the web

Parts of the data set used also in Wu06 et al, Madhavan05 et al, Dragut06 at al.
Domain
#
interfaces
Avg. # fields per
interface
Avg. # internal nodes
per interface
Avg. depth of
interfaces
Airfare
20
10.7
5.1
3.6
Automobile
20
5.1
1.7
2.4
Book
20
5.4
1.3
2.3
Job
20
4.6
1.1
2.1
Real Estate
20
6.5
2.4
2.7
Car Rentals
20
10.4
2.4
2.5
Hotels
30
7.6
2.4
2.3
Credit Card
20
50.15
20.25
3.6
Alliances
50
15.3
8.32
3.58
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 25
Experiments: Gold Standard Stop Words

How was the gold standard created?

Following the intuition:


A word is not a stop word if there is a label whose meaning changes
so “drastically” after the removal of the word from the label that the
new label does not resemble in any way the original meaning of the
label.
Examples:
The word yourself in the Credit Card domain is not a stop word
because of labels such as Please tell us about yourself
 The word who in the Airline domain is not a stop word because of
labels such as Who is going in this trip?

E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 26
Experiments: Evaluating Stop Words

From left to right Precision, Recall, F-score
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 27
Experiments: Discussion on Stop Words
Example of non-stop words commonly regarded as stop
words

Found non-stop words
Missed non-stop words
Airfare
first, last, from, to, when, and, or
where, who
Alliances
from, to, on, yourself, no, for, there, and, or
where, when, who, by
Auto
first, last, from, to, within, or
Book
first, last, before, or
after
Car Rental
to, and, or
from, last
Credit Card
first, last, per, and, or
yourself
Real Estate
to, from, or
Domain

Why do we miss some of them?
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 28
Experiments: Semantic Relationships

The gold standard


Manually created for each of the 9 domains.
Contains 7,544 relationships: 4,103 (54.4%) are synonymy relationships
and 3,441 (45.6%) are hypernymy/hyponymy relationships.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 29
Experiments: The Naïve Algorithm


It uses only the dictionary senses of individual words
Why is the accuracy so poor and ranging over such a large interval
(from 39% to 97.3%)?


It compares labels without taking into consideration their contexts.
It blindly establishes semantic relationships between labels that share some words.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 30
Experiments: The Improved Algorithm




It combines the context of labels
and semantic enrichment words.
F-score ranges from 82.1% to
99.3%, with the mean at 92.6%
and a standard deviation of
5.9%.
The naive algorithm has a mean
F-score of 74.9% and a standard
deviation of 18.5%.
It improves the average
precision to 95%, the average
recall to 90.4% and the average
F-score to 92.6%.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 31
Experiments: Where Do the Problems Lie?

Words and phrases that are commonly perceived as
synonyms but not recorded in electronic dictionaries
WordNet.


E.g. drop-off and return are synonyms in the Car Rental
domain but not by WordNet
Many labels are complex sentences

E.g. “So, what do you do for a living?”, “How flexible are
you?”.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 32
Experiments: What Else Did We Try?

Other linguistic techniques were attempted
Normalized Google Distance (NGD) [Cilibrasi and Vitanyi 2007]
 The kernel function for measuring the semantic similarity
between pairs of short text snippets [Sahami and Heilman 2006]

Domain
Label
Relationship
Label
Airfare
Outbound
Syn
Origin date
Airfare
How flexible are you?
Hyp
Search one day before and after
Car Rental
End
Syn
Drop-off date
Car Rental
Pick-up
Syn
Start
Credit Card
2nd card holder
Syn
Additional authorized user
Credit Card
So, what do you do for a living?
Syn
Employment Information
Real Estate
Size
Hyp
Square feet
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 33
Experiments: Stop Words & Semantic Relationships

We run the improved algorithm for computing semantic
relationships with the following four possible sets of stop word:




S1 is the set of stop words produced by our algorithm;
S2 is the gold standard of stop words;
S3 is the empty set;
S4 is a domain independent stop word set used by a typical IR system;


we used dcs.gla.ac.uk/idom/ir resources/linguistic_utils/stop_words
The outcome:

F-score of using S1 is on average 17.6% better than that using S3.


F-score of using S1 is on average 8% better than that using S4.


The largest difference is 43%.
The largest difference is 33%.
F-score using S1 is on average 0.03% better than that using S2.

This is another way of validating our improve algorithm.
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 34
Related Work


Synonym and near-synonym relationships between short phrases
have been recently studied [Bollegala et al. 2007, Sahami and
Heilman 2006]
There is a great deal of work to represent meaning of words (not
phrases) in various areas of research: linguistics, computer
science, cognitive psychology, etc


Manually created semantic networks Wordnet [Felbaum 1998] and Cyc
[Lenat et al. 1990]
Generic methods to measure word similarity or word association
 Using word frequencies in text corpora [Berland and Charniak 1990,
Caraballo 1999, Hearst 1992, Jiang and Conrath 1998, Lin 1998]
 Using a Web search engine counts (hits) to identify lexico-syntactic
patterns [Bollegala et al. 2007, Cilibrasi and Vitani 2007, Cimiano and Staab
2004]
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 35
Related Work, Cont’

Schema Matching
 Surveys [Rahm and Bernstein 2001, Shvaiko and Euzenat
2005]
 Query interface matching [He and Chang 2003, He at al. 2004,
Wang et al. 2004, Wu et al. 2004, 2006]
 A number of dictionary-based semantic matching techniques
for relational/XML schema and ontology alignment
[Benevantano et al. 2001, Giunchiglia et al. 2005, Kotis and
Vouros 2004]
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 36
End

Please visit the project web site
 http://www.cs.uic.edu/~edragut/QIProject.html
Thank you for your time and patience!
E. Dragut et al Stop Word and Related Problems in Web Interface Integration
Page 37
Download