Automatic query expansion

advertisement
PERFORMANCE ANALYSIS FOR QUERY EXPANSION FOR
INFORMATION RETRIEVAL
A Project
Presented to the faculty of the Department of Computer Science
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
Computer Science
by
Bimal Panchal
SPRING
2012
PERFORMANCE ANALYSIS FOR QUERY EXPANSION FOR
INFORMATION RETRIEVAL
A Project
by
Bimal Panchal
Approved by:
__________________________________, Committee Chair
Mary Jane Lee, Ph.D.
__________________________________, Second Reader
Robert Buckley
__________________________________
Date
ii
Student: Bimal Panchal
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
______________________________, Graduate Coordinator
Nikrouz Faroughi, Ph.D.
Department of Computer Science
iii
_________________
Date
Abstract
of
PERFORMANCE ANALYSIS FOR QUERY EXPANSION FOR
INFORMATION RETRIEVAL
by
Bimal Panchal
Information Retrieval (IR) is the technique of getting information from document
repositories. It is the science of searching for documents for information within
documents as well as that of searching relational databases. Query Expansion (QE) is one
of the various aspects of Information Retrieval. Query Expansion is the process of
reformulating a query which leads to the improvement in IR. In web search engine
context, Query Expansion involves evaluating user’s input and expanding search query to
match additional documents.
The purpose of this project is to analyze different kinds of QE techniques which help
improve Information Retrieval. In addition performance of various database tools, such
as, MySQL and MS SQL Server 2008, in query expansion will be conducted.
Performance factors considered are CPU cost and time.
______________________________, Committee Chair
Mary Jane Lee, Ph.D.
______________________________
Date
iv
ACKNOWLEDGMENTS
I would like to express gratitude to my project advisor Dr. Mary Jane Lee for her
direction and supporting me to complete this project. I am thankful to my second reader
Professor Robert Buckley for his assistance. I am utterly grateful to my parents Savitri
and Subhash Panchal for their love, moral support and encouragement.
v
TABLE OF CONTENTS
Page
Acknowledgments..........................................................................................................v
List of Figures ............................................................................................................ vii
Chapter
1. INTRODUCTION ....................................................................................................1
2. BACKGROUND INFORMATION .........................................................................2
2.1 Information Retrieval .......................................................................................2
2.1.1 Performance Measures for IR ..................................................................3
2.2 Query Expansion .............................................................................................4
2.2.1 Classes of Query Expansion ....................................................................5
3. GOOGLE QUERY EXPANSION…... ......................................................................9
4. FULL-TEXT QUERY EXPANSION IN MYSQL ................................................12
5. PERFORMANCE ANALYSIS FOR MYSQL QE ................................................15
6. FULL-TEXT QUERY EXPANSION IN MS SQL SERVER ................................22
7. PERFORMANCE ANALYSIS FOR MS SQL SERVER QE ...............................25
8. SUMMARY ............................................................................................................38
8.1 Summary ........................................................................................................38
9. FUTURE WORK ....................................................................................................40
9.1 Future work ....................................................................................................40
References ...................................................................................................................41
vi
LIST OF FIGURES
Page
Figure 1 Result set using full-text index ..................................................................... 16
Figure 2 Execution time based on different factors for results shown in figure 1 ...... 16
Figure 3 Time statistics result chart based on various factors for results in figure 1 . 17
Figure 4 Result set using full-text index with QE....................................................... 19
Figure 5 Execution time based on different factors for results shown in figure 4 ...... 20
Figure 6 Time statistics result chart based on various factors for results in figure 4 . 20
Figure 7 Output for query executed using FREETEXT ............................................. 28
Figure 8 Performance statistics for result set shown in figure 7 ................................. 28
Figure 9 CPU cost analysis for result set shown in figure 7 ....................................... 28
Figure 10 Output for query executed using CONTAINS and AND ........................... 29
Figure 11 Performance statistics for result set shown in figure 10 ............................. 30
Figure 12 CPU cost analysis for result set shown in figure 10 ................................... 30
Figure 13 Output for query executed using CONTAINS and ISABOUT .................. 32
Figure 14 Performance statistics for result set shown in figure 13 ............................. 32
Figure 15 CPU cost analysis for result set shown in figure 13 ................................... 32
Figure 16 Output for query executed using CONTAINS and INFLECTIONAL ...... 34
Figure 17 Performance statistics for result set shown in figure 16 ............................. 35
Figure 18 CPU cost analysis for result set shown in figure 16 ................................... 35
Figure 19 Output for query executed using CONTAINS and < > and AND ............. 36
vii
Figure 20 Performance statistics for result set shown in figure 19 ............................. 36
Figure 21 CPU cost analysis for result set shown in figure 19 ................................... 37
viii
1
Chapter 1
INTRODUCTION
In this project, there are two broad terms that reader will encounter – Information
Retrieval, and Query Expansion. In general, Information Retrieval (IR) is method of
searching for a document or part of document for specific information. Query Expansion
(QE) is a process of reformulating a user’s query for searching for information, in such a
way that the query should include additional search items that are relevant in addition to
the original query itself.
Information Retrieval is a big picture, whereas Query Expansion is a subset of
Information Retrieval. This project includes analysis of different QE techniques that are
being used in various web search engines and database tools. The analysis will help
compare different database tools and determine which one is better in terms of query
expansion process. Additionally, the analysis will help determine various factors, such as,
performance, efficiency, etc. coupled with query expansion techniques.
In terms of analysis, MySQL and MS SQL Server 2008 database tools are used. Test
queries are performed and results are collected. Analysis is done on various queries
enabling query expansion. Performance is measured in terms of time being taken to
execute queries with and without QE for MySQL. For MS SQL Server, time statistics and
CPU cost are taken into account when performing various query expansion processes on
a source query.
2
Chapter 2
BACKGROUND INFORMATION
2.1 Information Retrieval
Information Retrieval (IR) is the basis on which the modern search engines work today.
IR has been evolved greatly so far. The traditional IR mechanism was little different. It
involved retrieving information from the locally stored collection of documents. Modern
IR mechanism works this way: the query is passed as an input by user; IR engine then
processes that query; and based on the internal algorithm that is used, it then outputs the
results related to that query term [2].
The output is a list of ranked documents based on their relevance to the original query
specified by user.
Mathematics,
IR is not only related to Computer Science but also related to
Information
Science,
Library Science,
Information
Architecture,
Linguistics, Statistics, etc. to name a few [10].
IR has many applications – many Public Libraries use IR to provide access to the
information user needs; Web Search Engines use IR systems and algorithms a lot. Web
Search Engines are in fact the most visible application of IR systems. IR uses different
retrieval algorithms to help retrieve exact information based on user’s initial query. It is
up to an underlying algorithm to fetch more relevant information. Hence retrieval
algorithm needs to be quite efficient.
3
2.1.1 Performance Measures for IR
As this project is a study of performance analysis at its core, there should be enough
information about critical performance measures for IR. These different measures require
stack of information (i.e., a database) and a query for specific information [19].
Precision:
Precision can be written as follows:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
|(relevant information) + (retrieved information)|
|(retrieved information)|
Recall:
In mathematical expression, Recall can be written as following:
𝑟𝑒𝑐𝑎𝑙𝑙 =
|(relevant information) + (retrieved information)|
|(relevant information)|
Fall-Out:
Fall-Out can be expressed as:
𝐹𝑎𝑙𝑙𝑂𝑢𝑡 =
|(nonrelevant information) + (retrieved information)|
|(nonrelevant information)|
F-measure:
F-measure can be expressed as:
𝐹=
2 ∗ precision ∗ recall
(precision + recall)
4
There are many other performance measures such as Average Precision, R-precision,
Mean Average Precision, etc. But the ones mentioned here are the most important of all.
2.2 Query Expansion
Query Expansion (QE) is the process of generating a query response in such a way that
results include additional search terms related to the original query itself. This process is
quite complex since Query Expansion Engine needs to analyze the search results before
displaying and ranking them as relevant search items.
Query Expansion is most popular with web search engines these days. This is how it
works. User enters a query to the search engine and then the search engine, based on its
underlying QE algorithm, enhances that query to match additional items that might be of
an interest to user. In the end, it displays them in an order based on their relevancy –
more relevant items will be displayed first followed by less relevant.
By using Query Expansion, web search engines aim at improving Precision and/or Recall
– performance measures of IR [9]. For example, if user enters query for “car”, the
expanded query may include “car, cars, automobiles, auto” etc. There are mainly 3
classes of Query Expansion: human or computer-generated thesaurus, relevance
feedback, and automatic query expansion.
There are two major problems in QE – which terms to include, and which terms weigh
more. The underlying QE mechanism needs to be efficient enough in order to get nearly
perfect answers to these questions. There is also an important distinction between
5
concept-based and term-based query expansion. The question here is: Is it better to
expand the query based on the query term specified, or to expand it based upon overall
concept of the query.
2.2.1 Classes of Query Expansion
The process of query expansion classifies into following: human-generated thesaurus,
computer-generated thesaurus, relevance feedback, and automatic query expansion.
Human-generated thesaurus:
This is a compilation of frequently used words in relation to the corresponding field. For
example, an engineering thesaurus to include all the engineering terms, industrial
thesaurus to include industrial terms, etc. This is largely used in fields such as medicine,
aerospace, technology, and so on. The drawbacks are associated cost and time. It takes lot
of development cost and maintenance cost. And it also takes a long time to develop a
thesaurus.
Computer-generated thesaurus:
For the automatically-generated thesaurus, it doesn’t require experts to generate it.
Hence cost of experts is saved. To generate a thesaurus, there are steps involved. First
step is to extract the search word co-occurrences. Second step is to define similarities
between words by word co-occurrences or lexical relationship and finally to cluster
words based on their similarities. This method has not been proven successful as the
handcrafted / human-generated thesaurus method.
6
Relevance feedback:
Use of relevance feedback significantly improves recall and precision [13], IR measures
over the traditional way of query expansion. The process contains several steps: first, the
user provides initial query which returns initial result set matching the input query. The
user then selects the list of documents from the initial result set that are relevant to the
search. The system then re-weights and/or expands the query based upon the search terms
in the documents. This process may be iterative, meaning that system could possibly
iterate until it refines the search query to match more relevant results.
There are many models described in relation to relevance feedback mechanism for doing
query expansion. Most popular ones are: vector-space, probabilistic, and boolean. The
models differ depending on methods and theories behind them. The method in vectorspace model follows like this: all the top ranked relevant documents are used as the
highest ranked non-relevant document. The non-relevant document is used as the point in
the vector-space from which the feedback query is then removed.
Interactive query expansion:
Interactive query expansion uses a thesaurus. Initially, the user provides query to the
system. Then the system comes up with the list of documents that are relevant to the
search query. Once the user selects documents from the result set that are of an interest to
him/her, the system again refines the search result by looking at the result set and a
7
thesaurus. This process needs more research since there are some unknowns associated
with it.
Pseudo-relevance feedback:
Pseudo-relevance feedback mechanism was developed because of some limitations to the
relevance feedback process. The obvious thing here is that most users don’t like to give
manual feedback to the system. The process follows like this: the system returns an initial
set of documents as soon as the user provides search query. The system then assumes that
first n number of documents are relevant to the search query. The system takes terms
from these documents to re-weight the initial query. Finally, the system does it iteratively
until it gets final result set that is more suitable to the search query.
The drawback here is that the system relies heavily on the ability to efficiently retrieve
relevant documents initially. This depends on the underlying algorithm used to carry out
the process [13]. The advantage is that no manual feedback required from the user. Thus,
it saves time and efforts.
Automatic query expansion:
Automatic query expansion uses computer-generated thesaurus. This works similar to
pseudo-relevance feedback process. In terms of co-occurrence measures, relationships
between words based upon their co-occurrence in document is defined. During clustering,
documents that share measurable number of terms are grouped together. Then, a
8
thesaurus is generated from these terms. The drawbacks are that it doesn’t account for
synonyms and categories of terms sometimes too narrow or broad.
On the other hand in lexical co-occurrence measures, the proximity of words in the
document is taken into account instead of frequency of terms. It significantly improves
performance in small collection of document set. This process is better than pseudorelevance feedback but not as quite efficient as relevance feedback.
9
Chapter 3
GOOGLE QUERY EXPANSION
Word stemming
In word stemming, the search term is refined to its root or stem. For example, search term
for ‘translator’ can be expanded to ‘translator’, ‘translation’, etc. The keyword is reduced
to its stem (‘translate’ in this case) and then words beginning from the same stem can be
matched [20].
Acronyms
An acronym / abbreviation searched is automatically resolved to its full form. For
example a search for OSI may include results for ‘Open Systems Interconnection’ as well
as ‘Open Source Initiative’. Another example is term NFL, which can be expanded to
include ‘National Football League’, ‘National Forensic Laboratory’. The search results
are ranked based upon the relevance – most probable ones are listed first followed by the
rest.
Misspellings and typos-errors
Anytime if by mistake, one makes a typo while doing a search, Google will identify that
typo for almost all the time and will suggest correct variant for it in addition to the “did
you mean” prompt. Searching for ‘city trafic’ will include search results containing word
‘city’ and its variants as well as the correct word spell for ‘traffic’.
10
Synonyms
Google uses query expansion to include synonyms of the entered query term [20]. It is
useful to include related words. Most of the time, the search query is expanded when the
search term is entered improper. The synonym substituted as a part of improper word is
not bolded as a common rule. For example, search query for ‘remote connexion’ will
display search results for ‘remote connection’ in most cases.
Translations
In some cases, Google seems to translate search query term into another language and
displays results based on that. For example, search query for non-english term matches
query’s English-equivalent word. Searching for Spanish term ‘amigo’ may display results
that include English term ‘friend’ in them.
Ignored words
Surprisingly, some of the query term word gets discarded completely from the query. It
may be that those terms which are being dropped do not have much significance on the
overall query [20]. For example, search for a term ‘birthday bound attack’ might drop the
word ‘attack’ from the original query because of not much significance.
11
Interestingly, the query expansion process does not take place all the time when someone
searches for a query term. Whether or not query expansion should occur also happens to
be related to the entire search query – some query variants are much more likely to
trigger query expansion than others.
12
Chapter 4
FULL-TEXT QUERY EXPANSION IN MYSQL
In some instances, users may want to search for information which they rely on based on
their own knowledge. Using that knowledge, they define key terms to search and
typically these terms are too short. To address this situation, MySQL full-text search
engine introduced query expansion process within MySQL itself [16].
The expansion of the input query is based on automatic relevance feedback mechanism
that was discussed in the previous chapter. This process is also called blind query
expansion. MySQL full-text search engine perform following steps when it uses query
expansion [16]:

First, MySQL full-text search engine checks for all rows that match the search query

Next it checks all resulted rows and finds the relevant words out of those rows.

Finally, MySQL full-text search engine searches for the query again based on the
relevant words it got from previous step.
Typically, users need query expansion in situations when they could not find relevant
results to their search query and also when the returned search results are less. Users
search again but this time with the query expansion enabled so they can find what they
are looking for.
To use the query expansion, users need to use modifier WITH QUERY EXPANSION in
SELECT statement. Here is the usual form of using WITH QUERY EXPANSION:
13
SELECT col1, col2, col3
FROM table1
WHERE MATCH (col1, col2, col3)
AGAINST (‘keyword’, WITH QUERY EXPANSION)
Let’s look at the example below to understand query expansion using WITH QUERY
EXPANSION. Here, empName column is used in the employee table to demonstrate this.
ALTER TABLE employee
ADD FULLTEXT (empName)
Next, let’s find all the employees whose name contain “smith” in them without using
query expansion.
SELECT empName
FROM employee
WHERE MATCH (empName) AGAINST (‘smith’)
The output appears like following
---------------------------------------empName
---------------------------------------william smith
paul mithell smith
andrew smith
-----------------------------------------
14
As per the results, the employee names in the above search results contain “smith” in it.
Now, the same query can be used but with query expansion enabled as following:
SELECT empName
FROM employee
WHERE MATCH (empName)
AGAINST (‘smith’ WITH QUERY EXPANSION)
The output goes like following:
---------------------------------------empName
---------------------------------------william smith
paul mithell smith
andrew smith
paul patrick
andrew white
william george
----------------------------------------As per the results above, when query expansion is used it outputted more rows. First
three rows are the most relevant result. The remaining rows come from the relevant
words out of the first three rows, for example ‘andrew’.
Blind query expansion tends to increase noise significantly by returning non-relevant
results. Thus, it is recommended to use whenever search keywords are short.
15
Chapter 5
PEFORMANCE ANALYSIS FOR MYSQL QE
To run QE, MySQL server 5.5 is used. To be able to perform full-text search on a given
query, one needs to create full-text index on specified column(s) of a given table.
Following query adds full-text index on table ‘country’ for columns ‘name’ and ‘region’.
mysql> alter table country add fulltext (name, region);
Query OK, 239 rows affected (0.26 sec)
Records: 239 Duplicates: 0 Warnings: 0
Now, full-text searches against columns ‘name’ and ‘region’ can be executed. The
following query searches for word ‘america’ from mentioned columns using full-text
indexing.
mysql> select name, continent, region from country where match (name, region) against
('america');
+---------------------------+---------------+-----------------+
| name
| continent
| region
|
+---------------------------+---------------+-----------------+
| Argentina
| South America | South America
|
| Guyana
| South America | South America
|
| Honduras
| North America | Central America |
| Mexico
| North America | Central America |
| Nicaragua
| North America | Central America |
| Panama
| North America | Central America |
| Peru
| South America | South America
|
| Paraguay
| South America | South America
|
| El Salvador
| North America | Central America |
| Suriname
| South America | South America
|
| Uruguay
| South America | South America
|
| Venezuela
| South America | South America
|
| Guatemala
| North America | Central America |
| Greenland
| North America | North America
|
16
| Belize
| North America | Central America |
| Bermuda
| North America | North America
|
| Bolivia
| South America | South America
|
| Brazil
| South America | South America
|
| Canada
| North America | North America
|
| Chile
| South America | South America
|
| Colombia
| South America | South America
|
| Ecuador
| South America | South America
|
| Costa Rica
| North America | Central America |
| Falkland Islands
| South America | South America
|
| United States
| North America | North America
|
| French Guiana
| South America | South America
|
| Saint Pierre and Miquelon | North America | North America
|
+---------------------------+---------------+-----------------+
27 rows in set (0.03 sec)
Figure 1. Result set using full-text index
Figure 2 provides the query execution time (in seconds) based on different factors used
while analyzing and performing query search. The time measurements are for the above
query.
+-------------------------+----------+
| Status
| Duration |
+-------------------------+----------+
| starting
| 0.000154 |
| checking permissions
| 0.000017 |
| Opening tables
| 0.000039 |
| System lock
| 0.000022 |
| init
| 0.000034 |
| optimizing
| 0.000013 |
| statistics
| 0.030746 |
| preparing
| 0.000038 |
| FULLTEXT initialization | 0.000094 |
| executing
| 0.000007 |
| Sending data
| 0.000255 |
| end
| 0.000010 |
| query end
| 0.000005 |
| closing tables
| 0.000014 |
| freeing items
| 0.000105 |
| logging slow query
| 0.000006 |
| cleaning up
| 0.000006 |
| TOTAL
| 0.031565 |
+-------------------------+----------+
Figure 2. Execution time based on different factors for results shown in figure 1
17
Figure 3 shows results for query execution time for important factors that play key role in
query search process in a simpler pie chart form. From the chart, one of the major timeconsuming factor is “sending data” to the host machine when the data is ready to be
delivered by server engine. The “Full-text initialization” comes with no surprise while
consuming fair amount of time performing query search.
Time statistics
Opening tables
Initialization
Fulltext Init
Execution
Sending data
Closing tables
Figure 3. Time statistics result chart based on various factors for results in figure 1
Following query searches for search term ‘america’ on table name ‘country’ with
MySQL query expansion enabled. Clearly, this query will fetch more results because
after one execution, the internal full-text engine will expand this query to include more
relevant search terms.
mysql> select name, continent, region from country where match (name, region) against
('america' with query expansion);
18
+-------------------------+---------------+---------------------------+
| name
| continent
| region
|
+-------------------------+---------------+---------------------------+
| French Guiana
| South America | South America
|
| Falkland Islands
| South America | South America
|
| Colombia
| South America | South America
|
| Ecuador
| South America | South America
|
| Chile
| South America | South America
|
| Brazil
| South America | South America
|
| Guyana
| South America | South America
|
| Bolivia
| South America | South America
|
| Venezuela
| South America | South America
|
| Argentina
| South America | South America
|
| Suriname
| South America | South America
|
| Paraguay
| South America | South America
|
| Uruguay
| South America | South America
|
| Peru
| South America | South America
|
| Costa Rica
| North America | Central America
|
| Nicaragua
| North America | Central America
|
| Guatemala
| North America | Central America
|
| Panama
| North America | Central America
|
| Honduras
| North America | Central America
|
| Mexico
| North America | Central America
|
| Belize
| North America | Central America
|
| Bermuda
| North America | North America
|
| Canada
| North America | North America
|
| Greenland
| North America | North America
|
| El Salvador
| North America | Central America
|
| United States
| North America | North America
|
| Saint Pierre and Miquelon| North America | North America
|
| South Georgia
| Antarctica
| Antarctica
|
| South Korea
| Asia
| Eastern Asia
|
| South Africa
| Africa
| Southern Africa
|
| Central African Republic| Africa
| Central Africa
|
| Congo
| Africa
| Central Africa
|
| Chad
| Africa
| Central Africa
|
| Cameroon
| Africa
| Central Africa
|
| Gabon
| Africa
| Central Africa
|
| Angola
| Africa
| Central Africa
|
| Tajikistan
| Asia
| Southern and Central Asia |
| Sao Tome and Principe
| Africa
| Central Africa
|
| Pakistan
| Asia
| Southern and Central Asia |
| Nepal
| Asia
| Southern and Central Asia |
| Uzbekistan
| Asia
| Southern and Central Asia |
| Maldives
| Asia
| Southern and Central Asia |
| Sri Lanka
| Asia
| Southern and Central Asia |
| Bhutan
| Asia
| Southern and Central Asia |
| Equatorial Guinea
| Africa
| Central Africa
|
| Bangladesh
| Asia
| Southern and Central Asia |
| Afghanistan
| Asia
| Southern and Central Asia |
| India
| Asia
| Southern and Central Asia |
| Turkmenistan
| Asia
| Southern and Central Asia |
19
| Iran
| Asia
| Southern and Central Asia |
| Kazakstan
| Asia
| Southern and Central Asia |
| Kyrgyzstan
| Asia
| Southern and Central Asia |
| Congo, The Democratic Rep| Africa
| Central Africa
|
| North Korea
| Asia
| Eastern Asia
|
| French Southern territories| Antarctica | Antarctica
|
| French Polynesia
| Oceania
| Polynesia
|
| Virgin Islands, U.S.
| North America | Caribbean
|
| Ireland
| Europe
| British Islands
|
| Cook Islands
| Oceania
| Polynesia
|
| Solomon Islands
| Oceania
| Melanesia
|
| Cayman Islands
| North America | Caribbean
|
| Fiji Islands
| Oceania
| Melanesia
|
| Marshall Islands
| Oceania
| Micronesia
|
| Virgin Islands, British | North America | Caribbean
|
| Northern Mariana Islands| Oceania
| Micronesia
|
| United Kingdom
| Europe
| British Islands
|
| Faroe Islands
| Europe
| Nordic Countries
|
| Turks and Caicos Islands| North America | Caribbean
|
| Heard Island and McDonald| Antarctica
| Antarctica
|
| Cocos (Keeling) Islands | Oceania
| Australia and New Zealand |
| United States Minor Outlying| Oceania
| Micronesia/Caribbean
|
+-------------------------+---------------+---------------------------+
71 rows in set (0.09 sec)
Figure 4. Result set using full-text index with QE
Figure 4 shows query execution time (in seconds) based on different factors used while
analyzing and performing query search. The time measurements are for query results
shown in figure 3.
+-------------------------+----------+
| Status
| Duration |
+-------------------------+----------+
| starting
| 0.083475 |
| checking permissions
| 0.000020 |
| Opening tables
| 0.000025 |
| System lock
| 0.000015 |
| init
| 0.000025 |
| optimizing
| 0.000010 |
| statistics
| 0.000016 |
| preparing
| 0.000012 |
| FULLTEXT initialization | 0.000603 |
| executing
| 0.000006 |
| Sending data
| 0.000487 |
| end
| 0.000007 |
20
| query end
| 0.000005 |
| closing tables
| 0.000009 |
| freeing items
| 0.000141 |
| logging slow query
| 0.000005 |
| cleaning up
| 0.000005 |
| TOTAL
| 0.084866 |
+-------------------------+----------+
Figure 5. Execution time based on different factors for results shown in figure 4
Figure 6 shows results for query execution time for important factors that play key role in
query expansion process in a simpler pie chart form. By involving query expansion, the
full-text initialization factor consumes lot of time to execute the query. It is taking 0.5
(0.603 – 0.094) ms more for performing full-text initialization. This is because the
column to search for needs to be indexed prior to search.
Time statistics
Opening tables
Initialization
Fulltext Init
Execution
Sending data
Closing tables
Figure 6. Time statistics result chart based on various factors for results in figure 4
Let’s carefully examine figures 2 and 5. For figure 2, MySQL full-text search query
without QE took total time of 31 ms. Out of that, majority of time is taken by statistics
and the second most is “sending data” part followed by “fulltext initialization”.
21
For figure 5, the same MySQL full-text search query but now with QE took total time of
84 ms. From all, “starting” phase took longest time followed by “fulltext initialization”
phase.
By comparing total time, it can be concluded that it is obvious when QE is taken into
consideration the execution time is much higher. It takes almost 53 more milliseconds to
perform query expansion on the same table with same query to search for same keyword.
By carefully analyzing the results, one can also figure out that the full-text initialization
impacts a lot on the overall performance of the query in terms of time. This is due to
performing indexing on specified column(s) of a relatively very big database table with
thousands of rows.
So the question arises - is it wise to use query expansion? The answer depends on the
user’s needs. If the user is searching for short phrases or keywords, then it is better to
keep the query expansion disabled. For the ambiguous searches when the user is not sure
about the search term, the query expansion becomes really useful.
22
Chapter 6
FULL-TEXT QUERY EXPANSION IN MS SQL SERVER
This chapter briefly describes how MS SQL Server performs full-text query search. For
performance analysis and research, MS SQL Server 2008 R2 is used. SQL Server uses
FREETEXT and CONTAINS keywords to perform Query Expansion. The expansion
process occurs in such a way that source query is modified after initial run by full-text
internal search engine to get more information.
The search terms supplied in a query are expanded by the parser before hitting full-text
index. This query expansion process is performed by a component called a stemmer and
the expansion depends on rules that are being implied explicitly. However, query
expansion can be used to include following types:

Search for plural forms of a word – a search on bike would also return bikes

Thesaurus searches for synonymous forms of word – a search on Leopard may
return words that contain panther, mountain lion, puma, etc.

Searches for a word that will return all of linguistically meaningful variations of
that word – search for ride may include ride, rides, ridden, rode, etc.
There are two types of query: using contains and freetext. They both are supported in MS
SQL Server 2008. Both will be examined in detail.
23
Querying using CONTAINS
By default, use of the CONTAINS predicate will entail minimal language specific query
time expansion. For example, consider a search on the term ‘bank’
select * from TableName where CONTAINS(*,'bank')
The asterisk (*) is used to search all full-text indexed columns. In SQL 2008, search can
be made within all columns, a named column, or a subset of all full-text indexed
columns:
select * from TableName where CONTAINS(col1,'bank')
select * from TableName where CONTAINS((col1,col2),'bank'))
Sometimes, there is a need to search for words as noun and as verb. Searching for ‘bank’
using keyword CONTAINS will not return result for ‘banking’. CONTAINS will only
match with ‘bank’.
To solve this, FORMSOF keyword is useful. This term accepts two arguments –
INFLECTIONAL or THESAURUS. The INFLECTIONAL argument will expand the
search
phrase
to
search
on
all
conjugations
and
declensions,
and
the THESAURUS argument will enable a thesaurus expansion on the search phrase. Here
are two examples of what this would look like:
24
Select * from TableName where CONTAINS(*,'FORMSOF(INFLECTIONAL,run)')
And
Select * from TableName where CONTAINS(*,'FORMSOF(THESAURUS,run)')
Querying using FREETEXT
In fact, use of FREETEXT means that, by default, the search is expanded to
encompass all generations of the searched word, for the default full-text language setting
of your server. So, to continue the previous "bank" example, a FREETEXT search on
bank:
Select * from TableName
where FREETEXT
(*,‘bank’)
would be expanded to the equivalent of:
Select * from TableName
where CONTAINS
(*,'"bank" or "banks" or "bank''s" or banks''" or "banking")
The search would also return any relevant thesaurus expansions.
25
Chapter 7
PEFORMANCE ANALYSIS FOR MS SQL SERVER QE
Performance analysis is important in determining how query expansion impacts on
different performance factors such as disk IO, memory, CPU cost, query execution time,
and so on. In this chapter, various query execution times for MS SQL Server 2008 are
reviewed while running different free-text queries with QE enabled on a relatively large
database table.
To enable SQL Server to use full-text search mechanism, the following steps needed to
be done:

Select the database to be used and create a database catalog.

Create full-text index on column(s) for the chosen table.

Run full-text queries to search for terms.
Now series of queries will be executed on freshly created database table named
Production.ProductDescription. Firstly, run following query to create catalog on the
database to be used. Here AdvemtureWorks database is used.
create fulltext catalog AdventureWorksCatalog
26
The table used for querying is Production.ProductDescription in MS SQL Server 2008. It
contains detailed description about various products. Now, create full-text index on
Description column of this table .To create full-text index on specified column(s), run the
following query:
Create FullText Index on Production.ProductDescription
([Description])
Key Index PK_ProductDescription_ProductDescriptionID on AdventureWorksCatalog
with Change_Tracking Auto
There are various forms for querying SQL Server full-text search engine to demonstrate
query expansion. Following query examines ProductDescription table and fetches rows
containing keyword ‘race’ and ‘bike’ independent of each other.
Select [Description]
from Production.ProductDescription
Where
FREETEXT([Description], 'Race Bike')
The output appears as below:
Description
This bike delivers a high-level of performance on a budget. It is
responsive and maneuverable, and offers peace-of-mind when you
decide to go off-road.
For true trail addicts. An extremely durable bike that will go
anywhere and keep you in control on challenging terrain - without
27
breaking your budget.
Top-of-the-line competition mountain bike. Performance-enhancing
options include the innovative HL Frame, super-smooth front
suspension, and traction for all terrain.
Entry level adult bike; offers a comfortable ride cross-country
or down the block. Quick-release hubs and rims.
Value-priced bike with many features of our top-of-the-line
models. Has the same light, stiff frame, and the quick
acceleration we're famous for.
Same technology as all of our Road series bikes, but the frame is
sized for a woman. Perfect all-around bike for road or racing.
Same technology as all of our Road series bikes. Perfect allaround bike for road or racing.
A true multi-sport bike that offers streamlined riding and a
revolutionary design. Aerodynamic design lets you ride with the
pros, and the gearing will conquer hilly roads.
Cross-train, race, or just socialize on a sleek, aerodynamic
bike. Advanced seat technology provides comfort all day.
Cross-train, race, or just socialize on a sleek, aerodynamic bike
designed for a woman. Advanced seat technology provides comfort
all day.
Alluminum-alloy frame provides a light, stiff ride, whether you
are racing in the velodrome or on a demanding club ride on
country roads.
This bike is ridden by race winners. Developed with the Adventure
Works Cycles professional race team, it has a extremely light
heat-treated aluminum frame, and steering that allows precision
control.
All-occasion value bike with our basic comfort and safety
features. Offers wider, more stable tires for a ride around town
or weekend trip.
The plush custom saddle keeps you riding all day, and there's
plenty of space to add panniers and bike bags to the newlyredesigned carrier. This bike has stability when fully-loaded.
Lightweight kevlar racing saddle. Leather.
Carries 4 bikes securely; steel construction, fits 2" receiver
hitch.
Clip-on fenders fit most mountain bikes.
Men's 8-panel racing shorts - lycra with an elastic waistband and
leg grippers.
28
Perfect all-purpose bike stand for working on your bike at home.
Quick-adjusting clamps and steel construction.
Figure 7. Output for query executed using FREETEXT
The figure below shows total query execution time including client processing time and
wait on server replies in milliseconds. The time taken by each query will be examined
later.
Time Statistics
Client processing time
Total execution time
Wait time on server replies
CPU Cost
Clustered Index Seek
Fulltext Match
Nested Loops
Ms
3
58
55
in %
0.0001581
0.0033386
0.0000902
Figure 8. Performance statistics for result set shown in figure 7
The chart below displays information about the three most important variables which are
associated with CPU cost during the query expansion process.
CPU cost analysis
Clustered Index Seek
Fulltext Match
Nested Loops
Figure 9. CPU cost analysis for result set shown in figure 7
29
The following query examines ProductDescription table and fetches rows containing
keyword ‘race’ and ‘bike’ dependent of each other – meaning both words have to appear
together in a row.
Select [Description]
from Production.ProductDescription
Where
Contains([Description], '"Race" and "Bike"')
The output appears as below:
Description
Cross-train, race, or just socialize on a sleek, aerodynamic
bike. Advanced seat technology provides comfort all day.
Cross-train, race, or just socialize on a sleek, aerodynamic bike
designed for a woman. Advanced seat technology provides comfort
all day.
This bike is ridden by race winners. Developed with the Adventure
Works Cycles professional race team, it has a extremely light
heat-treated aluminum frame, and steering that allows precision
control.
Figure 10. Output for query executed using CONTAINS and AND
The figure below shows total query execution time including client processing time and
wait on server replies in milliseconds.
Time Statistics
Client processing time
Total execution time
Wait time on server replies
Ms
3
7
4
30
CPU Cost
Clustered Index Seek
Fulltext Match
Nested Loops
in %
0.0001581
0.0033022
0.0000084
Figure 11. Performance statistics for result set shown in figure 10
From the information given in the chart below, it can be said that the full-text match
operation requires more CPU power and hence its CPU cost is high.
CPU cost analysis
Clustered Index Seek
Fulltext Match
Nested Loops
Figure 12. CPU cost analysis for result set shown in figure 10
The following query examines ProductDescription table and fetches rows containing
keyword ‘race’ and ‘bike’ independent ranking them according to their weights.
Select [Description]
from Production.ProductDescription
Where
Contains([Description],
31
'ISABOUT (Race Weight(.4), Bike Weight (.2))')
The output appears as below:
Description
This bike delivers a high-level of performance on a budget. It is
responsive and maneuverable, and offers peace-of-mind when you
decide to go off-road.
For true trail addicts. An extremely durable bike that will go
anywhere and keep you in control on challenging terrain - without
breaking your budget.
Top-of-the-line competition mountain bike. Performance-enhancing
options include the innovative HL Frame, super-smooth front
suspension, and traction for all terrain.
Entry level adult bike; offers a comfortable ride cross-country
or down the block. Quick-release hubs and rims.
Value-priced bike with many features of our top-of-the-line
models. Has the same light, stiff frame, and the quick
acceleration we're famous for.
Same technology as all of our Road series bikes, but the frame is
sized for a woman. Perfect all-around bike for road or racing.
Same technology as all of our Road series bikes. Perfect allaround bike for road or racing.
A true multi-sport bike that offers streamlined riding and a
revolutionary design. Aerodynamic design lets you ride with the
pros, and the gearing will conquer hilly roads.
Cross-train, race, or just socialize on a sleek, aerodynamic
bike. Advanced seat technology provides comfort all day.
Cross-train, race, or just socialize on a sleek, aerodynamic bike
designed for a woman. Advanced seat technology provides comfort
all day.
This bike is ridden by race winners. Developed with the Adventure
Works Cycles professional race team, it has a extremely light
heat-treated aluminum frame, and steering that allows precision
control.
All-occasion value bike with our basic comfort and safety
features. Offers wider, more stable tires for a ride around town
or weekend trip.
The plush custom saddle keeps you riding all day, and there's
plenty of space to add panniers and bike bags to the newlyredesigned carrier. This bike has stability when fully-loaded.
32
Perfect all-purpose bike stand for working on your bike at home.
Quick-adjusting clamps and steel construction.
Figure 13. Output for query executed using CONTAINS and ISABOUT
The figure below shows total query execution time including client processing time and
wait on server replies in milliseconds.
Time Statistics
Client processing time
Total execution time
Wait time on server replies
CPU Cost
Clustered Index Seek
Fulltext Match
Nested Loops
Ms
5
7
2
in %
0.0001581
0.0033187
0.0000069
Figure 14. Performance statistics for result set shown in figure 13
The following chart shows CPU consumption for various factors for the result set above.
CPU cost analysis
Clustered Index Seek
Fulltext Match
Nested Loops
Figure 15. CPU cost analysis for result set shown in figure 13
33
The following query examines ProductDescription table and fetches rows containing
keyword ‘ride’ doing inflectional search i.e. by specifying INFLECTIONAL, the query
expansion engine would search for all conjugations of that keyword.
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, ' FORMSOF (INFLECTIONAL, ride) ')
The output appears as below:
Description
Suitable for any type of riding, on or off-road. Fits any budget.
Smooth-shifting with a comfortable ride.
Serious back-country riding. Perfect for all levels of
competition. Uses the same HL Frame as the Mountain-100.
Entry level adult bike; offers a comfortable ride cross-country
or down the block. Quick-release hubs and rims.
A true multi-sport bike that offers streamlined riding and a
revolutionary design. Aerodynamic design lets you ride with the
pros, and the gearing will conquer hilly roads.
Alluminum-alloy frame provides a light, stiff ride, whether you
are racing in the velodrome or on a demanding club ride on
country roads.
This bike is ridden by race winners. Developed with the Adventure
Works Cycles professional race team, it has a extremely light
heat-treated aluminum frame, and steering that allows precision
control.
All-occasion value bike with our basic comfort and safety
features. Offers wider, more stable tires for a ride around town
or weekend trip.
The plush custom saddle keeps you riding all day, and there's
plenty of space to add panniers and bike bags to the newlyredesigned carrier. This bike has stability when fully-loaded.
34
Aerodynamic rims for smooth riding.
A light yet stiff aluminum bar for long distance riding.
Expanded platform so you can ride in any shoes; great for allaround riding.
A stable pedal for all-day riding.
Excellent aerodynamic rims guarantee a smooth ride.
Anatomic design for a full-day of riding in comfort. Durable
leather.
New design relieves pressure for long rides.
Cut-out shell for a more comfortable ride.
Lightweight carbon reinforced
compromised weight.
for an unrivaled ride at an un-
The LL Frame provides a safe comfortable ride, while offering
superior bump absorption in a value-priced aluminum frame.
Lightweight butted aluminum frame provides a more upright riding
position for a trip around town. Our ground-breaking design
provides optimum comfort.
The HL aluminum frame is custom-shaped for both good looks and
strength; it will withstand the most rigorous challenges of daily
riding. Men's version.
Affordable light for safe night riding - uses 3 AAA batteries
Warm spandex tights for winter riding; seamless chamois
construction eliminates pressure points.
Figure 16. Output for query executed using CONTAINS and INFLECTIONAL
The figure below shows total query execution time including client processing time and
wait on server replies in milliseconds.
Time Statistics
Client processing time
Total execution time
Wait time on server replies
CPU Cost
Clustered Index Seek
Fulltext Match
Ms
4
7
3
in %
0.0001581
0.0033187
35
Nested Loops
0.0000069
Figure 17. Performance statistics for result set shown in figure 16
The chart below displays CPU cost information for the query expansion process. Clearly
for the above query, the full-text matching process consumes most CPU.
CPU cost analysis
Clustered Index Seek
Fulltext Match
Nested Loops
Figure 18. CPU cost analysis for result set shown in figure 16
The following query examines ProductDescription table and fetches rows containing
keywords ‘aluminum’ and ‘spindle’. The AND keyword is useful here and it will only
fetch those rows containing both keywords in them.
SELECT Description
FROM Production.ProductDescription
WHERE ProductDescriptionID <> 5 AND
CONTAINS(Description, ' Aluminum AND spindle')
36
The output appears as below:
Description
Aluminum alloy cups; large diameter spindle.
Figure 19. Output for query executed using CONTAINS and < > and AND
The figure below shows total query execution time including client processing time and
wait on server replies in milliseconds.
Time Statistics
Client processing time
Total execution time
Wait time on server replies
CPU Cost
Clustered Index Seek
Fulltext Match
Nested Loops
Ms
7
25
18
in %
0.0001581
0.0033044
0.0000084
Figure 20. Performance statistics for result set shown in figure 19
37
The chart below depicts the performance analysis in terms of CPU cost taking three most
important variables in account: Clustered index seek, Fulltext match, Nested loops.
CPU cost analysis
Clustered Index Seek
Fulltext Match
Nested Loops
Figure 21. CPU cost analysis for result set shown in figure 19
To conclude, the query expansion has an impact on performance of query execution time.
The Indexing process also costs more within overall CPU cost. Nested loops (inner join)
cost the least from the three factors affecting CPU cost. For the same table, applying fulltext query expansion leads to no change in CPU cost for the full-text initialization.
38
Chapter 8
SUMMARY
8.1 Summary
To summarize, MS SQL Server and MySQL take a relatively fair amount of time
performing full-text indexing and initialization while keeping the query expansion
enabled. This is due to the process involved in query expansion. When a database user
fires a query to search for a keyword using query expansion, the table gets the index for a
range to search for, the initial results are fetched, the query expansion algorithm is
applied to the initial result set, and finally a new result set is displayed containing
expanded search results.
The query expansion does indeed impact CPU cost. For MS SQL Server, CPU costs are
analyzed. For a same table, applying full-text query expansion leads to no change in the
CPU cost for full-text initialization. The CPU cost is an important factor in analyzing the
query performance. The timing statistics also determine that when applying the query
expansion, same query leads to increase in the execution time.
From the charts shown in previous chapter, it is clear that MS SQL Server consumes
more CPU to perform query expansion, leading to performance bottleneck. The main
operation that results in higher CPU cost is full-text matching. Careful analysis shows
that the process involved while performing query expansion is so complex that the query
expansion indeed needs more CPU power to perform critical operations. Nevertheless,
39
CPU performance can be improved if the operations involved in the internal process are
reduced in some way or the other. This heavily relies on the underlying query expansion
algorithm used. The traditional algorithms are proven but not necessarily result in better
performance. There is a need for better query expansion algorithm.
For MS SQL Server when applying query expansion, the SQL Server search engine takes
approximately 0.5 ms more to finish the process of full-text initialization. For the tables
with significantly more rows, the time difference would make huge impact on overall
query execution performance. For MySQL while applying query expansion, the overall
query execution time increases 53 ms.
The query performance results vary depending on the platform, the database, and the
database tool. The results are not predictable, since these different factors affect the
performance of that entire result set.
When the query expansion is enabled, the MySQL server takes more time in order to
perform full-text initialization, thus resulting in close to twice the amount of time it takes
to perform search when the query expansion is disabled. This is because of the process of
query expansion, which needs full-text initialization to expand source query. And this is
why the database users and application developers are advised to perform query
expansion only when needed.
40
Chapter 9
FUTURE WORK
9.1 Future work
This project includes performance analysis for QE among various database tools and
Google web search engine. This project can be expanded to include research work on
other web search engines such as Yahoo!, Bing, AltaVista, etc.
For database tools, performance analysis is done on MS SQL Server 2008 and MySQL.
Analysis might be included for other database tools such as Oracle, IBM DB2, and so on.
In this project, performance analysis is achieved by closely examining query execution
time statistics and CPU cost. There are other factors which can be analyzed, for example,
IO cost, memory consumption, operator cost and so on.
41
REFERENCES
[1] Kevyn B. Collins-Thompson. “Robust model estimation methods for IR” Language
Technologies Institute, Carnegie Mellon University pp. 41-50, 2008.
[2] R. Navigli, P. Velardi. “An analysis of Ontology based QE strategies” Proc. of
Workshop on Adaptive Text Extraction and Mining, 14th European Conference on
Machine Learning, 2003.
[3] Y. Qiu, H.P. Frei. “Concept Based Query Expansion” SIGIR ’93 Proc. of the 16th
annual international ACM SIGIR Conference on Research and development in
information retrieval, 1993.
[4] Singhal Amit “Modern Information Retrieval: A Brief Overview” Bulletin of the
IEEE Computer Society Technical Committee on Data Engineering, 2001.
[5] Efthimis N. Efthimiadis “Query Expansion” Annual Review of Information Systems
and Technology (ARIST), 1996.
[6] M. Shamim Khan, Sebastian Khor “Enhanced web document retrieval using
automatic query expansion” Journal of the American Society for Information Science and
Technology, 2004.
[7] Jonathan Mamou, Bhuvana Ramabhadran “Phonetic Query Expansion for Spoken
Document Retrieval” IBM Haifa Research Labs, 2004.
42
[8] Query Expansion techniques http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.71&rep=rep1&type=pdf.
[9] Query Expansion for IR – http://nlp.stanford.edu/IR-book/html/htmledition/queryexpansion-1.html#11685.
[10] http://en.wikipedia.org/.
[11] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.
com/en/us/pubs/archive/13021.pdf.
[12] http://www.dsoergel.com/NewPublications/HCIEncyclopediaIRShortEForDS.pdf.
[13] http://nlp.stanford.edu/IR-book/pdf/09expand.pdf.
[14] http://www.macs.hw.ac.uk/~pdw/b1/chli&d.pdf.
[15] http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1064&context=etd_
projects.
[16] http://dev.mysql.com/doc/refman/5.6/en/fulltext-query-expansion.html.
[17] http://www.mysqltutorial.org/using-mysql-query-expansion.aspx.
[18] http://www.mysqlfaqs.net/mysql-faqs/Indexes/Full-Text-Indexes/What-are-naturallanguage-and-boolean-and-query-expansion-full-text-searches.
[19] Gundong Xu, Yanchun Zhang, Lin Li “Web Mining and Social Networking –
Techniques and Applications” Springer Press pp. 22-24, 2011.
43
[20] http://code.google.com/apis/searchappliance/documentation/46/help_gsa/
serve_query_expansion.html.
Download