PERFORMANCE ANALYSIS FOR QUERY EXPANSION FOR INFORMATION RETRIEVAL A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Bimal Panchal SPRING 2012 PERFORMANCE ANALYSIS FOR QUERY EXPANSION FOR INFORMATION RETRIEVAL A Project by Bimal Panchal Approved by: __________________________________, Committee Chair Mary Jane Lee, Ph.D. __________________________________, Second Reader Robert Buckley __________________________________ Date ii Student: Bimal Panchal I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. ______________________________, Graduate Coordinator Nikrouz Faroughi, Ph.D. Department of Computer Science iii _________________ Date Abstract of PERFORMANCE ANALYSIS FOR QUERY EXPANSION FOR INFORMATION RETRIEVAL by Bimal Panchal Information Retrieval (IR) is the technique of getting information from document repositories. It is the science of searching for documents for information within documents as well as that of searching relational databases. Query Expansion (QE) is one of the various aspects of Information Retrieval. Query Expansion is the process of reformulating a query which leads to the improvement in IR. In web search engine context, Query Expansion involves evaluating user’s input and expanding search query to match additional documents. The purpose of this project is to analyze different kinds of QE techniques which help improve Information Retrieval. In addition performance of various database tools, such as, MySQL and MS SQL Server 2008, in query expansion will be conducted. Performance factors considered are CPU cost and time. ______________________________, Committee Chair Mary Jane Lee, Ph.D. ______________________________ Date iv ACKNOWLEDGMENTS I would like to express gratitude to my project advisor Dr. Mary Jane Lee for her direction and supporting me to complete this project. I am thankful to my second reader Professor Robert Buckley for his assistance. I am utterly grateful to my parents Savitri and Subhash Panchal for their love, moral support and encouragement. v TABLE OF CONTENTS Page Acknowledgments..........................................................................................................v List of Figures ............................................................................................................ vii Chapter 1. INTRODUCTION ....................................................................................................1 2. BACKGROUND INFORMATION .........................................................................2 2.1 Information Retrieval .......................................................................................2 2.1.1 Performance Measures for IR ..................................................................3 2.2 Query Expansion .............................................................................................4 2.2.1 Classes of Query Expansion ....................................................................5 3. GOOGLE QUERY EXPANSION…... ......................................................................9 4. FULL-TEXT QUERY EXPANSION IN MYSQL ................................................12 5. PERFORMANCE ANALYSIS FOR MYSQL QE ................................................15 6. FULL-TEXT QUERY EXPANSION IN MS SQL SERVER ................................22 7. PERFORMANCE ANALYSIS FOR MS SQL SERVER QE ...............................25 8. SUMMARY ............................................................................................................38 8.1 Summary ........................................................................................................38 9. FUTURE WORK ....................................................................................................40 9.1 Future work ....................................................................................................40 References ...................................................................................................................41 vi LIST OF FIGURES Page Figure 1 Result set using full-text index ..................................................................... 16 Figure 2 Execution time based on different factors for results shown in figure 1 ...... 16 Figure 3 Time statistics result chart based on various factors for results in figure 1 . 17 Figure 4 Result set using full-text index with QE....................................................... 19 Figure 5 Execution time based on different factors for results shown in figure 4 ...... 20 Figure 6 Time statistics result chart based on various factors for results in figure 4 . 20 Figure 7 Output for query executed using FREETEXT ............................................. 28 Figure 8 Performance statistics for result set shown in figure 7 ................................. 28 Figure 9 CPU cost analysis for result set shown in figure 7 ....................................... 28 Figure 10 Output for query executed using CONTAINS and AND ........................... 29 Figure 11 Performance statistics for result set shown in figure 10 ............................. 30 Figure 12 CPU cost analysis for result set shown in figure 10 ................................... 30 Figure 13 Output for query executed using CONTAINS and ISABOUT .................. 32 Figure 14 Performance statistics for result set shown in figure 13 ............................. 32 Figure 15 CPU cost analysis for result set shown in figure 13 ................................... 32 Figure 16 Output for query executed using CONTAINS and INFLECTIONAL ...... 34 Figure 17 Performance statistics for result set shown in figure 16 ............................. 35 Figure 18 CPU cost analysis for result set shown in figure 16 ................................... 35 Figure 19 Output for query executed using CONTAINS and < > and AND ............. 36 vii Figure 20 Performance statistics for result set shown in figure 19 ............................. 36 Figure 21 CPU cost analysis for result set shown in figure 19 ................................... 37 viii 1 Chapter 1 INTRODUCTION In this project, there are two broad terms that reader will encounter – Information Retrieval, and Query Expansion. In general, Information Retrieval (IR) is method of searching for a document or part of document for specific information. Query Expansion (QE) is a process of reformulating a user’s query for searching for information, in such a way that the query should include additional search items that are relevant in addition to the original query itself. Information Retrieval is a big picture, whereas Query Expansion is a subset of Information Retrieval. This project includes analysis of different QE techniques that are being used in various web search engines and database tools. The analysis will help compare different database tools and determine which one is better in terms of query expansion process. Additionally, the analysis will help determine various factors, such as, performance, efficiency, etc. coupled with query expansion techniques. In terms of analysis, MySQL and MS SQL Server 2008 database tools are used. Test queries are performed and results are collected. Analysis is done on various queries enabling query expansion. Performance is measured in terms of time being taken to execute queries with and without QE for MySQL. For MS SQL Server, time statistics and CPU cost are taken into account when performing various query expansion processes on a source query. 2 Chapter 2 BACKGROUND INFORMATION 2.1 Information Retrieval Information Retrieval (IR) is the basis on which the modern search engines work today. IR has been evolved greatly so far. The traditional IR mechanism was little different. It involved retrieving information from the locally stored collection of documents. Modern IR mechanism works this way: the query is passed as an input by user; IR engine then processes that query; and based on the internal algorithm that is used, it then outputs the results related to that query term [2]. The output is a list of ranked documents based on their relevance to the original query specified by user. Mathematics, IR is not only related to Computer Science but also related to Information Science, Library Science, Information Architecture, Linguistics, Statistics, etc. to name a few [10]. IR has many applications – many Public Libraries use IR to provide access to the information user needs; Web Search Engines use IR systems and algorithms a lot. Web Search Engines are in fact the most visible application of IR systems. IR uses different retrieval algorithms to help retrieve exact information based on user’s initial query. It is up to an underlying algorithm to fetch more relevant information. Hence retrieval algorithm needs to be quite efficient. 3 2.1.1 Performance Measures for IR As this project is a study of performance analysis at its core, there should be enough information about critical performance measures for IR. These different measures require stack of information (i.e., a database) and a query for specific information [19]. Precision: Precision can be written as follows: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = |(relevant information) + (retrieved information)| |(retrieved information)| Recall: In mathematical expression, Recall can be written as following: 𝑟𝑒𝑐𝑎𝑙𝑙 = |(relevant information) + (retrieved information)| |(relevant information)| Fall-Out: Fall-Out can be expressed as: 𝐹𝑎𝑙𝑙𝑂𝑢𝑡 = |(nonrelevant information) + (retrieved information)| |(nonrelevant information)| F-measure: F-measure can be expressed as: 𝐹= 2 ∗ precision ∗ recall (precision + recall) 4 There are many other performance measures such as Average Precision, R-precision, Mean Average Precision, etc. But the ones mentioned here are the most important of all. 2.2 Query Expansion Query Expansion (QE) is the process of generating a query response in such a way that results include additional search terms related to the original query itself. This process is quite complex since Query Expansion Engine needs to analyze the search results before displaying and ranking them as relevant search items. Query Expansion is most popular with web search engines these days. This is how it works. User enters a query to the search engine and then the search engine, based on its underlying QE algorithm, enhances that query to match additional items that might be of an interest to user. In the end, it displays them in an order based on their relevancy – more relevant items will be displayed first followed by less relevant. By using Query Expansion, web search engines aim at improving Precision and/or Recall – performance measures of IR [9]. For example, if user enters query for “car”, the expanded query may include “car, cars, automobiles, auto” etc. There are mainly 3 classes of Query Expansion: human or computer-generated thesaurus, relevance feedback, and automatic query expansion. There are two major problems in QE – which terms to include, and which terms weigh more. The underlying QE mechanism needs to be efficient enough in order to get nearly perfect answers to these questions. There is also an important distinction between 5 concept-based and term-based query expansion. The question here is: Is it better to expand the query based on the query term specified, or to expand it based upon overall concept of the query. 2.2.1 Classes of Query Expansion The process of query expansion classifies into following: human-generated thesaurus, computer-generated thesaurus, relevance feedback, and automatic query expansion. Human-generated thesaurus: This is a compilation of frequently used words in relation to the corresponding field. For example, an engineering thesaurus to include all the engineering terms, industrial thesaurus to include industrial terms, etc. This is largely used in fields such as medicine, aerospace, technology, and so on. The drawbacks are associated cost and time. It takes lot of development cost and maintenance cost. And it also takes a long time to develop a thesaurus. Computer-generated thesaurus: For the automatically-generated thesaurus, it doesn’t require experts to generate it. Hence cost of experts is saved. To generate a thesaurus, there are steps involved. First step is to extract the search word co-occurrences. Second step is to define similarities between words by word co-occurrences or lexical relationship and finally to cluster words based on their similarities. This method has not been proven successful as the handcrafted / human-generated thesaurus method. 6 Relevance feedback: Use of relevance feedback significantly improves recall and precision [13], IR measures over the traditional way of query expansion. The process contains several steps: first, the user provides initial query which returns initial result set matching the input query. The user then selects the list of documents from the initial result set that are relevant to the search. The system then re-weights and/or expands the query based upon the search terms in the documents. This process may be iterative, meaning that system could possibly iterate until it refines the search query to match more relevant results. There are many models described in relation to relevance feedback mechanism for doing query expansion. Most popular ones are: vector-space, probabilistic, and boolean. The models differ depending on methods and theories behind them. The method in vectorspace model follows like this: all the top ranked relevant documents are used as the highest ranked non-relevant document. The non-relevant document is used as the point in the vector-space from which the feedback query is then removed. Interactive query expansion: Interactive query expansion uses a thesaurus. Initially, the user provides query to the system. Then the system comes up with the list of documents that are relevant to the search query. Once the user selects documents from the result set that are of an interest to him/her, the system again refines the search result by looking at the result set and a 7 thesaurus. This process needs more research since there are some unknowns associated with it. Pseudo-relevance feedback: Pseudo-relevance feedback mechanism was developed because of some limitations to the relevance feedback process. The obvious thing here is that most users don’t like to give manual feedback to the system. The process follows like this: the system returns an initial set of documents as soon as the user provides search query. The system then assumes that first n number of documents are relevant to the search query. The system takes terms from these documents to re-weight the initial query. Finally, the system does it iteratively until it gets final result set that is more suitable to the search query. The drawback here is that the system relies heavily on the ability to efficiently retrieve relevant documents initially. This depends on the underlying algorithm used to carry out the process [13]. The advantage is that no manual feedback required from the user. Thus, it saves time and efforts. Automatic query expansion: Automatic query expansion uses computer-generated thesaurus. This works similar to pseudo-relevance feedback process. In terms of co-occurrence measures, relationships between words based upon their co-occurrence in document is defined. During clustering, documents that share measurable number of terms are grouped together. Then, a 8 thesaurus is generated from these terms. The drawbacks are that it doesn’t account for synonyms and categories of terms sometimes too narrow or broad. On the other hand in lexical co-occurrence measures, the proximity of words in the document is taken into account instead of frequency of terms. It significantly improves performance in small collection of document set. This process is better than pseudorelevance feedback but not as quite efficient as relevance feedback. 9 Chapter 3 GOOGLE QUERY EXPANSION Word stemming In word stemming, the search term is refined to its root or stem. For example, search term for ‘translator’ can be expanded to ‘translator’, ‘translation’, etc. The keyword is reduced to its stem (‘translate’ in this case) and then words beginning from the same stem can be matched [20]. Acronyms An acronym / abbreviation searched is automatically resolved to its full form. For example a search for OSI may include results for ‘Open Systems Interconnection’ as well as ‘Open Source Initiative’. Another example is term NFL, which can be expanded to include ‘National Football League’, ‘National Forensic Laboratory’. The search results are ranked based upon the relevance – most probable ones are listed first followed by the rest. Misspellings and typos-errors Anytime if by mistake, one makes a typo while doing a search, Google will identify that typo for almost all the time and will suggest correct variant for it in addition to the “did you mean” prompt. Searching for ‘city trafic’ will include search results containing word ‘city’ and its variants as well as the correct word spell for ‘traffic’. 10 Synonyms Google uses query expansion to include synonyms of the entered query term [20]. It is useful to include related words. Most of the time, the search query is expanded when the search term is entered improper. The synonym substituted as a part of improper word is not bolded as a common rule. For example, search query for ‘remote connexion’ will display search results for ‘remote connection’ in most cases. Translations In some cases, Google seems to translate search query term into another language and displays results based on that. For example, search query for non-english term matches query’s English-equivalent word. Searching for Spanish term ‘amigo’ may display results that include English term ‘friend’ in them. Ignored words Surprisingly, some of the query term word gets discarded completely from the query. It may be that those terms which are being dropped do not have much significance on the overall query [20]. For example, search for a term ‘birthday bound attack’ might drop the word ‘attack’ from the original query because of not much significance. 11 Interestingly, the query expansion process does not take place all the time when someone searches for a query term. Whether or not query expansion should occur also happens to be related to the entire search query – some query variants are much more likely to trigger query expansion than others. 12 Chapter 4 FULL-TEXT QUERY EXPANSION IN MYSQL In some instances, users may want to search for information which they rely on based on their own knowledge. Using that knowledge, they define key terms to search and typically these terms are too short. To address this situation, MySQL full-text search engine introduced query expansion process within MySQL itself [16]. The expansion of the input query is based on automatic relevance feedback mechanism that was discussed in the previous chapter. This process is also called blind query expansion. MySQL full-text search engine perform following steps when it uses query expansion [16]: First, MySQL full-text search engine checks for all rows that match the search query Next it checks all resulted rows and finds the relevant words out of those rows. Finally, MySQL full-text search engine searches for the query again based on the relevant words it got from previous step. Typically, users need query expansion in situations when they could not find relevant results to their search query and also when the returned search results are less. Users search again but this time with the query expansion enabled so they can find what they are looking for. To use the query expansion, users need to use modifier WITH QUERY EXPANSION in SELECT statement. Here is the usual form of using WITH QUERY EXPANSION: 13 SELECT col1, col2, col3 FROM table1 WHERE MATCH (col1, col2, col3) AGAINST (‘keyword’, WITH QUERY EXPANSION) Let’s look at the example below to understand query expansion using WITH QUERY EXPANSION. Here, empName column is used in the employee table to demonstrate this. ALTER TABLE employee ADD FULLTEXT (empName) Next, let’s find all the employees whose name contain “smith” in them without using query expansion. SELECT empName FROM employee WHERE MATCH (empName) AGAINST (‘smith’) The output appears like following ---------------------------------------empName ---------------------------------------william smith paul mithell smith andrew smith ----------------------------------------- 14 As per the results, the employee names in the above search results contain “smith” in it. Now, the same query can be used but with query expansion enabled as following: SELECT empName FROM employee WHERE MATCH (empName) AGAINST (‘smith’ WITH QUERY EXPANSION) The output goes like following: ---------------------------------------empName ---------------------------------------william smith paul mithell smith andrew smith paul patrick andrew white william george ----------------------------------------As per the results above, when query expansion is used it outputted more rows. First three rows are the most relevant result. The remaining rows come from the relevant words out of the first three rows, for example ‘andrew’. Blind query expansion tends to increase noise significantly by returning non-relevant results. Thus, it is recommended to use whenever search keywords are short. 15 Chapter 5 PEFORMANCE ANALYSIS FOR MYSQL QE To run QE, MySQL server 5.5 is used. To be able to perform full-text search on a given query, one needs to create full-text index on specified column(s) of a given table. Following query adds full-text index on table ‘country’ for columns ‘name’ and ‘region’. mysql> alter table country add fulltext (name, region); Query OK, 239 rows affected (0.26 sec) Records: 239 Duplicates: 0 Warnings: 0 Now, full-text searches against columns ‘name’ and ‘region’ can be executed. The following query searches for word ‘america’ from mentioned columns using full-text indexing. mysql> select name, continent, region from country where match (name, region) against ('america'); +---------------------------+---------------+-----------------+ | name | continent | region | +---------------------------+---------------+-----------------+ | Argentina | South America | South America | | Guyana | South America | South America | | Honduras | North America | Central America | | Mexico | North America | Central America | | Nicaragua | North America | Central America | | Panama | North America | Central America | | Peru | South America | South America | | Paraguay | South America | South America | | El Salvador | North America | Central America | | Suriname | South America | South America | | Uruguay | South America | South America | | Venezuela | South America | South America | | Guatemala | North America | Central America | | Greenland | North America | North America | 16 | Belize | North America | Central America | | Bermuda | North America | North America | | Bolivia | South America | South America | | Brazil | South America | South America | | Canada | North America | North America | | Chile | South America | South America | | Colombia | South America | South America | | Ecuador | South America | South America | | Costa Rica | North America | Central America | | Falkland Islands | South America | South America | | United States | North America | North America | | French Guiana | South America | South America | | Saint Pierre and Miquelon | North America | North America | +---------------------------+---------------+-----------------+ 27 rows in set (0.03 sec) Figure 1. Result set using full-text index Figure 2 provides the query execution time (in seconds) based on different factors used while analyzing and performing query search. The time measurements are for the above query. +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000154 | | checking permissions | 0.000017 | | Opening tables | 0.000039 | | System lock | 0.000022 | | init | 0.000034 | | optimizing | 0.000013 | | statistics | 0.030746 | | preparing | 0.000038 | | FULLTEXT initialization | 0.000094 | | executing | 0.000007 | | Sending data | 0.000255 | | end | 0.000010 | | query end | 0.000005 | | closing tables | 0.000014 | | freeing items | 0.000105 | | logging slow query | 0.000006 | | cleaning up | 0.000006 | | TOTAL | 0.031565 | +-------------------------+----------+ Figure 2. Execution time based on different factors for results shown in figure 1 17 Figure 3 shows results for query execution time for important factors that play key role in query search process in a simpler pie chart form. From the chart, one of the major timeconsuming factor is “sending data” to the host machine when the data is ready to be delivered by server engine. The “Full-text initialization” comes with no surprise while consuming fair amount of time performing query search. Time statistics Opening tables Initialization Fulltext Init Execution Sending data Closing tables Figure 3. Time statistics result chart based on various factors for results in figure 1 Following query searches for search term ‘america’ on table name ‘country’ with MySQL query expansion enabled. Clearly, this query will fetch more results because after one execution, the internal full-text engine will expand this query to include more relevant search terms. mysql> select name, continent, region from country where match (name, region) against ('america' with query expansion); 18 +-------------------------+---------------+---------------------------+ | name | continent | region | +-------------------------+---------------+---------------------------+ | French Guiana | South America | South America | | Falkland Islands | South America | South America | | Colombia | South America | South America | | Ecuador | South America | South America | | Chile | South America | South America | | Brazil | South America | South America | | Guyana | South America | South America | | Bolivia | South America | South America | | Venezuela | South America | South America | | Argentina | South America | South America | | Suriname | South America | South America | | Paraguay | South America | South America | | Uruguay | South America | South America | | Peru | South America | South America | | Costa Rica | North America | Central America | | Nicaragua | North America | Central America | | Guatemala | North America | Central America | | Panama | North America | Central America | | Honduras | North America | Central America | | Mexico | North America | Central America | | Belize | North America | Central America | | Bermuda | North America | North America | | Canada | North America | North America | | Greenland | North America | North America | | El Salvador | North America | Central America | | United States | North America | North America | | Saint Pierre and Miquelon| North America | North America | | South Georgia | Antarctica | Antarctica | | South Korea | Asia | Eastern Asia | | South Africa | Africa | Southern Africa | | Central African Republic| Africa | Central Africa | | Congo | Africa | Central Africa | | Chad | Africa | Central Africa | | Cameroon | Africa | Central Africa | | Gabon | Africa | Central Africa | | Angola | Africa | Central Africa | | Tajikistan | Asia | Southern and Central Asia | | Sao Tome and Principe | Africa | Central Africa | | Pakistan | Asia | Southern and Central Asia | | Nepal | Asia | Southern and Central Asia | | Uzbekistan | Asia | Southern and Central Asia | | Maldives | Asia | Southern and Central Asia | | Sri Lanka | Asia | Southern and Central Asia | | Bhutan | Asia | Southern and Central Asia | | Equatorial Guinea | Africa | Central Africa | | Bangladesh | Asia | Southern and Central Asia | | Afghanistan | Asia | Southern and Central Asia | | India | Asia | Southern and Central Asia | | Turkmenistan | Asia | Southern and Central Asia | 19 | Iran | Asia | Southern and Central Asia | | Kazakstan | Asia | Southern and Central Asia | | Kyrgyzstan | Asia | Southern and Central Asia | | Congo, The Democratic Rep| Africa | Central Africa | | North Korea | Asia | Eastern Asia | | French Southern territories| Antarctica | Antarctica | | French Polynesia | Oceania | Polynesia | | Virgin Islands, U.S. | North America | Caribbean | | Ireland | Europe | British Islands | | Cook Islands | Oceania | Polynesia | | Solomon Islands | Oceania | Melanesia | | Cayman Islands | North America | Caribbean | | Fiji Islands | Oceania | Melanesia | | Marshall Islands | Oceania | Micronesia | | Virgin Islands, British | North America | Caribbean | | Northern Mariana Islands| Oceania | Micronesia | | United Kingdom | Europe | British Islands | | Faroe Islands | Europe | Nordic Countries | | Turks and Caicos Islands| North America | Caribbean | | Heard Island and McDonald| Antarctica | Antarctica | | Cocos (Keeling) Islands | Oceania | Australia and New Zealand | | United States Minor Outlying| Oceania | Micronesia/Caribbean | +-------------------------+---------------+---------------------------+ 71 rows in set (0.09 sec) Figure 4. Result set using full-text index with QE Figure 4 shows query execution time (in seconds) based on different factors used while analyzing and performing query search. The time measurements are for query results shown in figure 3. +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.083475 | | checking permissions | 0.000020 | | Opening tables | 0.000025 | | System lock | 0.000015 | | init | 0.000025 | | optimizing | 0.000010 | | statistics | 0.000016 | | preparing | 0.000012 | | FULLTEXT initialization | 0.000603 | | executing | 0.000006 | | Sending data | 0.000487 | | end | 0.000007 | 20 | query end | 0.000005 | | closing tables | 0.000009 | | freeing items | 0.000141 | | logging slow query | 0.000005 | | cleaning up | 0.000005 | | TOTAL | 0.084866 | +-------------------------+----------+ Figure 5. Execution time based on different factors for results shown in figure 4 Figure 6 shows results for query execution time for important factors that play key role in query expansion process in a simpler pie chart form. By involving query expansion, the full-text initialization factor consumes lot of time to execute the query. It is taking 0.5 (0.603 – 0.094) ms more for performing full-text initialization. This is because the column to search for needs to be indexed prior to search. Time statistics Opening tables Initialization Fulltext Init Execution Sending data Closing tables Figure 6. Time statistics result chart based on various factors for results in figure 4 Let’s carefully examine figures 2 and 5. For figure 2, MySQL full-text search query without QE took total time of 31 ms. Out of that, majority of time is taken by statistics and the second most is “sending data” part followed by “fulltext initialization”. 21 For figure 5, the same MySQL full-text search query but now with QE took total time of 84 ms. From all, “starting” phase took longest time followed by “fulltext initialization” phase. By comparing total time, it can be concluded that it is obvious when QE is taken into consideration the execution time is much higher. It takes almost 53 more milliseconds to perform query expansion on the same table with same query to search for same keyword. By carefully analyzing the results, one can also figure out that the full-text initialization impacts a lot on the overall performance of the query in terms of time. This is due to performing indexing on specified column(s) of a relatively very big database table with thousands of rows. So the question arises - is it wise to use query expansion? The answer depends on the user’s needs. If the user is searching for short phrases or keywords, then it is better to keep the query expansion disabled. For the ambiguous searches when the user is not sure about the search term, the query expansion becomes really useful. 22 Chapter 6 FULL-TEXT QUERY EXPANSION IN MS SQL SERVER This chapter briefly describes how MS SQL Server performs full-text query search. For performance analysis and research, MS SQL Server 2008 R2 is used. SQL Server uses FREETEXT and CONTAINS keywords to perform Query Expansion. The expansion process occurs in such a way that source query is modified after initial run by full-text internal search engine to get more information. The search terms supplied in a query are expanded by the parser before hitting full-text index. This query expansion process is performed by a component called a stemmer and the expansion depends on rules that are being implied explicitly. However, query expansion can be used to include following types: Search for plural forms of a word – a search on bike would also return bikes Thesaurus searches for synonymous forms of word – a search on Leopard may return words that contain panther, mountain lion, puma, etc. Searches for a word that will return all of linguistically meaningful variations of that word – search for ride may include ride, rides, ridden, rode, etc. There are two types of query: using contains and freetext. They both are supported in MS SQL Server 2008. Both will be examined in detail. 23 Querying using CONTAINS By default, use of the CONTAINS predicate will entail minimal language specific query time expansion. For example, consider a search on the term ‘bank’ select * from TableName where CONTAINS(*,'bank') The asterisk (*) is used to search all full-text indexed columns. In SQL 2008, search can be made within all columns, a named column, or a subset of all full-text indexed columns: select * from TableName where CONTAINS(col1,'bank') select * from TableName where CONTAINS((col1,col2),'bank')) Sometimes, there is a need to search for words as noun and as verb. Searching for ‘bank’ using keyword CONTAINS will not return result for ‘banking’. CONTAINS will only match with ‘bank’. To solve this, FORMSOF keyword is useful. This term accepts two arguments – INFLECTIONAL or THESAURUS. The INFLECTIONAL argument will expand the search phrase to search on all conjugations and declensions, and the THESAURUS argument will enable a thesaurus expansion on the search phrase. Here are two examples of what this would look like: 24 Select * from TableName where CONTAINS(*,'FORMSOF(INFLECTIONAL,run)') And Select * from TableName where CONTAINS(*,'FORMSOF(THESAURUS,run)') Querying using FREETEXT In fact, use of FREETEXT means that, by default, the search is expanded to encompass all generations of the searched word, for the default full-text language setting of your server. So, to continue the previous "bank" example, a FREETEXT search on bank: Select * from TableName where FREETEXT (*,‘bank’) would be expanded to the equivalent of: Select * from TableName where CONTAINS (*,'"bank" or "banks" or "bank''s" or banks''" or "banking") The search would also return any relevant thesaurus expansions. 25 Chapter 7 PEFORMANCE ANALYSIS FOR MS SQL SERVER QE Performance analysis is important in determining how query expansion impacts on different performance factors such as disk IO, memory, CPU cost, query execution time, and so on. In this chapter, various query execution times for MS SQL Server 2008 are reviewed while running different free-text queries with QE enabled on a relatively large database table. To enable SQL Server to use full-text search mechanism, the following steps needed to be done: Select the database to be used and create a database catalog. Create full-text index on column(s) for the chosen table. Run full-text queries to search for terms. Now series of queries will be executed on freshly created database table named Production.ProductDescription. Firstly, run following query to create catalog on the database to be used. Here AdvemtureWorks database is used. create fulltext catalog AdventureWorksCatalog 26 The table used for querying is Production.ProductDescription in MS SQL Server 2008. It contains detailed description about various products. Now, create full-text index on Description column of this table .To create full-text index on specified column(s), run the following query: Create FullText Index on Production.ProductDescription ([Description]) Key Index PK_ProductDescription_ProductDescriptionID on AdventureWorksCatalog with Change_Tracking Auto There are various forms for querying SQL Server full-text search engine to demonstrate query expansion. Following query examines ProductDescription table and fetches rows containing keyword ‘race’ and ‘bike’ independent of each other. Select [Description] from Production.ProductDescription Where FREETEXT([Description], 'Race Bike') The output appears as below: Description This bike delivers a high-level of performance on a budget. It is responsive and maneuverable, and offers peace-of-mind when you decide to go off-road. For true trail addicts. An extremely durable bike that will go anywhere and keep you in control on challenging terrain - without 27 breaking your budget. Top-of-the-line competition mountain bike. Performance-enhancing options include the innovative HL Frame, super-smooth front suspension, and traction for all terrain. Entry level adult bike; offers a comfortable ride cross-country or down the block. Quick-release hubs and rims. Value-priced bike with many features of our top-of-the-line models. Has the same light, stiff frame, and the quick acceleration we're famous for. Same technology as all of our Road series bikes, but the frame is sized for a woman. Perfect all-around bike for road or racing. Same technology as all of our Road series bikes. Perfect allaround bike for road or racing. A true multi-sport bike that offers streamlined riding and a revolutionary design. Aerodynamic design lets you ride with the pros, and the gearing will conquer hilly roads. Cross-train, race, or just socialize on a sleek, aerodynamic bike. Advanced seat technology provides comfort all day. Cross-train, race, or just socialize on a sleek, aerodynamic bike designed for a woman. Advanced seat technology provides comfort all day. Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads. This bike is ridden by race winners. Developed with the Adventure Works Cycles professional race team, it has a extremely light heat-treated aluminum frame, and steering that allows precision control. All-occasion value bike with our basic comfort and safety features. Offers wider, more stable tires for a ride around town or weekend trip. The plush custom saddle keeps you riding all day, and there's plenty of space to add panniers and bike bags to the newlyredesigned carrier. This bike has stability when fully-loaded. Lightweight kevlar racing saddle. Leather. Carries 4 bikes securely; steel construction, fits 2" receiver hitch. Clip-on fenders fit most mountain bikes. Men's 8-panel racing shorts - lycra with an elastic waistband and leg grippers. 28 Perfect all-purpose bike stand for working on your bike at home. Quick-adjusting clamps and steel construction. Figure 7. Output for query executed using FREETEXT The figure below shows total query execution time including client processing time and wait on server replies in milliseconds. The time taken by each query will be examined later. Time Statistics Client processing time Total execution time Wait time on server replies CPU Cost Clustered Index Seek Fulltext Match Nested Loops Ms 3 58 55 in % 0.0001581 0.0033386 0.0000902 Figure 8. Performance statistics for result set shown in figure 7 The chart below displays information about the three most important variables which are associated with CPU cost during the query expansion process. CPU cost analysis Clustered Index Seek Fulltext Match Nested Loops Figure 9. CPU cost analysis for result set shown in figure 7 29 The following query examines ProductDescription table and fetches rows containing keyword ‘race’ and ‘bike’ dependent of each other – meaning both words have to appear together in a row. Select [Description] from Production.ProductDescription Where Contains([Description], '"Race" and "Bike"') The output appears as below: Description Cross-train, race, or just socialize on a sleek, aerodynamic bike. Advanced seat technology provides comfort all day. Cross-train, race, or just socialize on a sleek, aerodynamic bike designed for a woman. Advanced seat technology provides comfort all day. This bike is ridden by race winners. Developed with the Adventure Works Cycles professional race team, it has a extremely light heat-treated aluminum frame, and steering that allows precision control. Figure 10. Output for query executed using CONTAINS and AND The figure below shows total query execution time including client processing time and wait on server replies in milliseconds. Time Statistics Client processing time Total execution time Wait time on server replies Ms 3 7 4 30 CPU Cost Clustered Index Seek Fulltext Match Nested Loops in % 0.0001581 0.0033022 0.0000084 Figure 11. Performance statistics for result set shown in figure 10 From the information given in the chart below, it can be said that the full-text match operation requires more CPU power and hence its CPU cost is high. CPU cost analysis Clustered Index Seek Fulltext Match Nested Loops Figure 12. CPU cost analysis for result set shown in figure 10 The following query examines ProductDescription table and fetches rows containing keyword ‘race’ and ‘bike’ independent ranking them according to their weights. Select [Description] from Production.ProductDescription Where Contains([Description], 31 'ISABOUT (Race Weight(.4), Bike Weight (.2))') The output appears as below: Description This bike delivers a high-level of performance on a budget. It is responsive and maneuverable, and offers peace-of-mind when you decide to go off-road. For true trail addicts. An extremely durable bike that will go anywhere and keep you in control on challenging terrain - without breaking your budget. Top-of-the-line competition mountain bike. Performance-enhancing options include the innovative HL Frame, super-smooth front suspension, and traction for all terrain. Entry level adult bike; offers a comfortable ride cross-country or down the block. Quick-release hubs and rims. Value-priced bike with many features of our top-of-the-line models. Has the same light, stiff frame, and the quick acceleration we're famous for. Same technology as all of our Road series bikes, but the frame is sized for a woman. Perfect all-around bike for road or racing. Same technology as all of our Road series bikes. Perfect allaround bike for road or racing. A true multi-sport bike that offers streamlined riding and a revolutionary design. Aerodynamic design lets you ride with the pros, and the gearing will conquer hilly roads. Cross-train, race, or just socialize on a sleek, aerodynamic bike. Advanced seat technology provides comfort all day. Cross-train, race, or just socialize on a sleek, aerodynamic bike designed for a woman. Advanced seat technology provides comfort all day. This bike is ridden by race winners. Developed with the Adventure Works Cycles professional race team, it has a extremely light heat-treated aluminum frame, and steering that allows precision control. All-occasion value bike with our basic comfort and safety features. Offers wider, more stable tires for a ride around town or weekend trip. The plush custom saddle keeps you riding all day, and there's plenty of space to add panniers and bike bags to the newlyredesigned carrier. This bike has stability when fully-loaded. 32 Perfect all-purpose bike stand for working on your bike at home. Quick-adjusting clamps and steel construction. Figure 13. Output for query executed using CONTAINS and ISABOUT The figure below shows total query execution time including client processing time and wait on server replies in milliseconds. Time Statistics Client processing time Total execution time Wait time on server replies CPU Cost Clustered Index Seek Fulltext Match Nested Loops Ms 5 7 2 in % 0.0001581 0.0033187 0.0000069 Figure 14. Performance statistics for result set shown in figure 13 The following chart shows CPU consumption for various factors for the result set above. CPU cost analysis Clustered Index Seek Fulltext Match Nested Loops Figure 15. CPU cost analysis for result set shown in figure 13 33 The following query examines ProductDescription table and fetches rows containing keyword ‘ride’ doing inflectional search i.e. by specifying INFLECTIONAL, the query expansion engine would search for all conjugations of that keyword. SELECT Description FROM Production.ProductDescription WHERE CONTAINS(Description, ' FORMSOF (INFLECTIONAL, ride) ') The output appears as below: Description Suitable for any type of riding, on or off-road. Fits any budget. Smooth-shifting with a comfortable ride. Serious back-country riding. Perfect for all levels of competition. Uses the same HL Frame as the Mountain-100. Entry level adult bike; offers a comfortable ride cross-country or down the block. Quick-release hubs and rims. A true multi-sport bike that offers streamlined riding and a revolutionary design. Aerodynamic design lets you ride with the pros, and the gearing will conquer hilly roads. Alluminum-alloy frame provides a light, stiff ride, whether you are racing in the velodrome or on a demanding club ride on country roads. This bike is ridden by race winners. Developed with the Adventure Works Cycles professional race team, it has a extremely light heat-treated aluminum frame, and steering that allows precision control. All-occasion value bike with our basic comfort and safety features. Offers wider, more stable tires for a ride around town or weekend trip. The plush custom saddle keeps you riding all day, and there's plenty of space to add panniers and bike bags to the newlyredesigned carrier. This bike has stability when fully-loaded. 34 Aerodynamic rims for smooth riding. A light yet stiff aluminum bar for long distance riding. Expanded platform so you can ride in any shoes; great for allaround riding. A stable pedal for all-day riding. Excellent aerodynamic rims guarantee a smooth ride. Anatomic design for a full-day of riding in comfort. Durable leather. New design relieves pressure for long rides. Cut-out shell for a more comfortable ride. Lightweight carbon reinforced compromised weight. for an unrivaled ride at an un- The LL Frame provides a safe comfortable ride, while offering superior bump absorption in a value-priced aluminum frame. Lightweight butted aluminum frame provides a more upright riding position for a trip around town. Our ground-breaking design provides optimum comfort. The HL aluminum frame is custom-shaped for both good looks and strength; it will withstand the most rigorous challenges of daily riding. Men's version. Affordable light for safe night riding - uses 3 AAA batteries Warm spandex tights for winter riding; seamless chamois construction eliminates pressure points. Figure 16. Output for query executed using CONTAINS and INFLECTIONAL The figure below shows total query execution time including client processing time and wait on server replies in milliseconds. Time Statistics Client processing time Total execution time Wait time on server replies CPU Cost Clustered Index Seek Fulltext Match Ms 4 7 3 in % 0.0001581 0.0033187 35 Nested Loops 0.0000069 Figure 17. Performance statistics for result set shown in figure 16 The chart below displays CPU cost information for the query expansion process. Clearly for the above query, the full-text matching process consumes most CPU. CPU cost analysis Clustered Index Seek Fulltext Match Nested Loops Figure 18. CPU cost analysis for result set shown in figure 16 The following query examines ProductDescription table and fetches rows containing keywords ‘aluminum’ and ‘spindle’. The AND keyword is useful here and it will only fetch those rows containing both keywords in them. SELECT Description FROM Production.ProductDescription WHERE ProductDescriptionID <> 5 AND CONTAINS(Description, ' Aluminum AND spindle') 36 The output appears as below: Description Aluminum alloy cups; large diameter spindle. Figure 19. Output for query executed using CONTAINS and < > and AND The figure below shows total query execution time including client processing time and wait on server replies in milliseconds. Time Statistics Client processing time Total execution time Wait time on server replies CPU Cost Clustered Index Seek Fulltext Match Nested Loops Ms 7 25 18 in % 0.0001581 0.0033044 0.0000084 Figure 20. Performance statistics for result set shown in figure 19 37 The chart below depicts the performance analysis in terms of CPU cost taking three most important variables in account: Clustered index seek, Fulltext match, Nested loops. CPU cost analysis Clustered Index Seek Fulltext Match Nested Loops Figure 21. CPU cost analysis for result set shown in figure 19 To conclude, the query expansion has an impact on performance of query execution time. The Indexing process also costs more within overall CPU cost. Nested loops (inner join) cost the least from the three factors affecting CPU cost. For the same table, applying fulltext query expansion leads to no change in CPU cost for the full-text initialization. 38 Chapter 8 SUMMARY 8.1 Summary To summarize, MS SQL Server and MySQL take a relatively fair amount of time performing full-text indexing and initialization while keeping the query expansion enabled. This is due to the process involved in query expansion. When a database user fires a query to search for a keyword using query expansion, the table gets the index for a range to search for, the initial results are fetched, the query expansion algorithm is applied to the initial result set, and finally a new result set is displayed containing expanded search results. The query expansion does indeed impact CPU cost. For MS SQL Server, CPU costs are analyzed. For a same table, applying full-text query expansion leads to no change in the CPU cost for full-text initialization. The CPU cost is an important factor in analyzing the query performance. The timing statistics also determine that when applying the query expansion, same query leads to increase in the execution time. From the charts shown in previous chapter, it is clear that MS SQL Server consumes more CPU to perform query expansion, leading to performance bottleneck. The main operation that results in higher CPU cost is full-text matching. Careful analysis shows that the process involved while performing query expansion is so complex that the query expansion indeed needs more CPU power to perform critical operations. Nevertheless, 39 CPU performance can be improved if the operations involved in the internal process are reduced in some way or the other. This heavily relies on the underlying query expansion algorithm used. The traditional algorithms are proven but not necessarily result in better performance. There is a need for better query expansion algorithm. For MS SQL Server when applying query expansion, the SQL Server search engine takes approximately 0.5 ms more to finish the process of full-text initialization. For the tables with significantly more rows, the time difference would make huge impact on overall query execution performance. For MySQL while applying query expansion, the overall query execution time increases 53 ms. The query performance results vary depending on the platform, the database, and the database tool. The results are not predictable, since these different factors affect the performance of that entire result set. When the query expansion is enabled, the MySQL server takes more time in order to perform full-text initialization, thus resulting in close to twice the amount of time it takes to perform search when the query expansion is disabled. This is because of the process of query expansion, which needs full-text initialization to expand source query. And this is why the database users and application developers are advised to perform query expansion only when needed. 40 Chapter 9 FUTURE WORK 9.1 Future work This project includes performance analysis for QE among various database tools and Google web search engine. This project can be expanded to include research work on other web search engines such as Yahoo!, Bing, AltaVista, etc. For database tools, performance analysis is done on MS SQL Server 2008 and MySQL. Analysis might be included for other database tools such as Oracle, IBM DB2, and so on. In this project, performance analysis is achieved by closely examining query execution time statistics and CPU cost. There are other factors which can be analyzed, for example, IO cost, memory consumption, operator cost and so on. 41 REFERENCES [1] Kevyn B. Collins-Thompson. “Robust model estimation methods for IR” Language Technologies Institute, Carnegie Mellon University pp. 41-50, 2008. [2] R. Navigli, P. Velardi. “An analysis of Ontology based QE strategies” Proc. of Workshop on Adaptive Text Extraction and Mining, 14th European Conference on Machine Learning, 2003. [3] Y. Qiu, H.P. Frei. “Concept Based Query Expansion” SIGIR ’93 Proc. of the 16th annual international ACM SIGIR Conference on Research and development in information retrieval, 1993. [4] Singhal Amit “Modern Information Retrieval: A Brief Overview” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2001. [5] Efthimis N. Efthimiadis “Query Expansion” Annual Review of Information Systems and Technology (ARIST), 1996. [6] M. Shamim Khan, Sebastian Khor “Enhanced web document retrieval using automatic query expansion” Journal of the American Society for Information Science and Technology, 2004. [7] Jonathan Mamou, Bhuvana Ramabhadran “Phonetic Query Expansion for Spoken Document Retrieval” IBM Haifa Research Labs, 2004. 42 [8] Query Expansion techniques http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.71&rep=rep1&type=pdf. [9] Query Expansion for IR – http://nlp.stanford.edu/IR-book/html/htmledition/queryexpansion-1.html#11685. [10] http://en.wikipedia.org/. [11] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google. com/en/us/pubs/archive/13021.pdf. [12] http://www.dsoergel.com/NewPublications/HCIEncyclopediaIRShortEForDS.pdf. [13] http://nlp.stanford.edu/IR-book/pdf/09expand.pdf. [14] http://www.macs.hw.ac.uk/~pdw/b1/chli&d.pdf. [15] http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1064&context=etd_ projects. [16] http://dev.mysql.com/doc/refman/5.6/en/fulltext-query-expansion.html. [17] http://www.mysqltutorial.org/using-mysql-query-expansion.aspx. [18] http://www.mysqlfaqs.net/mysql-faqs/Indexes/Full-Text-Indexes/What-are-naturallanguage-and-boolean-and-query-expansion-full-text-searches. [19] Gundong Xu, Yanchun Zhang, Lin Li “Web Mining and Social Networking – Techniques and Applications” Springer Press pp. 22-24, 2011. 43 [20] http://code.google.com/apis/searchappliance/documentation/46/help_gsa/ serve_query_expansion.html.