Investigating Genetic algorithms to optimize the user query in the vector space model Mohammad Othman Nassar Amman Arab University moanassar@yahoo.com Feras Fares Al Mashagba Amman Arab University ferasfm79@yahoo.com Eman Fares Al Mashagba Irbid Private University Emanfa71@yahoo.com Amman- Jordan Abstract: This study discusses the effectiveness of using different Genetic Algorithms (GA) approaches with different similarity measures (Cosine, DICE, Jaccard, Inner Product) in the vector space model (VSM) based on Arabic data collection. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. For each similarity measure (Cosine, DICE, Jaccard, Inner Product) in the VSM we used and compared ten different GA approaches based on different fitness functions, different mutations and different crossover strategies to find the best strategy and fitness function that can be used with each similarity measure in the VSM when the data collection is the Arabic language. Our results indicate that the GA approach which uses one-point crossover operator, point mutation and Inner Product similarity as a fitness function represent the best IR system in VSM. Keywords: information retrieval, vector space model, query optimization, genetic algorithms. Introduction: Information retrieval (IR) can be defined as the study of how to determine and retrieve from a corpus of stored information the portions which are responsive to particular query [1]. The vector space model, Boolean model, Fuzzy sets model and the probabilistic retrieval model are the major information retrieval models. The retrival models are used to find the similarity between the query and the documents in order to retrieve the documents that reflect the query. Vector space model have four similarity measures: Cosine, DICE, Jaccard, and Inner Product. The similarity measures are used to evaluate the effectiveness of IR system using two measures: Precision and Recall. A (GA) is an adaptive heuristic search algorithm premised on the evolutionary ideas of natural selection and genetics [3]. The Genetic algorithm (GA) approach is an import approach because it can find a global solution in many problems such as NP-hard problems, machine learning problems, and also for evolving simple programs. In this paper, we will investigate Cosine and Jaccard similarity measures, for each similarity measure we compared ten different genetic algorithms settings (different mutation techniques, different fitness functions, different crossover techniques) to optimize the user query. As a test bed; we are going to use an Arabic data collection which composed from 242 documents and 59 queries, the correct answer for each query is also known in advanced. This collection was used by many researches in information retrieval such as [2, 19, 20]. The difficulty of the Arabic language is due to its differences with the other Indo-European languages. Those differences are discussed by many researchers [2, 20, 13, 14, 15], amongst them; syntactically, morphologically, and semantically differences. Differences became clearer when comparing Arabic to English, Arabic language is more sparsed, which means that for the same text length, English words are repeated more often than Arabic words [14, 15]. This sparseness may negatively affect the retrieval quality in Arabic language [2, 20]. Other differences are related to the complexity of the Arabic roots, to the existence of many forms of writing for the same letter, and to the punctuation associated with some letters that may change the meaning of two identical words. The uniqueness and the special properties for the Arabic language, its differences from the English and the other languages, and the lack of similar studies in the literature was our motivator to conduct a deep and rich comparative study based on Arabic data collection. We are going to use the same data collection as in [20], using the same data collection as in [20] will allow us to compare our results with their results. Previous Studies: Using GAs in information retrieval systems to optimize the user query is not a new trend in information retrieval and it will continue, this is due to the fact that GAs are a powerful and robust optimization techniques. Many studies have been conducted in the literature such as [2, 4, 5, 7, 6, 7, 5, 9, 10, 11, 12, 18, 20]. The authors in [8,4,6] presents many methods, all of them are based on VSM, the methods include: the connectionist Hopfield network; the symbolic ID3/ID5R, evolution- based genetic algorithms, symbolic ID3 Algorithm, evolution-based genetic algorithms, Simulated Annealing, neural networks, genetic programming. They found that these techniques are promising in their ability to analyze user queries, identify users’ information needs, and suggest alternatives for search. In [9, 11, 7, 5,12] the VSM have been used, the authors try to improve the IR performance by creating different mutation probabilities, new crossover operation, new fitness functions for the GA. Mercy and Naomie [10] propose a framework of data fusion approach based on linear combinations of retrieval status values obtained from Vector Space Model and Probability Model system. They used Genetic Algorithm (GA) to find the best linear combination of weights assigned to the scores of different retrieval system to get the most optimal retrieval performance. Using GA to improve the performance of Arabic information system is rare in the literature. In [17] the performance of an Arabic information retrieval system was improved using Genetic Algorithms, based on vector space model. The performance was enhanced through the usage of an adaptive matching function, which obtained from a weighted combination of four similarity measures (inner product, Cosine, Jaccard and Dice). Using GAs to improve the query in the Vector space model was studied by [20] based on Arabic data collection, the researchers created and compared different fitness functions, different mutations and different crossover strategies to find the best strategy and fitness function that can be used with two similarity measures (Dice, Inner Product) in the VSM. This paper will study the other unstudied similarity measures (Cosine, Jaccard) in the VSM and compare them to the work of [20]. In [2] Nassar and his collegues used the GAs to improve the query in the Boolean model; they created different Genetic Algorithm sittings to optimize the user query based on Arabic collection. Vector Space Model (VSM). In the VSM the documents and queries are represented as vectors in a multidimensional space, the dimensions for this space are the terms. Lexical scanning is required to identify the terms, after that an optional stemming process applied to the words, then the frequency of those stems is computed. Finally the query and the document vectors are compared using different similarity measures (e.g. Cosine, DICE, Jaccard, Inner Product), Table 1 shows those similarity measures. Similarity Evaluation Measure Term Vector Cosine for Binary t d q sim (d , q ) 2 d 1/ 2 q Evaluation for Weighted Term Vector 1/ 2 sim (d j , q ) w i, j i 1 t w i 1 2 i. j wi , q t w j 1 2 i ,q t sim (d , q) 2 Dice d q dq sim (d j , q) 2 wi , j wi ,q i 1 w t 2 i 1 i, j i 1 wi ,q t 2 t sim (d , q) Jaccard d q d q d q sim (d j , q) w w i, j i 1 i ,q t i1 wi, j i1 wi,q wi, j wi,q t t 2 2 i 1 Inner Product d i qk t Sim= (d k 1 ik q k ) Table 1: Different Similarity Measures. Where wi , j , wi , q are the weights of the i th term in document j , and in the query respectively. Genetic Algorithm (GA) The GA algorithm flowchart is illustrated in Figure 1. Genetic algorithm operations can be used to generate new and better generations. As shown in Figure 1 the genetic algorithm operations include: 1) Reproduction: fittest individuals are chosen based on the fitness function. 2) Crossover: exchanging the genes between two individual chromosomes that are reproducing. There are many crossover strategies such as n-point crossover [11], restricted crossover [7], uniform crossover [30], fusion operator [7] and dissociated crossover [7]. For more details about the crossover strategies you can see the related references. 3) Mutation: is the process of randomly altering the genes in a particular chromosome. There are two types of mutation: a) Point mutation: in which a single gene is changed. b) Chromosomal mutation: where some number of genes is changed c Generate initial population o m p Evaluate each individual l e t Reproduction e l Crossover y . Mutation NO Stopping Criteria met? YES STOP Figure 1: Flowchart for Typical Genetic Algorithm (GA). GA's are characterized by five basic components [20] as follows: 1) Chromosome representation for the feasible solutions to the optimization problem. 2) Initial population of the feasible solutions. 3) A fitness function that evaluates each solution. 4) Genetic operators that generate a new population from the existing population. 5) Control parameters such as population size, probability of genetic operators, number of generation. Experiment: In this study we used IR system based on VSM model that was built and implemented by Hanandeh [6] to handle the 242 Arabic abstracts collected from the Proceedings of the Saudi Arabian National Conference [16]. In this study we will follow the same procedure implemented by [20], this will allow us to compare our results to their results. So the significant terms are extracted from relevant and irrelevant documents then assigned weights. The binary weights of the terms are formed as a query vector, and then the query vector is adapted as a chromosome. Finally the GA is applied to get an optimal or near optimal query vector. After that we compared the result of the GA approach with the result of the traditional IR system without using a GA. The details for this study are the same as in [20] except that we used Cosine and Jaccard similarity measures instead of DICE and Inner Product similarity measures. The steps for this study are as the following: 1) Representation of the chromosomes. The chromosomes are represented as following: Binary representation: The chromosomes use a binary representation, and are converted to a real representation by using a random function. Number of Genes: We will have the same number of genes as the query and the feedback documents that have terms with non-zero weights. Chromosome size: The size of the chromosomes will be equal to the number of terms of the set (feedback documents+ the query set). The query vector: The query is represented as a binary. Terms update: Terms are modified by applying the random function on the terms weights. GA approach: The GA approaches receive an initial population chromosomes corresponding to the top 15 documents retrieved from traditional IR with respect to that query. 2) Fitness Function. Fitness function is a performance measure or reward function which evaluates how each solution is good. In this study Cosine and jaccard similarity measures are used as fitness functions. 3) Selection. Chromosomes selection depends on the fitness function where the higher values have a higher probability to be selected in the next generation. 4) Operators. We used two GA operators to produce offspring chromosomes, which are: A. Crossover: it function is to mix two chromosomes together to form new offspring. In this paper crossover occurs only with crossover probability Pc (Pc=0.8). In this study five crossover strategies were used for VSM : 1. One-point crossover operator. 2. Restricted crossover operator. 3. Uniform crossover operator. 4. Fusion operator. 5. Dissociated crossover. B. Mutation is the modification of the gene values of a solution with some probability Pm. In this experiment we used a mutation probability (Pm=0.7) and two different mutation strategies: 1. Point mutation. 2. Chromosomal mutation. Finally and based on the previous discussions we created ten different GA strategies. Those strategies will be used with each similarity measure (Cosine, Jaccard), the strategies are as following: GA1: GA that use one-point crossover operator and point mutation. GA2: GA that use one-point crossover operator and chromosomal mutation. GA3: GA that use restricted crossover operator and point mutation. GA4: GA that use restricted crossover operator and chromosomal mutation. GA5: GA that use uniform crossover operator and point mutation. GA6: GA that use uniform crossover operator and chromosomal mutation. GA7: GA that use fusion operator and point mutation. GA8: GA that use fusion operator and chromosomal mutation. GA9: GA that use dissociated crossover and point mutation. GA10: GA that use dissociated crossover and chromosomal mutation. GA strategies Using Cosine Similarity: The results for the GA strategies using cosine similarity are shown in Table 2 and Table 3. From those tables we notice that GA1, GA2, GA4, GA5, GA8, GA9 and GA10 give a high improvement than traditional IR system with 12.4245%, 6.959051%, 7.394054%, 5.40995%, 7.982538%, 7.255469% and 4.530111 respectively, while GA3, GA6 and GA7 give a low improvement than traditional IR system with -1.36021%, -2.44788% and -1.26468% respectively. This means that GA1 that use one-point crossover operator and point mutation gives the highest improvement over the traditional approach with 12.4245%. Recall Cosine GA1 GA2 GA3 GA4 GA5 GA6 GA7 GA8 GA9 GA10 0.1 0.132 0.165 0.151 0.133 0.135 0.15 0.133 0.13 0.135 0.137 0.141 0.2 0.14 0.164 0.157 0.135 0.16 0.166 0.141 0.138 0.162 0.163 0.151 0,3 0.147 0.182 0.165 0.142 0.175 0.151 0.144 0.15 0.179 0.164 0.152 0.4 0.151 0.166 0.167 0.149 0.161 0.149 0.15 0.146 0.167 0.167 0.159 0.5 0.156 0.179 0.172 0.153 0.178 0.172 0.152 0.152 0.177 0.179 0.171 0.6 0.178 0.191 0.18 0.172 0.188 0.181 0.164 0.176 0.188 0.187 0.179 0.7 0.183 0.193 0.181 0.181 0.193 0.181 0.181 0.179 0.189 0.188 0.19 0.8 0.234 0.244 0.239 0.236 0.231 0.241 0.222 0.23 0.231 0.232 0.24 0.9 0.241 0.251 0.243 0.243 0.242 0.244 0.231 0.242 0.242 0.244 0.243 Average 0.174 0.193 0.184 0.172 0.185 0.182 0.169 0.171 0.186 0.185 0.181 Table 2: Average Recall and Precision Values for 59 Query by Applying GA's on Cosine Similarity. Recall GA1 GA2 GA3 GA4 GA5 GA6 GA7 GA8 GA9 GA10 0.1 25.23531 14.39394 0.757576 2.272727 13.63636 0.757576 -1.51515 2.272727 3.787879 6.818182 0.2 17.14286 12.14286 -3.57143 14.28571 18.57143 0.714286 -1.42857 15.71429 16.42857 7.857143 0,3 23.80952 12.2449 -3.40136 19.04762 2.721088 -2.04082 2.040816 21.76871 11.56463 3.401361 0.4 9.933775 10.59603 -1.3245 6.622517 -1.3245 -0.66225 -3.31126 10.59603 10.59603 5.298013 0.5 14.74359 10.25641 -1.92308 14.10256 10.25641 -2.5641 -2.5641 13.46154 14.74359 9.615385 0.6 7.303371 1.123596 -3.37079 5.617978 1.685393 -7.86517 -1.1236 5.617978 5.05618 0.561798 0.7 5.464481 -1.0929 -1.0929 5.464481 -1.0929 -1.0929 -2.18579 3.278689 2.73224 3.825137 0.8 4.273504 2.136752 0.854701 -1.28205 2.991453 -5.12821 -1.7094 -1.28205 -0.8547 2.564103 0.9 4.149378 0.829876 0.829876 0.414938 1.244813 -4.14938 0.414938 0.414938 1.244813 0.829876 Average 12.4245 6.959051 -1.36021 7.394054 5.40995 -2.44788 -1.26468 7.982538 7.255469 4.530111 Table 3: GA's Improvement in Cosine Similarity (GA's Improvement %). GA strategies Using Jaccard Similarity: The results for the GA strategies using the Jaccard similarity are shown in Table 4 and Table 5. From those tables we notice that GA1, GA2, GA4, GA5, GA8, GA9 and GA10 give a high improvement than traditional IR system with 3.790779%, 12.47687%, 7.651593%, 8.639001%, 7.81104%, 7.913123% and 9.920652% respectively while GA3, GA6 and GA7 give a low improvement than traditional IR system with -1.20423%, -1.72545% and -3.85975% respectively, this means that GA2 that use GA that use one-point crossover operator and chromosomal mutation gives the highest improvement over the traditional approach with 12.47687%. Recall Jaccard GA1 GA2 GA3 GA4 GA5 GA6 GA7 GA8 GA9 GA10 0.1 0.13 0.134 0.141 0.133 0.137 0.141 0.129 0.122 0.142 0.139 0.141 0.2 0.17 0.176 0.199 0.165 0.182 0.184 0.165 0.162 0.182 0.185 0.191 0,3 0.261 0.271 0.288 0.243 0.277 0.281 0.256 0.254 0.271 0.274 0.28 0.4 0.213 0.222 0.277 0.211 0.266 0.269 0.214 0.211 0.271 0.277 0.278 0.5 0.355 0.377 0.387 0.342 0.373 0.375 0.345 0.333 0.373 0.377 0.385 0.6 0.335 0.343 0.401 0.341 0.38 0.384 0.323 0.311 0.382 0.381 0.386 0.7 0.385 0.399 0.398 0.381 0.385 0.387 0.371 0.362 0.382 0.359 0.381 0.8 0.389 0.401 0.415 0.392 0.406 0.41 0.385 0.389 0.407 0.411 0.414 0.9 0.434 0.452 0.467 0.433 0.445 0.438 0.437 0.43 0.434 0.441 0.441 Average 0.296889 0.308333 0.330333 0.293444 0.316778 0.318778 0.291667 0.286 0.316 0.316 0.321889 Table 4: Average Recall and Precision Values for 59 Query by Applying GA's on Jaccard Similarity. Recall GA1 GA2 GA3 GA4 GA5 GA6 GA7 GA8 GA9 GA10 0.1 3.076923 8.461538 2.307692 5.384615 8.461538 -0.76923 -6.15385 9.230769 6.923077 8.461538 0.2 3.529412 17.05882 -2.94118 7.058824 8.235294 -2.94118 -4.70588 7.058824 8.823529 12.35294 0,3 3.831418 10.34483 -6.89655 6.130268 7.662835 -1.91571 -2.68199 3.831418 4.980843 7.279693 0.4 4.225352 30.04695 -0.93897 24.88263 26.29108 0.469484 -0.93897 27.23005 30.04695 30.51643 0.5 6.197183 9.014085 -3.66197 5.070423 5.633803 -2.8169 -6.19718 5.070423 6.197183 8.450704 0.6 2.38806 19.70149 1.791045 13.43284 14.62687 -3.58209 -7.16418 14.02985 13.73134 15.22388 0.7 3.636364 3.376623 0.8 -1.03896 0 0.519481 -3.63636 -5.97403 3.084833 6.683805 0.771208 4.37018 5.398458 -1.02828 0 0.9 4.147465 7.603687 -0.23041 2.534562 0.921659 0.691244 -0.92166 Average 3.790779 12.47687 -1.20423 7.651593 8.639001 -1.72545 Table 5: GA's Improvement in Jaccard Similarity (GA's Improvement %) -3.85975 -0.77922 -6.75325 -1.03896 4.627249 5.655527 6.426735 0 1.612903 1.612903 7.81104 7.913123 9.920652 Comparison between the Best GA's Strategies: To create a detailed and useful comparison we will bring the results for the DICE and Inner Product from [20], and put them with our results for the Jaccard and Cosine. Table 6, and Figure 2 show the comparison between Cosine (GA1), Jaccard(GA2), Dice(GA9) and Inner Product (GA1). It is clear that we used only the best GA strategy for each similarity measure (Cosine, DICE, Jaccard, Inner Product) in the VSM. From this table we notice that the Inner Product(GA1) represent the best strategy over Cosine(GA1), Jaccard(GA2) and Dice(GA9). Which means that Inner Product(GA1) that use one-point crossover operator, point mutation, and Inner Product similarity as a fitness function represent the best IR system in VSM to be used with the Arabic data collection. Figure 2 also present the data in the table 6. Inner Recall Cosine(GA1) Jaccard(GA2) Dice(GA9) 0.1 0.165 0.141 0.141 0.146 0.2 0.164 0.199 0.197 0.208 0,3 0.182 0.288 0.298 0.301 0.4 0.166 0.277 0.277 0.283 0.5 0.179 0.387 0.402 0.405 0.6 0.191 0.401 0.408 0.409 0.7 0.193 0.398 0.396 0.413 0.8 0.244 0.415 0.412 0.437 0.9 0.251 0.467 0.441 0.487 Average 0.193 0.330333 0.330222 0.343222 Product(GA1) Table 6: Comparison Between the Best GA Strategies (Each Similarity Measures). 0.6 0.5 0.4 Cosine(GA1) Inner(GA1) Jaccard(GA2) 0.3 Dice(GA9) 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure 2: Comparison Between the Best GA Strategies (Each Similarity Measures). Conclusions: For each similarity measure (Cosine, DICE, Jaccard, Inner Product) in the VSM we compared ten different GA approaches, and by calculating the improvement of each approach over the traditional IR system, we noticed that most approaches (GA1, GA2, GA4, GA5, GA8, GA9 and GA10) gave improvements compared to the traditional IR system, also we noticed that the GA approach which uses one-point crossover operator, point mutation, and Inner Product similarity as a fitness function represent the best IR system in VSM to be used with the Arabic data collections with improvements over the traditional approach ranged from 5.626% to 28.0543%. References: [1] Tengku M.T., Sembok, C.J., and van Rijsbergen, "A simple logical-linguistic document retrieval system", Information Processing & Management, Volume 26, Issue 1, pp. 111-134, 1990. [2] Mohammad Othman Nassar, Feras Al Mashagba, and Eman Al Mashagba, " Improving the User Query for the Boolean Model Using Genetic Algorithms," International Journal of Computer Science Issues (IJCSI), ISSN (online): 1694-0814, Volume 8, Issue 5, September 2011. [3] Goldberg, D. E., Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, 1989. [4] Hsinchun C., "Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms", Journal of the American Society for Information Science. Volume 46 Issue 3, April 1995. [5] D. Vrajitoru, “Crossover improvement for the genetic algorithm in information retrieval”, Information Processing& Management, 34(4), pp. 405–415, 1998. [6] Hsinchun C, Ganesan S, Linlin S, "A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing", Journal Of The American Society For Information Science. 49(8):693–705, 1998. [7] Vicente P., Cristina P., "Order-Based Fitness Functions for Genetic Algorithms Applied to Relevance Feedback", Journal Of The American Society For Information Science And Technology, 54(2):152–160, 2003. [8] Andrew T., "an Artificial Intelligence Approach to Information Retrieval", Information Processing and Management, 40(4):619-632, 2004. [9] Rocio C., Carlos Lorenzetti, Ana M., Nelida B., "Genetic Algorithms for Topical Web Search: A Study of Different Mutation Rates", ACM Trans. Inter. Tech., 4(4):378– 419, 2005. [10] Mercy T., Naomie S., "A Framework for Genetic-Based Fusion of Similarity Measures In Chemical Compound Retrieval", International Symposium on Bio-Inspired Computing, Puteri Pan Pacific Hotel Johor Bahru, 5 - 7 September 2005. [11] Ahmed A. A. Radwan, Bahgat A. Abdel Latef, Abdel Mgeid A. Ali, Osman A. Sadek, "Using Genetic Algorithm to Improve Information Retrieval Systems", proceedings of world academy of since, engineering and technology, volume 17, ISSN 1307-6884, 2006. [12] Abdelmgeid A., "Applying Genetic Algorithm in Query Improvement Problem", International Journal "Information Technologies and Knowledge, Vol.1, p 309-316. 2007. [13] Khoja, S., "APT:Arabic part-of-speech tagger", proceedings of the student workshop at second meeting of north American chapter of Association for Copmputational Linguistics (NAACL2001), Pittsburgh, Pennsylvania, pp. 20-26, 2001. [14] yahaya, A., "on the Complexity of the initial stage of Arabic text processing", First Great Lakes Computer Science Conference, Kalamazoo, Michigan, USA, October, 1989. [15] Goweder, A., De Roeck, A., "Assessment of a Significant Arabic Corpus", Arabic Natural Language Processing Workshop (ACL2001), Toulouse, France. Downloaded from: (http://www.elsnet.org/acl2001 arabic.html). [16] I. Hmedi, and G. Kanaan and M. Evens, "design and implementation of automatic indexing for information retrieval with Arabic documents", Journal of American society for information science, Volume 48 Issue 10, pp. 867-881, 1997. [17] Bassam Al-Shargabi, Islam Amro, and Ghassan Kanaan, "Exploit Genetic Algorithm to Enhance Arabic Information Retrieval", 3rd International Conference on Arabic Language Processing (CITALA’09), Rabat, Morocco, pp. 37-41, 2009. [18] Fatemeh Dashti, and Solmaz Abdollahi Zad," Optimizing the data search results in web using Genetic Algorithm", international journal of advanced engineering and technologies, Vol 1, Issue No. 1, 016 – 022, ISSN: 2230-781, 2010. [19] Mohammad Othman Nassar, Ghassan Kanaan, and Hussain A. H. Awad, “Comparison between different global weighting schemes,” Lecture Notes in Engineering and Computer Science journal, ISSN: 2078-0966 (online version); 20780958 (print version), Volume: 2180; Issue: 1; pp: 690-692 ; Date: 2010; published by Newswood Limited. [20] Eman Al Mashagba, Feras Al Mashagba, and Mohammad Othman Nassar," Query Optimization Using Genetic Algorithms in the Vector Space Model," International Journal of Computer Science Issues (IJCSI), ISSN (online): 1694-0814, Volume 8, Issue 5, September 2011.