APPLIED GENETIC ALGORITHMS IN INFORMATION RETRIEVAL BANGORN KLABBANKOH Faculty of Information Technology King Mongkut’s Institute of Technology Ladkrabang Ladkrabang Bangkok 10520 Tel. (02) 7372551-4(EXT:802) Fax. 3269074 E-Mail:S0067034@kmitl.ac.th OUEN PINNGERN PH.D. Department of Computer Engineering, Faculty of Engineering King Mongkut’s Institute of Technology Ladkrabang Ladkrabang Bangkok 10520 Tel. (02) 3269969 E-Mail:kpouen@kmitl.ac.th Abstract: This article presents an online information retrieval using genetic algorithms to increase information retrieval efficiency. Under vector space model, information retrieval is based on the similarity measurement between query and documents. Documents with high similarity to query are judge more relevant to the query and should be retrieved first. Under genetic algorithms, each query is represented by a chromosome. These chromosomes feed into genetic operator process: selection, crossover, and mutation until we get an optimize query chromosome for document retrieval. Our testing result show that information retrieval with 0.8 crossover probability and 0.01 mutation probability give the highest precision while 0.8 crossover probability and 0.3 mutation probability give the highest recall. 1. INTRODUCTION Genetic Algorithms (GA’s) are probabilistic search methods that have been developed by John Holland in 1975. [1][2] GA’s applied natural selection and natural genetics in artificial intelligence to find the globally optimal solution to the optimization problem from the feasible solutions. Nowadays GA’s have been applied to various domains, including timetable, scheduling, robot control, signature verification, image processing, packing, routing, pipeline control systems, machine learning, and information retrieval [ 3][5]. 2. PRINCIPLE OF GENETIC ALGORITHMS 2.1 BASIC PRINCIPLES GA’s are characterized by 5 basic components as follow: 1) 2) 3) 4) 5) Chromosome representation for the feasible solutions to the optimization problem. Initial population of the feasible solutions. A fitness function that evaluates each solution. Genetic operators that generate a new population from the existing population. Control parameters such as population size, probability of genetic operators, number of generation etc. 2.2 PROCESS OF GENETIC ALGORITHMS GA’s is an iterative procedure which maintains a constant size population of feasible solutions. During each iteration step, called a generation, the fitness of current population are evaluated, and population are selected based on the fitness values. The higher fitness chromosomes are selected for reproduction under the action of crossover and mutation to form new population. The lower fitness chromosomes are eliminated. These new population are evaluated, selected and fed into genetic operator process again until we get an optimal solution (see Fig. 1) 3. ONLINE INFORMATION RETRIEVAL USING GENETIC ALGORITHMS 3.1 CHROMOSOME REPRESENTATION Online information retrieval using genetic algorithms is based on vector space model. Within this model, both documents and queries are represented by vector. A particular document is represented by vector of terms and a particular query is represented by vector of query terms. Generate Initial Population A ssess Initial Population Select Population Crossover New Population No Mutate New Population A ssess New population Terminate Search? Y es Stop FIGURE 1 THE PROCESS OF GENETIC ALGORITHMS A document vector (Doc) with n keywords and a query vector with m query terms can be represented as Doc = (term1,term2,term3….termn) Query = (qterm1, qterm2, qterm3,…..qtermm) We use binary term vector, so each termi (or qtermj) is either 0 or 1. Termi is set to zero when termi is not presented in document and set to one when term i is presented in document. For example, user enters a query into our system that could retrieve 5 documents. These documents are Doc1 = {Relational Databases, Query, Data Retrieval, Computer Networks, DBMS} Doc2 = {Artificial Intelligence, Internet, Indexing, Natural Language Processing} Doc3 = {Databases, Expert System, Information Retrieval System, Multimedia} Doc4 = {Fuzzy Logic, Neural Network, Computer Networks} Doc5 = {Object-Oriented, DBMS, Query, Indexing} All keywords of these documents can be arranged in the ascending order as Artificial Intelligence, Computer Networks, Data Retrieval, Databases, DBMS, Expert System, Fuzzy Logic, Indexing, Information Retrieval System, Internet, Multimedia, Natural Language Processing, Neural Network, Object-Oriented, Query, Relational Databases Encode in the chromosome representation as Doc1 = 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 Doc2 = 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 Doc3 = 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 Doc4 = 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 Doc5 = 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 These chromosomes are called initial population that feed into genetic operator process. The length of chromosome depends on number of keywords of documents retrieved from user query. From our example the length of each chromosome is 16 bits. 3.2 FITNESS EVALUATION Fitness function is a performance measure or reward function which evaluate how good each solution be. The information retrieval problem is how to retrieve user required documents. It seems that we could use the fitness functions in Table 1 to calculate the distance between document and query. From Table 1, there are 2 types of fitness functions: weighted term vector and binary term vector. We define X = (x1 , x2 , x3 ,….., xn) , | X | = number of terms occur in X , | X Y | = number of terms occur in both X and Y [6] TABLE 1 FITNESS FUNCTION Similarity Measure Sim (X,Y) Dice coefficient Binary Term Vectors Weighted Term Vectors X Y 2 x i . y i 2 t i 1 X Y t t x y 2 i i 1 Cosine coefficient X Y X 1/ 2 .Y i 1 2 i t x .y 1/ 2 i 1 t i i t x . y i 1 2 i i 1 2 i t Jaccard coefficient x .y X Y X Y X Y i 1 t t i i t x y x .y i 1 2 i i 1 2 i i 1 i i Result from these fitness functions are interval 0 to 1. By 1.0 means document and query is sameness. Values near 1.0 mean documents and query are more relevant and values near 0.0 mean documents and query are less relevant. Values evaluate from fitness functions are called “fitness”. 3.3 SELECTION After we evaluate population’s fitness, the next step is chromosome selection. Selection embodies the principle of ‘survival of the fittest’. Satisfied fitness chromosomes are selected for reproduction. Poor chromosomes or lower fitness chromosomes may be selected a few or not at all. 3.4 CROSSOVER Crossover is the genetic operator that mix two chromosomes together to form new offspring. Crossover occurs only with some probability (crossover probability). Chromosomes are not subjected to crossover remain unmodified. The intuition behind crossover is exploration of new solutions and exploitation of old solutions. GA’s construct a better solution by mixture good characteristic of chromosomes together. Higher fitness chromosomes have an opportunity to be selected more than the lower ones, so good solution always alive to the next generation. Crossover technique i point. For example, two chromosomes are crossover between position 5 and 11. 101111110011101 100110011110000 The resulting crossover yields two new chromosomes. 101110011111101 100111110010000 3.5 MUTATION Mutation involves the modification of the values of each gene of a solution with some probability (mutation probability). In accordance with changing some bit values of chromosomes, give the different breeds. Chromosomes may be better or poorer than old chromosomes. If they are poorer than old chromosomes, they are eliminated in selection step. The objective of mutation is restoring lost and exploring variety of data. For example: randomly mutate chromosome at position 10 101111110011101 Result 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 3.6 PROCESS OF OUR SYSTEM 1. User enters query into our system. 2. Match keywords from user query with list of keywords 3. Encode documents retrieved by user query to chromosomes (initial population) 4. Population feed into genetic operator process such as selection, crossover, and mutation. 5. Do step 4 until max generation is reached. We will get an optimize query chromosome for document retrieval. 6. Decode optimize query chromosome to query and retrieve document from database. 4. EXPERIMENTATION 4.1 TEST CASE FORMULATION This experimentation tests for 21 queries with 3 different fitness functions: jaccard coefficient (F1), cosine coefficient (F2) and dice coefficient (F3). A particular fitness function tests with set of parameters: probability of crossover (Pc = 0.8), and probability of mutation (Pm = 0.01, 0.10, 0.30) to compare the efficiency of retrieval system. The information retrieval efficiency measures from recall and precision. Recall is defined as the proportion of relevant document retrieved (see equation 1) [4][6] Recall = Number of documents retrieved and relevant Total relevant in collection (1) Precision is defined as the proportion of retrieved document that is relevant (see equation 2) [4][6] Precision = Number of documents retrieved and relevant Total retrieved (2) A tested database consisted of 343 documents taken from student’s projects of Information Technology Faculty, King Mongkut’s Institute of Technology Ladkrabang TABLE 2. INFORMATION RETRIEVAL BY 3 FITNESS FUNCTIONS WITH PC = 0.8 AND PM = 0.01 Keywords application database DNS internet marketing recognition security network Query Chromosome 00100000000000000000000000001100 0001000000000000000000000000000010000100 0011001001 00000000000010000000000001 0110110 11000 000100100 0000100000000010000000 F1 0.84 0.59 1.00 0.76 1.00 0.71 1.00 1.00 F2 0.91 0.65 1.00 0.86 1.00 0.75 1.00 1.00 F3 RetRel RetNRel 0.90 30 1 0.65 34 8 1.00 6 2 0.84 41 1.00 11 8 0.74 7 1.00 17 57 1.00 78 21 4.2 EXPERIMENT RESULTS Preliminary testing indicated that 1. Experiment from 3 fitness functions testing show that optimize queries from these fitness functions are all the same queries but there are different fitness values (F1, F2, and F3) as shown in Table 2. From Table 2, RetRel is defined as number of retrieved relevant documents and RetNRel is defined as number of retrieved but not relevant documents. 1 Precision Recall 0.9 Efficiency 0.8 0.7 0.6 0.5 0.4 0 0.05 0.1 0.15 0.2 Pmutation 0.25 0.3 FIG. 2 PRECISION AND RECALL 2. Information retrieval with Pc = 0.8 and Pm = 0.01 yields the highest precision 0.746 while information retrieval with Pm = 0.10 yields the moderate precision 0.560 and information retrieval with Pm = 0.30 yields the lowest precision 0.417 as shown in Figure 2. 3. Information retrieval with Pc = 0.8 and Pm = 0.30 yields the highest recall 0.976 while information retrieval with Pm = 0.01 yields the moderate recall and information retrieval with Pm = 0.l0 yields the lowest recall 0.786 as shown in Figure 2. 5. CONCLUSIONS From preliminary experiment indicated that precision and recall are invert. To use which parameters depends on the appropriate- ness that what would user like to retrieve for. In the case of high precision documents prefer, the parameters will be high crossover probability and low mutation probability. While in the case of more relevant documents (high recall) prefer, the parameters will be high mutation probability and lower crossover probability. From preliminary experiment indicated that we could use GA’s in information retrieval. The continuous study is testing with larger databases and represent retrieved documents by sequence of fitness values which represent user desire. REFERENCES [1] David, L. Handbook of Genetic Algorithms. New York : Van Nostrand Reinhold. 1991. [2] Goldberg, D.E. Genetic Algorithms: in Search, Optimization, and Machine Learning. New York : Addison-Wesley Publishing Co. Inc. 1989. [3] Kraft, D.H. et. al. “The Use of Genetic Programming to Build Queries for Information Retrieval.” in Proceedings of the First IEEE Conference on Evolutional Computation. New York: IEEE Press. 1994. PP. 468-473. [4] Korfhage, R.R. Information Storage and Retrieval. New York : Wiley Computer Publishing. 1997. [5] Martin-Bautista, M.J. et. al. “An Approach to An Adaptive Information Retrieval Agent using Genetic Algorithms with Fuzzy Set Genes.” In Proceeding of the Sixth International Conference on Fuzzy Systems. New York: IEEE Press. 1997. PP.1227-1232. [6] Salton, G. Automatic text processing: the transformation, analysis, and retrieval of information by computer. New York: Addison- Wesley Publishing Co. Inc. 1989.