Abstract: This article presents an online information

advertisement
APPLIED GENETIC ALGORITHMS IN INFORMATION RETRIEVAL
BANGORN KLABBANKOH
Faculty of Information Technology
King Mongkut’s Institute of Technology Ladkrabang
Ladkrabang Bangkok 10520
Tel. (02) 7372551-4(EXT:802) Fax. 3269074 E-Mail:S0067034@kmitl.ac.th
OUEN PINNGERN PH.D.
Department of Computer Engineering, Faculty of Engineering
King Mongkut’s Institute of Technology Ladkrabang
Ladkrabang Bangkok 10520
Tel. (02) 3269969 E-Mail:kpouen@kmitl.ac.th
Abstract: This article presents an online information retrieval using genetic algorithms to increase
information retrieval efficiency. Under vector space model, information retrieval is based on the
similarity measurement between query and documents. Documents with high similarity to query are
judge more relevant to the query and should be retrieved first. Under genetic algorithms, each query is
represented by a chromosome. These chromosomes feed into genetic operator process: selection,
crossover, and mutation until we get an optimize query chromosome for document retrieval. Our testing
result show that information retrieval with 0.8 crossover probability and 0.01 mutation probability give
the highest precision while 0.8 crossover probability and 0.3 mutation probability give the highest recall.
1. INTRODUCTION
Genetic Algorithms (GA’s) are probabilistic search methods that have been developed by John
Holland in 1975. [1][2] GA’s applied natural selection and natural genetics in artificial intelligence to
find the globally optimal solution to the optimization problem from the feasible solutions.
Nowadays GA’s have been applied to various domains, including timetable, scheduling, robot
control, signature verification, image processing, packing, routing, pipeline control systems, machine
learning, and information retrieval [ 3][5].
2. PRINCIPLE OF GENETIC ALGORITHMS
2.1 BASIC PRINCIPLES
GA’s are characterized by 5 basic components as follow:
1)
2)
3)
4)
5)
Chromosome representation for the feasible solutions to the optimization problem.
Initial population of the feasible solutions.
A fitness function that evaluates each solution.
Genetic operators that generate a new population from the existing population.
Control parameters such as population size, probability of genetic operators, number of
generation etc.
2.2 PROCESS OF GENETIC ALGORITHMS
GA’s is an iterative procedure which maintains a constant size population of feasible solutions.
During each iteration step, called a generation, the fitness of current population are evaluated, and
population are selected based on the fitness values. The higher fitness chromosomes are selected for
reproduction under the action of crossover and mutation to form new population. The lower fitness
chromosomes are eliminated. These new population are evaluated, selected and fed into genetic operator
process again until we get an optimal solution (see Fig. 1)
3. ONLINE INFORMATION RETRIEVAL USING GENETIC ALGORITHMS
3.1 CHROMOSOME REPRESENTATION
Online information retrieval using genetic algorithms is based on vector space model. Within
this model, both documents and queries are represented by vector. A particular document is represented
by vector of terms and a particular query is represented by vector of query terms.
Generate Initial Population
A ssess Initial Population
Select Population
Crossover New Population
No
Mutate New Population
A ssess New population
Terminate
Search?
Y es
Stop
FIGURE 1 THE PROCESS OF GENETIC ALGORITHMS
A document vector (Doc) with n keywords and a query vector with m query terms can be
represented as
Doc = (term1,term2,term3….termn)
Query = (qterm1, qterm2, qterm3,…..qtermm)
We use binary term vector, so each termi (or qtermj) is either 0 or 1. Termi is set to zero when
termi is not presented in document and set to one when term i is presented in document.
For example, user enters a query into our system that could retrieve 5 documents. These
documents are
Doc1 = {Relational Databases, Query, Data Retrieval, Computer Networks, DBMS}
Doc2 = {Artificial Intelligence, Internet, Indexing, Natural Language Processing}
Doc3 = {Databases, Expert System, Information Retrieval System, Multimedia}
Doc4 = {Fuzzy Logic, Neural Network, Computer Networks}
Doc5 = {Object-Oriented, DBMS, Query, Indexing}
All keywords of these documents can be arranged in the ascending order as
Artificial Intelligence, Computer Networks, Data Retrieval, Databases, DBMS, Expert
System, Fuzzy Logic, Indexing, Information Retrieval System, Internet, Multimedia, Natural
Language Processing, Neural Network, Object-Oriented, Query, Relational Databases
Encode in the chromosome representation as
Doc1 = 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1
Doc2 = 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0
Doc3 = 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
Doc4 = 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0
Doc5 = 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0
These chromosomes are called initial population that feed into genetic operator process. The
length of chromosome depends on number of keywords of documents retrieved from user query. From
our example the length of each chromosome is 16 bits.
3.2 FITNESS EVALUATION
Fitness function is a performance measure or reward function which evaluate how good each
solution be. The information retrieval problem is how to retrieve user required documents. It seems that
we could use the fitness functions in Table 1 to calculate the distance between document and query.
From Table 1, there are 2 types of fitness functions: weighted term vector and binary term vector.
We define X = (x1 , x2 , x3 ,….., xn) , | X | = number of terms occur in X , | X  Y | = number of
terms occur in both X and Y [6]
TABLE 1 FITNESS FUNCTION
Similarity
Measure
Sim (X,Y)
Dice coefficient
Binary
Term Vectors
Weighted
Term Vectors
X Y
2 x i . y i
2
t
i 1
X Y
t
t
x y
2
i
i 1
Cosine
coefficient
X Y
X
1/ 2
.Y
i 1
2
i
t
 x .y
1/ 2
i 1
t
i
i
t
 x . y
i 1
2
i
i 1
2
i
t
Jaccard
coefficient
 x .y
X Y
X  Y  X Y
i 1
t
t
i
i
t
 x   y   x .y
i 1
2
i
i 1
2
i
i 1
i
i
Result from these fitness functions are interval 0 to 1. By 1.0 means document and query is
sameness. Values near 1.0 mean documents and query are more relevant and values near 0.0 mean
documents and query are less relevant. Values evaluate from fitness functions are called “fitness”.
3.3 SELECTION
After we evaluate population’s fitness, the next step is chromosome selection. Selection
embodies the principle of ‘survival of the fittest’. Satisfied fitness chromosomes are selected for
reproduction. Poor chromosomes or lower fitness chromosomes may be selected a few or not at all.
3.4 CROSSOVER
Crossover is the genetic operator that mix two chromosomes together to form new offspring.
Crossover occurs only with some probability (crossover probability). Chromosomes are not subjected to
crossover remain unmodified. The intuition behind crossover is exploration of new solutions and
exploitation of old solutions. GA’s construct a better solution by mixture good characteristic of
chromosomes together. Higher fitness chromosomes have an opportunity to be selected more than the
lower ones, so good solution always alive to the next generation.
Crossover technique i
point. For example, two chromosomes are crossover between position 5 and 11.
101111110011101
100110011110000
The resulting crossover yields two new chromosomes.
101110011111101
100111110010000
3.5 MUTATION
Mutation involves the modification of the values of each gene of a solution with some
probability (mutation probability). In accordance with changing some bit values of chromosomes, give
the different breeds. Chromosomes may be better or poorer than old chromosomes. If they are poorer
than old chromosomes, they are eliminated in selection step. The objective of mutation is restoring lost
and exploring variety of data. For example: randomly mutate chromosome at position 10
101111110011101
Result 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1
3.6 PROCESS OF OUR SYSTEM
1. User enters query into our system.
2. Match keywords from user query with list of keywords
3. Encode documents retrieved by user query to chromosomes (initial population)
4. Population feed into genetic operator process such as selection, crossover, and mutation.
5. Do step 4 until max generation is reached. We will get an optimize query chromosome for
document retrieval.
6. Decode optimize query chromosome to query and retrieve document from database.
4. EXPERIMENTATION
4.1 TEST CASE FORMULATION
This experimentation tests for 21 queries with 3 different fitness functions: jaccard coefficient
(F1), cosine coefficient (F2) and dice coefficient (F3). A particular fitness function tests with set of
parameters: probability of crossover (Pc = 0.8), and probability of mutation (Pm = 0.01, 0.10, 0.30) to
compare the efficiency of retrieval system. The information retrieval efficiency measures from recall and
precision.
Recall is defined as the proportion of relevant document retrieved (see equation 1) [4][6]
Recall =
Number of documents retrieved and relevant
Total relevant in collection
(1)
Precision is defined as the proportion of retrieved document that is relevant (see equation 2)
[4][6]
Precision =
Number of documents retrieved and relevant
Total retrieved
(2)
A tested database consisted of 343 documents taken from student’s projects of Information
Technology Faculty, King Mongkut’s Institute of Technology Ladkrabang
TABLE 2. INFORMATION RETRIEVAL BY 3 FITNESS FUNCTIONS
WITH PC = 0.8 AND PM = 0.01
Keywords
application
database
DNS
internet
marketing
recognition
security
network
Query Chromosome
00100000000000000000000000001100
0001000000000000000000000000000010000100
0011001001
00000000000010000000000001
0110110
11000
000100100
0000100000000010000000
F1
0.84
0.59
1.00
0.76
1.00
0.71
1.00
1.00
F2
0.91
0.65
1.00
0.86
1.00
0.75
1.00
1.00
F3 RetRel RetNRel
0.90 30
1
0.65 34
8
1.00
6
2
0.84 41
1.00 11
8
0.74
7
1.00 17
57
1.00 78
21
4.2 EXPERIMENT RESULTS
Preliminary testing indicated that
1. Experiment from 3 fitness functions testing show that optimize queries from these fitness
functions are all the same queries but there are different fitness values (F1, F2, and F3) as shown in Table
2. From Table 2, RetRel is defined as number of retrieved relevant documents and RetNRel is defined as
number of retrieved but not relevant documents.
1
Precision
Recall
0.9
Efficiency
0.8
0.7
0.6
0.5
0.4
0
0.05
0.1
0.15
0.2
Pmutation
0.25
0.3
FIG. 2 PRECISION AND RECALL
2. Information retrieval with Pc = 0.8 and Pm = 0.01 yields the highest precision 0.746 while
information retrieval with Pm = 0.10 yields the moderate precision 0.560 and information retrieval with
Pm = 0.30 yields the lowest precision 0.417 as shown in Figure 2.
3. Information retrieval with Pc = 0.8 and Pm = 0.30 yields the highest recall 0.976 while
information retrieval with Pm = 0.01 yields the moderate recall and information retrieval with Pm = 0.l0
yields the lowest recall 0.786 as shown in Figure 2.
5. CONCLUSIONS
From preliminary experiment indicated that precision and recall are invert. To use which
parameters depends on the appropriate- ness that what would user like to retrieve for. In the case of
high precision documents prefer, the parameters will be high crossover probability and low mutation
probability. While in the case of more relevant documents (high recall) prefer, the parameters will be
high mutation probability and lower crossover probability. From preliminary experiment indicated that
we could use GA’s in information retrieval. The continuous study is testing with larger databases and
represent retrieved documents by sequence of fitness values which represent user desire.
REFERENCES
[1] David, L. Handbook of Genetic Algorithms. New York : Van Nostrand Reinhold. 1991.
[2] Goldberg, D.E. Genetic Algorithms: in Search, Optimization, and Machine Learning. New York :
Addison-Wesley Publishing Co. Inc. 1989.
[3] Kraft, D.H. et. al. “The Use of Genetic Programming to Build Queries for Information Retrieval.”
in Proceedings of the First IEEE Conference on Evolutional Computation. New York: IEEE Press.
1994. PP. 468-473.
[4] Korfhage, R.R. Information Storage and Retrieval. New York : Wiley Computer Publishing. 1997.
[5] Martin-Bautista, M.J. et. al. “An Approach to An Adaptive Information Retrieval Agent using
Genetic Algorithms with Fuzzy Set Genes.” In Proceeding of the Sixth International Conference on
Fuzzy Systems. New York: IEEE Press. 1997. PP.1227-1232.
[6] Salton, G. Automatic text processing: the transformation, analysis, and retrieval of information by
computer. New York: Addison- Wesley Publishing Co. Inc. 1989.
Download