Document 12913227

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 5 - September 2015
Review on Performance Evaluation Techniques for Information
Retrieval System
Seenuvasan k1, Janani V2
1
2
PG Scholar, Department of CSE, Adhiyamaan College of Engineering, Hosur-635109, Tamil Nadu, India
Assistant Professor, Department of CSE, Adhiyamaan College of Engineering, Hosur-635109, Tamil Nadu, India
Abstract
In 19th century people used to go for local library to access any kind of information by using card catalogue. From
20th century onwards people started to access information from large information stored database. The accessing of
information system can also be called as information retrieval system. The performance of each and every
information retrieval system has been evaluated under some techniques. Now a day’s many performance evaluation
techniques have been used. And their performance evaluation techniques have been divided into two categories one
is Non-Graphical Evaluation Techniques and Graphical Evaluation Techniques. In this paper, performance
evaluations for information retrieval system have been reviewed for both Non-Graphical and Graphical Techniques
such as Precision, Recall, F1-score, MAP, ROC Curve, AUC and nDCG.
Keywords: Information retrieval system, Graphical and Non-Graphical Techniques, Precision, Recall, F1-Score,
MAP, PR-Curve, ROC Curve, AUC, nDCG
1.
Introduction
The Information retrieval (IR) system is used to
search the information from massive amount of
stored database. Different kinds of information
revival systems that used were in different
generations. In other words as the time changes the
way of handling the IR systems have also changed.
Since in 1920s the mechanical and electro
mechanical based devices where used to search
information from large amount of stored database [1].
In 1948 Holmstrom introduced a “machine called the
Univac” search technique to retrieve the information
such as 120 words per minutes by using subject code
method. The subject code includes the referred
information about stored data. The Holmstrom
information searching technique helped to create next
generation search system, that is in 1950s the
computer based information search system was
introduced. From 1950 to 2000 number of computer
based IR projects were executed, in 1960s Gerard
Salton has formed IR group at Harvard University.
This IR group established ideas and concepts about
IR systems and a major achievement of this IR group
was to produce an algorithm for rank based retrieval.
In 1990s Berner-Lee described the World Wide Web,
thereafter the information retrieval system has faced
new types of problems and to solve these problems
two important developments were made. One is link
analysis and another one is search of anchor text.
Nowadays, though many search engines exist to
retrieve the information by using number of
techniques, how do we know which one is best
techniques? In order to measure IR system
ISSN: 2231-5381
performance, two methodologies are used. One is
binary judgment measure [2] another one is grade
judgment measure [3]. A binary judgment measure is
a binary assessment of either relevant or non-relevant
for each query-document pair. The binary measures
have two types of results one is ranked list results and
unranked sets results. A grade judgment measures is
a grade relevance assessment [3], that retrieve the
relevant documents based on grade or degree of a
document. The nDCG is a popular grade judgment
measures method that can be studied in future. The
rest of this article is organized as follows. Section 2
describes the unranked results, and Section 3
describes the ranked results and finally, section 4
provides conclusions
1.2Hierarchal Structure by Performance Evaluation
of IR system
Figure 1 shows the performance evaluation technique
of different IR systems. The binary judgment
measure and grade judgment measure are the two
methodologies used for IR system performance
evaluation. The binary judgment measure produces
the two types of evaluation results one is unranked
results and ranked results. Unranked results can also
be called as Non-Graphical results. Unranked results
use three types of evaluation techniques that are
precision, recall and F1-Scoure Ranked results
support both Graphical and Non-Graphical results[7]
[8] and ranked results use four types of evaluation
techniques that are Mean Average Precision (MAP),
PR-curve, ROC-curve and AUC. The grade judgment
measure produces the ranked based results and nDCG
is best suitable for evaluating ranking documents.
http://www.ijettjournal.org
Page 238
International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 5 - September 2015
graphical results. The non graphical results are
produced based on three techniques; Precision, Recall
and F1-score.
2.Unranked Retrieval Results
The Unranked retrieval result is defined as unordered
documents that are measured to produce the non
Figure 1: Performance Evaluation techniques (graphical and non graphical) hierarchal Structure
2.1 Precision, Recall and F1-score
Precision is the fraction of retrieved documents that
are relevant; recall is the fraction of relevant
documents that are retrieved [4]. Precision and recall
results is a binary assessment of either relevant
(positive) or non-relevant (negative).
relevant documents are retrieved; false negative
means relevant documents are not retrieved.
Precision and recall are inversely related, when
precision is high (increase), recall falls (low); when
recall is high (increase), precision falls
(low).precision is more important for web search,
recall is more important for patent search.
F-measures or F1-score is derived from precision and
recall measures. Both recall and precision combines
in F1-score. The formula is given below,
Where,
TP-True Positive (relevant documents)
TN-True Negative (not relevant documents)
FN-False Negative (relevant documents)
Where,
FP-False Positive (not relevant documents)
. Then default means
2.2 Example for precision, recall and F1-score
Table 1: Example of Evaluation of precision, recall
and F1-score
Figure 2: Relationship between precision and recall
True positive means relevant documents are
retrieved; true negative means non relevant
documents are not retrieved; false positive means non
ISSN: 2231-5381
Evaluation
Relevant
Not Relevant
Retrieved
20
40
Not Retrieved
60
100
http://www.ijettjournal.org
= 0.33
Page 239
International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 5 - September 2015
= 0.25
Table 3- Values of Precision and Recall
= 0.28
Precision
The total number of documents is 220; Table 1 is
explained for relevant and not relevant, retrieved and
not retrieved documents details. The precision is
0.33, recall is 0.25 and F1-Score is 0.28. Precision is
calculated by fraction of retrieved documents that are
relevant; recall is calculated by fraction of relevant
documents that are retrieved. Based on non graphical
evaluation methods precision, recall and F1-score are
used to produce single scalar values (precision =0.33,
recall =0.25 and F1-Score=0.28).
3. Ranked Retrieved Results
The ranked retrieved result produce the graphical and
non graphical results, this measurement slightly
extended with unranked measurements and produce
the high level results. Among these results precision
and recall are rank based results, PR-Curve, ROCCurve, P-Precision, MAP (Mean Average Precision),
AUC.
3.1 Precision and recall rank based results
Rank based precision and recall are similar to
unranked precision and recall, the rank based
precision and recall measurement is extended from
unranked measurements, these measurements are
explained by the given example.
Example 1:
Table 2- Example Data
Numbers of
Documents
Name of
Documents
Related
Documents
Recall
1.00
0.20
1.00
0.40
0.66
0.40
0.75
0.60
0.60
0.60
0.60
0.80
0.57
0.80
0.50
0.80
0.40
0.80
0.50
1.00
Above Table 2 contain 10 documents, there are 5
relevant documents marked as “X”. We retrieved one
by one from top to bottom. The first document is
retrieved, doc1 is relevant documents so precision is
100 % and recall is 20%. By retrieving the second
document, doc123 is also relevant because it has the
same precision value as 100% and recall value to
40% increases. By retrieving the Third documents,
doc456 which is non relevant a precision value is
decreased to 66% and recall value as no changes.
When we retrieve the 5 relevant documents recall
value has achieved 100% but when we measure the
final achievement of precision is have only achieved
50% because 5 non relevant documents are retrieved.
1
doc1
2
doc123
3
doc456
4
doc45
5
doc78
6
doc567
Where,
7
doc1784
The integral of 0 to 1 precision is closely
approximated.
8
doc444
9
doc1123
10
doc1789
ISSN: 2231-5381
3.1.1 Average Precision
The average precision value is defined as the average
values taken by the values of precision. And average
precision produce the single value precision and
recall results.
Where, N is total number of documents, P (k) is the
precision at a cutoff of k documents, ∆ r (k) is the
http://www.ijettjournal.org
Page 240
International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 5 - September 2015
change in recall that happened between cutoff k-1
and cutoff k.
From Table 2 and 3 the average precision has been
calculated.
Table 5- values of Precision and recall
Precision
Recall
1.00
0.20
= (1.00+1.00+0.75+0.60+0.50)/5
1.00
0.40
= 0.77
0.66
0.40
0.75
0.60
0.60
0.60
0.60
0.80
0.57
0.80
0.50
0.80
0.40
0.80
0.50
1.00
Average Precision
Therefore the average precision value is 0.77.
3.2 Mean of Average Precision (MAP)
MAP is rank based Non-Graphical evaluation
technique that is used to produce the non-graphical
results that are relevant to precision and recall. The
average of the average precision value for a set of
queries is called mean average precision. Average
precision is calculated when the relevant document is
retrieved. The given formula explains the above
concept
Where,
n(Re) is the number of relevant documents,
and
take one or zero indicating non relevant or
relevant at position k and i respectively.
Example 2:
Let‟s Consider Example 1 and 2.
The average precision of Example 1 is
Average precision = 0.77
Table 4- Example data
Numbers of
Documents
Avg. precision = (1.0+0.28+0.30)/3=0.52
Name of
Documents
The average precision of Example 2 is
Average precision = 0.52
Related
Documents
MAP= (0.77+0.52)/2=0.64
1
doc12
The Mean Average Precision for the example 1 & 2
has been calculated and the value is 0.64.
2
Doc423
3.3 Precision and Recall Curve
3
doc45
4
doc454
5
doc545
6
doc5
7
doc725
8
doc445
9
doc11
10
doc89
ISSN: 2231-5381
One of the good ways to characterize the
performance of information retrieval systems is to
produce the graphical way results by using the
precision and recall curve [4]. The given precision
and recall curve graph is based on Table 2 values.
3.4 ROC curve
Roc (Receiver Operating Characteristics) is another
rank based graphical performance evaluation
technique and ROC graph is a technique for
visualizing, organizing and selecting classifier based
on their performance [5][6].
http://www.ijettjournal.org
Page 241
International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 5 - September 2015
1
Precisi…
0.8
Precision
relevant documents or not retrieved non relevant
documents.
Negative - No, it is correct.
TN –True Negative:
0.6
True - Retrieved relevant documents or not
retrieved relevant documents or retrieved non
relevant documents or not retrieved non relevant
documents.
0.4
0.2
0
Negative - No, it is correct.
0
0.2
0.4
0.6
0.8
1
Metrics from the confusion matrix:
Using above confusion matrix we have to defined
sensitive and specificity followed by,
Recall
Figure 3: The precision and recall curve for our
example, It has achieved 100% recall and 50%
precision
ROC curve are described by two terms, one is
sensitivity and specificity. Sensitivity is also called as
recall, which is defined as how many relevant
documents have been retrieved as being relevant
document. Specificity is how many of the not
relevant document have been retrieved as being non
relevant. The following confusion matrix as
explained the working principle about ROC curve.
3.4.1 Confusion Matrix
Total prediction
TP + FP = TPP, FN + TN = TPN
Total ground truth
TP + FN = TAP, FP+TN = TAN
True Positive Rate (recall)
TP rate =
False Positive Rate (false alarm)
FP rate =
ROC curve confusion matrix:
Sensitivity =
=
Specificity =
=
3.4.2 ROC Space
Where,
TP –True Positive:
True - Retrieved relevant documents or not retrieved
relevant documents or retrieved non relevant
documents or not retrieved non relevant documents.
Positive - Yes, it is correct.
FP –False Positive:
False - Retrieved relevant documents or not
retrieved relevant documents or retrieved non
relevant documents or not retrieved non relevant
documents.
Positive - Yes, it is correct.
FN –False Negative:
False - Retrieved relevant documents or not
retrieved relevant documents or retrieved non
ISSN: 2231-5381
Each and every classification problem uses only two
classes, one is positive class another one is negative
class, each instance I mapped to positive „p‟ or
negative „n‟ class labels. The discrete classifier
model produces the single ROC point. Some
classification models such as a neural network or
Naive Bayes produce a continuous output. The
discrete classifier model will be discussed here and
remaining classification model will be discussed in
next section. ROC graphs are two dimensional
graphs, the (TP, FP) pairs are represented as discrete
classifier. TP rate is plotted on Y axis and FP rate is
plotted on X axis [5].
Let‟s consider 100 positive and 100 negative
instances that have been defined in confusion matrix.
A, B, C ROC points are represented as discrete
classifier. Figure 4 shows as all discrete classifiers.
The left lowers point (0, 0) represent as no false
positive error and also no true positives. The right
upper point (1, 1) represents opposite strategies of (0,
0).
http://www.ijettjournal.org
Page 242
International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 5 - September 2015
number of negative instance, so performance in the
far left-hand side of the ROC graph become more
interesting[6].
‘A’ -Discrete classifier ROC point
TP=63
FP=28
91
FN=37
TN=72
109
100
100
200
3.4.3 Create Curves in ROC Space
TP rate =
=
= 0.63
FP rate =
=
= 0.28
The discrete classifier represents only single point in
ROC space. Some classifiers (neural network or
Naive Bayes) naturally yield an instance probability
or score [6]; the scoring classifier can be used with
thresholds procedure. Each threshold values produce
the different ROC space points.
Table-6 Example Data for ROC
‘B’ -Discrete classifier ROC point
TP=76
FP=12
88
FN=24
TN=88
112
100
100
200
TP rate =
=
= 0.76
FP rate =
=
= 0.12
Instance
Class
Score
TP
FP
1
p
0.03
0.14
0.00
2
p
0.08
0.28
0.00
3
n
0.10
0.28
0.09
4
p
0.11
0.42
0.09
5
n
0.22
0.43
0.18
6
p
0.32
0.57
0.18
7
p
0.35
0.71
0.18
8
n
0.42
0.71
0.27
9
n
0.44
0.71
0.36
10
p
0.48
0.85
0.36
11
n
0.56
0.85
0.45
12
n
0.65
0.85
0.54
13
n
0.71
0.85
0.63
14
n
0.72
0.85
0.72
15
p
0.73
1.00
0.72
16
n
0.80
1.00
0.81
17
n
0.82
1.00
0.90
18
n
0.99
1.00
1.00
‘C’ -Discrete classifier ROC point
TP=24
FP=88
112
FN=76
TN=12
88
100
100
200
TP rate =
=
= 0.24
FP rate =
=
= 0.88
True Positive
1
B
0.8
A
0.6
0.4
0.2
C
0
0
0.2
0.4
0.6
0.8
1
False Positive
Figure 4: A basic ROC Space (graph) showing three
discrete classifiers.
The left upper point (0, 1) represents perfect
classification. The „B‟ ROC point is perfect
performance. The right lower point (1, 0) represents
unperfected classification or low level performances,
„C‟ ROC point is unperfected or low performance.
The most real world domains are dominated by large
ISSN: 2231-5381
Figure 5 illustrates the ROC curve of an example test
set of 18 instances, 7 positive instances and 11
negatives instances that are shown in Table 6 and the
instances are shorted by ascending order. The ROC
points at (0.1, 0.7) produces its highest accuracy.
http://www.ijettjournal.org
Page 243
International Journal of Engineering Trends and Technology (IJETT) – Volume 27 Number 5 - September 2015
[7] E. Rasmussen, "Evaluation in Information Retrieval," in 3rd
International Conference on Music Information Retrieval,
Paris, France, 2002, pp. 45-49.
[8] K. Zuva and T. Zuva, "Evaluation of Information Retrieval
Systems, " International Journal of Computer Science &
Information Technology (IJCSIT), vol. 4, pp. 35-43, 2012.
1
True Positive
0.8
Method
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
False Positive
Figure 5: Example of ROC
3.5 AUC
AUC (Area under an ROC curve) is used to measure
the classification models quality. The area of the unit
square is called portion of the AUC, its value of AUC
will always be between 0 and 1.0. The higher AUC
value is between 0.5 and 1.0, and this value is better
quality for classification model.
4. Conclusion
We have presented a various type of evaluation
techniques for information retrieval system. And
have presented a complete binary judgment measures
techniques such as Graphical and Non-Graphical
techniques.
The
Non-Graphical
techniques
(Precision, Recall, F1-Score, MAP) produces the
single scalar values. The graphical evaluation
techniques (PR-Curve, ROC Curve, AUC and
nDCG) are used to visualize the IR system
performance for easy user view. In future work,
review on complete grade judgment measures and
techniques will be discussed.
References
[1] Mark sanderson and W.BruceCroft , "The history of
information retrieval research", in proc. of IEEE conference,
may 2012.
[2] Kevin P. Murphy. Performance evaluation of binary
classifiers. Technical Report, University of British Columbia,
2007
[3] Jaana Kekäläinen , Kalervo Järvelin, Using graded relevance
assessments in IR evaluation, Journal of the American
Society for Information Science and Technology, v.53 n.13,
p.1120-1129, November 2002
[4] J. Davis and M. Goadrich "The Relationship between
Precision-Recall and ROC Curves", Proc. Int?l Conf.
Machine Learning, pp.233 -240
[5] T. Fawcett "An Introduction to ROC Analysis", Pattern
Recognition Letters, vol. 27, no. 8, pp.861 -874 2006
[6] T. Fawcett "ROC Graphs: Notes and Practical Considerations
for Data Mining Researchers", 2003 :HP Labs
ISSN: 2231-5381
http://www.ijettjournal.org
Page 244
Download