Comparative Study of Two NLP Framework Architectures Yixian Bian

advertisement
A Comparative Study of
Two Natural Language
Processing Frameworks
Yixin Bian, Gunes Koru, Hongfang Liu
Department of Information Systems, University of Maryland,
Baltimore County,MD,21250,USA
June 11, 2012
Introduction
UIMA (Unstructured Information Management
Architecture) is a framework for natural language
processing, originally developed by IBM but now
maintained by the Apache Software Foundation.
(General
Architecture
for
Text
GATE
Engineering) is a Java suite of tools originally
developed at the University of Sheffield and now
used worldwide by a wide community of scientists,
companies for all sorts of natural language
processing tasks.
Introduction
 Both developed in Java. Although they
share common goals, the two architectures
are different in many aspects.
 Which one to adopt ?
Introduction
In this paper, we compare them from three
perspectives:
 Software design quality
 Code Metrics
 Software maintenance
 Code smells
 Bugs
Bug survival curves
 User's manual
The Comparison of Metrics
The
number
of classes
UIMA
GATE
2,187
2,822
Min
Median
Max
Total
Average
Value
Min
Median
Max
Total
Average
Value
Line of
Code
0
25
2944
169,516
77.51
0
23
3869
228,454
80.95
CBO
0
2
84
11822
5.41
0
2
65
11203
3.97
NOC
0
0
71
1170
0.53
0
0
81
1027
0.36
RFC
0
6
347
35220
16.1
0
3
214
29909
10.6
DIT
0
1
10
3837
1.75
0
1
8
4731
1.68
LCOM
0
16
100
79374
36.29
0
0
100
85051
30.14
WMC
0
4
345
15166
6.93
0
2
180
15220
5.39
The Number of Code Smells
Code Smell
The number
of code smells
in UIMA
Average
(UIMA/KLOC)
The number of
code smells in
GATE
Average
(GATE/KLOC)
Data Class
6
0.035
11
0.05
Data Clumps
63
0.372
21
0.091
Feature Envy
26
0.153
0
0
Refused Bequest
101
0.6
448
2.05
Long Message
Chain
19
0.112
30
0.137
Shortgun
Surgery
23
0.136
189
0.863
God Class
16
0.094
48
0.219
Total
254
1.5
747
3.41
The Number of Bugs
Detection Tool
UIMA
GATE
FindBugs (2.0.0)
6
178
PMD (5.0)
1798
1794
Lint4j (0.9.13)
84
494
The Comparison of Bug
Survival Curves
The Comparison of User
Manuals
Contents
UIMA
GATE
Catalog
√
√
Tutoral of manual
√
√
Overview and characteristics of
software product
√
√
Installation and setup
√
√
Introduction of product
application
√
√
Frequently Asked Questions (FAQ)
√
Known issues and problems
with the software
√
×
×
Terms , concepts and their
basic definitions in software
√
×
Conclusion
 Software design quality
 Software maintenance
 User’s manual
 UIMA is better than GATE.
Thank you !
Download