Hadoop Recognition of Biomedical Named Entity Using Conditional

advertisement
Hadoop Recognition of Biomedical Named Entity Using
Conditional Random Fields
ABSTRACT:
Processing large volumes of data has presented a challenging issue, particularly in
data-redundant systems. As one of the most recognized models, the conditional
random fields (CRF) model has been widely applied in biomedical named entity
recognition (Bio-NER). Due to the internally sequential feature, performance
improvement of the CRF model is nontrivial, which requires new parallelized
solutions. By combining and parallelizing the limited-memory Broyden-FletcherGoldfarb-Shanno (LBFGS) and Viterbi algorithms, we propose a parallel CRF
algorithm called MRCRF (MapReduce CRF) in this paper, which contains two
parallel sub-algorithms to handle two time-consuming steps of the CRF model.
The MRLB (MapReduce LBFGS) algorithm leverages the MapReduce framework
to enhance the capability of estimating parameters. Furthermore, the MRVtb
(MapReduce Viterbi) algorithm infers the most likely state sequence by extending
the Viterbi algorithm with another MapReduce job. Experimental results show that
the MRCRF algorithm outperforms other competing methods by exhibiting
significant performance improvement in terms of time efficiency as well as
preserving a guaranteed level of correctness.
EXISTING SYSTEM:
 Current methods for Bio-NER fall into three general classes, i.e., dictionarybased methods, heuristic rule-based methods, and statistical machine
learning methods.
Further Details Contact: A Vinay 9030333433, 08772261612
Email: info@takeoffprojects.com | www.takeoffprojects.com
 In order to handle large data, Jeong et al. proposed an efficient inference
algorithm of CRF for large-scale natural language data which unified the
SFB and TP approaches.
 Lavergne et al. addressed the issue of training very large CRF, containing up
to hundreds output labels and several billion features.
DISADVANTAGES OF EXISTING SYSTEM:
 Relying on dictionary-based methods could cause the low recall due to the
continual appearance of new entities with the advancing biology research.
Biological named entities do not follow any nomenclature, which makes
rule-based methods hard to be perfect.
 Besides, rule-based systems require domain experts, and they are not
portable to other NE types and domains.
 Machine learning methods are more robust and they can identify potential
biomedical entities which are not previously included in standard
dictionaries.
PROPOSED SYSTEM:
 Conditional random fields (CRF), a type of conditional probability model,
has been widely applied in biomedical named entity recognition.
 In this paper, we propose an improved parallel CRF algorithm by combining
the parallel L-BFGS and Viterbi algorithms. The algorithm leverages the
MapReduce framework to enhance the capability of estimating parameters.
Further Details Contact: A Vinay 9030333433, 08772261612
Email: info@takeoffprojects.com | www.takeoffprojects.com
Furthermore, it infers the most likely state sequence by extending the Viterbi
algorithm with another MapReduce job.
 We propose an efficient method called MRCRF (MapReduce CRF) to
partition a large dataset across Hadoop nodes in order to keep the context of
each word in each sentence of Bio-NER, balance the workload and minimize
the need for replication. Compared to the CRF method, the proposed
MRCRF method requires ”partitioning” the data sets.
 We develop two efficient parallel algorithms, i.e., the MRLB (MapReduce
L-BFGS) algorithm and the MRVtb (MapReduce Viterbi) algorithm to
implement the parallel CRF for Bio-NER based on MapReduce. The
algorithms have improved performance compared with an existing
sequential algorithm.
 We conduct performance evaluation which can reveal the performance
benefit of the MRCRF algorithm over the CRF counterpart. The
performance is presented with reported speedup versus the sequential CRF
under different data set sizes and varying Hadoop configurations.
ADVANTAGES OF PROPOSED SYSTEM:
 The advantage of the CRF model is the ability to express long-distancedependent and overlapping features. CRF has shown empirical success
recently in Bio-NER, since it is free from the so-called label bias problem by
using a global normalization.
 We empirically show that, while maintaining a competitive accuracy on the
test data, the algorithm achieves significant speedup compared to the
Further Details Contact: A Vinay 9030333433, 08772261612
Email: info@takeoffprojects.com | www.takeoffprojects.com
baseline CRF algorithm implemented on a single machine. The proposed
algorithms are designed to work in the Hadoop environment, where each
mapper in the nodes only compute a subset of the data.
SYSTEM ARCHITECTURE:
Further Details Contact: A Vinay 9030333433, 08772261612
Email: info@takeoffprojects.com | www.takeoffprojects.com
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
 System
:
Pentium IV 2.4 GHz.
 Hard Disk
:
40 GB.
 Floppy Drive
:
1.44 Mb.
 Monitor
:
15 VGA Colour.
 Mouse
:
Logitech.
 Ram
:
512 Mb.
SOFTWARE REQUIREMENTS:
 Operating system :
Windows 7/UBUNTU.
 Coding Language :
Java 1.7 ,Hadoop 0.8.1
 IDE
:
Eclipse
 Database
:
MYSQL
REFERENCE:
Kenli Li, Wei Ai, Zhuo Tang, Fan Zhang, Lingang Jiang, Keqin Li, and Kai
Hwang, “Hadoop Recognition of Biomedical Named Entity Using Conditional
Random Fields”, IEEE Transactions on Parallel and Distributed Systems 2015.
Further Details Contact: A Vinay 9030333433, 08772261612
Email: info@takeoffprojects.com | www.takeoffprojects.com
Download