Biological Literature Miner: Gene Mention Recognition and Protein-Protein Interaction Pair Extraction CHEN Yifei ABSTRACT +++++++++ In this dissertation we describe a machine learning-based text mining system to extract automatically useful information from the biological literature: Biological Literature Miner (BioLMiner). BioLMiner is designed, implemented and tuned in collaboration with Feng Liu and includes four subsystems: a Gene Mention Recognizer (GMRer), a Gene Normalizer (GNer), an Interaction Article Classifier (IACer) and a Protein-Protein Interaction Pair Extractor (PPIEor). The two subsystems GMRer and PPIEor are the topic of this dissertation and they are based on Support Vector Machines (SVMs) and Conditional Random Fields (CRFs). GMRer is the first subsystem in BioLMiner and solves the task of recognizing gene mentions, i.e. gene and protein names, in the biological literature. This is a very important step for the next subsystems that are using these extracted gene mentions in order to extract further information. In GMRer both a sequence labeling-based model using CRFs and two classification-based models using forward SVMs and backward SVMs are used, respectively. To improve the performance further, a hybrid model is proposed that fuses, using voting rules, the outputs of the three individual models. PPIEor is the last subsystem of BioLMiner and its purpose is to automatically extract protein-protein interaction (PPI) pairs from the biological literature. Finding these interactions is one of the most pressing problems in the biological domain. During the preprocessing phase, PPIEor transforms the sentence-based articles into clause-based ones and the candidate PPI pairs are distilled. A binary SVM classifier is proposed to predict whether the candidate PPI pairs are correct or not. Finally, a post-processor recovers some self-interaction proteins which cannot be identified by the SVM model.