ProjectItay

advertisement

Prediction of symptoms for genetic disease

Itay Dangoor

Prediction of symptoms for genetic disease

Machine Learning Foundations, fall 2013

Final Project

Itay Dangoor

Introduction

Genetic disorders take a significant part of human disease and attract a lot of research attention in the biological and medical fields. In recent years, Information connecting human phenotypic anomalies to gene defects is being progressively gathered to form large data sets. Having such comprehensive references for genetic disorders and their effects makes it possible to carry large scale research involving both cellular activity and phenotypes.

Constraint-based metabolism modeling, abbreviated CBM, is a widely used framework to computationally examine cell metabolic behavior under different states.

CBM models describe cell metabolic processes by defining a set of metabolites

(biochemical compounds) and a set of convex constraints enforced on them. Usually a solution space that satisfies all the constraints is obtained using Linear Programming methods, allowing instant access to a description of all the cellular processes at once.

Thus CBM models enable systematic genome wide computational research of cellular metabolic activity and drive novel characterization of cellular behavior. In 2007 CBM

Models for generic human cell were presented [Duarte et al 2007, Ma et al, 2007], allowing the modeling community to extend the research focus from microorganism engineering towards major human medical issues.

Construction of a map from the physiological cellular state to physical symptoms is an open challenge that computational biology faces today. In this work I will try to construct such a map by building predictors dedicated to phenotypes of human disease. Based on gene disorder descriptions computed using CBM methods and the human model [Duarte et al 2007], and based on observed phenotype to gene associations, I will render predictions for human phenotype emergence as a function of gene defect patterns.

Prediction of symptoms for genetic disease

Itay Dangoor

Data

Human Phenotype Ontology

The Human Phenotype Ontology, or HPO, is an ordered reference of human genetic disease and associated observed phenotypes [www.human-phenotypeontology.org]. A set of phenotypes was extracted from this data set, with the corresponding set of causative genetic disease. Then, for each disease a set of known gene defects causing the disease was extracted. This process makes a two layers map from phenotypes to disease and from disease to genes. Making a link between every gene to all the phenotypes that map to a disease that map to this gene results in a many to many phenotype to causative gene mapping. The samples in this work are gene defects, and the classes , or labels, are the phenotypes.

Phenotype's causative gene defects s are considered positive samples, while the rest of the genes are considered negative samples. In order to have enough training data and render good predictions, phenotypes with less than 10 causative genes were filtered out, resulting in 120 phenotypes with between 10 and 45 positive samples

(Figure 1).

Figure 1: histogram counting phenotypes as a function of the amount of positive samples related to them.

Obtaining descriptors for gene defects

CBM models contain a set of metabolites (biochemical compounds), a set of biochemical reactions formed as convex linear constraints on the metabolites and a

Prediction of symptoms for genetic disease

Itay Dangoor map between reactions to genes regulating them. Cellular state with defected gene is simulated using the human CBM model [Duarte et al 2007]. First in order to enforce a gene malfunction on the model, all the reactions related to this gene are constrained to carry zero flux. A solution that satisfies all the constraints can be found using Linear

Programming, but usually the constraints yield more the one feasible solution. To resolve this issue FBA, Flux Balance Analysis is used [Varma and Palsson 1994].

FBA is a common CBM optimization method which adds a biologically meaningful objective function to the LP problem. The most prevalent function simulates a maximization of cellular growth rate, which is again solved using Linear

Programming. Samples are vectors of flux rates through the reactions obtained from the linear program solution.

Filtering non informative dimensions

The cellular state as defined in the solution of the LP problem is a vector in which each entry corresponds to a flux rate through one biochemical reaction in the model. The dimension of these vectors is 3788, as 3788 reactions are defined in the human metabolic model. Due to limitations of the model and the FBA method, some of the reactions always carry constant flux in my simulation, and hence contain no information. In order to increase prediction accuracy and speed, those dimensions with zero variance among all samples were filtered out, leaving 1207 dimensions.

Learning

Learning process summary

The goal of the learning in this work is to build a predictor per phenotype and be able to decide which phenotypes will be expressed in each state of gene defect. For that purpose two classification methods are used as predictors. 5 predictors are trained for each phenotype, the K-Nearest Neighbor algorithm with k = 1, 3, 5 and the

Support Vector Machine algorithm with linear kernel and radial kernel. The training and the testing groups are built using an equal amount of positive and negative samples. To test the prediction accuracy, a 5-fold cross-validation process is utilized.

The error rates are measured per phenotype and on the testing group only. The Error is defined as the percentage of positive samples predicted negative plus the percentage of negative samples predicted positive.

Prediction of symptoms for genetic disease

Itay Dangoor

K-Nearest-Neighbor

The implementation of K-NN is based on lecture 7 of the course. The implementation is the same as the one handed in exercise 3. The distance between samples is determined using the Euclidean metric. Since the data is of very high dimension (1207), and the topology of the problem is not exactly known, 3 configurations of the K-NN classifier are harnessed. The configurations differ from each other by the classification decision method:

1.

NN - Return the class of the first nearest neighbor only

2.

3NN - Return the class of the mode of 3 nearest neighbors

3.

5NN - Return the class of the mode of 3 nearest neighbors

Support vector machine

For an implementation of SVM the libSVM (Chih C. C and Chih J L) package is used. Here also since the data is of high dimension (1207), and the topology is not known, 2 configurations for the SVM Kernel are used:

1.

Linear kernel – product of two vectors is simply defined as the dot product.

2.

Radial kernel - product of two vectors is defined to be e

-GN

where N is the

2-norm of the difference of the vectors and G is a normalization factor, and is set to one divided by the number of dimensions.

Cross validation

In order to create a set of test samples, a 5-fold cross validation process is used. Both the positive and negative data are divided to 5 equal parts, and in 5 runs, each of the parts forms the test data in turn.

Data selection and multiple runs

Since for all of the classes exists more negative samples than positive samples, and in order to maintain balanced classifiers and produce balanced error measurements, the amount of negative samples is reduced to be the same as of the positive samples. The selection of negative samples is done by random selection. As in every run the selected training negative samples are different samples, the results may differ from one run to another. To overcome this issue and make the results more repeatable and significant, the 5-fold prediction process was run all over for 50 times.

Prediction of symptoms for genetic disease

Itay Dangoor

Accuracy Measures

As all the predictors are for one class, the error rate of the prediction process is calculated for each class separately. The used measure for the error rate is percent of test samples which were falsely classified out of the total amount of test samples.

Since there are 50 runs of the prediction each producing a different score, the mean error is taken as the final value for the prediction accuracy. In addition, the standard deviation is calculated for the 50 values, and a p-value for the error being lower than

50% is extracted using a one sided t-test. As there are multiple classes, the bonferroni correction is applied, taking the threshold of significant p-value from 0.01 to

0.000083.

Experimental Results

Following here are some figures describing the error rate results of the prediction process described in the Learning chapter. As stated before, all of the figures encapsulate an average of 50 runs.

Figure 2: mean error rate over all phenotypes.

Prediction of symptoms for genetic disease

Itay Dangoor

Figure 3: count of significant predictions out of 120 classes (p-val < 0.000083) .

Figure 4: distribution of the error rates for the various phenotype predictors over the 5 classification methods.

Prediction of symptoms for genetic disease

Itay Dangoor

Figure 5: error rate distribution of the 5 most significant phenotypes for 1-NN.

Figure 6: error rates of the 5 most significant phenotypes for 3-NN.

Prediction of symptoms for genetic disease

Itay Dangoor

Figure 7: error rates of the 5 most significant phenotypes for 5-NN.

Figure 8: error rates of the 5 most significant phenotypes for SVM with linear kernel.

Prediction of symptoms for genetic disease

Itay Dangoor

Figure 9: error rates of the 5 most significant phenotypes for SVM with radial kernel.

Discussion

The human cellular metabolic model has many limitations and lacks a lot of information in comparison to the grand complexity of the human living cell.

Moreover, the phenotypes predicted in this work are complex phenomena which are not necessarily related to some simple metabolic cell behavior. Yet the results indicate the prediction of most of the phenotypes is successful and significant (Figure 3).

Both K-NN and SVM could handle the classification task presented in this work, but with a high error rate of not lower than 15 %, and sometimes as high as

45% (figure 4). Also, not to be ignored are many cases, about 16%, in which all the classifiers failed in rendering a good significant prediction (Figure 3).

It is observed that out of the five methods implemented in this work, the radial kernel SVM is the best method for the presented task as it presents the lowest mean

Error (Figure 2), and the highest amount of significant predictions (Figure 3). The next best classifier for the task is the linear kernel SVM. Although the methods present different success rates, there are phenotypes that seem to be better predictable by a less successful method (Figure 4).

Out of 3 k-NN methods, the first nearest neighbor came out as the most fit for the current problem (Figure 2, Figure 3). This fact could shed some light on the data, implying that the samples are not scattered in space in a very clustered way. Another

Prediction of symptoms for genetic disease

Itay Dangoor explanation of the first nearest neighbor better performance might be related to a curse of dimensionality that is expressed in this work, as the samples are of dimension

1207, making the sample space very sparse in the Euclidean metric.

References

1.

http://www.human-phenotype-ontology.org/

2.

Duarte et al 2007

Duarte NC, Becker S a, Jamshidi N, Thiele I, Mo ML, Vo TD, Srivas R & Palsson

BØ (2007) Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences of the United

States of America 104: 1777–82

3.

Ma et al, 2007

Ma H, Sorokin A, Mazein A, Selkov A, Selkov E, Demin O & Goryanin I (2007) The

Edinburgh human metabolic network reconstruction and its functional analysis. Mol

Syst Biol 3:

4.

Varma A, Palsson BO. Metabolic flux balancing: basic concepts, scientific and practical use. Bio. Technol. 1994;12:994-998

5.

Chih C. C and Chih J L, LIBSVM: a library for support vector machines. ACM

Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.

Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Download