Splitting the dataset into two parts

advertisement
Data Mining for Prediction of Genes Associated with a given Biological Function
William Perrizo
ABSTRACT:
Comprehensive Genome Databases provide several catalogues of information on some genomes
such as yeast (Saccaromyces cerevesiae). We show the potential for using gene annotation data, which
includes phenotype, localization, protein class, complexes, enzyme catalogues, pathway information,
and protein-protein interaction, in predicting the functional class of yeast genes. We predict a rank
ordered list of genes for each functional class from available information and determine the relevance of
different input data through systematic weighting using a genetic algorithm. The classification task has
many unusual aspects, such as multi-valued attributes, classification into overlapping classes, a
hierarchical structure of attributes, many unknown values, and interactions between genes. We use a
vertical data representation called the P-tree representation, that treats each domain value as an attribute.
The weight optimization uses ROC evaluation in the fitness measure to handle the sparseness of the
data. We include the number of interactions as a prediction criterion, based on the observation that
important genes with many interactions are more likely to have some function. The result of the weight
optimization confirms this observation.
Keywords: data mining, prediction, classification, function, similarity, genetic algorithms, P-tree
1. INTRODUCTION
During the last three decades, the emergence of molecular and computational biology has shifted
classical genetics research to the study at the molecular level of the genome structure of different
organisms. As a result, the rapid development of more accurate and powerful tools for this purpose has
produced overwhelming volumes of genomic data that is being accumulated in more than 300 biological
databases. Gene sequencing projects on different organisms [1] have helped to identify tens of thousands
of potential new genes, yet their biological function often remains unclear. Different approaches have
been proposed for large-scale gene function analysis. Traditionally functions are inferred through
sequence similarity algorithms [2] such as BLAST or PSI-BLAST [3]. Similarity searches have some
shortcomings. The function of genes that are identified as similar may be unknown, or differences in the
sequence may be so significant as to make conclusions unclear. For this reason some researchers use
sequence as input to classification algorithms to predict function [4]. Another common approach to
function prediction uses two steps. Genes are first clustered based on similarity in expression data, and
then clusters are used to infer function from genes in the same cluster with known function [5].
Alternatively function has been directly predicted from gene expression data using classification
techniques such as Support Vector Machines [6].
We show the potential for the use of gene annotation data, which includes phenotype,
localization, protein class, complexes, enzyme catalogues, pathway information, and protein-protein
interactions, in predicting the functional class of yeast genes. Phenotype data has been used to construct
individual rules that can predict function for certain genes based on the C4.5 decision tree algorithm [7].
The gene annotation data we used was extracted from the Munich Information Center for Protein
Sequences (MIPS) database [8] and then processed and formatted in a way that fits our purpose. MIPS
has a genomic repository of data on Saccharomyces cerevesiae (yeast). Functional classes from the
MIPS database include among many others, Metabolism, Protein Synthesis and Transcription. Each of
these is in turn is divided into classes and then divided again into subclasses to yield a hierarchy of up to
five levels. Each gene may have more than one function associated with it. In addition MIPS has
catalogues with other information such as phenotype, localization, protein class, protein complexes,
enzyme catalogue and data on protein-protein interactions. In this paper, we predict function for genes at
the highest level in the functional hierarchy. Despite the fact that yeast is one of the most thoroughly
studied organisms, the function of 30 – 40 % of its ORFs remain currently unknown. For about 30% of
the ORFs no information whatsoever is available, and for the remaining ones unknown attributes are
very common. This lack of information creates interesting challenges that cannot be addressed with
standard data mining techniques. We have developed novel tools with the hope of helping biologists by
providing experimental direction concerning the function of genes with unknown functions.
2. EXPERIMENTAL METHOD
Gene annotations are untypical data mining attributes in many ways. Each of the MIPS
catalogues has a hierarchical structure. Each attribute, including the class label, gene function, is
furthermore multi-valued. We therefore have to consider each domain value as a separate binary
attribute rather than assigning labels to protein classes, phenotypes, etc. For the class label this means
that we must classify into overlapping classes, which is also referred to as multi-label classification,
rather than multi-class classification in which the class label is disjoint. To this end we represent each
domain value of each property as a binary attribute that is either one (gene has the property) or zero
(gene doesn't have property). This representation has some similarity to bit-vector representations in
Market Basket Research, in which the items in a shopping cart are represented as 1-values in the bitvector of all items in a store. The classification problem is correspondingly broken up into a separate
binary classification problem for each value in the function domain. The resulting classification problem
has more than one thousand binary attributes, each of which is very sparse. Two attribute values should
be considered related if both are 1, i.e., both genes have a particular property. Not much can be
concluded if two attribute values are 0, i.e., both genes do not have a particular property.
Classification is furthermore influenced by the hierarchical nature of the attribute domains. Many
models exist for the integration of hierarchical data into data mining tasks, such as text classification,
mining of association rules, and interactive information retrieval, among others[19],[20],[21],[22].
Recent work [17] introduces similarity measurements that can exploit a hierarchical domain, but focuses
on the case where matching attributes are confined to be at the leaf level of the hierarchy. The data set
we consider in this paper has poly-hierarchical attributes, where attributes must be matched at multiple
levels. Hierarchical information is represented using a separate set of attribute bit columns for each level
where each binary position indicates the presence or the absence of the corresponding category.
Evidence for the use of multiple binary similarity metrics is available in the literature and usage is based
on the computability and the requirement of the application [25]. In this work we use a similarity metric
identified in the literature as “Russel-Rao” and the definition is given bellow[25].
Given 2 binary vectors Zi and Zj with N dimensions (categorical values)
N 1
similarity ( Zi, Zj ) 
Z
k 0
i ,k
 Z j ,k
N
The above similarity measure will count the number of matching bits in the two binary vectors and
divide by the total number of bits. A similarity metric is defined as a function that assigns a value to the
2
degree of similarity between object i and j. For each similarity measure the corresponding dissimilarity
is defined as the distance between two objects and should be non-negative, commutative, adhere to
triangle inequality, and reflexive to be categorized as a metric. Dissimilarity functions that show partial
conformity to the above properties are considered as pseudo metrics. The corresponding dissimilarity
measure only shows the triangular inequality when M<N/2 where N is the number of dimensions and M
is the max(X.Y). For this application we find the use of the above as appropriate. It is also important to
note that in this application the existence of a categorical attribute for a given object is more valuable
than the nonexistence of a certain categorical attribute. In other words “1” is more valuable than a “0” in
the data. So the count of matching “1” is more important for the task than the count of matching “0”.
The ptree data structure we use also allows us to easily count the number of matching “1”s with the use
of a root count operation.
Similarity is calculated considering the matching similarity at each individual level. The total
similarity is the weighted sum of the individual similarities at each level on the hierarchy. The total
weight for attributes that match at multiple levels is thereby higher indicating a closer match. Counting
matching values corresponds to a simple additive similarity model. Additive models are often preferred
for problems with many attributes because they can better handle the low density in attribute space, also
referred to as "curse of dimensionality" [9].
2.1 Similarity & Weight Optimization
Similarity models that consider all attributes as equal, such as K-Nearest-Neighbor classification
(KNN) work well when all attributes are similar in their relevance to the classification task. This is,
however, often not the case. The problem is particularly pronounced for categorical attributes that can
only have two distances, namely distance 0 if attributes are equal and 1 or some other fixed distance if
attributes are different. Many solutions have been proposed, that weight dimensions according to their
relevance to the classification problem. The weighting can be derived as part of the algorithm [10]. In
an alternative strategy the attribute dimensions are scaled, using, e.g., a genetic algorithm, to optimize
the classification accuracy of a separate algorithm, such as KNN [11]. Our algorithm is similar to the
second approach, which is slower but more accurate. Modifications were necessary due to the nature of
the data. Because class label values had a relatively low probability of being 1 we chose to use AROC
values instead of accuracy as criterion for the optimization [12]. Nearest neighbor evaluation was
replaced by the counting of matches as described above. We furthermore included importance measures
into the classification that are entirely independent of the neighbor concept. We evaluate the importance
of a gene based on the number of possible genetic and physical interactions its protein has with the
proteins of other genes. Interactions with lethal genes, i.e., genes that cannot be removed in gene
deletion experiments because the organism cannot survive without them, were considered separately.
The number of items of known information, such as localization and protein class, was also considered
as importance criterion.
2.2 ROC Evaluation
Many measures of prediction quality exist, with the best-known one being prediction accuracy.
There are several reasons why accuracy is not a suitable tool for our purposes. One main problem
derives from the fact that commonly only few genes are involved in (positive) for a given function. This
leads to large fluctuations in the number of correctly predicted participant genes (true positives).
Secondly we would like to get a ranking of genes rather than a strict separation into participant and nonparticipant since our results may have to be combined with independently derived experimental
probability levels. Furthermore we have to account for the fact that not all functions of all genes have
3
been determined yet. Similarly there may be genes that are similar to ones that are involved in the
function, but are not experimentally seen as such due to masking. Therefore it may be more important
and feasible to recognize a potential candidate than to exclude an unlikely one. This corresponds to the
situation faced in hypothesis testing: A false negative, i.e., a gene that is not recognized as a candidate,
is considered more important than a false positive, i.e., a gene that is considered a candidate although it
isn't involved in the function.
The method of choice for this type of situation is ROC (Receiver Operating Characteristic)
analysis [12]. ROC analysis is designed to determine the quality of prediction of a given property, such
as a gene being involved in a phenotype. Samples that are predicted as positive and indeed have that
property are referred to as true positive samples, samples that are negative, but are incorrectly classified
as positive, are false positive. The ROC curve depicts the rate of true positives as a function of the false
positive rate for all possible probability thresholds. A measure of quality of prediction is the area under
the ROC curve. Our prediction results are all given as values for the area under the ROC curve
(AROC). To construct a ROC curve samples are ordered in decreasing likelihood of being positive.
The threshold that delimits prediction as positive is then continuously varied. If all true positive samples
are listed first the ROC curve will start out by following the y-axis until all positive samples have been
plotted and then continue as horizontal for the negative samples. With appropriate normalization the
area under this curve is 1. If samples are listed in random order the rate of samples that are true positive
and ones that are false positive will be equal and the ROC curve will be a diagonal with area 0.5. The
following table gives some examples of ordered samples and the respective AROC value.
Example with Area under the
ROC curve = 1
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
Example with Area under the
ROC curve = 0.625
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
Example with Area under the
ROC curve = 0.5
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
Table 1 Example ROC curve values.
2.3 Data Representation
2.3.1 P-TREE1 VERTICAL DATA REPRESENTATION
The input data was converted to P-trees [13], [14], [15], [16]. P-trees are a lossless, compressed,
and data-mining-ready data structure. This data structure has been successfully applied in data mining
applications ranging from Classification and Clustering with K-Nearest-Neighbor, to Classification with
Decision Tree Induction, to Association Rule Mining for real world data [13], [14], [15]. A basic P-tree
represents one attribute bit that is reorganized into a tree structure by recursively sub-dividing, while
recording the predicate truth value regarding purity for each division. Each level of the tree contains
truth-bits that represent pure sub-trees and can then be used for fast computation of counts. This
construction is continued recursively down each tree path until a pure sub-division is reached that is
entirely pure. The basic and complement P-trees are combined using Boolean algebra operations to
1
P-tree technology is patent pending, This work is partially supported by GSA Grant ACT#: K96130308.
4
produce P-trees for values, entire tuples, value intervals, or any other attribute pattern. The root count of
any pattern tree will indicate the occurrence count of that pattern. The P-tree data structure provides the
perfect structure for counting patterns in an efficient manner. The data representation can be
conceptualized as a flat table in which each row is a bit vector containing a bit for each attribute or part
of attribute for each gene. Representing each attribute bit as a basic P-tree generates a compressed form
of this flat table.
The following figure shows how an attribute from a poly-hierarchical domain is encoded into the
P-tree data structure (bit vector form). A poly-hierarchical attribute groups a set of attributes in a
recursive sub-hierarchy where each node represents an individual attribute from the attribute domain and
no restrictions are imposed on the number of nodes or number of levels. In our representation the
hierarchy is encoded in a depth first manner. The level boundaries with respect to the P-tree index are
preserved as meta information for the attributes. This information is used in computing the weighted
sum of the similarities between objects.
A0
O(A0, A2, A5, A9)
(b)
A2
A1
A3
A0
A4
A2
A6
A5
O(1, 0, 1, 0, 0, 1, 0, 0, 0, 1)
A5
(c)
A7
A8
A9
A9
(a)
(d)
Figure 1: (a) An example poly-hierarchical attribute domain (b) An example data attribute object drawn
from this domain (c) Bit vector representation of the object. (d) A link that represents the data attribute
object in the hierarchy.
Experimental class labels and the other categorical attributes were each encoded in single bit
columns. Protein-interaction was encoded using a bit column for each possible gene in the data set,
where the existence of an interaction with that particular gene was indicated with a truth bit.
2.4 Implementation
The work presented in this paper required data extraction from the MIPS database, data cleaning,
developing a similarity metric and optimizing the similarity metric. Following figure shows an outline
of the approach.
5
Standard GA
Wi
MIPS
Data
Domain
Knowledge
Train
List
Data Cleaning,
Derived
attributes
Gene
Data
HTML data
extractors
Test
List
ROC
Nearest Neighbor
Predictor
Similarity based on
weighted sum of
matching attributes
and importance
Predictor
with optimized
weights
Final
Prediction
Figure 2 Outline of approach.
Gene data from the MIPS database was retrieved using HTML data extractors. With assistance
from domain knowledge a few tools were developed to clean the data and also extract some derived
attributes as training data. One of the steps we took in data cleaning consisted in extracting the Function
sub-hierarchy “Sub-cellular localization” from the MIPS Function hierarchy and treating it as a property
in its on right. The basis for this step was the recognition that Sub-cellular Localization represents
localization information rather than function information. The information in Sub-cellular Localization
can be seen as somewhat redundant to the separate localization hierarchy. We didn’t explicitly decide in
favor of either but rather left that decision to the weight optimization algorithm. Sub-cellular
Localization could be seen to perform somewhat better than the original localization information,
suggesting that the data listed under Sub-cellular Localization may be cleaner than localization data.
The following equations summarize the computation using P-trees, where Rc: root-count, W: weight,
Px:P-tree for attribute x (Pg is the interaction P-tree for gene g), At: attribute count, ptn: class-partition,
Im: gene-importance, Lth: lethal gene (cannot be removed in gene deletion experiments), g: test gene to
be classified, f: feature, Ip: Interaction property, gn: genetic interaction (derived in vivo), ph:physical
interaction (derived in vitro), lt: lethal, ClassEvl: evaluated value for classification, : P-tree AND
operator.
ClassEvl ptn
 f g
(g)= 
Wf ×Rc(Pptn  Pf

 f


  {gn,ph,lt}


) +
WIp ×Rc(Pg  PIp )   WAt × At (g)
 

  Ip


Equation 1 Prediction function for gene (g)
For each feature attribute (f) of each test gene, g, the count of matching features for the required
partition was obtained from the root-count by ANDing the respective P-trees. We can obtain the number
of lethal genes interacting with a particular gene, g, with one P-tree AND operation. It is possible to
retrieve the required counts without a database scan.
6
Due to the diversity of the function class and the nature of the attribute data, we need a classifier
that does not require a fixed importance measure on its features, i.e., we need an adaptable mechanism
to arrive at the best possible classifier. In our approach we optimize the weight space, W, with a
standard Genetic Algorithm (GA) [23]. The set of weights on the features represented the solution space
that was encoded for the GA. The AROC value of the classified list was used as the GA fitness evaluator
in the search for an optimal solution. AROC evaluation provides a smoother and accurate evaluation of
the success of the classifier than standard accuracy which is vital ingredient for a successful hill climb.
3. RESULTS
We demonstrate the prediction capability of the proposed algorithm, through simulations that
predict different function classes at the highest level in the hierarchy. Class sizes range from small
classes with only very few genes, e.g., Protein activity regulation with 6 genes to large classes such as
Transcription with 769 genes, or more than 10% of the genome. We compare our results with [7] since
this is the most closely related work we have found that uses gene annotation data (phenotype).
Differences with respect to our approach are however such that comparison is difficult. In [7] a set of
rules for function classification is produced and mean accuracies and the biological relevance of some of
the significant rules are reported. The paper does not have the goal of producing an overall high
prediction accuracy, and does therefore not evaluate such and accuracy. Since we build a single
classifier for each function class, we must compare our results with the best rule for each function class.
We were able to download the related rule output files referred in [7] and pick the best accuracy values
recorded for each type of classification. These values were used for the comparison of our classifier. As
discussed previously we strongly discourage the use of standard accuracy calculation for the evaluation
of classification techniques, but use it here for comparison purposes. [7] does not report rules for all the
functional classes we have simulated. ‘ROC’ and ‘Acc’ indicate the accuracy indicators for our
approach for the respective function classes. Accuracy values for the comparative study [7] on
functional classification are given as ‘CK_Acc’. As it can be seen on the following figure for the classes
where the accuracy values are available, our approach performs better in all cases except with
Metabolism.
0.8
ROC
Acc
CK_Acc
Function Class
7
METABOLISM
CONTROL OF CELLULAR
ORGANIZATION
PROTEIN ACTIVITY
REGULATION
CELL FATE
REGULATION OF /
INTERACTION WITH
CELLULAR ENVIRONMENT
CELL CYCLE AND DNA
PROCESSING
CELLULAR
COMMUNICATION/SIGNAL
TRANSDUCTION
MECHANISM
ENERGY
CELLULAR TRANSPORT
AND TRANSPORT
MECHANISMS
TRANSCRIPTION
0
TRANSPORT
FACILITATION
0.4
PROTEIN SYNTHESIS
Accuracy
Classification Accuracy
Figure 3 Shows the Classification accuracy.
As observed in previous work [7], [24], different functional classes are predicted better by different
attributes. This is evident in the final optimized weights given in appendix table A1 for the classifier.
The weights do not show a strong pattern indicating a global set of weights that could work for all the
classifications. Overall the protein-class feature had higher weights in predicting the functions for most
of the functional categories followed by complexes. In protein synthesis functional category the
combined similarity in complexes (weight = 1.0), protein class (weight = 1.033), and Phenotype
(weight=1.2) lead to significant prediction accuracy with AROC = 0.90. Transport facilitation had a high
AROC value of 0.87 from the combined similarity in protein class (weight=1.867), localization
(weight=1.46), and complexes (weight=1.1). This observation highlights the importance of using protein
class and complexes as possible experimental features for biologists in the identification of gene
functions.
We include the number of interactions in such a way that many interactions directly favor
classification into any of the function classes. This contribution to the classification is not directly based
on similarity with training samples, and is thereby fundamentally different from the similarity-based
contributions. This is an unusual procedure that is justified by the observation that it improves
classification accuracy, which can be seen from the non-zero weights for interaction numbers. From a
biological perspective the number of interactions is often associated with importance of a gene. An
important gene is more likely to be involved in many functions than and unimportant one. The weights
in table A1 show that for all function classes interactions benefited classification.
It is worth analyzing if the effect is a simple result of the number of experiments done on a gene.
If many experiments have been done many properties may be known and consequently the probability
of knowing a particular function may be higher. Such an increased probability of knowing a function
would not help us in predicting unknown functions of genes on which few experiments have been done.
To study this we tried including the total number of known items of information, such as the total
number of known localizations, protein classes, etc. as a component. The weights of this component
were almost all zero as can be seen in table A1. This suggests that the number of interactions is indeed a
predictor of the likelihood of a function rather than of the likelihood of knowing about a function.
Use of a genetic algorithm requires the evaluation of the fitness function on the training data.
This requires a data scan for each evaluation. In this work the reported results were obtained by 100x40
evaluations for the genetic algorithm for each function class prediction. With the Ptree based
implementation each function class prediction takes 0.027 milliseconds on a Intel P4 2.4 GHz machine
with 2Gb of memory. The corresponding time for a horizontal implementation is 0.834 milliseconds. In
the horizontal approach the algorithm needs to go through all the attributes of a gene blindly to compute
the matching attribute count for a given gene. In contrast in the vertical approach we can count the
matching attributes by counting only the Ptrees for those attributes that are present for the given gene to
be predicted. The large difference in compute time clearly indicates the applicability of the vertical
approach for this application.
4. CONCLUSION
We were able to demonstrate the successful classification of gene function from gene annotation
data. The multi-label classification problem was solved by a separate evaluation of rank ordered lists of
genes associated with each function class. A similarity measure was calculated using weighted counts of
8
matching attribute values. The weights were optimized through a Genetic Algorithm. Weight
optimization was shown to be indispensable as different functions were predicted by strongly differing
sets of weights. The AROC value was shown to be a successful fitness evaluator for the GA. Results of
the weight optimization can be directly used in giving biologists an indication to what defines a
particular function.
Furthermore, it was interesting to note that quantitative information, such as the number of
interactions, played a significant role. In summary we found that our systematic weighting approach and
P-tree representation allowed us to evaluate the relevance of a rich variety of attributes.
5. REFERENCES
[1] Beck, S. and Sterk, P. Genome-scale DNA sequencing: where are we? Curr. Opin. Biotechnol.
9,116-120, 1998.
[2] Pellegrini,M., Marcotte,E. M., Thompson,M. J., Eisenberg,D. & Yeates,T. O. Assigning protein
functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci.
USA 96, 4285-4288, 1999.
[3] Altschul, S. F., Madden, T. L., Shaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.
J.Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic
Acids Res. 25, 3389-3402, 1997.
[4] King, R. D., Karwath, A, Clare, A., and Dehaspe, L. The utility of different representations of
protein sequence for predicting functional class. Bioinformatics 17 (5), 445-454, 2001.
[5] Wu, L., Hughes, T. R., Davierwala, A. P. , Robinson, M. D., Stoughton, R., and Altschuler, S. J.
Large-scale prediction of Sccharomyces cerevisiae gene function using overlapping transcriptional
clusters. Nature 31, 255-265, 2002.
[6] Brown, M., Nobel Grundy, W., Lin, D., Cristianini, N., Sugnet, C., Furey, T. S., Ares, M. Jr., and
Haussler, D. Knowledge-based analysis of microarray gene expression data using support vector
machines. Proceedings of the National Acedemy of Sciences USA, 97(1), 262-267, 1997.
[7] Clare, A. and King, R. D. Machine learning of functional class from phenotype data.
Bioinformatics 18(1), 160-166, 2002.
[8] http://mips.gsf.de/
[9] Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining,
inference, and prediction, Springer-Verlag, New York, 2001.
[10] Cost, S. and Salzberg, S., A weighted nearest neighbor algorithm for learning with symbolic
features, Machine Learning, 10, 57-78, 1993.
[11] W.F. Punch, E.D. Goodman, M. Pei, L. Chia-Shun, P. Hovland, and R. Enbody, Further research
on feature selection and classification using genetic algorithms, Proc. of the Fifth Int. Conf. on
Genetic Algorithms, pp 557-564, San Mateo, CA, 1993.
[12] Provost, F., Fawcett, T.; Kohavi, R., The Case Against Accuracy Estimation for Comparing
Induction Algorithms, 15th Int. Conf. on Machine Learning, pp 445-453, 1998.
[13] Ding, Q., Ding, Q., Perrizo, W., ARM on RSI Using P-trees, Pacific-Asia KDD Conf., pp. 66-79,
Taipei, May 2002.
[14] Ding, Q., Ding, Q., Perrizo, W., Decision Tree Classification of Spatial Data Streams Using Peano
Count Trees, ACM SAC, pp. 426-431, Madrid, Spain, March 2002.
[15] Khan, M., Ding, Q., Perrizo, W., KNN on Data Stream Using P-trees, Pacific-Asia KDD, pp. 517528, Taipei, May 2002.
[16] Perrizo, W., Peano Count Tree Lab Notes, CSOR-TR-01-1, NDSU, Fargo, ND, 2001.
9
[17] Prasanna Ganesan, Hector Graia-Molina, and Jennifer Widom. Exploiting Hierarchical Domain
Structure to Compute Similarity. ACM Transaction on Information Systems, Vol. 21, No. 1,
January 2003, Pages 64-93.
[18] Qin Ding, Maleq Khan, Amalendu Roy and William Perrizo. The P-tree Algebra. Proceedings of
ACM Symposium on Applied Computing (SAC'02), Madrid, Spain, March 2002, pp. 426-431.
[19] Feldman, R. and Dagan, I. 1995. Knowedge discovery in textual databases. In proceeding of KDD95.
[20] Han and Fu Han, J. and Fu, Y. 1995. Discovering the multi-level association rules from large
databases. In Proceeding of VLDB ’95, 420-431.
[21] Scott, S and Matwin, S. 1998. Text classification using WordNet hypernyms., In Proceeding of the
Use of the WordNet in Natural Language Processing Systems. Association for Computational
Languistics.
[22] Srikant, R and Agarwal, R. 1995. Mining generalized association rules., In Proceedings of VLDB
’95 407-419.
[23] Goldberg, D.E., Genetic Algorithms in Search Optimization, and Machine Learning, Addison
Wesley, 1989.
[24] Pavlidis, P., J. Weston, J. Cai, and W. Grundy (2001). Gene functional classification from
heterogeneous data. In Proceedings of the Fifth International Conference on Computational
Molecular Biology (RECOMB 2001).
[25] (J. D. Tubbs. A note on binary template matching. Pattern Recognition, 22(4):359–365, 1989.)
TRANSCRIPTION
CELLULAR TRANSPORT AND
TRANSPORT MECHANISMS
ENERGY
CELLULAR COMMUNICATION/SIGNAL
TRANSDUCTION MECHANISM
CELL CYCLE AND DNA PROCESSING
REGULATION OF / INTERACTION
WITH CELLULAR ENVIRONMENT
CELL FATE
PROTEIN ACTIVITY REGULATION
CONTROL OF CELLULAR
ORGANIZATION
METABOLISM
Location
Subcelular Location
Protein Class
Protein Complex
Pathway
Enzyme Catalogue
Phenotype
Lethal Interactions
Genetic Interactions
Physical Interactions
TRANSPORT FACILITATION
\ Function
Weight Attributes \
PROTEIN SYNTHESIS
APPENDIX
0.13
0.13
0.93
0.30
1.10
0.60
0.30
0.00
0.10
1.20
0.27
1.47
2.00
0.60
0.70
0.70
1.50
0.00
1.10
1.20
0.13
0.27
1.47
1.40
1.00
0.00
1.50
0.80
1.40
0.60
0.00
0.00
1.20
1.20
1.40
0.60
1.50
0.00
0.00
0.60
0.93
0.40
0.40
1.30
1.00
1.20
1.00
0.00
0.90
1.30
0.00
0.00
2.00
0.40
0.40
0.80
0.80
0.10
0.20
1.40
0.13
0.00
1.20
1.40
0.70
1.20
0.50
0.10
0.80
1.30
0.00
0.00
1.60
1.30
1.40
0.10
0.00
0.00
0.50
0.80
0.00
0.00
1.60
1.30
1.50
0.00
1.30
0.00
1.10
1.20
1.33
1.87
2.00
1.50
0.90
0.10
0.00
1.00
0.40
1.30
0.13
0.00
0.80
0.10
0.30
0.90
0.50
0.00
1.00
1.40
1.60
0.27
0.00
1.40
0.10
1.20
0.70
0.00
1.00
1.20
10
Protein-Protein Interaction 0.10 0.00
Attribute Count 0.00 0.00
Lethal Gene 1.30 0.00
1.20
0.00
1.50
1.20 0.20
0.00 0.00
1.10 0.30
0.40 0.80 1.00 0.10
0.00 0.00 0.00 0.00
0.30 0.50 0.10 0.60
0.00
1.50
0.00
1.20 0.20
0.00 0.00
0.60 0.00
Prediction Accuracy :ROC 0.895 0.873 0.836 0.752 0.679 0.677 0.675 0.623 0.554 0.503 0.4581 0.428
Number of Genes in Class 331 312 769 184
78
20 284 106 158
6
89 750
Table A1: Attribute Based convergent weights from the Genetic Algorithm Optimization Step.
11
Download