A data mining approach based on machine learning techniques to

advertisement
A Data Mining Approach Based on
Machine Learning Techniques to Classify
Biological Sequences
M. Maddouri , M. Elloumi
Knowledge-Based Systems 15 (2002) 217~223
Abstract
Data mining approaches can be helpful to analyze primary structures in
order to extract useful knowledge.
1. Introduction
This approach uses four steps.
1.Construct the set of the discriminant substring, called ‘discriminant
descriptor’ (DD), associated with each family of primary structures.
2.Construct a context which is a table of examples vs attributes.
3.Extract knowledge from the context and represent it by production rules.
4.Use these production rules to do classification of primary structures.
2. Definitions and notations
Let f1, f2 ,…, fn be families of strings and x be a substring of a string of fi ,
1 i  n.
1. The substring x is discriminant between fi and  j  i fj .
2. A discriminant x is minimum.
3. DD, denoted by i( f 1, f 2,..., fn) , the set of discriminant and minimum
substirngs. And k-DD
4. i ( f 1, f 2,..., fn)   k  i ( f 1, f 2,..., fn)
k 0
5. A context is a triplet (F ,  , R) , where F =
i 1 fi ,  
n
n

i 1
i , R is
a binary relation defined between F and  such that (w , s )  R , w  F
and s   , when the string w contains the substring s.
3. Approach
Let f1 = {cabbdbea , eacdabaea} , f2 = {bedeabbad , daaadbbd}.
1.Construction of the DDs
1  1( f 1, f 2)  {c} and 1   2( f 1, f 2)  
2  1( f 1, f 2)  {ae} and 2   2( f 1, f 2)  {aa, ed , ad , de}
3  1( f 1, f 2)  {bea, aba, bdb, dab, dbe} and 3   2( f 1, f 2)  {bba, eab}
4  1( f 1, f 2)  {abbd } and 4   2( f 1, f 2)  
2.Construction of the context
Construct a matrix R , of size F x  , such that R[ i, j ] = 1 if the
i  th string of F contains the j  th substring of  else R[ i ,j] = 0.
3.Induction of rules
Use a punishment/award algorithm to computes rules weights by using
string whose families are known.
During the classification process, we use the applicable rule having the
greatest weight.
4.Classification of primary structures
4.Coclusion
Mining biological data can help biologists to do classification in
biological sequences. And this approaches give such explanation and, hence,
can be helpful to improve biologists know-how and knowledge.
Created by : Yi-Yao Huang
Date : Nov. 14, 2002
Download