Data Mining Project: Reversed Genetics

advertisement
Data Mining Project: Reversed Linkage
(541 project for 2002 Fall Semester)
Standard problem considered in functional genomics today is given a disease (so called
phenotype) find possible candidate “genes” which can cause it. In order to do this
researchers collect data in the form of family trees of people affected by the disease (so
called pedigrees) and for each such individual, they obtain their genotypes in the form of
so called markers which define genes on the chromosome. Typically, tens of such genes
are defined today for each individual, but in the near future, due to new marking
techniques, the number of such genes can be measured in thousands (so called SNP
markers). In addition to the genotypical data, so called phenotypical data is collected as
well. Phenotypical data consists of attributes which describe clinical data as well as
habits (smoker, drinker etc) for that individual. Typically, each individual in a pedigree
is characterized by possibly hundreds of such phenotypical attributes in addition to tens
of genotypical ones. Over last decade or so, there has been considerable success in
identifying or finding approximate genes of genes causing (or partially causing) such
diseases as Huntington disease, Breast Cancer, Cystic Fibrosis et al. Success of such
methods depends very much of “luck” of finding the right data set – the right set of
pedigrees with significant number of affected (sick) people in them.
What we term “reversed linkage” is the question:
What possible phenotypes can be caused by a given gene?
In others words rather than looking for a gene (gene gene) for a given disease, we are
looking for a given gene, which diseases can be caused by that gene. Thus we go from
gene to a phenotype (disease) rather than from phenotype (disease) to gene.
This is just an explanation for those who are interested in the motivation behind our
project. The project itself requires no knowledge of genetics at all
Our project will rely entirely on simulated data. That data will come from our GenMine
pedigree simulator. That simulated will generate pedigrees with up to 15 individuals in
them. Each individual will be characterized by a feature vector – a vector of N binary
features corresponding to physical/clinical characteristics. Additionally each pedigree,
which from the project perspective is simply a set of individuals will have special subsets
of individuals called consistent sets. Consistent sets will be further indexed by the genes
ranging from 1 to L. Genes are fixed and the same for each pedigree. Individuals in
pedigrees, on the other hand, are disjoint (thus individual “2” in pedigree one is different
than the individual “2” in pedigree three). For each gene I the pedigree simulator will
generate a number of consistent sets C1…Cm all of them subsets of individuals on a
given pedigree P. Genetically speaking (which is irrelevant for the project), given a gene
I, each of the Ck can be “explained” by a dominant model on that gene. What is
important for the project, though, is that each such set Ck will be modeled as a binary
“membership” attribute called Element-of[k] (that is, there will be as many such
attributes as there are consistent sets for the gene k) . Given a person p, p.Element-of[k]
will be true iff p belongs to Ck. Technically, Element-of[k] should also be indexed by
gene I, since there are different consistent sets for each gene. We will drop I from
Element-of[k, I] if is is clear what I is. Thus we will deal with the following data:
Data description and its format
Collection of pedigrees P1,…Pm
Each pedigree Pj is a collection of individuals. Each individual is characterized by a
binary feature vector of length N and also by the set of binary membership attributes
Element-of[k, I] such that
p.Element-of[k,I] = true iff p belongs to k’th consistent set for its pedigree for gene I
Values of these attributes will be generated for you by the GenMine pedigree generator
program. Thus, from the point of the project (ignoring the genetic interpretation) you will
deal with very long bit strings organized into m sets (pedigrees). Each such long bit string
will have two types of bit positions – features or membership attributes.
Let me now proceed to the project objectives and explain how it will model the reverse
genetics problem as data mining problem.
First of all let me start with few more concept definitions:
By a query we will mean any conjunction of specific features of their negations. A
query q is consistent with a gene I on a pedigree P iff
q is true iff Element-of(k,I) is true
In other words if the set of individuals which satisfy the query q in the pedigree P is a
consistent set for gene I in that pedigree. A query q is minimally consistent with gene I
on the pedigree P iff it is consistent with I on P and no subquery of q is.
Now we are ready to formulate the project objectives. Given the input data as described
above, generated by the GenMine generator
1. Find for each gene I and for each pedigree P all minimal queries which are
consistent with that gene.
2. Find for each gene I, the top K queries which are consistent with the largest
number of pedigrees on that gene I. Ordering in “the top” is according the number
of pedigrees “voting” for that query on the gene I.
3. Given a query Q, for each gene I, find for how many pedigrees Q are consistent
with that gene. A query should be any SELECT SQL query using the features
from the feature vector
4. Given a query Q, for each gene, find for how many pedigrees, a query (Q AND
Q’) for some Q’ is consistent with that gene. Now as in (3) Q can have
disjunctions but assume that Q’ is again like in (2) a pure conjunctive query
(features or their negations)
5. Create a GUI interface for the tasks (1)-(4) and perhaps other queries you may
think of – including possibly disjunctive queries as well.
Clearly, the queries which are in top K (task 2) will be the phenotypes which are the most
likely to be “caused” by a hypothetical gene on the gene I.
Example
Lets have three pedigrees each with 5 members. We will use simple numbers to denote
the members but pedigrees do not share their members. Lets assume a feature vector of 4
features and 3 genes. The binary features are: Smart, Funny, Friendly and Bossy. Thus
each indvidual is characterized by binary feature vector of length four, thus 1001, denotes
a person who is smart, unfunny, unfriendly and bossy 
Additionally, each of the three pedigrees will have the following consistent sets for each
of the three genes:sd
Pedigree 1
Gene 1: {1,2}; {1,3}, {1,2,4}, {2,4,5}, {1,3,5}, {2}, {2,3,5}
Gene 2 {1}; {1,2, 3}, {4, 5}, {2,5}, {1,5}, {3}, {3,4, 5}, {1,3,4}, {1,2,5}, {1.2,3,4}
Gene 3 {1,2,3, 4}, {4}, {5}, {3,5}, {2, 4}, {1,2}, {1,3,4}
Pedigree 2
Gene 1 {1}; {1,4}, {1,2,4}, {2,4,5}, {1,5}, {2}, {2,4,5}
Gene 2 {2}; {1,5}. {1.3.4.5.}
Gene 3 {3}; {1,2}, {1,4}, {1,4,5}, {1,5}, {2,3,5}, {2}
Pedigree 3
Gene 1 {1,2}; {1,3}, {1,4}, {1,4,5}, {1,3,5}
Gene 2 {2}; {1,3}, {2,4}, {3,4,5}, {2,3,4,5}
Gene 3 {1}; {3}, {3,4}, {1,4,5}, {2,3,5}, {3,4,5}
These consistent sets will be modeled through Element-of(I,k) attributes. For example,
consider individual “2” in the pedigree one.
2.Element-of(1,1) = true (since 2 is a member of {1.2})
2.Element-of(2,1) = false (since 2 is not a member of {1,3})
2.Element-of(2,3) = false (since 2 is not a member of {4} – second consistent set in gene
3 for that pedigree)
etc..
Additionally lets assume the following feature vectors for 15 individuals (5 in each
pedigree) order from 1 to 5.
Pedigree 1
1110
0110
1100
1110
0001
Pedigree 2
0100
1110
1000
0001
0011
Pedigree 3
0000
1010
1100
1110
1111
Now consider a query Q (Friendly AND NOT Bossy) which is “**10 as binary pattern.
That query is consistent with gene 1 on both Pedigree 1 (the set {1,2,4} is among
consistent sets) and Pedigree 2 (the set {2} is among consistent set for that gene); but not
on Pedigree 3. On the other hand a query (Smart and Funny) which is 11** using binary
notation is consistent on gene 3 with all three pedigrees. Indeed, {1,3,4} is consistent
with gene three on pedigree one; {2} is consistent with gene three on the pedigree 2 and
finally {3,4,5} is consistent with the pedigree three on that gene.
Thus one may say that the three pedigrees unanimously vote for gene 3 for causing the
phenotype of “being funny”.
The objective of the project is to find all “top” queries for which most pedigrees vote for.
The two queries which we mentioned in the example are likely to be among the ones
which we want, the first one got two out of three “pedigree votes”, the second got all
“three votes”.
GRADING
Total of 100 points will be given for the project. Steps (1) and (2) will be worth 70
points. You will earn full credit of 70 points iff (1) your algorithm uses some form of
effective pruning and does not simply enumerate the entire search space (2) your
algorithm runs and finds for each gene the queries which are linked to that gene – these
queries have been “planted” by us in the data set.
Steps (3) and (5) will be worth 30 points, each of (3)-(5) worth 10 points. Here, the key
is again to have running code and be able to demonstrate that the user can enter a query Q
from GUI and see that the results are generated correctly. We will have test data which
will allow us to quickly see if your programs are correct.
FIRST HINTS
Wait until the lecture on data mining, mining frequent sets and mining association rules
before you “commit” to a solution. Clearly one does not want to apply a brute force
solution and try all possible queries! Our test data feature set will have possibly tens of
features and hundreds of consistent sets for each gene. We will also tens of genes. Thus,
your solution must be as efficient as possible, Also – the GUI has to be user friendly.
Additional points will be given for innovative interface design as well as for efficiency.
We will benchmark your solutions to measure the running time and memory consumption
as well.
Your solution should be generic not hard coded to the data. It should have metadata such
as feature vector and membership vectors as parameters and it should allow ultimately
any SQL query using features to be tested for any gene and get the number of pedigrees
which support that query for that gene (i.e. for which the answer to the query is consistent
set for that gene)
Since you will be using SQL your system should use a simple database, however your
algorithms will obviously be implemented in the host programming language (Java,
C++).
First you should focus on (1) and (2). You should have that running by Nov 15 in order to
finish in time. I strongly suggest that one member of your group starts looking at database
representation for the step (3) and interface implementation using JDBC immediately.In
this way you will maximize group collaboration. It is probably best to have one member
of the group coding (1) and (2) and have another member or members dealing with the
issues of (3) –(5)
MORE DIRECTION
(1) Steps (1) and (2) should be implemented in host language, not in SQL. That
means also that the source data should NOT be stored in a database. It is much
easier for the program to operate on the flat file – as data structure in your host
program rather than access database. On the other hand (3), (4) and (5) should
use the database representation of just individuals, their pedigree membership and
feature vectors (NOT membership predicates). In this way, when given arbitrary
SELECT query in MySQL you can quickly for each pedigree select the subset of
individuals who satisfy that query. Than, given that subset X you can determine
which of the consistent sets for a given gene match X. That matching step should
be done in host language again, unless you find a way to do it all as SQL query,
which I doubt. Furthermore (3)-(5) should use a simple window as GUI for
entering a query. Summarizing: for (3)-(5), the initial query evaluation should be
done using SQL, but the SQL job will end with generating a subset of individuals
which for a given pedigree satisfy the query. Than the next part is the “matching
part” – finding a consistent set for a given gene which matches the query answer
in that pedigree. This should be done outside of SQL and the database
representation – using membership attributes represented in whatever
representation you picked for (1) and (2). Hint for the step (4), treat the original
query Q as a new binary attribute (true if a person satisfies it, false if he does not)
then (4) becomes like (1) with one extra feature.
(2) Consistent sets for any gene may contain empty set and set of all members of a
pedigree. Although our data sets do not contain these special sets your program
should work in such case. This case will lead to many queries (why?) and in
general the complexity of the program will be very much affected by the structure
of source data set. If there are many sets which are either very large or very small,
your program will run longer. Just like in basket data mining when there are many
transactions which buy all the items.
(3) Your solution should use some form of pruning. You should ask yourself when
you try different conjunctions of features or their negations when can you stop
and conclude that for a given gene and given pedigree there is no need to continue
– since there is no chance to match any consistent set for that pedigree and that
gene. Without such pruning your algorithm will be hopeless – just think about the
total number of genes multiplied by the number of consistent sets multiplied by
the total number of queries – this is a hopelessly large search space! In other
worst case your algorithm may be close to exponential but the question is how
well will it perform “on average”
Download