Data Mining Project: Reversed Linkage (541 project for 2002 Fall Semester) Standard problem considered in functional genomics today is given a disease (so called phenotype) find possible candidate “genes” which can cause it. In order to do this researchers collect data in the form of family trees of people affected by the disease (so called pedigrees) and for each such individual, they obtain their genotypes in the form of so called markers which define genes on the chromosome. Typically, tens of such genes are defined today for each individual, but in the near future, due to new marking techniques, the number of such genes can be measured in thousands (so called SNP markers). In addition to the genotypical data, so called phenotypical data is collected as well. Phenotypical data consists of attributes which describe clinical data as well as habits (smoker, drinker etc) for that individual. Typically, each individual in a pedigree is characterized by possibly hundreds of such phenotypical attributes in addition to tens of genotypical ones. Over last decade or so, there has been considerable success in identifying or finding approximate genes of genes causing (or partially causing) such diseases as Huntington disease, Breast Cancer, Cystic Fibrosis et al. Success of such methods depends very much of “luck” of finding the right data set – the right set of pedigrees with significant number of affected (sick) people in them. What we term “reversed linkage” is the question: What possible phenotypes can be caused by a given gene? In others words rather than looking for a gene (gene gene) for a given disease, we are looking for a given gene, which diseases can be caused by that gene. Thus we go from gene to a phenotype (disease) rather than from phenotype (disease) to gene. This is just an explanation for those who are interested in the motivation behind our project. The project itself requires no knowledge of genetics at all Our project will rely entirely on simulated data. That data will come from our GenMine pedigree simulator. That simulated will generate pedigrees with up to 15 individuals in them. Each individual will be characterized by a feature vector – a vector of N binary features corresponding to physical/clinical characteristics. Additionally each pedigree, which from the project perspective is simply a set of individuals will have special subsets of individuals called consistent sets. Consistent sets will be further indexed by the genes ranging from 1 to L. Genes are fixed and the same for each pedigree. Individuals in pedigrees, on the other hand, are disjoint (thus individual “2” in pedigree one is different than the individual “2” in pedigree three). For each gene I the pedigree simulator will generate a number of consistent sets C1…Cm all of them subsets of individuals on a given pedigree P. Genetically speaking (which is irrelevant for the project), given a gene I, each of the Ck can be “explained” by a dominant model on that gene. What is important for the project, though, is that each such set Ck will be modeled as a binary “membership” attribute called Element-of[k] (that is, there will be as many such attributes as there are consistent sets for the gene k) . Given a person p, p.Element-of[k] will be true iff p belongs to Ck. Technically, Element-of[k] should also be indexed by gene I, since there are different consistent sets for each gene. We will drop I from Element-of[k, I] if is is clear what I is. Thus we will deal with the following data: Data description and its format Collection of pedigrees P1,…Pm Each pedigree Pj is a collection of individuals. Each individual is characterized by a binary feature vector of length N and also by the set of binary membership attributes Element-of[k, I] such that p.Element-of[k,I] = true iff p belongs to k’th consistent set for its pedigree for gene I Values of these attributes will be generated for you by the GenMine pedigree generator program. Thus, from the point of the project (ignoring the genetic interpretation) you will deal with very long bit strings organized into m sets (pedigrees). Each such long bit string will have two types of bit positions – features or membership attributes. Let me now proceed to the project objectives and explain how it will model the reverse genetics problem as data mining problem. First of all let me start with few more concept definitions: By a query we will mean any conjunction of specific features of their negations. A query q is consistent with a gene I on a pedigree P iff q is true iff Element-of(k,I) is true In other words if the set of individuals which satisfy the query q in the pedigree P is a consistent set for gene I in that pedigree. A query q is minimally consistent with gene I on the pedigree P iff it is consistent with I on P and no subquery of q is. Now we are ready to formulate the project objectives. Given the input data as described above, generated by the GenMine generator 1. Find for each gene I and for each pedigree P all minimal queries which are consistent with that gene. 2. Find for each gene I, the top K queries which are consistent with the largest number of pedigrees on that gene I. Ordering in “the top” is according the number of pedigrees “voting” for that query on the gene I. 3. Given a query Q, for each gene I, find for how many pedigrees Q are consistent with that gene. A query should be any SELECT SQL query using the features from the feature vector 4. Given a query Q, for each gene, find for how many pedigrees, a query (Q AND Q’) for some Q’ is consistent with that gene. Now as in (3) Q can have disjunctions but assume that Q’ is again like in (2) a pure conjunctive query (features or their negations) 5. Create a GUI interface for the tasks (1)-(4) and perhaps other queries you may think of – including possibly disjunctive queries as well. Clearly, the queries which are in top K (task 2) will be the phenotypes which are the most likely to be “caused” by a hypothetical gene on the gene I. Example Lets have three pedigrees each with 5 members. We will use simple numbers to denote the members but pedigrees do not share their members. Lets assume a feature vector of 4 features and 3 genes. The binary features are: Smart, Funny, Friendly and Bossy. Thus each indvidual is characterized by binary feature vector of length four, thus 1001, denotes a person who is smart, unfunny, unfriendly and bossy Additionally, each of the three pedigrees will have the following consistent sets for each of the three genes:sd Pedigree 1 Gene 1: {1,2}; {1,3}, {1,2,4}, {2,4,5}, {1,3,5}, {2}, {2,3,5} Gene 2 {1}; {1,2, 3}, {4, 5}, {2,5}, {1,5}, {3}, {3,4, 5}, {1,3,4}, {1,2,5}, {1.2,3,4} Gene 3 {1,2,3, 4}, {4}, {5}, {3,5}, {2, 4}, {1,2}, {1,3,4} Pedigree 2 Gene 1 {1}; {1,4}, {1,2,4}, {2,4,5}, {1,5}, {2}, {2,4,5} Gene 2 {2}; {1,5}. {1.3.4.5.} Gene 3 {3}; {1,2}, {1,4}, {1,4,5}, {1,5}, {2,3,5}, {2} Pedigree 3 Gene 1 {1,2}; {1,3}, {1,4}, {1,4,5}, {1,3,5} Gene 2 {2}; {1,3}, {2,4}, {3,4,5}, {2,3,4,5} Gene 3 {1}; {3}, {3,4}, {1,4,5}, {2,3,5}, {3,4,5} These consistent sets will be modeled through Element-of(I,k) attributes. For example, consider individual “2” in the pedigree one. 2.Element-of(1,1) = true (since 2 is a member of {1.2}) 2.Element-of(2,1) = false (since 2 is not a member of {1,3}) 2.Element-of(2,3) = false (since 2 is not a member of {4} – second consistent set in gene 3 for that pedigree) etc.. Additionally lets assume the following feature vectors for 15 individuals (5 in each pedigree) order from 1 to 5. Pedigree 1 1110 0110 1100 1110 0001 Pedigree 2 0100 1110 1000 0001 0011 Pedigree 3 0000 1010 1100 1110 1111 Now consider a query Q (Friendly AND NOT Bossy) which is “**10 as binary pattern. That query is consistent with gene 1 on both Pedigree 1 (the set {1,2,4} is among consistent sets) and Pedigree 2 (the set {2} is among consistent set for that gene); but not on Pedigree 3. On the other hand a query (Smart and Funny) which is 11** using binary notation is consistent on gene 3 with all three pedigrees. Indeed, {1,3,4} is consistent with gene three on pedigree one; {2} is consistent with gene three on the pedigree 2 and finally {3,4,5} is consistent with the pedigree three on that gene. Thus one may say that the three pedigrees unanimously vote for gene 3 for causing the phenotype of “being funny”. The objective of the project is to find all “top” queries for which most pedigrees vote for. The two queries which we mentioned in the example are likely to be among the ones which we want, the first one got two out of three “pedigree votes”, the second got all “three votes”. GRADING Total of 100 points will be given for the project. Steps (1) and (2) will be worth 70 points. You will earn full credit of 70 points iff (1) your algorithm uses some form of effective pruning and does not simply enumerate the entire search space (2) your algorithm runs and finds for each gene the queries which are linked to that gene – these queries have been “planted” by us in the data set. Steps (3) and (5) will be worth 30 points, each of (3)-(5) worth 10 points. Here, the key is again to have running code and be able to demonstrate that the user can enter a query Q from GUI and see that the results are generated correctly. We will have test data which will allow us to quickly see if your programs are correct. FIRST HINTS Wait until the lecture on data mining, mining frequent sets and mining association rules before you “commit” to a solution. Clearly one does not want to apply a brute force solution and try all possible queries! Our test data feature set will have possibly tens of features and hundreds of consistent sets for each gene. We will also tens of genes. Thus, your solution must be as efficient as possible, Also – the GUI has to be user friendly. Additional points will be given for innovative interface design as well as for efficiency. We will benchmark your solutions to measure the running time and memory consumption as well. Your solution should be generic not hard coded to the data. It should have metadata such as feature vector and membership vectors as parameters and it should allow ultimately any SQL query using features to be tested for any gene and get the number of pedigrees which support that query for that gene (i.e. for which the answer to the query is consistent set for that gene) Since you will be using SQL your system should use a simple database, however your algorithms will obviously be implemented in the host programming language (Java, C++). First you should focus on (1) and (2). You should have that running by Nov 15 in order to finish in time. I strongly suggest that one member of your group starts looking at database representation for the step (3) and interface implementation using JDBC immediately.In this way you will maximize group collaboration. It is probably best to have one member of the group coding (1) and (2) and have another member or members dealing with the issues of (3) –(5) MORE DIRECTION (1) Steps (1) and (2) should be implemented in host language, not in SQL. That means also that the source data should NOT be stored in a database. It is much easier for the program to operate on the flat file – as data structure in your host program rather than access database. On the other hand (3), (4) and (5) should use the database representation of just individuals, their pedigree membership and feature vectors (NOT membership predicates). In this way, when given arbitrary SELECT query in MySQL you can quickly for each pedigree select the subset of individuals who satisfy that query. Than, given that subset X you can determine which of the consistent sets for a given gene match X. That matching step should be done in host language again, unless you find a way to do it all as SQL query, which I doubt. Furthermore (3)-(5) should use a simple window as GUI for entering a query. Summarizing: for (3)-(5), the initial query evaluation should be done using SQL, but the SQL job will end with generating a subset of individuals which for a given pedigree satisfy the query. Than the next part is the “matching part” – finding a consistent set for a given gene which matches the query answer in that pedigree. This should be done outside of SQL and the database representation – using membership attributes represented in whatever representation you picked for (1) and (2). Hint for the step (4), treat the original query Q as a new binary attribute (true if a person satisfies it, false if he does not) then (4) becomes like (1) with one extra feature. (2) Consistent sets for any gene may contain empty set and set of all members of a pedigree. Although our data sets do not contain these special sets your program should work in such case. This case will lead to many queries (why?) and in general the complexity of the program will be very much affected by the structure of source data set. If there are many sets which are either very large or very small, your program will run longer. Just like in basket data mining when there are many transactions which buy all the items. (3) Your solution should use some form of pruning. You should ask yourself when you try different conjunctions of features or their negations when can you stop and conclude that for a given gene and given pedigree there is no need to continue – since there is no chance to match any consistent set for that pedigree and that gene. Without such pruning your algorithm will be hopeless – just think about the total number of genes multiplied by the number of consistent sets multiplied by the total number of queries – this is a hopelessly large search space! In other worst case your algorithm may be close to exponential but the question is how well will it perform “on average”