IBM Life Sciences Gene Sequence Analysis Demo Find and characterize novel cancer related genes in genomic sequences The IBM Life Sciences Development Team The Scenario Challenge: Find & Characterize novel cancer related genes in genomic sequences There are various ways one could identify novel genes. In this scenario we identify genes "in silico" We use bioinformatic tools and various data sources including recently published literature. The current process is very manual, pain-staking and error prone Scenario This screen shows us building a PubMed query to find recent articles dealing with interesting disease related genes (in this case, cancer related genes - neoplasms). Notice the complexity of the query. This screen shows that there were 11 resulting articles. We will examine each article further (one at at time). As an example, let's look at the third article... This page shows the detailed description of the third article. The accession number of the mouse gene referenced in this article is contained here. We will use this accession number to do further analysis on this gene. If an accession number was not referenced we would skip this article and move on to the next. For each gene accession number in each of the 11 articles, we will submit a BLAST search against the Human Genome database to see if there are any human genes similar to the mouse gene described in the article. Our query shows that there are 4 Human Genome sequences that have segments similar to our reference sequence. This first sequence has an 895 base pair region that is similar to ours in 756 base pairs (84%). That matches our acceptable range of 60% similarity over at least 150 base pairs. The matching base pairs are shown here. For each of the 4 similar genes, we look up its complete gene sequence using the GenBank public database This is the FASTA format of this gene sequence. It goes on for 100's of pages. We need to copy this sequence and paste it into our next application - Geneid We pasted the FASTA formatted gene sequence into the Geneid application. Geneid will analyze and predict the putative (or supposed) coding regions and exon-intron structures of this sequence. Here's a portion of the output from Geneid showing one of the predicted coding regions. It has removed the introns and converted the gene sequence into a protein sequence. We can now use BLASTP to compare this protein sequence to other known protein sequences. We do this for each of the predicted coding regions (perhaps 10 - 20 coding regions per sequence). This time we use BLASTP to search the non-redundant protein databases for sequences similar to our putative coding region. We do this for each of the putative coding regions suggested by Geneid. The BLAST results show 11 proteins that have similar coding sequences. If our putative protein is similar to known proteins, then it is probably not novel and we can ignore it. We move on to the next one until the list is exhausted. Note: Repeated execution of this step for the 10-20 putative coding regions across the 5-10 gene sequences similar to the new genes referenced in maybe 9 of our 11 original documents will produce our desired set of putative novel cancer-related genes. (I.E. we have to run this step over 500 times. And we'll need at least 5 browser sessions open from which we must manually cut and paste the information.) We are now done with the first part of our task: finding novel cancer releated genes and their resulting proteins from recently published literatature. We now start the second phase which is to try to determine the function and origin of these novel proteins. First, each Putative Novel protein sequence is analyzed by searching the Protein Family (Pfam) database which is protein family database of alignmnets and Hidden Markov Models (HMM). We're looking for clues as to what the function of this protein might be by perhaps finding distant family members. The results show one similar protein alignment. I.E our new protein may have a familiar origin to a known protein sequence that has a well defined function. By studying this well known sequence further, we may be able get an idea of what our protein does. BLAST can be used to help us determine the function of our protein. This time we're going to take each of our putative novel proteins and compare them to known proteins in the Swissprot database. We're looking for proteins similar to our novel proteins. Studying these similar proteins might give us a clue as to what our protein does. ClustalW is another useful tool. We've pasted in 3 of our protein sequences into the input field. ClustalW will do a multiple sequence alignment on these 3 sequences. This is the raw alignment results which show the best alignment of these three sequences. An alignment viewer can give us a graphical representation of these alignments. We can also run a phylogenetic analysis against the alignment data to determine if the sequences likely were derived from the same source. What we have just seen is the process that a bioinformatician might go through for a specific way of solving a problem. Notice that it is quite an involved process with many manual steps. It's easy to forget where you are and there is plenty of room for error. Many bioinformaticians with the appropriate skills will attempt to automate this process by writing Perl Scripts. Since nothing is standardized at this point, the interface to each application has to be determined separately by the programmer. The result is a custom solution that is often difficult to maintain and enhance and is not very reusable for other scenarios. Framework Approach Build Web Services wrappers around the applications used by the bioinformatician in this scenario Some of these applications will be run locally Some will be accessed via the Internet Automate the choreography of the applications through workflow scripts Provide user interaction through IBM's Portal Server interface For more information: http://www.ibm.com/lifesciences BLAST novel genes from PubMed vs. human genomic sequence (blastn with filtering)