Gene Sequence Analysis Demo Find and characterize novel cancer

advertisement
IBM Life Sciences
Gene Sequence
Analysis Demo
Find and characterize novel cancer
related genes in genomic sequences
The IBM Life Sciences Development
Team
The Scenario
Challenge: Find & Characterize novel
cancer related genes in genomic
sequences
There are various ways one could identify novel
genes.
In this scenario we identify genes "in silico"
We use bioinformatic tools and various data
sources including recently published literature.
The current process is very manual, pain-staking
and error prone
Scenario
This screen shows us building
a PubMed query to find recent
articles dealing with
interesting disease related
genes (in this case, cancer
related genes - neoplasms).
Notice the complexity of the
query.
This screen shows that
there were 11 resulting
articles. We will examine
each article further (one
at at time). As an
example, let's look at the
third article...
This page shows the detailed
description of the third article. The
accession number of the mouse gene
referenced in this article is contained
here. We will use this accession number
to do further analysis on this gene. If an
accession number was not referenced
we would skip this article and move on
to the next.
For each gene accession number
in each of the 11 articles, we will
submit a BLAST search against
the Human Genome database to
see if there are any human genes
similar to the mouse gene
described in the article.
Our query shows that there are
4 Human Genome sequences
that have segments similar to
our reference sequence.
This first sequence has an 895
base pair region that is similar to
ours in 756 base pairs (84%).
That matches our acceptable
range of 60% similarity over at
least 150 base pairs. The
matching base pairs are shown
here.
For each of the 4 similar
genes, we look up its
complete gene sequence
using the GenBank public
database
This is the FASTA format of
this gene sequence. It goes
on for 100's of pages. We
need to copy this sequence
and paste it into our next
application - Geneid
We pasted the FASTA
formatted gene sequence into
the Geneid application.
Geneid will analyze and
predict the putative (or
supposed) coding regions
and exon-intron structures of
this sequence.
Here's a portion of the output from
Geneid showing one of the
predicted coding regions. It has
removed the introns and converted
the gene sequence into a protein
sequence. We can now use
BLASTP to compare this protein
sequence to other known protein
sequences. We do this for each of
the predicted coding regions
(perhaps 10 - 20 coding regions
per sequence).
This time we use BLASTP to
search the non-redundant
protein databases for
sequences similar to our
putative coding region. We do
this for each of the putative
coding regions suggested by
Geneid.
The BLAST results show 11 proteins
that have similar coding sequences. If
our putative protein is similar to
known proteins, then it is probably not
novel and we can ignore it. We move
on to the next one until the list is
exhausted.
Note: Repeated execution of this step for
the 10-20 putative coding regions across
the 5-10 gene sequences similar to the
new genes referenced in maybe 9 of our
11 original documents will produce our
desired set of putative novel
cancer-related genes. (I.E. we have to run
this step over 500 times. And we'll need
at least 5 browser sessions open from
which we must manually cut and paste
the information.)
We are now done with the first
part of our task: finding novel
cancer releated genes and their
resulting proteins from recently
published literatature.
We now start the second phase
which is to try to determine the
function and origin of these
novel proteins.
First, each Putative Novel protein
sequence is analyzed by searching
the Protein Family (Pfam) database
which is protein family database of
alignmnets and Hidden Markov
Models (HMM). We're looking for
clues as to what the function of this
protein might be by perhaps finding
distant family members.
The results show one similar protein
alignment. I.E our new protein may
have a familiar origin to a known
protein sequence that has a well
defined function. By studying this well
known sequence further, we may be
able get an idea of what our protein
does.
BLAST can be used to help us
determine the function of our protein.
This time we're going to take each of
our putative novel proteins and
compare them to known proteins in
the Swissprot database. We're
looking for proteins similar to our
novel proteins. Studying these similar
proteins might give us a clue as to
what our protein does.
ClustalW is another
useful tool. We've
pasted in 3 of our
protein sequences
into the input field.
ClustalW will do a
multiple sequence
alignment on these
3 sequences.
This is the raw
alignment results
which show the
best alignment of
these three
sequences.
An alignment
viewer can give us
a graphical
representation of
these alignments.
We can also run a
phylogenetic
analysis against the
alignment data to
determine if the
sequences likely
were derived from
the same source.
What we have just seen is the process
that a bioinformatician might go
through for a specific way of solving a
problem. Notice that it is quite an
involved process with many manual
steps. It's easy to forget where you are
and there is plenty of room for error.
Many bioinformaticians with the
appropriate skills will attempt to
automate this process by writing Perl
Scripts. Since nothing is standardized
at this point, the interface to each
application has to be determined
separately by the programmer. The
result is a custom solution that is often
difficult to maintain and enhance and is
not very reusable for other scenarios.
Framework Approach
Build Web Services wrappers around the
applications used by the bioinformatician
in this scenario
Some of these applications will be run locally
Some will be accessed via the Internet
Automate the choreography of the
applications through workflow scripts
Provide user interaction through IBM's
Portal Server interface
For more information:
http://www.ibm.com/lifesciences
BLAST novel genes from PubMed vs. human genomic sequence (blastn with filtering)
Download