summary_for_august_2..

advertisement
Summary for August 25th-Sept 1st 2011
Krishnakumar Sridharan
Project A – TSS Prediction using Machine-learning
1. Framework for getting more meaningful results from DNAFeatures:
A. While I was testing different negative datasets using DNAFeatures (See previous update), I
was unable to get results for the nucleosome prediction features for all datasets. I met with
XK regarding this and he pointed out a few errors in my approach, and how to improve the
nucleosome prediction part
B. To get meaningful results from the NuPoP code used for nucleosome-based features, larger
sequences (-5kb, +5kb) need to be supplied at the “boundaries”, to take into account
“boundary effects” as said by authors of the NuPoP package. I was not taking these into
consideration in my 400bp long sequences and thus, the nucleosome prediction results I got
for some datasets would not mean anything.
C. The new approach that I plan on using in the future to get nucleosome-prediction features is
as follows:
 Calculate all features, except nucleosome-prediction features, from DNA
Features package using many single standard-length (400bp in examples) fasta
sequences, and get feature values.
 For nucleosome-based features, extend the standard-length sequence to about
5kbp on both sides,with either “Filler” (Random nucleotides and/or Unknown
nucleotides, “N”) or “Actual” (Upstream and downstream sequences), and then
run NuPoP on them for the limited region [5001,5400]
 The above approach will take care of the “boundary effects”, while giving the
nucleosome-based feature values for only the selected 400bp (or any standardlength) sequence.
2. Discussions on machine-learning: I also discussed with XK about the machine-learning
approach that he was following, and got some good directions from him regarding this. Some
interesting pointers were that:
A. The actual difference is not based on the machine-learning algorithm used, but
instead, it’s heavily based on the datasets and the “features” used for prediction
B. BioBayesNet is a good place to start these analyses, since it has been used before in
the group (YZ and XK) and is user-friendly. WEKA is the next go-to tool since it has
capabilities to run many different types of analyses
C. The Support Vector Machines (SVM) approach that I tried to implement is a popular
and well-used algorithm in machine-learning circles, but is considered
computationally expensive (Frequently runs out of memory, see previous updates).
Since SVM is often recommended and used in literature, I will be definitely coming
back to it once I have implemented a run in any of the other machine-learning
algorithms.
3. Research Discussion with Chris Eiesley (Dr.Dorman’s student):
I.
II.
III.
IV.
V.
After a conversation at my poster last Saturday, I met Chris Eisley (CE) from Dr. Karin
Dorman’s lab over lunch and discussed opportunities to collaborate in research.
We explained each of our research methodologies, approaches and progress to each
other so as to think of areas to work together on. CE works on the IMM-based models
and codes by Mike Sparks from our group, he is currently working on expanding some of
the models in this work to predict coding VS non-coding sequences.
CE works on estimating the effects of various hidden states, in the markov model, that
correspond to genomic features such as G/C content. He mentioned developing a
probability-based model that performs a binary classification in predicting whether a
given sequence is coding or non-coding.
I explained to him about how I use genomic features as predictors to predict whether or
not a sequence has a Transcription Start Site and if it does, where it has this TSS. These
discussions helped me think of a statistically sound methodology for doing the following
things for my project:
A. Formatting and making sure of the integrity of the positive and negative training
data, and testing data too. An idea that I came up with for testing data for the
eventual machine-learning based algorithm is to take a genomic
DNA/Chromosome sequence and fragment it. These fragments can together
form an unbiased test set.
B. Cross-validation of the eventual TSS-predicting algorithm – I could use either a
Random or Leave-one-out-cross-validation approach and separate 4/5th of the
genomic DNA fragments into training data and 1/5th as testing data.
C. CE suggested a machine-learning approach of his choice, Random Forest, and
we discussed how that would compare to SVMs or other machine-learning
methods
In conclusion, based on the discussions we had, some of the possible opportunities to
collaborate with CE would be as follows:
1) Adding to the genomic features:
 We discussed implementing a model loosely based on the one that CE is
working on, which will tell if a sequence has TSSs, based on the 200 kmer upstream sequence (Promoter elements).
 This model will output a probability score that is higher in case the given
upstream sequence is upstream of a TSS; it is a chain-based model that
has been used for predicting coding sequences.
 The concerns we would have are that the sequences with no TSS in
them would have no defined “upstream sequence” and also, choosing
the right sized k-mer

CE required a little more time to look at the finer details of implementing
such an algorithm for a promoter sequence and I agreed to provide him
with some +ve and –ve sequences once he is ready (he will notify me by
email) . If we could add this feature, it might be able to increase the
prediction efficiency of my approach.
2) Estimating more hidden states: CE mentioned that the prediction efficiency
for the markov chain, which he is working on, increases if there is an
estimate for the G/C content. I suggested that the DNAFeatures we have
might help him estimate more “hidden” states or genomic fetaures, so that
he may test if that increases the prediction accuracy for him.
3) Machine-learning: Since CE has applied the Random Forest approach in
some previous works, I could learn how to apply that method from him.
Also, the machine-learning part of my work could benefit his side of things.
4. Future Work in the coming weeks:
Based on these discussions I have had and the analyses I have done previously, my future plans
for the coming weeks are as follows:
A. Convert data into C4.5 machine-learning compatible data format (used in
BioBayesNet and WEKA) using perl scripts of my own (Timeline = Mid-next week)
B. In parallel, I will be working to implement the newer approach to calculate
nucleosome-based features using perl scripts to format data and the NuPoP code
within DNAFeatures package to predict the actual feature values (Timeline= End of
next week)
C. Once the C4.5 format is done, I will run these data in machine-learning tools
(Starting with BioBayesNet) and see what I get (Timeline = Within next 1.5 weeks).
Project B – Transcription Initiation and Promoter architecture across species
1. Objective of work this week: To predict the transcription initiation and promoter architecture
data for one species end-to-end, on a proof-of-concept basis, to see the challenges associated
with this task and to observe the hierarchy of script usage.
2. Tasks done: Running protist (Plasmodium Falciparum) data in GeneSeqer and EST2TSS and
following through till the TSS prediction step. Also, look at the different datasets available for a
given species and formulate an approach to capture information from the various formats of
available data
3. Plasmodium falciparum : Malaria parasite which was sequenced very recently (data released
May 2011), the sequenced data is available at plasmodb (http://plasmodb.org/plasmo/)
4. Approach used:
A. P.falciparum data was extracted from plasmodb in form of GFF, EST and Genomic
sequence files
B. A perl script was written to parse the GFF file into coordinates corresponding to
Transcription Start Sites
C. Geneseqer was run with the EST sequences and each of the 14 chromosomes in the
genomic sequence file (Running time: 30-40 minutes per chromosome). The
GeneSeqer output was given as input to EST2TSS based on which predictions are
given for possible Transcription Start Sites along with their orientation
D. EST2TSS can give both the individual TSSs and can cluster TSS-matches, within a
user-specified window, into a prospective TSR (window size to bundle/cluster TSSs is
usually good between 40-100bp, best predictions for example chromosome 14 seen
with 40bp window size)
E. Both GFF and EST files are admittedly “preliminary” by authors, so the data might
not point to an exact TSS, but using the strict criteria for matches in EST2TSS, we can
compress the various nearby TSSs into plausible TSRs.
F. This little exercise helps us shape an approach for 2 data types- GFF and
EST/Genomic Sequence file
5. Plans for each format of data:
A. GFF – Use perl or R scripts to run data and extract possible TSS positions
B. EST/Genomic sequence – Use GeneSeqer+EST2TSS and tweak input parameters in EST2TSS
to give a better supported annotation than the one given in GFF file
C. CAGE – R scripts by TR
D. SAGE and RNA-Seq – Interesting, but attempt only if the package development process is
done or close to done
Contingency plan for incomplete/absent GFF files for some species: For species with
incomplete GFF data, optimize EST2TSS parameters on existing data and extrapolate other data
from runs of EST2TSS with optimized parameters. For absent GFF files, use EST2TSS with a set of
more lenient/non-restrictive parameters to get some genome annotations.
6. Research discussion over meeting today:
A. Over our weekly meeting, we updated each other on the progress that we had made since
our last meeting. We discussed the progress till present from both our ends and the
directions to follow after this.
B. TR has been working on the CAGE datasets for Humans in the FANTOM dataset. He uses the
BioMart package within R to handle these datasets and obtains an output that contains the
Gene Name, Gene Start Position, Gene Orientation and Gene Sequence. I showed him the
outputs from EST2TSS
C. We decided upon the first of many data-format-based “checkpoints” in our workflow. I will
be consolidating the outputs that I get from EST2TSS and the GFF files into the following
formats:
(1) .ClusterFormat – This file extension has Gene Names as columns and all start
positions and strands they occur in, as rows. This is the input format for the TSS
Clustering part of our workflow
(2) .mod.gsq – Modified GeneSeqer output format contains the columns – Gene Name,
Gene Start, Gene Orientation, Gene Description, Gene Sequence. The purpose is to
connect EST2TSS outputs to the existing annotations, where EST2TSS is used as a
quality control tool to select only “strongly-supported” TSSs for the next step.
D. These formats will form a common basis of comparison of different data types (EST, GFF,
CAGE) from different species and also, act as a checkpoint before we proceed to our next
step which is clustering TSSs into a TSR.
E. TR will probably travel to Ames in the week of September 24th and we aim to have a
substantial amount of our single-kingdom analyses, some preliminary results and the set of
scripts that we plan to put together in one package, by September 24th
7. Future work in the coming weeks:
(i)
Use Perl scripts to format data into the two afore-mentioned formats (Timeline – Next
Wed./Thurs)
(ii)
Once the format is set, run these scripts on the next protist species Toxoplasma Gondii
and get species, chromosome-specific results (Timeline – Next Wed./Thurs)
(iii)
Provide these files as input to the x-means clustering algorithm and see what we get
(Timeline – End of Next week)
Download