summary_for_week_end..

advertisement
Summary for week ending August 12th
Krishnakumar Sridharan
Project A – TSS Prediction using Machine-learning
1. Summary of project status before Summer:
A. EST2TSS created by KS. Uses Geneseqer to match EST sequences to Genomic
segment and outputs genomic positions with more than certain number of
5’-EST matches, which satisfy certain conditions, as possible Transcription
Start Sites (TSSs)
B. XK, KS and HL jointly created a Genomic features calculation package, DNA
Features (presented to group on 03/04/2011) for use in Transcription Start
and Transposon Insertion Site predictions
C. Features calculated currently include: k-mer frequencies, Nucleotide
compositions, structural/physical properties, TRANSFAC motif detection and
nucleosome prediction.
D. In the presentation, KS demonstrated usage of package with
1) TSS-Positive dataset  [-200,200] genomic segment with TSS from
EST2TSS output at position 0
2) TSS-Negative dataset  [ [-400,-200] + [200,400] ] genomic segment
after checking, and removing, for any presence of TSSs from EST2TSS
output in that interval
E. Results
1) Negative dataset showed certain trends when feature scores were
visualized, these looked promising.
2) VB advised proceeding by trying other TSS-Negative datasets like
introns, random sequences etc and then finding significant “features”
that is the ones that make any statistical difference.
3) Preliminary literature review by KS and XK suggested using
Kolmogorov-Smirnov test, Correlations and t tests/chi-square tests
for this purpose.
2. Literature review:
A. Supervised Learning methods have been used for both binary and multi-class
classification problems. Support Vector Machines are non-probabilistic
binary linear classifiers. Guyon et al 2002 used a Support Vector Machinelearning (SVM) method to study causative genes for cancer by analyzing
microarray data
B. Guyon et al used a method called SVM – Recursive Feature Elimination
(SVM-RFE) in which a “feature” space is analyzed by a linear classifier that
adds weights to the best features recursively and thus, eliminates “bad”
features
C. KS modified and is implementing a variation of this SVM-RFE algorithm
using an R Package, e1071. The data that KS used is a sample output
from previous runs of the DNAFeatures project. Currently, this code is
running out of memory and takes quite long (> 12-15 hrs) to run till
error. I am working on it to eliminate these problems.
D. Other ways to rank features in feature selection algorithms include using :
Fisher’s criterion, Pearson Correlation Coefficients, using R-Square from
linear regression and using variable’s direct impact on the score.
E. Some notes on feature selection :1) There are two types of Feature Selection algorithms – Feature
ranking, which tries to find a “satisfactory” set of features and is
reasonably fast and Subset Selection, which tries to find an “Optimal”
set of features for the classifications and is very time-consuming
2) For classification problems in statistics the models usually implement
a stepwise regression to remove “bad” features and then this
regression is stopped using cross validation
3) 3 types of Subset Selection algorithms  Mappers – Generic model,
evaluates each subset by model fit and is computationally expensive.
Embedded – Finds features that are good in a specific model and
Filters – Similar to Wrapper but use filters instead. Filter metrics can
be Correlations or Mutual Information
4) The ultimate aim is to find features that are “highly correlated with
the classification but highly un-correlated with other features”
5) Optimality/Evaluation criterion of machine-learning based feature
selection methods can be AIC, BIC or Information Loss (As
implemented in BioBayes)
3. Future goals for the coming week:
A. Trying different negative datasets and using visualization by DNAFeatures as
a method to get the best negative dataset. (Timeline = Early next week).
There are 4 ways that I will be comparing:
1) Intergenic regions without flanking exons, from HL’s Plant Genome
Table. I will take out the unknown N residues from the ends of
sequences and chose only those sequences which have greater than
400-500 bp, put file in DNAFeatures readable format and visualize the
scores to see a trend.
2) Intron sequences
3) Random nucleotides put into sequences
4) [ [-400,-200] + [200,400] ] genomic segments
B. Trying to choose and implement the SVM-RFE and other ways to rank feature
subsets, so as to find a meaningful set (Timeline = End of next week).
C. I will be attempting to get the negative datasets from one of the above
approaches and run a feature selection/ranking algorithm on it, have some
results by next week and summarize my findings in the next update.
Project B – Promoter architecture across species
I.
II.
Current status:
A. Poster presented in ISCB Student Council meeting in Vienna, Austria by TR.
Project workflow and abstract in previous updates.
B. Datasets - EST/CAGE/Other forms of TSS data from 12 species from various
online resources have been collected and readied to run scripts on. Some
personal requests for data are pending (PPDB access is in hand).
C. Over personal/skype meetings during the summer semester, KS and TR have
compiled the data into meaningful formats to run scripts and also, formalized
the idea and workflow of the overall project (See previous updates)
D. Scripts – TR’s R scripts and KS’s perl scripts will be integrated in a commandline based package TSSPred. (Timeline = mid-September). Currently using an
online MobileMe repository, will shift to svn soon.
Future work in the coming week:
A. KS will start compiling the package together using all scripts and write linker
scripts in perl/R scripts to deal with data transfer between modules in the
workflow
B. We will also be formalizing a large-table format for all input data to the
TSSPred package (Timeline = End of next week)
Download