summary_for_august_1.. - Brendel Group, Indiana University

advertisement
Summary for August 14-25th 2011
Krishnakumar Sridharan
Project A – TSS Prediction using Machine-learning
1. Objective: To quantitatively analyze and compare different type of non-TSS “negative”
datasets, and finally choose the best one for further usage in feature-selection and machinelearning.
2. Materials and Methods:
A. Downloaded data, for Arabidopsis alone, from TAIR and Plant Genome Table (HL), and
used perl scripts to format the data and cut it to the right length.
B. The DNAFeatures (XK,KS and HL unpublished) package was used for scoring sequence
sets for “features” such as k-mer frequencies, Nucleotide compositions, structural/physical
properties, TRANSFAC motif detection and nucleosome prediction.
3. Datasets for Analyses:
I downloaded, cleaned, formatted and generated 4 different types of “negative” dataset and
evaluated them against one TSS-containing “positive” dataset. All sequences were 400 bp in
length. The datasets are derived as follows:A. POSITIVE DATASET – The genomic sequence from [-200,+200], where 0 position is
the TSS. This data can be generated by both EST2TSS and TAIR10 GFF files, I used the
TAIR10 GFF file-based method to generate this data for this particular analysis.
B. NEGATIVE DATASET 1 - 100us_and_300ds – The genomic region [-300,-200] +
[+200, +500] (400 bp region) was calculated and checked for the presence of any
TSSs (from TAIR10 GFF file), before the sequence was extracted by the BLAST+
package. The data used for this particular analysis is from Chromosome 1 alone.
C. NEGATIVE DATASET 2 – INTRONIC – Intronic sequences were downloaded from
TAIR and perl scripts were run on these to format the data and cut up 400 bp
sequences. The perl script (data_cleaner_intronic.pl) ensured that
1) Intronic sequences shorter than 150bp are discarded
2) Sequences 151-399bp long are extended by adding ‘N’ characters at the end
3) Sequences > 400bp long are cut up to 400bp, counted from the left end of
the sequence.
A total of ~49,860 sequences were thus obtained and 100 of these were used for
our analysis, these sequences are from different chromosomes.
D. NEGATIVE DATASET 3 – INTERGENIC – Intergenic sequences (Without Flanking
Exons) were obtained from the Plant Genome Table (HL) and formatted using the
same criteria as for Intronic Sequences by perl scripts. A total of ~23,100 sequences
were obtained, out of which 1000 400bp-long sequences were used for this analysis.
E. NEGATIVE DATASET 4 – RANDOM NUCLEOTIDE SEQUENCES – Random sequences
were generated using a perl script and a 1000 400bp-long sequences were used for
this analysis.
4. Results:
The results are displayed as MS Excel graphs for a limited set of features, for each of the
different datasets, as follows:
A.
B.
C.
D.
Nucleotide Composition Features’ Scores
Mono-nucleotide Structural Features’ Scores
Di-nucleotide Structural Features’ Scores
Tri-nucleotide Structural Features’ Scores
Graphs continued on next page..
A. Nucleotide Composition Features’ Scores
2
1
0
-1
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
401
FEATURE SCORE
POSITIVE DATA - Nucleotide Posn VS Score Graph
2
1
0
-1
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
401
FEATURE SCORE
NEGATIVE_DATA_1: Position VS Score
%G+C
%A+T
GC_Skew
AT_Skew
Keto_Skew
Purine_Skew
AT/GC_Ratio
%G+C
%A+T
GC_Skew
AT_Skew
Keto_Skew
Purine_Skew
AT/GC_Ratio
3
2
1
0
-1
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
FEATURE SCORE
NEGATIVE_DATA_2 - Position VS Score
2
1
0
-1
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
FEATURE SCORE
NEGATIVE_DATA_3 - Position VS Score
%G+C
%A+T
GC_Skew
AT_Skew
Keto_Skew
Purine_Skew
AT/GC_Ratio
%G+C
%A+T
GC_Skew
AT_Skew
Keto_Skew
Purine_Skew
AT/GC_Ratio
1.5
1
0.5
0
-0.5
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
FEATURE SCORE
NEGATIVE_DATA_4 -Position VS Score
%G+C
%A+T
GC_Skew
AT_Skew
Keto_Skew
Purine_Skew
AT/GC_Ratio
B. Mono-nucleotide Structural Features’ Scores
1
0.5
0
-0.5
1
16
31
46
61
76
91
106
121
136
151
166
181
196
211
226
241
256
271
286
301
316
331
346
361
376
391
FEATURE SCORE
POSITIVE DATA - Nucleotide Posn VS Score Graph
Hbondw1
Hbondw2
Hbondw3
Hbondw3'
Hbondw2'
Hbondw1'
1
0.5
0
-0.5
1
16
31
46
61
76
91
106
121
136
151
166
181
196
211
226
241
256
271
286
301
316
331
346
361
376
391
FEATURE SCORE
NEGATIVE_DATA_1: Position VS Score
NEGATIVE_DATA_2 - Position VS Score
Hbondw3
0
-1
NEGATIVE_DATA_3 - Position VS Score
1
Hbondw2'
Hbondw1
Hbondw2
Hbondw3
0.5
Hbondw3'
0
1
16
31
46
61
76
91
106
121
136
151
166
181
196
211
226
241
256
271
286
301
316
331
346
361
376
391
FEATURE SCORE
Hbondw3'
Hbondw1'
-2
-0.5
NEGATIVE_DATA_4 -Position VS Score
Hbondw2'
Hbondw1'
Hbondw1
Hbondw2
1
Hbondw3
0.5
Hbondw3'
0
-0.5
1
16
31
46
61
76
91
106
121
136
151
166
181
196
211
226
241
256
271
286
301
316
331
346
361
376
391
FEATURE SCORE
Hbondw1
Hbondw2
1
1
16
31
46
61
76
91
106
121
136
151
166
181
196
211
226
241
256
271
286
301
316
331
346
361
376
391
FEATURE SCORE
2
Hbondw1
Hbondw2
Hbondw3
Hbondw3'
Hbondw2'
Hbondw1'
Hbondw2'
Hbondw1'
C. Di-nucleotide Structural Features’ Scores
POSITIVE DATA - Nucleotide Posn VS Score
100
di-DNAbendstiff
FEATURE SCORE
di-DNAdenature
60
di-duplexstab.disruptenergy
40
di-duplexstab.freeenergy
di-pinduceddeform
20
di-propellertwist
di-proteinDNAtwist
-20
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
0
NEGATIVE_DATA_1: Position VS Score
100
80
FEATURE SCORE
di-BDNAtwistOhler
di-BDNAtwistOlson
80
60
40
20
-20
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
0
NEGATIVE_DATA_2 - Position VS Score
120
100
80
60
40
20
0
-20
-40
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
FEATURE SCORE
di-Aphylicity
di-stackingenergy
di-ZDNAstabenergy
di-Aphylicity
di-BDNAtwistOhler
di-BDNAtwistOlson
di-DNAbendstiff
di-DNAdenature
di-duplexstab.disruptenergy
di-duplexstab.freeenergy
di-pinduceddeform
di-propellertwist
di-proteinDNAtwist
di-stackingenergy
di-ZDNAstabenergy
di-Aphylicity
di-BDNAtwistOhler
di-BDNAtwistOlson
di-DNAbendstiff
di-DNAdenature
di-duplexstab.disruptenergy
di-duplexstab.freeenergy
di-pinduceddeform
di-propellertwist
di-proteinDNAtwist
di-stackingenergy
di-ZDNAstabenergy
NEGATIVE_DATA_3 - Position VS Score
100
FEATURE SCORES
80
60
40
20
-20
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
0
NEGATIVE_DATA_4 -Position VS Score
100
60
40
20
0
-20
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
362
381
400
FEATURE SCORE
80
di-Aphylicity
di-BDNAtwistOhler
di-BDNAtwistOlson
di-DNAbendstiff
di-DNAdenature
di-duplexstab.disruptenergy
di-duplexstab.freeenergy
di-pinduceddeform
di-propellertwist
di-proteinDNAtwist
di-stackingenergy
di-ZDNAstabenergy
di-Aphylicity
di-BDNAtwistOhler
di-BDNAtwistOlson
di-DNAbendstiff
di-DNAdenature
di-duplexstab.disruptenergy
di-duplexstab.freeenergy
di-pinduceddeform
di-propellertwist
di-proteinDNAtwist
di-stackingenergy
di-ZDNAstabenergy
D. Tri-nucleotide Structural Features’ Scores
5
tri-gccontent
0
-5
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
401
FEATURE SCORE
POSITIVE DATA - Nucleotide Posn VS Score
tri-bendability
tri-nuclpref
-10
5
tri-gccontent
0
-5
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
401
FEATURE SCORE
NEGATIVE_DATA_1: Position VS Score
tri-bendability
tri-nuclpref
-10
10
tri-gccontent
0
-10
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
FEATURE SCORE
NEGATIVE_DATA_2 - Position VS Score
tri-bendability
tri-nuclpref
-20
5
tri-gccontent
0
-5
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
FEATURE SCORE
NEGATIVE_DATA_3 - Position VS Score
tri-bendability
tri-nuclpref
-10
6
4
tri-gccontent
2
tri-bendability
0
tri-nuclpref
-2
1
17
33
49
65
81
97
113
129
145
161
177
193
209
225
241
257
273
289
305
321
337
353
369
385
FEATURE SCORE
NEGATIVE_DATA_4 -Position VS Score
5. Conclusions:
This analysis and the subsequent visualizations served two purposes – Looking for the best
negative dataset and Looking for the most-distinguishing features. The conclusions from this
work were as follows:
A. Best Negative Dataset:
1) As evaluated by the package, the best “negative” datasets which can give the clearest
differentiation from the “positive” dataset (which is needed for feature selection
algorithm) are the Negative Datasets 3 and 4, or Intergenic and Random Sequences.
2) Negative Dataset 1 gives some trends at 100bp which look like they are from the BLAST+
package(as suggested by Hong), since they were also present in a previous visualization
at the interval where blastdbcmd cuts and perl joins the sequences
3) The Intron Sequence(Negative Dataset 2) visualization suggest that cutting 10bp from
the 5’ end of that sequence will be a good idea
4) This strongly suggests that Intergenic and Random sequences (and maybe trimmed
Intronic Sequences) will be best for the next step.
B. Possible Best Features:
1) This analysis also serves to give an estimate of the features that will be returned from
the feature-selection algorithm. This will help me to cross-check the results of my
feature-selection algorithm and give me some inputs as to the relative significance of
different features from the DNAFeatures package
2) Based on the above visualizations, some good features seem to be: Tri-nucleotide
preference, Di-nucleotide DNA Denaturing, Di-nucleotide Bend Stiff (flexibility),
Mononucleotide Hydrogen Bonding-w1;w1';w2;w2';w3;w3', Nucleotide Compositional
measures- %GC,%AT,AT skew, GC skew, Keto skew, Purine skew and AT/GC Ratio.
6. Future work in the coming weeks:
A. Feature Selection algorithm: The SVM-RFE algorithm that I wrote has been constantly giving
me errors about memory allocation” and other things. I think this problem is because I have
been using poorly-formatted data from the example files of DNAFeatures. In the coming
weeks, I want to either implement this or any other feature ranking methods to select
significant features. I will abandon my old trials and start afresh for this step.
B. A statistical test can be implemented for the output of DNAFeatures to see if there are
significant differences in outputs for “positive” and “negative” datasets
C. Formatting data to machine-learning compatible format: The data needs to be formatted
to a form which machine-learning methods can read. This step will come after construction
of the feature-selection algorithm.
Project B – Promoter architecture across species
Current Status and Future Work:
A. TR and KS had a meeting on 08/22/2011 in which we discussed formatting some data
for our Cluster of Orthologous group-analyses-part of the project. We exchanged some
data that TR had collected and formatted. This is a good idea, but not immediately
required.
B. What was immediately needed was a trial run from start to end for atleast one species.
As the Protist data is immediately available in good detail (See PlasmoDB), we settled on
running our analyses on this.
C. GeneSeqer was run on the 14 Chromosome Plasmodium falciparum genomic sequence
data (~6-7 hrs on my desktop), I will be applying EST2TSS on the results to see the
predicted TSS and provide the data files to Taylor to run his scripts on. The objective is
to complete an end-to-end run for one dataset using all the scripts that will be included
in the package, TSRCalc.
D. I requested TR for an input/output format description for his scripts, so that I can begin
tool integration, as TR is out of town, I am awaiting news on this from him.
E. TR emailed and talked to other people who can provide more data for our analyses
(PPDB – Arabidopsis CAGE tag data from Japan, and Tommy Chu – C.Cinerea data at
Hong Kong University). Tommy agreed to provide the data but PPDB is yet to get back to
both of us.
F. Work for the coming weeks includes: Completing EST2TSS run on P.flaciparum data,
providing results to TR to input into his script, laying framework for scripts’ integration
into TSRCalc package and meeting Taylor regarding this and future steps.
Download