GOAL: To study the variations in the putative promoter regions of a selected set of genes across all six strains ( {Texas, Arg, Aust} x {virulent, attenuated} )
METHOD
SELECT GENES FOR STUDY:These are for single copy genes then?
1. Prepare preliminary gene list in Texas virulent from ALL chromosomes
(this is same as the global list from Audrey for all chromosomes)
DONE
2. Extract the complete but just the gene sequence (just 5' to 3' - note: no promoter sequence) for all these genes in this preliminary list.
Call this set G. done – has 3671 genes total
3. self-Blast: G vs. G done
4. Label any sequence i as "paralagous" to sequence j, if:
- the BLAST hit between i & j indicate high similarity (suggesting at least that one exon is heavily shared).
For this I plan to use blastn .
Also, we have to set the parameters appropriately. What if I use:
80% identity over an alignment that spans 50% more of one of the two gene sequences, as the minimum cutoff?I think this is stringent enough. Based on the MSA I did, I think you can up the % a bit and still be fine. However, if you want to be conversative, 50% is acceptable. After all, this is such a guessing game.
(do these parameter values sound alright to you for a start?yes.)
Comments Length coverage
% over gene
50
50
% identity cutoff
50
80
#genes mapped #singleton genes
189
189
3482
3482
50
20
90
80
164
370
3507
3301
10
10
80
90
498
457
3173
3214
As can be seen from the above results, it does not seem to matter whether we vary the
%identity cutoff from 50 to 80% or even to 90%. The results are nearly the same, which
means that if a gene overlaps with another gene, it does so with a pretty good percent identity (>80%).
On the other hand, if we decrease the cutoff for %coverage on the gene’s length (i.e., what %fraction of the gene is included in the alignment) from 50% down to 20% or 10%, then the number of genes that mapped to other genes gradually increases (189 to 370 to
498). But still the good news is that the majority of the genes are unique (3000+).
The last row parameters reflects the most the stringent parameter setting for calling two genes paralogous – because 10% length coverage implies roughly that about 160 bases of a gene (avg. length of a gene is ~1670) aligning with more than 90% identity with another gene. So I plan to take this data set forward to the rest of the analysis, if that’s
OK with you. That way we will be including about 3,214 unique genes in our analysis.
The file containing all these 3,214 unique genes are in this file: http://www.eecs.wsu.edu/~ananth/files/Aud/Promoter/genes/ genes_notmapped.fas
The file containing the remaining 457 ambiguously mapped (or multimapped) are in this fasta file: http://www.eecs.wsu.edu/~ananth/files/Aud/Promoter/genes/ genes_multimapped.fas
The way to locate the above gene in the genome is by understanding its naming format:
For example, BBOV_III010940_2341412_2342331 means:
Gene 010940 in chromosome III, and its location is from indices 2341412 to
2342331.
5. Partition the set of sequences in G into paralagous groups based on the self-BLAST results.
6. Report to Audrey all "singleton" groups - meaning those that didn't map to any other sequence.
These should be the ones with no paralogs.
7. Audrey will then shortlist a subset of genes from the singleton groups. Let this set be denote by G'.
Promoter centric comparison:
1.
For each selected gene in G’, extract the 1,000 bp region upstream of the gene + the first 100 bp of the gene itself from its 5’ end. Call this extracted sequence as the
“putative promoter” for this gene.
(do this extraction using the Genbank annotation).
2.
Make a set P = containing all the putative promoters for all genes in P’.
3.
Make a database D = {of all contigs in all other five strains }.
When creating this database, make sure the names of the contigs reflect their respective source strains.
4.
BLAST P (as query) vs. D (as database). a.
Here too, we will have to use some BLAST alignment parameters ( blastn ).
How about if we use: 90% identity over an alignment length of 100 bp or more?I am not too sure here. 90% seems a bit high but I guess we have to start somewhere. You may find that by setting the criterium at 90%, you will find quite a bit of difference?!
5.
Form “gene-groups” based on the above BLAST results.
This can be done by creating one gene-group for every selected gene, and then adding all those sequences from D (from other strains that is) that mapped to this gene’s putative promoter.
If a sequence from a strain mapped to more than one putative promoter in Texas virulent, then label them “multiple mappers” and remove them.
6.
Compute MSA for each gene-group
( program for MSA to be decided).
7.
Examine the MSAs for SNPs and strain-specific variations.