Plan_Lau

advertisement

GOAL: To study the variations in the putative promoter regions of a selected set of genes across all six strains ( {Texas, Arg, Aust} x {virulent, attenuated} )

METHOD

SELECT GENES FOR STUDY:These are for single copy genes then?

1. Prepare preliminary gene list in Texas virulent from ALL chromosomes

(this is same as the global list from Audrey for all chromosomes)

DONE

2. Extract the complete but just the gene sequence (just 5' to 3' - note: no promoter sequence) for all these genes in this preliminary list.

Call this set G. done – has 3671 genes total

3. self-Blast: G vs. G done

4. Label any sequence i as "paralagous" to sequence j, if:

- the BLAST hit between i & j indicate high similarity (suggesting at least that one exon is heavily shared).

For this I plan to use blastn .

Also, we have to set the parameters appropriately. What if I use:

80% identity over an alignment that spans 50% more of one of the two gene sequences, as the minimum cutoff?I think this is stringent enough. Based on the MSA I did, I think you can up the % a bit and still be fine. However, if you want to be conversative, 50% is acceptable. After all, this is such a guessing game.

(do these parameter values sound alright to you for a start?yes.)

Comments Length coverage

% over gene

50

50

% identity cutoff

50

80

#genes mapped #singleton genes

189

189

3482

3482

50

20

90

80

164

370

3507

3301

10

10

80

90

498

457

3173

3214

As can be seen from the above results, it does not seem to matter whether we vary the

%identity cutoff from 50 to 80% or even to 90%. The results are nearly the same, which

means that if a gene overlaps with another gene, it does so with a pretty good percent identity (>80%).

On the other hand, if we decrease the cutoff for %coverage on the gene’s length (i.e., what %fraction of the gene is included in the alignment) from 50% down to 20% or 10%, then the number of genes that mapped to other genes gradually increases (189 to 370 to

498). But still the good news is that the majority of the genes are unique (3000+).

The last row parameters reflects the most the stringent parameter setting for calling two genes paralogous – because 10% length coverage implies roughly that about 160 bases of a gene (avg. length of a gene is ~1670) aligning with more than 90% identity with another gene. So I plan to take this data set forward to the rest of the analysis, if that’s

OK with you. That way we will be including about 3,214 unique genes in our analysis.

The file containing all these 3,214 unique genes are in this file: http://www.eecs.wsu.edu/~ananth/files/Aud/Promoter/genes/ genes_notmapped.fas

The file containing the remaining 457 ambiguously mapped (or multimapped) are in this fasta file: http://www.eecs.wsu.edu/~ananth/files/Aud/Promoter/genes/ genes_multimapped.fas

The way to locate the above gene in the genome is by understanding its naming format:

For example, BBOV_III010940_2341412_2342331 means:

Gene 010940 in chromosome III, and its location is from indices 2341412 to

2342331.

5. Partition the set of sequences in G into paralagous groups based on the self-BLAST results.

6. Report to Audrey all "singleton" groups - meaning those that didn't map to any other sequence.

These should be the ones with no paralogs.

7. Audrey will then shortlist a subset of genes from the singleton groups. Let this set be denote by G'.

Promoter centric comparison:

1.

For each selected gene in G’, extract the 1,000 bp region upstream of the gene + the first 100 bp of the gene itself from its 5’ end. Call this extracted sequence as the

“putative promoter” for this gene.

(do this extraction using the Genbank annotation).

2.

Make a set P = containing all the putative promoters for all genes in P’.

3.

Make a database D = {of all contigs in all other five strains }.

When creating this database, make sure the names of the contigs reflect their respective source strains.

4.

BLAST P (as query) vs. D (as database). a.

Here too, we will have to use some BLAST alignment parameters ( blastn ).

How about if we use: 90% identity over an alignment length of 100 bp or more?I am not too sure here. 90% seems a bit high but I guess we have to start somewhere. You may find that by setting the criterium at 90%, you will find quite a bit of difference?!

5.

Form “gene-groups” based on the above BLAST results.

This can be done by creating one gene-group for every selected gene, and then adding all those sequences from D (from other strains that is) that mapped to this gene’s putative promoter.

If a sequence from a strain mapped to more than one putative promoter in Texas virulent, then label them “multiple mappers” and remove them.

6.

Compute MSA for each gene-group

( program for MSA to be decided).

7.

Examine the MSAs for SNPs and strain-specific variations.

Download