General Instructions:
Dead Line: 2/12/13 23:55.
Submission according to published pairs only.
The submission is electronic only in the course website.
Question 1 – Scoring matrices and protein alignment
You are given a partial multiple sequence alignment of a certain protein and its homologs in other organisms:
Based on this data only, compute the PAM matrix entries of (I,M) and (S,A). Follow the following steps: a.
Compute the frequencies of each AA (amino acid). b.
For each pair, compute the average substitution frequency (Average of X->Y and of
Y->X) and multiply by 1000.
Note:
The first sequence (in bold) is the reference sequence.
The overall number of possible substitutions in this alignment is:
(sequence length)*(number of sequences-1) = 20*5 = 100 c.
Divide the substitution frequency by frequency of the AA that was substituted. d.
Convert ratio to log10 and multiply by 10
Compute the alignment score for seq1 with seq2 using two different matrices – PAM250 and
BLOSUM62. e.
What are the scores of the alignments using each matrix? f.
Compare the scores of the different substitution matrices. Are the scores in the diagonal the same in the two matrices? Explain why?
BLOSUM62
PAM250
Question 2 – Phylogenetic trees a.
Create a phylogenetic tree: i.
Go to: http://www.phylogeny.fr/ and use the “A-la-Carte” mode. Change the following parameters:
Multiple alignment: ClustalW
Construction of phylogenetic tree: Distances: BioNJ ii.
Insert the sequences from question 1 and create a phylogenetic tree. Add a b.
screen shot showing your tree.
c.
The full protein sequences are available in the file "FullSequences.txt" on the site (it is in a multi-FASTA format, which we discussed in the tutorial. Each title, starting with ">" annotate the sequence). Use these sequences to determine the organisms d.
they were originated from. Did your answer change compared to that of 2.B? Explain why.
Question 3- PSI-BLAST:
A researcher is interested in studying the PAX gene family, which is a family of transcription factors involved in the formation of tissues and organs during embryonic development and involved in many human diseases .
One member of this family, named PAX4, is known to be involved in Diabetes.
Use the following query sequence to run PSI-BLAST.
Change the following parameters:
‘Max target sequences’ to 250.
‘Matrix’ to BLOSUM45
>pax4
MQQDGLSSVNQLGGLFVNGRPLPLDTRQQIVQLAIRGMRPCDISRSLKVSNGCVSKILGRYYRTGVLEP
KCIGGSKPRLATPAVVARIAQLKDEYPALFAWEIQHQLCTEGLCTQDKAPSVSSINRVLRALQEDQSLH
WTQLRSPAVLAPVLPSPHSNCGAPRGPHPGTSHRNRTIFSPGQAEALEKEFQRGQYPDSVARGKLAAAT
SLPEDTVRVWFSNRRAKWRRQEKLKWEAQLPGASQDLTVPKNSPGIISAQQSPGSVPSAALPVLEPLSP
SFCQLCCGTAPGRCSSDTSSQAYLQPYWDCQSLLPVASSSYVEFAWPCLTTHPVHHLIGGPGQVPSTHC
SNWP a.
Explore and describe the top 10 matches that were found after the first iteration.
From which organism was the sequence taken? What other organisms were found to have a matching sequence? Do you expect the different protein matches found in the top results to have a similar function? Explain .
Attach a screen shot of the top 10 matches’ results.
The PAX4 gene (coding for PAX4 protein) has both paralogs and orthologs.
Use information you learnt and available in the literature and internet on orthologs and paralogs to answer the following questions: b.
Find 2 paralogs and 2 orthologs of the human PAX4 in your PSI-blast results c.
Where do you expect to see higher levels of similarity - between orthologs or paralogs? Do the results we receive match your expectation? Explain.
d.
Search for the following family members in the results: PAX-3, PAX-6, and PAX-7.
Use the information from your PSI-blast search to rank the 3 family members relative to PAX-4. What parameter did you use for the ranking? Explain. e.
Note, for each detected protein use the information from the first appearance of the protein in the PSI-blast results to answer the question.
From the first PSI-blast iteration choose only PAX-7 matches and perform a second iteration of PSI-BLAST. Now repeat the question 3d on the second iteration results.
Explain the differences between the results in 3d and 3e.