Proteins dictate function in an organism: What happens as proteins evolve? Budding yeast Saccharomyces pombe (sugar fungus) Fission yeast Schizosaccharomyces pombe In our project, we'll be determining if functional homologs of S. cerevisiae Met proteins are present in S. pombe This semester: Five genes from S. pombe will be transferred to S. cerevisiae What organism should the class study after we finish S. pombe genes? A look at the molecular phylogeny should help Are there any correlations between the kind of amino acid substitutions observed over evolution with their chemistry? How are bioinformatics tools used to analyze the conservation of protein sequences? How can I identify regions of proteins that are most strongly conserved and most likely to be important for function? For proteins to maintain their function, they don't tolerate drastic changes to their shapes Amino acid substitutions that significantly perturb the structure of a protein or alter its chemistry can cause the protein to lose function Met16p from S. cerevisiae complexed with PAP (2OQ2) Recall that the final folded form of a protein is determined by its primary sequence R (“reactive”) groups form a variety of bonds important for structure and function Cysteine is one of the most evolutionarily constrained amino acids Cys-254 is in close proximity to the end-product, PAP, suggesting that it plays a role in catalysis Custom view of Met16p highlights Cys Protein: backbone view PAP: ball-and-stick Cysteine: space-fill Acidic Glu (E) Asp (D) Amino acids can be grouped according to the chemistry and size of their R groups Charged Arg (R) Basic His (H) Lys (K) Polar Asn (N) Gln (Q) Gly (G) Small Thr (T) Cys (C) Tyr (Y) Ser (S) Aromatic Trp (W) Ala (A) Neutral Ile (I) Val (V) Phe (F) Leu (L) Met (M) Pro (P) Hydrophobic Most amino acids are abbreviated by their first letter: (Abundant, hydrophobic ones get preference) A C G H I L M P S T V Ala Cys Gly His Ile Leu Met Pro Ser Thr Val alanine cysteine glycine histidine isoleucine leucine methionine proline serine threonine valine Phonetic abbreviations: F Phe phenylalanine R Arg arginine Oddballs: (Charged, aromatic, some polar) D E Asp Glu aspartic acid glutamic acid K Lys lysine N Q Asn Gln asparagine glutamine W Trp Y Tyr tryptophan tyrosine The one letter code needs to be part of a 21st century biologist’s vocabulary Studying the evolutionary conservation of amino acids in sequences provides a sense of the importance of the amino acid to protein function BLOSUM62 (BLOck SUbstitution Matrix) was based on statistical alignments seen in proteins that are at least 62% identical Matrix assigns scores for substitutions: Maximum score for the same amino acid (completely conserved, possibly essential) Positive scores are awarded for common amino acid substitutions, in decreasing order, based on their occurrence in proteins Negative scores are unlikely substitutions Note the high score for Cys! The biochemical connection: Higher scores are frequently correlated with conservative amino acid substitutions based on amino acids chemistry and size Are there any correlations between the kind of amino acid substitutions observed over evolution with their biochemistry? How are bioinformatics tools used to analyze the conservation of protein sequences? How can I identify regions of proteins that are most strongly conserved and most likely to be important for function? BLAST BLAST is an acronym for Basic Local Alignment Search Tool, a computer algorithm for finding homologous sequences in databases BLASTN compares nucleic acid sequences BLASTP compares protein sequences BLOSUM62 is the default scoring matrix for BLASTP BLOSUM 62 scores relate the frequency of a particular substitution to the probability that it occurs by chance in proteins that are at least 62% identical throughout their length Score = k log10 ( Pij Qi * Q j ) Scaling factor used to produce integral values Pij is the observed frequency of two amino acids (i and j) replacing each other in homologous sequences Qi and Qj are probabilities of finding i and j randomly in a sequence Positive and negative scores suggest amino acid changes have been selected for (positive) or against (negative) during evolution Magnitude of the score suggests the strength of the selection Score of zero suggests that a particular substitution can be explained by chance alone BLASTP begins with a query sequence (e.g. your MET sequence) The query sequence is broken into "words" that will act as seeds in alignments Query Word s BLAST searches for matches (or synonyms) in target entries in the database Word match Word match Target sequence If a target entry has two or more matches to "words" from the query, the alignment is extended in both directions looking for additional similarity Word match Target sequence Word match "Words" are integral to the BLASTP search BLASTP uses a sliding window to identify words Consider the sequence: E A G L E S BLASTP would break this down into a series of four 3-letter words: E A A G G G L L L Tip! Use a non-proportional word font such as Courier when working with database entries. E E S The fonts are uglier, but the letters have a constant spacing that generates nice columns! Next: words are given a numerical score BLASTP uses the BLOSUM62 matrix as its default for assigning values to words E A G 5 + 4 + 6 = 15 A G L G L E 6 + 4 + 5 = 15 4 + 6 + 4 = 14 L E S 4 + 5 + 4 = 13 BLASTP next checks for word synonyms (1-letter replacements)with a score greater than a default threshold of 10 Of the 60 possible synonyms for each word, only a small handful are statistically likely to appear in homologous proteins E K E E E E A S C T V A G G G G G G (11) (12) (11) (11) (11) A G L S G L (11) A G I (12) G L E G I E (13) G L D (12) G L Q (12) L E S I E S (13) BLASTP will search for all of these words and synonyms in the protein database Sequences must have at least two words for further consideration BLASTP uses word matches as a nucleus and extends them in both directions, looking for additional similarity Word match Target sequence Original search word Q A S T L Y E - A G L E S E A T T N - - R R E I Query + A + T + + + G L E S E A + + R + E + Summary N A A T Y W D A S G L E S - - - S Q I I R K E L Target As BLASTP extends the alignment out from the match, it calculates a running score – extension stops when the score drops below a threshold value Penalties are assigned for gaps and mismatches Plus signs in summary line indicate a positive BLOSUM62 value Are there any correlations between the kind of amino acid substitutions observed over evolution with their biochemistry? How are bioinformatics tools used to analyze the conservation of protein sequences? How can I identify regions of proteins that are most strongly conserved and most likely to be important for function? Highly conserved protein sequences are often essential for function You will compare sequences of homologous proteins from model organisms Escherichia coli K-12 (gram negative) Caenorhabditis elegans Arabidopsis thaliana Bacillus subtilis str. 168 (gram positive) Mus musculus Phylogeny.fr provides tools for preparing multiple sequence alignments and phylogenetic trees Multiple sequence alignments show regions of conservation Identical amino acids are shown in blue – conservative changes in grey Tree Dyn generates a phylogenetic tree Length of branches reflects time since divergene from a node Bootstrap values predict reliability of nodes in the tree (max = 1.0) Length corresponds to 600 million years Weblogo program provides a graphical depiction of multiple sequence alignments Sizes of different amino acids reflects the frequency with which a particular amino acid is found at the position – note the positions of amino acids with high BLOSUM scores