Computational Approach for Predicting Interaction Sites of Cytochrome and Photosystem I W. Chen, A. Sekmen, B. Bruce, K. Nguya, P. Mishra, L. Emujakporue, K. Wehbi Computer Science, Tennessee State University Biochemistry & Cellular & Molecular Biology, University of Tennessee at Knoxville Supported by NSF Targeted Infusion Grant (1137484) & TN-SCORE thrust on Nanostructures for Enhancing Energy Efficiency BICOB 2013 Outline Research Background Problem and Challenge Methods Interaction Relation between Cytochrome and Photosystem I Prediction Algorithms Results and Analysis Summary and Future Work Research Background Natural Photosynthetic Process Hydrogen is a particularly useful energy carrier for transportation. However, there are no sources of molecular hydrogen on the planet. Thus it remains a difficult challenge to find an efficient and environmentally sustainable way of producing, capturing, storing highly attractive yet dilute energy source. Natural Photosynthetic Process is not efficient and quantitative Research Background Artificial Photosynthetic Process The research centers in UTK recently demonstrated that the natural process of photosynthesis can be redirected to produce molecular hydrogen. They have characterized and partially optimized protein-metal hybrid complexes that, when exposed to light, generate hydrogen at a high rate and are temporally and thermally stable. Specifically they are using mutagenesis to increase the affinity between cyt c6 and PSI from the thermophilic cyanobacterium Thermosynechococcus elongates Artificially redirect/engineer the proteins that can donate and accept large number of electrons by protein interaction to produce large quantity of energy Problem and Challenge Artificial process requires to remodel the protein-protein interface to include new residues that are introduced into the native complexes to create binding sites similar to those found in green algae and higher plants. Future improvement involves further kinetic optimization of electron transfer within photosystem I. The lack of a crystal structure for bound binary complex makes traditional structural biology tools unavailable to date. There has some low resolution structural approach such as chemical cross-linking that have been used to investigate this interaction. Goal of this research Computationally predicting the interaction sites of protein pairs (donors and accepters) that tap into photosynthetic processes to produce efficient and inexpensive energy Interaction Relation between Cytochrome c6 and Photosystem I PsaF Three type of amino acid bonding 1. Electrostatic bonding 2. Hydrogen bonding 3. Hydrophobic bonding Interaction Relation between Cytochrome c6 and Photosystem I PsaF Electrostatic Bond N {E , D} and P {R, H , K } Re {( x, y) | x N and y P} C .. C .. H:N .. · H Lose an electron .. C:O .. · Get an electron We ( x, y ) 0.1 if x {E , D} & y {R, K } if x {E , D} & y H if x {E,D} & y {R,K,H } 0.1α 0.1α α α α α . H:N .. : H H .. Interaction Relation between Cytochrome c6 and Photosystem I PsaF Hydrogen Bond Rh {( x, y) | x {R,H,K,S,T, N,Q,W,Y }} if ( x,y) Rh Wh ( x, y) 0 if ( x, y) Rh Approach: Prediction Algorithms 1. Calculate the score of interaction for each residue subsequences from PsaF and c6 proteins by Dynamic Programming. 2. Track back to get the k interaction sites with the k top scores. Algorithm 1 Calculate the score using a window if x {E , D} & y {R, K } 0.1 if x {E , D} & y H 0.2 if x {E , D} & y {R, K , H } & y ' {R, K } Wew ( x, y ) or if x {E , D} & x' {E , D} & y {R, K } 0.02 if x {E , D} & y {R, K , H } & y ' H or if x {E , D} & x' {E , D} & y H Otherwise c6 **********D***** We ( D, R ) PsaF *******R********** The score at ( xi , y j ) is decided by the score at ( xi 1 , y j 1 ) if i 0 or j 0 0, S[i, j ] max{ S[i-1,j-1] W (x i , y j ), 0}, Otherwise where W ( x, y) Wew ( x, y) or Wew ( x, y) Wh ( x, y) **********D***** We ( D, H ) 0.1 *******H********** **********D***** We ( D, Y ) 0.2 *******Y*R******** **********D***** We ( D, Y ) 0.02 *******Y*H******** Prediction Algorithms Algorithm 1: Calculate the score using a window (of length 7) We : 1, 0.22 S 1 A 2 E 3 L 4 M 5 D 6 S 7 E 8 A 9 E 0 0 0 0 0 0 0 0 0 0 1 G 0 0 0.2 0 0 0.2 0 0.2 0 0.2 2 P 0 0 0.2 0 0 0.2 0 0.2 0 0.2 3 R 0 0.2 1 0.4 0.2 1 0.4 1 0.4 1 4 F 0 0 0.4 0.78 0.18 0.4 0.78 0.6 0.78 0.6 5 K 0 0.2 1 0.6 0.98 1.18 0.6 1.78 0.8 1.78 6 Y 0 0 0.4 0.78 0.38 1.18 0.96 0.8 1.56 1 7 K 0 0.2 1 0.6 0.98 1.38 1.38 1.96 1 2.56 8 H 0 0.02 0.3 1.02 0.62 1.08 1.4 1.48 1.98 1.1 Interaction site/sequence with the score 2.56: DSEAE RFKYK Prediction Algorithms Algorithm 2: Calculate the score allowing gaps (insertion/deletions) The score at ( xi , y j ) is decided by the score at ( xi 1 , y j 1 ) , ( xi 1 , y j ) , ( xi , y j 1 ) and weight W ( xi , y j ) 0, if i 0 or j 0 S [i, j ] max{ S [i-1,j-1] W ( xi ,y j ), S [i-1,j ] g , S [i,j-1] g , 0}, Otherwise where W ( x, y) We ( x, y) or We ( x, y) Wh ( x, y) Prediction Algorithms Algorithm 2: Calculate the score allowing gaps (insertion/deletions) We : 1, 0.22 S 1 A 2 E 3 L 4 M 5 D 6 S 7 E A 9 E 0 0 0 0 0 0 0 0 0 0 8 1 G 0 0 0 0 0 0 0 0 0 0 2 P 0 0 0 0 0 0 0 0 0 0 3 R 0 0 1 0 0 1 0 1 0 1 4 F 0 0 0.8 0.78 0.58 0.8 0.78 0.8 0.78 0.8 5 K 0 0 1 0.80 0.60 1.58 1.38 1.78 1.58 1.78 6 Y 0 0 0.8 0.78 0.58 1.38 1.36 1.58 1.56 1.58 7 K 0 0 1 0.78 0.58 1.58 1.38 2.36 2.16 2.56 8 H 0 0 0.8 0.78 0.58 1.38 1.36 2.16 2.14 2.36 Interaction site/sequence with the score 2.36: ELMDSE R– FKYK Prediction Algorithms Speed-up the prediction by parallelization Theoretically, the algorithms can be similarly executed in O(log m log n) time using O(mn / log m) processors in CREW PRAM model by A. Apostolico et al.’s approach, where m = min {|X|, |Y|}, n = S 1 max{|X|, |Y|} and X and Y are the pair of protein sequences [11]; 2 in O(1) time using m + n processors in BSR model [12]. 3 Practically, we can use a computer with multiple cores : Step1: Divide the |X| × |Y| matrix S in to k×k blocks … such that each block (|X|/k × |Y|/k elements) can be calculated in O(|X|/k × |Y|/k ) time by 1 processor. k Step 2: First, calculate the blocks in the first diagonal, then the ones in the second diagonal, until the ones in the (2k–1)th diagonal. Time Complexity: the ith diagonal only depends on the values in the (i–1)th diagonal. Each block on the same diagonal can be calculated in parallel. Therefore, the problem can be solved in O(( 2k 1)( mn / k 2 )) O(mn / k ) time with k processors, where 1 k m . 2 3 … k 3 … k K+1 … k K+1 … k K+1 … 2k-2 K+1 … 2k-2 2k-1 Results and Analysis Dataset Totally, 86 pairs of protein sequences from cyt c6 and PsaF are used for the test. The datasets are given from Dr. Bruce’s Lab in UTK and each pair belongs to the same organism and is able to have electrostatic attractions with each other. A pair of PsaF and c6: PsaF:MRRLFALILAIGLWFNFAPQAQALGANLVPCKDSPAFQALAEN ARNTTADPESGKKRFDRYSQALCGPEGYPHLIVDGRLDRAGDFLIPSI LFLYIAGWIGWVGRAYLQAIKKESDTEQKEIQIDLGLALPIISTGFAW PAAAIKELLSGELTAKDSEIPISPR c6:MENVGCEENLLRLILVNLLLVIALLCNLTIIYPALAAETSNGSKIFN ANCAACHIGGANILVEHKTLQKSGLSKYLENYEIEPIQAIINQIQNGK SAMPAFKNKLSEQEILEVTAYIFQKAETGW For each pair of sequences, three interaction sites which have top three scores and corresponding pairs of interaction subsequences are predicted. Parameters in Weight Schemes We : 1, 0.22 Wh : 0.1 Results and Analysis Result For each pair of protein sequences, the original sequences, three interaction sites with the scores, corresponding interaction subsequences, and net charge of each subsequence are output as follows: Psaf:MRRLFALILAIGLWFNFAPQAQALGANLVPCKDSPAFQALAENARNTTADPES GKKRFDRYSQALCGPEGYPHLIVDGRLDRAGDFLIPSILFLYIAGWIGWVGRAYLQ AIKKESDTEQKEIQIDLGLALPIISTGFAWPAAAIKELLSGELTAKDSEIPISPR c6:MENVGCEENLLRLILVNLLLVIALLCNLTIIYPALAAETSNGSKIFNANCAACHIGG ANILVEHKTLQKSGLSKYLENYEIEPIQAIINQIQNGKSAMPAFKNKLSEQEILEVTAYI FQKAETGW 1st interaction site information: Interaction score: 2.76 Interaction site location and subsequence in Psaf: 54-59, u = KKRFDR Interaction site and subsequence in c6: 106-111, v = EQEILE Net charge: when ph = 6.25 net charge for u = 3.00395114057246, net charge for v = -2.98030929177886 when ph = 6.5 net charge for u = 3.00209690387591 net charge for v = -2.98889520136613 …….. Datasets and output: www.tnstate.edu/faculty/wchen/research.aspx Results and Analysis Comparison of the algorithm using a window and using gaps For the simplicity, we consider the electrostatic bond only in the weight schemes. From the results, we found that the algorithm using gaps tends to give the interaction sites that have the same number of the positive charged and negative charged residues. For example, for the pair of protein sequences in the last slide, the first interaction site and the corresponding interaction residue subsequences predicted from Algorithm 1 are PsaF: 54-59, u = KKRFDR cyt c6: 106-111, v = EQE ILE from Algorithm 2 are PsaF: 55-59, K_RFDR cyt c6: 106-111, EQEI LE. In the first pair of subsequences, there are four positive charged residues (KKRR) and three negative charged residues (EEE), and in the second pair of subsequences, there are three positive (KDR) and three negative (EEE) charged residues. Therefore, the algorithm should be selected based on the property to be investigated. Results and Analysis Comparison of Laboratory and Computational Approaches Lab approach (Mass Spectrometry) Model of the electron donor docking sites. Shown are the sites of the complex between PSI (green ribbons) and cyt c6 (white). In yellow are the heme group of cyt c6 the Trp pair B627/A651, and the special chlorophyll pair P700. The distance between the redox cofactors is 14 Å. The Glu69 and Glu70 of cyt c6 are able to form a strong salt bridge with Lys27 and Lys23 of PsaF, respectively. Lys20 and Lys16 of PsaF form weaker salt bridges with Glu71 of cyt c6 and Glu613 of PsaB, respectively. Interestingly, in this model the conserved positive charge on the northern face of cyt c6 (Arg66) and the adjacent Asp65 can form a strong salt bridge with the pair Arg623/ Asp624 of PsaB. Results and Analysis Comparison of Laboratory and Computational Approaches Lab approach (Mass Spectrometry) Psaf:DIAGLTPCSESKAYAKLEKKELKTLEKRLKQYEADSAPAVALKATMERTKARFA NYAKAGLLCGNDGLPHLIADPGLALKYGHAGEVFIPTFGFLYVAGYIGYVGRQYLIA VKGEAKPTDKEIIIDVPLATKLAWQGAGWPLAAVQELQRGTLLEKEENITVSPR c6:ADLALGAQVFNGNCAACHMGGRNSVMPEKTLDKAALEQYLDGGFKVESIIYQV ENGKGAMPAWADRLSEEEIQAVAEYVFKQATDAAWKY The laboratory approach shows that the cross-lined interaction happens in following interaction subsequences: PsaF: 21-28, ELKTLEKR cyt c6: 67-81, LSEEEIQAVAEYVFK Computational Approach 1. Algorithm using a window PsaF: 22-29, KTLEKRLK cyt c6: 64-70, DRLSEE_E 2. Algorithm using gaps PsaF: 15-27, KLEKKELKTLEKR cyt c6: 64-76, DRLSEEEIQAVAE. Both algorithms accurately predict the interaction site Results and Analysis Distribution of interaction Sites in PsaF Number of interactions at location i = |S(i)| where S(i) = {s: s is the predicted interaction site which contains location i } Results and Analysis Distribution of interaction score in PsaF Interactio n score at location i sS (i ) score of s where S (i ) {s : s is the predicted interactio n site which contains location i } Results and Analysis Distribution of interaction number and score in cyt c6 Results and Analysis Net charge of PsaF and c6 c-terminal n-terminal Cys-Phe-Ile-Glu-Asn-Cys-Pro-His-His-Gly Side chains Amino Acid pKa Values -carboxylic acid c-terminal -amino n-terminal Alanine A 2.35 9.87 Arginine R 2.01 9.04 Asparagine N 2.02 8.80 Aspartic Acid D 2.10 9.82 3.86 Q- Cysteine C 2.05 10.25 8.00 Q- Glutamic Acid E 2.10 9.47 4.07 Q- Glutamine Q 2.17 9.13 Glycine G 2.35 9.78 Histidine H 1.77 9.18 6.10 Q+ Isoleucine I 2.32 9.76 Leucine L 2.33 9.74 Lysine K 2.18 8.95 Methionine M 2.28 9.21 Phenylalanine F 2.58 9.24 Proline P 2.00 10.60 Serine S 2.21 9.15 Threonine T 2.09 9.10 Tryptophan W 2.38 9.39 Tyrosine Y 2.20 9.11 Valine V 2.29 9.72 Amino Acid Side chain 12.48 Q+ 10.53 Q+ 10.07 Q- Results and Analysis Net charge of of PsaF and cyt c6 Net charge s(i) of sequence s at location i is calculated from ph = 6.25 to ph = 8 at each interval 0.25 use a window of length 7 as follows: s(i) = the net charge of subsequence of s from position i – 3 to i + 3 Net charge NetCh(i) at position i for all 86 proteins of PsaF/c6 is defined as NetCh(i ) sS s(i ), where S is the set of 86 PsaF/c6 sequences. Summery and Future Work We proposed the mathematical model and computational approaches for predicting interaction sites of Cytochrome and Photosystem I. The results show that the approaches are effective and efficient. In the future, we will add more interaction criteria into the model and algorithms. We will also find more laboratory results to compare with the results from computational approaches.