Rational HIV vaccine design Nebojsa Jojic and David Heckerman Machine Learning and Applied Statistics Microsoft Research Collaborators Vladimir Jojic, Microsoft/U Toronto Carl Kadie, Microsoft Jennifer Listgarten, Microsoft/U Toronto Chris Meek, Microsoft Brendan Frey, Microsoft/ U Toronto Bette Korber, Los Alamos National Laboratory Christian Brander, Harvard/MGH Nicole Frahm, Harvard/MGH Simon Mallal/ Royal Perth Hospital Jim Mullins/ University of Washington Epitome as a model of diversity in natural signals A set of image patches Input image Epitome Compact representation Compact representation Using the epitome for recognition The smiling point Epitome of 295 face images Images with the highest total posterior at the “smiling point” Images with the lowest total posterior at the “smiling point” Epitomes may also allow some variability Epitome e: Mean Variances Epitomes can be computed for ordered datasets (e.g., 1-D arrays or 2-D, or 3-D or n-D matrices) with arbitrary measurement types: Intensities R, G, B values Gradient values Wavelet coefficients Spectral energies Nucelotide or aminoacid content … We even played with text and MIDI files AIDS 101 AIDS (acquired immune deficiency syndrome) was first described in the early 1980s HIV (human immunnodeficiency virus) causes AIDS was isolated in 1983; 40 million people now infected HIV is RNA virus: protein coat + copying proteins + regulatory proteins + RNA Copying proteins + RNA enters cell RNA is reverse transcribed to DNA DNA inserts into cells DNA and is transcribed and translated to more HIV protein Infected cell assembles more copies of HIV Cell bursts releasing many new copies of HIV The map of HIV From http://www.mcld.co.uk/hiv (A simplified version of the LANL detailed map) HIV diversity (LANL database) HIV is encoded in an RNA sequence of about 10000 nucleotides, divided into several genes. NEF is one of the shorter and moderately variable ones. The NEF length in the strain The 73 nucelotides of the NEF gene Note the insertions, deletions and mutations. A triplet of nucleotides encode for one aminoacid. A change in a single aminoacid may lower the cellular immunity to the virus in one patient and increase it in the other. Immune system response MHC-I Molecule Epitope Known epitopes in a part of HIV’s Gag protein Epitopes in variable regions Colors signify different human immune types Immunology 101 “Train and kill” mechanism Immune system sees a virus and trains “killer cells” (T cells) to kill any cell showing a pattern from the virus Patterns are short peptides (8-11 amino acids long) called epitopes: 3D structure of an epitope as presented by an infected cell to the killer cells Amino-acid pattern (peptide) SLYNTVATL But, HIV is variable… The train-and-kill mechanism doesn’t work as well for HIV – the virus adapts through rapid mutation. As soon as the killer cells get the upper hand, the epitopes start changing. Possible solution: Find epitopes that occur frequently across a *population* of HIV viruses Compact these epitopes into a small vaccine (small is good: long vaccines are hard to deliver, and less likely to be effective) The epitome of a virus Colors: Different patients Sequence data VLSGGKLDKWEKIRLRPGGKKKYKLKHIVWASRELERF LSGGKLDRWEKIRLR KKKYQLKHIVW KKKYRLKHIVW Epitome Machine Learning Approach to Vaccine Design Use sample HIV strains from multiple patients Build models that compactly encode as many epitopes (or likely epitopes) as possible Learning techniques: Myopic Split and merge Expectation Maximization Coverage of all 10aa blocks from 245 Gag proteins (Perth data) A Vaccine for HIV/AIDS Typical vaccines are near copies of the virus that is being vaccinated against HIV mutates at a high rate – can’t use traditional techniques Machine learning allows us to build compact forms of “pseudo-virus” that covers the diversity of the HIV virus (or rather a pseudo-protein that covers the diversity of a particular HIV protein) This pseudo-protein, which we call the epitome is much shorter than the concatenation of all strains Expected (weighted) coverage optimization We have algorithms to predict this! We have some idea about this, too. p(T), p(S): Cleavage, MHC binding, transport P(XS|ET): T-cell cross-reactivity Finding Epitopes and their MHC-I counterparts MHC-I Molecule Peptide Important to find both epitopes and the MHC-I types that can present them Each patient has six MHC-I types (2 As, 2Bs, 2Cs) Most epitopes can be presented by only a few MHC-I molecules Different populations (China, India, South Africa, etc.) have different MHC-I frequencies Finding Epitopes and their MHC-I counterparts Existing methods: Trial and error in the wet lab Machine learning Our methods: More machine learning Machine learning + physics Machine learning + wet lab Machine Learning Examples of peptide is epitope for MHC-I type Examples of peptide is NOT epitope for MHC-I type Classifier: -Logisitc regression -SVM -Neural net -Etc p(is epitope | peptide, MHC - I) Issues (from experience) Amount of data Feature extraction Algorithm choice Simple feature extraction SLYNTVATL, A02 • Amino acid at position 1=S • Amino acid at position 2=L • Amino acid at position 3=Y … • Amino acid at position 9=L • MHC-I type=A02 Simple feature extraction (logistic regression) 100% 90% False Epitopes Included 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% True Epitopes Missed 70% 80% 90% 100% Better feature extraction SLYNTVATL, A02 • Previously mentioned features • Amino acid at position 1 = S & MHC-I = A02 • Amino acid at position 2 = L & MHC-I = A02 … • Amino acid at position 9 = L & MHC-I = A02 Better feature extraction 100% 90% False Epitopes Included 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% True Epitopes Missed 70% 80% 90% 100% Machine learning + physics with David Baker and Ora Furman, UW Machine learning + physics with David Baker and Ora Furman, UW Machine learning + wet lab With Christian Brander & Nicole Frahm, Harvard Jennifer Listgarten, U. Toronto peptide, e.g., NYTSLIYTLIEESQNQQEK … Pt1 Pt2 Pt3 Pt4 PtN If a patient’s blood reacts with a peptide, then it is very likely that some subsequence of the peptide is an epitope for at least one of the patient’s six MHC-I types From observations for many patients, tease out the responsible MHC-I type(s) Find the subsequence in the lab What makes a good solution for a peptide? The fewer the responsible MHC-I types the better An MHC-I type gets “points” for appearing in reacting patients and loses “points” for appearing in non-reacting patients Not easy… A B C A B C Lots of noise: p(react | is epitope)~0.25 “Leaks”: may see a reaction even when the peptide is not an epitope for any MHC-I type of the patient “Explaining away”: When a patient has two MHC-I types that can be responsible for a reaction, those two get less credit Don’t actually know – p(react | is epitope) – Leak probabilities Example solution: reacting patients non-reacting patients Graphical model for a peptide A01 A02 A03 B02 B03 A02c A01c A03c A03c B01c B02c B02c B03c C01c C01c C03c leak B01 p0 OR pt1 reacts C02c leak p0 C01 C02 C03 … OR pt2 reacts (Directed Acyclic) Graphical Models Fuel Battery TurnOver Gauge Start p(F,B,T,G,S) = p(F) p(B|F) p(T|F,B) p(G|F,B,T) p(S|F,B,T,G) = p(F) p(B|F) p(T|F,B) p(G|F,B,T) p(S|F,B,T,G) = Pvars p(var|parents) Graphical model for a peptide A01 A02 A03 … B01 B02 B03 … C01 C02 C03 … Graphical model for a peptide A01 A02 A03 p A02c A03c B01c B02c C01c C03c B01 B02 B03 C01 C02 C03 Graphical model for a peptide A01 A02 A03 p A02c p B01 B02 p p p A03c B01c B02c C01c C03c p B03 C01 C02 C03 Graphical model for a peptide A01 A02 A03 B01 A02c A03c B01c B02c C01c C03c leak p0 OR pt1 reacts B02 B03 C01 C02 C03 Graphical model for a peptide A01 A02 A03 B02 B03 A02c A01c A03c A03c B01c B02c B02c B03c C01c C01c C03c leak B01 p0 OR pt1 reacts C02c leak p0 C01 C02 C03 … OR pt2 reacts Solving the model Principle: find the p, p0 and MHC-I assignments that maximize the likelihood of the data Algorithm: Guess p, p0 Iterate • Use relaxation method to find max likelihood MHC-I assignments • Use gradient descent to find values of p, p0 that maximize the likelihood Status Most likely assignments have been confirmed Summary HIV vaccine design is a data intensive problem Data is in the form of discrete sequences, making it ideal for computer-science/machine-learning analysis Machine learning approaches are instrumental in finding epitopes and vaccine compression Work in progress: Our vaccine designs are scheduled to be tested at Mass General in vitro this summer What if there are fewer epitopes? Nef -- no play -- assumes only epitopes in LANL + predicted set are epitopes 100% 90% 70% Fewer epitopes 60% 50% Optimized for LANL+predicted epitopes 40% Optimized for LANL epitopes only 30% 20% 10% Nef -- no play -- assumes only LANL epitopes are epitopes 0% 0 50 100 150 200 250 300 350 400 450 100% # Amino Acids in Vaccine 500 90% 80% % of Possible Score % of Possible Score 80% 70% 60% 50% Optimized for LANL + predicted epitopes 40% Optimized for LANL epitopes only 30% 20% 10% 0% 0 50 100 150 200 250 300 # Amino Acids in Vaccine 350 400 450 500 What if there are more epitopes? Nef -- no play -- assumes only epitopes in LANL + predicted set are epitopes 100% 90% 70% More epitopes 60% 50% Optimized for LANL+ predicted epitopes 40% Optimized assuming all 9mers are epitoes 30% 20% Nef -- no play -- assumes all 9mers are epitopes 10% 0% 0 50 100 150 200 250 300 350 400 450100% 500 # Amino Acids in Vaccine 90% Optimized for predicted epitopes 80% If uncertain, should err in favor of more epitopes (overlap provides some robustness) % of Possible Score % of Possible Score 80% Optimized assuming all 9mers are epitoes 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 # Amino Acids in Vaccine 400 500 Rational Design of HIV/AIDS Vaccines Many collaborators: Microsoft: Nebojsa Jojic, David Heckerman, Vladimir Jojic, Chris Meek, Brendan Frey, Carl Kadie, Jennifer Listgarten Royal Perth Hospital: Simon Mallal University of Washington: Jim Mullins Harvard/Mass General: Bruce Walker, Christian Brander Los Alamos National Lab: Bette Korber AIDS 101 AIDS (acquired immune deficiency syndrome) was first described in the early 1980s HIV (human immunnodeficiency virus) causes AIDS was isolated in 1983; 40 million people now infected HIV is RNA virus: protein coat + copying proteins + RNA – Copying proteins + RNA enters cell – RNA is reverse transcribed to DNA – DNA inserts into cells DNA and is transcribed and translated to more HIV protein – Infected cell assembles more copies of HIV – Cell bursts releasing many new copies of HIV Immunology 101 Immune system fights viruses through “train and kill” mechanism Immune system sees a virus and trains “killer cells” (T cells) to kill any cell showing a pattern from the virus Patterns are short peptides (8-11 amino acids long) called epitopes: Amino-acid pattern (peptide) 3D structure of an epitope as presented by an infected cell to the killer cells SLYNTVATL MHC-I Molecule Epitope HIV is different The train-and-kill mechanism doesn’t work for HIV – the virus adapts through rapid mutation. As soon as the killer cells get the upper hand, the epitopes start changing. Possible solution: Find epitopes that occur commonly across a *population* of HIV viruses Compact these epitopes into a small vaccine (small is good: long vaccines are hard to deliver, and less likely to be effective) Important to find both epitopes and the MHC-I types that can present them Each patient has six MHC-I types (2 As, 2Bs, 2Cs) Most epitopes can be presented by only a few MHC-I molecules Different populations (China, India, South Africa, etc.) have different MHC-I frequencies Machine learning, HIV, and SPAM HIV Use machine learning to find patterns of words and phrases that indicate spam Free! Money Click here Vi@gr@ Use machine learning to find epitopes that stimulate the immune system SLYNTVATL