Rational HIV Vaccine Design

advertisement
Rational HIV vaccine design
Nebojsa Jojic and David Heckerman
Machine Learning and Applied Statistics
Microsoft Research
Collaborators
Vladimir Jojic, Microsoft/U Toronto
Carl Kadie, Microsoft
Jennifer Listgarten, Microsoft/U Toronto
Chris Meek, Microsoft
Brendan Frey, Microsoft/ U Toronto
Bette Korber, Los Alamos National Laboratory
Christian Brander, Harvard/MGH
Nicole Frahm, Harvard/MGH
Simon Mallal/ Royal Perth Hospital
Jim Mullins/ University of Washington
Epitome as a model of diversity
in natural signals
A set of image patches
Input image
Epitome
Compact representation
Compact representation
Using the epitome for recognition
The smiling point
Epitome of 295 face images
Images with the
highest total
posterior at the
“smiling point”
Images with the
lowest total
posterior at the
“smiling point”
Epitomes may also allow some variability
Epitome e:
Mean 
Variances 
Epitomes can be computed for ordered datasets
(e.g., 1-D arrays or 2-D, or 3-D or n-D matrices)
with arbitrary measurement types:
Intensities
R, G, B values
Gradient values
Wavelet coefficients
Spectral energies
Nucelotide or aminoacid content
…
We even played with text and MIDI files
AIDS 101
AIDS (acquired immune deficiency syndrome) was first
described in the early 1980s
HIV (human immunnodeficiency virus) causes AIDS was
isolated in 1983; 40 million people now infected
HIV is RNA virus: protein coat + copying proteins +
regulatory proteins + RNA
Copying proteins + RNA enters cell
RNA is reverse transcribed to DNA
DNA inserts into cells DNA and is transcribed and
translated to more HIV protein
Infected cell assembles more copies of HIV
Cell bursts releasing many new copies of HIV
The map of HIV
From http://www.mcld.co.uk/hiv
(A simplified version of the LANL detailed map)
HIV diversity (LANL database)
HIV is encoded in an RNA sequence of about 10000 nucleotides,
divided into several genes. NEF is one of the shorter and moderately
variable ones.
The NEF length in the strain
The 73 nucelotides of the NEF gene
Note the insertions, deletions and mutations. A triplet of nucleotides encode
for one aminoacid. A change in a single aminoacid may lower the cellular
immunity to the virus in one patient and increase it in the other.
Immune system response
MHC-I Molecule
Epitope
Known epitopes in a part of HIV’s Gag protein
Epitopes in variable regions
Colors signify different human immune types
Immunology 101
“Train and kill” mechanism
Immune system sees a virus and trains “killer
cells” (T cells) to kill any cell showing a pattern
from the virus
Patterns are short peptides (8-11 amino acids
long) called epitopes:
3D structure of an epitope
as presented by an infected
cell to the killer cells
Amino-acid pattern (peptide)
SLYNTVATL
But, HIV is variable…
The train-and-kill mechanism doesn’t work as well
for HIV – the virus adapts through rapid
mutation. As soon as the killer cells get the
upper hand, the epitopes start changing.
Possible solution:
Find epitopes that occur frequently across a
*population* of HIV viruses
Compact these epitopes into a small vaccine
(small is good: long vaccines are hard to
deliver, and less likely to be effective)
The epitome of a virus
Colors:
Different
patients
Sequence data
VLSGGKLDKWEKIRLRPGGKKKYKLKHIVWASRELERF
LSGGKLDRWEKIRLR KKKYQLKHIVW KKKYRLKHIVW
Epitome
Machine Learning Approach to
Vaccine Design
Use sample HIV strains from multiple patients
Build models that compactly encode as many epitopes (or
likely epitopes) as possible
Learning techniques:
Myopic
Split and merge
Expectation Maximization
Coverage of all 10aa blocks from 245 Gag proteins (Perth data)
A Vaccine for HIV/AIDS
Typical vaccines are near copies of the virus that is
being vaccinated against
HIV mutates at a high rate – can’t use traditional
techniques
Machine learning allows us to build compact forms of
“pseudo-virus” that covers the diversity of the HIV virus
(or rather a pseudo-protein that covers the diversity of a
particular HIV protein)
This pseudo-protein, which we call the epitome is much
shorter than the concatenation of all strains
Expected (weighted) coverage optimization
We have algorithms to
predict this!
We have some idea about
this, too.
p(T), p(S): Cleavage, MHC binding, transport
P(XS|ET): T-cell cross-reactivity
Finding Epitopes
and their MHC-I counterparts
MHC-I Molecule
Peptide
Important to find both epitopes and
the MHC-I types that can present them



Each patient has six MHC-I types (2 As, 2Bs,
2Cs)
Most epitopes can be presented by only a few
MHC-I molecules
Different populations (China, India, South
Africa, etc.) have different MHC-I frequencies
Finding Epitopes and their MHC-I
counterparts
Existing methods:
 Trial and error in the wet lab
 Machine learning
Our methods:
 More machine learning
 Machine learning + physics
 Machine learning + wet lab
Machine Learning
Examples of
peptide is epitope
for MHC-I type
Examples of
peptide is NOT epitope
for MHC-I type
Classifier:
-Logisitc regression
-SVM
-Neural net
-Etc
p(is epitope | peptide, MHC - I)
Issues (from experience)



Amount of data
Feature extraction
Algorithm choice
Simple feature extraction
SLYNTVATL, A02
• Amino acid at position 1=S
• Amino acid at position 2=L
• Amino acid at position 3=Y
…
• Amino acid at position 9=L
• MHC-I type=A02
Simple feature extraction
(logistic regression)
100%
90%
False Epitopes Included
80%
70%
60%
50%
40%
30%
20%
10%
0%
0%
10%
20%
30%
40%
50%
60%
True Epitopes Missed
70%
80%
90%
100%
Better feature extraction
SLYNTVATL, A02
• Previously mentioned features
• Amino acid at position 1 = S & MHC-I = A02
• Amino acid at position 2 = L & MHC-I = A02
…
• Amino acid at position 9 = L & MHC-I = A02
Better feature extraction
100%
90%
False Epitopes Included
80%
70%
60%
50%
40%
30%
20%
10%
0%
0%
10%
20%
30%
40%
50%
60%
True Epitopes Missed
70%
80%
90%
100%
Machine learning + physics
with David Baker and Ora Furman, UW
Machine learning + physics
with David Baker and Ora Furman, UW
Machine learning + wet lab
With Christian Brander & Nicole Frahm, Harvard
Jennifer Listgarten, U. Toronto
peptide, e.g., NYTSLIYTLIEESQNQQEK
…
Pt1



Pt2
Pt3
Pt4
PtN
If a patient’s blood reacts with a peptide, then it is very
likely that some subsequence of the peptide is an
epitope for at least one of the patient’s six MHC-I types
From observations for many patients, tease out the
responsible MHC-I type(s)
Find the subsequence in the lab
What makes a good solution
for a peptide?


The fewer the responsible MHC-I types the better
An MHC-I type gets “points” for appearing in
reacting patients and loses “points” for appearing
in non-reacting patients
Not easy…





A
B
C
A
B
C
Lots of noise: p(react | is epitope)~0.25
“Leaks”: may see a reaction even when the peptide is not an epitope for
any MHC-I type of the patient
“Explaining away”: When a patient has two MHC-I types that can be
responsible for a reaction, those two get less credit
Don’t actually know
– p(react | is epitope)
– Leak probabilities
Example solution:
reacting
patients
non-reacting
patients
Graphical model for a peptide
A01
A02
A03
B02
B03
A02c
A01c
A03c
A03c
B01c
B02c
B02c
B03c
C01c
C01c
C03c
leak
B01
p0
OR
pt1
reacts
C02c
leak
p0
C01
C02
C03
…
OR
pt2
reacts
(Directed Acyclic) Graphical Models
Fuel
Battery
TurnOver
Gauge
Start
p(F,B,T,G,S) = p(F) p(B|F) p(T|F,B) p(G|F,B,T) p(S|F,B,T,G)
= p(F) p(B|F) p(T|F,B) p(G|F,B,T) p(S|F,B,T,G)
= Pvars p(var|parents)
Graphical model for a peptide
A01
A02
A03
…
B01
B02
B03
…
C01
C02
C03
…
Graphical model for a peptide
A01
A02
A03
p
A02c
A03c
B01c
B02c
C01c
C03c
B01
B02
B03
C01
C02
C03
Graphical model for a peptide
A01
A02
A03
p
A02c
p
B01
B02
p
p
p
A03c
B01c
B02c
C01c
C03c
p
B03
C01
C02
C03
Graphical model for a peptide
A01
A02
A03
B01
A02c
A03c
B01c
B02c
C01c
C03c
leak
p0
OR
pt1
reacts
B02
B03
C01
C02
C03
Graphical model for a peptide
A01
A02
A03
B02
B03
A02c
A01c
A03c
A03c
B01c
B02c
B02c
B03c
C01c
C01c
C03c
leak
B01
p0
OR
pt1
reacts
C02c
leak
p0
C01
C02
C03
…
OR
pt2
reacts
Solving the model

Principle: find the p, p0 and MHC-I assignments
that maximize the likelihood of the data

Algorithm:
Guess p, p0
Iterate
• Use relaxation method to find max likelihood
MHC-I assignments
• Use gradient descent to find values of p, p0 that
maximize the likelihood
Status
Most likely assignments have been confirmed
Summary




HIV vaccine design is a data intensive problem
Data is in the form of discrete sequences, making
it ideal for computer-science/machine-learning
analysis
Machine learning approaches are instrumental in
finding epitopes and vaccine compression
Work in progress: Our vaccine designs are
scheduled to be tested at Mass General in vitro
this summer
What if there are fewer epitopes?
Nef -- no play -- assumes only epitopes in LANL + predicted set are
epitopes
100%
90%
70%
Fewer epitopes
60%
50%
Optimized for LANL+predicted
epitopes
40%
Optimized for LANL epitopes
only
30%
20%
10%
Nef -- no play -- assumes only LANL epitopes are epitopes
0%
0
50
100
150
200
250
300
350
400
450
100%
# Amino Acids in Vaccine
500
90%
80%
% of Possible Score
% of Possible Score
80%
70%
60%
50%
Optimized for LANL + predicted
epitopes
40%
Optimized for LANL epitopes
only
30%
20%
10%
0%
0
50
100
150
200
250
300
# Amino Acids in Vaccine
350
400
450
500
What if there are more epitopes?
Nef -- no play -- assumes only epitopes in LANL + predicted set are
epitopes
100%
90%
70%
More epitopes
60%
50%
Optimized for LANL+ predicted
epitopes
40%
Optimized assuming all 9mers
are epitoes
30%
20%
Nef -- no play -- assumes all 9mers are epitopes
10%
0%
0
50
100
150
200
250
300
350
400
450100% 500
# Amino Acids in Vaccine
90%
Optimized for predicted
epitopes
80%
If uncertain, should
err in favor of more
epitopes
(overlap provides some
robustness)
% of Possible Score
% of Possible Score
80%
Optimized assuming all 9mers
are epitoes
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
# Amino Acids in Vaccine
400
500
Rational Design of HIV/AIDS Vaccines
Many collaborators:
 Microsoft: Nebojsa Jojic, David Heckerman,
Vladimir Jojic, Chris Meek, Brendan Frey, Carl
Kadie, Jennifer Listgarten
 Royal Perth Hospital: Simon Mallal
 University of Washington: Jim Mullins
 Harvard/Mass General: Bruce Walker,
Christian Brander
 Los Alamos National Lab: Bette Korber
AIDS 101



AIDS (acquired immune deficiency syndrome) was
first described in the early 1980s
HIV (human immunnodeficiency virus) causes
AIDS was isolated in 1983; 40 million people
now infected
HIV is RNA virus: protein coat + copying
proteins + RNA
– Copying proteins + RNA enters cell
– RNA is reverse transcribed to DNA
– DNA inserts into cells DNA and is transcribed and
translated to more HIV protein
– Infected cell assembles more copies of HIV
– Cell bursts releasing many new copies of HIV
Immunology 101
Immune system fights viruses through “train and kill”
mechanism
 Immune system sees a virus and trains “killer cells” (T
cells) to kill any cell showing a pattern from the virus
 Patterns are short peptides (8-11 amino acids long)
called epitopes:
Amino-acid pattern (peptide)
3D structure of an epitope
as presented by an infected
cell to the killer cells
SLYNTVATL
MHC-I Molecule
Epitope
HIV is different
The train-and-kill mechanism doesn’t work for HIV –
the virus adapts through rapid mutation. As soon as
the killer cells get the upper hand, the epitopes start
changing.
Possible solution:
 Find epitopes that occur commonly across a
*population* of HIV viruses
 Compact these epitopes into a small vaccine (small
is good: long vaccines are hard to deliver, and less
likely to be effective)
Important to find both epitopes and
the MHC-I types that can present them



Each patient has six MHC-I types (2 As, 2Bs,
2Cs)
Most epitopes can be presented by only a few
MHC-I molecules
Different populations (China, India, South
Africa, etc.) have different MHC-I frequencies
Machine learning, HIV, and SPAM
HIV
Use machine learning
to find patterns of
words and phrases
that indicate spam
 Free!
 Money
 Click here
 Vi@gr@
Use machine learning to
find epitopes that
stimulate the immune
system
SLYNTVATL
Download