Written Report - University of California, Santa Cruz

advertisement
Visualizing the HIV gp-120 Envelope
Protein
Philip Heller, University of California, Santa Cruz
CMPS 261 Final Project Report – Spring 2010
Abstract: More than a quarter of a century of HIV vaccine development has produced little in the way of tangible
results. Recent efforts have focused on interference with the gp-120 envelope protein of the HIV virus.
Researchers in the Phillip Berman lab at U.C. Santa Cruz have identified loci along the gp-120 primary structure
where relevant biochemical reactions take place. Due to protein folding, loci that are distant in a primary
sequence may actually be near enough in the 3-D structure to influence on another. To help researchers better
understand reaction loci, I have modified the JMol protein visualization program to accommodate visual
annotation of identified loci.
Introduction
The HIV virus probably entered the
United States in 1970, around the time
that African doctors started reporting a
rise in opportunistic infections. The name
“AIDS” was coined in 1982. Vaccine
development began in 1984. Based on
historical efforts against other diseases,
the Department of Health and Human
Services predicted success within 2 years
[1]. More than a generation later, recent
results from a large-scale trial in Thailand
have been promising [2], but cautious
optimism has replaced “two-year
turnaround” exuberance in the research
community.
HIV/AIDS virus differs in many ways from
diseases such as tetanus and polio, for
which standard procedures produced
successful vaccines in straightforward
ways. HIV attacks the immune system,
damaging the mechanisms that a vaccine
is supposed to enhance. HIV inserts its
own genome into host cells, where it can
hide from the immune system. HIV
evolves constantly and rapidly, rendering
immune responses obsolete before they
can be effective. And no one has ever
recovered from HIV, so there is no “role
model” mechanism for a vaccine to
imitate. On the other hand, gene
technology and bioinformatics have come
into their own as effective sciences,
providing researches with analytical
resources that could not have been
foreseen in 1982.
Much research has focused on the gp120
envelope protein of the HIV virus.
Molecules of this protein project like
spikes from the viral surface, and initiate
contact with and penetration of host cells.
The virus becomes significantly more
dangerous after it enters a host cell,
because it subverts the cell’s reproductive
mechanism so that the cell amplifies the
virus. Destroying gp120 would destroy
the virus’ access to reproduction.
The genetic and molecular structure of
gp120 are well known. The complete HIV
DNA sequence was published in Nature
in 1985 [3]. A 3-D molecular structure of
gp-120 was derived by X-ray
crystallography in 1998 [4], and a
corresponding PDB file has been
generated. PDB files specify the locations
in 3-space of amino acid sequences with
atomic granularity; they are used by
numerous protein visualization tools.
(Note that the PDB file actually models
gp120 in association with other proteins;
these are dimly visible in the screenshots
of my application.)
In the past decade, biomedical technology
has enabled the association of certain
viral properties with the specific location
in gp120 that gives rise to the property.
For example, it is possible to determine
the exact location on the protein where
neutralizing antibodies bind. Meanwhile,
visualization tools have allowed
researchers to view such sites as visual
annotations on pictures of the protein,
thus reading the data in its natural
context. In the next section I will give a bit
more detail on the state of the art of HIV
protein visualization, and subsequently
will explain the contribution of my own
work.
over many studies, but there is only one
trusted tertiary structure. Quaternary
structure describes the joining together of
multiple protein units to form complexes;
it is not relevant to this application.
Feature glyphs can be added to any
primary, secondary, or tertiary structure
image. However, in practice it is easier to
depict primary sequences as text rather
than image, and to annotate with
highlights rather than with glyphs, as
shown in Figure 1. The figure shows part
of a group of 6 highly similar sequences,
with a structural feature highlighted in
yellow.
Prior Work and State of the Art
I use the term “feature” to mean a
property of a protein that is associated
with a specific location in the protein
(antibody binding sites are an example of
a feature). The visualization challenge at
hand can be thought of as annotating
protein illustrations with feature glyphs.
Protein structure is described
conceptually at four levels. Primary
structure is the protein’s linear amino
acid sequence; it can be represented as a
string over a 20-character alphabet.
Secondary structure refers to the local
folding of the protein chain into simple
shapes such as loops and sheets. Much
secondary structure can be accurately
predicted by software. Secondary
structure can generally be depicted in 2
dimensions. Tertiary structure is the realworld 3-D shape of the protein. In the
general case it cannot be predicted by
software, but must be determined by Xray crystallography. Crystallography is a
very expensive process, especially
compared to genetic sequencing (which
produces primary structure).
Consequently, there are tens of thousands
of nearly identical gp120 primary
structures sampled from many patients
Figure 1: Primary structure annotation (source: [5])
Secondary structure is simply convoluted
primary structure. Secondary structure
visualization convolutes linear text into
an image, generally with room to spare
for arrows and callouts, as shown in
Figure 2.
Figure 2: Secondary structure annotation (source: [6])
Figure 3, which is a detail of Figure 2,
shows that the dots represent individual
amino acids. Red dots are annotations
indicating a single feature. Red-and-green
dots show that two features have been
observed. The two-colored dots are fairly
easy to read, but with more features the
dots become subdivided into so many
pieces that understanding is impaired.
Figure 3: Secondary structure detail (source: [6])
With tertiary structure, the character of
visualization switches from symbolic to
representational. The PDB file format (see
above) provides 3D coordinates for all
atoms in a protein, along with
information for determining which atoms
belong to which amino acids. Thus all
data necessary to 3D visualization is
available. Figure 4 shows three popular
tertiary structure visualization styles:
wireframe, spacefilling, and backboneand-ribbon. In most cases, tertiary
visualization is preferred. Amino acids
that are spatially adjacent and interact in
important ways may appear distant in
primary and secondary sequence
depictions; true amino acid relationships
are only seen with tertiary structure.
commands in a scripting language.
Scripting also supports modifying the size
and color of selected amino acids. It is
fairly common to see articles with figures
of gp120, rendered as in Figure 4, with
features of relevance to the article
highlighted in some way. For examples,
see [8, 9].
The most popular rendering applications
are the RasMol/PyMol/JMol suite [10].
These three programs are nearly identical
implementations in, respectively, X
Windows, Python, and Java. Image
manipulation is done by mouse dragging,
optionally qualified by the SHIFT key.
Scripts may be supplied as files at startup
time, or typed during runtime.
There is no explicit “HIV visualization”
field of study. In the biomedical domain,
“visualization” generally means
microscopy. (See [13] for a typical
example.) Some research has been done
in the more general field of protein
visualization. For example, Meads et al
[14] have developed a program called
ProtAlign, which visualizes the tertiary
structure of a pairwise protein alignment.
ProtAlign was developed as a tool for
threading, which is the technique of
predicting a protein’s tertiary structure,
given a sufficiently similar protein with
known structure. Threading may prove
valuable in the study of gp120, where
only one version of the protein has known
tertiary structure. In the foreseeable
future we are not likely to see many more
gp120 tertiary structures: X-ray
crystallography remains quite expensive
and time consuming.
Figure 4: Tertiary structure visualization (source: [7])
A number of PDB-compatible applications
are freely available. They generally
permit real-time image rotation and
zooming. Display parameters (such as
which of the three styles of Figure 4 to
use) are set via GUI controls or by typing
Data and Motivation
I had two collections of data. The first was
feature data from the Phillip Berman lab
at U.C. Santa Cruz that formed the basis of
[6]. The article reports locations of some
important gp120 features as measured in
a cohort of patients. These features are
protease binding, receptor binding,
antibody binding, and glycolsylation; their
primary-sequence loci are provided in
tables, and they appear as markup glyphs
on a secondary-structure diagram
(partially reproduced here as Figures 2
and 3). Protease binding is reported as a
prevalence percentage within the cohort;
the other features are reported as simply
present or absent. I also had a large Excel
spreadsheet that contained all singlesubstitution mutations observed at each
locus of the sequence. Mutation sites are
of particular interest because any vaccine
that requires a particular amino acid to be
present at a particular site will be
ineffective if the site mutates into a
different amino acid.
I wanted to distil as much information as
possible from the given data. I asked
myself the following question: “If I can
visualize all five features simultaneously
in three dimensions, will I observe
relationships that would otherwise be
hidden?” In the next section I detail the
novel functionality that I developed in
order to create the visualization I needed
to answer my question.
obscured; and in any event, JMol only
supports single-color markup. I could
simply create one JMol view for each
feature, but this approach has a
drawback: after only a few user
interactions, it’s possible for the views to
be so disjoint that no unified notion of the
overall information is evident. This visual
confusion can be overcome by ensuring
that all views show the same orientation
and zoom at all times. JMol has no facility
for doing this, so I implemented an eventechoing mechanism that applies all image
manipulating mouse gestures to all
images.
While it is certainly undesirable to display
all features in a single view, it is still
valuable to see combinations of (typically
a few) features, selected ad hoc, in a
single view. I created one extra view,
which I call the “predicate view”, which
displays logical combinations of features.
A dialog box allows modification of the
logical predicate. For example, Figure 5
shows the predicate view, displaying loci
that are protease binding sites (red),
receptor binding sites (blue) or both
(white).
Design Rationale and Novel
Functionality
I decided to leverage as much JMol
functionality as possible, and write new
code where necessary. My primary
requirement was to mark amino acid sites
in order to indicate features. Additionally,
if possible, I wanted a quantitative mark
for quantitative features (e.g. protease
binding, for which I had prevalence
percentage data).
I then had to decide what to do about loci
with multiple features. In 3D, multiple
colors as in Figure 3 would not be helpful.
Consider a site with three colored zones:
from many sightlines, one zone would be
Figure 5: The predicate view
The predicate view is configured via the
dialog shown in Figure 6.
absent or present, and these values can be
translated to 0 and 1 respectively. With all
feature values normalized to a common
range, sphere radius can be varied between
a minimum and a maximum to reflect the
normalized value. This is best seen with
mutations, where nearly all loci have some
mutability as shown in Figure 7. The large
green balls indicate high incidence of
mutation, while the small balls indicate
conservation.
Figure 6: Predicate view configuration dialog
In this dialog, combo boxes alternately
support choice of operand and operator. The
operand may be any of the features, and the
operator may be AND, OR, AND NOT, or OR
NOT. Evaluation is in order of appearance.
The “Results” section below shows several
examples of the predicate view.
Locations which bear features are marked
with spheres. Sphere radius is controlled by
the script, and can therefore be varied to
provide information if the data has some
natural quantitative interpretation.
Protease binding sites are reported as
prevalence percentages, which are easily
normalized to the range (0-1). For
mutations, from 1 to 20 amino acids are
possible at any site, and the number of
distinct amino acids actually observed can
be normalized to the same (0-1) range. The
remaining features are reported as either
Figure 7: Variable radius reflecting quantitative value
Once feature values are treated as
numeric rather than boolean, ordinary
boolean operations can no longer be used.
The Predicate View therefore uses basic
fuzzy logic operations [12] to determine
the sphere radius of marked features.
With values in the range (0-1), the fuzzy
logic equivalent of “x AND y” is min(x, y).
The equivalent of “x OR y” is max(x, y),
and “NOT x” becomes (1-x).
Figure 8 shows all six feature views, as
well as the Predicate View’s configuration
dialog.
Figure 8: The entire application
Results
3-D visualization of gp120 has
provided novel understanding of the
protein. One interesting result is the
green “positive selection map” seen in
Figures 7 and 8. This is derived from
the mutation spreadsheet data. An HIV
population in a human host, like any
population in an ecosystem, is subject
to Darwinian selection: only
mutations that confer positive traits
will survive, be passed on to offspring,
an eventually noted by researchers.
The HIV virus is known to mutate
vigorously. Sites that are especially
subject to mutation are poor vaccine
targets, as the mutations the vaccine
products can't recognize will appear
and sweep through the population.
Thus it is important to be able to
understand which parts of gp120 are
highly mutable, and which are more
stable (“conserved”). In Figures 7 and
8, large green atoms signify high
mutability. Clumps of both high and
low mutations are visible especially in
Figure 7. This clumping is good news:
if mutability were evenly distributed,
it would be difficult to design a drug
whose target region in the protein is
conserved.
A second application of predicates
concerns antibodies and glycosylation.
Antibodies are the immune system’s
attack mechanism; in a healthy host
they bind to infecting viruses and
destroy them. HIV infects immunesystem cells, degrading the body’s
ability to make antibodies to itself or
to other opportunistic infections. It is
important to understand the activity
of the few antibodies that are known
to neutralize HIV. The yellow window
in Figure 8 shows sites that are
vulnerable to antibodies. One of HIV’s
defenses against antibodies is
glycosylation, a process whereby the
virus attaches an armor coat of
carbohydrate to its exterior. The pink
window shows glycosylation
locations.
There are 4 loci where both antibody
activity and glycosylation have been
observed. This is unusual, as it seems
more promising for an antibody to
attach a virus at its weak points rather
than its most protected ones. By
configuring the predicate window for
“Glycosylation AND Antibodies”, we
can see these sites, as in Figure 9:
another operator and operator are
configured into the predicate, which
becomes “Glycosylation AND
Antibodies AND Positive Selection”.
Recall that in fuzzy logic, AND is the
same as MIN. The atom balls in Figure
9 are maximum size, since for
glycosylation and antibodies we only
have presence/absence data (so
present features map to maximum
size). Including “AND Positive
Selection” adjusts the ball sizes of
Figure 9 to sizes indicative of
conservation, as in Figure 10.
Figure 10: Glycosylation AND Antibodies AND Positive
Selection
The clumped white balls are now
quite small, indicating conservation.
Future Directions and Conclusions
Figure 9: Glycosylation AND Antibodies
Figure 9 shows a result that the
Berman lab believes to be novel: the 4
sites, which are separated in gp120’s
primary and secondary structure,
form a single clump which is clearly
seen in the upper-left corner of the
figure.
We can now ask whether this clump is
mutable or conserved. To do this,
Proteins other than gp120 can be
visualized by supplying a different
PDB file in the command line. Feature
names and colors are specified in a
very simple Java enum, which can be
easily modified. One view is created
for each feature in the enum, and
screen space is the only limitation on
the number of features. On my 2.26
GHz dual core MacBook Pro laptop,
rotating and zooming are slightly
sluggish; with many more than 6
features, more compute power might
be necessary. The predicate
configuration dialog can be extended
to support more sophisticated
predicates, for example by offering
precedence operators such as
parentheses. It should be noted
however that complex predicate
calculus, while second nature to
computer programmers, are foreign to
biomedical researchers, and might
enjoy little use.
HIV research has largely escaped the
attention of modern visualization
scientists, and is a fertile field with
many opportunities. The ability to
visualize multiple features as tertiarystructure markup gives vaccine
developers a powerful new tool for
understanding a powerful old enemy
in new ways.References
[1] “Frontline” interview with Reagan
Administration HHS Secretary Margaret
Heckler, Jan 11, 2006. Transcript at
http://www.pbs.org/wgbh/pages/frontli
ne/aids/interviews/heckler.html.
[2] Liu J. and Ostrowski, M. “Development
of TNFSF as molecular adjuvants for
ALVAC HIV-1 vaccines.” Hum Vaccin.
2010 Apr 6;6(4).
[3] Ratner L, Haseltine W, Patarca R, et al.
(1985). "Complete nucleotide sequence of
the AIDS virus, HTLV-III". Nature 313
(6000): 277–84.
[4] Wyatt R, Kwong PD, Desjardins E,
Sweet RW, Robinson J, Hendrickson WA,
Sodroski JG. “The antigenic structure of
the HIV gp120 envelope glycoprotein.”
Nature. 1998 Jun 18;393(6686):705-11.
[5] Julie M. Decker et. al. “Amtigenic
conservation and immugenicity of the HIV
coreceptor binding site” JEM vol. 201 no.
91407-1419.
[6] Bin Yu, Dora P. A. J. Fonseca, Sara M.
O'Rourke, and Phillip W. Berman.
“Protease Cleavage Sites in HIV-1 gp120
Recognized by Antigen Processing
Enzymes Are Conserved and Located at
Receptor Binding Sites” Journal of
Virology, February 2010, p. 1513-1526,
Vol. 84, No. 3.
[7]
http://www.pdb.org/pdb/static.do?p=ed
ucation_discussion/Looking-atStructures/graphics.html.
[8] Martha A. Alexander-Miller, Kenneth
C. Parker, Taku Tsukui, C. David
Pendleton, John E. Coligan and Jay A.
Berzofsky. “Molecular analysis of
presentation by HLA-A2.1 of a
promiscuously binding V3 loop peptide
from the HIV-1 envelope protein to
human cytotoxic T lymphocytes”
International Immunology, Vol. 8, No. 5,
pp. 641-649, May 1996.
[9] Chih-chin Huang,1 Min Tang,1 MeiYun Zhang,2 Shahzad Majeed,1 Elizabeth
Montabana,1 Robyn L. Stanfield,4 Dimiter
S. Dimitrov,3 Bette Korber,5 Joseph
Sodroski,6 Ian A. Wilson,4 Richard
Wyatt,1* Peter D. Kwon: “Structure of a
V3-Containing HIV-1 gp120 Core”. Science
11 November 2005:
Vol. 310. no. 5750,
pp. 1025 - 1028.
[10] Angel Herra ́ez: “Jmol TO THE
RESCUE”. Biochemistry and Molecular
Biology Education Vol. 34, No. 4, pp. 255–
261, 2006.
[11] Yun (Kenneth) Kanga, Sofija
Andjelica, James M. Binley, Emma T.
Crooks, Michael Frantia, Sai Prasad, N.
Iyera, Gerald P. Donovan, Antu K. Dey,
Ping Zhu, Kenneth H. Roux, Robert J.
Durso, Thomas F. Parsons, Paul J.
Maddon, John P. Moore and William C.
Olson: “Structural and immunogenicity
studies of a cleaved, stabilized envelope
trimer derived from subtype A HIV-1”
Vaccine Volume 27, Issue 37, 13 August
2009, Pages 5120-5132.
[12] Zadeh, Lotfi: “Knowledge
Representation in Fuzzy Logic”. IEEE
Transactions on Knowledge and Data
Engineering, March 1989 (vol 1 no. 1).
[13] Miroslav P Milev, Chris M Brown, and
Andrew J Mouland. “Live cell visualization
of the interactions between HIV-1 Gag
and the cellular RNA-binding protein
Staufen1”. Retrovirology 2010, 7:41
doi:10.1186/1742-4690-7-41.
[14] Donna Meads, Marc D. Hansen, and
Alex Pang. “Protalign: A 3-dimensional
protein alignment assessment tool.”
Pacific Symposium on Biocomputing,
4:354-367 (1999).
Download