Visualizing the HIV gp-120 Envelope Protein Philip Heller, University of California, Santa Cruz CMPS 261 Final Project Report – Spring 2010 Abstract: More than a quarter of a century of HIV vaccine development has produced little in the way of tangible results. Recent efforts have focused on interference with the gp-120 envelope protein of the HIV virus. Researchers in the Phillip Berman lab at U.C. Santa Cruz have identified loci along the gp-120 primary structure where relevant biochemical reactions take place. Due to protein folding, loci that are distant in a primary sequence may actually be near enough in the 3-D structure to influence on another. To help researchers better understand reaction loci, I have modified the JMol protein visualization program to accommodate visual annotation of identified loci. Introduction The HIV virus probably entered the United States in 1970, around the time that African doctors started reporting a rise in opportunistic infections. The name “AIDS” was coined in 1982. Vaccine development began in 1984. Based on historical efforts against other diseases, the Department of Health and Human Services predicted success within 2 years [1]. More than a generation later, recent results from a large-scale trial in Thailand have been promising [2], but cautious optimism has replaced “two-year turnaround” exuberance in the research community. HIV/AIDS virus differs in many ways from diseases such as tetanus and polio, for which standard procedures produced successful vaccines in straightforward ways. HIV attacks the immune system, damaging the mechanisms that a vaccine is supposed to enhance. HIV inserts its own genome into host cells, where it can hide from the immune system. HIV evolves constantly and rapidly, rendering immune responses obsolete before they can be effective. And no one has ever recovered from HIV, so there is no “role model” mechanism for a vaccine to imitate. On the other hand, gene technology and bioinformatics have come into their own as effective sciences, providing researches with analytical resources that could not have been foreseen in 1982. Much research has focused on the gp120 envelope protein of the HIV virus. Molecules of this protein project like spikes from the viral surface, and initiate contact with and penetration of host cells. The virus becomes significantly more dangerous after it enters a host cell, because it subverts the cell’s reproductive mechanism so that the cell amplifies the virus. Destroying gp120 would destroy the virus’ access to reproduction. The genetic and molecular structure of gp120 are well known. The complete HIV DNA sequence was published in Nature in 1985 [3]. A 3-D molecular structure of gp-120 was derived by X-ray crystallography in 1998 [4], and a corresponding PDB file has been generated. PDB files specify the locations in 3-space of amino acid sequences with atomic granularity; they are used by numerous protein visualization tools. (Note that the PDB file actually models gp120 in association with other proteins; these are dimly visible in the screenshots of my application.) In the past decade, biomedical technology has enabled the association of certain viral properties with the specific location in gp120 that gives rise to the property. For example, it is possible to determine the exact location on the protein where neutralizing antibodies bind. Meanwhile, visualization tools have allowed researchers to view such sites as visual annotations on pictures of the protein, thus reading the data in its natural context. In the next section I will give a bit more detail on the state of the art of HIV protein visualization, and subsequently will explain the contribution of my own work. over many studies, but there is only one trusted tertiary structure. Quaternary structure describes the joining together of multiple protein units to form complexes; it is not relevant to this application. Feature glyphs can be added to any primary, secondary, or tertiary structure image. However, in practice it is easier to depict primary sequences as text rather than image, and to annotate with highlights rather than with glyphs, as shown in Figure 1. The figure shows part of a group of 6 highly similar sequences, with a structural feature highlighted in yellow. Prior Work and State of the Art I use the term “feature” to mean a property of a protein that is associated with a specific location in the protein (antibody binding sites are an example of a feature). The visualization challenge at hand can be thought of as annotating protein illustrations with feature glyphs. Protein structure is described conceptually at four levels. Primary structure is the protein’s linear amino acid sequence; it can be represented as a string over a 20-character alphabet. Secondary structure refers to the local folding of the protein chain into simple shapes such as loops and sheets. Much secondary structure can be accurately predicted by software. Secondary structure can generally be depicted in 2 dimensions. Tertiary structure is the realworld 3-D shape of the protein. In the general case it cannot be predicted by software, but must be determined by Xray crystallography. Crystallography is a very expensive process, especially compared to genetic sequencing (which produces primary structure). Consequently, there are tens of thousands of nearly identical gp120 primary structures sampled from many patients Figure 1: Primary structure annotation (source: [5]) Secondary structure is simply convoluted primary structure. Secondary structure visualization convolutes linear text into an image, generally with room to spare for arrows and callouts, as shown in Figure 2. Figure 2: Secondary structure annotation (source: [6]) Figure 3, which is a detail of Figure 2, shows that the dots represent individual amino acids. Red dots are annotations indicating a single feature. Red-and-green dots show that two features have been observed. The two-colored dots are fairly easy to read, but with more features the dots become subdivided into so many pieces that understanding is impaired. Figure 3: Secondary structure detail (source: [6]) With tertiary structure, the character of visualization switches from symbolic to representational. The PDB file format (see above) provides 3D coordinates for all atoms in a protein, along with information for determining which atoms belong to which amino acids. Thus all data necessary to 3D visualization is available. Figure 4 shows three popular tertiary structure visualization styles: wireframe, spacefilling, and backboneand-ribbon. In most cases, tertiary visualization is preferred. Amino acids that are spatially adjacent and interact in important ways may appear distant in primary and secondary sequence depictions; true amino acid relationships are only seen with tertiary structure. commands in a scripting language. Scripting also supports modifying the size and color of selected amino acids. It is fairly common to see articles with figures of gp120, rendered as in Figure 4, with features of relevance to the article highlighted in some way. For examples, see [8, 9]. The most popular rendering applications are the RasMol/PyMol/JMol suite [10]. These three programs are nearly identical implementations in, respectively, X Windows, Python, and Java. Image manipulation is done by mouse dragging, optionally qualified by the SHIFT key. Scripts may be supplied as files at startup time, or typed during runtime. There is no explicit “HIV visualization” field of study. In the biomedical domain, “visualization” generally means microscopy. (See [13] for a typical example.) Some research has been done in the more general field of protein visualization. For example, Meads et al [14] have developed a program called ProtAlign, which visualizes the tertiary structure of a pairwise protein alignment. ProtAlign was developed as a tool for threading, which is the technique of predicting a protein’s tertiary structure, given a sufficiently similar protein with known structure. Threading may prove valuable in the study of gp120, where only one version of the protein has known tertiary structure. In the foreseeable future we are not likely to see many more gp120 tertiary structures: X-ray crystallography remains quite expensive and time consuming. Figure 4: Tertiary structure visualization (source: [7]) A number of PDB-compatible applications are freely available. They generally permit real-time image rotation and zooming. Display parameters (such as which of the three styles of Figure 4 to use) are set via GUI controls or by typing Data and Motivation I had two collections of data. The first was feature data from the Phillip Berman lab at U.C. Santa Cruz that formed the basis of [6]. The article reports locations of some important gp120 features as measured in a cohort of patients. These features are protease binding, receptor binding, antibody binding, and glycolsylation; their primary-sequence loci are provided in tables, and they appear as markup glyphs on a secondary-structure diagram (partially reproduced here as Figures 2 and 3). Protease binding is reported as a prevalence percentage within the cohort; the other features are reported as simply present or absent. I also had a large Excel spreadsheet that contained all singlesubstitution mutations observed at each locus of the sequence. Mutation sites are of particular interest because any vaccine that requires a particular amino acid to be present at a particular site will be ineffective if the site mutates into a different amino acid. I wanted to distil as much information as possible from the given data. I asked myself the following question: “If I can visualize all five features simultaneously in three dimensions, will I observe relationships that would otherwise be hidden?” In the next section I detail the novel functionality that I developed in order to create the visualization I needed to answer my question. obscured; and in any event, JMol only supports single-color markup. I could simply create one JMol view for each feature, but this approach has a drawback: after only a few user interactions, it’s possible for the views to be so disjoint that no unified notion of the overall information is evident. This visual confusion can be overcome by ensuring that all views show the same orientation and zoom at all times. JMol has no facility for doing this, so I implemented an eventechoing mechanism that applies all image manipulating mouse gestures to all images. While it is certainly undesirable to display all features in a single view, it is still valuable to see combinations of (typically a few) features, selected ad hoc, in a single view. I created one extra view, which I call the “predicate view”, which displays logical combinations of features. A dialog box allows modification of the logical predicate. For example, Figure 5 shows the predicate view, displaying loci that are protease binding sites (red), receptor binding sites (blue) or both (white). Design Rationale and Novel Functionality I decided to leverage as much JMol functionality as possible, and write new code where necessary. My primary requirement was to mark amino acid sites in order to indicate features. Additionally, if possible, I wanted a quantitative mark for quantitative features (e.g. protease binding, for which I had prevalence percentage data). I then had to decide what to do about loci with multiple features. In 3D, multiple colors as in Figure 3 would not be helpful. Consider a site with three colored zones: from many sightlines, one zone would be Figure 5: The predicate view The predicate view is configured via the dialog shown in Figure 6. absent or present, and these values can be translated to 0 and 1 respectively. With all feature values normalized to a common range, sphere radius can be varied between a minimum and a maximum to reflect the normalized value. This is best seen with mutations, where nearly all loci have some mutability as shown in Figure 7. The large green balls indicate high incidence of mutation, while the small balls indicate conservation. Figure 6: Predicate view configuration dialog In this dialog, combo boxes alternately support choice of operand and operator. The operand may be any of the features, and the operator may be AND, OR, AND NOT, or OR NOT. Evaluation is in order of appearance. The “Results” section below shows several examples of the predicate view. Locations which bear features are marked with spheres. Sphere radius is controlled by the script, and can therefore be varied to provide information if the data has some natural quantitative interpretation. Protease binding sites are reported as prevalence percentages, which are easily normalized to the range (0-1). For mutations, from 1 to 20 amino acids are possible at any site, and the number of distinct amino acids actually observed can be normalized to the same (0-1) range. The remaining features are reported as either Figure 7: Variable radius reflecting quantitative value Once feature values are treated as numeric rather than boolean, ordinary boolean operations can no longer be used. The Predicate View therefore uses basic fuzzy logic operations [12] to determine the sphere radius of marked features. With values in the range (0-1), the fuzzy logic equivalent of “x AND y” is min(x, y). The equivalent of “x OR y” is max(x, y), and “NOT x” becomes (1-x). Figure 8 shows all six feature views, as well as the Predicate View’s configuration dialog. Figure 8: The entire application Results 3-D visualization of gp120 has provided novel understanding of the protein. One interesting result is the green “positive selection map” seen in Figures 7 and 8. This is derived from the mutation spreadsheet data. An HIV population in a human host, like any population in an ecosystem, is subject to Darwinian selection: only mutations that confer positive traits will survive, be passed on to offspring, an eventually noted by researchers. The HIV virus is known to mutate vigorously. Sites that are especially subject to mutation are poor vaccine targets, as the mutations the vaccine products can't recognize will appear and sweep through the population. Thus it is important to be able to understand which parts of gp120 are highly mutable, and which are more stable (“conserved”). In Figures 7 and 8, large green atoms signify high mutability. Clumps of both high and low mutations are visible especially in Figure 7. This clumping is good news: if mutability were evenly distributed, it would be difficult to design a drug whose target region in the protein is conserved. A second application of predicates concerns antibodies and glycosylation. Antibodies are the immune system’s attack mechanism; in a healthy host they bind to infecting viruses and destroy them. HIV infects immunesystem cells, degrading the body’s ability to make antibodies to itself or to other opportunistic infections. It is important to understand the activity of the few antibodies that are known to neutralize HIV. The yellow window in Figure 8 shows sites that are vulnerable to antibodies. One of HIV’s defenses against antibodies is glycosylation, a process whereby the virus attaches an armor coat of carbohydrate to its exterior. The pink window shows glycosylation locations. There are 4 loci where both antibody activity and glycosylation have been observed. This is unusual, as it seems more promising for an antibody to attach a virus at its weak points rather than its most protected ones. By configuring the predicate window for “Glycosylation AND Antibodies”, we can see these sites, as in Figure 9: another operator and operator are configured into the predicate, which becomes “Glycosylation AND Antibodies AND Positive Selection”. Recall that in fuzzy logic, AND is the same as MIN. The atom balls in Figure 9 are maximum size, since for glycosylation and antibodies we only have presence/absence data (so present features map to maximum size). Including “AND Positive Selection” adjusts the ball sizes of Figure 9 to sizes indicative of conservation, as in Figure 10. Figure 10: Glycosylation AND Antibodies AND Positive Selection The clumped white balls are now quite small, indicating conservation. Future Directions and Conclusions Figure 9: Glycosylation AND Antibodies Figure 9 shows a result that the Berman lab believes to be novel: the 4 sites, which are separated in gp120’s primary and secondary structure, form a single clump which is clearly seen in the upper-left corner of the figure. We can now ask whether this clump is mutable or conserved. To do this, Proteins other than gp120 can be visualized by supplying a different PDB file in the command line. Feature names and colors are specified in a very simple Java enum, which can be easily modified. One view is created for each feature in the enum, and screen space is the only limitation on the number of features. On my 2.26 GHz dual core MacBook Pro laptop, rotating and zooming are slightly sluggish; with many more than 6 features, more compute power might be necessary. The predicate configuration dialog can be extended to support more sophisticated predicates, for example by offering precedence operators such as parentheses. It should be noted however that complex predicate calculus, while second nature to computer programmers, are foreign to biomedical researchers, and might enjoy little use. HIV research has largely escaped the attention of modern visualization scientists, and is a fertile field with many opportunities. The ability to visualize multiple features as tertiarystructure markup gives vaccine developers a powerful new tool for understanding a powerful old enemy in new ways.References [1] “Frontline” interview with Reagan Administration HHS Secretary Margaret Heckler, Jan 11, 2006. Transcript at http://www.pbs.org/wgbh/pages/frontli ne/aids/interviews/heckler.html. [2] Liu J. and Ostrowski, M. “Development of TNFSF as molecular adjuvants for ALVAC HIV-1 vaccines.” Hum Vaccin. 2010 Apr 6;6(4). [3] Ratner L, Haseltine W, Patarca R, et al. (1985). "Complete nucleotide sequence of the AIDS virus, HTLV-III". Nature 313 (6000): 277–84. [4] Wyatt R, Kwong PD, Desjardins E, Sweet RW, Robinson J, Hendrickson WA, Sodroski JG. “The antigenic structure of the HIV gp120 envelope glycoprotein.” Nature. 1998 Jun 18;393(6686):705-11. [5] Julie M. Decker et. al. “Amtigenic conservation and immugenicity of the HIV coreceptor binding site” JEM vol. 201 no. 91407-1419. [6] Bin Yu, Dora P. A. J. Fonseca, Sara M. O'Rourke, and Phillip W. Berman. “Protease Cleavage Sites in HIV-1 gp120 Recognized by Antigen Processing Enzymes Are Conserved and Located at Receptor Binding Sites” Journal of Virology, February 2010, p. 1513-1526, Vol. 84, No. 3. [7] http://www.pdb.org/pdb/static.do?p=ed ucation_discussion/Looking-atStructures/graphics.html. [8] Martha A. Alexander-Miller, Kenneth C. Parker, Taku Tsukui, C. David Pendleton, John E. Coligan and Jay A. Berzofsky. “Molecular analysis of presentation by HLA-A2.1 of a promiscuously binding V3 loop peptide from the HIV-1 envelope protein to human cytotoxic T lymphocytes” International Immunology, Vol. 8, No. 5, pp. 641-649, May 1996. [9] Chih-chin Huang,1 Min Tang,1 MeiYun Zhang,2 Shahzad Majeed,1 Elizabeth Montabana,1 Robyn L. Stanfield,4 Dimiter S. Dimitrov,3 Bette Korber,5 Joseph Sodroski,6 Ian A. Wilson,4 Richard Wyatt,1* Peter D. Kwon: “Structure of a V3-Containing HIV-1 gp120 Core”. Science 11 November 2005: Vol. 310. no. 5750, pp. 1025 - 1028. [10] Angel Herra ́ez: “Jmol TO THE RESCUE”. Biochemistry and Molecular Biology Education Vol. 34, No. 4, pp. 255– 261, 2006. [11] Yun (Kenneth) Kanga, Sofija Andjelica, James M. Binley, Emma T. Crooks, Michael Frantia, Sai Prasad, N. Iyera, Gerald P. Donovan, Antu K. Dey, Ping Zhu, Kenneth H. Roux, Robert J. Durso, Thomas F. Parsons, Paul J. Maddon, John P. Moore and William C. Olson: “Structural and immunogenicity studies of a cleaved, stabilized envelope trimer derived from subtype A HIV-1” Vaccine Volume 27, Issue 37, 13 August 2009, Pages 5120-5132. [12] Zadeh, Lotfi: “Knowledge Representation in Fuzzy Logic”. IEEE Transactions on Knowledge and Data Engineering, March 1989 (vol 1 no. 1). [13] Miroslav P Milev, Chris M Brown, and Andrew J Mouland. “Live cell visualization of the interactions between HIV-1 Gag and the cellular RNA-binding protein Staufen1”. Retrovirology 2010, 7:41 doi:10.1186/1742-4690-7-41. [14] Donna Meads, Marc D. Hansen, and Alex Pang. “Protalign: A 3-dimensional protein alignment assessment tool.” Pacific Symposium on Biocomputing, 4:354-367 (1999).