Know the Limitations of your Data – X-ray, NMR, EM PHAR 201/Bioinformatics I Philip E. Bourne SSPPS, UCSD Prerequisite Reading: Structural Bioinformatics Chapters 4-6 PHAR 201 Lecture 3 2012 1 When You Grab a PDB Fie What Are You Starting With? PHAR 201 Lecture 3 2012 2 Data Views • Depositor/Annotator • Type of experiment: X-ray, NMR, EM • Type of molecule: protein, nucleic acid, or protein-nucleic acid complex Step 2 Validation Report Depositor Step 1 PDB ID Archival Data Deposit Annotate Validate Step 3 PDB Entry Core DB Distribution Site Corrections Step 4 Depositor Approval PHAR 201 Lecture 3 2012 3 Annotation • Resolve nomenclature and format problems • Add missing required data items • Add higher level classifications • Review validation report and summary letter to the depositor • Produce and check final mmCIF and PDB files • Update status and load database • Check data consistency across archive PHAR 201 Lecture 3 2012 4 Annotation – More Specifics • Make sure entry is complete (mandatory items from mmCIF dictionary) • Format exchange – Converts between PDB and mmCIF formats – Recognizes most variants of PDB format • Check nomenclature – Residue – Polymer atoms – Hydrogen atoms – Ligand atoms PHAR 201 Lecture 3 2012 5 Validation • Covalent geometry – Comparison with standard values (Engh and Huber1; Gelbin et al.3; Clowney et al.2 ) – Identify outliers • Stereochemistry – check chiral centers • Close contacts in asymmetric unit and unit cell • Occupancy • Sequence in SEQRES and coordinates • Distant waters • Experimental (SFCHECK4) 1R.A.Engh & R.Huber. Acta Cryst. A47 (1991):392-400 Clowney et al. J.Am.Chem.Soc. 118 (1991):509-518 3A. Gelbin et al. J.Am.Chem.Soc. 118 (1991):519-529 4A.A. Vaguine, J. Richelle, and S.J. Wodak. Acta Cryst. D55 (1999):191-205. 2L. PHAR 201 Lecture 3 2012 6 The process by which biological data in a database are annotated and validated changes over time – this introduces a temporal inconsistency PHAR 201 Lecture 3 2012 7 Summary Thus Far • The biocurators (annotators) are the unsung heroes of modern biology P.E.Bourne and J. McEntyre 2006 Biocurators: Contributors to the World of Science PLoS Comp. Biol., (Editorial) 2(10) e142 [PDF] – International Society for Biocuration • As a resource developer - start right and the need for data remediation in years to come will be less likely • As a resource user - be aware of the process used to provide the data and hence the limitations of the data you are using PHAR 201 Lecture 3 2012 8 The quality of the data you use in a bioinformatics experiment is a function of the method used to collect these data – understand the method PHAR 201 Lecture 3 2012 9 As of Oct 5, 2011 EM 254 PHAR 201 Lecture 3 2012 10 X-ray Crystallography • • • • • • • • • • Oldest technique Majority of the depositions A number of Nobel prizes International Union of Crystallography (IUCr) .. Acta .. Method based on scattering from electrons – hydrogen atoms usually not seen (sometimes modeled in) In fact modeling in is an issue Atoms of similar atomic weight not distinguishable eg O, N, C Influence of crystal packing eg malate dehydrogenase (4MDH) Environment in crystal highly aqueous Produces similar structures to NMR eg thioredoxin (3TRX vs 1SRX) PHAR 201 Lecture 3 2012 11 The X-ray Crystallography Pipeline Basic Steps Crystallomics • Isolation, Target • Expression, Data Selection • Purification, Collection • Crystallization Structure Structure Solution Refinement PHAR 201 Lecture 3 2012 Functional Annotation Publish 12 Limitations - Crystallization • Crystallization: – – – – Non-soluble Twinning Micro heterogeneity Disorder PHAR 201 Lecture 3 2012 13 Limitations – Data Collection PHAR 201 Lecture 3 2012 14 Limitations - Refinement PHAR 201 Lecture 3 2012 15 Limitations – Map Fitting • In an intricate study the only way to be sure that the work is correct is to make your own judgment from the electron density – this is never done. • It can be done at http://eds.bmc.uu.se/eds/ • It requires that the experimental data (the 100d structure factors be available) PHAR 201 Lecture 3 2012 16 Limitations – Non-crystallographic Symmetry (NCS) PHAR 201 Lecture 3 2012 17 Limitations – Refinement • Introduces restraints/constraints that may or may be realistic • Water has been used unnecessarily • Resolution quoted wrongly • Standards have helped • See for example: H. Weissig, and P.E. Bourne 1999 Bioinformatics 15(10) 807-831. An Analysis of the Protein Data Bank in Search of Temporal and Global Trends PHAR 201 Lecture 3 2012 18 Limitations – Interpretation of the Biologically Active Molecule 1QQP http://www.pdb.org/pdb/101/static101.do?p=education_discussion/Looking-at-Structures/bioassembly_tutorial.html PHAR 201 Lecture 3 2012 19 Limitations – Functional Annotation • Functional annotation is ONLY in the publication NOT PDB • Attempt to address this with GO assignments • Attempt to address this with literature integration • Structural genomics – function unknown • One structure – one to many functions (power law) – functions may be unrecognized since the PDB is relatively static • Many efforts at functional annotation PHAR 201 Lecture 3 2012 20 Why Are Understanding Limitations Important? • Later we will study reductionism – a key process in the use of biological data • As a result of reductionism you will need to choose a representative structure for the task at hand • Understanding the limitations of the experiment will help us do this PHAR 201 Lecture 3 2012 21 Summary of Important Features in using Structure Data Determined by X-ray Crystallography • Resolution is a key indicator – think about it relative to atomic resolution ie 1.54A for a C-C single bond • Disorder (ie undetermined or alternative atomic coordinates) is a natural part of many structures • R factor (all) describes the agreement of the model with the experimental data. It should be better than 0.20 (Rfree 0.26) PHAR 201 Lecture 3 2012 22 Summary of Important Features in using Structure Data Determined by X-ray Crystallography Cont. • B (aka temperature) factors offer indicators both to the accuracy of a structure and the most mobile regions • At right is 5EBX drawn with QuickPDB PHAR 201 Lecture 3 2012 23 NMR PHAR 201 Lecture 3 2012 24 Features of NMR • Limited in size (25-100 kDa) – provided labeled samples are obtainable • Selected information on proteins to ~150kDa • Solution study – small sample needed for soluble proteins • Only a few solid state studies • Reveals hydrogen positions • Leads to an ensemble of dynamical structures – these are rarely used in bioinformatics studies • Useful in high throughput screens to determine protein ligand interactions • Used for phasing of X-ray structures ie the methods are synergistic • Until recently applicable to membrane proteins PHAR 201 Lecture 3 2012 25 NMR - Methodology • Molecules are tumbling and vibrating with thermal motion • Usually labeled with H1 C13 N15 P31 - in an external magnetic field have two spin states – one paired and one opposed to the external magnetic field • Detects and assigns chemical shifts of atomic nuclei with non-zero spin • The shifts depend on their electronic environments ie identities and distances of nearby atoms • The system can be tuned to look at specific features of the characteristic spin moments • H1 H1 provides NOE constraints • Better resolution is obtained when the molecule is tumbling fast – size slows this – offset by higher magnetic field strengths • Protein must be soluble at high concentration and stable without aggregation – high throughput can show this and folded vs unfolded very quickly PHAR 201 Lecture 3 2012 26 NMR – Methodology cont. • Result is a set of distance constraints between pairs of atoms either bonded or non-bonded • If there are sufficient constraints then an ensemble of possibilities results • Often this ensemble is averaged and constraints adjusted to conform to normal bond lengths and distances • Usually left with 15-30 members of the ensemble • Ideally less than 1Å RMSD between models (backbone only) • Portions of the molecule with high motion have tell-tale signals eg apo calmodulin PHAR 201 Lecture 3 2012 27 BMRB - http://www.bmrb.wisc.edu/ PHAR 201 Lecture 3 2012 28 NMR Terms • COSY/NOESY spectra: Allow the space interactions between atoms to be measured and generate a 3D structure of the protein. (what we have discussed) • TROSY Transverse Relaxation Optimized Spectroscopy: Invented about 1997. First described by Professor Kurt Wuthrich. Useful for analyzing larger protein systems. TROSY is a method for getting sharper peaks on large proteins. TROSY is best at higher fields. If the aim is to study a large complex or a chemical shift perturbation when a protein binds to a receptor using NMR, it’s better to use a 900 MHz machine than a more standard lower-field machine • solid state NMR: Requires wider-bore (63 or even 89 mm diameter) magnets (than solution state NMR). The higher stored energy of these wide bore magnets means that they are significantly more difficult to build, and as a result high-field solid state NMR lags behind liquid state in terms of available field strength. • multidimensional (three- and four-dimensional) NMR: Introduced about 12-15 years ago. This technology has the advantage of resolving the severe overlap in 2D spectra. PHAR 201 Lecture 3 2012 29 In both X-ray crystallography and NMR there is the danger that the final structure reflects the model it was computed against PHAR 201 Lecture 3 2012 30 Additional Validation Checks • Stereochemical quality – – – – Ramachandran plot outliers Dihedrals, bond lengths and angles Fold Deviation Score (FDS) Validation Server http://deposit.rcsb.org/validate/ PHAR 201 Lecture 3 2012 31 Use the PDB Geometry Data PHAR 201 Lecture 3 2012 32 Electron Microscopy 1KVP STRUCTURAL ANALYSIS OF THE SPIROPLASMA VIRUS, SPV4, IMPLICATIONS FOR EVOLUTIONARY VARIATION TO OBTAIN HOST DIVERSITY AMONG THE MICROVIRIDAE, • Able to look at large molecular assemblies • Resolution now 30A to below 4A • Cryo-EM preserves aqueous environment (no staining) • Experimentally more tractable • Can resolve images (direct measurement of phases) or diffraction patterns • Can provide a 3D volumetric reconstruction • Suitable for the study of membrane proteins eg bacteriorhodopsin (1990) PHAR 201 Lecture 3 2012 33 1P85 Real space refined coordinates of the 50S subunit fitted into the low resolution cryo-EM map of the EF-G.GTP state of E. coli 70S ribosome • Single particle reconstruction – multiple orientations of the same particle found in the specimen (viruses, ribosome…) • Electron tomography – 3D reconstruction of a single particle (organelles, whole cells) PHAR 201 Lecture 3 2012 34 Example EM Result • Example for a hybrid study that combines elements of electron crystallography and helical reconstruction with homology modeling and molecular docking approaches in order to elucidate the structure of an actin-fimbrin crosslink (Volkmann et al., 2001b). Fimbrin is a member of a large superfamily of actin-binding proteins and is responsible for crosslinking of actin filaments into ordered, tightly packed networks such as actin bundles in microvilli or stereocilia of the inner ear. The diffraction patterns of ordered paracrystalline actin-fimbrin arrays (background) were used to deduce the spatial relationship between the actin filaments (white surface representation) and the various domains of the crosslinker (the two actinbinding domains of fimbrin are pink and blue, the regulatory domain cyan). Combination of this data with homology modeling and data from docking the crystal structure of fimbrin’s N-terminal actin-binding domain into helical reconstructions (Hanein et al., 1998), allowed us to build a complete atomic model of the crosslinking molecule (foreground, color scheme as in surface representation of the array). • From Structural Bioinformatics 2005 p124 PHAR 201 Lecture 3 2012 35 Example EM Result • • Example for a combination of high-resolution structural information from X-ray crystallography and medium-resolution information from electron cryomicroscopy (here 2.1 nm). Actin and myosin were docked into helical reconstructions of actin decorated with smooth-muscle myosin (Volkmann et al., 2000). Interaction of myosin with filamentous actin has been implicated in a variety of biological activities including muscle contraction, cytokinesis, cell movement, membrane transport, and certain signal transduction pathways. Attempts to crystallize actomyosin failed due to the tendency of actin to polymerize. Docking was performed using a global search with a density correlation measure (Volkmann and Hanein, 1999). The estimated accuracy of the fit is 0.22 nm in the myosin portion and 0.18 nm in the actin portion. One actin molecule is shown on the left as a molecular surface representation. The yellow area denotes the largest hydrophobic patch on the exposed surface of the filament, a region expected to participate in actomyosin interactions. The fitted atomic model of myosin is shown on the right. The transparent envelope represents the density corresponding to myosin in the 3D reconstruction. The solution set concept (see text) was used to evaluate the results and to assign probabilities for residues to take part in the interaction. The tone of red on the myosin model is proportional to this statistically evaluated probability (the more red, the higher the probability). From Structural Bioinformatics 2005 p127 PHAR 201 Lecture 3 2012 36 Small-angle X-ray Scattering SAXS http://en.wikipedia.org/wiki/Small-angle_X-ray_scattering • Reveals shape and size of macromolecules in the range 5-25nm • Handles partially ordered systems • No need for crystalline sample; larger molecules than NMR, but at lower resolution • Leading to hybrid techniques PHAR 201 Lecture 3 2012 37 Summary Regarding Data Limitations • • • • • • Pay attention to the method its pluses and minuses Be aware of models Be aware of the general limitations of each method For NMR be aware of an ensemble of structures Be aware of hybrid models For all methods be aware of the parameters that govern the accuracy • You will need to know these limitations for just about any bioinformatics study since it will be necessary to choose a non-redundant set (NR) – we will visit Astral and Pisces which are tools in defining an NR set PHAR 201 Lecture 3 2012 38