Sesame: a Data Management System for Structural Proteomics John L. Markley Center for Eukaryotic Structural Genomics National Magnetic Resonance Facility at Madison Biological Magnetic Resonance Data Bank Department of Biochemistry, University of WisconsinMadison, Madison WI 53706, USA. markley@nmrfam.wisc.edu What data do we need to capture and make available to ourselves and others? Individual research lab (“R01 research”) • Information gleaned from databases, literature, and collaborators • Laboratory resources (people, freezers, rooms, instruments ...) • Protocols • Projects and their progress • Products (what they are, where located, where distributed) • Primary data collected (sequences, gel scans, HPLC, UV-vis, MS, NMR...) • Results to be submitted to databases or put in publications Large consortia (e.g., structural genomics centers) • Same as above, but on a larger scale • Data transmitted to specialized databases (TargetDB, PEPCdb) Shared instrumentation / technology centers • User requests for data collection, services • Scheduling / billing • Protocols (available for distribution and reuse) • Individual data sets (back-up copy, copy sent to users) • Reporting for funding agencies and advisory committees Publicly accessible databases (Genbank, UniProt, PDB, BMRB, Entrez ...) • Validation tools and pre-submission screening of data • Tools for data integration and sharing Necessary ingredients Vocabularies, ontologies, and data schema for defined domains (with databases based on these) Genbank Uniprot Protein Data Bank / Biological Magnetic Resonance Data Bank PubMed, PubChem, ... Data sharing Structural genomics groups are leading the way: TargetDB, PEPCdb, Center databases Others are planned: Molecular Libraries Integration of data, simulations, and validation General goal in many areas User interfaces that integrate genomic and other biomedical databases Genome (nucleic acid and protein sequences) Transcriptome (gene chips, ...) Interactome (molecular interaction maps) Structures (3D coordinates and underlying data) Biophysical and dynamic data (NMR parameters) Chemicals (small molecule screening, metabolomics) Images (macromolecular complexes → cells → organisms) NMR structure determination as an example of the integration of data, simulations, and validation Target Simulations Data PDB, BMRB Spectra of protein Prediction of spectra from sequence / homology modeling Peak identifications Prediction of 2º structure from sequence and assignments Backbone assignments Prediction of tertiary contacts NOE and RDC restraint analysis Sidechain assignments Prediction of NMR parameters (shifts, RDCs) from structure Structure calculations Validation of input data and calculated results Final ensemble Database deposition Publication “Chemical shift priors” can be used to sharpen probabilities in automated peak identification and assignment 1H-15N probability density plot (chemical shift priors) for mouse protein Mm202773 generated from the sequence of the protein. The colors correspond to the probability scale on the right. Conformation-dependent chemical shift densities: 13C Ala 13C -sheet -helix The parametric descriptions of these distributions are: sheet:1.65| 1.0| 0.8| 50.9| 0.00| -0.0| 0.0| 0.00 coil: 1.89| 0.0| 0.4| 54.9| 1.43| -0.6| 0.6| 52.5 Conformation-dependent chemical shift densities taken two at a time: 1H-13C Estimated relative densities in two-dimensional 13C-1H chemical shift space for alanine (top four panels) and methionine (bottom four panels). In each set of four panels, the densities represent: (top left) extended strand, E, (top right) alpha helix, H, (lower left) random coil, R, (lower right) combination of E + H Eghbalnia et al., J. Biomol. NMR, in press PECAN (Protein Energetic Conformational Analysis from NMR chemical shifts) analysis of secondary structure from assigned chemical shifts and the protein sequence: example of output helix transition region extended Eghbalnia et al., J. Biomol. NMR, in press Example of the combined use of database, GRID-type computing, and validation: recalculation of structures of >500 proteins from restraint data extracted from the PDB NMR restraints: NOE, J-couplings from PDB “MR” text files BMRB restraints database (MR grid) Database of corrected restraints (DOCR) Filtered database of non-redundant restraints (FRED) Distributed computation on Condor cluster of >800 workstations CYANA >500 CYANA structures Database of recalculated structures (RECOORD) Following final CNS water refinement, validated and compared PDB CNS >500 CNS structures Collaboration among BMRB, EBI, European Validation Group (NMRQUAL), .... Improvement in Ramachandran plot appearance (recalculated – original structures) Example of analysis of results from the RECOORD (REcalculated COORDinates) database) Improvement in Z-score packing quality (recalculated – original structures) Nederveen et al. Proteins, in press The Sesame Project Goals: laboratory information management system, collaborative tools, process pipelining Started in 1998 as a tool for managing data at the National Magnetic Resonance Facility at Madison (NMRFAM) Was adopted by the Center for Eukaryotic Structural Genomics (CESG) in 2000 and continually expanded and refined since then Currently used by: • Three structural genomics consortia (CESG, SGPP, and Berlin Structure Factory) • Proteomics project on enzymes of E. coli • NMR facilities (NMRFAM, Medical College of Wisconsin, Mayo) • Molecular Interactions Facility (Madison) • Biological Magnetic Resonance Data Bank • “R01” labs Under construction for: • Metabolomics consortium • Molecular screening facility Zolani et al., J. Struct. Funct. Genomics (2003) 4:11-23 http://www.sesame.wisc.edu Sesame: basic design features Multiple-tier system sitting on a commercial relational database management system Computational servers do the heavy work and off-load to distributed (GRID) servers Users interact via Java2 with clients on any computer linked to the Web (desktop, notebook, hand-held) CORBA Name Server OR B Alibaba Dispenser OR B AlibabaDB JDB C User Client OR B Tier 1 Alibaba OR B Tier 2 Sesame DB and FS Tier 3 Setting up a new virtual laboratory under Sesame (from the ‘Help Pages’) • Users can set up a lab or facility • The “Lab Master” invites members to join the group and can carry out some customizations Target Information • Sesame organizes uploaded information from the annotated proteome (gene identifier, ORF sequence, sequences of flanking regions, relevant data from web-based databases, etc.) • ORF information includes direct links to all records in the Sesame data base associated with that ORF for rapid and efficient data recall and analysis • Queries can be used to select ORFs on the basis of chosen criteria (physical properties or annotations of various kinds) and to organize them into ‘workgroups’ of defined sizes (usually 96) for entry into either the E. coli or cell-free pipeline Every record is associated with a protocol, so that laboratory results can be entered Variables associated with cell growth: • OD at induction • Wet weight of cells • Expression level • Solubility • Cleavage “Actions” provide a controlled vocabulary for searchable outcomes of work • Each lab can create a unique set of controlled vocabulary terms. • Actions can represent lab activities (e.g., steps in cloning, expression testing, protein production from cells or cell-free extracts, structure determinations by X-ray or NMR) or stamps of approval (e.g., cloning acceptable, purification complete). • CESG’s entire data dictionary contains 184 defined actions. • The list of actions can evolve as warranted by the development of new laboratory processes. • Actions are linked to a specific workgroup number and protocol so that the data can be mined at a later date and reports generated. • Actions shown here pertain to target selection and cloning. Protocols are organized so that only the fields and actions relevant to a given protocol are visible to users Body of the protocol List of CESG actions NIH database tag mapped to each action Fields for data entry – default values can also be entered Other features of the Sesame LIMS In addition to documenting the progress of targets, Sesame can perform other functions: • Predict restriction digest patterns • Store information about laboratory ‘resources’, such as primer prefixes, protease types, and storage locations • Create orders for oligonucleotide primers • Store images and files • Create reports (XML and summary reports) Multiple windows can be “tiled” on the browser screen for ease in making comparisons As another example, here is the Sesame record for a sample prepared for mass spectrometery On scrolling down, one sees that images and files have been appended JM 7203 xray LA 7196 Match to: gi|22328616 Score: 111 At4g14165 F-box family protein-related [Arabidopsis thaliana] Found in search of I:\CESG4\jm7203.mgf Nominal mass (Mr): 30619; Calculated pI value: 9.03 NCBI BLAST search of gi|22328616 against nr Unformatted sequence string for pasting into other applications Taxonomy: Arabidopsis thaliana Cleavage by Trypsin: cuts C-term side of KR unless next residue is P Sequence Coverage: 16% Matched peptides shown in Bold Red 1 51 101 151 201 251 MENKHNPTSH QQQFMHGNSW IPKTSDCFHM FGPPPPSNPW QGITVLANDT SSSVQFYGAQ TSHTWSELPE KLAPYGRSMI VYKDHKLYFL KVLATKLVVT GGFIRNTIYF WFVPSFKH Expected At4g14165.1 mw 30595 LNPCVPLGTL MRYAMQVRGQ NKTGSFKIFD VTGKVLKVEE SASHGNNTHD LPDKSCPKTH LAPTVLGINR FCGDIPQQTF MGGARPRTWS IYIFNLETQK PLADLIPPRQ TWKGDTCWNQ EWSVKVERSQ FRVFESMLLD TEPLHTLDSY Sesame module for crystallization screening (‘Well’) • Well is the Sesame module that manages information about crystallization screens • This includes the composition of the screen (the software performs automatic volume calculations) • Screens are linked to specific sample records • Sesame controls the screening robot Sesame module for crystallization screening (‘Well’) • Well is the Sesame module that manages information about crystallization screens • This includes the composition of the screen (the software performs automatic volume calculations) • Screens are linked to specific sample records • Sesame controls the screening robot In structural genomics, most protein-protein interactions detected are homo-oligomers At5g06450.1: Arabidopsis fold-space target At5g06450.1: Arabidopsis fold-space target X-ray structure 1VK0 • 28% identity over 104 residues with P0445H04.27 from Oryza sativa. • Structure is similar to trimeric cyclic viral exonucleases, but At5g06450.1 is a homohexamer with C6 symmetry • Structure is reminiscent of processivity factors in nucleic acid modifying enzymes • Is a nucleic acid pulled through its center? Each subunit has a disordered positively charged sequence YKYKGS with aromatic rings for stacking? Sesame is flexible and able to accommodate changes in protocols and decision points New CESG pipeline being tested Construct design PCR cloning -> DNA 15-50 g scale mRNA -> protein -> cell-free production and solubility screen Flexi®Vector plasmids 1-5 mg scale Protein from E. coli cells Fluidigm chip crystallization screening Protein from cell-free Screening: Yield MS Functional assays NMR 15N-1H HSQC or 1H screening Sesame module for protein-protein interactions: ‘Rukh’ • The Rukh module is designed to manage yeast two-hybrid (Y2H) screens, and to track the whole Y2H screening process from the initial screen setup through the validation steps. • Rukh is also used to score positives based on absorbance data, visual selection from gel images or other criteria, and to generate work lists for a Tecan robot to reformat the plates. Sesame interface for Y2H screening Functions Plate reader data shown for a bait. Those above the cutoff go forward to the next screen in the process. Screen evaluation and refinement Visualization and scoring of gels is used in creating the next screen to be used in the process. Example of data retrieval: searching for information on a recent structure determined at CESG (domain from ORF At3g03410.1) Search by ID reveals that two versions of this gene have been studied: the full-length and a “chunk” scroll scroll more By clicking on the ORF number, one can view available annotation captured from other databases The list of actions shows that the structure of the domain has been determined and deposited in the PDB Structure of the 67 aa domain (nCML) of At3g3410.1 nCML N-terminal domain from human calmodulin Song, Zhao, Thao, Frederick & Markley (2004) J. Biomol. NMR 30, 451-456 PDB 1TIZ; BMRB 6209 EF-hands can be “open” or “closed” Different responses of the EF-hands of CaM and calbindin to Ca2+ binding are thought to be responsible for their different physiological roles in calcium signaling First EF-hand Calmodulin (CaM) Calbindin Ca2+-loaded CaM Ca2+-loaded calbindin Ca2+-loaded nCML (this work) closed closed open closed open Second EF Hand closed closed open closed closed The Arabidopsis nCML studied here has a different signature, which may indicate the existence of a new calcium signaling pathway in plants. Arabidopsis has 6 CaM proteins and 50 CaM-related proteins such as nCML, the domain studied here. Mapping of the ligand binding site of At2g24940.1 by NMR Red: [1H,15N]-HSQC of At2g24940.1 Green: [1H,15N]-HSQC of At2g24940.1 in the presence of progesterone (ratio 1:1) Blue: Site with the greatest chemical shift perturbations (HN >0.15ppm) HN ={((H)2+(N/5)2)/2}1/2. Summary Domains currently covered by Sesame • Bioinformatics (data from and to data banks) • Molecular biology • Protein chemistry • Proteomics • Molecular interactions • NMR spectroscopy • Crystallomics and X-ray crystallography Additional domains under construction • Metabolomics • Small molecule screening Challenges • Education of users of the LIMS • Motivation of users to capture all data correctly Feedback from data summaries used to track progress and guide next steps Automate data entry from instrumentation • Evolution of the LIMS to handle scientific domains at increasing depth and breadth Nomenclature issues Interfacing with instrumentation and software Acknowledgments Sesame staff members Zsolt Zolnai John Cao Peter Lee Jing Li Michael Runnels Jianhua Zhang Sesame collaborators Wim Hol (UW, Seattle) Michael Hoffman Eileen Maher Hartmut Oschkinat (Berlin) Michael Sussman NMRFAM staff members Arash Bahrami Hamid Eghbalnia Liya Wang Milo Westler BMRB staff members Eldon Ulrich Jurgen Doreleijers Jundong Lin Steve Mading Dimitri Maziuk David Tolmie Chris Schulte Kent Wenger* Hongyang Yao Collaborators Hideo Akutsu (Osaka) Helen Berman (PDB) Yannis Ioannidis (Athens, Greece) Miron Livny* John Westbrook (PDB) *UW-Madison Computer Sciences CESG staff members David Aceti Craig Bingman Brian Fox Zach Miller* Craig Newman George Phillips Jikui Song Zhaohui Sun Brian Volkman (MCW) Others on the recalculation project Alexandre Bonvin Peter Güntert Robert Kaptein Sander Nabuurs Aart Nederveen Michael Nilges Chris Spronk Wim Vranken Grant support National Institutes of Health National Institute for General Medical Sciences National Center for Research Resources Biomedical Technology Program National Library of Medicine National Science Foundation