Unit 2 Biological Databases and Data Retrieval UNIT 2 BIOLOGICAL DATABASES AND DATA RETRIEVAL Structure 2.1 Introduction 2.4 Expected Learning Outcomes 2.2 PubChem Drugbank Classification of Biological Databases ZINC Database Classification Based on Source Cambridge Structure Database (CSD) Nucleotide Databases Protein Database Structural Databases CATH 2.3 Small Molecular Databases 2.5 Structure Viewing tools and File Formats 2.6 Summary 2.7 Terminal Questions 2.8 Answers Classification of Biological Databases Based on Nature of Data 2.1 INTRODUCTION In the previous unit, you have learned about the basics of computers and their studying biological databases. In these biological databases information related to DNA, RNA, Protein, and other biomolecules are stored in a systematic way inside servers named Data servers. Scientists, academicians, and researchers working across the globe can retrieve this data (Biological Data) whenever they need it for the purpose of analysis. In BBCCT-101 course molecules of life, you have studied various biomolecules and their significance. You are also aware that these molecules interact with each other through various metabolic reactions. 47 Bioinformatics Skill Enhancement Course BBCS-185 Some important characteristic features of biological databases include: 1. They can store data in electronic form in various formats. 2. Each entry in the database would be assigned a unique number or ID. It cannot be repeated in other terms non-redundancy. 3. Data sharing The data can be downloaded from various websites related to biological databases or FTP (File transfer protocol). 4. They are well structured, searchable, and also information is updated periodically as per publications/innovations in the scientific world. 5. The data also refers to unique IDs in research publications/books. This can be called a cross-reference. Expected Learning Outcomes After studying this unit, you should be able to: define biological databases; classify of biological databases; enlist application of biological databases in research and data retrieval through web links; describe chemical, biological and structural databases; and explain the availability of online visualization tools and offline . 2.2 CLASSIFICATION OF BIOLOGICAL DATABASES In unit-11 of BBCCT-105 (Proteins) course you came across some basic concepts of biological databases, hence you are advised to recall those concepts before proceeding further in this unit. The classification of biological databases is very simple and is based on the source and nature of data collection. 2.2.1 Classification Based on Source 48 1. Primary databases: These databases are constructed based on data collected from laboratory experiments. After experiments the data will be validated and analyzed before uploading in the biological databases and it is very crucial step in the data collection. They are classified based on the type of biological molecules like nucleic acid databases (GenBank, EMBL, DDBJ, NDB), protein databases (PIR, Swiss-Prot, TrEMBL, PDB), metabolic pathway database (KEGG, EcoCyc, and MetaCyc) and small molecule databases (PubChem, Drug Bank, ZINC, CSD). 2. Secondary Databases: These databases are constructed based on primary biological databases with additional information. Secondary databases comprise data derived from the results of analyzing primary data available on the primary databases. They are often referred to as curated databases, but this is a bit of a misnomer Unit 2 Biological Databases and Data Retrieval because primary databases are also curated to ensure that the data in them is consistent and accurate. Secondary databases often draw upon information from numerous sources, including other databases (primary and secondary), controlled vocabulary, and scientific literature. They are highly curated and often use a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from published data. library over the past decade or so, providing a wealth of (often daunting) information on just about any gene or gene product that has been investigated by the research community. The potential for mining this information to make new discoveries is vast. Table 2.1: Differences between primary and secondary databases (Source:https://www.ebi.ac.uk/) Primary database Synonyms Archival database Secondary Database Curated database; knowledgebase Results of analysis, Source of Direct submission of experimentally- data derived data from researchers literature research and interpretation, often of data in primary databases Inter Pro (protein families, motifs and Examples ENA, GenBank and DDBJ (nucleotide sequence) Array domains) UniProt Knowledgebase (sequen Express and GEO (functional ce and functional genomics data) Protein Data Bank (PDB; coordinates of three- information on proteins) Ensembl (variat dimensional macromolecular structures) ion, function, regulation and more layered onto whole genome sequences) Upto now you have studied about types of biological databases based on sources, now let us know about nucleotide databases. 2.2.2 Nucleotide Databases It is well known that DNA and RNA are major nucleic acids. You have studied the structure of these nucleic acids in Unit 13&14 of BBCCT-101). Each protein/enzyme coded by a specific gene/genes intern itself is a DNA sequence. If all the gene coding sequences are stored in a database i.e, called a nucleotide database. These databases are repositories of the store and 49 BBCS-185 Bioinformatics Skill Enhancement Course retrieve data in terms of nucleotides of various genomes (set of chromosomes of an organism). Let us see some of the examples of these nucleotide databases: Gen Bank It is an integral part of the main biological database, i.e., NCBI (National Center for Biotechnology. It has a tool called Entrez, which helps to retrieve data from Genbank. EMBL European Molecular Biology Laboratory is available at European Bioinformatics Institute (EBI). SRS (Sequence Retrieval System) is a tool for retrieval of desired protein/DNA/Gene Sequences from this above database. DDBJ DNA data Bank for Japan is another database. All above three-nucleotide databases are interconnected with each other by data sharing and allow access to the data through the web links and data rvers (Fig. 2.1). Swiss-PROT This database is owned by EMBL and maintained by SIB TrEMBL - It contains maximum translated sequences Fig. 2.1: Graphical representation of Nucleotide databases. 2.2.3 Protein Database PIR Protein Information Resource is located at NBRF (National Biomedical Research Foundation). It consists of complete protein information like source protein crystal structures available in protein Databank (with ID), etc., PIR have been classified into four types. PIR1- This database is fully classified and annotated PIR2 - It is basic database with preliminary protein information PIR3- This database has unverified entries PIR4- Database with genetically engineered sequences. This helps to understand the possibilities of engineering proteins for research activities. 50 Unit 2 Biological Databases and Data Retrieval SAQ 1 Fill in the blanks: i) _________________tool is used to retrieve data from GenBank ii) DDBJ is a____________database. iii) EMBL database is maintained by_______. iv) The full form of NCBI is________________. v) ____________________database is related to classification of protein families. 2.2.4 Structural Databases Protein Databank (PDB) comprises various databases PDB is a part of the Worldwide Protein Data Bank which collects, organizes, and disseminates data on biological macromolecular structures like proteins, enzymes, and DNA/RNA. PDBj (Protein Data Bank Japan) maintains a centralized PDB archive of macromolecular structures and provides integrated tools, in collaboration with the Research Collaboratory for Structural Bioinformatics (RCSB), the Biological magnetic resonance Data Bank (BMRB) in the USA, and the PDBe in the EU. RCSB: Research Collaboratory for Structural Bioinformatics. (RCSB-PDB). Secondary structural Databases Let us know about secondary structural databases, they are like: 1. SCOP : (http://scop.mrc-lmb.cam.ac.uk/scop/, Fig. 2.2) 2. CATH 3. PDBSUM 1. SCOP (Structural Classification of Proteins) database started by Lab of Molecular Biology, MRC, Cambridge, UK. The aim of this database is to classify protein 3D structures based on hierarchical schemes. SCOP has various classifications as species, proteins, families, super families, folds, and classes. Each class was again classified into various structural organizations. Folds are further classified into five classes (Fig. 2.3): a. All alpha b. All beta c. Alpha or beta d. Alpha and beta e. Multi-domain folds 51 BBCS-185 Bioinformatics Skill Enhancement Course Fig. 2.2: Screenshot showing SCOP home page. Classification chart Fig.2.3: SCOP hierarchy chart structured based and evolution based classification. 2.2.5 CATH The CATH database (http://www.cathdb.info/) is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues, and continues to be developed by the Orengo group at University College London. Protein structures in the databank were experimentally determined and split into their consecutive polypeptide chains, where applicable. All protein domains are identified within these chains using a mixture of automatic methods and manual curation that are available to the scientific community. The domains are then classified within the CATH structural hierarchy: Class as (C) level, domains assigned according to their secondary structure content, i.e. 52 Unit 2 Biological Databases and Data Retrieval all alpha, all beta, a mixture of alpha and beta, or little secondary structure; in the Architecture (A) level, information of the secondary structure arrangement in three-dimensional space is used at the Topology/fold (T) level, information on how the secondary structure elements are connected and arranged is used; assign segregation are made to the Homologous superfamily (H) level if there sufficient evidence that the domains are related by evolution, i.e. they are homologous. To know, and browse the classification hierarchy, visit CATH hierarchy web page (Fig. 2.4). Additional sequence data for domains with no experimentally determined structures are provided by sister resources like Gene3D, which are used to populate the homologous superfamilies. Protein sequences from UniProtKB and Ensembl were scanned against CATH HMMs to predict domain sequence boundaries and make homologous superfamily assignments/groups. Fig. 2.4: CATH home page. Learners can explore more about the proteins/enzymes/receptors by using various sub-search methods like 3D structure, protein evolution, protein function, and conserved sites. You can also access updated information about the development or updation of the database from Learn more tab of the CATH web homepage. You may download the complete database from the download link. This will help learners to understand protein classification through the structural organization of proteins. SAQ 2 Define the following terms: i) PDB ii) RCSBC iii) SCOP 53 BBCS-185 Bioinformatics Skill Enhancement Course 2.3 CLASSIFICATION OF BIOLOGICAL DATABASES BASED ON NATURE OF DATA Up to now, you have learned about biological databases and their classification-based. Now you will learn how to classify biological databases based on their data nature.There are currently five types of databases 1. Sequence databases: These databases consist of DNA, RNA, and protein sequences. You can access the gene or protein sequence by searching by providing the name or unique ID in the respective databases. For example EMBL (European Molecular Biology Laboratory), NCBI (National Center for Biotechnology Information). 2. Structural databases: These are specialized databases related to protein/DNA structures derived from X-Ray or NMR experiments or theoretical models. Some of the structural databases are related to crystal structures of chemicals. For example Protein Databank (PDB), Cambridge Crystallographic Data Center (CCDC). 3. Literature databases: These databases are very important for the development and advancement of science and technology as well as other disciplines. These databases help researchers, academicians, and scientists to search for the information and are able to download data in electronic form like HTML (HyperText Machine Language), PDF (Portable document Format), JPEG (Joint Photographic Experts Group), and other formats. Examples: Pubmed, Medline, National Digital Library. 4. Gene expression databases: It is a well-known database to understand the gene functions like up-regulation or down-regulation of cellular activities. The Gene Expression Database (GXD) is a community resource for gene expression information obtained from the laboratory. At various GXD stores and integrates different types of expression data and makes these data freely available in formats appropriate for comprehensive analysis. This database helps in the interconnection of the gene expression and control with other genes through various systems biology software. Now-a-days scientists are working on gene molecular networks. 5. Metabolic pathway databases: It is a curated database of experimentally elucidated metabolic pathways from all domains of life and is well maintained and updated regularly. As you know about metabolic pathways of carbohydrate metabolism (BBCCT-109) have many enzymatic reactions to achieve the final product. The main characteristics of metabolic database pathways are as follows: Online encyclopedia of metabolism 54 1) Predict metabolic pathways in sequenced genomes 2) Support metabolic engineering via enzyme database 3) Metabolite database aids metabolomics research Examples; KEGG (Kyoto Encyclopedia of Genes and Genomes) MetaCyc, BioCyc Unit 2 Biological Databases and Data Retrieval 2.4. SMALL MOLECULAR DATABASES You have studied the various databases related to biological databases like NCBI, Genbank, Swiss-Prot, EMBL, EBI, DDBJ, CATH, and SCOP in the previous sections. All the above databases related to proteins and nucleic acids have molecules with more molecular weight. So the above molecular databases are called macromolecular databases. Low molecular weight molecules are stored and retrieved through databases known as small molecular databases. Most of the molecules are organic molecules like drugs, antibiotics, vaccines, peptides, elements, compounds, etc. Now, we will discuss in detail small molecular databases in this section. 2.4.1 PubChem The PubChem database is a primary source for various chemicals, drugs, and derivatives. It is one of the freely accessible chemical information resource databases as well as the largest in the world. We can search for various chemicals by molecular formula, name, structure, and other identifiers. Further, one can find chemical and physical properties, safety and toxicity information, biological activities, literature citations, patents, and more. New chemicals/substances will be added regularly as and when new information is available from the literature or from experimental results. It is very crucial for finding vendor-based chemicals or new chemicals. Most of the scientists are screening molecules from the Pubchem database for various disease treatments like Atherosclerosis, Cardio-vascular diseases, etc. It is the database of chemicals that anyone can submit scientific data to this database and become a provider. This database is more important and useful for scientists, students, and the general public. Each month database and programmatic services provide data to several million users worldwide about compounds. (https://pubchemdocs.ncbi.nlm.nih.gov/) 2.4.2 Drug Bank The drug bank is a crucial database for all existing drugs that are approved by Food and Drug Administration authority to treat various diseases. It is also one of the largest drug banks in the world. DrugBank is a curated pharmaceutical knowledge base, with products commercially available for precision medicine, telehealth, and drug discovery. It also provides important drug information in a structured, unified resource. DrugBank Online is a free-to-access website that provides highly detailed information across multiple topics including pharmacology, chemical structures, targets, metabolism, and toxicology. The integrated data means you can search by text, gene sequence, chemical structure, and more. Anyone can download a comprehensive dataset, free for academic and non-commercial researches. (https://www.drugbank.com/visit this weblink to know more bout Drug Bank ) 55 BBCS-185 Bioinformatics Skill Enhancement Course 2.4.3 ZINC Database It is a free database of commercially-available compounds for virtual screening. ZINC contains over 230 million purchasable compounds in readyto-dock, 3D formats. ZINC also contains over 750 million purchasable compounds anyone can search for analogs in a short span of time. ZINC is maintained by the Irwin and Shoichet Laboratories in the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF). This database is used by various researchers, training people, scientists, biotech companies, research organizations, and university scholars for drug discovery. (https://zinc.docking.org/visit this weblink to more about the ZINC database ) 2.4.4 Cambridge Structure Database (CSD) - organic molecules and metal-organic crystal structures. Containing over one million structures from x-ray and neutron diffraction analyses, this unique database of accurate 3D structures has become an essential resource to scientists around the world. There will be automatic checking of newly added entries to the above database later on the chemical information further verified by in-house scientific editors/scientists before launching online to the public. Each chemical structure is enhanced with good quality for visualization, downloading, and understanding of physical properties. This new knowledge has been applied across academia and industry in pursuit of new drugs, novel materials and a greater understanding of chemical and crystallographic phenomena. (https://www.ccdc.cam.ac.uk/solutions/csdcore/components/csd/visit this weblink to more about CSD ) . You will learn more about how to retrieve data from these data bases while performing exercise number 3 of this course. SAQ 3 Do as directed i) Write a short on chemical databases used in drug design and drug discovery. ii) Define the term curated database? Enlist few chemical databases developed using curation method. 2.5. STRUCTURE VIEWING TOOLS AND FILE FORMATS 56 The molecular structures like organic, biological molecules like proteins, DNA, RNA, lipids and carbohydrates can be visualized through specific software. Most of the molecular visualization software is not only for visualization but also for various modifications and calculations of bond lengths, angles, energy, rotatable bonds, molecular weight, and other various parameters. These parameters are very important as per molecular visualization. In Unit 2 Biological Databases and Data Retrieval addition, a few more advanced software help us to calculate binding energy between drug Receptor and molecular stability at various solvents, pH, temperatures and etc., Currently, we are going to discuss basic software used for the visualization of biomolecules. There are various file formats available to view those molecules in 3Dimenstional space. It means that, each atomic position should be defined from its origin with respect to X,Y, and Z axis. The majorly used format to view in basic software is the pdb format (Protein databank). It has a stranded format as follows. While performing exercises number 7 and 8 you will learn more about these tools and file formats. Examples of PDB Format Glucagon is a small protein of 29 amino acids in a single chain. The first residue is the amino-terminal amino acid, histidine, which is followed by a serine residue and then glutamine. The coordinate information (entry 1gcn) starts with: ATOM 1 N HIS A 1 49.668 24.248 10.436 1.00 25.00 N ATOM 2 CA HIS A 1 50.197 25.578 10.784 1.00 16.00 C ATOM 3 C HIS A 1 49.169 26.701 10.917 1.00 16.00 C ATOM 4 O HIS A 1 48.241 26.524 11.749 1.00 16.00 O ATOM 5 CB HIS A 1 51.312 26.048 9.843 1.00 16.00 C ATOM 6 CG HIS A 1 50.958 26.068 8.340 1.00 16.00 C ATOM 7 ND1 HIS A 1 49.636 26.144 7.860 1.00 16.00 N ATOM 8 CD2 HIS A 1 51.797 26.043 7.286 1.00 16.00 C ATOM 9 CE1 HIS A 1 49.691 26.152 6.454 1.00 17.00 C ATOM 10 NE2 HIS A 1 51.046 26.090 6.098 1.00 17.00 N ATOM 11 N SER A 2 49.788 27.850 10.784 1.00 16.00 N ATOM 12 CA SER A 2 49.138 29.147 10.620 1.00 15.00 C ATOM 13 C SER A 2 47.713 29.006 10.110 1.00 15.00 C ATOM 14 O SER A 2 46.740 29.251 10.864 1.00 15.00 O ATOM 15 CB SER A 2 49.875 29.930 9.569 1.00 16.00 C ATOM 16 OG SER A 2 49.145 31.057 9.176 1.00 19.00 O ATOM 17 N GLN A 3 47.620 28.367 8.973 1.00 15.00 N ATOM 18 CA GLN A 3 46.287 28.193 8.308 1.00 14.00 C ATOM 19 C GLN A 3 45.406 27.172 8.963 1.00 14.00 C Notice that each line or record begins with the record type ATOM. The atom serial number is the next item in each record. (Source: https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html) 57 BBCS-185 Bioinformatics Skill Enhancement Course There are plenty of to show small molecules to higher level molecular structures in a space. Among them few are enlisted as follows 1. RasMol Most of the protein structure databases tools available today are wellequipped with graphical visualization tools. The commonly used tool for academic and research purposes is RasMol software. This is a molecular graphics program intended to visualize proteins, nucleic acids and small molecules, available in a 3-D structures format. In order to display a molecule, RasMol requires an atomic co-ordinate file that specifies the position of every atom in the molecule through its 3-D Cartesian coordinates (Fig. 2.5). RasMol accepts this coordinate file in a variety of formats, including the Protein Data Bank (PDB) format. The visualization tool provides the user a choice of color schemes and molecular representation (wireframe, cylinder (Dreiding) stick bonds, alpha-carbon trace, space-filling (CPK) spheres, macromolecular ribbons (either smooth-shaded solid ribbons or parallel strands), hydrogen bonding and dot surface. Additional features such as test labeling for selected atoms, different color schemes for different parts of the molecule, zoom, rotation, etc. have made this the most popular among all existing visualization tools. This standalone software can be downloaded from the RasMol website. Website:http://www.openrasmol.org/ Fig. 2.5: RasMol software with Crystals of Crambin with PDB ID: 1CRN. 2. Chime 58 Chime and proteins explorer are derivatives of RasMol that allow visualization of structures inside web browsers, while RasMol runs independently outside a web browser. Hence, chime should be used only online, when connected to the Internet. Another feature of Chime is that only certain molecules that are allowed by the company can be seen, unlike RasMol where any protein molecule with atomic coordinates can be seen. Unit 2 Biological Databases and Data Retrieval You can access chime at: www.umass.edu/microbio/chime Nowalso used widely. This can be downloaded on personal computers to view molecules like proteins, DNA and RNA. 3. MolMol MolMol stands for Molecule analysis and Molecule display. This is also free software with a lot of features that are not found in RasMol and Chime. MolMol is a molecular graphics program for display, analysis and manipulation of three-dimensional structures of biological macromolecules, with special emphasis on nuclear magnetic resonance (NMR) solution structures of proteins and nucleic acids. MolMol can be reached at: www.mol.biol.ethz.ch/wuthrich/software/molmol 4. Pymol PyMOL is a user-friendly and one of the popular molecular visualization tools on an open-source foundation, maintained and distributed by Schrödinger. It is widely used software in structural bioinformatics, biophysics, computer-aided drug design, and other fields of biology. It is an advanced graphical user interface in the field of molecular visualization. Used to see the protein binding with ligand/drug in a 3D space. This software allows the viewer to label atoms, bonds, distances, angles, residues, residues with numbers, chains, and types of bond interactions (Fig. 2.6). It has many features like one can model protein, DNA with secondary structural information. User can see the proteins in different forms like balls-sticks, wires, molecular surfaces with atomic energy distribution. This tool is freely available for students/academic institutions with legal agreement or registration. The tutorial and software download is also available at https://pymol.org/2/ Fig.2.6: Visualization of ligand with sticks within a cavity of protein by pymol software. 5. SPDBV Swiss-PdbViewer is an application that provides a user-friendly interface to analyze several proteins at the same time. The proteins can be superimposed in order to deduce structural alignments and compare their active sites or any other relevant parts. Amino acid mutations, H-bonds, angles and distances between atoms can be viewed easily. This tool functions on the intuitive graphic and menu interface (Fig. 2.7). 59 BBCS-185 Bioinformatics Skill Enhancement Course Swiss-Pdb viewer was developed in 1994 by Nicolas Guex. Swiss-PdbViewer is closely SWISS-MODEL, an automated homology modeling server developed within the Swiss Institute of Bioinformatics (SIB) at the Structural Bioinformatics Group of associated with Biozentrum in Basel. Working with SWISS-MODEL and SWISS-Pbd Viewer programs greatly reduces the amount of time required to generate models. It is possible to thread a protein primary sequence onto a 3D template and get immediate feedback on how well the threaded protein will be accepted by the reference structure before submitting a request to build missing loops and refine sidechain packing. Fig. 2.7: Protein structure visualization with Spdbv software. 2.6 SUMMARY Biological databases used to store experimental data in various formats that can be accessed through the internet. In biological databases, to avoid the Non-Redundancy of the data, a unique number or ID is assigned as primary key. All biological databases are well structured so that data can be retrieved across the globe with ease in a short span of time. 60 There are two types of Biological Databases: 1. Primary Databases data collected from Laboratory- GenBank, EMBL,DDBJ, NDB, TrEMBL, PIR, SwissProt and PDB. 2. Secondary databases- derived from the results of analyzing primary data of primary, databases- InterPro(Protein families motifs, and domains) UniProt, Ensembl, Brenda databases Macro molecules like proteins, DNA and RNA come under separate databases. Similarly small molecules hence their own databases. Examples of small molecular databases are Pubchem, Drug Bank,ZINC Databases,CSD (Cambridge Structure Database). Unit 2 Biological Databases and Data Retrieval SCOP, CATH and PDBSUM serve as Secondary structural databases. SCOP has five sub-classes. 1. All alpha, 2. All Beta, 3. Alpha or Beta 4. Alpha and Beta 5. Multi-domain fold. Based on nature of data in biological databases, there are five types of databases. 1. Sequence databases-EMBL, NCBI, GenBank 2. Structural Databases- RCSB-PDB and CCDC 3. Literature Databases- Pubmed, Medline, NDL 4. Gene expression Databases- GXD, Gene Expression Omnibus (GEO) is a database repository of high throughput gene expression data and hybridization arrays, chips, microarrays. 5. Metabolic pathway Databases-KEGG (Kyoto Encyclopedia of Genes and Genomes) MetaCyc, BioCyc To view any protein or any molecule in a virtual space, it requires 3D space coordinates like X, Y, and Z along the axis with respect to the origin. The basic and free molecular visualization software is Rasmol. Pymol is a well-developed Graphical user interface with good rendering options along with well-advanced features. 2.7 TERMINAL QUESTIONS 1. What are the basic principles and characteristics of an ideal biological Databases 2. Write a note on Primary databases and secondary databases with special emphasis to biological databases. 3. Describe nucleotide databases. 4. Explain secondary Structural databases. 5. Write in detail about classification of biological databases based on nature of data. 2.8 ANSWERS Self Assessment Questions 1. i) Nucleotide Database at NCBI ii) DNA iii) European Bioinformatics institute iv) National Centre for Biotechnology v) PIR 61 BBCS-185 Bioinformatics Skill Enhancement Course 2. 3. i) Protein Data Bank ii) Research Collaboratory for Structural Bioinformatics iii) Structural Classification of Proteins i) Refer Section 2.4.1 to 2.4.3. ii) Refer Metabolic Pathway Databases under section 2.3. Terminal Questions 1. There are various principles for the biological databases or characteristics. i) Biological databases can be stored in an electronic form at various formats. ii) As it is a database, each entry in the database would be assigned Nonredundancy. iii) Biological data sharing The data can be downloaded from various websites related to biological databases or ftp (File transfer protocol). iv) Biological databases are well structured, searchable and also updated information periodically as per publications/innovations in the scientific world. v) The data also referred with unique ID in research publications/books. This can be called a cross reference. 2. Synonyms Source of data Examples 62 Primary database Secondary Database Archival database Curateddatabase; knowledgebase Direct submission of experimentallyderived data from researchers ENA, GenBank and DDBJ (nucleotide sequence) ArrayExpress and GEO (fu nctional genomics data) Protein Data Bank (PDB; coordinates of threedimensional macromolecular structures) Results of analysis, literature research and interpretation, often of data in primary databases InterPro (protein families, motifs and domains) UniProt Knowledgebase (seque nce and functional information on proteins) Ensembl (varia tion, function, regulation and more layered onto whole genome sequences) Unit 2 3. 4. 5. Biological Databases and Data Retrieval Refer section (2.2.2) i) Importance of nucleotide databases ii) Enlist nucleotide databases and describe in detail. Refer section (2.2.4) i) Write about SCOP,CATH, PDBSUM in detail. ii) Classification of SCOP iii) Few points on CATH databases. Refer section (2.3) 63 BBCS-185 Bioinformatics Skill Enhancement Course Exercise 4 : DATABASES NCBI, PDB, SCOP, PUBMED, GENE BANK, UNIPROT Structure 4.1 Introduction 4.2 Databases and Retrieval Expected Learning 4.3 Summary 4.4 Lab Exercises Outcomes 4.1 INTRODUCTION In this exercise, you will learn about biological databases that are widely used in the field of bioinformatics. Databases are systematic collections of theoretically related data. Software packages are used for defining and managing databases. In publicly accessible databases, there is a lot of information available regarding biomolecules due to exponential growth in biological data. Data is no longer published in a conventional way but rather submitted directly to databases. Generally, the biological database can be classified into sequence database, structural database, genome database, proteome database, specialized databases, etc. Expected Learning Outcomes After performing this exercise you shall be able to: browse the required information from the databases; 64 explain the importance of databases; differentiate different biological databases; and Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt enlist the applications of various databases for academic and scientific research. 4.2 DATABASES AND RETRIEVAL In this section, we are going to learn about various databases that you have studied in unit-2 of this course. However, the main focus will be on how to access the information available in these databases using existing online tools. You are aware that there are different databases available for individual biomolecules like proteins, nucleic acids, and small molecules. Let us explore them one by one. National Centre for Biotechnology Information (NCBI) National Centre for Biotechnology Information (NCBI) is a source of public for analysing molecular and biomedical database; genomic data, and conducting research in computational biology. NCBI maintains over 40 integrated databases for the medical and scientific sectors, as well as the general public. The GenBank nucleotide database is maintained by the NCBI, which is part of the National Institute of Health (NIH), a federal agency of the US government, Access NCBI database by following weblink (https://www.ncbi.nlm.nih.gov/) and use popular resources to retrieve information and use different tools (Fig. 4.1). Fig. 4.1: Screeshot showing NCBI database. You can access the NCBI to know about different popular resources, further you will be learning the usage of NCBI and Genbank to access nucleotide and protein sequence in exercises 5 and 7. PUBMED Public Medical (PubMed) is a bibliographic database of popular NCBI resources. PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the objective of educating health globally and personally. As of 2021, PubMed comprises more 65 BBCS-185 Bioinformatics Skill Enhancement Course than 32 million citations (Abstract) for biomedical literature from MEDLINE, life science journals, and online books. Citations do not include full text journal articles but may include links to full-text content from PubMed Central (PMC) and publisher web sites available from other sources. PubMed was developed and maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH). Procedure Step 1: Open the PUBMED, browser using the following URL http://www.ncbi.nlm.nih.gov/pubmed/ (Fig. 4.2). Fig. 4.2: Screeshot showing PubMed home page. Step 2: Type your text query in the search panel (for example corona virus 66 Step 3: Select the appropriate abstract from the PubMed summary web page (Fig. 4.3). Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt Fig. 4.3: Screeshot showing PubMed search results. Step 3: Copy and save the relevant bibliography search for further use. GenBank The GenBank nucleotide database is maintained by the NCBI. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis. The GenBank database is intended to provide and encourage access to the most up-to-date and complete DNA sequence 67 BBCS-185 Bioinformatics Skill Enhancement Course information within the scientific community (https://www.ncbi.nlm.nih.gov/genbank/) (Fig. 4.4). Fig. 4.4: Screenshot showing GenBank. So far, you have learned about databases NCBI, PubMed, GenBank and how to access the citations/abstract from PubMed. To become more familiar with the procedure, repeat the exercises with different keywords such as author name, keywords like antioxidants, curcumin, cholesterol etc. and text searches. In the next subsection you will learn about Protein Data Bank, which is widely used for 3-D protein structure-related information. Further, you will be learning the usage of GenBank to access nucleotide sequences in exercises 5 and 7. PDB The Protein Data Bank (PDB) is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations, Research Collaboratory for Structural Bioinformatics (RCSB). The PDB database is intended to provide access to 3-D structural information. To access the PDB database, follow the web link https://www.rcsb.org/ and retrieve structural information from PDB (Fig. 4.5). You can access the PDB to understand about structural database, further you will be learning the usage of PDB to access and download 3-D structures of protein and DNA in exercises 6. 68 Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt Fig. 4.5: Screenshot showing PDB homepage. SCOP Structural classification of proteins (SCOP). SCOP maintained at MRC (medical research council) laboratory of Molecular biology and centre for protein engineering. Describes structural and evolutionary relationships between proteins. Classification in hierarchical fashion, like Family: clustered to families with clear evolutionary relationships, Super Family: structural and functional characteristic have common evolutionary origin, Fold: common fold if they have same secondary structure. Procedure: Step 1: Open the SCOP from the following URL https://scop.mrclmb.cam.ac.uk/ (Fig. 4.6). Fig. 4.6: Screeshot showing SCOP homepage. Step 2: Type the protein name or relevant text in the text box titled or enter keyword (Fig. 4.7). 69 BBCS-185 Bioinformatics Skill Enhancement Course Fig. 4.7: Screenshot showing Keyword in search engine of SCOP. Step 3: On pressing search button the result page (summary) is displayed. To know further about specific protein, click on it (Fig. 4.8). Fig. 4.8: Screenshot showing SCOP search results. 70 Step 4: Choose the appropriate link to display the functional information (Fig. 4.9). Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt Fig. 4.9: Appropriate links showing family and super family have been encircled. UNIPROT The Universal Protein (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are of the following subtypes like UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). The UniProt consortium and host institutions like the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) are committed to the long-term preservation of the UniProt databases. Procedure: Step 1. Open the UniProt website from the following URLhttps://www.uniprot.org/ (Fig. 4.10). Fig. 4.10: Screeshot showing Uniprot homepage. 71 BBCS-185 Bioinformatics Skill Enhancement Course Step 2: Type the protein name or relevant text in the text box titled or enter keyword (Fig. 4.11). Fig. 4.11: Screenshot showing Uniprot search column. Step 3: On pressing search button the result page(summary) is displayed (Fig. 4.12). Fig. 4.12: Screenshot showing search results on Uniprot. Step 4: Choose the first sequence by double clicking the accession number, go to display button select FASTA format to retrieve sequence (Fig. 4.13). 72 Exercise 4 Data Bases: NCBI, PDB, SCOP, PubMed, GenBank, UniProt Fig. 4.13: Screeshot showing display options for searched protein. Step 5. Copy and save the protein sequence for further analysis (Fig. 4.14). Fig. 4.14: Screenshot showing IASTA sequence of the protein. 4.3 SUMMARY Databases are systematic collections of theoretically related data. Generally, the biological database can be classified into sequence database, structural databases, genome database, proteome database, and specialized databases etc. To gain knowledge of different biological databases and their usage appropriately in relevant ways, databases are created and maintained. NCBI is a source of the public biomedical database, NCBI maintains over 40 integrated databases for the medical and scientific sectors, as well as the general public. The GenBank nucleotide database is maintained by the NCBI and maintains complete DNA sequence information. 73 BBCS-185 Bioinformatics Skill Enhancement Course PubMed is a bibliography database of popular NCBI resources, comprising more than 21 million abstracts for biomedical literature. PDB is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. SCOP provides structural and evolutionary relationships between proteins and provides classification on hierarchical protein family and fold information. UniProt is a comprehensive resource for protein sequence and annotation data. 4.4 LAB EXERCISES 74 1. Retrieve abstract PMID: 32643536 from PUBMED bibliographic database write title of abstract and authors name 2. Access nucleotide sequence FJ436056 from GENE BANK database in Genbank format and FASTA format and write the title, molecule type, how many base pairs 3. Download protein 3-D structure of PDBID-7JMO from PDB (RCSB) database and view structure in viewing tool. 4. Open SCOP database and give any keyword or text search write the functional aspect, name of the protein, family, class and domain 5. Access globulin protein sequence from UNIPROT database in FASTA format and write the title, Uniprot KB id, organism name. Exercise 5 Retrieval of Protein and Gene Sequences from NCBI Exercise 5 RETRIEVAL OF GENE SEQUENCES FROM NCBI Structure 5.1 Introduction 5.2 Procedure Expected Learning 5.3 Summary 5.4 Lab Exercises Outcomes 5.1 INTRODUCTION In the previous exercise, you have learned about different databases. In this exercise, you will be studying the retrieval of protein and gene sequences from NCBI. In this exercise, we will learn about protein and gene sequence retrieval from NCBI database. We have studied theoretically NCBI database in Unit-2 and learned about different resources of NCBI such as GenBank and GenPept in Exercise 4. In this section, we shall access sequences from GenBank and GenPept of NCBI which will be used in various sequence analysis techniques. Protein sequences are the fundamental determinants of biological structure and function. The NCBI protein database is a collection of protein sequences from different sources like GenPept, including translation from annotated coding regions in GenBank, Reference sequences and Third Party Annotation (TPA) as well as records from Swiss-Prot, Protein Information Resource (PIR), Protein Research Foundation (PRF) and Protein Data Bank (PDB). The nucleotide database is a collection of gene sequences from different sources, which include GenBank, RefSeq, TPA, and PDB. Genome, gene, and transcript sequence data provide the foundation for biomedical research and discovery. The Gene database can be accessed by simply querying the word, preferably the gene name, or the disease name in the query box, which will display the list of genes associated with the search. Users can also search records with Gene ID, which is a unique identifier issued by NCBI. 75 BBCS-185 Bioinformatics Skill Enhancement Course Expected Learning Outcomes After performing this exercise you shall be able to: capable of using NCBI-GenPept database and retrieve protein sequence; explore and retrieve gene information from NCBI Gene database; and explain the importance of NCBI in sequence retrieval. 5.2 PROCEDURE Step 1: Access the home page of NCBI from the following web link https://www.ncbi.nlm.nih.gov/ (Fig. 5.1). Fig. 5.1: Screenshot showing NCBI home page. 2. Click on the scrolling button 5.2). ene Fig. 5.2: Screenshot showing dropdown menu on NCBI. 3. 76 Type the relevant text in the search box or enter keyword (ExampleGene name, Species name etc) (Fig. 5.3). Exercise 5 Retrieval of Protein and Gene Sequences from NCBI Fig. 5.3: Screenshot showing search box on NCBI. 4. On pressing search button the result page (summary page) is displayed (Fig. 5.4). Fig. 5.4: Screenshot showing summary page (results) on NCBI. 5. Choose the desired Gene sequence by double-clicking the name or ID or check to mark the appropriate sequence; go to the display button, select GenBank or Fasta format to retrieve the sequence (Fig. 5.5). Scroll down and click on required file format (FASTA or GenBank format) 77 BBCS-185 Bioinformatics Skill Enhancement Course Fig. 5.5: Screenshot showing how to obtain FASTA or GenBank sequence. 6. Copy and save the required gene sequence for further analysis (Fig. 5.6). Fig. 5.6: Screenshot showing FASTA sequence. 5.3 SUMMARY NCBI is a systematic collection of theoretically related biological data such as sequence databases, genome databases, and specialized databases, etc. NCBI is a source of public biomedical databases. The GenBank nucleotide database is maintained by the NCBI that provides complete DNA sequence information. The GenPept protein sequence database maintained by NCBI. Both gene and protein sequences can be retrieved from NCBI database for further analysis. 5.4 LAB EXERCISES 78 1. Access Nucleo capsid phosphor protein Gene sequence from NCBI database in Genbank format write the title, ID, Organism name 2. Access envelope protein sequence from NCBI database in Genbank format write the title, ID, Organism name 3. Access Covid-19 protein sequence from NCBI database in FASTA format and write the title, ID, organism name. Exercise 6 Download Protein Structure from PDB Exercise 6 ACCESSING PROTEIN STRUCTURE FROM PDB Structure 6.1 6.2 Introduction 6.3 Summary Expected Learning Outcomes 6.4 Lab Exercises Procedure 6.1 INTRODUCTION In the previous exercise, you learned how to retrieve protein and gene sequences from the NCBI database. Now, in this exercise, you shall be exploring the steps involved in downloading protein structures from the PDB database. 3-D structures of proteins from Protein Data Bank (PDB), are used to understand structural information such as the binding site of a protein or DNA, the active site of enzymes, DNA-Protein interactions, and ProteinProtein interactions, and this has applications in drug design. You learned about the PDB database in Unit-2 and accessed the PDB website in Exercise4. Protein structure is useful to understand how the protein works, and that information can be used to inhibit, regulate, or modify protein function, and predict what molecules bind to that protein. Also, to understand various biological interactions, assist drug discovery, or even design novel proteins therapeutic as molecules. In order to understand the biological function of DNA, we need to study its molecular structure. The PDB is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. You have learned about PDB in exercise-4, now in this exercise, we shall learn about how to download protein structure from PDB. Expected Learning Outcomes After performing this exercise you shall be able to: describe how to access structures of proteins from PDB; 79 BBCS-185 Bioinformatics Skill Enhancement Course describe how to access structural data of a protein using PDB database; explain the PDB file format; and perform how to download and save 3-D structure of Protein in PDB format. 6.2 PROCEDURE Step 1: 1. Open the PDB from the following URL- https://www.rcsb.org/ (Fig. 6.1). Fig. 6.1: Screenshot showing PDB home page. 2. Enter the query in the textbox provided by entering PDB ID, molecule name or author name. Click on the search button (Fig. 6.2). Fig. 6.2: Screenshot showing search column on PDB homepage. 80 Exercise 6 3. Download Protein Structure from PDB From the summary page click on PDB ID 7LYJ and download the macromolecular 3D structure in PDB format (Fig. 6.3 and 6.4). Fig. 6.3: Screenshot showing target protein (7LYJ). Dowanload PDB format. Fig. 6.4: Screenshot showing how to download PDB format. 4. Using any one of the visualizing tools PyMoL or RasMol or Swiss-PDB viewer open the structure file to visualize. You will learn about these tools in exercise number 8 of this course. 81 BBCS-185 Bioinformatics Skill Enhancement Course 6.3 SUMMARY PDB is the NCBI database from where we can access the protein 3-D structures. In this exercise you have exhibited the skills to download protein in PDB format. You have acquired the skills to access PDB pages and learned how to search for the desired protein. These PDB formats can be visualised using visualising tools. 6.4 LAB EXERCISES 82 1. Access 7LMF protein structure from PDB database and download in PDB format and also save in PDB flat file(text) format comment few points. 2. Download S. cerevisiae CMG-Pol epsilon-DNA in PDB format give it PDB ID, source and comment few points. 3. Download crystal structure of yeast phenylalanine t-rnain PDB format give it PDB ID, source and comment few points.
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )