IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin Basic Molecular Biology All living things are made of Cells Prokaryote, Eukaryote Cell Signaling What is Inside the cell: From DNA, to RNA, to Proteins Cells Fundamental working units of every living system. Every organism is composed of one of two radically different types of cells: prokaryotic cells or eukaryotic cells. Prokaryotes and Eukaryotes are descended from the same primitive cell. All extant prokaryotic and eukaryotic cells are the result of a total of 3.5 billion years of evolution. Cell Structure A cell is a smallest structural unit of an organism that is capable of independent functioning All cells have some common features Cell Cycle Born, eat, replicate, and die The Tree of Life According to the most recent evidence, there are three main branches to the tree of life. Prokaryotes include Archaea (“ancient ones”) and bacteria. Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae. Prokaryotes and Eukaryotes Prokaryotes Eukaryotes Single cell Single or multi cell No nucleus Nucleus No organelles Organelles One piece of circular DNA Chromosomes No mRNA post transcriptional modification Exons/Introns splicing Signaling Pathways: Control Gene Activity Instead of having brains, cells make decision through complex networks of chemical reactions, called pathways Synthesize new materials Break other materials down for spare parts Signal to eat or die An Example -- Cell Cycle Signaling Cells Information and Machinery Cells store all information to replicate itself Human genome is around 3 billions base pair long Almost every cell in human body contains same set of genes But not all genes are used or expressed by those cells Machinery: Collect and manufacture components Carry out replication Kick-start its new offspring Terminology Genome: an organism’s genetic material Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA Genotype: The genetic makeup of an organism Phenotype: the physical expressed traits of an organism Nucleic acid: Biological molecules(RNA and DNA) that allow organisms to reproduce Amino acid: Organic molecules that build blocks of proteins. Protein: a large, complex molecule that is essential part of organisms and participates in every process within cells and achieve a particular function. Three critical molecules DNAs Hold information on how cell works RNAs Act to transfer short pieces of information to different parts of cell Provide templates to synthesize into protein Proteins Form enzymes that send signals to other cells and regulate gene activity Form body’s major components (e.g. hair, skin, etc.) Overview of DNA to RNA to Protein A gene is expressed in two steps Transcription: RNA synthesis Translation: Protein synthesis DNA the Genetics Makeup Genes are inherited and are expressed genotype (genetic makeup) phenotype (physical expression) On the left, is the eye’s phenotypes of green and black eye genes. Central Dogmas of Molecular Biology 1) The concept of genes is historically defined on the basic of genetic inheritance of a phenotype. (Mendellian Inheritance) 2) The DNA an organism encodes the genetic information. It is made up of a double stranded helix composed of ribose sugars. Adenine(A), Citosine (C), Guanine (G) and Thymine (T). [note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.] Central Dogmas of Molecular Biology 3) Each side of the double helix faces it´s complementary base. A T, and G C. 4) Biochemical process that read off the DNA always read it from the 5´´side towards the 3´ side. (replication and transcription). 5) A gene can be located on either the ´plus strand´ or the minus strand. But rule 4) imposes the orientation of reading .. And rule 3 (complementarity) tells us to complement each base E.g. If the sequence on the + strand is ACGTGATCGATGCTA, the – strand must be read off by reading the complement of this sequence going ´backwards´ e.g. TAGCATCGATCACGT Central Dogmas of Molecular Biology 6) DNA information is copied over to mRNA that acts as a template to produce proteins. We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but let´s not forget the various RNA genes) Bioinformatics Bioinformatics (computational biology) solves biological problems on the molecular level with the use of techniques including: applied mathematics statistics computer science artificial intelligence Bioinformatics Biological Computer + Data Calculations Molecular Biology as an Information Science Central Dogma of Molecular Biology DNA -> RNA -> Protein -> Phenotype Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation Central Paradigm for Bioinformatics -> Genomic Sequence -> Transcript -> Protein Structure -> Protein Function Large Amounts of Information Data Management Computer Algorithms Statistical Methods Major research efforts Sequence alignment Gene finding Genome assembly RNA structure prediction Protein structure prediction Analysis of gene regulation Prediction of protein-protein interactions Modeling of evolution Major research areas Sequence analysis Genome annotation Computational evolutionary biology Measuring biodiversity Analysis of gene expression Analysis of regulation Analysis of protein expression Analysis of mutations in cancer Analysis of epigenetics in cancer High-throughput in vivo binding analysis Prediction of protein structure Comparative genomics Modeling biological systems High-throughput image analysis Protein-protein docking Software and tools Databases Web services in bioinformatics Data types DNA sequences RNA sequences Protein sequences Gene Expression cDNA, mRNA microarray data Now tiling array technology 50 M data points to tile the human genome at ~50 bp res. Can only sequence genome once but can do an infinite variety of array experiments Protein-DNA interactions ChIP-chip, ChIP-seq, ChIP-PET and so on Phenotype Experiments KOs Protein Interactions Yeast hybrid Proteomics Other Integrative Data Information to understand genomes Metabolic Pathways Regulatory Networks Signaling Networks Whole Organisms Phylogeny The Literature (MEDLINE) GenBank Growth Year 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 GenBank Data Base Pairs Sequences 680338 606 2274029 2427 3368765 4175 5204420 5700 9615371 9978 15514776 14584 23800000 20579 34762585 28791 49179285 39533 71947426 55627 101008486 78608 157152442 143492 217102462 215273 384939485 555694 651972984 1021211 1160300687 1765847 2008761784 2837897 3841163011 4864570 8604221980 7077491 Exponential Growth of Data Matched by Development of Computer Technology CPU vs Disk & Internet Driving Force in Bioinformatics Internet Hosts 1981 1983 1985 1987 1989 1991 1995 140 Structures in PDB 1980 1993 120 100 80 60 40 20 0 1985 1990 1995 CPU Instruction Time (ns) 1979 4500 4000 3500 No. 3000 Protein 2500 2000 Domain 1500 Structures 1000 500 0 Types of Relational databases The Internet can be thought of as one enormous relational database. The “links”/URL are the primary keys. SQL (Standard Query Language) Sybase; Oracle ; Access; (Databases systems) Sybase used at NCBI. SRS(One type of database querying system of use in Biology) XML Database and vocabularies for life science HTML: Hypertext Markup Language XML: a general-purpose specification for creating custom markup languages. It is classified as an extensible language, because it allows the user to define the mark-up elements BSML: an extensible language specification and container for bioinformatic data. BSML was developed under a 1997 grant from the National Human Genome Research Institute (NHGRI) as an evolving public domain standard for the bioinformatics community Examples of XML <?xml version="1.0" encoding="UTF-8"?> <element_name attribute_name="attribute_value">Element Content</element_name> <book>This is a book... </book> Primary Databases A primary Database is a repository of data derived from experiments or from research knowledge. Genbank (Nucleotide repository) Protein DB, Swissprot PDB (MMDB) are primary databases. Pubmed (literature) Genome Mapping databases. Kegg Database.(pathways) Secondary Databases A secondary database contains information derived from other sources. Refseq (Currated collection of Genbank at NCBI) UniGene (Clustering of ESTs at NCBI) GeneID (Unique ID for each Gene at NCBI) Organism-specific databases are often a mix between primary and secondary. Biological Databases Nucleotide databases: Genbank: International Collaboration • NCBI (USA), EMBL (Europe), DDBJ (Japan and Asia) • A “bank” No curation.. Submission to these database is required for publication in a journal. Organism specific databases (Quick quiz: Find URLs using search engines) • FlyBase • ChickGBASE • pigbase • wormpep • YPD (Yeast Protein Database) • SGD(Saccharomyces Genome Database) Protein Databases: NCBI: More on next week Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB. 1998 version free of any licensing) • http://www.expasy.ch(latest pay version) • NCBI has the latest free version. • Translated Proteins from Genbank Submissions EMBL • TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT PIR • Structure databases: PDB: Protein structure database. • Http://www.rscb.org/pdb/ MMDB: NCBI’s version of PDB with entrez links. • Http://www.ncbi.nlm.nih.gov • Genome mapping information: http://www.il-st-acad-sci.org/health/genebase.html NCBI (Human) Genome Centers: Stanford, Washington University, UC Berkeley Research Centers and Universities Literature databases: NCBI: Pubmed: All biomedical literature. • www.ncbi.nlm.nih.gov • Abstracts and links to publisher sites for full text retrieval/ordering journal browsing. Publisher web sites. Biomednet: Commercial site for litterature search. Pathways database: KEGG: Kyoto Encyclopedia of Genes and Genomes: www.genome.ad.jp/kegg/kegg/html Genome Search and Visualization database: UCSC Genome Browser (genome.uscs.edu/) Information techniques Databases – Building, Querying – Complex data Text String Comparison – Text Search – 1D Alignment – Significance Statistics – Alta Vista, grep Finding Patterns – Machine Learning – Clustering – Data mining Geometry – Robotics – Graphics (Surfaces, Volumes) – Comparison and 3D Matching (Vision, recognition) Physical Simulation – Newtonian Mechanics – Electrostatics – Numerical Algorithms – Simulation Bioinformatics as New Paradigm for Scientific Computing Physics Prediction based on physical principles EX: Exact Determination of Rocket Trajectory Emphasizes: Supercomputer, CPU Biology Classifying information and discovering unexpected relationships EX: Gene Expression Network Emphasizes: networks, “federated” database Topics -- Genome Sequence Finding Genes in Genomic DNA introns exons promotors Characterizing Repeats in Genomic DNA Statistics Patterns Duplications in the Genome Large scale genomic alignment Whole-Genome Comparisons Finding Structural RNAs Topics -- Protein Sequence Sequence Alignment How to align two strings optimally via Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution scoring matrices Multiple Alignment and Consensus Patterns How to align more than one sequence and then fuse the result in a consensus representation HMMs, Profiles Motifs Scoring schemes and Matching statistics How to tell if a given alignment or match is statistically significant A P-value or An E-value)? Score Distributions Low Complexity Sequences Evolutionary Issues Rates of mutation and change Topics – Structures Secondary Structure “Prediction” via Propensities Neural Networks, Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary Structure Prediction Structure Prediction: Protein vs RNA Tertiary Structure Prediction Fold Recognition Threading Ab initio Direct Function Prediction Active site identification Relation of Sequence Similarity to Structural Similarity Topics -- Structures Structure Comparison Structural Alignment Basic Protein Geometry and Aligning sequences on the Least-Squares Fitting basis of 3D structure. Distances, Angles, Axes, Rotations DP does not converge, Calculating a helix axis in 3D via unlike sequences, what to fitting a line do? LSQ fit of 2 structures Other Approaches: Molecular Graphics Distance Matrices, Hashing Calculation of Volume and Surface Fold Library How to represent a plane Docking and Drug Design as How to represent a solid Surface Matching How to calculate an area Hinge prediction Packing Measurement Topics – Function Genomics Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions Large scale cross referencing of information Function Classification and Orthologs The Genomic vs. Singlemolecule Perspective Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting proteins Networks Global structure and local motifs Structural Genomics Folds in Genomes, shared & common folds Bulk Structure Prediction Genome Trees Bioinformatics tools Sequence comparison (pairwise and multiple alignments, e.g. ClustalW, Blastz, ) Phylogenetic reconstruction (e.g. Phylip, IQPNNI, SplitsTree) Database search (e.g. BLAST, HMMer) Comparative sequence assembly (e.g. OSLay) Gene finding (e.g. genscan, FirstEF) Motif discovery (e.g. MEME, Weeder) Protein structure (e.g. CE) Bioinformatics algorithms Dynamic Programming EM algorithms Neural Networks Hidden Markov Models Support Vector Machine Phylogenetic Trees Clustering Bioinformatics Topics? (YES?) Digital Libraries Automated Bibliographic Search and Textual Comparison Knowledge bases for biological literature (YES) Motif Discovery Using Gibb's Sampling (YES) Metabolic Pathway Simulation (YES) Gene identification by sequence inspection Prediction of splice sites (YES) Linkage Analysis Linking specific genes to various traits YES) RNA structure prediction Identification in sequences (YES) Homology modeling