An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt Terminology Bioinformatics: using computational techniques to access, analyze, and interpret the biological information. Tool Building. Biocomputing and computational biology are the synonyms. Sequence analysis is the study of molecular sequence data. Genomics analyzes the context of genes or complete genomes. Proteomics is the subdivision of genomics concerned with analyzing the protein complement, i.e. the proteome. The Human Genome Project and numerous the data coming at alarming rates. Homo sapiens the 3.2 billion base pairs: Estimates of the number of genes were around 100,000 range; but turns out to be twice as many as a fruit fly, between 25’ and 35,000! The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in regulation and control. Three major databases with their own specific format. Mirrored among each other and sharing accession codes, but NOT identifier names: 1) National Center for Biotechnology Information (NCBI),/the National Library of Medicine (NLM), at the NIH, (Gene bank and GenPept). http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html Georgetown University’s National Biomedical Research Foundation Protein Identification Resource and Naval Research Lab sequences of threedimensional structure. http://www-nbrf.georgetown.edu/ http://www-nbrf.georgetown.edu/pirwww/dbinfo/nrl3d.html 2) European Molecular Biology Laboratory http://www.ebi.ac.uk/embl/index.html, http://www.embl-heidelberg.de/ European Bioinformatics Institute, http://www.ebi.ac.uk/ Swiss Institute of Bioinformatics’ (SIB), Expert Protein Analysis System http://www.expasy.ch/, http://www.expasy.org/links.html Nucleotide Sequence Database, amino acid sequence databases http://expasy.cbr.nrc.ca/sprot/ 3) http://www.ddbj.nig.ac.jp/ The National Institute of Genetics, DNA Data Bank of Japan. Atlas of Protein Sequence and Structure: The first well recognized protein sequence database, mid sixties, by Dr. Margaret Dayhoff. DDBJ began in 1984, GenBank in 1982, and EMBL in 1980. They are all attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequences. Each program needs to recognize particular aspects of the sequence files; flexibility of the program is a headache. NCBI’s ASN.1 format and its Entrez interface attempt to reduce these prbls. Unfortunately, not like ieee working groups for internet taskforce, RFCies for example, format issues are the most confusing and troubling aspect of working with primary sequence data. Sequence database installations are commonly a complex ASCII/Binary mix, but neither relational nor OOP (often proprietary). Contain several very long text files each containing different types of information all related to particular sequences. Software is usually required to interact with these databases. ReadSeq of Don Gilbert (a reformatting program, for DNA and protein sequences, accepting single or multiple inputs in 18 different formats, converting to a specified format. ) http://www.molecularevolution.org/ AWTY (Are We There Yet?) is a system for graphically exploring convergence of Markov Chain Monte Carlo (MCMC) chains in Bayesian phylogenetic inference (Nylander et al. 2008). FigTree to graphically view phylogenetic trees. Clustal W (Thompson et al. 1994) is for global multiple sequence alignment. Using a progressive alignment algorithm with affine gap penalties and a guide tree based on sequence similarity to align DNA or amino acid sequences. The affine gap cost model penalizes insertions and deletions using a linear function in which one term is length independent, and the other is length dependent. Gap penalty = Gapopen + Len * Gapextend. Recent reviews comparing multiple alignment algorithms (e.g., Hickson et al. 2000, Thompson et al. 1999, and McClure et al. 1994). Morrison and Ellis (1997) discuss the effects of nucleotide sequence alignment on the estimation of phylogenetic hypotheses. The current version is Clustal W2 (Larkin et al. 2007). The program is also available with a graphical user interface, Clustal X. BEAST, (Beauti), -Bayesian Evolutionary Analysis Sampling Trees- is for evolutionary inference of molecular sequences, Andrew Rambaut and Alexei Drummond (Drummond et al. 2002; 2005; 2006). FASTA compares pairs of protein or DNA sequences as well as comparing a single protein or DNA sequence to a database or library. Fast and local or remote services. GARLI (Genetic Algorithm for Rapid Likelihood Inference) performs phylogenetic searches on aligned nucleotide datasets using the maximum likelihood criterion. MAFFT implements FFT to optimize protein alignments based on physical properties of the amino acids (Katoh et al., 2002; 2005). The program uses progressive alignment followed by refinement, also known as iterative alignment. All sequence databases contain (in their own format): Name (Genetic identifiers): LOCUS, ENTRY, ID Definition: A brief, one-line, textual sequence description. Accession Number: A constant data identifier. Source and classification (taxonomy) information. Complete literature references. Comments and keywords. The all important FEATURE table! A summary or checksum line. The sequence itself. LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993 Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha). X03558 X03558.1 GI:31097 elongation factor; elongation factor 1. human. Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1506) AUTHORS Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W. TITLE The primary structure of the alpha subunit of human elongation…… JOURNAL Eur. J. Biochem. 155 (1), 167-171 (1986) MEDLINE 86136120 FEATURES Location/Qualifiers source 1..1506 /organism="Homo sapiens" /db_xref="taxon:9606" CDS 54..1442 /note="EF-1 alpha (aa 1-463)" /codon_start=1 /protein_id="CAA27245.1" /db_xref="GI:31098" /db_xref="SWISS-PROT:P04720" /translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM ……VTKSAQKAQKAK" BASE COUNT 412 a 337 c 387 g 370 t ORIGIN 1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa 61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca………. 1501 aactgt // GenBank and GenPept format EMBL and SWISSPROT ID AC DT DE DE GN OS OS OS OC OC OX RN RP RC RX RA RT RL CC CC CC CC CC CC CC CC DR DR DR DR DR DR KW KW FT FT FT FT FT FT FT FT FT FT FT FT SQ // EF11_HUMAN STANDARD; PRT; 462 AA. P04720; P04719; 13-AUG-1987 (Rel. 05, Created)…… Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1) (eEF1A-1) (Elongation factor Tu) (EF-Tu). EEF1A1 OR EEF1A OR EF1A. Homo sapiens (Human), Bos taurus (Bovine), and Oryctolagus cuniculus (Rabbit). Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. NCBI_TaxID=9606, 9913, 9986; [1] SEQUENCE FROM N.A. SPECIES=Human; MEDLINE=86136120; PubMed=3512269; Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.; "The primary structure of the alpha subunit of human elongation …. -binding sites."; Eur. J. Biochem. 155:167-171(1986).…… -!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OF AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEIN BIOSYNTHESIS. -!- SUBCELLULAR LOCATION: Cytoplasmic. -!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY, PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE. -!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY. EF-TU/EF-1A SUBFAMILY…… EMBL; X03558; CAA27245.1; -…… PIR; S18054; EFRB1…… HSSP; Q01698; 1TUI…… InterPro; IPR004160; GTP_EFTU_D3. Pfam; PF00009; GTP_EFTU; 1…… PROSITE; PS00301; EFACTOR_GTP; 1. Elongation factor; Protein biosynthesis; GTP-binding; Methylation; Multigene family. NP_BIND 14 21 GTP (BY SIMILARITY). NP_BIND 91 95 GTP (BY SIMILARITY). NP_BIND 153 156 GTP (BY SIMILARITY). MOD_RES 36 36 METHYLATION (TRI-). MOD_RES 55 55 METHYLATION (DI-). MOD_RES 79 79 METHYLATION (TRI-). MOD_RES 165 165 METHYLATION (DI-). MOD_RES 318 318 METHYLATION (TRI-). BINDING 301 301 ETHANOLAMINE-PHOSPHOGLYCEROL. BINDING 374 374 ETHANOLAMINE-PHOSPHOGLYCEROL. CONFLICT 83 83 S -> A (IN REF. 2). CONFLICT 232 232 L -> V (IN REF. 3). SEQUENCE 462 AA; 50141 MW; D465615545AF686A CRC64; MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL DKLKAERERG …… VTKSAQKAQK AK PIR/NBR F format ENTRY EFHU1 #type complete iProClass View of EFHU1 TITLE translation elongation factor eEF-1 alpha-1 chain - human ALTERNATE_NAMES translation elongation factor Tu ORGANISM #formal_name Homo sapiens #common_name man #cross-references taxon:9606 DATE 30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change….. ACCESSIONS B24977; A25409; A29946; A32863; I37339 REFERENCE A93610 #authors Rao, T.R.; Slobin, L.I. #journal Nucleic Acids Res. (1986) 14:2409 #title Structure of the amino-terminal end of mammalian elongation… #accession B24977 ##molecule_type mRNA ##residues 1-82,'A',84-94 ##label RAO ##cross-references EMBL:X03689; NID:g31109; PIDN:CAA27325.1; PID:g31110……. GENETICS #gene GDB:EEF1A1; EEF1A; EF1A ##cross-references GDB:118791; OMIM:130590 #map_position 6q14-6q14 #introns 48/3; 108/3; 207/3; 258/1; 343/3; 422/1 CLASSIFICATION SF003007 #superfamily translation elongation factor Tu; translation elongation factor Tu homology KEYWORDS GTP binding; methylated amino acid; nucleotide binding; P-loop; phosphoprotein; protein biosynthesis; RNA binding FEATURE 1-223 #domain eEF-1 alpha domain I, GTP-binding #status predicted #label EF1\ 8-156 #domain translation elongation factor Tu homology #label ETU\ 14-21 #region nucleotide-binding motif A (P-loop)\ 153-156 #region GTP-binding NKXD motif\ 245-330 #domain eEF-1 alpha domain II, tRNA-binding #status predicted #label EF2\ 332-462 #domain eEF-1 alpha domain III, tRNA-binding #status predicted #label EF3\ 36,55,79,165,318 #modified_site N6,N6,N6-trimethyllysine (Lys) #status predicted\ 301,374 #binding_site glycerylphosphorylethanolamine (Glu) (covalent) #status predicted SUMMARY #length 462 #molecular_weight 50141 SEQUENCE 5 10 15 20 25 30 1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K 31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L 61 D K L K A E R E R …... Q K A Q K A K Examples of DBs with specialized type of sequences Almost all the links Human Genome Ensemble Project at http://www.ensembl.org/ Patterns, motifs, and profiles: REBASE, EPD, PROSITE, Aligned multiple sequence entries. RDP and ALN. Functionally, structurally, or phylogenetically ordered iProClass and HOVERGEN vertebrate gene db. HIV Database, and the Giardia lamblia Genome Project. 3D Structure, atomic coordinate data is necessary to define the tertiary shape of a particular biological molecule. Protein DB and Rutgers Nucleic Acid Db. MolBio Molecular visualization with special software. Genomic linkage mapping databases for H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli. OMIM — Online Mendelian Inheritance in Man Phylogenetic Tree Databases: e.g. the Tree of Life. Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). Check the links given below.. There’s a bewildering assortment of different databases and ways to access and manipulate the information within them. The key is to learn how to use that information in the most efficient manner. For example: Given a novel genome sequence, find all genes and p-genes. I want to design "sequence capture" probes for the exons of 40 genes that cause RP. Obtain the exonic sequence, with at least 100 nt's flanking, and 1000 nts of the promoter from transcription start I propose a new way to find disease-causing mutations in humans. I want to only look in genes that have regions that are 1) highly conserved across species, 2) have known functional protein domains (ex. transmembrane domains), and 3) have mRNA secondary structure. Is this a good idea? 1859 of Charles Darwin’s The Origin of Species Basic Mendelian Genetics Mendel’s laws independent assortment independent segregation mitosis and meiosis dominant/recessive and pedigrees (the graphs of phenotype) alleles Basic molecular genetics DNA RNA proteins Central Dogma genes and gene structure cells and chromosomes Principles of Genetics, Tamarin Pearson FastA format — GCG single sequenc e format — >EFHU1 PIR1 release 71.01 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK VTKSAQKAQKAK !!AA_SEQUENCE 1.0 P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human N;Alternate names: translation elongation factor Tu…… F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1> F;8-156/Domain: translation elongation factor Tu homology <ETU> F;14-21/Region: nucleotide-binding motif A (P-loop) F;153-156/Region: GTP-binding NKXD motif F;245-330/Domain: eEF-1 alpha domain II, tRNA-binding #status predicted <EF2> F;332-462/Domain: eEF-1 alpha domain III, tRNA-binding #status predicted <EF3> F;36,55,79,165,318/Modified site: N6,N6,N6-trimethyllysine (Lys) #status predicted F;301,374/Binding site: glycerylphosphorylethanolamine (Glu) (covalent) #status predicted EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 .. 1 401 351 451 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE…… IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA VTKSAQKAQK AK