M.Sc. in Molecular Medicine, Institute of Molecular Medicine, Trinity College Dublin, Ireland. Introduction to Bioinformatics: February 2005 David Lynn (M.Sc., Ph.D.) http://www.binf.org/course2005/ Topics for the next 3 days: Day 1a – Nucleic Acid Sequence Analysis Day 1b – Protein Sequence Analysis Day 1c – Accessing Complete Genomes Day 2a – Alignments & Homology Searching Day 2b – Phylogenetic Trees Day 1a Introduction Interrogating Sequence Databases Translating DNA in 6 frames. Reverse complement & other tools. Calculating some properties of DNA/RNA sequences. Primer design. Gene prediction. Alternative splicing. Promoter characterisation. Other resources. 1) Translating DNA in 6 frames 5'3' Frame 1 atcacctggtatagtataa ITWYSI 3'5' Frame 1 ttatactataccaggtgat LYYT R * 5'3' Frame 2 atcacctggtatagtataa SPGIV* 3'5' Frame 2 ttatactataccaggtgat YTIPGD 5'3' Frame 3 atcacctggtatagtataa HLV*Y 3'5' Frame 3 ttatactataccaggtgat I LYQ V Why? Translating in all 6 frames is commonly done for a range of bioinformatics applications. One place you may need to do it is to locate ORFs in an mRNA sequence which will have untranslated 3’ and 5’ UTRs. Try find the protein sequence encoded by the IL-11 mRNA (link on webpage) using the Translate Tool at Expasy. 2) Search launcher at Baylor College Readseq – converts sequences from one format to another. RepeatMasker – masks sequences against repeat sequences. Primer Selection - PCR primer selection (See primer design later). WebCutter- restriction maps using enzymes w/ sites >= 6 bases. 6 Frame Translation - translates a nucleic acid sequence in 6 frames. Reverse Complement - reverse complements a nucleic acid sequence. Reverse Sequence - reverses sequence order. Sequence Chopover - cut a large protein/DNA sequence into smaller ones with certain amounts of overlap. HBR - Finds E.coli contamination in human sequences. 3) Oligo Calculator Calculates the – Length – %GC content – Melting temperature (Tm) the midpoint of the temperature range at which the nucleic acid strands separate – Molecular weight – What an OD = 1 is in picoMolar of your input sequence. Many of these parameters are useful in primer design Beer – Lambert Law A = ecl e = molar extinction coefficient c = molar concentration l = light path = 1 cm A = O.D. If O.D. = 1 = 41 pM Reading of O.D. = 0.5 on spectrometer – => concentration = 20.5pM 5) Gene Prediction Gene prediction is an area under intensive research in bioinformatics. GENSCAN program - one of the major programs used to predict genes in the human genome . Should be useful in predicting genes in most vertebrate species, although caution should be used when dealing with other species especially prokaryotes where other programs are more suitable. The Institute for Genomic Research The Deambulum Nucleic Acids Sequence Analysis page at Infobiogen 6) Splice site prediction/Alternative splicing For proper splicing => some way to distinguish exons from introns. Accomplished using certain base sequences as signals. Allow the spliceosome (the cellular machinery that does the splicing) to identify the 5' and 3' ends of the intron. Eukaryotes: the base sequence of an intron begins with 5' GU, and ends with 3' AG. Each species has additional bases associated with these splice sites. Introns also have another important sequence signal called a branch site containing a tract of pyrimidine bases and a special adenine base, usually approximately 50 bases upstream from the 3' splice site. Consensus splice site sequences Alternative splicing Central dogma of molecular biology was that 1 gene = 1 protein. Multiple possible mRNA transcripts can be produced from 1 gene and if translated these transcripts can code for very different proteins – Alternative splicing 4 basic methods of alternative splicing. 1) Splice/Don’t Splice 2) Competing 5’ or 3’ splice sites 3) Exon Skipping 4) Mutually Exclusive Exons The Human Alternative Splicing Database at UCLA Used ESTs to locate alternative splices. Project has resulted in a publication of over six thousand alternatively spliced isoforms of human genes. Search the database using any of the following identifiers: – Gene Symbol – UniGene Sequence Identifier – UniGene Cluster Identifier – Gene Title – GenBank Sequence Identifier 7) Promoter Analysis & Recognition A promoter is a sequence that is used to initiate and regulate transcription of a gene. Most protein-coding genes in higher eukaryotes have polymerase II dependent promoters. Features of pol II promoters: – Combination of multiple individual regulatory elements. – Most important elements are transcription factor binding sites. – CAAT or TATA boxes are neither necessary nor sufficient for promoter function. – In many cases, order and distances of elements are crucial for their function. – Sequences between elements within a promoter are usually not conserved and of no known function. The promoter region in higher eukaryotes PromoterInspector predicts eukaryotic pol II promoter regions with high specificity (~ 85%) in mammalian genomic sequences. sensitivity of PromoterInspector is about 50% which means that the current version predicts about every second promoter in the genome. PromoterInspector predicts the approximate location of a promoter region and not the exact location of the Transcription Start Site (TSS). MatInspector professional Individual Transcription Factor sites build the basis of the promoter. Relatively short stretches of DNA (10 - 20 nucleotides) Sufficiently conserved in sequence to allow specific recognition by the corresponding transcription factor. Utilizes a library of matrix descriptions for transcription factor binding sites to locate matches in sequences.