Introduction to Bioinformatics Course

advertisement
International Livestock Research Institute,
Nairobi,
Kenya.
Introduction to Bioinformatics:
NOV. 2005
David Lynn (M.Sc., Ph.D.)
Trinity College Dublin
Ireland.
http://www.binf.org/ILRI2005/
Topics for the next 4 days:

Day 1 – Nucleic Acid Sequence Analysis

Day 2 – Protein Sequence Analysis

Day 3 – Accessing Complete Genomes

Day 4 – Alignments & Homology Searching

Day 4 – Phylogenetic Trees
Day 1










Introduction
Interrogating Sequence Databases
Translating DNA in 6 frames.
Reverse complement & other tools.
Calculating some properties of DNA/RNA sequences.
Primer design.
Gene prediction.
Alternative splicing.
Promoter characterisation.
Other resources.
1) Translating DNA in 6 frames
5'3' Frame 1
atcacctggtatagtataa
ITWYSI
3'5' Frame 1
ttatactataccaggtgat
LYYT R *
5'3' Frame 2
atcacctggtatagtataa
SPGIV*
3'5' Frame 2
ttatactataccaggtgat
YTIPGD
5'3' Frame 3
atcacctggtatagtataa
HLV*Y
3'5' Frame 3
ttatactataccaggtgat
I LYQ V
Why?

Translating in all 6 frames is commonly done for a range of
bioinformatics applications.

One place you may need to do it is to locate ORFs in an mRNA
sequence which will have untranslated 3’ and 5’ UTRs.

Try find the protein sequence encoded by the IL-11 mRNA (link on
webpage) using the Translate Tool at Expasy.
2) Search launcher at Baylor College









Readseq – converts sequences from one format to another.
RepeatMasker – masks sequences against repeat sequences.
Primer Selection - PCR primer selection (See primer design later).
WebCutter- restriction maps using enzymes w/ sites >= 6 bases.
6 Frame Translation - translates a nucleic acid sequence in 6 frames.
Reverse Complement - reverse complements a nucleic acid sequence.
Reverse Sequence - reverses sequence order.
Sequence Chopover - cut a large protein/DNA sequence into smaller
ones with certain amounts of overlap.
HBR - Finds E.coli contamination in human sequences.
3) Oligo Calculator

Calculates the
– Length
– %GC content
– Melting temperature (Tm) the midpoint of the temperature range at
which the nucleic acid strands separate
– Molecular weight
– What an OD = 1 is in picoMolar of your input sequence.

Many of these parameters are useful in primer design
Beer – Lambert Law

A = ecl
 e = molar extinction coefficient
 c = molar concentration
 l = light path = 1 cm

A = O.D.

If O.D. = 1 = 41 pM

Reading of O.D. = 0.5 on spectrometer
– => concentration = 20.5pM
5) Gene Prediction

Gene prediction is an area under intensive research in bioinformatics.

GENSCAN program - one of the major programs used to predict genes
in the human genome .

Should be useful in predicting genes in most vertebrate species,
although caution should be used when dealing with other species
especially prokaryotes where other programs are more suitable.

The Institute for Genomic Research
The Deambulum Nucleic Acids Sequence Analysis page at Infobiogen

6) Splice site prediction/Alternative splicing

For proper splicing => some way to distinguish exons from introns.

Accomplished using certain base sequences as signals.

Allow the spliceosome (the cellular machinery that does the splicing)
to identify the 5' and 3' ends of the intron.

Eukaryotes: the base sequence of an intron begins with 5' GU, and
ends with 3' AG.

Each species has additional bases associated with these splice sites.

Introns also have another important sequence signal called a branch
site containing a tract of pyrimidine bases and a special adenine base,
usually approximately 50 bases upstream from the 3' splice site.
Consensus splice site sequences
Alternative splicing

Central dogma of molecular biology was that 1 gene = 1 protein.

Multiple possible mRNA transcripts can be produced from 1 gene
and if translated these transcripts can code for very different
proteins
– Alternative splicing

4 basic methods of alternative splicing.
1) Splice/Don’t Splice
2) Competing 5’ or 3’ splice sites
3) Exon Skipping
4) Mutually Exclusive Exons
The Human Alternative Splicing Database at
UCLA

Used ESTs to locate alternative splices.

Project has resulted in a publication of over six thousand alternatively
spliced isoforms of human genes.

Search the database using any of the following identifiers:
– Gene Symbol
– UniGene Sequence Identifier
– UniGene Cluster Identifier
– Gene Title
– GenBank Sequence Identifier
7) Promoter Analysis & Recognition



A promoter is a sequence that is used to initiate and regulate
transcription of a gene.
Most protein-coding genes in higher eukaryotes have polymerase II
dependent promoters.
Features of pol II promoters:
– Combination of multiple individual regulatory elements.
– Most important elements are transcription factor binding sites.
– CAAT or TATA boxes are neither necessary nor sufficient for
promoter function.
– In many cases, order and distances of elements are crucial for their
function.
– Sequences between elements within a promoter are usually not
conserved and of no known function.
The promoter region in higher eukaryotes
PromoterInspector

predicts eukaryotic pol II promoter regions with high specificity
(~ 85%) in mammalian genomic sequences.

sensitivity of PromoterInspector is about 50% which means that
the current version predicts about every second promoter in the
genome.

PromoterInspector predicts the approximate location of a
promoter region and not the exact location of the Transcription
Start Site (TSS).
MatInspector professional

Individual Transcription Factor sites build the basis of the
promoter.

Relatively short stretches of DNA (10 - 20 nucleotides)

Sufficiently conserved in sequence to allow specific recognition
by the corresponding transcription factor.

Utilizes a library of matrix descriptions for transcription factor
binding sites to locate matches in sequences.
Download