6-finding genes

advertisement
Genes are identified by open reading frames


Synonym(s) ORFs
protein coding sequences


Definition(s)
A reading frame is a sequence of nucleotides in DNA that
contains no termination codon and so can potentially translate
as a polypeptide chain
-An ORF begins with a start codon and contains no stop codon for
a distance long enough to encode a protein
2
NCI Thesaurus via
Unified Medical
Language System at the
National Library of
Medicine
Merriam-Webster's
Medical Dictionary by
Merriam-Webster Inc.
• A reading frame that does not contain a
nucleotide triplet which stops translation
before formation of a complete polypeptide
-- abbreviation ORF.
•An open reading frame is a portion
of a DNA molecule that, when
translated into amino acids,
contains no stop codons.
•The genetic code reads DNA
sequences in groups of three base
pairs, which means that a doublestranded DNA molecule can read in
any of six possible reading frames-three in the forward direction and
three in the reverse.
•A long open reading frame is
likely part of a gene.
Open Reading Frames (ORF)
On a given piece of DNA, there can be 6 possible frames. The ORF can
be either on the + or - strand and on any of 3 possible frames
Frame 1: 1st base of start codon can either start at base 1,4,7,10,...
Frame 2: 1st base of start codon can either start at base 2,5,8,11,...
Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame –1,-2,-3 are on minus strand)
An open reading frames starts with ATG in most
species, and ends with a stop codon (TAA, TAG or TGA)
A program called SIXFRAME, you
can visit the site directly
http://searchlauncher.bcm.tmc.edu/
seq-util/Options/sixframe.html




ACTGGGAAACCATTAAAACCATTATTTGGGGTATTACCA
Original sequence:
> - 39 nucleotides
actgggaaaccattaaaaccattatttggggtattacca
Translation in forward direction:
frame +1
> - 13 codons
ThrGlyLysProLeuLysProLeuPheGlyValLeuPro
Genes Can be Identified within Genomic
DNA Sequences
 ORF is defined as a stretch of DNA containing
at least with 100 bp with a start codon and a
stop codon of translation
 By scanning for “Open Reading Frame” (ORF)
at least more than 90% of the genes in
bacteria and yeast have been identified
Both very short genes and long genes are missed by this
method
For eukaryotic genes, due to the presence of multiple exons and
introns, scanning of the ORF is not a good method to
identify genes.
1. One needs to use
computer programs to compare
the genomic DNA sequences to c DNA sequences, splice
site sequences and sequences of the expressed sequence
tags (EST)
2. Another powerful method for identifying human genes is to
compare the human genomic sequence with that of the
mouse since human and mouse are sufficiently related to
have most genes in common


ESTs represent partial sequences of cDNA clones (300 bp > 700 bp)
mRNA
AAAAA Synthesis of one strand DNA ,
Reverse transcriptase

mRNA
AAAAA

cDNA
AAAAA


cDNA
cDNA
AAAAA


AAAAA
T3



3’
T 7 5’
5’
3’
MCS
RNA degradation, synthesis of two
strand DNA, DNA polymerase
coloning and sequencing
An expressed sequence tag or EST is: A short sub-sequence of a
cDNA sequence, they may be used to identify gene transcripts, and
are instrumental in gene discovery and gene sequence
determination.
Because these clones consist of DNA that is complementary to
mRNA, the ESTs represent portions of expressed genes.
They may be represented in databases as either cDNA/mRNA
sequence or as the reverse complement of the mRNA, the template
strand.
The identification of ESTs has proceeded rapidly, with
approximately 74.2 million ESTs now available in public databases
(e.g. GenBank 1 January 2013, all species).
The current understanding of the human set of genes (as of 2006)
includes the existence of thousands of genes based on EST

In bioinformatics, FASTA format is a text-based format for
representing either nucleotide sequences or peptide
sequences, in which nucleotides or amino acids are
represented using single-letter codes.

The format also allows for sequence names and comments
to precede the sequences.

The format originates from the FASTA software package,
but has now become a standard in the field of
bioinformatics.
Each EST must have the following information:
• A sequence ID (ex. sequence-run ID)
2. • Location in respect of the poly A (3' or 5')
3. • The CLONE ID from which the EST has been generated
4. • Organism
5. • Tissue and/or conditions
6. • The sequence

1.

The EST can be stored in FASTA format:


>T27784 EST16067 Human Endothelial cells Homo sapiens cDNA 5'
CCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTTCTAATAT
CTTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGCACACAGAT
GTGAAATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAA
AAATCCTCT TTCTTTGGGGTTTTTCTTTCTTTCTTTTT………


In this respect,
ESTs have become a tool to refine the predicted transcripts for
those genes, which leads to the prediction of their protein
products and ultimately their function.

gives information on the conditions in which the
corresponding gene is acting(the situation in which those
ESTs are obtained (tissue, organ, disease state - e.g. cancer).

ESTs contain enough information to permit the design of
precise probes that then can be used to determine the gene
expression.
Fast & cheap (almost all steps are automated)
• They represent the most extensive available survey of the
transcribed portion of genomes.
 • There are necessary for gene structure prediction, gene
discovery and genome mapping:
 -> provide experimental evidence for the position of exons
 -> provide regions coding for potentially new proteins
 -> characterization of splice variants and alternative polyadenilation
 • Provide an alternative to library screening
 -> short tag can lead to a cDNA clone
 • Provide an alternative to full-length cDNA sequencing
 -> sequences of multiple ESTs can reconstitute a full-length cDNA
 • Single Nucleotide Polymorphism (SNP) data mining



In short, the human and mouse genomes are remarkably
similar not only in the structure of their chromosomes but
also at the level of DNA sequence. Scientists have
reported similarities between the two species for decades
but never with the detail that is possible by lining up two
genome sequences.

The new findings, by researchers at Celera Genomics in
Rockville, Maryland, provide the strongest evidence yet
that the mouse is a useful model for understanding
human health and disease. Almost any gene in humans is
going to be present in mice and vice versa, the team
concludes.

The Celera team compared mouse chromosome 16 with its
corresponding regions of the human genome. Much of this
chromosome corresponds to human chromosome 21, which
contains genes involved in Down syndrome and similar
disorders.

Fourteen genes on mouse chromosome 16 are not found in
humans. All the others—more than 700 mouse genes—have
counterparts in the human genome, most of which are grouped
together and in the same order as in the mouse genome. The
sequence data on mouse chromosome 16 have been deposited in
the public database called GenBank

The Celera team, led by Richard J. Mural, identified 11,822 short
segments of mouse DNA that correspond to just one region of the
human genome. The order and orientation of DNA in these
segments is nearly identical in both genomes for 99 % of the
segments. The segments are about 200 base pairs long and are
called 'syntenic anchors.‘

Analyzing & comparing genetic material
from different species to study evolution,
gene function, and inherited disease

Understand the uniqueness between
different species
1.
Gene location
1.
Gene structure




2.
Exon number
Exon lengths
Intron lengths
Sequence similarity
Gene characteristics
 Splice sites
 Codon usage
 Conserved synteny
Figure 1 Regions of the human and mouse homologous genes: Coding
exons (white), noncoding exons (gray}, introns (dark gray), and intergenic
regions (black). Corresponding strong (white) and weak (gray) alignment
regions are shown connected with arrows.
.

By comparing the genome compositions
between genomes, scientists can better
understand the evolutionary history of a
given genome

Genome composition is used to describe the
make up of contents of a haploid genome,
which should include :
1.
genome size,
1.
proportions of
 non-repetitive DNA and
 repetitive DNA in details.
Unique and Repetitive DNA Sequence in Eukaryotes
•
Non-repetitive DNA:
 Only present once per genome “Single copy” “Solitary”
 DNA (repetition frequency) R =1 or 2
 Found in prokaryotic and eukaryotic
 Much information, high complexity
Intermediate (Moderate) Repetitive DNA:
 Repeat several times (10-1000X) per genome

10<R<10,000
 Disperse throughout the genome in eukaryotes
 Little information, moderate complexity
•
Highly Repetitive DNA:
Short repetitive DNA (<100 bp)
present up to 1 million times in the eukaryotic genome

R >100,000
Almost no information, low complexity

Low-complexity regions are often defined as regions of
biased composition containing simple sequence repeats

Sequences like ATATATACTTATATA which are mostly two
letters are called low-complexity.

The low complexity may be pre-conditioned by strong
inequality in nucleotide content (biased composition), by
tandem or dispersed repeats or by palindrome-hairpin
structures, as well as by a combination of all these factors.

The low-complexity sequence can also be hidden at the
translated protein level.

The search for DNA regions with low complexity is one of
the tasks of modern structural analysis of complete
genomes.
Genome composition
Non-repetitive DNA:
• once per genome
“Single copy
• DNA R=1 or 2
• Found in
prokaryotic and
eukaryotic
• Much information,
high complexity
Intermediate
(Moderate) Repetitive
DNA:
Disperse throughout
the genome in
eukaryotes
• 10<R<10,000
• eukaryotic genome
• Little information,
• moderate complexity
Highly Repetitive DNA
•Short repetitive DNA
(<100 bp) present up
to 1 million times
R (repetition •
frequency) >100,000
• in the eukaryotic
genome
•Almost no information,
•low complexity
Solitary genes:
About 25-50 percent of
the protein-coding genes are represented
only once in the haploid genome
Duplicated genes:
These genes are close
but non-identical sequences that often
are located within 5-50 kb of one another
called “gene family”
Each
gene family could contain from a
few to 30 members
Total Number of Genes and Duplicated Genes
•
•
In bacteria, since most of the genes are unique, so the number of
distinct families is close to the total gene number
In eukaryotes, many genes are duplicated, and as a result the
number of different gene families is much less than the total
number of genes
Proportions of Unique and Duplicated Genes
The proportion of unique genes drops sharply with genome size;
bacteria have the highest proportion of unique genes, and yeast,
flies, worm and Arabidopsis drop sharply
Gene family: A set of duplicated genes that encode
proteins with similar but not identical amino acid
sequences.
Collection of identical or similar genes,Derived
from a single ancestral gene
 Clustered or dispersed throughout the genome


 Identical genes: Examples include: rRNA and histone genes
 Nonidentical genes: globin genes (a and b)
 The genes encoding b-globins are a good example of gene family
that contains five functional genes: b, d, Ag, Gg, and


Multigene family  a collection of
genes that are similar or identical in
sequence and presumably of common
ancestral origin
Include genes for the major rRNA
molecules, huge tandem repeats of
these genes enable cells to make
millions of ribosomes during active
protein synthesis 
 In vertebrates and invertebrates, the genes encoding rRNAs
and some other noncoding RNAs such as snRNA are
arranged in tandemly repeated arrays
 These tandemly repeated genes, appear one after the other,
encode identical or almost identical proteins or functional
RNAs
 The tandemly repeated rRNA and snRNA genes are needed
to meet the great cellular demand for their transcripts.
Example: cells have 100 copies or more of 5S rRNA genes
 Multiple copies of tRNA and histone genes are also present
in clusters, but generally not in tandem repeat
A Tandem rDNA Gene Cluster
A tandem gene cluster of rRNA gene
Electromicrograph
of DNA being
Transcribed into
RNA
•
•
Green arrow indicates
DNA and Red arrow
indicates RNA
This micrograph was
taken by O.L. Miller, Jr,
and Barbara R. Beatty at
Oak Ridge National Lab
showing
the
transcription of tandem
repeat of rRNA genes in
Xenopus oocytes
Human
chromosomes,
ideograms
G-bands
Tandem repeats on
every chromosome:
Telomeres
Centromeres
5 clusters of repeated rRNA genes:
Short arms of chromosomes 13, 14, 15, 21, 22
Tandemly repetitive
2- Nonidentical genes:
globin genes (a and b)
Nonidentical genes
Many genes occur as multigene families, can be clustered on the
same chromosome or scattered throughout the genome,
Families can be
▪ clustered - nearby on chromosomes (α-globins, Hox A)
▪ Dispersed – on various chromosomes (actin, tubulin)
Members of clusters may
• show stage or tissue-specific expression
▪ Implies means for co-regulation as well as individual
regulation

They arise over time from mutations that
accumulate in duplicated genes. They
evolved from a common ancestor

Original α & β genes evolved from
duplication of a common ancestral globin
gene
Zeta ϧ
Nu У
Epsi Ѱ
Delta σ
Original α & β genes evolved from duplication of a common ancestral
globin gene. They evolved from a common ancestor.
Transposition separated the α globin and β globin families, so they
exist on different chromosomes
Transposition separated the α globin and β
globin families, so they exist on different
chromosomes
▪ Globin genes increase in number from primitive
fish to humans
 Clusters evolve by duplication and divergence


Gene number tends
to increase with
evolutionary
complexity.

They arise over
time
from
mutations
that
accumulate
in
duplicated genes.

History of gene families can
be traced by comparing
sequences
 Molecular clock model holds
that rate of change within a
group is relatively constant

Distance between related
sequences combined with
clock leads to inference
about when duplication took
place
Zeta ϧ
Nu У
Epsi Ѱ
Delta σ
Download