Genes are identified by open reading frames Synonym(s) ORFs protein coding sequences Definition(s) A reading frame is a sequence of nucleotides in DNA that contains no termination codon and so can potentially translate as a polypeptide chain -An ORF begins with a start codon and contains no stop codon for a distance long enough to encode a protein 2 NCI Thesaurus via Unified Medical Language System at the National Library of Medicine Merriam-Webster's Medical Dictionary by Merriam-Webster Inc. • A reading frame that does not contain a nucleotide triplet which stops translation before formation of a complete polypeptide -- abbreviation ORF. •An open reading frame is a portion of a DNA molecule that, when translated into amino acids, contains no stop codons. •The genetic code reads DNA sequences in groups of three base pairs, which means that a doublestranded DNA molecule can read in any of six possible reading frames-three in the forward direction and three in the reverse. •A long open reading frame is likely part of a gene. Open Reading Frames (ORF) On a given piece of DNA, there can be 6 possible frames. The ORF can be either on the + or - strand and on any of 3 possible frames Frame 1: 1st base of start codon can either start at base 1,4,7,10,... Frame 2: 1st base of start codon can either start at base 2,5,8,11,... Frame 3: 1st base of start codon can either start at base 3,6,9,12,... (frame –1,-2,-3 are on minus strand) An open reading frames starts with ATG in most species, and ends with a stop codon (TAA, TAG or TGA) A program called SIXFRAME, you can visit the site directly http://searchlauncher.bcm.tmc.edu/ seq-util/Options/sixframe.html ACTGGGAAACCATTAAAACCATTATTTGGGGTATTACCA Original sequence: > - 39 nucleotides actgggaaaccattaaaaccattatttggggtattacca Translation in forward direction: frame +1 > - 13 codons ThrGlyLysProLeuLysProLeuPheGlyValLeuPro Genes Can be Identified within Genomic DNA Sequences ORF is defined as a stretch of DNA containing at least with 100 bp with a start codon and a stop codon of translation By scanning for “Open Reading Frame” (ORF) at least more than 90% of the genes in bacteria and yeast have been identified Both very short genes and long genes are missed by this method For eukaryotic genes, due to the presence of multiple exons and introns, scanning of the ORF is not a good method to identify genes. 1. One needs to use computer programs to compare the genomic DNA sequences to c DNA sequences, splice site sequences and sequences of the expressed sequence tags (EST) 2. Another powerful method for identifying human genes is to compare the human genomic sequence with that of the mouse since human and mouse are sufficiently related to have most genes in common ESTs represent partial sequences of cDNA clones (300 bp > 700 bp) mRNA AAAAA Synthesis of one strand DNA , Reverse transcriptase mRNA AAAAA cDNA AAAAA cDNA cDNA AAAAA AAAAA T3 3’ T 7 5’ 5’ 3’ MCS RNA degradation, synthesis of two strand DNA, DNA polymerase coloning and sequencing An expressed sequence tag or EST is: A short sub-sequence of a cDNA sequence, they may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of expressed genes. They may be represented in databases as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases (e.g. GenBank 1 January 2013, all species). The current understanding of the human set of genes (as of 2006) includes the existence of thousands of genes based on EST In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a standard in the field of bioinformatics. Each EST must have the following information: • A sequence ID (ex. sequence-run ID) 2. • Location in respect of the poly A (3' or 5') 3. • The CLONE ID from which the EST has been generated 4. • Organism 5. • Tissue and/or conditions 6. • The sequence 1. The EST can be stored in FASTA format: >T27784 EST16067 Human Endothelial cells Homo sapiens cDNA 5' CCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTTCTAATAT CTTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGCACACAGAT GTGAAATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAA AAATCCTCT TTCTTTGGGGTTTTTCTTTCTTTCTTTTT……… In this respect, ESTs have become a tool to refine the predicted transcripts for those genes, which leads to the prediction of their protein products and ultimately their function. gives information on the conditions in which the corresponding gene is acting(the situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer). ESTs contain enough information to permit the design of precise probes that then can be used to determine the gene expression. Fast & cheap (almost all steps are automated) • They represent the most extensive available survey of the transcribed portion of genomes. • There are necessary for gene structure prediction, gene discovery and genome mapping: -> provide experimental evidence for the position of exons -> provide regions coding for potentially new proteins -> characterization of splice variants and alternative polyadenilation • Provide an alternative to library screening -> short tag can lead to a cDNA clone • Provide an alternative to full-length cDNA sequencing -> sequences of multiple ESTs can reconstitute a full-length cDNA • Single Nucleotide Polymorphism (SNP) data mining In short, the human and mouse genomes are remarkably similar not only in the structure of their chromosomes but also at the level of DNA sequence. Scientists have reported similarities between the two species for decades but never with the detail that is possible by lining up two genome sequences. The new findings, by researchers at Celera Genomics in Rockville, Maryland, provide the strongest evidence yet that the mouse is a useful model for understanding human health and disease. Almost any gene in humans is going to be present in mice and vice versa, the team concludes. The Celera team compared mouse chromosome 16 with its corresponding regions of the human genome. Much of this chromosome corresponds to human chromosome 21, which contains genes involved in Down syndrome and similar disorders. Fourteen genes on mouse chromosome 16 are not found in humans. All the others—more than 700 mouse genes—have counterparts in the human genome, most of which are grouped together and in the same order as in the mouse genome. The sequence data on mouse chromosome 16 have been deposited in the public database called GenBank The Celera team, led by Richard J. Mural, identified 11,822 short segments of mouse DNA that correspond to just one region of the human genome. The order and orientation of DNA in these segments is nearly identical in both genomes for 99 % of the segments. The segments are about 200 base pairs long and are called 'syntenic anchors.‘ Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand the uniqueness between different species 1. Gene location 1. Gene structure 2. Exon number Exon lengths Intron lengths Sequence similarity Gene characteristics Splice sites Codon usage Conserved synteny Figure 1 Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions are shown connected with arrows. . By comparing the genome compositions between genomes, scientists can better understand the evolutionary history of a given genome Genome composition is used to describe the make up of contents of a haploid genome, which should include : 1. genome size, 1. proportions of non-repetitive DNA and repetitive DNA in details. Unique and Repetitive DNA Sequence in Eukaryotes • Non-repetitive DNA: Only present once per genome “Single copy” “Solitary” DNA (repetition frequency) R =1 or 2 Found in prokaryotic and eukaryotic Much information, high complexity Intermediate (Moderate) Repetitive DNA: Repeat several times (10-1000X) per genome 10<R<10,000 Disperse throughout the genome in eukaryotes Little information, moderate complexity • Highly Repetitive DNA: Short repetitive DNA (<100 bp) present up to 1 million times in the eukaryotic genome R >100,000 Almost no information, low complexity Low-complexity regions are often defined as regions of biased composition containing simple sequence repeats Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity. The low complexity may be pre-conditioned by strong inequality in nucleotide content (biased composition), by tandem or dispersed repeats or by palindrome-hairpin structures, as well as by a combination of all these factors. The low-complexity sequence can also be hidden at the translated protein level. The search for DNA regions with low complexity is one of the tasks of modern structural analysis of complete genomes. Genome composition Non-repetitive DNA: • once per genome “Single copy • DNA R=1 or 2 • Found in prokaryotic and eukaryotic • Much information, high complexity Intermediate (Moderate) Repetitive DNA: Disperse throughout the genome in eukaryotes • 10<R<10,000 • eukaryotic genome • Little information, • moderate complexity Highly Repetitive DNA •Short repetitive DNA (<100 bp) present up to 1 million times R (repetition • frequency) >100,000 • in the eukaryotic genome •Almost no information, •low complexity Solitary genes: About 25-50 percent of the protein-coding genes are represented only once in the haploid genome Duplicated genes: These genes are close but non-identical sequences that often are located within 5-50 kb of one another called “gene family” Each gene family could contain from a few to 30 members Total Number of Genes and Duplicated Genes • • In bacteria, since most of the genes are unique, so the number of distinct families is close to the total gene number In eukaryotes, many genes are duplicated, and as a result the number of different gene families is much less than the total number of genes Proportions of Unique and Duplicated Genes The proportion of unique genes drops sharply with genome size; bacteria have the highest proportion of unique genes, and yeast, flies, worm and Arabidopsis drop sharply Gene family: A set of duplicated genes that encode proteins with similar but not identical amino acid sequences. Collection of identical or similar genes,Derived from a single ancestral gene Clustered or dispersed throughout the genome Identical genes: Examples include: rRNA and histone genes Nonidentical genes: globin genes (a and b) The genes encoding b-globins are a good example of gene family that contains five functional genes: b, d, Ag, Gg, and Multigene family a collection of genes that are similar or identical in sequence and presumably of common ancestral origin Include genes for the major rRNA molecules, huge tandem repeats of these genes enable cells to make millions of ribosomes during active protein synthesis In vertebrates and invertebrates, the genes encoding rRNAs and some other noncoding RNAs such as snRNA are arranged in tandemly repeated arrays These tandemly repeated genes, appear one after the other, encode identical or almost identical proteins or functional RNAs The tandemly repeated rRNA and snRNA genes are needed to meet the great cellular demand for their transcripts. Example: cells have 100 copies or more of 5S rRNA genes Multiple copies of tRNA and histone genes are also present in clusters, but generally not in tandem repeat A Tandem rDNA Gene Cluster A tandem gene cluster of rRNA gene Electromicrograph of DNA being Transcribed into RNA • • Green arrow indicates DNA and Red arrow indicates RNA This micrograph was taken by O.L. Miller, Jr, and Barbara R. Beatty at Oak Ridge National Lab showing the transcription of tandem repeat of rRNA genes in Xenopus oocytes Human chromosomes, ideograms G-bands Tandem repeats on every chromosome: Telomeres Centromeres 5 clusters of repeated rRNA genes: Short arms of chromosomes 13, 14, 15, 21, 22 Tandemly repetitive 2- Nonidentical genes: globin genes (a and b) Nonidentical genes Many genes occur as multigene families, can be clustered on the same chromosome or scattered throughout the genome, Families can be ▪ clustered - nearby on chromosomes (α-globins, Hox A) ▪ Dispersed – on various chromosomes (actin, tubulin) Members of clusters may • show stage or tissue-specific expression ▪ Implies means for co-regulation as well as individual regulation They arise over time from mutations that accumulate in duplicated genes. They evolved from a common ancestor Original α & β genes evolved from duplication of a common ancestral globin gene Zeta ϧ Nu У Epsi Ѱ Delta σ Original α & β genes evolved from duplication of a common ancestral globin gene. They evolved from a common ancestor. Transposition separated the α globin and β globin families, so they exist on different chromosomes Transposition separated the α globin and β globin families, so they exist on different chromosomes ▪ Globin genes increase in number from primitive fish to humans Clusters evolve by duplication and divergence Gene number tends to increase with evolutionary complexity. They arise over time from mutations that accumulate in duplicated genes. History of gene families can be traced by comparing sequences Molecular clock model holds that rate of change within a group is relatively constant Distance between related sequences combined with clock leads to inference about when duplication took place Zeta ϧ Nu У Epsi Ѱ Delta σ