Genome Analysis & Gene Prediction Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional RNA) A human cell contains approximately 23,000 genes. Some of these are expressed in all cells all the time. These socalled housekeeping genes are responsible for the routine metabolic functions (e.g. respiration) common to all cells. Some are expressed as a cell enters a particular pathway of differentiation. Some are expressed all the time in only those cells that have differentiated in a particular way. For example, a liver cell expresses continuously the genes for the metabolizing enzymes. Some are expressed only as conditions around and in the cell change. For example, the arrival of a hormone (due to environmental factors or others) may turn on (or off) certain genes in that cell. How Gene Expression is Regulated? To Know about gene expression, first we look for the basic structure of a gene. Genomic DNA Genomic DNA 5’…. Upstream Primary Transcript Downstream …3’ About Upstream region of a Gene Genomic DNA Upstream 5’…. Upstream Primary Transcript Upstream promoter/Regulatory region Regulatory Locus Distal Distal (GC box) Central Downstream …3’ Promoter Proximal Central (CAAT box) Core/basal Promoter (TATA Box) About Core Promoter basal or core promoter located within about 40 base pairs (bp) of the transcription start site (TSS) It is found in all protein-coding genes. This is in sharp contrast to the upstream promoter whose structure and associated binding factors differ from gene to gene. It contains a sequence of TATA box (either canonical TATA box or TATA variant). It is bound by a large complex of some 50 different proteins, including - Transcription Factor IID (TFIID) which is a complex of TATA-binding protein (TBP), which recognizes and binds to the TATA box 14 other protein factors which bind to TBP — and each other — but not to the DNA. - Transcription Factor IIB (TFIIB) which binds both the DNA and pol II. About Upstream Promoter/Regulatory Regions an "upstream" promoter, which may extend over as many as 200 bp farther upstream It has three regions - Proximal region: insulators are possibly present in this region. Insulators are stretches of DNA (as few as 42 base pairs) and located between the enhancer(s) and promoter or silencer(s) and promoter of adjacent genes or clusters of adjacent genes. Their function is to prevent a gene from being influenced by the enhancer (or silencer) of its neighbors. - Central Region: Silencers are possibly present in this region. Silencers control regions of DNA that may be located thousands of base pairs away from the gene they control. However, when transcription factors (Silencers) bind to them, expression of the gene they control is repressed. - Distal Region: Enhancers may be present in this region. Enhancer bind to regions of DNA that are thousands of base pairs away from the gene they control. Binding increases the rate of transcription of the gene. Enhancers can be located upstream, downstream, or even within the gene they control. About Upstream Promoter/Regulatory Regions About Primary Transcript Genomic DNA 5’…. Upstream Primary Transcript Downstream TSS Exon Intron ATG…. Start codon mRNA Exon GT……..AG ………... GT…..AG Donor site Exon Intron …......TGA Acceptor site ATG…………………………………………TGA Stop codon …3’ About Primary Transcript Primary transcript consists of Cap region: 5' cap is a specially altered nucleotide on the 5' end of precursor messenger RNA. 5’-UTR: Regions of the gene outside of the CDS are called UTR’s (untranslated regions), and are mostly ignored by gene finders, though they are important for regulatory functions. Coding sequence (CDS): CDS of a gene is delimited by four types of signals: start codons (ATG in eukaryotes), stop codons (usually TAG, TGA, or TAA), donor sites (usually GT), and acceptor sites (AG). 3’-UTR: three prime untranslated region (3' UTR) is a particular section of messenger RNA (mRNA). Poly-A tail: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. The poly(A) tail consists of multiple adenosine monophosphates. About Intron and Exon Intron: It is derived from the term intragenic region, i.e. a region inside a gene. these are sometimes called intervening sequences which refer to any of several families of internal nucleic acid sequences that are not present in the final gene product Exon: these sequences are present in the mature form of an RNA molecule after removing of introns. The mature RNA molecule can be a messenger RNA or a functional form of a non-coding RNA such as rRNA or tRNA. More about Exon Three types of exons are defined: initial exons extend from a start codon to the first donor site; internal exons extend from one acceptor site to the next donor site; final exons extend from the last acceptor site to the stop codon; single exons (which occur only in intronless genes) extend from the start codon to the stop codon. Structure of a Gene An Hypothetical Example Gene Parse Tree Gene Prediction Analysis by sequence similarity can only reliably identify about 30% of the protein coding genes in a genome 50-80% of new genes that are identified, have a partial, marginal, or unidentified homolog Frequently expressed genes tend to be more easily identifiable by homology than rarely expressed genes Gene finding is species-specific Codon usage patterns vary by species Functional regions (promoters, translation initiation sites, termination signals) vary by species Common repeat sequences are species-specific Gene finding programs rely on this information to identify coding regions Protein Coding Gene ab initio using computational methods is the most suited to protein-coding genes Protein-coding genes have recognizable features • open reading frames (ORFs) • codon bias • known transcription and translational start and stop motifs (promoters, 3’ poly-A sites) • splice consensus sequences at intron-exon boundaries ab initio gene discovery • • • • Protein-coding genes have recognizable features We can design software to scan the genome and identify these features Some of these programs work quite well, especially in bacteria and simpler eukaryotes with smaller and more compact genomes It’s a lot harder for the higher eukaryotes where there are a lot of long introns, genes can be found within introns of other genes, etc. ab initio gene discovery—Validating predictions and refining gene models Standard types of evidence for validation of predictions include: • match to previously annotated cDNA • match to EST from same organism • similarity of nucleotide or conceptually translated protein sequence to sequences in GenBank • protein structure prediction match to a PFAM domain • associated with recognized promoter sequences, ie TATA box, CpG island • known phenotype from mutation of the locus Finding Non–protein Coding Genes • Non-protein coding genes (tRNA, rRNA, snoRNA, siRNA, miRNA, various other ncRNAs) are harder to find than protein-coding genes. Because • often not poly-A tailed—don’t end up in cDNA libraries • no ORF • constraint on sequence divergence at nucleotide not protein level, so homology is harder to detect Finding Non–protein Coding Genes To find out, Non-protein coding genes, we have identify….. • secondary structure • homology, especially alignment of related species • experimentally • isolation through non-polyA dependent cloning methods • microarrays ab initio gene discovery—approaches Most gene-discovery programs makes use form of machine learning algorithm. A learning algorithm requires a training set data that the computer uses to “learn” how pattern. of some machine of input to find a Two common machine learning approaches used in gene discovery (and many other bioinformatics applications) are Dynamic programming model Artificial neural networks (ANNs) and Hidden Markov models (HMMs) Control of Gene Expression—Transcription Factors Transcription factors (TFs) are proteins that bind to the DNA and help to control gene expression. The sequences to which they bind are transcription factor binding sites (TFBSs), which are a type of cisregulatory sequence Most transcription factors can bind to a range of similar sequences. These can be found in either of two ways, as a consensus sequence, or as a position weight matrix (PWM). Once we know the binding site, we can search the genome to find all of the (predicted) binding sites Evidence based Approaches Comparative or similarity based gene prediction Combine gene models with alignment to known ESTs & protein sequences Gene Prediction Tools SNAP TwinScan Gnomon (NCBI) GeneWise Jigsaw GLEAN Grail BLAST FASTAX BLAT WABA MZEF, MZEF-SPC FGENESH Genome Annotation-Much work remains Despite good progress in identifying both protein coding and non-protein coding genes, much work remains to be done before even the best-studied genomes are fully annotated. For the higher eukaryotes, only a tiny percentage of features such as TFBSs and other non-gene features have so far been indentified. References http://users.rcn.com/jkimball.ma.ultranet/Biolog yPages/P/Promoter.html