Information Ensembl Glossary

Information Ensembl Glossary             Accession number - An Accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ). Alu - A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it. The Alu universal primer sequence is as follows: 5'-GTG GAT CAC CTG AGG TCA GGA GTT TC-3' (26-mer). allele - One of the alternate forms of a specific gene. Each allele is an individual member of a gene pair and is inherited from one parent. When genes are considered "simply" as segments of a nucleotide sequence, then it refers to each of the possible alternative nucleotides at a specific position in the sequence. API - An API (Application Programming Interface) is a series of routines that applications can use to make the operating system request and carry out lowerlevel services. BAC - A BAC (Bacterial Artificial Chromosome) is a vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species cloned into bacteria. Once the foreign DNA has been cloned into the host bacteria, it can make many copies of it. BLAST - Basic Local Alignment Search Tool (Altschul et al., J Mol Biol 215:403-410; 1990). A sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query. BLAT - BLAST-Like Alignment Tool (Kent, W.J. 2002. BLAT -- The BLASTLike Alignment Tool. Genome Research 4: 656-664). is a mRNA/DNA and cross-species protein sequence analysis toll to quickly find sequences of 95% and greater similarity of length 40 bases or more. BLOSUM 62 - Blocks Substitution Matrix (Henikoff and Henikoff, Proc Natl Acad Sci U S A 89:10915-10919; 1992). A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members) Centimorgan (cM) - A unit of genetic distance, determined by how frequently two genes on the same chromosome are inherited together. One centimorgan equals 1% recombinant offspring. cDNA - Complementary DNA obtained by reverse transcription of a mRNA template. n bioinformatics jargon, cDNA is thought of as a DNA version of the mRNA sequence. Generally, cDNA are denoted in coding or 'sense' orientation. CCDS - Consensus CDS is a core set of human protein coding regions that are consistently annotated between Ensembl, VEGA and RefSeq. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. CDS - (Coding sequence) refers to the portion of a gene or an mRNA that codes for a protein. Introns are not coding sequences, nor are the 5' or 3' UTR. The  coding sequence in a cDNA or mature mRNA includes everything from the start codon through to the stop codon, inclusive. Cigar - Cigar stands for Compact Idiosyncratic Gapped Alignment Report and defines the sequence of matches/mismatches and deletions (or gaps). The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is: o Original sequence: AACGCTT The aligned sequence will be: cigar line: 2MD3M2D2M MMDMMMDDMM A A - C G C - - T T       Cosmid - DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced. Clone - A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies. Contig - A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information. Contig can be used in other contexts: A clone contig is a group of cloned fragments of DNA covering overlapping regions of a particular chromosome. A sequence contig is an extended sequence created by merging sequences that overlap. A contig map shows the regions of a chromosome where contiguous DNA segments overlap. These maps allow the study of a complete segment of a genome by examining a series of overlapping clones covering a region of interest. DDBJ - DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters. Data is exchanged with EMBL/EBI and GenBank/NCBI on a daily basis, and the three data banks share virtually the same data at any given time. Domain - A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the threedimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics. Dotter - Ensembl DotterView is based on the program Dotter, a dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. The Dotter tool provides a visual display of the sequence alignment it represents. The dotplot displays detailed comparison of two sequences. Every residue in one sequence is compared to every residue in the other sequence. The first sequence runs along the x-axis and the second sequence along the y-axis. In regions where the two sequences are similar to each other, a row of high scores will run diagonally across the dot matrix. If you're comparing a sequence against itself to find internal repeats, you'll notice        that the main diagonal scores maximally, since it's the 100% perfect self-match. To make the score matrix more intelligible, the pairwise scores are averaged over a sliding window that runs diagonally. The averaged score matrix forms a three-dimensional landscape, with the two sequences in two dimensions and the height of the peaks in the third. This landscape is projected onto two dimensions by aid of grayscales - higher peaks are indicated by darker grays. Dotter was written by Erik L.L. Sonnhammer and Richard Durbin Gene 167: GC1-10 (1995) DWGA - (Derived from Whole Genome Alignments). Human /versus/ Chimpanzee exception: The human /versus/ chimpanzee orthologue predictions were obtained in a completely different manner. Since the current chimpanzee genome sequence assembly is the result of low-coverage sequencing, the assembled sequence is of too poor quality to generate a gene set on the classical Ensembl gene build pipeline. The chimpanzee gene set produced by Ensembl has rather been generated by "projecting" human genes to the chimpanzee genome through whole genome BLASTz alignments between both species and filtering for orthologue sequence alignments. The result of this procedure is de facto the human - chimpanzee orthologue set that has been Derived from Whole Genome Alignments (DWGA). See the Prediction Method section on a relevant Ensembl Gene Report page. DUST - dust is a standalone application that looks for low complexity sequences. EMBL - (Nucleotide Sequence Database) Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications. ENCODE - The ENCyclopedia Of DNA Elements (ENCODE) project uses defined regions of the Human genome to test and evaluate different methods and technologies for finding various functional elements in Human DNA. The two main criteria for manually selected regions were presence of well-studied genes or other known sequence elements, and existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb. Ensembl genes - Set of Ensembl gene predictions based on experimental evidence from protein sequences and/or near-full-length cDNA available from public sequence databases. "Ensembl known genes" are predicted on the basis of species-specific database entries from manually curated UniProt/Swiss-Prot, partially manually curated RefSeq and UniProt/TrEMBL databases. Predictions of "Ensembl novel genes" are based on other experimental evidence such as protein and cDNA sequence information from related species. Eponine - Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognizing specific sequence motifs. Each of these is associated with a position distribution relative to the TSS. EST - (Expressed Sequence Tags) Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically         identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. EST genes - Set of Ensembl gene predictions solely based on EST evidence. The process of EST gene prediction uses a combination of Exonerate, BLAST and Est2Genome to map ESTs onto the genomic sequence. Redundant ESTs are merged, before GenomeWise is used to assign 5' and 3' UTRs to the longest found ORF. See Ensembl EST genes for a more complete explanation of the EST gene prediction process. Exonerate - Exonerate is a fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins. Fgenes - FGENES, also known as Find Genes, is a Human gene predictor that is based on pattern recognition of different types of exons, promoters and poly A signals. It is built based on linear discriminant functions of internal, 5'-coding, and 3'-coding exon recognition. It is designed to find the optimal combination of these components and to construct a set of gene models along a given sequence. GENSCAN - GENSCAN (Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94). is an application for identification of complete gene structures in genomic DNA. The splice site models used are described in more detail in: Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163. GeneWise - GeneWise is sequence analysis tool for comparing proteins or profile HMMs to DNA sequences allowing for introns and frameshifts. The Wise2 package was written by Ewan Birney. More information about the package can be obtained at: http://www.ebi.ac.uk/Wise2/ GO - Gene Ontology Consortium. An organized hierarchy of terms produced by the Gene Ontology (GO) Consortium, used to describe biological processes, cellular component, and molecular function. Molecular Function Ontology. Tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity. Biological Process Ontology. Broad biological goals, such as mitosis or purine metabolism, are accomplished by ordered assemblies of molecular functions. Cellular Component Ontology. Subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. A gene may be indexed under many GO terms depending on GO classification system. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. For instance, cytochrome c can be described by the molecular function term electron transporter activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane. Haplotypes - SNPs are not randomly or independently distributed on different chromosomes, but tend to be associated with one another. Haplotypes are a way of denoting a group of several SNPs closely linked physically on a chromosome. HGVbase - The Human Genome Variation Database (HGVbase) provides an accurate, comprehensive, high utility catalog of normal human gene and genome variation to aid research of Human phenotypic variation. Records are highly curated and annotated to ensure maximal utility and data accuracy. HGVbase is         collaboration between the Karolinska Institute (Sweden), the European Bioinformatics Institute (UK). InterPro - InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases. InterPro IDs are linked to the summary of information about that domain or family. InterPro is managed by EBI. A number of databases (SwissProt, TrEMBL, PROSITE, PRINTS, Pfam, and ProDom, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY) with different approaches to biological information are used to derive protein signatures. ProteinView, GeneView and DomainView provide links to the relevant InterPro entries. Jalview - Jalview is a multiple alignment editor, used by the EBI clustalw server and the PFAM protein domain database and is available as a general purpose alignment editor. Known genes - Known genes are transcripts that have been mapped by Ensembl to near-full-length protein sequences already available in the public sequence databases. MGI - (Mouse Genome Informatics) houses a database that provides integrated access to data on the genetics, genomics, and biology of mouse (Mus musculus). MBRH - (Multiple Best Reciprocal Hit). When due to gene duplications there are multiple 'best' hits with identical score, E-value, % identity, %positivity, one is unable to pick a unique orthologue for a gene. This results in more complex graphs of 'best' relationships. This often occurs when different genes have identical translations, which could be due to a duplication event, an assembly error, or chance. On average 3% of the genes have an identical translation to some other gene either within it's genome or in another genome. o MBRH / DUP 1.# - MBRH set where in one genome there is only one gene, but the other genome has multiple genes, all on the same chromosome and within 1.5 megbases of each other. This could be due to recent gene duplication events where sequences have not diverged or a mis-assembly of the genome sequence leading to artificial, apparent gene duplications. (e.g. MBRH / DUP 1.2 or MBRH/ DUP 1.4) o MBRH / SYN - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. The one(s) labeled MBRH/SYN satisfies both the MBRH criteria and the RHS search criteria. o MBRH / COMPLEX - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. This MBRH pair does not satisfy the RHS criteria. Novel genes - Novel genes are genes that have been predicted by Ensembl on the basis of similarity to protein or cDNA sequences and/or ESTs, but could not be mapped with confidence to existing entries in any public sequence database. OMIM - (Online Mendelian Inheritance in Man) Genetic knowledge database which was first published in 1966, (Mendelian Inheritance in Man (MIM) (currently in its 12th edition) that includes information and references, including links to MEDLINE and sequence. Ensembl links both to OMIM entries for any gene, where available, and to a subset of this database, the OMIM Morbid Map presenting the syndrome and disease-associated genes described in OMIM. Orthologue - Orthologues are genes derived from a common ancestor through vertical descent and can be thought of as the direct evolutionary counterpart. In          contrast, paralogues are genes within the same genome that have evolved by duplication. PDB - Protein Data Bank is a repository for 3-D biological macromolecular structure data. PDB archives protein structures deduced from crystallography and Nuclear magnetic reasonance (NMR) experiments on protein structures. The Protein Data Bank (PDB) is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology -- three members of the Research Collaboratory for Structural Bioinformatics (RCSB). The RCSB PDB is supported by funds from the National Science Foundation, the Department of Energy, and the National Institutes of Health. Pfam - Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Pfam can be used to view the domain organization of proteins, to view multiple alignments, protein domain architectures, protein structures, and species distributions. Pmatch - Pmatch is a fast, exact matching program for aligning protein sequences with either protein or DNA sequence. Prints - The PRINTS protein fingerprint database is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SwissProt/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors. Prosite - PROSITE is a database of protein families and domains run by the (Expert Protein Analysis System (ExPASy) proteomics server of the Swiss Institute of Bioinformatics (SIB). It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Pseudogenes - Processed pseudogenes result from reverse transcription of a mature mRNA and reinsertion into the genomic sequence. Ensembl detects potential processed pseudogenes among the Ensembl transcript predictions. See the Pseudogenes section for more information about how Ensembl detects pseudogenes. Pre-release site - Initial annotations without gene predictions or validation, and improvements to the Web site are often available on the pre-release site at http://dev.ensembl.org QTL - (Quantitative Trait Locus). Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure). The presence of QTL is inferred from genetic mapping.Â Total variation is partitioned into components linked to a number of discrete, mapped chromosome markers described by statistical association to quantitative variation in a particular phenotypic trait that is thought to be controlled by the cumulative action of alleles at multiple loci. Scaffold - Scaffolds are sets of ordered, oriented contigs positioned relative to each other by mate pairs whose reads are in adjacent contigs.          Synteny - The term synteny has originally been defined indicating that two gene loci share the same chromosome. In genomic context we refer to syntenic regions if both seqeunce and gene order is conserved between two (closely related) species. RefSeq - NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products, providing a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses. RepeatMasker - RepeatMasker (AFA Smit & P Green) is a standard software tool used in computational genomics to identify repetitive elements and lowcomplexity sequences. RH map - Radiation Hybrid map. Technique for identifying landmarks (STS) every 100 kb in the human genome, the ordering is relative to the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human-hamster hybrid cell lines. RHS - (Reciprocal Hit based on Synteny information). For closely related species (i.e. inside the vertebrate or arthropod phylum), where some gene order conservation is expected, we identify additional orthologous pairs obtained by a combination of reciprocal BLAST and location information. RHS is a reciprocal pair, where one direction is the best hit, but the reverse hit is less than best. To classify as RHS the pair must also maintain synteny (conserved gene order) within 1.5 MB of a UBRH or MBRH/ DUP. Due to the fact that this search is looking at less than 'best' hits it is possible that a given gene can have both a UBRH orthologue prediction and an RHS orthologue prediction. SEG - Seg divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions". Segment lengths and the number of segments per sequence are determined automatically by the algorithm. SGD - Saccharomyces Genome Database. Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae. Shotgun method - (also whole genome shotgun) Semi-automated sequencing method that involves randomly sequenced cloned pieces of the genome (size selected, sually 2, 10, 50 and 150 kb), with no prior knowledge their location. The clones are then sequenced from both ends. The two ends of the same clone are referred to as mate pairs. The distance between two "mate pairs" can be inferred if the library size is known and has a narrow window of deviation. This approach can be contrasted with "directed" strategies, in which pieces of DNA from known chromosomal locations are sequenced. SignalP - The SignalP application predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Grampositive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. Signal peptides indicate a protein that will be secreted. Prediction of signal peptides is quite accurate however care must be exercised and these regions should be verified by other means. (Henrik Nielsen, Jacob Engelbrecht, SÃ¸ren Brunak and Gunnar von Heijne. Identification of prokaryotic and eukaryotic           signal peptides and prediction of their cleavage sites. Protein Engineering 10, 16 (1997) SNAP - SNAP (Synonymous/Non-synonymous Analysis Program) calculates synonymous and non-synonymous substitution rates based on a set of codonaligned nucleotide sequences, based on the method of Nei and Gojobori, incorporating a statistic developed in Ota and Nei. SNAP - is an ab initio gene prediction program developed by Ian Korf that models models protein coding sequences in genomic DNA by means of hidden Markov models. SNPs - Single Nucleotide Polymorphisms are common variations that occur in DNA with a 0.1% frequency. Ensembl displays SNPs obtained from dbSNP, (the SNP repository maintained by NCBI; The Human Genic Bi-Allelic Sequences Database (HGVBase) and The SNP Consortium Ltd.(TSC). SSAHA - (Sequence Search and Alignment by Hashing Algorithm) is designed to detect exact matches, or nearly exact matches, in DNA or protein databases. The SSAHA search has been optimized for alignments of high percentage identity and display as results the most significant matches for ungapped alignments between sequences. Each exact match in an SSAHA alignment is analogous to finding a high-scoring segment pair in BLAST. A number of consecutive matches on a contig may represent features of a gene such as exons or 5' and 3' untranslated regions, depending on the nature of the query sequence. STS markers - STS markers are short sequences of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. Because each is unique, STSs are often used in linkage and radiation hybrid mapping techniques. STSs serve as landmarks on the physical map of the human genome. supercontigs - Assemblies consist of sequence contigs combined into scaffolds, also known as supercontigs. Supercontigs are combined and ordered according to their orientation and linking information provided by mated sequences from the ends of genomic sub-clones. For some species, supercontigs are combined into ultracontigs, in which neighboring supercontigs are organized into their proper order and orientation using linking information provided by the physical map of BAC clones independently assembled using restriction fragment patterns and the FPC program. tandem repeats - Multiple copies of the same base sequence on a chromosome; used as markers in physical mapping. translation start site - The position within an mRNA at which synthesis of a protein begins. The translation start site is usually an AUG codon, but occasionally, GUG or CUG codons are used to initiate protein synthesis. tRNAs - A class of RNA with triplet nucleotide sequences that are complementary to the triplet nucleotide coding sequences of mRNA. The role of tRNAs in protein synthesis is to bond with amino acids and transfer them to the ribosomes, where proteins are assembled according to the genetic code carried by mRNA. tRNAscan-SE - tRNAscan-SE is an application for tRNAscan-SE identifies transfer RNA genes in genomic DNA or RNA sequences. It combines the specificity of the Cove probabilistic RNA prediction package (Eddy & Durbin, 1994) with the speed and sensitivity of tRNAscan 1.3 (Fichant & Burks, 1991). Ensembl uses the EufindtRNA implementation described by Pavesi and colleagues (1994) to search for eukaryotic pol III tRNA promoters. tRNAscan         and EufindtRNA are used as first-pass prefilters to identify candidate tRNA regions of the sequence. These subsequences are then passed to Cove for further analysis, and output if Cove confirms the initial tRNA prediction. In this way, tRNAscan-SE attains the best of both worlds: a false positive rate of less than one per 15 billion nucleotides of random sequence the combined sensitivities of tRNAscan and EufindtRNA (detection of 99% of true tRNAs) search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the original tRNAscan 1.3 (tRNAscan-SE uses both a code-optimized version of tRNAscan 1.3 which gives a 650-fold increase in speed, and a fast C implementation of the Pavesi et al. algorithm). This program and results of its analysis of a number of genomes have been published in Lowe & Eddy, Nucleic Acids Research 25: 955-964 (1997). TSC - The SNP Consortium is a non-profit foundation to provide public SNP related information available to the public without intellectual property restrictions. UBRH - (Unique Best Reciprocal Hit). When a query gene translation has an unambiguous 'best' hit to a target translation, and that particular target translation has an unambiguous 'best' hit back to the starting query translation, that gene translation pair is labelled a UBRH orthologue prediction. Unigene - UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each Unigene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location. UniProt/TrEMBL - SPTrEMBL is a subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the EMBL EMBL nucleotides that are not yet incorporated into the UniProt/SwissProt database. UniProt/Swiss-Prot - (Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. SwissProt is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). UniSTS - UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR. UTR - Untranslated Region. The 5' UTR is the portion of an mRNA from the 5' end to the position of the first codon used in translation. The 3' UTR is the portion of an mRNA from the position of the last codon that is used in translation to the 3' end. Vega genes - Vega genes from the Vertebrate Genome Annotation (VEGA) database include manual annotation of specific Human, Mouse, and Zebrafish clones. Annotation is performed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, ab initio gene prediction applications (genscan, Fgenes),. Comparative analysis using vertebrate datasets is used to aid novel gene discovery. The data gathered in these steps is then used to manually annotate the clone adding gene structures,   descriptions and poly-A features. The annotation is based on supporting evidence only. YAC - Yeast Artificial Chromosome. Originated from a bacterial plasmid, a YAC contains a yeast centromeric region (CEN), a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell. ZFIN - Zebrafish Information Network. ZFIN is a database for the zebrafish model organism that holds information on wild-type stocks, mutants, genes, gene expression data, and map markers.

Information Ensembl Glossary

Related documents

Products

Support

Information Ensembl Glossary

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib