Information Ensembl Glossary

advertisement
Information Ensembl Glossary












Accession number - An Accession number is a unique identifier given to a
sequence when it is submitted to one of the DNA repositories (GenBank,
EMBL, DDBJ).
Alu - A dispersed intermediately repetitive DNA sequence found in the human
genome in about one million copies. The sequence is about 300 bp long and is
found commonly in introns, 3' untranslated regions of genes, and intergenic
genomic regions. The name Alu comes from the a recognition site for the AluI
endonuclease that cleaves it. The Alu universal primer sequence is as follows:
5'-GTG GAT CAC CTG AGG TCA GGA GTT TC-3' (26-mer).
allele - One of the alternate forms of a specific gene. Each allele is an individual
member of a gene pair and is inherited from one parent. When genes are
considered "simply" as segments of a nucleotide sequence, then it refers to each
of the possible alternative nucleotides at a specific position in the sequence.
API - An API (Application Programming Interface) is a series of routines that
applications can use to make the operating system request and carry out lowerlevel services.
BAC - A BAC (Bacterial Artificial Chromosome) is a vector used to clone DNA
fragments (100 to 300-kb insert size; average, 150 kb) from another species
cloned into bacteria. Once the foreign DNA has been cloned into the host
bacteria, it can make many copies of it.
BLAST - Basic Local Alignment Search Tool (Altschul et al., J Mol Biol
215:403-410; 1990). A sequence comparison algorithm optimised for speed
which is used to search sequence databases for optimal local alignments to a
query.
BLAT - BLAST-Like Alignment Tool (Kent, W.J. 2002. BLAT -- The BLASTLike Alignment Tool. Genome Research 4: 656-664). is a mRNA/DNA and
cross-species protein sequence analysis toll to quickly find sequences of 95%
and greater similarity of length 40 bases or more.
BLOSUM 62 - Blocks Substitution Matrix (Henikoff and Henikoff, Proc Natl
Acad Sci U S A 89:10915-10919; 1992). A matrix that defines scores for amino
acid substitutions, reflecting the similarity of physicochemical properties, and
observed substitution frequencies. The BLOSUM 62 matrix is tailored using
sequences sharing no more than 62% identity (sequences closer evolutionary,
were represented by a single sequence in the alignment to avoid bias from using
related family members)
Centimorgan (cM) - A unit of genetic distance, determined by how frequently
two genes on the same chromosome are inherited together. One centimorgan
equals 1% recombinant offspring.
cDNA - Complementary DNA obtained by reverse transcription of a mRNA
template. n bioinformatics jargon, cDNA is thought of as a DNA version of the
mRNA sequence. Generally, cDNA are denoted in coding or 'sense' orientation.
CCDS - Consensus CDS is a core set of human protein coding regions that are
consistently annotated between Ensembl, VEGA and RefSeq. The long term
goal is to support convergence towards a standard set of gene annotations on the
human genome.
CDS - (Coding sequence) refers to the portion of a gene or an mRNA that codes
for a protein. Introns are not coding sequences, nor are the 5' or 3' UTR. The

coding sequence in a cDNA or mature mRNA includes everything from the start
codon through to the stop codon, inclusive.
Cigar - Cigar stands for Compact Idiosyncratic Gapped Alignment Report and
defines the sequence of matches/mismatches and deletions (or gaps). The cigar
line defines the sequence of matches/mismatches and deletions (or gaps). For
example, this cigar line 2MD3M2D2M will mean that the alignment contains 2
matches/mismatches, 1 deletion (number 1 is omitted in order to save some
space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the
original sequence is:
o Original sequence: AACGCTT
The aligned sequence will be:
cigar line: 2MD3M2D2M
MMDMMMDDMM
A A - C G C - - T T






Cosmid - DNA from a bacterial virus spliced with a small fragment of a genome
(up to 50 kb) to be amplified and sequenced.
Clone - A segment of DNA that has been inserted into a vector molecule, such
as a plasmid, and then replicated to form many identical copies.
Contig - A contig is a contiguous stretch of DNA sequence without gaps that
has been assembled solely based on direct sequencing information. Contig can
be used in other contexts: A clone contig is a group of cloned fragments of DNA
covering overlapping regions of a particular chromosome. A sequence contig is
an extended sequence created by merging sequences that overlap. A contig map
shows the regions of a chromosome where contiguous DNA segments overlap.
These maps allow the study of a complete segment of a genome by examining a
series of overlapping clones covering a region of interest.
DDBJ - DDBJ is the sole DNA data bank in Japan, which is officially certified
to collect DNA sequences from researchers and to issue the internationally
recognized accession number to data submitters. Data is exchanged with
EMBL/EBI and GenBank/NCBI on a daily basis, and the three data banks share
virtually the same data at any given time.
Domain - A region of special biological interest within a single protein
sequence. However, a domain may also be defined as a region within the threedimensional structure of a protein that may encompass regions of several distinct
protein sequences that accomplishes a specific function. A domain class is a
group of domains that share a common set of well-defined properties or
characteristics.
Dotter - Ensembl DotterView is based on the program Dotter, a dot-matrix
program with dynamic threshold control suited for genomic DNA and protein
sequence analysis. The Dotter tool provides a visual display of the sequence
alignment it represents. The dotplot displays detailed comparison of two
sequences. Every residue in one sequence is compared to every residue in the
other sequence. The first sequence runs along the x-axis and the second
sequence along the y-axis. In regions where the two sequences are similar to
each other, a row of high scores will run diagonally across the dot matrix. If
you're comparing a sequence against itself to find internal repeats, you'll notice







that the main diagonal scores maximally, since it's the 100% perfect self-match.
To make the score matrix more intelligible, the pairwise scores are averaged
over a sliding window that runs diagonally. The averaged score matrix forms a
three-dimensional landscape, with the two sequences in two dimensions and the
height of the peaks in the third. This landscape is projected onto two dimensions
by aid of grayscales - higher peaks are indicated by darker grays. Dotter was
written by Erik L.L. Sonnhammer and Richard Durbin Gene 167: GC1-10
(1995)
DWGA - (Derived from Whole Genome Alignments). Human /versus/
Chimpanzee exception: The human /versus/ chimpanzee orthologue predictions
were obtained in a completely different manner. Since the current chimpanzee
genome sequence assembly is the result of low-coverage sequencing, the
assembled sequence is of too poor quality to generate a gene set on the classical
Ensembl gene build pipeline. The chimpanzee gene set produced by Ensembl
has rather been generated by "projecting" human genes to the chimpanzee
genome through whole genome BLASTz alignments between both species and
filtering for orthologue sequence alignments. The result of this procedure is de
facto the human - chimpanzee orthologue set that has been Derived from Whole
Genome Alignments (DWGA). See the Prediction Method section on a relevant
Ensembl Gene Report page.
DUST - dust is a standalone application that looks for low complexity
sequences.
EMBL - (Nucleotide Sequence Database) Europe's primary nucleotide sequence
resource. The main sources of the DNA and RNA sequences in the database are
submissions from individual researchers, genome sequencing projects and patent
applications.
ENCODE - The ENCyclopedia Of DNA Elements (ENCODE) project uses
defined regions of the Human genome to test and evaluate different methods and
technologies for finding various functional elements in Human DNA. The two
main criteria for manually selected regions were presence of well-studied genes
or other known sequence elements, and existence of a substantial amount of
comparative sequence data. A total of 14.82Mb of sequence was manually
selected using this approach, consisting of 14 targets that range in size from
500kb to 2Mb.
Ensembl genes - Set of Ensembl gene predictions based on experimental
evidence from protein sequences and/or near-full-length cDNA available from
public sequence databases. "Ensembl known genes" are predicted on the basis of
species-specific database entries from manually curated UniProt/Swiss-Prot,
partially manually curated RefSeq and UniProt/TrEMBL databases. Predictions
of "Ensembl novel genes" are based on other experimental evidence such as
protein and cDNA sequence information from related species.
Eponine - Eponine is a probabilistic method for detecting transcription start
sites (TSS) in mammalian genomic sequence, with good specificity and
excellent positional accuracy. Eponine models consist of a set of DNA weight
matrices recognizing specific sequence motifs. Each of these is associated with a
position distribution relative to the TSS.
EST - (Expressed Sequence Tags) Coarse sequence reads from flanking vector
regions into the inserts of cDNA libraries. ESTs act as physical markers for
cloning and full length sequencing of the cDNAs of expressed genes. Typically








identified by purifying mRNAs, converting to cDNAs, and then sequencing a
portion of the cDNAs.
EST genes - Set of Ensembl gene predictions solely based on EST evidence.
The process of EST gene prediction uses a combination of Exonerate, BLAST
and Est2Genome to map ESTs onto the genomic sequence. Redundant ESTs are
merged, before GenomeWise is used to assign 5' and 3' UTRs to the longest
found ORF. See Ensembl EST genes for a more complete explanation of the
EST gene prediction process.
Exonerate - Exonerate is a fast gapped DNA-DNA alignment algorithm. It can
be used for aligning various types of sequences such as genomic DNA,
cDNAs/ESTs, and proteins.
Fgenes - FGENES, also known as Find Genes, is a Human gene predictor that is
based on pattern recognition of different types of exons, promoters and poly A
signals. It is built based on linear discriminant functions of internal, 5'-coding,
and 3'-coding exon recognition. It is designed to find the optimal combination of
these components and to construct a set of gene models along a given sequence.
GENSCAN - GENSCAN (Burge, C. and Karlin, S. (1997) Prediction of
complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94). is
an application for identification of complete gene structures in genomic DNA.
The splice site models used are described in more detail in: Burge, C. B. (1998)
Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls,
D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier
Science, Amsterdam, pp. 127-163.
GeneWise - GeneWise is sequence analysis tool for comparing proteins or
profile HMMs to DNA sequences allowing for introns and frameshifts. The
Wise2 package was written by Ewan Birney. More information about the
package can be obtained at: http://www.ebi.ac.uk/Wise2/
GO - Gene Ontology Consortium. An organized hierarchy of terms produced by
the Gene Ontology (GO) Consortium, used to describe biological processes,
cellular component, and molecular function. Molecular Function Ontology.
Tasks performed by individual gene products; examples are carbohydrate
binding and ATPase activity. Biological Process Ontology. Broad biological
goals, such as mitosis or purine metabolism, are accomplished by ordered
assemblies of molecular functions. Cellular Component Ontology. Subcellular
structures, locations, and macromolecular complexes; examples include nucleus,
telomere, and origin recognition complex. A gene may be indexed under many
GO terms depending on GO classification system. A gene product has one or
more molecular functions and is used in one or more biological processes; it
might be associated with one or more cellular components. For instance,
cytochrome c can be described by the molecular function term electron
transporter activity, the biological process terms oxidative phosphorylation and
induction of cell death, and the cellular component terms mitochondrial matrix
and mitochondrial inner membrane.
Haplotypes - SNPs are not randomly or independently distributed on different
chromosomes, but tend to be associated with one another. Haplotypes are a way
of denoting a group of several SNPs closely linked physically on a chromosome.
HGVbase - The Human Genome Variation Database (HGVbase) provides an
accurate, comprehensive, high utility catalog of normal human gene and genome
variation to aid research of Human phenotypic variation. Records are highly
curated and annotated to ensure maximal utility and data accuracy. HGVbase is








collaboration between the Karolinska Institute (Sweden), the European
Bioinformatics Institute (UK).
InterPro - InterPro is an integrated resource for protein families, domains and
sites, combining information from several different protein signature databases.
InterPro IDs are linked to the summary of information about that domain or
family. InterPro is managed by EBI. A number of databases (SwissProt,
TrEMBL, PROSITE, PRINTS, Pfam, and ProDom, SMART, TIGRFAMs, PIR
SuperFamilies and SUPERFAMILY) with different approaches to biological
information are used to derive protein signatures. ProteinView, GeneView and
DomainView provide links to the relevant InterPro entries.
Jalview - Jalview is a multiple alignment editor, used by the EBI clustalw server
and the PFAM protein domain database and is available as a general purpose
alignment editor.
Known genes - Known genes are transcripts that have been mapped by Ensembl
to near-full-length protein sequences already available in the public sequence
databases.
MGI - (Mouse Genome Informatics) houses a database that provides integrated
access to data on the genetics, genomics, and biology of mouse (Mus musculus).
MBRH - (Multiple Best Reciprocal Hit). When due to gene duplications there
are multiple 'best' hits with identical score, E-value, % identity, %positivity, one
is unable to pick a unique orthologue for a gene. This results in more complex
graphs of 'best' relationships. This often occurs when different genes have
identical translations, which could be due to a duplication event, an assembly
error, or chance. On average 3% of the genes have an identical translation to
some other gene either within it's genome or in another genome.
o MBRH / DUP 1.# - MBRH set where in one genome there is only one
gene, but the other genome has multiple genes, all on the same
chromosome and within 1.5 megbases of each other. This could be due
to recent gene duplication events where sequences have not diverged or a
mis-assembly of the genome sequence leading to artificial, apparent gene
duplications. (e.g. MBRH / DUP 1.2 or MBRH/ DUP 1.4)
o MBRH / SYN - This is a more complex MBRH set where there are
multiple genes in each genome split across multiple chromosomes. The
one(s) labeled MBRH/SYN satisfies both the MBRH criteria and the
RHS search criteria.
o MBRH / COMPLEX - This is a more complex MBRH set where there
are multiple genes in each genome split across multiple chromosomes.
This MBRH pair does not satisfy the RHS criteria.
Novel genes - Novel genes are genes that have been predicted by Ensembl on
the basis of similarity to protein or cDNA sequences and/or ESTs, but could not
be mapped with confidence to existing entries in any public sequence database.
OMIM - (Online Mendelian Inheritance in Man) Genetic knowledge database
which was first published in 1966, (Mendelian Inheritance in Man (MIM)
(currently in its 12th edition) that includes information and references, including
links to MEDLINE and sequence. Ensembl links both to OMIM entries for any
gene, where available, and to a subset of this database, the OMIM Morbid Map
presenting the syndrome and disease-associated genes described in OMIM.
Orthologue - Orthologues are genes derived from a common ancestor through
vertical descent and can be thought of as the direct evolutionary counterpart. In









contrast, paralogues are genes within the same genome that have evolved by
duplication.
PDB - Protein Data Bank is a repository for 3-D biological macromolecular
structure data. PDB archives protein structures deduced from crystallography
and Nuclear magnetic reasonance (NMR) experiments on protein structures. The
Protein Data Bank (PDB) is operated by Rutgers, The State University of New
Jersey; the San Diego Supercomputer Center at the University of California, San
Diego; and the Center for Advanced Research in Biotechnology of the National
Institute of Standards and Technology -- three members of the Research
Collaboratory for Structural Bioinformatics (RCSB). The RCSB PDB is
supported by funds from the National Science Foundation, the Department of
Energy, and the National Institutes of Health.
Pfam - Pfam is a large collection of multiple sequence alignments and hidden
Markov models covering many common protein domains and families. Pfam can
be used to view the domain organization of proteins, to view multiple
alignments, protein domain architectures, protein structures, and species
distributions.
Pmatch - Pmatch is a fast, exact matching program for aligning protein
sequences with either protein or DNA sequence.
Prints - The PRINTS protein fingerprint database is a compendium of protein
fingerprints. A fingerprint is a group of conserved motifs used to characterise a
protein family; its diagnostic power is refined by iterative scanning of a
SwissProt/TrEMBL composite. Usually the motifs do not overlap, but are
separated along a sequence, though they may be contiguous in 3D-space.
Fingerprints can encode protein folds and functionalities more flexibly and
powerfully than can single motifs, full diagnostic potency deriving from the
mutual context provided by motif neighbors.
Prosite - PROSITE is a database of protein families and domains run by the
(Expert Protein Analysis System (ExPASy) proteomics server of the Swiss
Institute of Bioinformatics (SIB). It consists of biologically significant sites,
patterns and profiles that help to reliably identify to which known protein family
(if any) a new sequence belongs.
Pseudogenes - Processed pseudogenes result from reverse transcription of a
mature mRNA and reinsertion into the genomic sequence. Ensembl detects
potential processed pseudogenes among the Ensembl transcript predictions. See
the Pseudogenes section for more information about how Ensembl detects
pseudogenes.
Pre-release site - Initial annotations without gene predictions or validation, and
improvements to the Web site are often available on the pre-release site at
http://dev.ensembl.org
QTL - (Quantitative Trait Locus). Genetic loci where allelic variation is
associated with variation in a quantitative trait (e.g. blood pressure). The
presence of QTL is inferred from genetic mapping. Total variation is
partitioned into components linked to a number of discrete, mapped
chromosome markers described by statistical association to quantitative
variation in a particular phenotypic trait that is thought to be controlled by the
cumulative action of alleles at multiple loci.
Scaffold - Scaffolds are sets of ordered, oriented contigs positioned relative to
each other by mate pairs whose reads are in adjacent contigs.









Synteny - The term synteny has originally been defined indicating that two gene
loci share the same chromosome. In genomic context we refer to syntenic
regions if both seqeunce and gene order is conserved between two (closely
related) species.
RefSeq - NCBI's Reference Sequences (RefSeq) database is a curated database
of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a
comprehensive, integrated, non-redundant set of sequences, including genomic
DNA, tRNA, and protein products, providing a stable reference for gene
identification and characterization, mutation analysis, expression studies,
polymorphism discovery, and comparative analyses.
RepeatMasker - RepeatMasker (AFA Smit & P Green) is a standard software
tool used in computational genomics to identify repetitive elements and lowcomplexity sequences.
RH map - Radiation Hybrid map. Technique for identifying landmarks (STS)
every 100 kb in the human genome, the ordering is relative to the frequency with
which they are separated by radiation-induced breaks. The frequency is assayed
by analysing a panel of human-hamster hybrid cell lines.
RHS - (Reciprocal Hit based on Synteny information). For closely related
species (i.e. inside the vertebrate or arthropod phylum), where some gene order
conservation is expected, we identify additional orthologous pairs obtained by a
combination of reciprocal BLAST and location information. RHS is a reciprocal
pair, where one direction is the best hit, but the reverse hit is less than best. To
classify as RHS the pair must also maintain synteny (conserved gene order)
within 1.5 MB of a UBRH or MBRH/ DUP. Due to the fact that this search is
looking at less than 'best' hits it is possible that a given gene can have both a
UBRH orthologue prediction and an RHS orthologue prediction.
SEG - Seg divides sequences into contrasting segments of low-complexity and
high-complexity. Low-complexity segments defined by the algorithm represent
"simple sequences" or "compositionally-biased regions". Segment lengths and
the number of segments per sequence are determined automatically by the
algorithm.
SGD - Saccharomyces Genome Database. Canonical database for the molecular
biology and genetics of Saccharomyces cerevisiae.
Shotgun method - (also whole genome shotgun) Semi-automated sequencing
method that involves randomly sequenced cloned pieces of the genome (size
selected, sually 2, 10, 50 and 150 kb), with no prior knowledge their location.
The clones are then sequenced from both ends. The two ends of the same clone
are referred to as mate pairs. The distance between two "mate pairs" can be
inferred if the library size is known and has a narrow window of deviation. This
approach can be contrasted with "directed" strategies, in which pieces of DNA
from known chromosomal locations are sequenced.
SignalP - The SignalP application predicts the presence and location of signal
peptide cleavage sites in amino acid sequences from different organisms: Grampositive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method
incorporates a prediction of cleavage sites and a signal peptide/non-signal
peptide prediction based on a combination of several artificial neural networks.
Signal peptides indicate a protein that will be secreted. Prediction of signal
peptides is quite accurate however care must be exercised and these regions
should be verified by other means. (Henrik Nielsen, Jacob Engelbrecht, Søren
Brunak and Gunnar von Heijne. Identification of prokaryotic and eukaryotic










signal peptides and prediction of their cleavage sites. Protein Engineering 10, 16 (1997)
SNAP - SNAP (Synonymous/Non-synonymous Analysis Program) calculates
synonymous and non-synonymous substitution rates based on a set of codonaligned nucleotide sequences, based on the method of Nei and Gojobori,
incorporating a statistic developed in Ota and Nei.
SNAP - is an ab initio gene prediction program developed by Ian Korf that
models models protein coding sequences in genomic DNA by means of hidden
Markov models.
SNPs - Single Nucleotide Polymorphisms are common variations that occur in
DNA with a 0.1% frequency. Ensembl displays SNPs obtained from dbSNP,
(the SNP repository maintained by NCBI; The Human Genic Bi-Allelic
Sequences Database (HGVBase) and The SNP Consortium Ltd.(TSC).
SSAHA - (Sequence Search and Alignment by Hashing Algorithm) is designed
to detect exact matches, or nearly exact matches, in DNA or protein databases.
The SSAHA search has been optimized for alignments of high percentage
identity and display as results the most significant matches for ungapped
alignments between sequences. Each exact match in an SSAHA alignment is
analogous to finding a high-scoring segment pair in BLAST. A number of
consecutive matches on a contig may represent features of a gene such as exons
or 5' and 3' untranslated regions, depending on the nature of the query sequence.
STS markers - STS markers are short sequences of genomic DNA that can be
uniquely amplified by the polymerase chain reaction (PCR) using a pair of
primers. Because each is unique, STSs are often used in linkage and radiation
hybrid mapping techniques. STSs serve as landmarks on the physical map of the
human genome.
supercontigs - Assemblies consist of sequence contigs combined into scaffolds,
also known as supercontigs. Supercontigs are combined and ordered according
to their orientation and linking information provided by mated sequences from
the ends of genomic sub-clones. For some species, supercontigs are combined
into ultracontigs, in which neighboring supercontigs are organized into their
proper order and orientation using linking information provided by the physical
map of BAC clones independently assembled using restriction fragment patterns
and the FPC program.
tandem repeats - Multiple copies of the same base sequence on a chromosome;
used as markers in physical mapping.
translation start site - The position within an mRNA at which synthesis of a
protein begins. The translation start site is usually an AUG codon, but
occasionally, GUG or CUG codons are used to initiate protein synthesis.
tRNAs - A class of RNA with triplet nucleotide sequences that are
complementary to the triplet nucleotide coding sequences of mRNA. The role of
tRNAs in protein synthesis is to bond with amino acids and transfer them to the
ribosomes, where proteins are assembled according to the genetic code carried
by mRNA.
tRNAscan-SE - tRNAscan-SE is an application for tRNAscan-SE identifies
transfer RNA genes in genomic DNA or RNA sequences. It combines the
specificity of the Cove probabilistic RNA prediction package (Eddy & Durbin,
1994) with the speed and sensitivity of tRNAscan 1.3 (Fichant & Burks, 1991).
Ensembl uses the EufindtRNA implementation described by Pavesi and
colleagues (1994) to search for eukaryotic pol III tRNA promoters. tRNAscan








and EufindtRNA are used as first-pass prefilters to identify candidate tRNA
regions of the sequence. These subsequences are then passed to Cove for further
analysis, and output if Cove confirms the initial tRNA prediction. In this way,
tRNAscan-SE attains the best of both worlds: a false positive rate of less than
one per 15 billion nucleotides of random sequence the combined sensitivities of
tRNAscan and EufindtRNA (detection of 99% of true tRNAs) search speed
1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the
original tRNAscan 1.3 (tRNAscan-SE uses both a code-optimized version of
tRNAscan 1.3 which gives a 650-fold increase in speed, and a fast C
implementation of the Pavesi et al. algorithm). This program and results of its
analysis of a number of genomes have been published in Lowe & Eddy, Nucleic
Acids Research 25: 955-964 (1997).
TSC - The SNP Consortium is a non-profit foundation to provide public SNP
related information available to the public without intellectual property
restrictions.
UBRH - (Unique Best Reciprocal Hit). When a query gene translation has an
unambiguous 'best' hit to a target translation, and that particular target translation
has an unambiguous 'best' hit back to the starting query translation, that gene
translation pair is labelled a UBRH orthologue prediction.
Unigene - UniGene is an experimental system for automatically partitioning
GenBank sequences into a non-redundant set of gene-oriented clusters. Each
Unigene cluster contains sequences that represent a unique gene, as well as
related information such as the tissue types in which the gene has been
expressed and map location.
UniProt/TrEMBL - SPTrEMBL is a subset of TrEMBL (Translated EMBL
database) containing the computer-annotated protein translations of all coding
sequences (CDS) present in the EMBL EMBL nucleotides that are not yet
incorporated into the UniProt/SwissProt database.
UniProt/Swiss-Prot - (Universal Protein Resource) is the world's most
comprehensive catalogue of information on proteins. UniProt/Swiss-Prot is a
curated protein sequence database that provides a high level of annotation, a
minimal level of redundancy and high level of integration with other databases.
SwissProt is maintained collaboratively by the Swiss Institute for Bioinformatics
(SIB) and the European Bioinformatics Institute (EBI).
UniSTS - UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites
(STS) markers. For each marker, UniSTS displays the primer sequences, product
size, and mapping information, as well as cross references to dbSNP, RHdb,
GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that
contain the primer sequences determined by ePCR.
UTR - Untranslated Region. The 5' UTR is the portion of an mRNA from the 5'
end to the position of the first codon used in translation. The 3' UTR is the
portion of an mRNA from the position of the last codon that is used in
translation to the 3' end.
Vega genes - Vega genes from the Vertebrate Genome Annotation (VEGA)
database include manual annotation of specific Human, Mouse, and Zebrafish
clones. Annotation is performed on a clone-by-clone basis using a combination
of similarity searches against DNA and protein databases, ab initio gene
prediction applications (genscan, Fgenes),. Comparative analysis using
vertebrate datasets is used to aid novel gene discovery. The data gathered in
these steps is then used to manually annotate the clone adding gene structures,


descriptions and poly-A features. The annotation is based on supporting
evidence only.
YAC - Yeast Artificial Chromosome. Originated from a bacterial plasmid, a
YAC contains a yeast centromeric region (CEN), a yeast origin of DNA
replication, a cluster of unique rectriction sites and a selectable marker and a
telomere region at the en of each arm. YACs are capable of cloning extremely
large segments of DNA (over 1 megabase long) into a host cell, where the DNA
is propagated along with the other chromosomes of the yeast cell.
ZFIN - Zebrafish Information Network. ZFIN is a database for the zebrafish
model organism that holds information on wild-type stocks, mutants, genes,
gene expression data, and map markers.
Download