part #2

advertisement
Genome Projects
and Gene Hunting
Wen-chang Lin
Institute of Biomedical Sciences, Academia Sinica
Taipei, Taiwan
R. O. C.
E-mail: wenlin@ibms.sinica.edu.tw
Http://www.ibms.sinica.edu.tw/~wenlin
The Human Genome Project is an ambitious
effort to understand the hereditary instructions
that make each of us unique. The goal of this
effort is to find the location of the 100,000 or so
human genes and to read the entire genetic
script, all 3 billion bits of information, by the
year 2005.
What is the Human Genome Project?
The Human Genome Project (HGP) is an international research program designed to construct
detailed genetic and physical maps of the human genome, to determine the complete nucleotide
sequence of human DNA, to localize the estimated 50,000-100,000 genes within the human
genome, and to perform similar analyses on the genomes of several other organisms used
extensively in research laboratories as model systems. The scientific products of the HGP will
comprise a resource of detailed information about the structure, organization and function of
human DNA, information that constitutes the basic set of inherited "instructions” for the
development and functioning of a human being. Successfully accomplishing these ambitious
goals will demand the development of a variety of new technologies. It will also necessitate
advanced means of making the information widely available to scientists, physicians, and others
in order that the results may be rapidly used for the public good. Improved technology for
biomedical research will thus be another important product of the HGP. From the inception of the
HGP, it was clearly recognized that acquisition and use of such genetic knowledge would have
momentous implications for both individuals and society and would pose a number of policy
choices for public and professional deliberation. Analysis of the ethical, legal, and social
implications of genetic knowledge, and the development of policy options for public
consideration are therefore yet another major component of the human genome research effort.
Specific Goals (Phase I 1993-1998)
Genetic Map
Complete the 2-5 cM map by 1995
Develop technology for rapid genotyping
Develop markers that are easier to use
Develop new mapping technologies
Physical Map
Complete an STS map of the human genome at a resolution of 100 kb
DNA Sequencing
Develop efficient approaches to sequencing one- to several- megabase regions of DNA
of high biological interest.
Develop technology for high throughput sequencing, focusing on systems integration
of all steps from template preparation to data analysis.
Build up sequencing capacity to a collective rate of 50 Mb per year by the end of the
period. This rate should result in an aggregate of 80 Mb of DNA sequence completed
by the end of FY 1998.
Specific Goals (Phase I 1993-1998)
Gene Identification
Develop efficient methods of identifying genes and for placement of known genes on physical
maps or sequenced DNA.
Technology Development
Substantially expand support of innovative technological developments as well as improvements
in current technology for DNA sequencing and to meet the needs of the Human Genome Project
as a whole.
Model Organisms
Finish an STS map of the mouse at 300 Kb resolution
Finish the sequence of the E. coli and S. cerevisiae genomes by 1998 or earlier
Continue sequencing C. elegans and Drosophila genomes, with the aim of bringing C. elegans
to near completion by 1998
Sequence selected segments of mouse DNA side by side with corresponding human DNA in
areas of high biological interest
Specific Goals (Phase I 1993-1998)
Informatics
Continue to create, develop and operate databases and database tools for easy access to data,
including effective tools and standards for data exchange and links among databases
Consolidate, distribute and continue to develop effective software for large-scale genome
projects
Continue to develop tools for comparing and interpreting genome information
Ethical, Legal and Social Implications (ELSI)
Continue to identify and define issues and develop policy options to address them
Develop and disseminate policy options regarding genetic testing services with widespread
potential use
Foster greater acceptance of human genetic variation
Enhance and expand public and professional education that is sensitive to sociocultural and
psychological issues
Training
Continue to encourage training of scientists in interdisciplinary sciences related to genome research
Specific Goals (Phase I 1993-1998)
Technology Transfer
Encourage and enhance technology transfer both into and out of centers of genome research
Outreach
Cooperate with those who would set up distribution centers for genome materials.
Share all information and materials within 6 months of their development. This should be
accomplished by submission to public databases or repositories, or both, where
appropriate.
Specific Goals (Phase II 1998-2003)
Specific Goals (Phase II 1998-2003)
Goal 1--The Human DNA Sequence
a) Finish the complete human genome sequence by the end of 2003.
b) Finish one-third of the human DNA sequence by the end of 2001.
c) Achieve coverage of at least 90% of the genome in a working draft based on
mapped clones by the end of 2001.
d) Make the sequence totally and freely accessible.
Specific Goals (Phase II 1998-2003)
Goal 2--Sequencing Technology
a) Continue to increase the throughput and reduce the cost of current sequencing
technology.
b) Support research on novel technologies that can lead to significant
improvements in sequencing technology.
c) Develop effective methods for the advanced development and introduction of
new sequencing technologies into the sequencing process.
Specific Goals (Phase II 1998-2003)
Goal 3--Human Genome Sequence Variation
a) Develop technologies for rapid, large-scale identification or scoring, or both, of
SNPs and other DNA sequence variants.
b) Identify common variants in the coding regions of the majority of identified genes
during this 5-year period.
c) Create an SNP map of at least 100,000 markers.
d) Develop the intellectual foundations for studies of sequence variation.
e) Create public resources of DNA samples and cell lines.
Specific Goals (Phase II 1998-2003)
Goal 4--Technology for Functional Genomics
a) Develop cDNA resources.
b) Support research on methods for studying functions of non-protein-coding
sequences.
c) Develop technology for comprehensive analysis of gene expression.
d) Improve methods for genome-wide mutagenesis.
e) Develop technology for global protein analysis.
Specific Goals (Phase II 1998-2003)
Goal 5--Comparative Genomics
a) Complete the sequence of the C. elegans genome in 1998.
b) Complete the sequence of the Drosophila genome by 2002.
c) The mouse genome.
1) Develop physical and genetic mapping resources.
2) Develop additional cDNA resources.
3) Complete the sequence of the mouse genome by 2005.
d) Identify other model organisms that can make major contributions to the
understanding of the human genome and support appropriate genomic
studies.
Goal 6--Ethical, Legal, and Social Implications (ELSI)
U.S. Human Genome Project Funding($Millions)
FY
DOE NIH* U.S. Total
1988 10.7 17.2 27.9
1989 18.5 28.2 46.7
1990 27.2 59.5 86.7
1991 47.4 87.4 134.8
1992 59.4 104.8 164.2
1993 63.0 106.1 169.1
1994 63.3 127.0 190.3
1995 68.7 153.8 222.5
1996 73.9 169.3 243.2
1997 77.9 188.9 266.8
1998 85.5 217.7 303.2
(NT$9,780,500,000)
1999 89.8 225.7 315.5
Mar. 24, 2000 -Finished sequence:
561,973 kb
17.5% of genome
Draft sequence:
2,020,129 kb
62.9% of genome
Current Progress
Breakdown by Chromosome
Chr
Effective
size (kb)
Sequence
done (kb)
Percent
finished
Number of
contigs
Longest
contig (kb)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y
total
263000
255000
214000
203000
194000
183000
171000
155000
145000
144000
144000
143000
98000
93000
89000
98000
92000
85000
67000
72000
39000
34491
164000
35000
3180491
26571
23193
10417
12521
15679
45668
81476
8730
4839
6091
8398
24509
2143
29775
2196
19372
28861
3734
15021
25825
25851
33620
65513
6934
528043
10.1%
9.1%
4.9%
6.2%
8.1%
25.0%
47.6%
5.6%
3.3%
4.2%
5.8%
17.1%
2.2%
32.0%
2.5%
19.8%
31.4%
4.4%
22.4%
35.9%
66.3%
97.5%
39.9%
19.8%
16.6
154
109
59
99
94
305
298
42
30
36
63
99
7
106
17
118
129
20
144
137
72
12
347
27
2532
928
695
746
393
739
3926
2094
1902
1010
469
817
1526
1416
1450
297
512
1101
349
1008
1187
7223
23051
949
1104
The completed sequence covers 33.4 Mb of 22q with 11 gaps and
has been estimated to be accurate to less than 1 error in 50,000
bases, by internal and external checking exercises. The largest
contiguous segment stretches over 23 Mb. From our gap-size
estimates, we calculate that we have completed 33,464 kb of a total
region spanning 34,491 kb and that therefore the sequence is
complete to 97% coverage of 22q.
545
genes;
134
pseudo
genes.
http://www.ornl.gov/hgmis/
3,000 ~ 4,000 genes
http://www.ncbi.nlm.nih.gov/disease/
Completed Genomes
Organism
Genome
Size (Mb)
Caenorhabditis elegans
Saccharomyces cerevisiae
Escherichia coli
Bacillus subtilus
Synechocystis sp.
*Archaeoglobus fulgidus
*Pyrobaculum aerophilum
Haemophilus influenzae
*Methanobacterium
thermoautotrophicum
Helicobacter pylori
*Methanococcus jannaschii
*Aquifex aolicus
Borrelia burgdorferi
Treponema pallidum
Mycoplasma pneumoniae
*Mycoplasma genitalium
*Mycoplasma genitalium
Treponema pallidum
Chlamydia trachomatis
Plasmodium falciparum
Chr2
Rickettsia prowazekii
Helicobacter pylori
Leishmania major
Chr1
100
12.1
4.6
4.2
3.6
2.2
2.2
1.8
1.8
1.7
1.7
1.5
1.3
1.1
0.8
0.6
0.6
1.14
1.05
1
1.1
1.64
.27
Estimated
Genes
6034
4288
~4000
3168
2471
N.A.
1740
1855
1590
1692
1508
863
1234
677
470
470
The TIGR Microbial Database
provides links to world-wide genome
sequencing projects completed and
underway, including the completed
TIGR genomes: Archaeoglobus
fulgidus, Borreliaburgdorferi,
Deinococcus radiodurans,Haemophilus
influenzae,Helicobacter pylori,
Methanococcus jannaschii,
Mycobacterium tuberculosis,
Mycoplasma genitalium, Thermotoga
maritima,and Treponema pallidum.
In the last few decades, advances in molecular biology and
the equipment available for research in this field have
allowed the increasingly rapid sequencing of large portions
of the genomes of several species. In fact, to date, several
bacterial genomes, as well as those of some simple
eukaryotes (e.g., Saccharomyces cerevisiae, or baker's yeast)
have been sequenced in full. The Human Genome Project,
designed to sequence all 24 of the human chromosomes, is
also progressing. Popular sequence databases, such as
GenBank and EMBL, have been growing at exponential
rates. This deluge of information has necessitated the careful
storage, organization and indexing of sequence information.
Information science has been applied to biology to produce
the field called Bioinformatics
The most pressing tasks in bioinformatics involve the analysis of
sequence information. Computational Biology is the name given to
this process, and it involves the following:
•
•
•
•
Finding the genes in the DNA sequences of various
organisms
Developing methods to predict the structure and/or function
of newly discovered proteins and structural RNA sequences.
Clustering protein sequences into families of related
sequences and the development of protein models.
Aligning similar proteins and generating phylogenetic trees
to examine evolutionary relationships.
Simple Mathematics:
Human Genome
3 x 10 9 bps
Human Genes (5% of the genome)
100,000 genes
In a given cell type at a certain stage, it is estimated that around
20 % of the genes are transcribed or expressed.
20,000 genes
Automatic sequencer
The Growth of GenBank sequence database in the past 10 years.
Release
Year
Base pairs
Entries
58
62
66
70
74
80
86
92
98
104
110
115
88
89
90
91
92
93
94
95
96
97
98
99
24,690,876
37,183,950
51,306,092
77,337,678
120,242,234
163,802,597
230,485,928
425,860,958
730,552,938
1,258,290,513
2,162,067,871
4,653,932,745
21,248
31,229
41,057
58,952
97,084
150,744
237,775
620,765
1,114,581
1,891,953
3,043,729
5,354,511
Gene Expression Studies
GenBank Overview
What is GenBank?
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available
DNA sequences ( Nucleic Acids Research 1998 Jan 1;26(1):1-7). There are approximately
2,162,000,000 bases in 3,044,000 sequence records as of December 1998. As an example, you may
view the record for the neurofibromatosis gene. The complete release notes for the current
version of GenBank are available. A new release is made every two months. GenBank is part of the
International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank
of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These
three organizations exchange data on a daily basis.
Submissions to GenBank
Many journals require submission of sequence information to a database prior to publication so
that an accession number may appear in the paper. NCBI has a WWW form, called BankIt, for
convenient and quick submission of sequence data. The beta-test version of Sequin, NCBI's new
stand-alone submission software for MAC, PC, and UNIX platforms, is available by FTP. When using
Sequin, the output files for direct submission should be sent to GenBank by electronic mail.
Alternatively, the data files may be copied to a floppy disk and mailed to NCBI. Authorin, an
older stand-alone program for MACs and PCs, can still be used to format your submission, although
submitters are encouraged to switch to either BankIt or Sequin.
Searching GenBank
Text and Similarity searching
Entrez Browser
GenBank (nucleotides and proteins), PubMed (MEDLINE), 3D structures, genomes, and taxonomy
databases.
BLAST Sequence Similarity Searching
Nucleotide or protein query sequences against the specified database using the BLAST suite of
algorithms.
dbEST Searching
dbEST (Database of Expressed Sequence Tags).
GenBank nr database:
>gi|216185|dbj|D00635|ABCADHCC Acetobacter polyoxogenes genes for alcohol dehydrogenase,
cytochrome c, complete cds ¶
GAATTCCGAACTATCCGTTTCATTGCTTATGCGACAGCATGTTCACTTTTTAGTGAGGCTGAACACTAAA
ATGTCAGGAGACGAGCGTGCTAGCCTCAGTATGTTGCCATGAAACGGACCACCTGCTTTGTCTTTCCTGC
CTGAAGCCGGTTTCTGTCTGGCCGGAAAAGAAGCGCTAGCGCGTTTTTTTGCCGGATACATTCAGAAAGC
TGCTCCGGGCAGAAAGTTGCAGCGGCGGCATCCTGAATTCGAAACCGTTAGTTTTCTGAGGACATCACAT
ATGATTTCTGCCGTTTTCGGAAAAAGACGTTCTCTGAGCAGAACGCTTACAGCCGGAACGATATGTGCGG
CTCTCATCTCCGGGTATGCCACCATGGCATCCGCAGATGACGGGCAGGGCGCCACGGGGGAAGCGATCAT
CCATGCCGATGATCACCCCGGTAACTGGATGACCTATGGCCGCACCTATTCTGACCAGCGCTACAGCCCG
CTGGATCAGATCAACCGTTCCAATGTCGGTAACCTGAAGCTGGCCTGGTATCTGGACCTTGATACCAACC
GTGGCCAGGAAGGCACGCCCCTGGTTATTGATGGCGTCATGTACGCCACCACCAACTGGAGCATGATGAA
AGCCGTCGACGCCGCAACCGGCAAGCTGCTGTGGTCCTATGACCCGCGCGTGCCCGGCAACATTGCCGAC
AAGGGCTGCTGTGACACGGTCAACCGTGGCGCGGCATACTGGAATGGCAAGGTCTATTTCGGCACGTTCG
ACGGTCGCCTGATCGCGCTGGACGCCAAGACCGGCAAGCTGGTCTGGAGCGTCAACACCATTCCGCCCGA
AGCGGAACTGGGCAAGCAGCGTTCCTATACGGTTGACGGCGCGCCCCGTATCGCCAAGGGCCGCGTGA>>
¶
FASTA format
Medline searches: Academia Sinica Library (local)
Http://igm.nlm.nih.gov/
Given COX-1 and COX-2 can a putative COX-3 be identified?
Text search for COX-3 (and suitable alternative forms)
Acquire human COX-1 and COX-2 sequences
Search for sequence
similarties in a fulllength sequence
database
Search for sequence
similarties in an EST
database
Merge the results of the full-length and EST
searches
ESTs virtually indentical to
COX-1 and COX-2
ESTs similar, but not
indentical to COX-1/-2
May provide tissue
localization information
Search ESTs back against
full-length databases
Strong similarities with
other genes indicate close
relationship of COX family
to another gene family probably with a different
function
Is it highly similar to
COX-1, COX-2 or both?
Is it only weakly similar?
If so, might it be more
similar to something else,
a putative COX-3?
In silico cloning:
In order to perform an electronic cDNA library screen, the EST sequences
retrieved in this way can be used as queries in a BLASTN search of dbEST to identify
over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs
until no additional hits are found. The ESTs isolated can be assembled into sequence
contigs using computer softwares.
EST 1
EST 3
Query
EST 2
1
61
121
181
241
301
361
421
481
541
601
661
mdltkmgmiq
ilfhrnsqhy
mletiqasdd
vdqspsvsts
vktemmqvde
esaeqvpppa
alavsmdfst
qhrklhsgmk
qthtgtdmav
tgdhpyecef
pfecklchqr
wriektylyl
lqnpshptgl
tldflspktf
ndteatmadg
fglsamsptk
vpsqdspgaa
eagqaptgrp
yggllpqgfi
tygcelcgkr
fcllcgkrfq
cgscfrdest
srdysamikh
cyv
lckanqmrla
qqileyayta
gaeeeedrka
aavdslmtig
essisggmgd
ehpapppekh
qrelfsklge
fldslrlrmh
aqsalqqhme
lkshkrihtg
lrthngaspy
gtlcdvvimv
tlqakaedld
rylknifisk
qsllqgtlqp
kveergkegp
lgiysvlpnh
lavgmksesr
llahsagaka
vhagvrsyic
ekpyecngcd
qcticteycp
dsqefhahrt
dllyaaeile
hsseesgyas
pagpeeptla
gtptrssvit
kadavlsmps
tigeqcsvcg
fvcdqcgaqf
secnrtfpsh
kkfslkhqle
slssmqkhmk
vlactskmfe
ieyleeqclk
vagqslpgpm
gggrhpgvae
sarelhygre
svtsglhvqp
velpdneave
skedalethr
talkrhlrsh
thyrvhtgek
ghkpeeippd
Sequence Alignment and Similarity Search:
One goal of sequence alignment is to enable the researcher to determine
whether two sequences display sufficient similarity to justify the inference of
homology. Similarity is an observable quantity that might be expressed as, say,
percent identity or some other suitable measure. Homology, on the other hand,
refers to a conclusion drawn from these data that two genes share a common
evolutionary history. While it is presumed that homologous sequences have
diverged from a common ancestral sequence through iterative changes, we do not
actually know what the ancestral sequence was (barring the possibility that DNA
could be recovered from a fossil); all we have to observe are the sequences from
extant organisms. In a residue-by-residue alignment it is often apparent that
certain regions of a protein, or perhaps specific amino acids, are more highly
conserved than others. This information may be suggestive of which residues are
most crucial for a maintaining a protein’s structure or function.
hum pLZF p
hum TZFP p
1 MDLTKMGMIQLQNPSHPTGLLCKANQMRLAGTLCDVVIMVDSQEFHAHRTVLACTSKMFE
1
MSLPPIRLPSPYGSDRLVQLAARLRPA--LCDTLITVGSQEFPAHSLVLAGVSQQLG
: I:L P
L: A ::R A LCD :I V SQEF AH VLA S:
60
55
hum pLZF p
hum TZFP p
61 ILFHRNSQHYTLDFLSPKTFQQILEYAYTATLQAKAEDLDDLLYAAEILEIEYLEEQCLK 120
56 ----RRGQWALGEGISPSTFAQLLNFVYGESVELQPGELRPLQEAARALGVQSLEEACWR 111
R Q
: :SP TF Q:L : Y ::: : :L L AA L :: LEE C :
hum pLZF p
hum TZFP p
121 MLETIQASDDNDTEATMADGGAEEEEDRKARYLKNIFISKHSSEESGYASVAGQSLPGPM 180
112 ARGDRAKKPDP--------G-----------------LKKHQEEPEKPSRNPERELGDPG 146
D
G
: KH E
:
: L P
hum pLZF p
hum TZFP p
181 VDQSP-SVSTSFGLSAMSPTKAAVDSLMTIGQSLLQGTLQPPAGPEEPTLAGGGRHPGVA 239
147 EKQKPEQVSRTGGR-----------------EQEMLHKHSPPRG--RPEMAG-------- 179
Q P VS : G
: :
PP G
P :AG
hum pLZF p
hum TZFP p
240 EVKTEMMQVDEVPSQDSPGAAESSISGGMGDKVEERGKEGPGTPTRSSVITSARELHYGR 299
180 --ATQEAQQEQTRSK------EKRLQAPVG----QRGADG-----KHGVLTWLRENPGGS 222
T: Q :: S:
E :
:G
:RG :G
: V:T RE
G
hum pLZF p
hum TZFP p
300 EESAEQVPPPAEAGQAPTGRPEHPAPP-PEKHLGIYSVLPNHKADAVLSMPSSVTSGLHV 358
223 EESLRKLPGPLP----PAGSLQTSVTPRPSWAEAPWLVGGQPALWSILLMPP-------- 270
EES ::P P
P:G :
P P
V :
::L MP
hum pLZF p
hum TZFP p
359 QPALAVSMDFSTYGGLLPQGFIQRELFSKLGELAVGMKSESRTIGEQCSVCGVELPDNEA 418
271 RYGIPFYHSTPTTGAWQEVWREQRIPLSLNAPKGLWSQNQ---L-ASSSPTPGSLP---- 322
: :
T G
QR
S
: : :
:
:S :
LP
hum pLZF p
hum TZFP p
419 VEQHRKLHSGMKTYGCELCGKRFLDSLRLRMHLLAHSAGAKAFVCDQCGAQFSKEDALET 478
323 ----------------------------------------------QGPAQLS-PGEMEE 335
Q AQ S
:E
hum pLZF p
hum TZFP p
479 HRQTHTGTDMAVFCLLCGKRFQAQSALQQHMEVHAGVRSYICSECNRTFPSHTALKRHLR 538
336 SDQGHTG---------------ALATCAGHEDKAG------CPPRPHPPPAPPARSR--- 371
Q HTG
A ::
H :
C
: P: A R
hum pLZF p
hum TZFP p
539 SHTGDHPYECEFCGSCFRDESTLKSHKRIHTGEKPYECNGCDKKFSLKHQLETHYRVHTG 598
372 ----------------------------------PYACSVCGKRFSLKHQMETHYRVHTG 397
PY C C K:FSLKHQ:ETHYRVHTG
hum pLZF p
hum TZFP p
599 EKPFECKLCHQRSRDYSAMIKHLRTHNGASPYQCTICTEYCPSLSSMQKHMKGHKPEEIP 658
398 EKPFSCSLCPQRSRDFSAMTKHLRTH-GAAPYRCSLCGAGCPSLASMQAHMRGHSPSQLP 456
EKPF C LC QRSRD:SAM KHLRTH GA:PY:C::C
CPSL:SMQ HM:GH P ::P
hum pLZF p
hum TZFP p
659 PDWRIEKTYLY------------LCYV
673
457 PGWTIRSTFLYSSSRPSRPSTSPCCPSSSTT 487
P W I T:LY
C
Sequence Alignment and Similarity Search:
Database similarity searching allows us to determine which of the
hundreds of thousands of sequences present in the database are potentially related
to a particular sequence of interest. In database searching, the basic operation is to
sequentially align a query sequence to each subject sequence in the database. The
results are reported as a ranked hit list followed by a series of individual sequence
alignments, plus various scores and statistics. Current sequence databases are
already immense and have continued to increase at an exponential rate, making
straightforward application of dynamic programming methods impractical for
database searching. One solution is to use massively parallel computers. There are
several frequently used programs available on the Internet:
FastA
BLITZ
BLAST
Smith-Waterman based system (GenWeb of NHRI)
Blast Family of Programs
The BLAST family of programs allows all combinations of DNA or protein query sequences with searches
against DNA or protein databases:
blastp compares an amino acid query sequence against a
protein sequence database.
blastn compares a nucleotide query sequence against a
nucleotide sequence database.
blastx compares the six-frame conceptual translation
products of a nucleotide query sequence (both
strands) against a protein sequence database.
tblastn compares a protein query sequence against a
nucleotide sequence database dynamically
translated in all six reading frames (both
strands).
tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
The default matrix for all protein-protein comparisons is BLOSUM62.
Databases available for BLAST search
Protein Sequence Databases
nr
All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
month
All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days.
swissprot
the last major release of the SWISS-PROT protein sequence database (no updates)
yeast
Yeast (Saccharomyces cerevisiae) protein sequences.
E. coli
E. coli genomic CDS translations
pdb
Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank
Nucleotide Sequence Databases
nr
All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences)
month
All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest
Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
dbsts
Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions
yeast
Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
E. coli
E. coli genomic nucleotide sequences
organism
CLUSTAL W
One of the most widely
used multiple sequence
alignment program. Based
on the idea of progressive
alignment, this program
takes an input set of
sequences and calculates a
series
of
pairwise
alignments,
comparing
each sequence to every
other sequence, one at a
time.
Human PLZF
406 ZINC1
(part #1)
(part #2)
(part #3)
406
409
426
1/1
14/14
1/1
C
CGVELPDNEAVEQH
H
(part #1)
(part #2)
(part #3)
434
437
454
1/1
14/14
1/1
C
CGKRFLDSLRLRMH
H
(part #1)
(part #2)
(part #3)
463
466
483
1/1
14/14
1/1
C
CGAQFSKEDALETH
H
(part #1)
(part #2)
(part #3)
492
495
512
1/1
14/14
1/1
C
CGKRFQAQSALQQH
H
(part #1)
(part #2)
(part #3)
520
523
540
1/1
14/14
1/1
C
CNRTFPSHTALKRH
H
434 ZINC1
463 ZINC1
492 ZINC1
520 ZINC1
548 ZINC1
(part #1)
(part #2)
(part #3)
548
551
568
1/1
14/14
1/1
C
CGSCFRDESTLKSH
H
(part #1)
(part #2)
(part #3)
576
579
596
1/1
14/14
1/1
C
CDKKFSLKHQLETH
H
(part #1)
(part #2)
(part #3)
604
607
624
1/1
14/14
1/1
C
CHQRSRDYSAMIKH
H
(part #1)
(part #2)
(part #3)
632
635
652
1/1
14/14
1/1
C
CTEYCPSLSSMQKH
H
576 ZINC1
604 ZINC1
632 ZINC1
C2H2 zinc finger motif
BLOCK
ID
AC
DT
DE
PA
ID
ZINC_FINGER_C2H2; BLOCK
AC
BL00028; distance from previous block=(7,2235)
DE
Zinc finger, C2H2 type, domain proteins.
BL
CHP motif; width=29; seqs=135; 99.5%=1594; strength=1246
ADR1_YEAST ( 106) CEVCTRAFARQEHLKRHYRSHTNEKPYPC 10
AEF1_DROME ( 214) CNVCDKTFRQSSTLTNHLKIHTGEKPYNC 10
AZF1_YEAST ( 623) CDYCGKRFTQGGNLRTHERLHTGEKPYSC 10
BASO_HUMAN ( 358) CTACEKTFYDKGTLKIHYNAVHLKIKHKC 39
BRC1_DROME ( 669) CNICKRVYSSLNSLRNHKSIYHRNLKQPK 37
BRC2_DROME ( 471) CAICERVYCSRNSLMTHIYTYHKSRPGEM 27
BRC3_DROME ( 467) GSLAAAVYSLHSHAHGHVLGHATSPPRPG 87
BRLA_EMENI ( 324) EPGCNGRFKRQEHLKRHMKSHSKEKPHVC 22
BTEB_RAT ( 147) YSGCGKVYGKSSHLKAHYRVHTGERPFPC 11
CF23_DROME ( 368) CPDCPKTFKTPGTLAMHRKIHTGEAEREA 24
CF2_DROME ( 403) CSYCGKSFTQSNTLKQHTRIHTGEKPFRC 11
ZINC_FINGER_C2H2; PATTERN.
Prosite
PS00028;
APR-1990 (CREATED); JUN-1994 (DATA UPDATE); NOV-1995 (INFO UPDATE).
Zinc finger, C2H2 type, domain.
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.
Phylogenetic Analysis:
Phylogenetics is the study of evolutionary relationships. Phylogenetic
analysis is the means of inferring or estimating these relationship. The
evolutionary history inferred from phylogenetic analysis is usually depicted as
branching (treelike) diagrams, which represent a ort of pedigree of the inherited
relationships among molecules (“gene trees”), organisms, or both. The four steps
in phylogenetic analysis of DNA sequences are alignment, determining the
substitution model, tree building, and tree evaluation. While other scientific
analysis generally have empirical bases, phylogenetic analysis do not. The
physical events yielding a phylogeny happened in the past, and can only be
inferred or estimated. The three major tree-building criteria are distance,
maximum parsimony, and maximum likelihood.
Over 130 packages available for various platforms
Radial
Slanted
Cladogram
Phylogram
http://www2.ebi.ac.uk/clustalw/
Ortholog:
Homologous genes that have diverged from each other after speciation events (e.g.,
human beta- and chimp beta-globin)
Paralog:
Homologous genes that have diverged from each other after gene duplication events
(e.g., human beta- and gamma-globin)
Xenolog:
Homologous genes that have diverged from each other after lateral gene transfer
events (e.g., antibiotic resistance genes in bacteria)
Homolog:
Genes that are descended from a common ancestor (e.g., all globins)
COG0568 K DNA-dependent RNA polymerase sigma70/sigma32 subunits
EST:
Expressed Sequences Tags
dbEST is a division of
GenBank that contains
sequence data and other
information on "singlepass" cDNA sequences, or
Expressed Sequence Tags,
from
a
number
of
organisms.
There
are
1,775,721 entries in human
EST and 918,414 entries in
mouse
EST.
Total
of
3,643,273 sequence entries in
dbEST. (Feb. 18, 2000).
EST projects have their roots in the early 1980s, when it was
recognized that short stretches of DNA sequences from
cDNAs could be used to identify genes. The Institute for
Genomic Research (TIGR) was established to generate EST
data on a massive scale. Among the largest projects conducted
entirely in the public domain include an effort funded by
Merck and Co., which has deposited more than 500,000 human
ESTs into dbEST. A hallmark of these endeavours, carried out
by a collaboration between Washington University Genome
Sequencing Center and members of IMAGE (Integrated
Molecular Analysis of Gene Expression) consortium, has been
the rapid deposition of the sequences into the public domain
and the concomitant availability of the sequence-tagged
clones.
dbEST release 021800
Summary by Organism - February 18, 2000
Number of public entries: 3,643,273
Homo sapiens (human)
Mus musculus + domesticus (mouse)
Rattus sp. (rat)
Caenorhabditis elegans (nematode)
Drosophila melanogaster (fruit fly)
Danio rerio (zebrafish)
Lycopersicon esculentum (tomato)
Zea mays (maize)
Glycine max (soybean)
Oryza sativa (rice)
Arabidopsis thaliana (thale cress)
1,775,721
918,414
134,685
101,232
86,121
61,893
53,603
51,883
50,656
47,939
45,757
Search: AA927876
dbEST Id:
1659486
IDENTIFIERS
EST name:
GenBank Acc:
GenBank gi:
om18b09.s1
AA927876
3076620
CLONE INFO
Clone Id:
Source:
Insert length:
DNA type:
IMAGE:1541369 (3')
NCI
1074
cDNA
PRIMERS
Sequencing:
SEQUENCE
Quality:
Entry Created:
Last Updated:
-40m13 fwd. ET from Amersham
TTTGACGGGAGGGCACAGGAAACTCTTTATTATGGTGATGAGATCGACAATCTCCCCTAC
TGTTAACCTTCGCTCCTGCACACTTCAGTGTCCTCACTCTGTAGGGCTCGCTGGCCTGGG
CTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCCTGGGGTNNTCTGG
GGCGGAATTTGCTAGGCCGCCGTAGCAGCTGTGCCAGGTCAGAAGCCGAGCCGGNCCGCT
TTTCGTTCTTTAATTGGACTCTTGGCTAAGACGCTACCGACACCCCGTCAGTGGTGGAGG
AAGAAGGACAACAGGGAGAGGTCGAGG
High quality sequence stops at base: 318
Apr 17 1998
Jun 10 1998
COMMENTS
This clone is available royalty-free through LLNL ; contact
the IMAGE Consortium (info@image.llnl.gov) for further
information.
LIBRARY
dbEST lib id:
Lib Name:
Organism:
Organ:
Lab host:
Vector:
R. Site 1:
R. Site 2:
Description:
1042
Soares_NFL_T_GBC_S1
Homo sapiens
pooled
DH10B
pT7T3D-Pac (Pharmacia) with a modified polylinker
Not I
Eco RI
Equal amounts of plasmid DNA from three normalized libraries
(fetal lung NbHL19W, testis NHT, and B-cell NCI_CGAP_GCB1) were
mixed, and ss circles were made in vitro. Following HAP
purification, this DNA was used as tracer in a subtractive
hybridization reaction. The driver was PCR-amplified cDNAs from
pools of 5,000 clones made from the same 3 libraries. The pools
consisted of I.M.A.G.E. clones 297480-302087, 682632-687239,
726408-728711, and 729096-731399. Subtraction by Bento Soares
and M. Fatima Bonaldo.
Simple Mathematics:
Summary by Organism - February 18, 2000
Homo sapiens (human)
Human genes
1,775,721
100,000 genes
More than 10 fold coverage!!
Clustering is the process of finding subsets of sequences which belong together within a
larger set. This is done by converting discrete similarity scores to boolean links between
sequences. That is, two sequences are considered linked if their similarity exceeds a
threshold. UniGene clustering proceeds in several stages, with each stage adding less
reliable data to the results of the preceding stage. This staged clustering affords greater
control than a more egalitarian treatment of all links between sequences.
Unigene_HUMAN:
92,571 clusters| HGI:
299,412 clusters
Unigene_MOUSE:
75,963 clusters| MGI:
104,927 clusters
Unigene_RAT:
28,680 clusters | RGI:
35,875 clusters
(Feb. 19 , 2000)
(Jul. 3, 1999)
THCs, "Tentative Human Consensus" sequences, are assemblies of human ESTs.
TIGR's Human Gene Index compare with UniGene?
The HGI assemblies (and all of TIGR's Gene Index assemblies) are made by first
clustering the EST sequences and then assembling these clusters into consensus
sequences, or THCs(TCs for non-human data). EST sequences are compared and
clustered together if they meet the following criteria:
a minimum 40 base pair match
greater than 95% similarity in the overlap region
a maximum unmatched overhang of 20 base pairs
These clusters are then assembled into consensus sequences using TIGR's in-house
assembly program.
UniGene links ESTs in a cluster if the sequences have a 50 base pair overlap in the
3' untranslated region (UTR) with 100% identity. These clusters are not run through
the more stringent assembly process and consensus sequences are not made. For this
reason you will often find several TIGR THCs contained within one UniGene cluster.
UniGene Human Release Statistics
Statistics for UniGene build uploaded on: Sat Feb 19 2000
UniGene Build #108
Sequences Included in UniGene
=============================
Known genes are from GenBank 114 (1-Dec-1999)
ESTs are from dbEST through 13-Feb-2000
30044 mRNAs + gene CDSs
938584 EST, 3'reads
347845 EST, 5'reads
+ 157255
EST, other/unknown
---------1473728
total sequences in clusters
Final Number of Clusters (sets)
===============================
92571
sets total
10797
sets contain at least one known gene
91523
sets contain at least one EST
9749
sets contain both genes and ESTs
HGI Release 4.5 - Nov. 15, 1999
Total sequences
in THCs
ESTs
1,066,183
HTs
5,949
Totals 1,072,132
singletons
241,110
1,165
242,275
total
1,307,293
7,114
1,314,407
Total unique sequences
THCs
singleton ESTs
singleton HTs
Total
84,837
241,110
1,165
327,112
AA927876 as query (318 bps)
Database: Unigene_HUMAN
58,791 sequences; 43,055,747 total letters
Sequences producing significant alignments:
Score
(bits)
gnl|UG|Hs#S971963 ak43b04.s1 Homo sapiens cDNA, 3' end /clone=IM...
gnl|UG|Hs#S510257 70F12 Homo sapiens cDNA /clone=(not-directiona...
599
36
E
Value
e-171
0.17
gnl|UG|Hs#S971963 ak43b04.s1 Homo sapiens cDNA, 3' end /clone=IMAGE:1408687
/clone_end=3' /gb=AA868505 /ug=Hs.99430 /len=627
Length = 627
Score = 599 bits (302), Expect = e-171
Identities = 321/327 (98%), Positives = 321/327 (98%)
Hs. 99430
Hs.99430 Homo sapiens
EXPRESSION INFORMATION
cDNA sources:
Blood, Ovary, Testis
EST SEQUENCES (8)
AI150041
cDNA clone IMAGE:1751830
AA927876
cDNA clone IMAGE:1541369
AI223414
cDNA clone IMAGE:1838461
AI150330
cDNA clone IMAGE:1751988
AA868505
cDNA clone IMAGE:1408687
AA476210
cDNA clone IMAGE:771312
AA456628
cDNA clone IMAGE:809583
AI361709
cDNA clone IMAGE:2021901
Testis 3'
3'
Testis 3'
Testis 3'
Testis 3'
Ovary
3'
Ovary
3'
Blood
3'
read
read
read
read
read
read
read
read
1.1
1.1
1.0
0.6
kb
kb
kb
kb
Hs.434 Homo sapiens
Human heregulin-beta1 gene, complete cds
MAPPING INFORMATION
Chromosome:
8
Gene Map 98: stSG4083 , Chr.8, D8S1820-D8S505
Gene Map 98: WI-18803 , Chr.8, D8S1820-D8S505
Gene Map 98: SHGC-12780 , Chr.8, D8S1820-D8S505
EXPRESSION INFORMATION
cDNA sources:
Brain, Breast, Liver, Testis
AA927876 as query (318 bps)
Database: HGI-HUMAN
234,459 sequences; 111,134,950 total letters
Sequences producing significant alignments:
Score
(bits)
E
Value
581
40
e-165
0.027
lcl|THC226049
lcl|R47793
34 1.7
lcl|THC226049
Length = 436
THC226049
Score = 581 bits (293), Expect = e-165
Identities = 313/320 (97%), Positives = 313/320 (97%), Gaps = 1/320 (0%)
>THC226049
TGAGGGCACAGGAAACTCTTTATTATGGTGATGAGATCGACAATCTCCCCTACTGTTAACCTTCGCTCCTGCACACTTCA
GTGTCCTCACTCTGTAGGGCTCGCTGGCCTGGGCTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCC
TGGGGCgTTcTGGGGCGGAATTTGCTAGGCCGCCGTAGCAGCGGTGCCAGGTCAGAAGCCGAGCCGGCyCGCTTTTCGTT
CTTTAATTGGACTCTTGGCTAAGACGCTACCGACACCCCGTCaGgTGGTGGAGGAAGAAGGACAACAGGGAGAGGTCGAG
GGCCGAGACGGCTCGAGGGAGGAGTAGAGGAAGGTGGAGCGGATGGTCCATCCGGGCGGGAGTTGGCTGGGCGAGTGACC
GCGCATGTGCCGCTGCATGGAGGGCAAGCTGTTACA
1=================================THC226049================================436
----------------------------1--------------------------->
--------------------------------------2-------------------------------------->
#
EST Id
GB#
ATCC#
left right library
-------------------------------------------------------------------------------1
F zw35g01.s1
AA476210
1
317 ovary tumor NbHOT, Soares
2
F zx75d08.s1
AA456628
1
436 ovary tumor NbHOT, Soares
Sequence source codes:
F = WashU/Merck
There are no hits for THC226049.
In silico cloning:
In order to perform an electronic cDNA library screen, the EST sequences
retrieved in this way can be used as queries in a BLASTN search of dbEST to identify
over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs
until no additional hits are found. The ESTs isolated can be assembled into sequence
contigs using computer softwares.
How to start?
TBLASTN
emb|AJ003623|HSJ003623 H.sapiens DNA for EST MPIpl10-4B1
Length = 556
Score = 46.9 bits (109), Expect = 1e-04
Identities = 29/83 (34%), Positives = 42/83 (49%), Gaps = 8/83 (9%)
Query: 23 RLRPALCDTLITVGSQEFPAHSLVLAGVSQQLGRRGQWALGEG--------ISPSTFAQL 74
RL+ LCD L+ VG Q+F AH VLA S+
E
P F +
Sbjct: 164 RLKGQLCDVLLIVGDQKFRAHKNVLAASSEYFQSLFTNKENESQTVFQLDFCEPDAFDNV 343
Query: 75
LNFVYGESVELQPGELRPLQEAARALGVQSL 105
LN++Y S+ ++
L +QE
+LG+ L
Sbjct: 344 LNYIYSSSLFVEKSSLAAVQELGYSLGISFL 436
Experimental results:
TTGANNNCCTTTGAANNNCCNNTTNNTCATAGATCTCTCGAGTTTTTTTTTTTTTTTTTTTCTGAAGGGAGGGCACAGGAAAC
TCTTTATTATGGTGATGAGATCGACAATCTCCCCTACTGTTAACCTTCGCTCCTGCACACTTCAGTGTCCTCACTCTGTAGGG
CTCGCTGGCCTGGGCTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCCTGGGGCGCTTCTGGGGCGGAAT
TTGCTAGGCCGCCGTAGCAGCGGTGCCAGGTCAGAAGCCGAGCCGGCCCGCTTTTCGTTCTTTAATTGGACTCTTGGCTAAGA
CGCTACCGACACCCCGTCAGGTGGTGGAGGAAGAAGGACAACAGGGAGAGGTCGAGGGCCGAGACGGCCTCGAGGAGGAGTAG
AGGAAGGTGGAGCGGATGGTCCATCCGGGCGGGAGTTGGCTGGGCGAGTGACCGCGCATGTGCGCCTGCATGGAGGCCAGGCT
GGGACAGCCGGCCCCGCACAGGGAGCAGCGGTACGGAGCGGCCCCGTGTGTCCGCAGGTGCTTGGTCATGGCCGAGAAGTCCC
GGGAGCGCTGAGGACAAAGGCTACAGGAGAAGGGCTTCTCTCCTGTGTGGACTCGGTAGTGCGTCTCCATCTGATGCTTGAGT
GAAAACCTCTTTCACAGACAGAGCACGCATAGGGGCCCAGACCGAGCANGGTCGACGCGGCCCGCGAAATTCGGATCCCCGGG
GCCTTCATGGGCCATATGACCCCCCAAGCTAGCGTAAATCTGGGAACATCGTATGGGTAAAGCCNTNANAGAATCTCTTTTTT
TTTGGGTTTGGGGNGGGGGTNATCTTTCATTNATCGAATTAGANTAGTTATNTNCCATTAATCCATTGNANNGGNNTTTAAAC
ATTCCCTTGAAGGGATTCCNAAACCCTTTTACCNCAATTTTGGGTCCCGTCCAAACCCAGGTTGACAAGNGGGTTTTTGGAAA
TTNTTTNCCCNTNATTCAATTTTTCCT
Yeast two-hybrid experiment;
Differential Display;
Library screening; etc.
BLASTN search to GenBank
Cosmid from chromosome 19; it is a novel gene.
BLASTN search to dbEST; Unigene; TIGR-HGI
cDNA and genomic DNA alignment and matrix analysis:
Gene prediction programs:
http://CCR-081.mit.edu/GENSCAN.html
GRAIL 2
10138 - 11018
12608 - 12748
13530 - 13923
+
x
x
GENSCAN
10138 - 11018
11268 - 11341
11450 - 11518
11644 - 11808
11989 - 12144
12360 - 12454
12608 - 12748
+
+
+
+
+
x
x
FGENES
1880 - 1908
5061 - 5175
5900 - 6049
8317 - 8544
10357 - 11018
11268 - 11341
11450 - 11518
11644 - 11864
polyA: 12521
x
x
x
+
+
+
+
+
+
(Start)
ATGTCCCTGCCCCCCATAAGACTGCCCAGCCCCTATGGCTCTGATCGGCTGGTACAGCTAGCAGCCAGGCTCCGGCCAGCACTCTGTGATACTCTGATCACCGTAGGGAGCCAGGAGTTC
M S L P P I R L P S P Y G S D R L V Q L A A R L R P A L C D T L I T V G S Q E F>
CCCGCCCACAGCCTGGTGCTAGCAGGTGTCAGCCAGCAGCTGGGCCGCAGGGGCCAGTGGGCTCTGGGAGAAGGCATCAGCCCTTCTACCTTTGCCCAGCTCCTGAACTTTGTGTATGGG
P A H S L V L A G V S Q Q L G R R G Q W A L G E G I S P S T F A Q L L N F V Y G>
GAGAGTGTAGAGCTGCAGCCTGGAGAGCTAAGGCCCCTTCAGGAGGCGGCCAGGGCCTTGGGAGTGCAGTCCCTGGAAGAGGCATGCTGGAGGGCTCGAGGGGACAGGGCTAAAAAGCCA
E S V E L Q P G E L R P L Q E A A R A L G V Q S L E E A C W R A R G D R A K K P>
GATCCAGGCCTGAAGAAACATCAGGAGGAGCCAGAGAAACCCTCAAGGAATCCTGAGAGAGAACTGGGGGACCCTGGAGAGAAGCAGAAACCAGAACAGGTTTCTAGAACTGGTGGGAGA
D P G L K K H Q E E P E K P S R N P E R E L G D P G E K Q K P E Q V S R T G G R>
GAACAGGAGATGTTGCACAAGCACTCGCCACCAAGAGGCAGACCCGAGATGGCAGGAGCAACGCAGGAGGCTCAGCAGGAACAGACCAGGTCAAAGGAGAAACGCCTCCAAGCCCCTGTT
E Q E M L H K H S P P R G R P E M A G A T Q E A Q Q E Q T R S K E K R L Q A P V>
GGCCAAAGGGGAGCAGATGGGAAGCATGGAGTGCTCACGTGGTTGAGGGAAAATCCAGGGGGCTCTGAGGAAAGTCTGCGCAAGCTCCCTGGCCCCCTTCCCCCAGCAGGCTCCCTGCAA
G Q R G A D G K H G V L T W L R E N P G G S E E S L R K L P G P L P P A G S L Q>
ACCAGCGTCACCCCTAGGCCCTCGTGGGCTGAGGCCCCTTGGTTGGTGGGGGGCCAGCCTGCCCTGTGGAGCATCCTGCTGATGCCGCCCAGATATGGCATTCCCTTCTACCATAGCACC
T S V T P R P S W A E A P W L V G G Q P A L W S I L L M P P R Y G I P F Y H S T>
CCCACCACTGGAGCCTGGCAGGAGGTCTGGCGGGAACAGAGGATCCCACTGTCCCTAAATGCCCCCAAAGGGCTCTGGAGCCAGAACCAGTTGGCCTCCTCCAGCCCTACCCCAGGTTCC
P T T G A W Q E V W R E Q R I P L S L N A P K G L W S Q N Q L A S S S P T P G S>
CTCCCCCAGGGCCCCGCACAGCTCAGCCCTGGGGAGATGGAAGAGTCTGATCAGGGGCACACAGGCGCACTTGCAACCTGTGCGGGTCATGAGGACAAGGCAGGCTGCCCACCTCGCCCG
L P Q G P A Q L S P G E M E E S D Q G H T G A L A T C A G H E D K A G C P P R P>
CACCCTCCCCCGGCCCCTCCTGCTCGGTCTCGGCCCTATGCGTGCTCTGTCTGTGGAAAGAGGTTTTCACTCAAGCATCAGATGGAGACGCACTACCGAGTCCACACAGGAGAGAAGCCC
H P P P A P P A R S R P Y A C S V C G K R F S L K H Q M E T H Y R V H T G E K P>
TTCTCCTGTAGCCTTTGTCCTCAGCGCTCCCGGGACTTCTCGGCCATGACCAAGCACCTGCGGACACACGGGGCCGCTCCGTACCGCTGCTCCCTGTGCGGGGCCGGCTGTCCCAGCCTG
F S C S L C P Q R S R D F S A M T K H L R T H G A A P Y R C S L C G A G C P S L>
GCCTCCATGCAGGCGCACATGCGCGGTCACTCGCCCAGCCAACTCCCGCCCGGATGGACCATCCGCTCCACCTTCCTCTACTCCTCCTCGAGGCCGTCTCGGCCCTCGACCTCTCCCTGT
A S M Q A H M R G H S P S Q L P P G W T I R S T F L Y S S S R P S R P S T S P C>
TGTCCTTCTTCCTCCACCACCTGACGGGGTGTCGGTAGCGTCTTAGCCAAGAGTCCAATTAAAGAACGAAAAGCGGGCCGGCTCGGCTTCTGACCTGGCACCGCTGCTACGGCGGCCTAG
C P S S S T T *
hum TZF p
hum pLZF p
mus pLZF p
1
MSLPPIRLPSPYGSDRLVQLAARLRPALCDTLITVGSQEFPAHSLVLAGVSQQLG----RRGQWALGEGISPSTFAQLLNFVYGESVELQPGELR 91
1 MDLTKMGMIQLQNPSHPTGLLCKANQMRLAGTLCDVVIMVDSQEFHAHRTVLACTSKMFEILFHRNSQHYTLDFLSPKTFQQILEYAYTATLQAKAEDLD 100
1 MDLTKMGMIQLQNPSHPTGLLCKANQMRLAGTLCDVVIMVDSQEFHAHRTVLACTSKMFEILFHRNSQHYTLDFLSPKTFQQILEYAYTATLQAKAEDLD 100
M : :: PS
RL :LCD :I V SQEF AH VLA S:
R Q
: :SP TF Q:L : Y ::: : :L
hum TZF p
hum pLZF p
mus pLZF p
92 PLQEAARALGVQSLEEACW------RARGD---RAKKPDPG----------------LKKHQEEPEKPSRNPERELGDPGEKQKP--------------- 151
101 DLLYAAEILEIEYLEEQCLKMLETIQASDDNDTEATMADGGAEEEEDRKARYLKNIFISKHSSEESGYASVAGQSLPGPMVDQSPSVSTSFGLSAMSPTK 200
101 DLLYAAEILEIEYLEEQCLKILETIQASDDNDTEATMADGGGEEEDDRKARYLKNIFISKHSSEESGYASVAGQSLPGPMVDQSPSVSTSFGLSAMSPTK 200
L AA L :: LEE C
:A D
A
D G
: KH E
:
: L P
Q P
hum TZF p
hum pLZF p
mus pLZF p
152 EQVSRTGGREQEMLH-KHSPPRG--RPEMAG-----ATQEAQQEQTRSKEKRLQ-AP------VG--------QRGADG-----KHGVLTWLRENPGGSE 223
201 AAVDSLMTIGQSLLQGTLQPPAGPEEPTLAGGGRHPGVAEVKTEMMQVDEVPSQDSPGAAESSISGGMGDKVEERGKEGPGTPTRSSVITSARELHYGRE 300
201 AAVDSLMSIGQSLLQGTLQPPAGPEEPTLAGGGRHPGVAEVKMEMMQVDEAPCQDSPGAAESSISGGMGDKFEERSKEGPGTPTRRSVITSARELHYGRE 300
V
Q :L:
PP G
P :AG
E : E : E
Q :P
:
:R :G
: V:T RE
G E
hum TZF p
hum pLZF p
mus pLZF p
224 ESLRKLPGPLP----PAGSLQTSVTP--RP--SWAEAP----WLVGGQP-ALWSILLMPPRYGIPFYHST-----PTTGAWQEVWR-----------EQR 294
301 ESAEQVPPPAEAGQAPTGRPEHPAPPPEKHLGIYSVLPNHKADAVLSMPSSVTSGLHVQPALAVSMDFSTYGGLLPQGFIQRELFSKLGELAVGMKSESR 400
301 ESGEQLSPPVEAGQGPPGRQEPLAPPVEKHLGIYSVLPNHKADAVLSMPSSVTSGLHVQPALAVSMDFSTYGGLLPQGFIQRELFSKLGELAVGMKAESR 400
ES :: P
P G :
P :
: P
V
P :: S L : P
:
ST
P
:E:
E R
hum TZF p
hum pLZF p
mus pLZF p
295 ----------IPLSLN--------APKGLWSQ----------N-----Q--LASSSPTPGSLP-QGPAQLSP-GEMEESDQGHTGALAT-----CAG--- 349
401 TIGEQCSVCGVELPDNEAVEQHRKLHSGMKTYGCELCGKRFLDSLRLRMHLLAHSAGAKAFVCDQCGAQFSKEDALETHRQTHTGTDMAVFCLLCGKRFQ 500
401 PLGEQCSVCGVELPDNEAVEQHRKLHSGMKTYGCELCGKRFLDSLRLRMHLLAHSAGAKAFVCDQCGAQFSKEDALETHRQTHTGTDMAVFCLLCGKRFQ 500
: L N
G: :
LA S: :
: Q AQ S
:E
Q HTG: :
C
hum TZF p
hum pLZF p
mus pLZF p
350 --------HEDKAG--------CP---P---------RPHPPPAPPARS------R----------------PYACSVCGKRFSLKHQMETHYRVHTGEK 399
501 AQSALQQHMEVHAGVRSYICSECNRTFPSHTALKRHLRSHTGDHPYECEFCGSCFRDESTLKSHKRIHTGEKPYECNGCDKKFSLKHQLETHYRVHTGEK 600
501 AQSALQQHMEVHAGVRSYICSECNRTFPSHTALKRHLRSHTGDHPYECEFCGSCFRDESTLKSHKRIHTGEKPYECNGCGKKFSLKHQLETHYRVHTGEK 600
E :AG
C
P
R H
P
R
PY C C K:FSLKHQ:ETHYRVHTGEK
hum TZF p
hum pLZF p
mus pLZF p
400 PFSCSLCPQRSRDFSAMTKHLRTH-GAAPYRCSLCGAGCPSLASMQAHMRGHSPSQLPPGWTIRSTFLYSSSRPSRPSTSPCCPSSSTT 487
601 PFECKLCHQRSRDYSAMIKHLRTHNGASPYQCTICTEYCPSLSSMQKHMKGHKPEEIPPDWRIEKTYLYLCY-V
673
601 PFECKLCHQRSRDYSAMIKHLRTHNGASPYQCTICTEYCPSLSSMQKHMKGHKPEEIPPDWRIEKTYLYLCYV
673
PF C LC QRSRD:SAM KHLRTH GA:PY:C::C
CPSL:SMQ HM:GH P ::PP W I T:LY :
Hs.99430 Homo sapiens
EXPRESSION INFORMATION
cDNA sources:
Blood, Ovary, Testis
EST SEQUENCES (8)
AI150041
cDNA clone IMAGE:1751830
AA927876
cDNA clone IMAGE:1541369
AI223414
cDNA clone IMAGE:1838461
AI150330
cDNA clone IMAGE:1751988
AA868505
cDNA clone IMAGE:1408687
AA476210
cDNA clone IMAGE:771312
AA456628
cDNA clone IMAGE:809583
AI361709
cDNA clone IMAGE:2021901
Testis 3'
3'
Testis 3'
Testis 3'
Testis 3'
Ovary
3'
Ovary
3'
Blood
3'
Northern Blotting
read
read
read
read
read
read
read
read
1.1
1.1
1.0
0.6
kb
kb
kb
kb
LOCUS
AF130255
1960 bp
mRNA
PRI
22-FEB-1999
DEFINITION Homo sapiens testis zinc finger protein (TZFP) mRNA, complete cds.
ACCESSION
AF130255
KEYWORDS
.
SOURCE
human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;
Primates; Catarrhini; Hominidae; Homo.
REFERENCE
1 (bases 1 to 1960)
AUTHORS
Tang,Tang K., Lai,Chun-Hung, Tang,Chieh-Ju C., Huang,Chang-Jen and
Lin,Wen-chang.
TITLE
Identification and gene structure of a novel human PLZF related
transcription factor gene, TZFP
JOURNAL
Unpublished
REFERENCE
2 (bases 1 to 1960)
AUTHORS
Tang,T. K., Tang,C.-J. C. and Lin,W.-c.
TITLE
Direct Submission
JOURNAL
Submitted (22-FEB-1999) Institute of Biomedical Sciences, Academia
Sinica, No. 128, Sec. 2, Academia Road, Taipei, Taiwan 11529,
TAIWAN
Search: AA927876
dbEST Id:
1659486
IDENTIFIERS
EST name:
GenBank Acc:
GenBank gi:
om18b09.s1
AA927876
3076620
CLONE INFO
Clone Id:
Source:
Insert length:
DNA type:
IMAGE:1541369 (3')
NCI
1074
cDNA
PRIMERS
Sequencing:
SEQUENCE
Quality:
Entry Created:
Last Updated:
-40m13 fwd. ET from Amersham
TTTGACGGGAGGGCACAGGAAACTCTTTATTATGGTGATGAGATCGACAATCTCCCCTAC
TGTTAACCTTCGCTCCTGCACACTTCAGTGTCCTCACTCTGTAGGGCTCGCTGGCCTGGG
CTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCCTGGGGTNNTCTGG
GGCGGAATTTGCTAGGCCGCCGTAGCAGCTGTGCCAGGTCAGAAGCCGAGCCGGNCCGCT
TTTCGTTCTTTAATTGGACTCTTGGCTAAGACGCTACCGACACCCCGTCAGTGGTGGAGG
AAGAAGGACAACAGGGAGAGGTCGAGG
High quality sequence stops at base: 318
Apr 17 1998
Jun 10 1998
COMMENTS
This clone is available royalty-free through LLNL ; contact
the IMAGE Consortium (info@image.llnl.gov) for further
information.
LIBRARY
dbEST lib id:
Lib Name:
Organism:
Organ:
Lab host:
Vector:
R. Site 1:
R. Site 2:
Description:
1042
Soares_NFL_T_GBC_S1
Homo sapiens
pooled
DH10B
pT7T3D-Pac (Pharmacia) with a modified polylinker
Not I
Eco RI
Equal amounts of plasmid DNA from three normalized libraries
(fetal lung NbHL19W, testis NHT, and B-cell NCI_CGAP_GCB1) were
mixed, and ss circles were made in vitro. Following HAP
purification, this DNA was used as tracer in a subtractive
hybridization reaction. The driver was PCR-amplified cDNAs from
pools of 5,000 clones made from the same 3 libraries. The pools
consisted of I.M.A.G.E. clones 297480-302087, 682632-687239,
726408-728711, and 729096-731399. Subtraction by Bento Soares
and M. Fatima Bonaldo.
Human cDNA Library Details:
470 different libraries so far
covering more than 40 tissues
Q&A
CGAP
Stomach
202.NCI_CGAP_Gas1 gastric tumor
203.NCI_CGAP_Gas4 gastric tumor
Testis
204.Barstead HPL-RB5 testis
205.Soares testis NHT
206.Life Tech. testis (10426-013)
Thymus
207.NCI_CGAP_Thym1 thymoma
Thyroid
208.NCI_CGAP_Thy1 invasive thyroid tumor
Uterus
209.NCI_CGAP_Ut1 uterine tumor
210.NCI_CGAP_Ut2 uterine tumor
211.NCI_CGAP_Ut3 uterine tumor
212.NCI_CGAP_Ut4 uterine tumor
213.Soares pregnant uterus NbHPU
CGAP: Cancer Genome Anatomy Project
Why CGAP?
In the last two decades we have learned that genetic changes lie at the
root of all cancers. In response, the Cancer Genome Anatomy Project
(CGAP) will unite the newest technologies, along with those both costeffective and capable of high-throughput, to identify all the genes
responsible for the establishment and growth of cancer.
Project Goals
To achieve a comprehensive molecular characterization of normal,
precancerous, and malignant cells.
Normal Cells
Cancer Cells
Comparing the fingerprints of a normal
versus a cancer cell will highlight genes
that by their suspicious absence or
presence (such as Gene H ) deserve
further scientific scrutiny to determine
whether such suspects play a role in
cancer, or can be exploited in a test for
early detection.
Identifying the genetic differences among normal cells, precancerous cells, and cancer cells, will
contribute to our understanding of cancer as it
fosters the discovery of genes that directly cause cancer
provides us with a way to identify early precancerous cells and thus
enhances our methods for early detection
improves our ability to match patients with appropriate treatment
Pre-cancer
Time line
Malignant Tumor
The research results displayed in this graph demonstrate that for patients suffering from the
cancer neuroblastoma, the presence or absence of a specific set of genes found on Chromosome 1
strongly correlates with patient outcome. Therefore, in the future this characteristic of the tumor
can be used to identify those patients that would benefit from more aggressive treatment, and
those best served by the current treatment protocol.
Laser Capture Microdissection
(LCM)
Go
1999
CGAP sequences:
473,746
CGAP genes:
20,665
2000
CGAP sequences:
925,746
CGAP genes:
79,844
Not in all others
Not in all others
Not in all others
Sequencing of Expressed Sequence Tags (ESTs)
Serial Analysis of Gene Expression
Differential Display Approaches
Hybridization Analysis
Digital Differential Display
The foundation of DDD is UniGene. UniGene employs a conservative method to assign all the
human EST sequences that meet minimal standards of quality to distinct "clusters", each
representing a unique human expressed gene. DDD takes advantage of UniGene by comparing
the number of times sequences from different libraries were assigned to a particular UniGene
cluster. This has the advantage that DDD will only report on sequences that we have confidence
represent bona fide human expressed genes.
There will of course be many differences in the number of sequences contained in each library
that are assigned to a particular UniGene cluster, but only some of these differences are likely to
reflect biological reality. Therefore DDD employs a statistical method of comparison - The Fisher
Exact Test - to identify only those differences that are likely to be real. One important factor in
determining statistical relevance is the absolute number of sequences in each library that have
been successfully assigned to a UniGene cluster. In many cases there are not enough sequences
in dbEST libraries to meet the threshold of significance employed in the Fisher Exact Test. Since
DDD will only yield a report if there are differences that exceed this threshold, it is expected that
many comparisons will yield nothing.
the fraction of sequences within the pool
visual aid that reflects the numerical values
statistically significant pairwise comparison
THREE PRINCIPLES UNDERLIE THE SAGE TECHNOLOGY:
One short oligonucleotide sequence from a defined location within a transcript ("tag") allows accurate
quantitation.
Tag size (10-14bp) is optimal for high throughput while maintaining accurate gene identification and
quantitation.
The combined power of serial and parallel processing increases data throughput by orders of
magnitude when compared to conventional approaches.
Ortholog:
Homologous genes that have diverged from each other after speciation events (e.g.,
human beta- and chimp beta-globin)
Paralog:
Homologous genes that have diverged from each other after gene duplication events
(e.g., human beta- and gamma-globin)
Xenolog:
Homologous genes that have diverged from each other after lateral gene transfer
events (e.g., antibiotic resistance genes in bacteria)
Homolog:
Genes that are descended from a common ancestor (e.g., all globins)
Dec. 11, 1998:
C. elegans: Sequence to Biology
-Jonathan Hodgkin, H. Robert Horvitz, Barbara R. Jasny, Judith Kimble*
This special issue of Science celebrates a landmark in biology: determination of
the essentially complete DNA sequence of an animal genome. The animal is a small
invertebrate, the nematode (or roundworm) Caenorhabditis elegans, and the
sequence consists of about 97 million base pairs of DNA, approximately
one-thirtieth the number in the human genome. Nonetheless, the information content
is enormous--eight times that of the budding yeast Saccharomyces cerevisiae,
the only other eukaryote with a sequenced genome.
Genomic sequence of the Nematode C. elegnas:
A platform for investigating biology
The C. elegans Squencing Consortium
97 MB
257 YACs (20% only in
YAC)
2527 cosmids
113 fosmids
44 PCR
19,099 predicted genes
18,891 proteins here
(16,260 reviewed)
EST: 67,815 EST from
40,379 clones
7432 genes
A multicellular organism genome
Genefinder program:
** transplicing**
40% of predicted genes have ESTmatches
16,260/19,099 genes have been interactively reviewed.
Average of one gene per 5 Kb.
Average of five introns per gene.
27% of genome resides in exons.
pFAM protein family search :
Intracellular communication
Transcriptional regulation
Table 1. The 20 most common protein domains in C. elegans (41). RRM, RNA recognition motif;
RBD, RNA binding domain; RNP, ribonuclear protein motif; UDP, uridine 5'-diphosphate.
------------------------------------------------------------------Number
Description
-------------------------------------------------------------------
650
410
240
170
140
130
120
100
90
90
90
90
80
80
80
80
80
70
70
70
7 TM chemoreceptor
Eukaryotic protein kinase domain
Zinc finger, C4 type (two domains)
Collagen
7 TM receptor (rhodopsin family)
Zinc finger, C2H2 type
Lectin C-type domain short and long forms
RNA recognition motif (RRM, RBD, or RNP domain)
Zinc finger, C3HC4 type (RING finger)
Protein-tyrosine phosphatase
Ankyrin repeat
WD domain, G-beta repeats
Homeobox domain
Neurotransmitter-gated ion channel
Cytochrome P450
Helicases conserved C-terminal domain
Alcohol/other dehydrogenases, short-chain type
UDP-glucoronosyl and UDP-glucosyl transferases
EGF-like domain
Immunoglobulin superfamily
Worming secrets from the C. elegans genome:
Dec 11, 1998. Sciences
Washington University Genome Sequencing Center.
Sanger Centre
8 - year effort: Sydney Brenner starts all.
by 1992, they were doing a million bases per year. ~$200 M
High-through put sequencing.
Human genome project.
“We will be doing a lot of jumping back and forth between species” - F. Collins
Ping-Pong homology search
In silico cloning:
In order to perform an electronic cDNA library screen, the EST sequences
retrieved in this way can be used as queries in a BLASTN search of dbEST to identify
over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs
until no additional hits are found. The ESTs isolated can be assembled into sequence
contigs using computer softwares.
EST 2
EST 3
EST 1
There are many sequencing related errors in the dbEST.
C elegnas
a. a. sequences
Human EST sequences
Comparative Gene Identification
Query=
(597 letters)
Sequences producing significant alignments:
lcl|THC200240
lcl|THC151579
lcl|AA099787
(bits)
224
181
127
lcl|THC200240
Length = 764
Score = 224 bits (565), Expect = 4e-58
Identities = 106/187 (56%), Positives = 136/187 (72%)
Value
4e-58
3e-45
8e-29
Query: 248 SGMKKNKYGNIEDLVVHLNFVCPKGIIQKQCQVPRMSSGPDIHQIILGSEGTLGVVSEVT 307
SGMKKN YGNIEDLVVH+ V P+GII+K CQ PRMS+GPDIH I+GSEGTLGV++E T
Sbjct: 3
SGMKKNIYGNIEDLVVHIKXVTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT 182
lcl|THC151579
Length = 698
Score = 181 bits (455), Expect = 3e-45
Identities = 81/142 (57%), Positives = 106/142 (74%)
Query: 446 LGMNHGVLGESFETSVPWDKVLSLCRNVKELMKREAKAQGVTHPVLANCRVTQVYDAGAC 505
L + + VLGESFETS PWD+V+ LCRNVKE + RE K +GV
+ CRVTQ YDAGAC
Sbjct: 41 LALEYXVLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC 220
THC200240
sp|O00116|ADAS_HUMAN ALKYLDIHYDROXYACETONEPHOSPHATE SYNTHASE PRECURSOR (ALKYL-DHAP
SYNTHASE) (ALKYLGLYCERONE-PHOSPHATE SYNTHASE)
Length = 658
446-248=198
Score = 124 bits (309), Expect = 5e-29
517-319=198
Identities = 59/60 (98%), Positives = 59/60 (98%)
248
Query: 1
SGMKKNIYGNIEDLVVHIKXVTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT 60
SGMKKNIYGNIEDLVVHIK VTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT
Sbjct: 319 SGMKKNIYGNIEDLVVHIKMVTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT 378
THC151579
sp|O00116|ADAS_HUMAN ALKYLDIHYDROXYACETONEPHOSPHATE SYNTHASE PRECURSOR (ALKYL-DHAP
SYNTHASE) (ALKYLGLYCERONE-PHOSPHATE SYNTHASE)
Length = 658
Score = 127 bits (315), Expect = 1e-29
Identities = 59/60 (98%), Positives = 59/60 (98%)
446
Query: 1
LALEYXVLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC 60
LALEY VLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC
Sbjct: 517 LALEYYVLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC 576
U58746
[THC195737--------------------------------------------MTRHGKNSTAASVYTYHERRRDAKASGYGTLHARLGADSIKEFHCCSLTLQPCRNPVISPTGYIF
--------]
DREAILENILAQKKAYAKKLKEYEKQVAEESAAAKIAEGQAETFTKRTQFSAIESTPSRTGAVAT
[THC195737-------------------PRPEVGSLKRQGGVMSTEIAAKVKAHGEEGVMSNMKGDKSTSLPSFWIPELNPTAVATKLEKPSS
----------------------------------------------------]
KVLCPVSGKPIKLKELLEVKFTPMPGTETAAHRKFLCPVTRDELTNTTRCAYLKKSKSVVKYDVV
[THC195737----------------------]
EKLIKGDGIDPINGEPMSEDDIIELQRGGTGYSATNETKAKLIRPQLELQ*
(44%/59%)
Translation of
U58746
1 MTRHGKNCTAGAVYTYHEKKKDTAASGYGTQNIRLSRDAVKDFDCCCLSLQPCHD
1 MTRHGKNSTAASVYTYHERRRDAKASGYGTLHARLGADSIKEFHCCSLTLQPCRN
*******.** .******...*. ****** . ** *..*.* **.*.****.
55
55
Translation of
U58746
56 PVVTPDGYLYEREAILEYILHQKKEIARQMKAYEKQRGTRREEQKELQRAASQDH 110
56 PVISPTGYIFDREAILENILAQKKAYAKKLKEYEKQVAEESAAAKIAEGQAETFT 110
**..* **...****** ** *** *...* ****
* . *
Translation of
U58746
111 VRGFLEKESAIVSRPLNPFTAKALSGTSPD-----------DVQPGPSVGPPSKD 154
111 KRTQFSAIESTPSRTGAVATPRPEVGSLKRQGGVMSTEIAAKVKAHGEEGVMSNM 165
*
. **
* .
*.
*.
* *
Translation of
U58746
155 K-DK--VLPSFWIPSLTPEAKATKLEKPSRTVTCPMSGKPLRMSDLTPVHFTPLD 206
166 KGDKSTSLPSFWIPELNPTAVATKLEKPSSKVLCPVSGKPIKLKELLEVKFTPMP 220
* **
******* *.* * ******** * **.****... .* *.***.
Translation of
U58746
207 SSVDRVGLITRSER-YVCAVTRDSLSNATPCAVLRPSGAVVTLECVEKLIRKDMV 260
221 ------GTETAAHRKFLCPVTRDELTNTTRCAYLKKSKSVVKYDVVEKLIKGDGI 269
* * . * ..* **** *.*.* ** *. * .** . *****. * .
Translation of
U58746
261 DPVTGDKLTDRDIIVLQRGGTGFAGSGVKLQAEKSRPVMQA 301
270 DPINGEPMSEDDIIELQRGGTGYSAT-NETKAKLIRPQLELQ 310
**..*. ... *** *******.. .
.*
** ..
U50199
[THC171302-MVFGENQDLIRTHFQKEADKVRAMKTNWGLFTRTRMIAQSDYDFIVTYQQAENEAERSTVLSVFKEK
------------------------------------------------------------------AVYAFVHLMSQISKDDYVRYTLTLIDDMLREDVTRTIIFEDVAVLLKRSPFSFFMGLLHRQDQYIVH
------------------------------------------------------------------ITFSILTKMAVFGNIKLSGDELDYCMGSLKEAMNRGTNNDYIVTAVRCMQTLFRFDPYRVSFVNING
------------------------------------------------------------------YDSLTHALYSTRKCGFQIQYQIIFCMWLLTFNGHAAEVALSGNLIQTISGILGNCQKEKVIRIVVST
-----------------]
[THC177150-------------------------------------------LRNLITSNQDVYMKKQAALQMIQNRIPTKLDHLENRKFTDVDLVEDMVYLQTELKKVVQVLTSFDEY
------------------------------------------------------------------ENELRQGSLHWSPAHKCEVFWNENAHRLNDNRQELLKLLVAMLEKSNDPLVLCVAAHDIGEFVRYYP
------------------------------------------------]
RGKLKVEQLGGKEAMMRLLTVKDPNVRYHALLAAQKLMINNWKDLGLEI
Human gene: 483 aa
gi|2895578 (AF041338) vacuolar proton pump subunit SFD alpha is...
gi|2895576 (AF041337) vacuolar proton pump subunit SFD beta iso...
gi|1213557 (U50199) coded for by C. elegans cDNA yk89e9.5; code...
gi|1086810 (U41109) similar to S. cerevisiae vacular H(+)-ATPas...
gnl|PID|e351278 (Z99532) hypothetical protein [Schizosaccharomy...
sp|P41807|VM13_YEAST VACUOLAR ATP SYNTHASE 54 KD SUBUNIT (V-ATP...
927
885
468
335
185
123
0.0
0.0
e-131
5e-91
5e-46
2e-27
gi|1213557 (U50199) coded for by C. elegans cDNA yk89e9.5; coded for by C.
elegans cDNA cm7g5; coded for by C. elegans cDNA cm14b9;
coded for by C. elegans cDNA yk52g5.5; coded for by C.
elegans cDNA yk76e5.5; coded for by C. elegans cDNA
yk131f11.5; c...
Length = 470
Score = 468 bits (1192), Expect = e-131
Identities = 243/477 (50%), Positives = 314/477 (64%), Gaps = 20/477 (4%)
gi|2895578 (AF041338) vacuolar proton pump subunit SFD alpha isoform [Bos
taurus]
Length = 483
Score = 927 bits (2369), Expect = 0.0
Identities = 460/483 (95%), Positives = 465/483 (96%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 61
MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMISAEDCEFIQRFEMKRSPE 60
MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMIS+EDCEFIQRFEMKRSPE
MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMISSEDCEFIQRFEMKRSPE 60
EKQEMLQTEGSQCAKTFINLMTHICKEQTVQYILTMVDDMLQENHQRVSIFFDYARCSKN 120
EKQEMLQTEGSQ AKTFINLMTHI KEQTVQYILT+VDD LQENHQRVSIFFDYA+ SKN
EKQEMLQTEGSQRAKTFINLMTHISKEQTVQYILTLVDDTLQENHQRVSIFFDYAKRSKN 120
Query: 121 TAWPYFLPILNRQDPFTVHMAARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS 180
TAW YFLP+LNRQD FTVHM ARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS
Sbjct: 121 TAWSYFLPMLNRQDLFTVHMTARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS 180
Query: 181 GVAVETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ 240
GV ETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ
Sbjct: 181 GVTAETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ 240
Query: 241 YQMIFSIWLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKSTERE 300
YQMIFS+WLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKS ERE
Sbjct: 241 YQMIFSVWLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKSVERE 300
Query: 301 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK 360
TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK
Sbjct: 301 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK 360
Query: 361 SGRLEWSPVHKSEKFWRENAVRLNEKNYELLKILTKLLEVSDDPQXLAVAAHDVGXYVRX 420
SGRLEWSPVHKSEKFWREN RLNEKNYELLKILTKLLEVSDDPQ LAVAAHDVG YVR
Sbjct: 361 SGRLEWSPVHKSEKFWRENPARLNEKNYELLKILTKLLEVSDDPQVLAVAAHDVGEYVRH 420
Query: 421 YPRGKRVIEQXGGKQLVMNHMHHEXQQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQTXA 480
YPRGKRVIEQ GGKQLVMNHMHHE QQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQT A
Sbjct: 421 YPRGKRVIEQLGGKQLVMNHMHHEDQQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQTAA 480
Query: 481 ARS 483
ARS
Sbjct: 481 ARS 483
U64857
[AA134689----------------------------------------------MSLNGFGEHTRSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGYSYCGETAAYAF
--------------------------]
KQVVSSAVERVFILGPSHVVALNGCAITTCSKYRTPLGDLIVDHKINEELRATRHFDLMDRRDEES
[THC196496------------------------------------EHSIEMQLPFIAKVMGSKRYTIVPVLVGSLPGSRQQTYGNIFAHYMEDPRNLFVISSDFCHWGERF
-----------------------------------------------------------------SFSPYDRHSSIPIYEQITNMDKQGMSAIETLNPAAFNDYLKKTQNTICGRNPILIMLQAAEHFRIS
-----------------------------------]
NNHTHEFRFLHYTQSNKVRSSVDSSVSYASGVLFVHPN
Translation of
U64857
1 MSNR---VVCREASHAGSWYTASGPQLNAQLEGWLSQVQSTKRPARAIIAPHAGY
1 MSLNGFGEHTRSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGY
**
.* ********.*
* ** ** .
***.*.*****
52
55
Translation of
U64857
53 TYCGSCAAHAYKQVDPSITRRIFILGPSHHVPLSRCALSSVDIYRTPLYDLRIDQ 107
56 SYCGETAAYAFKQVVSSAVERVFILGPSHVVALNGCAITTCSKYRTPLGDLIVDH 110
.*** .** *.*** *
*.******* * * **...
***** ** .*.
Translation of
U64857
108 KIYGELWKTGMFERMSLQTDEDEHSIEMHLPYTAKAMESHKDEFTIIPVLVGALS 162
111 KINEELRATRHFDLMDRRDEESEHSIEMQLPFIAKVMGSKR--YTIVPVLVGSLP 163
** ** * *. * . .* ******.**. ** * *.. .**.*****.*
Translation of
U64857
163 ESKEQEFGKLFSKYLADPSNLFVVSSDFCHWGQRFRYSYYD-ESQGEIYRSIEHL 216
164 GSRQQTYGNIFAHYMEDPRNLFVISSDFCHWGERFSFSPYDRHSSIPIYEQITNM 218
*..* .* .*..*. ** ****.********.** .* ** *
** * ..
Translation of
U64857
217 DKMGMSIIEQLDPVSFSNYLKKYHNTICGRHPIGVLLNAITELQK-NGMNMSFSF 270
219 DKQGMSAIETLNPAAFNDYLKKTQNTICGRNPILIMLQAAEHFRISNNHTHEFRF 273
** *** ** * * .* **** .******.** ..*.*
. *. . * *
Translation of
U64857
271 LNYAQSSQCRNWQDSSVSYAAGALTVH
297
274 LHYTQSNKVRSSVDSSVSYASGVLFVHPN 302
*.*.** . *
*******.* * **
BLASTP (Jan. 10, 1999)
gi|1465834 (U64857) No definition line found [Caenorhabditis el...
sp|Q10212|YAY4_SCHPO HYPOTHETICAL 34.8 KD PROTEIN C4H3.04C IN C...
sp|P47085|YJX8_YEAST HYPOTHETICAL 38.5 KD PROTEIN IN SUI2-TDH2 ...
gi|2425141 (AF020286) similar to C. elegans CEESS08F encoded by...
gnl|PID|d1031681 (AP000006) 294aa long hypothetical protein [Py...
gi|2983422 (AE000712) hypothetical protein [Aquifex aeolicus]
gi|2621080 (AE000796) conserved protein [Methanobacterium therm...
gnl|PID|e283857 (Y08257) orf c05005 [Sulfolobus solfataricus]
sp|Q57846|Y403_METJA HYPOTHETICAL PROTEIN MJ0403 >gi|2129073|pi...
gi|2983762 (AE000735) hypothetical protein [Aquifex aeolicus]
300
215
195
155
87
85
79
78
77
68
1e-80
3e-55
3e-49
4e-37
1e-16
7e-16
4e-14
9e-14
2e-13
1e-10
gi|1465834 (U64857) No definition line found [Caenorhabditis elegans]
Length = 302
Score = 300 bits (759), Expect = 1e-80
Identities = 153/292 (52%), Positives = 198/292 (67%), Gaps = 4/292 (1%)
Query: 8
Sbjct: 11
REASHAGSWYTASGPQLNAQLEGWLSQVQSTKRPARAIIAPHAGYTYCGSCAAHAYKQVD 67
R ASHAGSWY A+
L+ QL WL
ARA+I+PHAGY+YCG AA+A+KQV
RSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGYSYCGETAAYAFKQVV 70
Z36238
[THC132858-------------------]
MKQFKRGIERDGTGFVVLMAEEAEDMWHIYNLIRIGDIIKASTIRKVVSETSTGTTSSQRVHTM
LTVSVESIDFDPGAQELHLKGRNIEENDIVKLGAYHTIDLEPNRKFTLQKTEWDSIDLERLNLA
[THC85433-----------------------------------------LDPAQAADVAAVVLHEGLANVCLITPAMTLTRAKIDMTIPRKRKGFTSQHEKGLEKFYEAVSTA
--------------------------------------------]
{AA938998*****************
FMRHVNLQVVKCVIVASRGFVKDAFMQHLIAHADANGKKFTTEQRAKFMLTHSSSGFKHALKEV
*******}
[THC200182---------------------------------------------------LETPQVALRLADTKAQGEVKALNQFLELMSTEPDRAFYGFNHVNRANQELAIETLLVADSLFRA
-----------------------------------------------]
QDIETRRKYVRLVESVREQNGKVHIFSSMHVSGEQLAQLTGCAAILRFPMPDLDDEPMDEN
Translation of
Z36238
1 MKLVRKNIEKDNAGQVTLVPEEPEDMWHTYNLVQVGDSLRASTIRKVQTESSTGS
1 MKQFKRGIERDGTGFVVLMAEEAEDMWHIYNLIRIGDIIKASTIRKVVSETSTGT
** ...**.*..* * *. ** ***** ***...** ..******* .*.***.
55
55
Translation of
Z36238
56 VGSNRVRTTLTLCVEAIDFDSQACQLRVKGTNIQENEYVKMGAYHTIELEPNRQF 110
56 TSSQRVHTMLTVSVESIDFDPGAQELHLKGRNIEENDIVKLGAYHTIDLEPNRKF 110
*.**.* **..**.**** * .*..** **.**. **.******.*****.*
Translation of
Z36238
111 TLAKKQWDSVVLERIEQACDPAWSADVAAVVMQEGLAHICLVTPSMTLTRAKVEV 165
111 TLQKTEWDSIDLERLNLALDPAQAADVAAVVLHEGLANVCLITPAMTLTRAKIDM 165
** * .***. ***. * *** .*******..****..**.**.*******...
Translation of
Z36238
166 NIPRKRKGNCSQHDRALERFYEQVVQAIQRHIHFDVVKCILVASPGFVREQFCDY 220
166 TIPRKRKGFTSQHEKGLEKFYEAVSTAFMRHVNLQVVKCVIVASRGFVKDAFMQH 220
.******* .***.. **.*** * * **.. ****..*** ***.. *
Translation of
Z36238
221 MFQQAVKTDNKLLLGNRSKFLQVHASSGHKYSLKEALCDPTVLARLSDTKAAGEV 275
221 LIAHADANGKKFTTEQRAKFMLTHSSSGFKHALKEVLETPQVALRLADTKAQGEV 275
. .* . *
.*.**. *.*** * .*** * * * **.**** ***
Translation of
Z36238
276 KALDDSYKMLQHEPDRAFYGLKQVEKANEAMAIDTLLISDELFRHQDVATRSRYV 330
276 KALNQFLELMSTEPDRAFYGFNHVNRANQELAIETLLVADSLFRAQDIETRRKYV 330
***
.. ******** .* .**. .**.***..* *** **. ** .**
Translation of
Z36238
331 RLVDSVKENAGTVRIFSSLHVSGEQLSQLTGVAAILRFPVPELSDQEGDS-SSEE 384
331 RLVESVREQNGKVHIFSSMHVSGEQLAQLTGCAAILRFPMPDLDDEPMDEN
381
***.**.*. * *.****.*******.**** *******.*.* *. *
Translation of
Z36238
385 D 385
382
381
BLASTP (Jan. 10, 1999)
sp|P48612|PELO_DROME PELOTA PROTEIN >gi|973224 (U27197) pelota ...
sp|P50444|YNU6_CAEEL HYPOTHETICAL 42.9 KD PROTEIN R74.6 IN CHRO...
gi|3941543 (AF069497) pelota [Arabidopsis thaliana]
pir||S45456 DOM34 protein - yeast (Saccharomyces cerevisiae) >g...
sp|P33309|DO34_YEAST DOM34 PROTEIN >gi|295608 (L11277) DOM34 [S...
gnl|PID|e304505 (Z86109) unknown [Saccharomyces pastorianus]
gi|2622770 (AE000923) cell division protein [Methanobacterium t...
gnl|PID|d1031529 (AP000006) 356aa long hypothetical protein [Py...
sp|Q57638|Y174_METJA HYPOTHETICAL PROTEIN MJ0174 >gi|2127805|pi...
gi|2649765 (AE001046) cell division protein pelota (pelA) [Arch...
520
446
385
236
212
199
155
146
145
116
e-147
e-125
e-106
2e-61
2e-54
3e-50
4e-37
3e-34
6e-34
3e-25
sp|P50444|YNU6_CAEEL HYPOTHETICAL 42.9 KD PROTEIN R74.6 IN CHROMOSOME III
>gi|3879163|gnl|PID|e1348805 (Z36238) Similar to the
DOM34 protein of saccharomyces cerevisiae (Swiss Prot
accession number P33309) [Caenorhabditis elegans]
Length = 381
Score = 446 bits (1136), Expect = e-125
Identities = 215/371 (57%), Positives = 282/371 (75%)
1200
5
1100
5
C. elegans protein length
1000
900
800
5
700
5
5
600
5
5
55
55
55 5
5 55
5
5 555 5 5
5
5
5
5 55 5
5
5
555 55555 555
5 5
5
5
5
5
5
55
55
55 555 5
5
55 5
5555555
5 55
5
555 5 555 5 5555 55 5 55
5 5
5
5
555
5
5
555
5
5 5555 55
5 55
5555555 55
5
5
5
5 55
5
5
55 5 5
5555
5
55 55
500
400
300
200
100
0
0
100
200
300
400
500
600
CGI protein length
700
800
900
1000
800
700
H
H
Match area length
600
HH
H
H HH
500
HH
HH
H
H
H HH
H
H
H
H
HH
H
H
H H
HH
H
H
H
H
HHHHH H H
H
H
H
HH
H
H
H
HH
HHH
H H
H
H
H H
HH
H
H
H
H
H
H HH H
HH
H
H
HH H H H H
H
H
H
H
H
H
H
H
H
H
HH
H
H
H
H
HH
HH
H
H HH
H
HHH H
H
H
HH
H H
H
H
H
H
H
H
H
H
HH
HH HH H
H
H
H
HH
H
H
H
H
H
H
HH
H
400
300
200
100
H
H
H
0
0
100
200
300
400
500
600
CGI protein length
700
800
900
1000
Protein similarity between CGI and C. elegans
100
90
A
A
A
A
A
A
AAAA
A
AA A A
A
A
A
AA
A
A
A
A
A
A
A
A
A
A
A
AA
AA
A
A
A
AA
A
A
A
AA A
A
A
A
A
A
A
A A
A
A
A A
AA A
AA A
A
A
A
A
A
A
AA
AA
A A AAAAA
AAA
A
A
A
A
A
A
AA
A A
A AAA A
A AA
A A A A
A
A
A A
A AA
A
A
AAA A
A A AA AA
A
A
A A
AA
AA
A
AA
AA
A
A AA
A
A
80
70
60
50
40
A
A
A
A
A
A
30
0
100
200
300
400
500
600
CGI protein length
700
800
900
1000
C. elegans from WormPept:18,452 entries
HGI searches
(5 days for TBLASTN analysis)
*Families
*Known Gene
*New Contig
*Undetermined
<100 aa
3,934
7,954
3,456
2,070
1,038
83% between Human &
C. elegans
11% C. elegans specific
*150 full length genes so far, more expected
following GAP closure and 5’RACE.
C. elegans from WormPept:18,452 entries
MGI searches
(5 days for TBLASTN analysis)
*Families
*Known Gene
*New Contig
*Undetermined
<100 aa
5,602
4,151
5,805
1,856
1,038
84% between Mouse &
C. elegans
10% C. elegans specific
http://www.ibms.sinica.edu.tw/~wenlin/
Download