Gibson Second Edition

advertisement
Genome Science
Ka-Lok Ng
Dept. of Bioinformatics
Asia University
The Core Aims of Genomics Science
(1)
An integrated web-based database and research
interface
access to the enormous volume of data
web interfaces
Relational databases
Generic Model Organism Database (GMOD)
project http://www.gmod.org/  to develop reusable
components suitable for creating new community
databases of biology
The Core Aims of Genomics Science
(2)
To assemble physical an genetic maps
location of genes in a genome
physical distance and relative position defined by
recombination frequencies
the map is crucial for comparing the genomes of related
species
related phenotypic and genetics data
used in animal and plants breeding
extend to more species with greater accuracy
The Core Aims of Genomics Science
(3) To generate and order genomic
and expressed gene sequences
High-volume sequencing
Basic technique is developed by
Fred Sanger
“Shotgun” approach  assemble
into contigs, scaffolds (a set of
contigs), then the whole
chromosomes
mRNA is unstable
Coding parts  cDNA clones –
cloned from mRNA transcripts
Expressed sequence tags (ESTs)
Obtain full length cDNA is not easy
 because of mRNA structure
The Core Aims of Genomics Science
(3) To generate and order genomic and expressed gene
sequences
mRNA  cDNA  EST
Reverse transcription  cDNA
EST - partial cDNA sequences
sequenced either from 5' or 3‘
Alternative splicing  not a one-to-one
correspondence between ESTs and genes
Whole genome reconstruction
The Core Aims of Genomics Science
(4)
Identify and annotate the complete set of genes encoded
within a genome
From complete sequence of a genome  genes identification
Alignment of cDNA, DNA and protein sequences – BLAST
Gene finding software – ORFs, transcription start and
termination sites, exon/intron boundaries
Then gene annotation  linking sequence to genetic function,
expression, locus information, comparative data from homologous
proteins
The Core Aims of Genomics Science
(5) To characterize DNA sequence
diversity
Single-nucleotide polymorphisms (SNPs)
About 90 percent of human genome variation
comes in the form of single nucleotide
polymorphisms (neither harmful nor
beneficial)
Theoretically, a SNP could have four
possible forms, or alleles (different seq.
alternative), since there are four types of
bases in DNA. But in reality, most SNPs
have only two alleles. For example, if some
people have a T at a certain place in their
genome while everyone else has a G, that
place in the genome is a SNP with a T allele
and a G allele.
The human genome contains more than 10
million SNPs  once in every 100 to 300 bp !
Find associations between SNP variation
and phenotypic variation,e.g. Sickle-cell
anemia 鐮刀狀細胞貧血症
SNP
mutation
Sickle-cell anemia and SNP
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RFLPs.html
The Core Aims of Genomics Science
(5) To characterize DNA sequence
diversity
Characterize the level of haplotype
structure due to linkage disequilibrium (LD)
haplotype = a set of adjacent
polymorphisms found on a single
chromosome
LD = groups of closely linked alleles that
tend to be inherited together, can be used
to map human disease genes very
accurately
Knowledge of LD are utilized to do
disease locus mapping
In the human genome, haplotypes tend to
be approximately 60,000 bp in size and
therefore contain up to 60 SNPs that travel
as a group.
Haplotype
The Core Aims of Genomics Science
Mendel's Laws enable the outcome of genetic crosses to be predicted.
A and B on different chromosome
The Core Aims of Genomics Science
Genes on the same chromosome should display linkage.
Genes A and B are on the same chromosome and so should be inherited
together. Mendel's Second Law should therefore not apply to the
inheritance of A and B, but holds for the inheritance of A and C, or B
and C. Mendel did not discover linkage because the seven genes that
he studied were each on a different pea chromosome.
Partial linkage
Partial linkage was discovered in the early 20th
century. The cross shown here was carried out
by Bateson, Saunders and Punnett in 1905 with
sweet peas. The parental cross gives the typical
dihybrid result (see Figure on the right ), with all
the F1 plants displaying the same phenotype,
indicating that the dominant alleles are purple
flowers and long pollen grains. The F1 cross
gives unexpected results as the progeny (後裔)
show neither a 9 : 3 : 3 : 1 ratio (expected for
genes on different chromosomes) nor a 3 : 1 ratio
(expected if the genes are completely linked). An
unusual ratio is typical of partial linkage
The Core Aims of Genomics Science
(5) To characterize DNA sequence diversity
the farther apart two genes are, the more they
tend to assort independently (randomly) 
recombination frequency ↑
Higher freq.  farther apart
Vermilion - 朱紅色
The Core Aims of Genomics Science
(6) To compile atlases of gene
expression
analyzing profiles of transcription
and protein synthesis
traditional method: Northern blots,
hybridization
modern technology – microarray
relative level of expression
(differential expression)
patterns of covariation in gene
expression  clues to unknown
gene function (guilt by association)
The Core Aims of Genomics Science
(7) To accumulate functional data, including biochemical and
phenotypic properties of genes
Near-saturation mutagenesis (screening hundreds of thousands of
mutants to identify genes that affect traits as diverse as
embryogenesis, immunology, and behavior)
high-throughput reverse genetics (methods to systematically and
specifically inactivate individual genes).
Yeast Genome Deletion Project http://wwwsequence.stanford.edu/group/yeast_deletion_project/deletions3.html
Mouse http://www.bioscience.org/knockout/knochome.htm
Proteomics – detecting protein expression and protein-protein
interactions
Pharmacogenomicists – study the interactions between small
molecules (i.e. potential drugs) and proteins
Functional genomics – a crucial component is to study various
model organisms
Clone library – collections of DNA fragments that are cloned into a
vector
The Core Aims of Genomics Science
With Smith's site-directed
mutagenesis the researchers can
study in detail how proteins
function and how they interact
with other biological molecules.
Site-directed mutagenesis can be
used, for example, to
systematically change amino
acids in enzymes, in order to
better understand the function of
these important biocatalysts. The
researchers can also analyze how
a protein is folded into its
biologically active threedimensional structure. The
method can also be used to study
the complex cellular regulation of
the genes and to increase our
understanding of the mechanism
behind genetic and infectious
diseases, including cancer.
GTC  Valine
GCC  Alanine
Site-directed mutagenesis
The Core Aims of Genomics Science
(8) To provide the resources for
comparison with other genomes.
Comparative maps  allow genetic data
from one species to be used in the other
species
Comparative maps  local gene order
along a chromosome tends to be
conserved  Synteny (human and
mouse genome)
Even without synteny, the conservation
of gene function is known (say from fly
to primate靈長類動物)
Gene order conservation (GOC)
Mapping Genomes – Genetic Maps
Genetic map – the relative order of genetic markers in linkage groups in
which the distance between markers is expressed as units of recombination
Genetic markers – sequences tags, repeats, restriction enzyme
polymorphism (cutting sites)
In diploid (具兩套染色體) organisms, genetic maps are assembled from data
on the co-segregation (同時分離) of genetic markers either in pedigrees (家譜)
or in the progeny (後代) of controlled crosses.
•Genetic distance unit  centriMorgan (cM)
•In human 1cM = 1% of recombination frequency
•Human, 1cM ~ 1Mbp
•100 cM  1 crossover occurs per chromosome per generation
•Markers on different chromosomes have a 50-50 chance of co-segregation
50cM (0.5 crossover occurs per generation)
Mapping Genomes – Genetic Maps
(A)
A pair of different parental chromosomes
(green and blue colors).
(B)
A table showing the frequency of recombinants
between each marker. Larger number indicates
that the genes are farther apart.
(C)
The most likely genetic map from the entire data.
In this hypothetical example, two linkage groups
are inferred, the top one is longer than 50 cM.
Genetic distance ~ 0.11  11cM
0.22  21cM, 0.25  24cM,
0.33  33cM
Figure 1.1
•
•
•
Mapping Genomes – Genetic Maps
Software of the assembly of genetic maps
http://linkage.rockefeller.edu/soft/list.html
Multiple factors lead to high variation in the
correspondence between physical and genetic distances
There is variability of recombination rate along a
chromosome (centromeres and telomeres are less
reconbinogenic than general euchromatin)  hot spots
and cold spots of recombination
Exercise 1.1 (Part 1) Constructing a genetic map
Constructing a genetic map - four recessive loci – thickskin, reddish, sour, petite.
After identifying two true-breeding trees that are either completely wild-type or mutant
for all four loci, the breeder crosses them, and then plants an orchard of F2 (second
generation) trees.
Q. Based on the following frequencies of mutant classes, determine which loci are
likely to be on the same chromosome and which are the most closely linked.
Exercise 1.1 (Part 2) Constructing a genetic map
Assume independent assortment for each recessive phenotype 
¼  242 petite (127+42+38+12+10+8+3+2), 249 reddish, 247 sour
and 236 thickskin
Expect that unlinked loci would segregate independently ~ 60 trees (that is
1/4*1/4*968) produced each double mutants class
Exercise 1.1 (Part 2) Constructing a genetic map
Mapping Genomes – Genetic Maps
Exercise 1.1 Constructing a genetic map
four recessive loci – thickskin, reddish, sour, petite
Q. Determine which loci are likely to be on the same
chromosome and which are the most closely linked.
Answer: Total number of 968 trees. Assume independent
assortment for each recessive phenotype  ¼  242
petite, 249 reddish, 247 sour and 236 thickskin
Expect that unlinked loci would segregate independently
~ 60 trees (that is 1/4*1/4*968) produced each double
mutants class
Exercise 1.1 (Part 2) Constructing a genetic map
Mapping Genomes – Genetic Maps
s
r
t
Approximate solution
p
Mapping Genomes – Physical Maps
Physical maps
• is an assembly of contiguous stretches of chromosomal
DNA – contigs – in which the distance between
landmark sequences of DNA is expressed in kilobases
• the ultimate physical map is the complete sequence
Applications
(1) provide a scaffold upon which polymorphic markers can be
placed
(2) facilitating finer scale linkage mapping
(3) confirm linkages inferred from recombination frequencies
(4) resolve ambiguities about the order of closely linked
genes
(5) enable detailed comparisons of regions of synteny
between genomes
Mapping Genomes – Physical Maps
Two strategies used to assemble contigs
(1) Alignment of randomly isolated clones based on shared
restriction fragment length profiles
•
YAC – ~1Mbp long fragments
•
BAC – ~100kbp long fragments
•
Plasmid – ~ kbp long fragments
•
Automatic restriction profiling (Ch. 2) assemble contigs
(short for "contiguous sequences").
Genomic clone library
Unlike the case of fX174, no large genome
could be completely sequenced without
an extra round of fragmentation into
manageable sized chunks. In other words
it had to be transferred into one or
more clone libraries from which
individual clones were picked to be
"subcloned" in M13 for sequencing.
The general outline of the procedure is
shown at right. You can see that fX174
bypassed the first stage, the construction of
a clone library from the target genome.
cDNA library – made from RNA that has
been reverse transcribed into cDNA and
are used for EST sequencing projects.
Cloning vectors
Mapping Genomes – Physical Maps
(2) Hybridization-based approaches –
chromosome walking
Chromosome walking is used as a means of
finding adjacent genes (positional cloning), or
parts of a gene which are missing in the original
clone as well as to analyze long stretches of
eukaryotic DNA. This task requires finding a set of
overlapping fragments of DNA that spans the
distance between the marker and the gene.
Genomic DNA is shown in blue. Selected clones
from a library of cloned genomic DNA fragments
are shown in red. The initial probe, probe a, is
specific to gene A or exon A and allows
identification of clones 1 and 2. A new probe, probe
b, is prepared from one end of clone 2 and used to
isolate new clones 3 and 4 from the genomic library.
Probe c, prepared from clone 4 is used to identify
clone 5, etc. The orientation of the clones is
determined by restriction mapping of the clones.
Clone 6 contains the desired gene B or exon B.
Mapping Genomes – Cytogenetic Maps
Historically – aid in the alignment of physical and genetic
maps
Cytogenetic maps are the banding patterns observed through
a microscope on stained chromosome spreads
Traditional preparation – salivary gland polytene
chromosomes 唾液腺多線染色體 (greatly enlarged
relative to their usual condition) of insects and Giemsabanded mammalian metaphase karyotypes
http://book.tngs.tn.edu.tw/database/scientieic/content/1970/0
0100010/images/0053b.jpg
Chromosomes  the genetic material  phenotypes or
medical conditions correlate with the deletion or
rearrangement of chromosome sections
Cytogenetic map are aligned with the physical map through
in situ (在原位置) hybridization – a clone fragment is
annealed to a single location on the cytogenetic map
NCBI Genomic Biology
http://www.ncbi.nlm.nih.gov/Genomes/
Keyword: HOX AND homo[ORGN]
Karyotypes
Mapping Genomes – Cytogenetic Maps
Alignment of cytological, physical, and genetic maps.
Cytological map – a representation of a chromosome based on the pattern of
staining of bands
Physical map – the location of transcripts and sites of insertions and deletions
Genetic map – recombination rates vary along a chromosome, typically reduced
near the telomere and centromere
Distances between genetic, physical and cytological markers are not uniform
How to search for genes on a genome map ? See my lecture notes on Bioinformatics class.
Comparative Genomics
Synteny – conservation of gene order
between chromosome segments of
two or more organisms.
Homologes – highly conserved loci
derived form a common ancestral
locus
Orthologs – similar genes that arose
as result of duplication subsequent to
an evolutionary split
Paralogs – similar genes that arose as
result of duplication
speciation
• Conservation of gene order is an inverse function of the times since
divergence from the ancestral locus.
• Note – rates of divergence vary considerably at all taxonomic levels.
• Japanese pufferfish – 7.5 times smaller than the human genome, show
extensive gene order similarity with humans, around 50% - 80% is in the same
order as is found in the human genome
Comparative Genomics
1. Chromosome painting – used to define regions of Synteny cover regions (~0.1
of a chromosome arm)
2. Each chromosome of one species is labeled with a set of fluorescent dyes, and
hybridized to chromosome spreads of the other genome.
3. Uses the fluorescent in situ hybridization (FISH) technique to detect DNA
sequences in metaphase spreads of animal cells. The fluorescently labeled hybrid
karyotype is shown in bottom.
Comparative Genomics
Synteny between cat and human genomes. Ideograms (染色體模式圖) for each of
the 24 chromosomes shown on the right in each pair are aligned against color-coded
representations of corresponding cat chromosomes.
CAT – six groups (A – F) of 2 – 4 chromosomes each.
Top row – 12 autosomes that are essentially syntenic along, except for some
rearrangements
Bottom row – 10 autosomes that have at least one major rearrangement
The two sex chromosomes are essentially syntenic between cat and human
Comparative Genomics
•
•
•
•
•
Sequence conservation = functional importance
High-resolution comparative physical mapping – found ~1Mbp synteny region
between human and mouse
May contain hundreds of genes, local inversions and insertions/deletions
involving one or a few genes
Families of genes organized in tandem clusters
Considerable size variation in intergenic “junk” DNA
Comparative Genomics
• Identifying genes and regulatory regions in
seq. genomes is challenging
• ORF are usually good
Comparative Genomics
• Identifying genes and regulatory regions in sequenced
genomes is challenging
• Open reading frames (ORFs) are usually good indication
of genes
• However, it is difficult to determine which ORFs belong
to a gene
– Many mammalian genes have small exons and large
introns
• Regulatory sequences even more difficult
Comparative Genomics
• Computer programs analyze genomic sequence
– GRAIL
– GeneFinder
• Look for ORFs, splice sites, poly A addition sites, etc.
• Predict gene structure
• Frequently wrong
– Usually miss exons at beginning or end of gene
– Sometimes predict exon when one doesn’t really
exist
Comparative Genomics
• When comparing genomes of different species, the
genes normally have the same exon–intron structure
• Look for conserved ORFs in both genomes
• Frequently permit accurate identification of genes
– Fugu–human comparison found >1,000 genes
– Mouse–human comparison indicates only 25,000
genes in genome
Example of sequence comparison
• Comparison of the human and mouse spermidine
synthase genes revealed an additional intron in the
human gene that is not found in the mouse
homologue
Human
Mouse
5,500 bp
The Human Genome Project (HGP)
Objectives
1. Generation of high-resolution genetic and physical maps that will help in the
localization of disease-associated genes.
2. The attainment of sequence benchmarks, leading to generation of a complete
genome sequence by the year 2005. (A draft version was achieved in May 2000,
but finished sequence required an error rate of less than 1 in 10,000 bp)
3. Identification of each and every gene in the genome by a combination
bioinformatics identification of open reading frame (ORFs), generation of voluminous
EST databases, and collation(對照)of functional data including comparative data from
other animal genome projects.
4. Compilation of exhaustive polymorphism databases, in particular of SNPs, to
facilitate integration of genomic and clinical data, as well as studies of human
diversity and evolution.
The Human Genome Project (HGP)
Table 1.1 Initial Goals of the HGP
From the First 5-Year Plan: 1993-1998
Table 1.2 A Blueprint for the Future of the HGP
15 Grand Challenges in the Third 5-Year Plan:
2003 – 2005
HGP budget – set aside for research on the ethical ,
legal, and social implication of genetic reserach
(the ELSI project)
The Human Genome Project
The architecture of the Human Genome Project in the twenty-first century.
Three major themes for future genome research are founded on six pillars
of genome resources.
ELSI
Box 1.1 The Ethical, Legal, and Social Implications of
the HGP
Funding – The National Human Genome Research Institute
(NHGRI)  5% of its annual budget to ELSI
Funding three types of activities: regular research grants,
education grants, and intramural programs at the NIH
campus
Web sites: http://www.genome.gov/10001618
http://www.ornl.gov/sci/techresources/Human_Genome/res
earch/elsi.html
4 major objectives
4 main subject areas
ELSI
Great concern is the privacy and confidentiality of genetic information.
Especially – Iceland (介於格陵蘭與挪威間
http://www.tita.org.tw/view/iceland.html) and Estonia (愛沙尼亞共和國
http://www.suntravel.com.tw/zone/Europe/Estonia-136.htm)
 government-sponsored databases of medical records have been
supplied to medical research companies.
Psychological impact and potential for stigmatization (給帶來恥辱,使
貼上標籤) inherent in the generation of genetic data  racial mistrust
and socioeconomic differences in gathering of and access to genetic
information
Reproductive issues
Potential moral (possible legal) obligations once data has been obtained.
Philosophical discussions – human responsibility, human right to
“play God” with genetic material, meaning of free will in relation to
genetically influenced behaviors
Genetically Modified Organisms (GMOs)
1998 – Five new major aims
1.7 (Part 1) Whose genome was sequenced?
The content of the Human Genome
Completion of the first draft of the HGP was announced at press conference in May
2000, but publication of the result was delayed until Feb. of 2001.
Need refinment of the seq. assembly, including gap closure, gene annotation, and
prediction
It is estimated that the total number of genes is somewhere around 25,000 (~ two
times greater than gene contents of the fruit-fly and C.elegans, and five times
greater than yeast, see Table 1.3 for more details)
Table 1.3 Comparison of Gene Content in some Representative Genomes
No dramatic differences in gene content between humans and other mammals.
Sep. 1994 – the first high-resolution genetic map of the complete genome – 23
linkage groups (one per chromosome) with 1200 markers at an average of 1cM
intervals
Around 1995 – physical map – 52000 sequence tag sites (STS) at ~60 kbp
intervals
1998 – 3000 SNPs
Middle of 2004 – 1.8 million mapped SNP, see The SNP Consortium (TSC)
http://snp.cshl.org
Providing polymorphic markers at 2kb intervals and placing 85% of all exons within
5kp of a SNP.
2000 – the first draft of the smallest human chromosome, chromosome 21 was
published
The content of the Human Genome
Two questions for the HGP
(1) Whose genome was sequenced ?
The sequence is derived from a collection of several libraries obtained from
a set of anonymous donors. Both the IHGSC and the private firm Celera Genomics
assembled their seq. from multiple libraries of ethnicaly diverse individuals
One particular indiveidual’s DNA contributed 3/4 and 2/3 of the raw seq. respectively.
Size of shaded sector ~ amount of seq.
contributed by a single individual
The content of the Human Genome
The Celera sample included at least one individuals from each of four
ethnic groups, as well as both males and females.
Craig Venter admitted that his own DNA contributed substantially to the
Celera sequence
Their own poodle (獅子狗) contributed to the first-draft canine (犬科動物)
genome seq.
The Human Genome Project
(2) When can we regard it as finished ?
• The complete seq. of 99% of human euchromatin has been
published to an estimated error rate of ~ 1 event in 100,000
bases.
• Human polymorphism is an order of magnitude greater than
this  at least 10 SNPs for each seq. error
• Extensive tracts of heterochromatin (there are few or no
genes, such as centromeres and telomeres), mostly
associated with centromeres that may account for as much as
20% of the total genome, will probably never be sequenced.
• Since the completion of the first draft  HGP focus on
characteristing human diversity.
• International HapMap project – map all of the major
haplotypes in the human genome and characterize their
distribution among populations, as a step toward identification
of human disease susceptibility factors, see
http://www.hapmap.org
Internet Resources
– NCBI and Ensembl
NCBI http://www.ncbi.nlm.nih.gov
Ensemble http://www.ensembl.org
– a collaboration between EMBL-EBI and
the Sanger Center in the UK.
Both sites provide high-resolution physical
maps of any segment of the genome.
Several genome views
UCSC Genome Browser
http://genome.cse.ucsc.edu
Commercial web sites - Incyte Genomics,
Celera, Rosetta Inpharmatics, Informax, and
LION Biosciences
http://consert-lpg.obs.ujfgrenoble.fr/html/en/rosetta_section2_wrapper.shtml
Figure 1.8 The National Center for
Biotechnology Information (NCBI) Web site.
Internet Resources – NCBI and Ensembl
Ex. 1.2 Use the NCBI and Ensemble genome browser to
examine a human disease gene. Use OMIM to identify
a gene that is implicated in the etiology (病因學) of the
disease.
Ans.
Go to http://www.ncbi.nlm.nih.gov  Asthma (氣喘)  find
one of the interest  for example, Interleukin 13 (IL13).
This page gives a lot of textual information + link to
other sites, including Human Gene Mutation Database
(HGDB) or Entrez Gene
(a) What are the various identifiers of the gene ?
*147683
(b) Where is the gene located on the chromosome
(cytologically and physically) ?
The cytological location is 5q31 (chromosome 5, long arm,
Click on Gene map locus  5q13  click location
5q13  click NCBI MapViewer
 position132.02 Mb, Gene ID for IL13 is 3596
 Gene aliases: ALRH; P600; IL-13; MGC116786;
MGC116788; MGC116789
(c) What is the RefSeq for the gene ?
The RefSeq is NM_002188, an mRNA seq.
Internet Resources – NCBI and Ensembl
(d) How many exons are there in the major transcript, and how long is it?
From Entrez Gene  Display ‘Gene table’  4 exons, 1282 bp long
and encodes a 146 amino acid protein, or use NCBI MapViewer 
Consensus CDS (ccds)
From RefSeq ID is NM_002188 link to GeneBank signal peptide
(interleukin 13 precursor), 34 aa (seq. 15 – 116),
mat_peptide (interleukin 13 precursor) 98 aa
(e) What is known about the function of the gene?
See NCBI description - This gene encodes an immunoregulatory
cytokine produced primarily by activated Th2 cells. This cytokine is
involved in several stages of B-cell maturation and differentiation.
(f) Do the two annotations agree? Which browser do you prefer,
and why?
Ensemble http://www.ensembl.org, select gene  type IL-13 
Ensembl gene ID ENSG00000169194
GeneView show that the Exons: 4 Transcript length: 1,282 bps
Protein length: 146 residues
Internet Resources - OMIM
•
•
Online Mendelian Inheritance in Man
A database that provides text summarizing recent
genetic research in response to a query about a
particular disease, as well as links to MedLine and
GenBank and other information.
•
Intended for physicians and human geneticists
disease types such as muscle, metabolism,
cardiovascular, and physiological disorders.
•
OMIM lists in excess of 15,000 known diseasecausing Mendelian disorders.
•
GEO BLAST tool – search for all genes in the gene
expression database that have similar seq, and
then compare levels of expression of the genes
across species and experimental conditions.
Figure 1.9 The Mendelian
Inheritance in Man (OMIM) Web site
Internet Resources - OMIM
OMIM http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
Use OMIM help
Internet Resources - OMIM
OMIM has a defined numbering
system – certain positions within that
number indicate information about
the genetic disorder itself.
The first digit – the mode of
inheritance of the disorder
1 = autosomal (常染色體) dominant
2 = autosomal recessive
3 = X-linked locus or phenotype
4 = Y-linked locus or phenotype
5 = mitochondrial
6 = autosomal locus or phenotype
Internet Resources - OMIM
• The distinct between 1 or 2 and 6 is that entries
cataloged before May 1994 were assigned
either a 1 or 2, whereas entries after that date
were assigned a 6 regardless of whether the
mode of inheritance was dominate or recessive.
• * = the phenotype caused by the gene at this
locus is not influenced by genes at other loci;
however, the disorder itself may be caused by
mutations at multiple loci
• # = the phenotype is caused by two or more
genetic mutations
Internet Resources - OMIM
Example: 604896 (MKKS)
Display
allele
variant
allelic variants – description is given after each allelic variant of the
clinical or biochemical outcome of that particular mutation
allelic variant for MKKS
Internet Resources - OMIM
The OMIM indicates that the gene SRY
encodes a transcription factor that is a
member of the high-mobility group-box
family of DNA binding proteins. Mutations
in this gene give rise to XY females with
gonadal dysgenesis(女性生殖腺發育不全症),
as well as translation of part of the Y
chromosome containing this gene to the X
chromosome in XX males.
Q 1a. An allelic variant of SRY causing sex
reversal with partial ovarian function has been
cataloged in OMIM. What was the mutation at
the amino acid level and what is observed in
XY mice carrying this mutation?
Ans. Use “SRY AND human” for the OMIM
search  then view list of allelic variants.
Variant 0020 is the correct entry. Mutation is
Gln2Ter; XY mice are fertile females, although
fertility is reduced and ovaries fail early.
Internet Resources - OMIM
Q1b. Follow the Gene Map link in the left
sidebar to access the MIM gene map,
one other gene is found at the same
cytogenetic map location. What is the
name of this gene, and what methods
were used to map the gene to this
location?
Ans. Click GeneMap in the left sidebar.
Correct gene is ZFY. Under the Methods
columns, REn and A are listed. Clicking
on the Methods hyperlink at the top of the
column shows the key to the
abbreviations. REn stands for neighbor
analysis in restriction fragments; A stands
for in situ hybridization.
Animal Genome Projects
The International Sequencing
Consortium (ISC)
http://www.intlgenome.org
- A database of animal and plant
genome sequencing projects
- Some of these organisms are
shown in Figure 1.10
Figure 1.10 (Part 1) A gallery of animal genome
sequencing projects
Animal Genome Projects
- At the National Human Genome
Research Institute (NHGRI), the
decision to commit the tens of millions
of dollars required for any new
genome is made by a council of senior
genome scientists – a 10 page “white
paper”
- Weigh the expected impact of the
sequence on enabling biomedical
research and the annotation of
sequence function
- A draft genome can be produced for
most animals within 3-6 months
Figure 1.10 (Part 2) A gallery of animal
genome sequencing projects
1.10 (Part 3) A gallery of animal genome sequencing projects
GenBank Files – Box 1.2
There are may ways to present the structure and
annotation of a gene or seq.
due to alternative splicing and TSS, the small
errors occur during cDNA cloning
all genomes are full of polymorphism
The same gene may be represented by multiple
different seq. or annotations in the genome
database
Refseq – hand curation by experts
Example – human HoxA1, 11421562
Go to http://www.ncbi.nlm.nih.gov/
1. LOCUS: XM_004915, GI:14751246
2. Followed by the reference, ….
3. Features section (CDS, misc_feature, .. etc),
links to GeneID, MIM, CDD
4. Next comes the seq. in FASTA format,
‘Display’ in XML or ASN.1 file format
GenBank Files – Box 1.2
Use Entrez Gene – HOXA1
Two isoforms
GenBank format
Graph display – HOXA1
GenBank Files – Box 1.2
Ensembl - http://www.ensembl.org/index.html
Gene – HOXA1
GenBank Files – Box 1.2
UCSC Genome Browser http://genome.cse.ucsc.edu
Gene – HOXA1
Rodent Genome Projects
Mouse Genome Informatics (MGI)
http://www.informatics.jax.org/
Three major advantages of rodent research are
1. Existence of a large number of mutant strains
that, combined with whole genome mutagensis
 lead to genetic analysis of every identified
locus in the genome
2. Existence of a panel of approximately 100
commonly used lab. mouse strains
with well-characterized genealogy – a
resource for the study of genetic variation
3. The existence of conserved seq. blocks is
generally an indicator of functional constraint
2002 – draft of the Mouse genome
2004 – draft of the rat genome
Figure 1.11 The Mouse Genome Informatics (MGI) Web site
Rodent Genome Projects
Functional genomic analysis of rat has been stimulated by three
major advances achieved in the 1990s
1. The technology for targeted (Site-directed) mutagenesis by
homologous recombination of the wide-type locus with a
disrupted copy
2. Saturation random (unbiased) mutagenesis programs - Gathers
information about entire “sequence space” – i.e., relationship
between aa sequence, 3D protein structure and function
3. Emergence of ‘phenomic’(表現性狀) analysis, in which
mutagenized lines are subject to biochemical, physiological,
immunological, morphological, and behavioral tests in parallel 
large-scale identification of genes required for non-lethal (非致命
的) phenotypes
Rodent Genome Projects
Conservation of gene order and DNA seq.
between the human and mouse genomes
http://www.ncbi.nlm.nih.gov/Homology/
(A)Blocks of synteny between mouse (chr.
11) and parts of five different human
chromosomes
(B)Enlarged view of a small region – human
5q31. In this approximately 1 Mb region
there is almost perfect correspondence in
the order, orientation, and spacing of 23
putative genes, including four interleukins.
(C)Enlargement of the alignment of 50kb
that includes the genes KIF-3A, IL-4 and
IL-13. Blue dots show the distribution of
conserved seq. (with 50%-100% identity).
Two of the conserved blocks (red bars)
fall between genes, whereas most of the
others (blue bars) are in the introns and
exons of the genes.
Use PipMaker
http://nog.cse.psu.edu/pipmaker
Figure 1.12 Mouse-human synteny
and sequence conservation
Exercise 1.3 Compare the structure of a gene in a mouse and a human
Rodent Genome Projects
Use NCBI http://www.ncbi.nlm.nih.gov
choose Genome biology
mouse chr.11
use Maps and options
add human gene map
Rodent Genome Projects
Mouse Genome Informatics (MGI)
http://www.informatics.jax.org
- Integrate physical and genetic maps
- Search for ortholog genes
- Online comparison of the mouse and
human genome
Rodent Genome Projects
Ex. 1.3
Use either NCBI or Ensembl
browser, explore the
structure of the gene used in
Fig. 1.2 in a mouse and a
human (and other
vertebrates)
Ans. Ensembl
http://www.ensembl.org
– type in human IL13
(ENSG00000169194)
‘Orthologue Prediction’ 
view all genes in
‘MultiContigView’  IL13 is
on mouse chr.11, human chr.
5, and rat chr.10
Box 1.2 (Part 2) GenBank Files
Other Vertebrate Biomedical Models
2004 – chicken (G. gallus) and dog (C. familiaris) genomes are fully sequenced
Motivation – biomedical
Chickens – model for oncogenesis and virology
Dog – model for complex diseases such as asthma, parasite infection, cancer
arthritis (關節炎), diabetes, and behavioral disorders
Applications
• Artificial selection on breed diversity
• Research into avian (鳥類的) evolution
Vertebrate development  Zebrafish
• transparent embryogenesis, ease of culture, existence of dense genetic map
• Found ~ thousands of genes are required for proper development of organs
• http://zfin.org
• a variety of ecologically and commercially fish species, such as sticklebacks刺魚,
cichlids慈鯛, salmonids
Other Vertebrate Biomedical Models
狗基因圖譜 定序完成
華盛頓郵報2005/12/8電
http://www.udn.com/2005/12/9/NEWS/WORLD/WOR4/3052845.shtml
可以用狗當作探討人類基因疾病的主要工具。因為某些狗罹患某些疾病
的機率遠高於其他的狗,如
薩摩耶犬易得糖尿病,
羅威納犬易得骨癌,
西班牙獵犬是癲癇症的高危險群,
杜賓犬罹患嗜睡症的比率遠高於其他的狗,這些疾病人類也很常見。
克隆羊「桃莉」
http://scc.bookzone.com.tw/sccc/sccc.asp?ser=302
Other Vertebrate Biomedical Models
Sequencing nonhuman primates, such as rehsus
macaque (獮猴), chimpanzee(黑猩猩) – intend
to understand the origins of diversity in the
immune system as well as mechanisms of
pathogen resistance
Comparison of human and chimp seq.
• Many genes seems to have been positively
selected
• Huamn are differentiated from chimps by small
deletions up to 10kb in length, which occur on
average every 500kb along chromosome 21
Animal Breeding Projects
OMIA (Australia) – genome maps for over a dozen species of
agricultural importance
http://www.angis.org.au/Databases/BIRX/omia
• Access data on inheritance patterns for species other than human
and mouse
• Benefits of breeding programs lie in improvements in yield,
infectious disease resistance adaptation to climatic conditions,
improved food quality, maximizing the benefits of transgenic
technology
• These goals will be met both through enhanced genetic map
development and association studies using SNP technology
ArkDBs (UK, Roslin Institute in Edinburgh)
http://www.thearkdb.org
• genomes resources for ~10 species
Invertebrate Model Organisms
Generic Model Organism Database (GMOD)
http://www.gmod.org
- A coordinated effort of the mammalian,
invertebrate, and plant genome communities to
standardize web tool construction and
implementation and to provide open source
software for database management
Figure 1.13 The GMOD project
Invertebrate Model Organisms
A 40 kb region of cytological
band 43E of fruit fly, centered on
the saxophone gene.
Figure 1.14 Drosophila gene annotation
Invertebrate Model Organisms
Flybase
http://www.flybase.org/
- Search for the gene symbol : sax
- click the ‘gene region map’
http://www.flybase.org/cgibin/gbrowse_fb/dmel?ref=2R;id=FBgn0
003317
- each gene either has a number
beginning with CG or is identified by its
standard name (e.g. sax)
- show gene and mRNA
- transposable element insertions
(Burdock, one is shown in pink)
Invertebrate Model Organisms
• The first multicellular eukaryotes to be sequenced
completely is C. elegans at 1998 http://www.wormbase.org
• Fruit fly –sequences completed at 2000
• Decades of genetic analysis have led to the molecular
characterization of up to 20% of the complement of genes in
these two organisms
• Over 90% of the true genes seem to have been identified
• Assigned a tentative function based on seq. similarity
• 1/3 ~ 1/4 of the predicted genes remain ‘orphans’ with no
known seq. similarity to genes in any other organism 
without functional data
Invertebrate Model Organisms
• Ongoing EST sequencing, gene structure and mutational analysis
• Unexpected – there may be 50% more genes in C.elegans genome
(19,000) than there are in the fly genome (13,500), despite the fact that
the fly is much more complex at several levels, including (1) the number
of cells, (2) number of cell types, and (3) organization of the nervous
system
• Nematode – a surprising surplus of steroid類固醇-hormone receptors
• Fruit fly – olfactory嗅覺的 receptor family
• There is no simple relationship between gene number and tissue
complexity
• The high degree of conservation of all the major regulatory and
biochemical pathways, most of all are identifiable not only in both
nematode and flies but also in the unicellular eukaryote yeast and in
vertebrate genomes
Invertebrate Model Organisms
Functional genomics  a major impact of the
invertebrate genome projects is the prospect of
obtaining mutations in every single gene of the
genomes
In fly – by a combination of saturation mutagenesis
+ a library of overlapping deficiencies (deletion)
that remove every segment of each
chromosome
In nematode - saturation mutagenesis + RNAi (a
double-strand RNA fed to the worms
Invertebrate Model Organisms
>60% of a sample of 289 human disease
genes have an orthologous genes in the fly
<60% in nematode
~20% in yeast
Fig. 1.15 shows the fraction of human disease
genes in each of six categories that have
orthologs in the fly, nematode and yeast
genome, as detected by seq. similarity at three
level of significance
Conservation of genetic interactions across the
animal kingdom  uncover genes that are
interact with known disease-promoting loci
Pharmaceutical companies – interested in
invertebrate genomics for its potential to
identify drugs that affect neural function
Example: fluoxetine resistance in nematodes,
alcohol tolerance in files
Molecular interactions between gene products
can be conserved allows the functional
comparison of genes across species
Figure 1.15 Human disease genes in
model organisms
蜜蜂(Honey Bee)基因定序
http://www.udn.com/2006/10/31/NEWS/WORLD/WOR4/3581547.shtml
海膽(Sea urchin)基因定序
http://tw.news.yahoo.com/article/url/d/a/061110/2/6cqy.html
Box 1.3 Managing and Distributing Genome Data
Internet technology is essential for genomic scientists
NCBI, EBI, LIMS (laboratory information management
systems)
DB – RDB (relational DB) and OODB (object-oriented DB)
RDB – very effective for sorting, searching, and distributing
data that fits into table form
OODB – good at handling complex data structures and are
useful for performing analyses on sequence ‘objects’ (data
+ with functions for operating on the data)  a very
efficient programming approach
DB query language = SQL = structured query language
http://www.geocities.com/SiliconValley/Vista/2207/sg17.html
Scripting language (no need to compile) = PERL = good for
extracting and processing text files
http://bio.perl.org
Box 1.3 Managing and Distributing Genome Data
Plant Genome Projects
Arabidopsis Thaliana – the first plant genome to be
sequenced between 1999 and 2000
• ~115 Mb, ~25,000 genes, ~2 times (no. fly genes)
• Evolved via two rounds of whole genome duplication 
shuffling隨意混和 of chromosome regions and
considerable gene loss
• >1500 tandem arrays (generally 2 or 3 copies) of
repeated genes have been identified, ~11,000 gene
families
• Some geneticists regard this number as representative
of the minimal complexity required to support
multicellularity
• It is believed that all plant and animal genomes
represent modifications of a ‘toolkit’ of gene families that
evolved >109 years ago
Plant Genome Projects
>30 Segmental duplications
(A) 7 intra-chromosomal duplication
are shown as duplicated blocks of
color within three of the five
chromosomes; five duplications
occur in the first chromosome and
the fourth and fifth chromosomes
display one duplication piece
(B) Anther two dozen interchromosomal segmental
duplications. A twist in the band
 inversion accompanied the
duplication event
Figure 1.16 Chromosome
duplications in the
Arabidopsis thaliana genome
Plant Genome Projects
Plant genomes – plant-specific genes
Enzymes required for cell wall biosynthesis
Transport proteins that move organic nutrients, inorganic
ions, toxic compounds, metabolites, and even proteins
and nucleic acids between cells
Enzymes required for photosynthesis, such as Rubisco and
electron transport proteins
Products involved in plant turgor 細胞之正常膨脹,
phototrophic趨光性 and gravitrophic趨地性
Enzymes and cytochromes involved in the production of
second metabolites found in flowering plants
A large number of pathogen resistance R genes, as are
mammalian immune system. R genes are dispersed
throughout the genome rather than localized in a single
complex
Plant Genome Projects
• Plants share with animals many of
the gene families - Intercellular
communication, transcriptional
regulation, signal transduction
• A. Thaliana lacks homologs of the
Ras G-protein family and tyrosine
kinase receptors, Rel, forkhaed,
nuclear steroids receptor transcription
factors
• TAIR – The Arabidopsis Information
Resource http://www.arabidopsis.org
• UK CropNet http://ukcrop.net/
Grasses and Legumes豆莢
>50 different plant species are under way
The most important – major feed crops – the
grasses maize, rice, wheat, sorghum高粱, barley
大麥, the forage飼料的 legumes soybean, alfalfa
紫花苜蓿, forage rye黑麥 grasses, fescues(羊茅,
酥油草)  several genomes are very large 
whole genome sequencing is impractical
Both rice (Oryza sativa) and maize (Zea mays)
have relatively small genomes
Two major rice genome cultivars培育品種,
japonica rice禾更米 and indica rice秈米
MaizeGDB http://www.maizegdb.org
waxy rice糯米
Rice-Arabidopsis synteny
• Comparison of genome sequences of rice and arabidopsis
extensive complex patterns of synteny
• 20 of 54 genes in a 340 kb long of the rice genome (top)
retain the same order in five different 80- to 200-kb regions
of the Arabidopsis genome (below).
• Conserved genes (red and green boxes) are found on
both rice and Arabidopsis strands, but are interspersed by
a variable number of different genes (yellow boxes) in
Arabidopsis. Shaded boxes above the rice chromosome
indicate that the conserved genes is in the opposite
relative orientation on the Arabidopsis chromosomes.
rice
Figure 1.17 Rice-Arabidopsis synteny
Grasses and Legumes
Economically important traits include resistance to
a broad range of pathogens; flowering time,
seed set, grain morphology, and related yield
traits; tolerance to drought, salt, heavy metals
and other extreme environmental circumstances;
and measures of feed quality such as protein
and sugar content.
Improved through genetic engineering +
specialized plant breeding techniques
Genome projects  reveal much information
regarding the evolution of domesticated species
Grasses and Legumes
Teosinte墨西哥類蜀黍 versus Maize玉蜀黍
• Modern maize is a derivative of the wild progenitor
teosinte, which had multiple tillers.
• Throughout the coding region of tb1, the level of
polymorphism is substantially the same in a sample of
maize and teosinte. However, in the 5’ UTR, there is a
dramatic reduction in the level of polymorphism in
maize relative to that seen in teosinte.
Figure 1.18 Teosinte branched 1
and the evolution of maize
Other Flowering Plants
• >90 angiosperm genome projects are listed on the US
department of Agriculture web site
http://www.nal.usda.gov/pgdic/Map_proj
• African, Australian, European, US projects
• Genetic maps and search for a common set of plant genes
• For some species, large EST seq. projects are also in
place  enable comparative genomic analysis
• Arabidopsis + grasses + several model organisms  shed
light on plant evolution
Other Flowering Plants
Forest trees – potential for economic impact
High-density genetic maps of spruce, loblolly and
several pines, a few species of Eucalyptus
Trait – wood quality, growth and flowering parameters
Dendrome web site http://dendrome.ucdavis.edu
Comparative analyses and transcription profiling of
genes involved in wood properties including lignins木
質素 and enzymes that regulate cell wall biosynthesis
Crops plants – potato, tomato, tobacco, beans, cotton
Analyzing the genome diversity affect productivity,
yield and quality improvements
No plant equivalent of the HGP’s ELSI initiative has
been established.
Figure 1.19 Forest genomics
t
Microbial Genome Projects
The minimal genome
• 1995 – the 1st complete genome, H.
influenzae  M. genitalium  3 other
bacteria
• 1997 – E. coli
• Seq. information – genome structures (GC
content, transposable elements,
recombination), genome content (total
number of genes, conserved gene families)
• Gene annotation for prokaryotes are
more straightforward – ORF tend to be
uninterrupted and genes tend to be closely
spaced; however the assignment of genes to
operons is not trivial
• ~3/4 microbial genome can be assigned a
function based on their similarity to genes
on other organisms or by identifying protein
domains
• TIGR http://www.tigr.org
Microbial Gene contents
M. genitalium 0.6 Mb, 471 genes
H. influenzae 1.8 Mb, 1750 genes
E. coli K12
4.6 Mb, 4288 genes  average gene length ~ 1.1 kb
Gene duplication and divergence in large genomes,
gene loss in small genomes
Exercise 1.4 Compare two microbial genomes using the CMR
The minimal genome
– the minimum complement of genes that are
necessary and sufficient to maintain a living
organism
To define genetically ‘What is life’?
Two general strategies
Bioinformatics strategy – identify which genes are
present in each and every sequenced genome
• Some functions can be performed by nonorthologous genes
• Conserved orthologs + a small number of
alternatives ~256 genes
The minimal genome
Experimental strategy – systematically knock out the function of
individual genes: mutations that cannot be recovered define
genes that are likely to be components of the minimal genome
• M. genitalium – recovered 120 of the 470 genes
• B. subtilis (~4100 genes) – 271 genes are indispensable (必
要的) under favorable growth conditions, metabolism, cell
division and shape, synthesis of cellular envelope
• Synthetic lethal (綜合的致命) – the nonviability (無存活能力)
in combination of two or more individually viable mutations
• Infer that life can be supported by a genome of between 250
and 350 genes
• Build a viable organism from scratch by stitching (組在一起)
together artificially synthesized genes – build a poliovirus (脊
髓灰質炎病毒)
The minimal genome
Deeper color  presence of a
gene
Pale color  the genes is absent in
that species
Gene a, d, f are present in all
species, so are inferred to be
necessary for life.
Figure 1.20A Describing the minimal genome
The minimal genome
Mutagenesis experiments
- Establish which genes are
essential by systematically
knockout each functional
genes and seeing whether the
organism can survive without it.
- The overlap of these two
approaches may define the
minimal genome.
Figure 1.20B Describing the minimal genome
1.21 TIGR representation of a typical microbial genome
Sequenced Microbial Genomes
TIGR – Comprehensive Microbial Resource (CMR)
http://www.tigr.org/tigrscripts/CMR2/CMRHomePage.spl
New site http://pathema.tigr.org/tigrscripts/CMR/CmrHomePage.cgi
39 genomes were generated by TIGR, and the rest
by Brazil, Japan … Omniome DB
Streptococcus pneumoniae TIGR4
The outer and inner circles represent genes
encoded on the two strands of the
chromosomes
Genes from HMM – blue
BLAST – yellow, Omniome – pink
Click ‘align genome’ – MUMMER
Click ‘Analyses’ – for more tools, such as
COG/TIGRFAM/PFAM
Box 1.2 (Part 1) GenBank Files
Environmental Sequencing
Sequencing DNA extracted form an environment such as ocean,
soil, or intestinal flora (腸道微生物)
The main reason is that the vast majority of bacteria cannot be
cultured in vitro  our knowledge of microflora is both limited
by and biased by sampling
Pilot projects – identify novel genes has the potential to change
oceanographers’ understanding of the mechanisms of
photosynthesis and global carbon and nitrogen cycling
Proteorhodopsin genes – suggesting that light harvesting need
not be coupled to chlorophyll in cyanobacteria
C. Venter – identified >1M new genes !!, almost 150 new types of
bacteria
Fecal material – human gut contains > 500 different species of
bacteria, < 30% can be cultured outside the body
Yeast
Completed at 1997
MIPS
http://mips.gsf.de/genre/proj/yeast/index.jsp
SGD http://www.yeastgenome.org
Parasite Genomics
World Health Organization (WHO)
• 10 tropical diseases that affect billions of
people worldwide
• Eradicating (根除) the pathogenic agents
• Crop damage caused by parasitic plant
nematodes costs billions of dollars
Parasite Genomics
Aims
1. Identify species-specific genes
2. Understanding the developmental
genetics
3. Polymorphism surveys that address the
population biology of the parasites
4. Mapping the genomics of the mosquito
100 genomes, 10 days and 10 million dollars Awards
2006 News, http://www.biotechnews.com.au/index.php/id;1321634104
全球首見 實驗室做出人類精子 2009/07/09
http://udn.com/NEWS/WORLD/WOR4/5008159.shtml
The End
Download