- Biostatistics

advertisement
Gene Annotation in Genomics Experiments
With a Focus on Tools in BioConductor
Carlo Colantuoni
&
Rafael Irizarry
April 19, 2006
ccolantu@jhsph.edu
Biological Setup
Every cell in the human body contains the entire
human genome: 3.3 Gb or ~30K genes.
The investigation of gene expression is meaningful
because different cells, in different environments,
doing different jobs express different genes.
Tasks necessary for gene expression analysis:
Define what a gene is.
Identify genes in a sea of genomic DNA where <3% of
DNA is contained in genes.
Design and implement probes that will effectively
assay expression of ALL (most? many?) genes
simultaneously. Cross-reference these probes.
Cellular Biology, Gene Expression, and
Microarray Analysis
DNA
RNA
Protein
Gene: Protein coding unit of genomic DNA
with an mRNA intermediate.
~30K genes
Sequence is
a Necessity
DNA
Probe
START
mRNA
Genomic
DNA
5’ UTR
protein coding
STOP
AAAAA
3’ UTR
3.3 Gb
From Genomic DNA to mRNA Transcripts
EXONS
INTRONS
Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.
~30K
>30K
Alternative splicing
Alternative start & stop sites in same RNA molecule
RNA editing & SNPs
Transcript coverage
Homology to other transcripts
Hybridization dynamics
3’ bias
Unsurpassed as source of expressed sequence
Sequence Quality!
Redundancy!
Completeness?
Chaos?!?
From Genomic DNA to mRNA Transcripts
~30K
>30K
>>30K
Transcript-Based
Gene-Centered Information
Design of Gene Expression Probes
Content: UniGene, Incyte, Celera Expressed vs. Genomic
Source: cDNA libraries, clone collections, oligos
Cross-referencing of array probes (across platforms):
Sequence <> GenBank <> UniGene <> HomoloGene
Possible mis-referencing:
Genomic GenBank Acc.#’s
Referenced ID has more NT’s than probe
Old DB builds
DB or table errors – copying and pasting 30K rows in excel …
Using RefSeq’s can help.
From Genomic DNA to mRNA Transcripts
From Genomic DNA to mRNA Transcripts
http://www.ncbi.nlm.nih.gov/Entrez/
Functional Annotation of Lists of Genes
KEGG
PFAM
SWISS-PROT
GO
DRAGON
DAVID
BioConductor
Analysis of Functional Gene Groups
0
.
3
0
.
2
0
.
1
0
.
0
4
2
0
2
4
•One
of theAnnBuilder.
largest challenges
analyzingto
genomic
•Using
It isinpossible
•The
annotate
package
maps
to
GenBank
data is associating the experimental data with the
accession
number,e.g.
LocusLink
LocusID,
gene
build associations
with
specific
gene
available
metadata,
sequence,
gene
annotation,
symbol,
gene
name,literature.
UniGene
chromosomal
maps,
lists, eg.
hgu95a
packagecluster,
for
chromosome, cytoband, physical distance (bp),
AffymetrixGene
HGU95A
GeneChips.
orientation,
Ontology
Consortium (GO),
PubMed
PMID.and AnnBuilder packages
•The
annotate
provides some tools for carrying this out.
Analysis of Functional Gene Groups
0
.
3
0
.
2
0
.
1
0
.
0
4
2
0
2
4
Functional Gene/Protein Networks
DIP
BIND
MINT
HPRD
PubGene
Predicted Protein Interactions
Analysis of Gene Networks
9606 is the Taxonomy ID for Homo Sapiens
Predicted Human Protein Interactions
Predicted Human Protein Interactions
Used high-throughput protein interaction experiments
from fly, worm, and yeast to predict human protein
interactions.
Human protein interaction is predicted if both proteins
in an interaction pair from other organism have high
sequence homology to human proteins.
>70K Hs interactions predicted
>6K Hs genes
Analysis of Gene Networks
Thanks to …
Rafael Irizarry
Scott Zeger
Jonathan Pevsner
Carlo Colantuoni
Clinical Brain Disorders Branch, NIMH, NIH
Dept. Biostatistics, JHSPH
ccolantu@jhsph.edu
colantuc@intra.nimh.nih.gov
NCBI Web Links
http://www.ncbi.nlm.nih.gov
http://www.ncbi.nlm.nih.gov/Entrez/
http://www.ncbi.nih.gov/Genbank/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
http://www.ncbi.nlm.nih.gov/dbEST/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
http://www.ncbi.nlm.nih.gov/LocusLink/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
http://www.ncbi.nlm.nih.gov/PubMed/
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd
http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp
http://www.ncbi.nlm.nih.gov/SNP/
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html
http://www.ncbi.nlm.nih.gov/geo/
http://www.ncbi.nlm.nih.gov/RefSeq/
FTP:
ftp://ftp.ncbi.nlm.nih.gov/
ftp://ftp.ncbi.nlm.nih.gov/repository/UniGene
ftp://ftp.ncbi.nih.gov/pub/HomoloGene/
NUCLEOTIDE:
PROTEIN:
http://genome.ucsc.edu/
http://us.expasy.org/
ftp://us.expasy.org/
http://www.embl-heidelberg.de/
http://www.ensembl.org/
http://www.sanger.ac.uk/Software/Pfam/
http://www.sanger.ac.uk/Software/Pfam/ftp.shtml
http://www.ebi.ac.uk/
http://www.gdb.org/
http://smart.embl-heidelberg.de/
http://bioinfo.weizmann.ac.il/cards/index.html
http://www.ebi.ac.uk/interpro/
http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl
http://us.expasy.org/prosite/
PATHWAYS and NETWORKS:
ftp://us.expasy.org/databases/prosite/
http://www.genome.ad.jp/kegg/
ftp://ftp.genome.ad.jp/pub/kegg/ (http://www.genome.ad.jp/anonftp/)
http://dip.doe-mbi.ucla.edu
More Web Links
http://dip.doe-mbi.ucla.edu/dip/Download.cgi
http://www.blueprint.org/bind/
http://www.bioconductor.org/
http://www.blueprint.org/bind/bind_downloads.html http://apps1.niaid.nih.gov/david/
http://www.geneontology.org/
http://160.80.34.4/mint/index.php
http://discover.nci.nih.gov/gominer/index.jsp
http://pubmatrix.grc.nia.nih.gov/
http://160.80.34.4/mint/release/main.php
http://pevsnerlab.kennedykrieger.org/dragon.htm
http://www.hprd.org/
http://www.hprd.org/FAQ?selectedtab=DOWNLOAD+REQUESTS
http://www.pubgene.org/ (also .com)
SAVAGE:
Detection of More Subtle
Functionally Related Groups
of Gene Expression Changes
Differential Expression of Functional
Gene Groups within One Experiment
0
.
3
Swiss-Prot
DRAGON
EXP#1
SAVAGE
0
.
2
0
.
1
PFAM
0
.
0
4
0
.
3
2
0
2
4
2
0
2
4
2
0
2
4
30K
10K
0
.
2
0
.
1
KEGG
0
.
0
4
0
.
3
0
.
2
0
.
1
0
.
0
4
~40K annotations
~3K
Differential Expression of a Single Functional
Gene Group Across Multiple Experiments
BioDB
EXP#1
EXP#2
EXP#3
EXP#4
DRAGON
SAVAGE
0
.
4
0
.
4
0
.
4
0
.
4
0
.
3
0
.
3
0
.
3
0
.
3
0
.
2
0
.
2
0
.
2
0
.
2
0
.
1
0
.
1
0
.
1
0
.
1
0
.
0
4
2
0
2
4
0
.
0
4
2
0
2
4
0
.
0
4
2
0
2
4
0
.
0
4
2
0
2
4
Similar Differential Expression Patterns Across Multiple Experiments
0
.
3
3
X
1
0.0
EXP#1
0
.
3
0
.
2
2
5
CN
EXP#1
p value
<0.1
0
.
3
0
.
2
0
.
2
0
.
1
0
.
1
4
CN
0
.
0
4
0
.
3
0
.
1
2
0
2
0
.
0
4
4
0
.
3
2
0
2
4
2
4
4
CN
3
CN
0
.
2
CN
0
.
0
4
0
.
2
2
0
.
1
5
EXP#2
EXP#2
ALL
0
2
4
0
.
1
X
0
.
0
4
1
2
The distribution of gene expression values for each
gene group in each sample is plotted as a single
point in low dimensional space. This is achieved
using Principal Components Analysis along with
Non-Metric Multi-Dimensional Scaling.
2
0
2
0
.
0
4
4
2
0
PING:
Detection of Differential
Expression in Functional
Networks of Proteins
Interaction Networks in
Gene Expression Data
Large Protein Interaction Network
Network Regulated
in Sample #1
Large Protein Interaction Network
Network Regulated
in Sample #2
Network Regulated
in Sample #1
Large Protein Interaction Network
Network Regulated
in Sample #3
Network Regulated
in Sample #1
Network Regulated
in Sample #2
Large Protein Interaction Network
Network
of Interest
PING
Network Regulated
in Sample #1
Network Regulated
in Sample #2
Network Regulated
in Sample #3
NT's in GenBank (millions)
100000
10000
1000
100
10
1
1984
1994
2004
Genomic DNA Content
1.
2.
3.
4.
5.
6.
Interspersed repeats (~1/2 Hs. genome)
(Processed) pseudogenes
prokaryotes eukaryotes
Simple sequence repeats
Segmental duplications (~5% Hs. genome)
Blocks of tandem repeats (can be very large)
Genes: Promoters - Exons – Introns <3%
defining what a gene is - protein coding unit of genomic DNA with
an mRNA intermediate
identifying genes within genomic DNA
protein-coding genes (mRNA)
functional RNA genes - tRNA, rRNA, snoRNA, snRNA, miRNA
Gene: Protein coding unit of genomic DNA
with an mRNA intermediate.
Protein
START
mRNA
Genomic
DNA
5’ UTR
protein coding
STOP
AAAAA
3’ UTR
3.3 Gb
Gene: Protein coding unit of genomic DNA
with an mRNA intermediate.
~30K genes
Sequence is
a Necessity
Protein
START
mRNA
Genomic
DNA
5’ UTR
protein coding
STOP
AAAAA
3’ UTR
3.3 Gb
How is a gene defined
in “wet” biology and in silico?
Seq. from
Array
mRNA
probe
sample
design:
Cross-referencing
splicing - known
of array
and
probes:
not
Seq.
onAlt.
array
Source – cDNA libraries, oligos, clone collections
GenBank <> UniGene <> HomoloGene
Alt. start / stop site in same RNA molecule
Content – UniGene, Celera, Incyte
Possible
Less important:
mis-referencing:
RNA editing, SNPs
Transcript coverage
Genomic GenBank Acc.#’s
Referenced
Homology ID
to other
has more
transcripts
NT’s than probe
Old DB builds
Hybridization dynamics – hyper-multiplex hyb rxn
DB or table errors
Empirical validation
3’ bias
Finding genes in eukaryotic DNA
ORF identification – Three Letter Genetic Code (codons)
4*4*4. It is possible to translate any stretch of genomic DNA
into protein, but that doesn’t mean we have identified a protein
coding gene!
There are several kinds of exons:
-- non-coding
-- initial coding exons
-- internal exons
-- terminal exons
-- some single-exon genes are intronless
What We Are Going To Cover
Cells, Genes, Transcripts –> Genomics Experiments
Sequence Knowledge Behind Genomics Experiments
Annotation of Genes in Genomics Experiments
Download