Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor Carlo Colantuoni & Rafael Irizarry April 19, 2006 ccolantu@jhsph.edu Biological Setup Every cell in the human body contains the entire human genome: 3.3 Gb or ~30K genes. The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes. Tasks necessary for gene expression analysis: Define what a gene is. Identify genes in a sea of genomic DNA where <3% of DNA is contained in genes. Design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously. Cross-reference these probes. Cellular Biology, Gene Expression, and Microarray Analysis DNA RNA Protein Gene: Protein coding unit of genomic DNA with an mRNA intermediate. ~30K genes Sequence is a Necessity DNA Probe START mRNA Genomic DNA 5’ UTR protein coding STOP AAAAA 3’ UTR 3.3 Gb From Genomic DNA to mRNA Transcripts EXONS INTRONS Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. ~30K >30K Alternative splicing Alternative start & stop sites in same RNA molecule RNA editing & SNPs Transcript coverage Homology to other transcripts Hybridization dynamics 3’ bias Unsurpassed as source of expressed sequence Sequence Quality! Redundancy! Completeness? Chaos?!? From Genomic DNA to mRNA Transcripts ~30K >30K >>30K Transcript-Based Gene-Centered Information Design of Gene Expression Probes Content: UniGene, Incyte, Celera Expressed vs. Genomic Source: cDNA libraries, clone collections, oligos Cross-referencing of array probes (across platforms): Sequence <> GenBank <> UniGene <> HomoloGene Possible mis-referencing: Genomic GenBank Acc.#’s Referenced ID has more NT’s than probe Old DB builds DB or table errors – copying and pasting 30K rows in excel … Using RefSeq’s can help. From Genomic DNA to mRNA Transcripts From Genomic DNA to mRNA Transcripts http://www.ncbi.nlm.nih.gov/Entrez/ Functional Annotation of Lists of Genes KEGG PFAM SWISS-PROT GO DRAGON DAVID BioConductor Analysis of Functional Gene Groups 0 . 3 0 . 2 0 . 1 0 . 0 4 2 0 2 4 •One of theAnnBuilder. largest challenges analyzingto genomic •Using It isinpossible •The annotate package maps to GenBank data is associating the experimental data with the accession number,e.g. LocusLink LocusID, gene build associations with specific gene available metadata, sequence, gene annotation, symbol, gene name,literature. UniGene chromosomal maps, lists, eg. hgu95a packagecluster, for chromosome, cytoband, physical distance (bp), AffymetrixGene HGU95A GeneChips. orientation, Ontology Consortium (GO), PubMed PMID.and AnnBuilder packages •The annotate provides some tools for carrying this out. Analysis of Functional Gene Groups 0 . 3 0 . 2 0 . 1 0 . 0 4 2 0 2 4 Functional Gene/Protein Networks DIP BIND MINT HPRD PubGene Predicted Protein Interactions Analysis of Gene Networks 9606 is the Taxonomy ID for Homo Sapiens Predicted Human Protein Interactions Predicted Human Protein Interactions Used high-throughput protein interaction experiments from fly, worm, and yeast to predict human protein interactions. Human protein interaction is predicted if both proteins in an interaction pair from other organism have high sequence homology to human proteins. >70K Hs interactions predicted >6K Hs genes Analysis of Gene Networks Thanks to … Rafael Irizarry Scott Zeger Jonathan Pevsner Carlo Colantuoni Clinical Brain Disorders Branch, NIMH, NIH Dept. Biostatistics, JHSPH ccolantu@jhsph.edu colantuc@intra.nimh.nih.gov NCBI Web Links http://www.ncbi.nlm.nih.gov http://www.ncbi.nlm.nih.gov/Entrez/ http://www.ncbi.nih.gov/Genbank/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide http://www.ncbi.nlm.nih.gov/dbEST/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene http://www.ncbi.nlm.nih.gov/LocusLink/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed http://www.ncbi.nlm.nih.gov/PubMed/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp http://www.ncbi.nlm.nih.gov/SNP/ http://eutils.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html http://www.ncbi.nlm.nih.gov/geo/ http://www.ncbi.nlm.nih.gov/RefSeq/ FTP: ftp://ftp.ncbi.nlm.nih.gov/ ftp://ftp.ncbi.nlm.nih.gov/repository/UniGene ftp://ftp.ncbi.nih.gov/pub/HomoloGene/ NUCLEOTIDE: PROTEIN: http://genome.ucsc.edu/ http://us.expasy.org/ ftp://us.expasy.org/ http://www.embl-heidelberg.de/ http://www.ensembl.org/ http://www.sanger.ac.uk/Software/Pfam/ http://www.sanger.ac.uk/Software/Pfam/ftp.shtml http://www.ebi.ac.uk/ http://www.gdb.org/ http://smart.embl-heidelberg.de/ http://bioinfo.weizmann.ac.il/cards/index.html http://www.ebi.ac.uk/interpro/ http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl http://us.expasy.org/prosite/ PATHWAYS and NETWORKS: ftp://us.expasy.org/databases/prosite/ http://www.genome.ad.jp/kegg/ ftp://ftp.genome.ad.jp/pub/kegg/ (http://www.genome.ad.jp/anonftp/) http://dip.doe-mbi.ucla.edu More Web Links http://dip.doe-mbi.ucla.edu/dip/Download.cgi http://www.blueprint.org/bind/ http://www.bioconductor.org/ http://www.blueprint.org/bind/bind_downloads.html http://apps1.niaid.nih.gov/david/ http://www.geneontology.org/ http://160.80.34.4/mint/index.php http://discover.nci.nih.gov/gominer/index.jsp http://pubmatrix.grc.nia.nih.gov/ http://160.80.34.4/mint/release/main.php http://pevsnerlab.kennedykrieger.org/dragon.htm http://www.hprd.org/ http://www.hprd.org/FAQ?selectedtab=DOWNLOAD+REQUESTS http://www.pubgene.org/ (also .com) SAVAGE: Detection of More Subtle Functionally Related Groups of Gene Expression Changes Differential Expression of Functional Gene Groups within One Experiment 0 . 3 Swiss-Prot DRAGON EXP#1 SAVAGE 0 . 2 0 . 1 PFAM 0 . 0 4 0 . 3 2 0 2 4 2 0 2 4 2 0 2 4 30K 10K 0 . 2 0 . 1 KEGG 0 . 0 4 0 . 3 0 . 2 0 . 1 0 . 0 4 ~40K annotations ~3K Differential Expression of a Single Functional Gene Group Across Multiple Experiments BioDB EXP#1 EXP#2 EXP#3 EXP#4 DRAGON SAVAGE 0 . 4 0 . 4 0 . 4 0 . 4 0 . 3 0 . 3 0 . 3 0 . 3 0 . 2 0 . 2 0 . 2 0 . 2 0 . 1 0 . 1 0 . 1 0 . 1 0 . 0 4 2 0 2 4 0 . 0 4 2 0 2 4 0 . 0 4 2 0 2 4 0 . 0 4 2 0 2 4 Similar Differential Expression Patterns Across Multiple Experiments 0 . 3 3 X 1 0.0 EXP#1 0 . 3 0 . 2 2 5 CN EXP#1 p value <0.1 0 . 3 0 . 2 0 . 2 0 . 1 0 . 1 4 CN 0 . 0 4 0 . 3 0 . 1 2 0 2 0 . 0 4 4 0 . 3 2 0 2 4 2 4 4 CN 3 CN 0 . 2 CN 0 . 0 4 0 . 2 2 0 . 1 5 EXP#2 EXP#2 ALL 0 2 4 0 . 1 X 0 . 0 4 1 2 The distribution of gene expression values for each gene group in each sample is plotted as a single point in low dimensional space. This is achieved using Principal Components Analysis along with Non-Metric Multi-Dimensional Scaling. 2 0 2 0 . 0 4 4 2 0 PING: Detection of Differential Expression in Functional Networks of Proteins Interaction Networks in Gene Expression Data Large Protein Interaction Network Network Regulated in Sample #1 Large Protein Interaction Network Network Regulated in Sample #2 Network Regulated in Sample #1 Large Protein Interaction Network Network Regulated in Sample #3 Network Regulated in Sample #1 Network Regulated in Sample #2 Large Protein Interaction Network Network of Interest PING Network Regulated in Sample #1 Network Regulated in Sample #2 Network Regulated in Sample #3 NT's in GenBank (millions) 100000 10000 1000 100 10 1 1984 1994 2004 Genomic DNA Content 1. 2. 3. 4. 5. 6. Interspersed repeats (~1/2 Hs. genome) (Processed) pseudogenes prokaryotes eukaryotes Simple sequence repeats Segmental duplications (~5% Hs. genome) Blocks of tandem repeats (can be very large) Genes: Promoters - Exons – Introns <3% defining what a gene is - protein coding unit of genomic DNA with an mRNA intermediate identifying genes within genomic DNA protein-coding genes (mRNA) functional RNA genes - tRNA, rRNA, snoRNA, snRNA, miRNA Gene: Protein coding unit of genomic DNA with an mRNA intermediate. Protein START mRNA Genomic DNA 5’ UTR protein coding STOP AAAAA 3’ UTR 3.3 Gb Gene: Protein coding unit of genomic DNA with an mRNA intermediate. ~30K genes Sequence is a Necessity Protein START mRNA Genomic DNA 5’ UTR protein coding STOP AAAAA 3’ UTR 3.3 Gb How is a gene defined in “wet” biology and in silico? Seq. from Array mRNA probe sample design: Cross-referencing splicing - known of array and probes: not Seq. onAlt. array Source – cDNA libraries, oligos, clone collections GenBank <> UniGene <> HomoloGene Alt. start / stop site in same RNA molecule Content – UniGene, Celera, Incyte Possible Less important: mis-referencing: RNA editing, SNPs Transcript coverage Genomic GenBank Acc.#’s Referenced Homology ID to other has more transcripts NT’s than probe Old DB builds Hybridization dynamics – hyper-multiplex hyb rxn DB or table errors Empirical validation 3’ bias Finding genes in eukaryotic DNA ORF identification – Three Letter Genetic Code (codons) 4*4*4. It is possible to translate any stretch of genomic DNA into protein, but that doesn’t mean we have identified a protein coding gene! There are several kinds of exons: -- non-coding -- initial coding exons -- internal exons -- terminal exons -- some single-exon genes are intronless What We Are Going To Cover Cells, Genes, Transcripts –> Genomics Experiments Sequence Knowledge Behind Genomics Experiments Annotation of Genes in Genomics Experiments