Modeling Functional Genomics Datasets CVM8890-101 Lesson 1 13 June 2007 Bindu Nanduri Lesson 1: Data to Biological sense. What we are trying to achieve. Introduction to functional genomics modeling strategies. Transcriptomics and Proteomics Why study gene expression changes????? Transcription is predominant form of regulation Northern Blots Mol Vis. 1996 Nov 4;2:11 Microarrays Basic concept: Reverse Northern blot on a large scale High throughput: hybridize control and experimental samples simultaneously using distinct fluorescent dyes many assays can be carried out in parallel Affymetrix oligo arrays design Usually the most 3 prime area, often UTR AAAA.. 25mer 25mer 25mer (11 to 16) 25mer http://www.affymetrix.com Genomic Tiling Array Design Genome Sequence 5´ 3´ Multiple probes Center-Center Resolution 38 bp ISB Systems Biology Course 2006 Is mRNA level = Protein level? Is there a correlation??? Comparison of protein levels (MS, 2D gels) and RNA levels (SAGE) for 156 genes in yeast mRNA levels unchanged, but protein levels varied by up to 20X protein levels unchanged, but mRNA levels varied by up to 30X Highly expressed mRNAs correlate well with protein levels Gygi et al. (1999) Mol. Cell. Biol. ISB Systems Biology Course 2006 ISB Systems Biology Course 2006 Expressed Sequence Tags ESTs…pieces of DNA sequence (usually 200 to 500 nt) generated by sequencing either one or both ends of an expressed gene Bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and Can be useful "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs http://www.ncbi.nlm.nih.gov/About/primer/est.html EST Sequence Clustering Gene can be expressed as mRNA many,many times, ESTs derived from this mRNA may be redundant many identical, or similar, copies of the same EST redundancy and overlap means that when someone searches dbEST for a particular EST, they may retrieve a long list of tags, many of which may represent the same gene UniGene database automatically partitions GenBank sequences into a non-redundant set of gene-oriented clusters http://www.ncbi.nlm.nih.gov/About/primer/est.html ESTs: EST mapping to the genome, annotation differential expression Transcriptome: Clustering, differential expression analysis Proteome: differential expression analysis Multiple data analysis platforms Proteomics Transcriptomics EST analysis LIST of elements #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 Reference A ALBU_CHICK Serum albumin precursor (Alpha-livetin) (Allergen Gal d 5) APA1_CHICK Apolipoprotein A-I precursor (Apo-AI) FIBA_CHICK Fibrinogen alpha/alpha-E chain precursor [Contains: Fibrinopep Mol_id: 1; Molecule: Ovotransferrin; Chain: Null; Synonym: Conalbumin; Hete PB2 protein [Influenza A virus (A/chicken/Taiwan/7-5/99(H6N1))] C Chain C, Crystal Structure Of Native Chicken Fibrinogen I50711 complement C3 precursor - chicken TTHY_CHICK Transthyretin precursor (Prealbumin) (TBPA) TIM2_CHICK Metalloproteinase inhibitor 2 precursor (TIMP-2) (Tissue inhibito AAA6469 MYH9_CHICK Myosin heavy chain, nonmuscle (Cellular myosin heavy chain S19188 myosin-V - chicken FIBB_CHICK Fibrinogen beta chain precursor [Contains: Fibrinopeptide B] A Chain A, Crystal Structure Of Wild Type Turkey Delta 1 Crystallin (Eye Le type I polyketide synthase AVES 2 [Streptomyces avermitilis MA-4680] Hyperion protein, 419 kD isoform [Gallus gallus] 0 vitronectin [Gallus gallus] ovirus 3] CA36_CHICK Collagen alpha 3(VI) chain precursor paired-type homeobox Atx [Gallus gallus] I beta su I51298 transforming protein sno-N - chicken TP2A_CHICK DNA topoisomerase II, alpha isozyme ITA6_CHICK Integrin alpha-6 precursor (VLA-6) glucose regulated thiol oxidoreductase protein precursor [Gallus gallus] spectrin alpha chain [Gallus gallus] rsor ATP-binding cassette transporter 1 [Gallus gallus] cone-type transducin alpha subunit [Gallus gallus] condensin complex subunit [Gallus gallus] s] hick BA2B_CHICK Bromodomain adjacent to zinc finger domain 2B (Extracellular ryanodine receptor type 3 [Gallus gallus] type I polyketide synthase AVES 4 [Streptomyces avermitilis MA-4680] structural muscle protein titin [Gallus gallus] n k breast cancer susceptibility protein [Gallus gallus] FAS_CHICK Fatty acid synthase [Includes: EC 2.3.1.38; EC 2.3.1.39; EC 2. #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 Reference Accession Peptides (Hits) Sc ALBU_CHICK Serum albumin precursor (Alpha-livetin) (Allergen Gal d 5) 113575 255 (244 6 1 3 1 APA1_CHICK Apolipoprotein A-I precursor (Apo-AI) 113990 94 (84 4 4 2 0) FIBA_CHICK Fibrinogen alpha/alpha-E chain precursor [Contains: Fibrinopeptide 1706798 40 (37 1 1 0 1) Mol_id: 1; Molecule: Ovotransferrin; Chain: Null; Synonym: Conalbumin; Hetero1127086 34 (30 3 0 1 0) PB2 protein [Influenza A virus (A/chicken/Taiwan/7-5/99(H6N1))] 9954387 28 (0 22 3 2 1) C Chain C, Crystal Structure Of Native Chicken Fibrinogen 8569623 15 (15 0 0 0 0) I50711 complement C3 precursor - chicken 2118406 11 (9 2 0 0 0) TTHY_CHICK Transthyretin precursor (Prealbumin) (TBPA) 136463 10 (10 0 0 0 0) TIM2_CHICK Metalloproteinase inhibitor 2 precursor (TIMP-2) (Tissue inhibitor 3122960 13 (0 11 1 0 1) AAA6469 3645997 10 (9 0 1 0 0) MYH9_CHICK Myosin heavy chain, nonmuscle (Cellular myosin heavy chain) (NMMHC) 127759 14 (5 0 5 3 1) S19188 myosin-V - chicken 104779 13 (4 3 3 2 1) FIBB_CHICK Fibrinogen beta chain precursor [Contains: Fibrinopeptide B] 399491 9 (8 1 0 0 0) A Chain A, Crystal Structure Of Wild Type Turkey Delta 1 Crystallin (Eye Lens 14278427 12 (0 2 10 0 0) type I polyketide synthase AVES 2 [Streptomyces avermitilis MA-4680] 29827480 8 (6 1 0 1 0) Hyperion protein, 419 kD isoform [Gallus gallus] 0 4582571 11 (0 5 3 3 0) vitronectin [Gallus gallus] ovirus 3] 1922282 7 (7 0 0 0 0) CA36_CHICK Collagen alpha 3(VI) chain precursor 1345652 10 (3 2 3 1 1) paired-type homeobox Atx [Gallus gallus] I beta su 18252581 11 (0 4 5 1 1) I51298 transforming protein sno-N - chicken 2147397 7 (6 1 0 0 0) TP2A_CHICK DNA topoisomerase II, alpha isozyme 13959708 7 (5 2 0 0 0) ITA6_CHICK Integrin alpha-6 precursor (VLA-6) 124948 12 (0 5 1 4 2) glucose regulated thiol oxidoreductase protein precursor [Gallus gallus] 22651801 11 (1 3 4 0 3) spectrin alpha chain [Gallus gallus] rsor 1334744 9 (3 2 1 2 1) ATP-binding cassette transporter 1 [Gallus gallus] 18028983 9 (2 3 2 1 1) cone-type transducin alpha subunit [Gallus gallus] 11066401 8 (1 6 0 1 0) condensin complex subunit [Gallus gallus] s] hick 26801168 12 (0 2 4 4 2) BA2B_CHICK Bromodomain adjacent to zinc finger domain 2B (Extracellular matri 22653663 8 (3 1 3 1 0) ryanodine receptor type 3 [Gallus gallus] 1212912 9 (0 5 2 1 1) type I polyketide synthase AVES 4 [Streptomyces avermitilis MA-4680] 29827484 7 (2 4 1 0 0) structural muscle protein titin [Gallus gallus] n k 7024535 9 (0 5 2 0 2) breast cancer susceptibility protein [Gallus gallus] 19568157 7 (4 1 1 0 1) FAS_CHICK Fatty acid synthase [Includes: EC 2.3.1.38; EC 2.3.1.39; EC 2.3.1.41 1345958 8 (1 4 1 1 1) Modeling Function Modeling function requires: knowing the components of the system (structural annotation) knowing what these components do & how they interact (functional annotation) http://www.protonet.cs.huji.ac.il/ProToGO/Introduction.html Where do you begin???? Specifics Transcriptome Analysis Clustering Similar expression patterns = similar regulation? clustering algorithms help us identify patterns in complex data Key Goal: identify co-regulated groups of genes Hierarchical clustering K-means clustering Self organizing feature maps Principal component analysis Proteomics Qualitative : total number of identified proteins data intersections Quantitative: changes in protein expression Proteomic data analysis tools Use GO for……. Grouping gene products by biological function Determining which classes of gene products are overrepresented or under-represented Focusing on particular biological pathways and functions (hypothesis-driven data interrogation) Relating a protein’s location to its function Course Overview Introduction to functional annotation. Orthologs and homologs; clusters of orthologous genes (COGs) and the gene ontology (GO); and how to find what functional annotation is available Tools for functional annotation. Accessing functional data; computational strategies to obtain more complete functional annotation; the AgBase GO annotation pipeline. Introduction to pathways analysis. Theory and strategies for pathway analysis modeling in different species and tools for pathway analysis. Functional genomics modeling : prokaryotic and eukaryotic examples Some Useful Links http://www.genomesonline.org/ (comprehensive access to information regarding complete and ongoing genome projects around the world.) http://www.geneontology.org/ (provides a controlled vocabulary to describe gene and gene product attributes in any organism) http://pir.georgetown.edu/ (integrated protein informatics resource for genomics and proteomics) http://www.pir.uniprot.org/ (protein database) http://mips.gsf.de/ (maintains a set of generic databases as well as the systematic comparative analysis of microbial, fungal, and plant genomes.) http://www.ncbi.nlm.nih.gov/ (comprehensive resource for public databases, literature and tools) http://www.ebi.ac.uk/ensembl/ (System that maintains automatic annotation of large eukaryotic genomes) http://expasy.org/ (expert protein analysis system) http://www.biocyc.org/ (BioCyc is a collection of 260 Pathway/Genome Databases: metabolic pathways) http://www.genome.jp/kegg/ (biological systems" database integrating both molecular building block information and higher-level systemic information) Some Useful Links http://pfgrc.tigr.org/index.shtml (functional genomics studies on a variety of pathogens for which genomic sequence information is currently, or will soon be, available) http://www.tigr.org/ (comprehensive resource for microbial genomics) http://www.cs.ualberta.ca/~bioinfo/PA/ (High throughput proteome annotations) http://garnet.arabidopsis.org.uk/systems_biology_tools.htm (Arabidopsis resources) http://www.systems-biology.org/002/ (systems biology portal) http://www.ebi.ac.uk/biomodels/ (mathematical models of biological interests) http://www.genmapp.org/current_databases.html (species-specific collections of genes and annotation) http://bioinfo.bgu.ac.il/bsu/microarrays/links/ (Microarray analysis resources) http://david.abcc.ncifcrf.gov/ (Database for Annotation, Visualization and Integrated Discovery) http://www.animalgenome.org/pigs/community/links.html (swine genetics community) Some Useful Links http://www.biocarta.com/FeaturedProducts/index.asp (pathways and tools for analysis) http://www.genecards.org/index.shtml (database of human genes that includes automatically-mined genomic, proteomic and transcriptomic information, as well as orthologies, disease relationships, SNPs, gene expression, gene function, and service links for ordering assays and antibodies) http://www.proteomecommons.org/ (proteomics tools) http://harvester.embl.de/ http://bioinformatics.org/ (open access institute) http://www.ihop-net.org/UniPub/iHOP/ (A network of genes and proteins extends through the scientific literature) http://www1.jcsg.org/psat/help/document.html (comparative analysis of protein sequence) http://orthomcl.cbil.upenn.edu/cgi-bin/OrthoMclWeb.cgi (genome-scale algorithm for grouping ortholog protein sequences) http://www.pathogenomics.ca/ortholuge/ (ortholog prediction program) http://www.gene-regulation.com/pub/databases.html (transcription factor database) Some Useful Links http://www.reactome.org/ (curated knowledgebase of biological pathways) http://www.biochemweb.org/systems.shtml(The Virtual Library of Biochemistry,Moleculer Biology and Cell Biology) http://genome-www.stanford.edu/ (Stanford genomic resources) http://www.softberry.com/berry.phtml (collection of tools for annotation and analysis of sequences) http://sosui.proteome.bio.tuat.ac.jp/sosuiframe0E.html (prediction of transmembrane domains in proteins) http://www.psort.org/psortb/ (subcellular localization predictions) http://www.ch.embnet.org/software/TMPRED_form.html (prediction of membrane-spanning regions and their orientation) http://www.agbase.msstate.edu/ (functional analysis of agricultural plant and animal gene products)