Working with gene lists: Finding data using GEO & BioMart June 5, 2014 Analyzing a gene list With hundreds of genes but a limited budget and lab personnel, you need to prioritize the gene list to candidate genes for follow-up Pick ones that are “interesting” Known to be involved in other related processes but not (yet) in your process of interest Has protein features which suggest a function in your process, but it has not been characterized No known function or domain, but it shows up in other, related high-throughput experiments suggesting a key role in your process of interest Our approach Analyzing gene lists by: 1. Finding overlap with other high-throughput experiments 2. Finding additional information using BioMart 1. Mouse/human homologs 2. Protein domain content 3. GO classification GEO (gene expression omnibus) GEO Datasets Curated gene expression datasets i.e. there is backlog of experiments that haven’t made it into the database Can search for experiments and conduct differential gene expression queries on some datasets Can download datasets & do offline analyses GEO Profiles Profiles of expression data for genes Why search GEO? What other experiments have been done that are similar to yours? GEO datasets How do my genes of interest behave in other large scale experiments GEO profiles GEO Profile search Search on a gene name (C04F5.7): GEO Dataset search “C. elegans”: 4434 GEO Dataset searches Query Total datasets C. elegans datasets C. elegans 4434 4072 C. elegans AND response 131 121 C. elegans AND host response 5 5 C. elegans AND immune 24 20 C. elegans AND antimicrobial 109 94 Once dataset identified Download data SOFT format: tab-delimited data Issues: Not necessarily processed such that they have the ratios of experiment/control If starting with raw data, may not be able to replicate exactly what authors did or lack expertise/software to generate a list of DE genes Look for supplementary data from publication Usually they provide a list of all DE genes Choice of dataset for comparison In class demo Biomart – EBI Ensembl Use series of menus Data source – organism (genes, variation, ect) Filters -- reduce the number of results Attributes – what data to return Can set up very precise and multilayered queries Can query across multiple organisms Simple query: Given a list of gene IDs, you can obtain attributes or sequences for the entire list Tools ID converter – very useful, easy to use Two sites for BioMart access www.biomart.org Database journal issue on BioMart Filtering in BioMart Attributes in BioMart Biomart Filters C. elegans genes with a human homolog Specify only genes with >= # isoforms protein coding genes with a transmembrane domain Attributes Entrez Gene IDs, WormBase IDs, Affy IDs Sequence data transcript, protein, UTRs, flanking regions, ect. BioMart In class demo Today’s exercise Compare current dataset from PLoS Pathogens paper to data from a different dataset Identify & retrieve additional information about C. elegans genes using BioMart