ppt

NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data May 30th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Acknowledgment: Several slides courtesy of Professor Fiona Brinkman, MBB Today’s Agenda A brief overview of the bioinformatics for SNP detection software Proteins Systems biology Metagenomics (some resources; very brief…) Group feedback: bioinformatics needs at SFU? NGS-based SNP Analysis Programs From: Nielsen et al. 2011. Nature Reviews Genetics 12:443-451 NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data BIOINFORMATICS OF PROTEINS From DNA to Protein to Systems ATGGAATTC… 5 Amino Acid Properties – Venn Diagram Polypeptides O H3 N + H R1 H N H R2 H N O H O H R4 O R3 N H O Ramachandran Plot Secondary Structure (SS) Prediction Note major assumptions in all  Entire information for forming ss is contained in the primary sequence  Side groups of residues will determine structure  Pattern recognition  Looks for patterns in common ss’s like amphipathic alpha-helices (e.g. pattern of polar and non-polar residues)  Homology  Predict ss of the central residue of a given segment from homologous segments (neighbors)  Based on alignments of homologous residues from a protein family  Assumption: homologous proteins = similar structure  Extension: Use BLOSUM to detect similarity, or, better, use Position Specific Scoring Matrix (PSSM) SS Prediction Programs • PredictProtein-PHD (72%) – http://www.predictprotein.org/ • PREDATOR (75%) – http://www-db.embl heidelberg.de/jss/servlet/ de.embl.bk.wwwTools.GroupLeftEMBL/argos/ predator/predator_info.html • PSIpred (77%) – http://bioinf.cs.ucl.ac.uk/psipred/ (PSSM generated by PSI-BLAST, better sequence database, won CASP competition for many years) • Jpred (81%) – http://www.compbio.dundee.ac.uk/jpred/ Tertiary Structure Lactate Dehydrogenase: Mixed a / b Immunoglobulin Fold: b Hemoglobin B Chain: a Tertiary Structure: Protein Folds Holm, L. and Sander, C. (1996) Mapping the protein universe. Science, 273, 595-603. Protein Folds  Folds: definition difficult and different criteria used for different classification systems – Normally formed around a separate hydrophobic core  Current protein fold taxonomy – Very roughly … – Approx. 1000-2000 different estimated folds, depending on method of analysis – of which about half are estimated to be known (500-1000) – Average domain size approx. 150 aa (50 – 250 aa approx std dev) Protein Fold Major Classes All alpha proteins (all a) All beta proteins (all b) Alpha/beta proteins (a/b) - Parallel strands connected by helices (bab motifs) Alpha plus beta proteins (a+b) - More irregular a and b combinations “Other” - Often subclassified now Protein Fold Classification • Curated/Semi Manual Classification – SCOP (Structural Classification Of Proteins) http://scop.mrc-lmb.cam.ac.uk/scop/ – CATH (Class, Architecture, Topology, Homologous superfamily) http://www.cathdb.info/ SCOP classification  Family: clear evolutionarily relationship – – Residue identities >= 30% OR known similar functions and structures (example: globins form family though some only 15% identical)  Superfamily: Probable common evolutionary origin – Low sequence identities, but structural and functional features suggest common evolutionary origin. (example: actin, ATPase domain of heat shock proteins, and hexakinase form a superfamily).  Fold: major structural similarity – Same major ss in same arrangement with the same topological connections – May occur by convergent evolution SCOP example 17 CATH example 18 Protein Fold Classification • Automated Classification – DALI http://ekhidna.biocenter.helsinki.fi/dali – VAST (Vector Alignment Search Tool) http://www.ncbi.nlm.nih.gov/Structure/ VAST/vast.shtml DALI/FSSP – Automated classification Exhaustive all-against-all 3D structure comparison of protein structures currently in the PDB Domain Classification # (DC_l_m_n_p) l: fold space attractor region m: globular folding topology/fold type (clusters of structural neighbours in fold space with average pairwise Z-scores, by Dali, above 2) n: functional family (PSI-Blast, clusters of identically conserved functional residues, E.C. numbers, Swissprot keywords) p: sequence family (>25% identities) VAST – Automated classification http://www.ncbi.nlm.nih.gov/Structure/VAST/vasthelp.html All against all BLAST comparison of NCBI’s MMDB (database of known protein structure at NCBI, derived from the PDB) Clustered into groups by a neighbor joining procedure, using BLAST p-value cutoffs of C or less (where C=10e-7, 10e-40 or 10e-80, to reflect three different levels of redundancy). A fourth level of classification is based on sequence identity Motif and Domain Searching • InterPro – an integration of tools (PROSITE, PFAM, PRINTS, PRODOM) – http://www.ebi.ac.uk/interpro/ • Expasy Tools has more… – PATTINPROT, to search for patterns in proteins yourself, etc… But first… Check if the analysis you want to do has already been done! i.e. www.ebi.ac.uk/proteome/ db.psort.org 22 Phylofacts http://phylogenomics.berkeley.edu/phylofacts/ PhyloFacts includes hidden Markov models for classification of usersubmitted protein sequences to protein families across the Tree of Life. Subcellular Localization Prediction – Example of the benefit of integrating results with a Baysian approach Localization Prediction - methods  Several programs analyze single features:  TargetP  Initially one program analyzed multiple features:  PSORT I (eukaryotes and prokaryotes)  Developed in 1990 PSORT I prediction method: Rule based Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Compositional Analysis Molecular Weight Amino Acid Frequency Isoelectric Point UV Absorptivity Solubility, Size, Shape 27 NGS Bioinformatics Workshop 2.1 Meta-Analysis of Genomic Data SYSTEMS BIOLOGY Systems Biology What is systems biology? ① Considers all (or many) of the proteins and genes in the system ② Links proteins and genes using interactions and functions ③ Uses computational models to study system ④ Provides insights into mechanisms, system dynamics, global properties Molecular Interaction (MI) Network  Nodes = Gene / Protein  Edge = Interaction  Possible interactions:  phosphorylation  physical binding  transcriptional regulation  others? Cytoscape Cytoscape supports many use cases in molecular and systems biology, genomics, and proteomics:  Load molecular and genetic interaction data sets in many formats  Project and integrate global datasets and functional annotations  Establish powerful visual mappings across these data  Perform advanced analysis and modeling using Cytoscape plugins  Visualize and analyze humancurated pathway datasets such as Reactome or KEGG. http://www.cytoscape.org/ Cytoscape Control tabs: Network, VizMapper, plugin tabs Search for nodes Visible networks Network navigation Change visible attributes Attributes for highlighted nodes / edges Cytoscape – Loading Data Data Files: 1. Network (Simple Interaction Format) 2. Node attributes (tab-delimited) 3. Gene expression (tab-delimited) Cytoscape – Loading Data 1. Network (Simple Interaction Format) • Format: gene1 interaction_type gene2 • E.g.: C1QB C1R C2 pp pp pp … C1R C2 C4 Cytoscape – Loading Data 2. Gene Attribute (tab-delimited table) • Maps data values to nodes Load File Check off “Show Text File Import Options” Check off “Transfer first line as attribute names..” Preview Cytoscape – Loading Data 3. Gene expression (tab-delimited table) • Format: gene1 exp_cond1 exp_cond2 … sig_cond1 sig_cond2 … • Expression value: fold-change or intensity from microarray • Significance value: P-value indicating how likely the expression value is different between conditions. Cytoscape – Network Style In “Vizmapper” tab… Double-click “Node color” Select expression fold-change values (CMexp) Select “Continuous Mapping” as mapping type Can change color by double-clicking on arrows Systems Biology Analyses 1. Differentially-expressed subnetworks • jActiveModules 2. Functional enrichment • BiNGO Differentially-Expressed Subnetworks  Search for sub-networks that contain a significant number differentially-expressed genes (nodes)  All genes in sub-network interact…  SO these highly differentially-expressed sub-networks may represent a critical pathway or complex involved in a condition of interest Differentially-Expressed Subnetworks jActive algorithm:  Searches for sub-networks that contain a significant number differentially-expressed genes (or nodes)  Heuristic – won’t always find the optimum result  Z-score signifies how likely to find a subnetwork with a similar number of DE genes. jActive - Inputs Select expression significance (p-values) Search from highlighted nodes jActive - Results Subnetworks listed here Highlight result and click “Create Network” Functional Enrichment Functional Enrichment:  Also called over-representation analysis  Searches for common or related functions in a gene set  Is there a common annotation (e.g. pathway, GO term) for a set of genes that is more frequent than you would expect by chance? Gene Ontology • Controlled vocabulary describing functions, processes and cell components • Consistency between organisms and gene products • GO terms linked by relationships (is-a, part-of) and have hierarchy (parent – child) protein complex organelle mitochondrion [other protein complexes] fatty acid beta-oxidation multienzyme complex [other organelles] is-a part-of Functional Enrichment BiNGO:  Looks for GO terms that are over-represented in a set of genes.  Displays the results in two ways  A table with p-values  A graph showing relationships between terms  Uses the hypergeometric test to statistically test for overrepresentation of each GO term.  Performs multiple hypothesis correction (since we are testing multiple GO terms for over-representation). BiNGO - Inputs Fill in Name Lower significance level Select “Custom” and then load go.annot file Click Start BiNGO BiNGO - Results BiNGO - Results General GO Terms Significance Specific GO Terms EGAN: Exploratory Gene Association Networks http://akt.ucsf.edu/EGAN/ NGS Bioinformatics Workshop 2.5 Meta-Analysis of Genomic Data METAGENOMICS What is Metagenomics?  The culture-independent isolation and characterization of DNA from uncultured microorganism communities  Nice reading list on the topic: http://www.cbcb.umd.edu/confcour/CMSC828Gmaterials/reading-list.html  See also: Torsten Thomas Jack Gilbert and Folker Meyer. 2012. Metagenomics - a guide from sampling to data analysis. Microb. Inform. Exp. doi:10.1186/2042-5783-2-3 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351745/  I will just mention a few relevant bioinformatics tools here (no specific endorsements implied). MG-RAST server http://metagenomics.nmpdr.org/ Meyer, F. et al. 2008. The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 9:386 doi:10.1186/1471-2105-9-386 MEGAN - MEtaGenome ANalyzer http://ab.inf.uni-tuebingen.de/software/megan/ Huson DH et al. 2007. MEGAN analysis of metagenomic data. Genome Res. 17: 377-386

ppt

Related documents

Products

Support

ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib