Dr G P S Raghava Institute of Microbial Technology, Chandigarh, India Bioinformatics Informatics | Drug Informatics | Chemoinformatics | Vaccine Email: raghava@imtech.res.in http://crdd.osdd.net/ http://www.imtech.res.in/raghava/ Studying Central Dogma with “Omis” Transcriptomics Proteomics Protein Mature mRNA mRNA Genomics DNA Metabolomics (all the endogenous metabolites produced) Complexity of the system NEXT GENERATION SEQUENCING • Sequence full genome of an organism in a few days at a very low cost. • Produce high throughput data in form of short reads. Illumina ABI’s Solid Roche’s 454 FLX Ion torrent Genome assembly and annotation done at IMTECH • Burkholderia sp. SJ98 (Kumar et al. 2012). • Debaryomyces hansenii MTCC 234 (Kumar et al. 2012). • Imtechella halotolerans K1T (Kumar et al. 2012). • Marinilabilia salmonicolor JCM 21150T (Kumar et al. 2012). • Rhodococcus imtechensis sp. RKJ300 (Vikram et al. 2012). • Rhodosporidium toruloides MTCC 457 (Kumar et al. 2012). RNA sequencing RNA-Seq is a recently developed approach to transcriptome profiling that uses deepsequencing technologies to measure levels of transcripts and their isoforms. Samples of interest Condition (colon tumor) Isolate RNAs Generate cDNA, fragment, size select, add linkers Sequence ends Map to genome, transcriptome, and predicted exon junctions Downstream analysis 100s of millions of paired reads 10s of billions bases of sequence Methods for Protein Analysis Gel electrophoresis, northern/western blot (fluorescence/radio active label) X-ray crystallography Protein microarrays Mass Spectrometry Protein arrays High throughput analysis of hundreds of thousands of proteins. Proteins are immobilized on glass chip. Various probes (protein, lipids, DNA, peptides, etc) are used. Cons Require a priori knowledge of the proteins of interest. Availability of suitable antibodies. Measure only a small fraction of the proteome Mass Spectrometry Find a way to “charge” an atom or molecule (ionization). Place charged atom or molecule in a magnetic field or electric field and measure its speed or radius of curvature relative to its mass-to-charge ratio (mass analyzer). Detect ions using microchannel plate or photomultiplier tube(Detection). Sample Ion source: makes ions Mass analyzer: separates ions Mass spectrum Presents information Protein Identification by MS Spot removed Library from gel Theoretical spectra built Fragmented using trypsin Spectrum of fragments generated MATCH Experimental spectra Artificially trypsinated Database of sequences (i.e. SwissProt) Instrumentation High Vacuum System Inlet • • • HPLC Flow injection Sample plate • • Ion Source Mass Analyzer MALDI ESI Time of flight (TOF) Quadrupole Ion Trap Magnetic Sector FTMS Detector • • • Data System Microchannel Plate Electron Multiplier Hybrid with photomultiplier Ion Sources make ions from sample molecules (Ions are easier to detect than neutral molecules.) MALDI Electrospray ionization Sample plate Sample Inlet Nozzle Pressure = 1 atm Inner tube diam. = 100 um(Lower Voltage) Partial vacuum Laser MH+ N2 Sample in solution N2 gas + + ++ ++ ++++ ++ + + ++ ++ + + + + + + + + + + + + + +++ +++ +++ +++ + ++ MH2+ MH+ MH3+ High voltage applied to metal sheath (~4 kV) Charged droplets Grid (0 V) +/- 20 kV Mass analyzers separate ions based on their mass-to-charge ratio (m/z) Operate under high vacuum (keeps ions from bumping into gas molecules) Actually measure mass-to-charge ratio of ions (m/z) Key specifications are resolution, mass measurement accuracy, and sensitivity. Several kinds exist: for bioanalysis, quadrupole, time-of-flight and ion traps are most used. Tandem Mass Spectrometry S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7 F: + c Full ms [ 300.00 - 2000.00] RT: 0.01 - 80.02 100 90 80 1409 LC NL: 1.52E8 1991 1615 2149 1621 1411 2147 1387 60 1593 1995 1655 1435 50 1987 1445 1661 40 2001 2177 1937 1779 30 2155 2205 2135 2017 1095 80 75 70 65 60 55 801.0 50 45 40 35 638.9 25 2207 1105 85 30 1307 1313 20 MS 90 Relative Abundance 70 95 Base Peak F: + c Full ms [ 300.00 2000.00] 1611 Relative Abundance 638.0 100 1389 872.3 1275.3 15 1707 687.6 10 2331 10 1173.8 20 2329 944.7 783.3 1212.0 1048.3 1413.9 1617.7 1400 1600 1742.1 1884.5 5 0 200 400 600 800 1000 m/z 0 5 10 15 20 25 30 35 40 45 Time (min) MS-1 50 55 60 65 70 collision cell 75 1200 1800 2000 80 S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 850.3 100 95 MS-2 687.3 MS/MS 90 85 588.1 80 75 70 Ion Source Relative Abundance 65 60 55 851.4 425.0 50 45 949.4 40 326.0 35 524.9 30 25 Parent Ions Fragment Ions 20 589.2 226.9 1048.6 1049.6 397.1 489.1 15 10 629.0 5 0 200 400 600 800 1000 m/z 1200 1400 1600 1800 2000 What’s in a Mass Spectrum? Fragment Ions Derived from molecular ion or higher weight fragments “molecular ion” In molecular ions, adduct ions, [M+reagent gas]+ High mass Mass, as m/z. Z is the charge, and for doubly charged ions (often seen in macromolecules), masses show up at half their proper value Peptide Fragmentation H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 N-terminus C-terminus AA residuei-1 AA residuei AA residuei+1 Collision Induced Dissociation H+ H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH Ri-1 Prefix Fragment Ri Ri+1 Suffix Fragment • Peptides tend to fragment along the backbone. • Fragments can also loose neutral chemical groups like NH3 and H2O. B ions and Y ions http://www.molgen.mpg.de/101151/Proteomics Mass spectra searching techniques Commercial Software SEQUEST (Yates et al., 1995) MASCOT (Perkins, Pappin, Creasy, & Cottrell, 1999) Open Database search tools Myrimatch X!Tandem MSGF OMSSA More accurate than Mascot and sequest (Kim & Pevzner, 2014) Target-Decoy Search Strategy for Mass Spectrometry-Based Proteomics • incorrect “decoy” sequences added to the search space will correspond with incorrect search results that might otherwise be deemed to be correct. PSM in target and decoy 450 400 350 PSM matches 300 250 200 Series1, 150 150 100 50 Mass spectrum 0 target Target and reversed Decoy database decoy Proportion of matches in decoy database represent false matches Applications • Analyzing Protein Modifications • Finding all modifications on a single protein • Proteome wide scanning of modifications • Protein Profiling • Generate large scale proteome maps • Annotate and correct genomic sequences • Analyze protein expression as a function of cellular state • Detection of amino acid substitutions • Protein sample identification/confirmation • Protein sample purity determination Major Proteomics Repositories Repositories PeptideAtlas PRIDE RAW DATA Result' Raw'Data' Metadata' Proteome Central PASSEL Other DBs result' raw' JOURNALS Other DBs GPMDB UniProt NextProt Major Challenge iden fied uniden fied Large number of unidentified spectra May be peptides are missing in the database searched….. Are all the reference databases complete ??? Proteogenomics • • • • Term coined in the literature in 2004. Genomic for generating customized databases. Identify novel peptides. Disease biomarkes based on novel mutation intensity Proteomics Non-Tumor Sample Searched against intensity mass/charge Tumor Sample Refseq Uniprot Aberrant proteins Variant proteins Fusion proteins mass/charge NonTumor Sample intensity Proteogenomics Genome sequencing Germline variants Tumor Sample intensity mass/charge RNA-Seq mass/charge Alternative splicing, somatic variants, expression Tumor Specific Protein DB TYPES OF PEPTIDES IDENTIFIED IN PROTEOGENOMICS Novel Peptides mapping on Non Coding Region 5’UTR Novel Peptides mapping on Non Coding Region 3’UTR Novel Junctions Intergenic peptides Alternating Reading Frame Methods of generation of customized databases 6 Frame Translation of Reference Database Ab initio gene prediction. RNA-seq data Whole genome Sequencing Other Databases • Perl or python scripts • Perl and python scripts • CustomProdb, Galaxy-P system, sapFinder • Peppy • OMIM, neXtProt, Ecgene, ChimerDB, COSMIC What does proteogenomics offer? Proteomics Genomics Transcriptomics Novel N termini Novel signal peptide Novel Exons Novel Junctions Variant peptides Novel ORFs Novel signal peptide cleavage sites Open source Web Services Chemoinformatics and Pharmacoinformatics Molecular Interactions Biological Databases Genome Annotations Immunoinformatics or Vaccine Informatics Functional Annotation of Proteins Proteins Structure Prediction GPSR: A Resource for Genomics Proteomics and Systems Biology • A journey from simple computer programs to drug/vaccine informatics • Limitations of existing web services • History repeats (Web to Standalone) • Graphics vs command mode • General purpose programs • Small programs as building unit • Integration of methods in GPSR Types of Prediction Methods TP Senstivity= ´100 TP+FN TN Specifity= ´100 TN+FP Accuracy= TP + TN ´100 TP + TN+FP + FN (TP ´ TN) - (FP ´ FN) MCC= ´100 (TP+FP)(TP+FN)(TN+FP)(TN+FN) Live Server Live CD Pkg Repository Installation Webserver Standalone Galaxy platform All in ONE Osddlinux desktop is ready for use. Password for sudo : osddlinux root : osddlinux Osddlinux installation on system hard drive Operating System for Drug Discovery