Multivariate Data Analysis for Metabolomics Data generated by MS / NMR Spectroscopy Metabolomics Workshop Research Triangle Park / NC July 14th-15th H. Thiele, Bruker Daltonik, Bremen Why do Metabolic Profiling ? Clinical Diagnostics • find metabolic markers for disease progression (e.g. cancer) • diagnose inborn errors or other diseases • study of genetic differences Toxicology • markers for drug toxicity and drug efficacy • analyze time course of toxicological response Food Science • quality control / classification of origin • health/flavor enhancement of agrochemical products MS + NMR together: • Parallel Statistics >> more confidence • Hyphenation >> ultimate characterization tool Fundamental Issue in MVS: Dimension Reduction Bucketing • several bucketing techniques for optimum design of variables Dynamic Peak Bucketing Scheme a1 a2 a3 a4 a5 a1 a2 a3 a4 a5 b1 b2 b3 b 4 b5b6 a1 0 a2 a3 0 a4 a5 b1 b2 b3 0 b4 b5 b6 c1c2c3 c4c5 c6 0 0 a1 0 a2 a3 0 a4 a5 0 0 b1 b2 b3 0 b 4 b5 b6 c1 c2 c3 0 0 c4 c5 c6 0 • Spectra are bucketed one by one • Bucket table gets a new column whenever a new peak occurs • Spectra not having peaks at new positions get corresponding 0 Kernel Bucketing for LC-MS Data Bucketing Parameter e.g. m/z bucket width = 1, Kernel 0.3 Da Time bucket width = 60s, Kernel 10s LC-MS chromatograms of N samples Bucketing Intensity Intensity 100 100 Time m/z 1Da 100 0.3 Da 10s 100 30 100 Intensity Rt-m/z ... 1min - 1min - 1 min - pairs 535 m/z 536 m/z 537 m/z Sample X 500 50 5 Sample Y 0 25 0 Sample Z 500 50 5 ... 50 100 60s Time [s] Table of N samples Which Bucketing Technique to be used ? • Rectangular, equidistant bucketing standard, good compromise if no a priori knowledge • Variable sized bucketing makes shifts ineffective, allows selective usage • Point wise bucketing often used for broad line spectra as a special case of rectangular, equidistant bucketing • Dynamic peak bucketing allows very fine bucketing without getting huge tables, requires stable shifts or masses • Kernelized bucketing variant of rectangle bucketing to reduce effect of shifts Data Preprocessing : Spectral Background Subtraction • Measured Data are contaminated by solvents and chemical noise • Intensity of contaminants may dominate the relevant data Chemical Noise Solvent at m=75.2, Baseline and Scaled Noise Estimate Detection of Traces by dynamic grouping Data Preprocessing : Spectral Background Subtraction • Subtraction of spectral background makes relevant data visible Hidden traces of m=180.2 and m=208.2 in BPC Intens. x10 5 1.25 Intensity Base Peak Chromatogram (BPC) before and after Background Subtraction 1.00 0.75 0.50 0.25 0.00 12.0 12.5 13.0 13.5 Intens. x10 5 3 14.0 Intensity tim e 14.5 15.0 15.5 16.0 16.5 Time [min] 2 Visible traces of m=180.2 and m=208.2 in BPC 1 0 0 2 4 6 8 10 14 16 18 Time [min] Intens. x10 4 6 Intensity tim e 12 4 2 0 12.0 12.5 13.0 13.5 14.0 tim e 14.5 15.0 15.5 16.0 16.5 Time [min] Peak Picking Tasks • Find compounds defined by RT, m/z, z and area • Take together isotopic peaks and charge states Multivariate Statistics in Metabolomics Spectroscopy How to analyze large numbers of complex LC-MS chromatograms or NMR spectra with the target of simple discrimination or grouping? - healthy / non-healthy - high / low quality NMR or MS = sensor LC-MS chromatogram or NMR spectrum= fingerprint Use Pattern Recognition Techniques! Pattern Recognition (PR) Objectives of PR: • Statistical characterization • Model building • Classification Methods of PR: Exploratory Data Analysis • Statistical Tests • Principle Component Analysis (PCA) >> variance analysis Unsupervised Pattern Recognition • Cluster Analysis Supervised Pattern Recognition • Discriminant Analysis: LDA, PCA-DA • Classification of samples by various means e.g. Genetic Algorithm, SVM, … Idea of Principal Component Analysis (PCA) y y y x y x x x PC3 PC2 PC1 e.g. Principal Component Analysis Classification using PCA Input spectra list Bucketing coordinate transformation distance measures comparison to critical values classification model Coordinate Transformation PC2 ppm1 ppm2 PC1 Scores Loadings Baby Urine Samples : PCA - NMR mevalonic aciduria maple syrup disease PC 1,2 orotic aciduria 200 normal candidates Pattern in PC1/PC2 scores plot reveals candidates with inborn errors. New Born Screening by NMR > 400 baby urines, PCA, disease vectors indicating strength of metabolic disorders PC1/PC2 PC3/PC4 PC11/PC12 Bucket Analysis from 9 to 0.4ppm in 0.04ppm steps Excluded: 6 to 4.5 ppm residual water and urea Results of BEST-NMR at 600 MHz 1D-spectra Noesy presat 64 scans 6c 6 New Born Screening by NMR Distance from normals distribution is a measure for concentration of the molecule representing an inborn error Hippuric acid vector CH2group of hippuric acid 4c 4 PC2-Scores PCA : NMR vs. LC-MS PC1-Scores Fig. 1: Scores plot of NMR data from baby urines (born 2003). Fig. 2: Scores plot of LC-MS data from a subset of baby urines (born 2003). LC-MS data of Samples 114 and 94 Sample 114 BPC: -All MS 8 6 4 -MS, 5.5-6.8min 263.1037 2 264.1062 Intensity * 104 0 Sample 94 BPC: -All MS 8 263.1037 6 C13 H15 N2 O4 ,263.10 264.1068 4 260 261 262 263 Measured Pattern 264 Calculated Pattern 265 266 267 m/z 0 8 eXpose 94 vs. 114 BPC: -All MS 6 263.1 2 4 2 0 2 4 6 8 t [min] Generate Molecular Formula of mass 263.1037 m/z @ 5.9min. The determined formula C13H15N2O4 corresponds to phenyl-acetylglutamate. Combining Spectra and Statistical Data Interpretation scores / loadings Loadings in PCA indicate the importance of the original variables (buckets) in the variance space. In ideal cases a set of loadings refers to signals of a compound. Analysis of Bucket Variables Menu bar / Options Data Viewer Loadings Plot Bucket table Covariance matrix The covariance matrix looks like a TOCSY, cross - peaks indicate correlated fluctuations. This includes multi molecular fluctuations. Rows at cursor position are shown on top of the 2D matrix. Covariance Analysis row from covariance matrix reference spectrum from spectra base Interesting rows can be saved to disk as 1D NMR spectra and used for spectra base searching as any other 1D spectrum. Often, a small number of compounds from the spectra base match well while others do not. PCA analysis of 69 newborn urine LC-MS spectra 1: selecting two LC-MS runs differing in the PC1 values from scores plot 2: selecting bucket (spectral region) from loadings plot with high PC1 value Scores plot (PC1-PC2) Loadings plot (PC1-PC2) Generate SumFormula no peak LC-MS (run 1) peak LC-MS (run 2) Sum-Formula Generation Electron Configuration Intensity M+ M+* C/H Ratio, Elemental Limits m/z N-rule; isotope distribution double bond equiv. Isotope Masses, Abundances Experimental Peak Fast, exact Intensities & Masses calculation Molecular Constraints Fast Formula Generator using CHNO Algorithm List of Hits & Mass/Intensity Patterns of isotopic Patterns Formula Scoring: Isotopic pattern as additional decision criteria for elemental composition List of Hits & Intensity Mass/Intensity Patterns Three independent Scores: • Intensity Ratios • Intensity weighted mean Masses • Intensity weighted Peak Distances m/z Experimental Intensities & Masses { Theoretical Intensities & Masses } Calculating the elemental composition Simulated mass spectrum of Chlorpyriphos 12C isotope peak 13C isotope peak (11% int.) 3 x Cl isotope peaks H3C Cl Cl S O P O O H3C N Cl Clinical Proteomics The samples are different The experimental and mathematical techniques are similar But the goal is the same Workflow Clinical Proteomics Binding Washing Patients Serum Samples Isolation Normal Normal Disease * * * Analysis Elution Detection Cluster analysis Normal Disease Clinical Results MALDI-TOF MS W. Pusch et. al., Pharmacogenomics (2003) 4(4), 463-476 Data Preparation Aim: Extraction of the same set of features from each individual spectrum. These features will be used for model generation and later for classification of new spectra. As with metabonomics the identification of the features is of large interest Steps: • Quality Checks for spectra • Recalibration, Baseline correction, Noise Reduction • Peak detection and area calculation • Normalization of peak areas Tasks for Clicinal Proteomics Data Analysis 1. Data preprocessing 2. Peak annotation 3. Statistical characterization 4. Discriminance analysis Most of the tasks are quite similar for both kinds of applications except for the dimensionality of the original data: 1D MALDI 2D LCMS-ESI MS Data Preprocessing : Recalibration Problem: peaks are not aligned to each other as in this example; For LC-MS it is usually the retention time Solution: application of a recalibration algorithm Result: peaks are aligned to each other Data Preprocessing : Recalibration Intensity Mass shift Above mass tolerance Prominent peak shifted In tol. but not prominent Spec 1 Spec 2 linear mass shift m/z • Selection of prominent peaks (e.g. 30% occurrence) • Use this peak list with average masses as calibrants • Assignment of peaks and calibrants with a mass tolerance • Recalibration of all spectra by solving least square problems for a linear mass shift Data Transformation Wavelet vs. Fourier Aim: Transformation of spectrum from • time-amplitude (mass spectra) domain • into time-frequency (Fourier) or • time-scale (Wavelet) representation Benefit: - decomposition into distinct frequency/scale bands - significant features (peaks, patterns) occur on specific frequencies/scales Steps for wavelet decomposition: low pass filter approximation coefficients high pass filter detail coefficients Wavelets for Feature Extraction Aim: Determine features from the spectra which are discriminant for class separation Method: Wavelet-Transformation • gives information about the signal • localized in time (m/z) and frequency • lower freq. corresponds to raw structural information of the spectrum • higher freq. corresponds to fine/detailed information • in contrast to FFT we get knowledge were the feature is located in time (m/z) Feature selection: • we get much features (dep. on time resolution) • Brute force + sophisticated feature selection needed Peak detection Problem: •Common sets of peaks needed for later model stage •Different peaks vary to a different extent over all spectra •Small peaks , nevertheless giving a good separation between classes, might be overlooked only considering single spectra average spectrum single spectrum Peak detection Solution - ClinProTools: • Determination of peak positions by use of Average-Spectrum • Integration over start and end Masses for detected Peaks Blue areas indicate picked peaks average spectrum Red area for picked peaks in Model (see later) Average spectra per class Peak at ca. 2022Da – Idx 24 in Model of most imp. 15 Peaks for GA and SVM Avg.-spec class 1 Avg.-spec class 2 Avg.-spec class 3 Avg.-spec class 4 Univariate Statistics: Getting a basic idea about the data Calculation based on: peak intensities / peak areas • descriptive / robust statistics • Welch`s t-test / Wilcoxen test Statistic peak area sorted according p-value Algorithms for Discriminate Analysis Some alternatives to classical linear DA: • Feature selection: Genetic Algorithms (GA) + Cluster Analysis • Support Vector Machines (SVM) GA: Application to MS data • Solution = combinations of peaks • Initial population: randomly generated solutions • Start with multiple initial populations using a migration schema • Each solution is assigned a fitness value according to its ability to separate two or more classes (using centroid or KNNclustering and by determination of between and within class distances) • New generations of population are formed using – Selection: the fitter a solution, the higher the chance for being selected as a parent – Crossover: parents form new solutions by exchanging some of their peaks, new solutions replace parents – Mutation: random changes in solutions • Result: combinations of peaks, which separate classes best GA: Genetic Evolution Chromosome 1 Start Set Mutation Chromosome 2 1000 1200 1700 2100 2500 1700 1800 2000 2200 2300 1000 1200 1500 2100 2500 1700 1800 2000 2150 2300 1000 1200 1500 2150 2300 1700 1800 2000 2100 2500 50-500 cycles Cross Over 1000 1200 1500 Selection 2150 Discard disadvantages 2300 1700 1800 2000 Fitness Test using k-NN 2100 2500 Keep advantages KNN: k-nearest neighbor clustering • Spectrum = point in Rn (e.g. areas of selected peaks) • Determination of k nearest neighbors for each spectrum • Classification of all points using classes of neighboring points • Example: point A is classified as class 2, point B as class 1 • Fitness value: – percentage of correctly classified points – calculation of between/within distances A B Legend: class 1 class 2 Centroid clustering • Spectrum = point in Rn (e.g. areas of selected peaks) • Spectrum by spectrum is analyzed (iterative process): – if it is the first spectrum or too far away from all existing clusters, a new cluster with just this spectrum is created – otherwise it is assigned to the nearest cluster, the centroid is recalculated • Fitness value: – pureness & #clusters (optimal: k clusters, with all spectra of one class) – calculation of between/within distances 3 4 1 2 GA: Results Prediction capability of GA (plot for best 2 peaks) ~76% pred.* ~91% pred.* ~66% pred.* * Prediction acc. for a model with 25 peaks SVM: Support Vector Machine • SVM: Calculation of direction in Rn, which separates best between two classes (supervised method) • PCA (principal component analysis): calculation of direction in Rn, which best explains variability (unsupervised method, i.e. without looking at class memberships) PCA = PCA SVM SVM support vectors What is SVM – Basic problem • Assume we have two classes of data points with two peaks • Now I look for that line which – optimal - separates these 2 classes Class 2 Many decision boundaries can separate these two classes. Which should be chosen? Class 1 The green boundaries are valid but bad ones What is SVM – Basic idea The decision boundary should be as far away from the data of both classes as possible. We should maximize the margin, m: Class 2 Class 1 m This problem can be solved by mathematical optimization theory SVM: Application to MS data •SVM: quadratic optimization problem, solved by an iterative process using Sequential Minimal Optimization •In simplest case a hyperplane separating classes is calculated •Therefrom contribution of individual peaks is calculated From Spectra in 3 classes : We get: • Separating hyperplanes • Recognition & Prediction accuracy • Peak ranking highlighting potential biomarker-patterns SVM: Application to MS data - results Note its plotted in 2D but in fact it is high dimenensional Rec. Pred. Class 1 90 69 Class 2 86 83 Class 3 90 76 Rank 1 2 3 … Index 7 17 18 … Mass 1212 1450 1469 … SVM: Results Prediction capability of SVM (plot for best 2 peaks) ~93% pred.* ~70% pred. * ~75% pred. * * Prediction acc. for a model with 25 peaks