Copy Number Data Analysis Xiaowen Wang Field Applications Specialist support@partek.com Topics • Copy Number Analysis • • • • • • Data import and copy number creation Detect regions of amplification and deletion Detect copy number variation among different population Overlap gene with copy number regions Chromosome visualization LOH Analysis • Data import • Detect LOH • Integration of LOH with copy number • Allele Specific Copy Number Analysis • Data import • ASCN creation • Detect allelic imbalance 2 Copyright © Partek Inc. 1 Copy Number Analysis Standard Copy Number Processing Workflow Import Allele Intensity (Affy .cel files) Import from Affymetrix, Agilent, Illumina, NimbleGen etc… Copy Number/LogRatio Detect regions on each sample Analysis on regions across samples Find genes overlap with regions Biological interpretation Genomic integration Visualize data at any of the steps 4 Copyright © Partek Inc. 2 Assay Vendors for Copy Number/aCGH • Affymetrix 10K through SNP6, Cyto2.7M, MIP –CEL and Text File • Illumina 317K, 550K, & 1M bead arrays : GenomeStudio plugin, Text • Agilent aCGH : Feature Extraction output, gpr output • Roche/Nimblegen aCGH : Paired files 5 Copyright © Partek Inc. Import Samples • Import allele intensity to create copy number • Import Copy number/log ratio 6 Copyright © Partek Inc. 3 Import Allele Intensity from Affymetrix .CEL Files • Specify .CEL files to import • Probes are adjusted for fragment length and sequence bias • Automatically download needed library files 7 Copyright © Partek Inc. Import Allele Intensity from Affymetrix .CEL Files • Output normalized allele intensity in log scale • Two columns per SNP & one column per CNV probeset 8 Copyright © Partek Inc. 4 Import Allele Intensity from Illumina GenomeStudio • Use Partek plug-in for GenomeStuio • Analysis > Reports > Report Wizard • Choose Custom Report and select Partek Report Plug-in • Specify Type as X & Y • No normalization is performed • Output three spreadsheets in the project: • allele intensity • B allele frequency • Genotype call 9 Copyright © Partek Inc. Paired/Unpaired Copy Number Creation UNPAIRED • PAIRED Two samples (case/control) taken from each subject • The normal sample is baseline for the case sample for each subject • Output copy number values only for case sample in each subject Affy : SNP6, SNP5, 100K,500K Illumina: 1M, Omni1-Quad 10 Copyright © Partek Inc. 5 Estimating Copy Number from Allele Intensities • The allele intensities are compared to intensities from normal subjects • Normal sample(s) allele intensity will serve as baseline • So if the intensity of a probe is 2 times brighter than baseline, it has 2 times as much DNA at the location on the genome for which the probe targets. • • • • Normal = 2 copies 2 times normal = 4 copies ½ of normal intensity = 1 copy No intensity = 0 copies 11 Copyright © Partek Inc. Baseline Choices Better ability to detect true copy number Paired DNA and reference from same patient Unpaired Experimental Reference Reference from similar samples run in same lab Unpaired Lab Reference Reference from larger unrelated group of samples run in same lab Unpaired Universal Reference Large Hapmap baseline run in third party lab More robustness to sources of noise 12 Copyright © Partek Inc. 6 GC Wave Correction on Copy Number Adjust copy number/logratio based on local gc content • Need reference genome in .2bit format Diskin, et.al; Adjustment of genomic waves in signal intensities from whole genome SNP genotyping platforms, Nucleic Acid Res., 2008, 36: 19 13 Copyright © Partek Inc. Import Samples • Import allele intensity to create copy number • Import Copy number/log ratio 14 Copyright © Partek Inc. 7 Import Agilent aCGH • Import from Feature Extraction • Select the FE output .txt files • Choose LogRatio to import • Change the log base to 2 • Annotation will be generated during import • Output is LogRatio spreadsheet 15 Copyright © Partek Inc. Import from Illumina GenomeStudio • Use Partek plug-in for GenomeStuio • Analysis>Reports>Report Wizard • Choose Custom Report and select Partek Report Plugin • Specify Type as Illumina Copy Number Analysis • No normalization is performed • Output is Partek project containing three spreadsheets: • LogRRatio • B allele frequency • Genotype call 16 Copyright © Partek Inc. 8 Import from Illumina Text Report • Choose text file output from Illumina • Select field to import: • Sample ID • Probeset ID • Data • • • Only one type of data to import at a time Specify output file Annotation • Automatically link to annotation if it is in the text file • Choose File>Properties to manually specify text format of the annotation file 17 Copyright © Partek Inc. Import from NimbleGen • Specify either project folder or specific files • Specify annotation file (.pos) (*normalized.txt) (.pair) • Two format of the files • .pair in raw data folder • *normalized.txt in processed data folder • Output of normalized.txt files is corrected logratio in the text file • Partek uses .pair file and LOESS normalization to create logratio of one color to the other 18 Copyright © Partek Inc. 9 Import Affymetrix MIP Chip File > Import > Affymetrix > MIP Copy Number text file • Choose input file, annotation file • Three files to choose from: — ASCN — Total Copy Number — Allele Ratio • Values pre-calculated by Affymetrix • Must adjust Copy Number values below zero to use analysis options • Set values below zero to small number 19 Copyright © Partek Inc. Import from NimbleGen • Paired files in raw data folder • Need to specify baseline channel • Output LOESS normalized log2ratio • .normalized.txt in processed data folder • Output corrected logratio 20 Copyright © Partek Inc. 10 Assign Sample Attributes • There are many ways to assign sample information (treatments, phenotype, other clinical information) 1) From a “sampleInfo” file (.cel Import only) 2) By creating treat/phenotype groups and dragging the samples into the appropriate group 3) By splitting apart the filename 4) By manually adding columns and filling them in (similar to Excel) 21 Copyright © Partek Inc. 3) “Drag & Drop” Specification of Groups Name the new attribute and all categories of that attribute. Group samples by dragging and dropping (e.g. attribute name is “Type”, and categories are “Down Syndrome” and “Normal”) 22 Copyright © Partek Inc. 11 Exploratory Analysis - PCA Scatterplot & Histogram • Identify outliers • Clustering pattern • Identify the distribution of the data 23 Copyright © Partek Inc. Chromosome View on Copy Number • View one chromosome at a time • Change the order of tracks • Add/remove track Heatmap track • Display the copy number/ log ratio for all the samples in spreadsheet • Color represent copy number value Profile track • Display selected sample • Y axis is copy number value • Raw and smoothed copy number 24 Copyright © Partek Inc. 12 How to find regions of CNV? (Amplifications & Deletions) • Monitoring trends across multiple adjacent markers • Define chromosomal breakpoints where these trends in chromosomal abundance changes 2 “normal” • Methods in Partek: • Hidden Markov Model • Genomic segmentation 25 Copyright © Partek Inc. Partek Genomic Segmentation • Find a breakpoint that produces different neighboring regions Segmentation Parameters • Specify minimum number of genomic markers • Two sided t-test to comparing two neighboring regions • Based on significance and amount of changes to decide whether to insert breakpoint Region Report • 2 One sided t-test to compare the mean of the region with expected range to determine aberration status • Expected range: the range around each expected copy number. In a diploid region , the expected range would be 2+/- 0.3 which is from 1.7-2.3. Signal to Noise 2.6 2.2 (2.6-2.2)=0.4 > 0.3 26 Copyright © Partek Inc. 13 Hidden Markov Model • Specify expected states (copy number/log ratio) • Specify maximum probability of retaining the same state between neighboring markers • Genomic decay describes how quickly the retention of the state will decay to the initial probability • Find the most likely state sequence given the data • Compare state of 2 as normal to determine aberration status 27 Copyright © Partek Inc. Result Spreadsheet • One row per segment per sample • Mean is the average copy number of all the markers in the region • First 3 columns are the genomic location: chromosome, start, end • HMM has stat column • Copy Number status is based on the report parameters • Segmentation has p-value column 28 Copyright © Partek Inc. 14 HMM vs Segmentation • HMM - Good on homogenous samples with anticipated states (copy number) • Segmentation - Good for heterogeneous sample when you don’t know the copy number state 29 Copyright © Partek Inc. Segmentation Result Spreadsheet • One row per segment per sample • First 3 columns are the genomic location: chromosome, start, end • Copy Number status is based on the report parameters • Right click on a region row header>Browse to location 30 Copyright © Partek Inc. 15 Plot Detected Regions Karyoview (Histogram View) — Sample frequency on aberration regions Classification View — View each region in each sample separately 31 Copyright © Partek Inc. Analyze Detected Segments Analyze regions across multiple samples Region of sample 1 Region of sample 2 Region of sample 3 Region of sample 4 Region of sample 5 Result Result Result Result • Regions can be smaller than the default number of markers chosen during segmentation • Lack of defined borders in copy number • You can apply filter on small and less commonly shared regions 32 Copyright © Partek Inc. 16 Detect CN Variation on Different Categories • Chi-square test is used to detect copy number changes among different categories • Unbalance between samples will increase significance of categorical contribution to aberration • Right click on a region header can invoke HTML report 33 Copyright © Partek Inc. Create Region List • Specify criteria to filter down to interesting regions based on • p-value • length • number of marker • chromosome • number of aberration samples 34 Copyright © Partek Inc. 17 Find Overlapping Genes • Overlap with RefSeq, AceView, Ensemble, CNV or custom database • Output format: • on a new column in the region spreadsheet • new spreadsheet 35 Copyright © Partek Inc. Test for Known Abnormalities • Input file: • Filtered segmentation/HMM result spreadsheet • Abnormality database • If overlap = positive • Output is each row is a feature testing in each sample 36 Copyright © Partek Inc. 18 Cluster Genome Copy number spreadsheet is used to verify how the samples are clustered on the whole genome or selected chromosomes • default is showing cluster on chromosome 1 • click Show All button to cluster on the whole genome • combine left and right click on chromosome number to select chromosomes 37 Copyright © Partek Inc. 38 Copyright © Partek Inc. Chromosome View 19 Tracks • Can be easily added and removed • Drag and drop to change the order of the tracks • Select a track to change the configuration at the bottom • Heatmap track on copy number data •Click on the color tab to change the heatmap color •Profile track on copy number data •Default only show the selected sample •Click on color tab to change the dot color • click on samples table to select samples to display • Add New track •Any annotation information •Any spreadsheet with genomic location 39 Copyright © Partek Inc. Loss of Heterozygosity 20 LOH Workflow • Import genotype calls • Both Paired and Unpaired analysis are available. • Paired is preferred (if possible) because it focuses only on genomic regions specific to the disease phenotype and minimizes differences between individuals 41 Copyright © Partek Inc. Import Samples • Platform: • Affymetrix CHP (500K, SNP6 etc.) • Affymetrix text file (Mouse Diversity Genotyping Array) • Illumina output from GenomeStudio • Illumina Text file • Specify input file(s) • Specify output file(s) 42 Copyright © Partek Inc. 21 QA/QC Sample QC • Report rate of NC and heterogynous call in each sample SNP QC • Hardy-Weinberg Equilibrium — Allele frequencies in a population remain constant • Chi-square test is performed on expected genotype frequency vs. observed genotype frequency • Output frequencies of each allele 43 Copyright © Partek Inc. Create LOH • Hidden Markov Model is used to find the LOH regions based on genotype error and expected heterozygous frequency at each SNP • Paired and unpaired • Paired is preferred when possible as it is more accurate in its expected genotype frequencies. 44 Copyright © Partek Inc. 22 Paired LOH • Specify the tumor/normal pair information • Specify parameters for HMM • Homozygous SNPs in normal sample are excluded 45 Copyright © Partek Inc. Unpaired LOH • Need to create baseline using normal samples • Specify baseline file • Specify parameters for HMM • Default heterozygous frequency is used if no baseline is provided 46 Copyright © Partek Inc. 23 LOH creates a segment table • One row per sample per LOH region • Heterozygous rate is the number of AB calls divided by total number of genotype calls in the region • Paired LOH only display case sample LOH regions 47 Copyright © Partek Inc. 48 Copyright © Partek Inc. Intro to CN & LOH merge • Sample ID should match • Input file is regions spreadsheet detected on each sample • Output is the union of the regions with different categories • • • • • Amplification with LOH Amplification without LOH Deletion with LOH Deletion without LOH Copy-Neutral LOH 24 LOH & CN Overlay LOH Amplification Deletion Amplification Amp w/ LOH Copy neutra l LOH 49 Del w/ LOH Deletion Copyright © Partek Inc. Output of CN overlap with LOH • Genomic location of the region • Sample ID • Description • Average copy number • Heterozygous rate 50 Copyright © Partek Inc. 25 LOH and Copy Number Overlap Chromosome view of the region in the five categories 51 Copyright © Partek Inc. Find regions in multiple samples • Specify output region that is in common in # of samples • Output number of samples in the region and sample ID 52 Copyright © Partek Inc. 26 Visualization Histogram on number of samples in each region for each category 53 Copyright © Partek Inc. Allele Specific Copy Number Copyright © 2009 Partek Incorporated. All rights reserved. 27 Why ASCN • Estimate number of copies for each allele • Able to detect imbalance in copy number between alleles which is important for mixture tissue e.g. tumor samples • Help interpret total copy number analysis results Copyright © 2009 Partek Incorporated. All rights reserved. ASCN Workflow • Import samples • genotype calls • allele intensity • Sample ID should match in both spreadsheets • Normal sample(s) are required on the spreadsheet • Analyze both paired and unpaired design • Detect allelic imbalance • Overlap with copy number analysis Copyright © 2009 Partek Incorporated. All rights reserved. 28 Create ASCN • Not all SNPs are used • Paired is preferred (if possible) • minimizes differences between individuals • Informative SNPs are heterozygous call in normal sample • Un-Paired analysis • Informative SNPs are heterozygous call in tumor sample • LOH regions in tumor sample means no informative SNPs (missing values) Copyright © 2009 Partek Incorporated. All rights reserved. ASCN Result • Each row is one allele copy number in each sample • “?” means it is not informative SNPs in the sample • Min/max is determined by its value • Diploid region min/max should be around 1 Copyright © 2009 Partek Incorporated. All rights reserved. 29 Detect Allelic Imbalance • Each informative SNP proportion score is: Proportion = (Max - Min)/(Max + Min) • Large proportion score means allelic imbalance, range from 0-1 • Genomic segmentation is performed, • Average proportion score on informative SNPs is reported Copyright © 2009 Partek Incorporated. All rights reserved. Questions? Hands On & Demo Copyright © 2009 Partek Incorporated. All rights reserved. 30 Data Set •20 paired tumor/normal samples •Kindly provided by Ian Campbell, Peter MacCallum Cancer Centre •Could easily be run paired •Run on the Affymetrix Human SNP 6.0 Arrays ° 900K SNPs& 900K CNVs for 1.8 M total genetic markers Copyright © 2009 Partek Incorporated. All rights reserved. 31