Copy Number Data Analysis
Xiaowen Wang
Field Applications Specialist
support@partek.com
Topics
•
Copy Number Analysis
•
•
•
•
•
•
Data import and copy number creation
Detect regions of amplification and deletion
Detect copy number variation among different population
Overlap gene with copy number regions
Chromosome visualization
LOH Analysis
• Data import
• Detect LOH
• Integration of LOH with copy number
•
Allele Specific Copy Number Analysis
• Data import
• ASCN creation
• Detect allelic imbalance
2
Copyright © Partek Inc.
1
Copy Number Analysis
Standard Copy Number Processing Workflow
Import Allele Intensity (Affy .cel files)
Import from
Affymetrix,
Agilent,
Illumina,
NimbleGen
etc…
Copy Number/LogRatio
Detect regions on each sample
Analysis on regions across samples
Find genes overlap with regions
Biological interpretation
Genomic integration
Visualize data at any of the steps
4
Copyright © Partek Inc.
2
Assay Vendors for Copy Number/aCGH
•
Affymetrix 10K through SNP6, Cyto2.7M, MIP –CEL and Text File
•
Illumina 317K, 550K, & 1M bead arrays : GenomeStudio plugin, Text
•
Agilent aCGH : Feature Extraction output, gpr output
•
Roche/Nimblegen aCGH : Paired files
5
Copyright © Partek Inc.
Import Samples
•
Import allele intensity to create copy
number
•
Import Copy number/log ratio
6
Copyright © Partek Inc.
3
Import Allele Intensity from Affymetrix .CEL Files
•
Specify .CEL files to import
•
Probes are adjusted for
fragment length and
sequence bias
•
Automatically download
needed library files
7
Copyright © Partek Inc.
Import Allele Intensity from Affymetrix .CEL Files
•
Output normalized allele
intensity in log scale
• Two columns per SNP & one
column per CNV probeset
8
Copyright © Partek Inc.
4
Import Allele Intensity from Illumina GenomeStudio
• Use Partek plug-in for GenomeStuio
• Analysis > Reports > Report Wizard
• Choose Custom Report and select
Partek Report Plug-in
• Specify Type as X & Y
• No normalization is performed
• Output three spreadsheets in the
project:
• allele intensity
• B allele frequency
• Genotype call
9
Copyright © Partek Inc.
Paired/Unpaired Copy Number Creation
UNPAIRED
• PAIRED
Two samples (case/control)
taken from each subject
• The normal sample is baseline
for the case sample for each
subject
• Output copy number values only
for case sample in each subject
Affy : SNP6, SNP5, 100K,500K
Illumina: 1M, Omni1-Quad
10
Copyright © Partek Inc.
5
Estimating Copy Number from Allele Intensities
• The allele intensities are compared to intensities from normal
subjects
• Normal sample(s) allele intensity will serve as baseline
• So if the intensity of a probe is 2 times brighter than baseline, it has
2 times as much DNA at the location on the genome for which the
probe targets.
•
•
•
•
Normal = 2 copies
2 times normal = 4 copies
½ of normal intensity = 1 copy
No intensity = 0 copies
11
Copyright © Partek Inc.
Baseline Choices
Better ability to detect true copy number
Paired
DNA and
reference from
same patient
Unpaired
Experimental
Reference
Reference from
similar samples
run in same lab
Unpaired
Lab Reference
Reference from
larger unrelated
group of samples
run in same lab
Unpaired
Universal
Reference
Large Hapmap
baseline run in
third party lab
More robustness to sources of noise
12
Copyright © Partek Inc.
6
GC Wave Correction on Copy Number
Adjust copy number/logratio based on local gc content
• Need reference genome in .2bit format
Diskin, et.al; Adjustment of genomic waves in signal intensities from whole genome SNP
genotyping platforms, Nucleic Acid Res., 2008, 36: 19
13
Copyright © Partek Inc.
Import Samples
•
Import allele intensity to create copy
number
•
Import Copy number/log ratio
14
Copyright © Partek Inc.
7
Import Agilent aCGH
• Import from Feature Extraction
• Select the FE output .txt files
• Choose LogRatio to import
• Change the log base to 2
• Annotation will be generated during
import
•
Output is LogRatio spreadsheet
15
Copyright © Partek Inc.
Import from Illumina GenomeStudio
• Use Partek plug-in for GenomeStuio
• Analysis>Reports>Report Wizard
• Choose Custom Report and select
Partek Report Plugin
• Specify Type as Illumina Copy Number
Analysis
• No normalization is performed
• Output is Partek project containing
three spreadsheets:
• LogRRatio
• B allele frequency
• Genotype call
16
Copyright © Partek Inc.
8
Import from Illumina Text Report
• Choose text file output from
Illumina
• Select field to import:
• Sample ID
• Probeset ID
• Data
•
•
•
Only one type of data to import at a
time
Specify output file
Annotation
• Automatically link to annotation if it is
in the text file
• Choose File>Properties to manually
specify text format of the annotation
file
17
Copyright © Partek Inc.
Import from NimbleGen
•
Specify either project folder or specific
files
•
Specify annotation file (.pos)
(*normalized.txt)
(.pair)
•
Two format of the files
• .pair in raw data folder
• *normalized.txt in processed data folder
•
Output of normalized.txt files is
corrected logratio in the text file
•
Partek uses .pair file and LOESS
normalization to create logratio of one
color to the other
18
Copyright © Partek Inc.
9
Import Affymetrix MIP Chip
File > Import > Affymetrix > MIP Copy Number text file
• Choose input file, annotation file
• Three files to choose from:
— ASCN
— Total Copy Number
— Allele Ratio
• Values pre-calculated by Affymetrix
• Must adjust Copy Number values
below zero to use analysis options
• Set values below zero to small
number
19
Copyright © Partek Inc.
Import from NimbleGen
•
Paired files in raw data folder
• Need to specify baseline
channel
• Output LOESS normalized
log2ratio
•
.normalized.txt in processed
data folder
• Output corrected logratio
20
Copyright © Partek Inc.
10
Assign Sample Attributes
•
There are many ways to assign sample information (treatments, phenotype,
other clinical information)
1) From a “sampleInfo” file (.cel Import only)
2) By creating treat/phenotype groups and dragging the samples into the
appropriate group
3) By splitting apart the filename
4) By manually adding columns and filling them in (similar to Excel)
21
Copyright © Partek Inc.
3) “Drag & Drop” Specification of Groups
Name the new
attribute and all
categories of that
attribute.
Group samples by
dragging and dropping
(e.g. attribute name is
“Type”, and categories
are “Down Syndrome”
and “Normal”)
22
Copyright © Partek Inc.
11
Exploratory Analysis - PCA Scatterplot & Histogram
• Identify outliers
• Clustering pattern
• Identify the distribution of the
data
23
Copyright © Partek Inc.
Chromosome View on Copy Number
• View one chromosome at a
time
• Change the order of tracks
• Add/remove track
Heatmap track
• Display the copy number/ log
ratio for all the samples in
spreadsheet
• Color represent copy number
value
Profile track
• Display selected sample
• Y axis is copy number value
• Raw and smoothed copy
number
24
Copyright © Partek Inc.
12
How to find regions of CNV?
(Amplifications & Deletions)
•
Monitoring trends across multiple adjacent markers
•
Define chromosomal breakpoints where these trends in chromosomal
abundance changes
2 “normal”
•
Methods in Partek:
• Hidden Markov Model
• Genomic segmentation
25
Copyright © Partek Inc.
Partek Genomic Segmentation
• Find a breakpoint that produces different
neighboring regions
Segmentation Parameters
• Specify minimum number of genomic markers
• Two sided t-test to comparing two neighboring
regions
• Based on significance and amount of changes to
decide whether to insert breakpoint
Region Report
• 2 One sided t-test to compare the mean of the
region with expected range to determine
aberration status
• Expected range: the range around each expected
copy number. In a diploid region , the expected
range would be 2+/- 0.3 which is from 1.7-2.3.
Signal to Noise
2.6
2.2
(2.6-2.2)=0.4 > 0.3
26
Copyright © Partek Inc.
13
Hidden Markov Model
• Specify expected states (copy
number/log ratio)
• Specify maximum probability of
retaining the same state between
neighboring markers
• Genomic decay describes how
quickly the retention of the state will
decay to the initial probability
• Find the most likely state sequence
given the data
• Compare state of 2 as normal to
determine aberration status
27
Copyright © Partek Inc.
Result Spreadsheet
• One row per segment per sample
• Mean is the average copy number of all
the markers in the region
• First 3 columns are the genomic location:
chromosome, start, end
• HMM has stat column
• Copy Number status is based on the report
parameters
• Segmentation has p-value column
28
Copyright © Partek Inc.
14
HMM vs Segmentation
• HMM - Good on
homogenous samples with
anticipated states (copy
number)
• Segmentation - Good for
heterogeneous sample
when you don’t know the
copy number state
29
Copyright © Partek Inc.
Segmentation Result Spreadsheet
• One row per segment per sample
• First 3 columns are the genomic location: chromosome, start, end
• Copy Number status is based on the report parameters
• Right click on a region row header>Browse to location
30
Copyright © Partek Inc.
15
Plot Detected Regions
Karyoview (Histogram View)
— Sample frequency on aberration regions
Classification View
— View each region in each sample separately
31
Copyright © Partek Inc.
Analyze Detected Segments
Analyze regions across
multiple samples
Region of sample 1
Region of sample 2
Region of sample 3
Region of sample 4
Region of sample 5
Result
Result
Result
Result
• Regions can be smaller than the default
number of markers chosen during segmentation
• Lack of defined borders in copy number
• You can apply filter on small and less
commonly shared regions
32
Copyright © Partek Inc.
16
Detect CN Variation on Different Categories
• Chi-square test is used to
detect copy number
changes among different
categories
• Unbalance between
samples will increase
significance of categorical
contribution to aberration
• Right click on a region
header can invoke HTML
report
33
Copyright © Partek Inc.
Create Region List
• Specify criteria to filter down to interesting regions based on
• p-value
• length
• number of marker
• chromosome
• number of aberration samples
34
Copyright © Partek Inc.
17
Find Overlapping Genes
• Overlap with RefSeq, AceView, Ensemble, CNV or custom database
• Output format:
• on a new column in the region spreadsheet
• new spreadsheet
35
Copyright © Partek Inc.
Test for Known Abnormalities
• Input file:
• Filtered segmentation/HMM result spreadsheet
• Abnormality database
• If overlap = positive
• Output is each row is a feature testing in each sample
36
Copyright © Partek Inc.
18
Cluster Genome
Copy number spreadsheet is used
to verify how the samples are
clustered on the whole genome or
selected chromosomes
• default is showing cluster on
chromosome 1
• click Show All button to
cluster on the whole genome
• combine left and right click
on chromosome number to
select chromosomes
37
Copyright © Partek Inc.
38
Copyright © Partek Inc.
Chromosome View
19
Tracks
• Can be easily added and removed
• Drag and drop to change the order of the tracks
• Select a track to change the configuration at the bottom
• Heatmap track on copy number data
•Click on the color tab to change the heatmap color
•Profile track on copy number data
•Default only show the selected sample
•Click on color tab to change the dot color
• click on samples table to select samples to display
• Add New track
•Any annotation information
•Any spreadsheet with genomic location
39
Copyright © Partek Inc.
Loss of
Heterozygosity
20
LOH Workflow
• Import genotype calls
• Both Paired and Unpaired analysis are
available.
• Paired is preferred (if possible) because it
focuses only on genomic regions specific
to the disease phenotype and minimizes
differences between individuals
41
Copyright © Partek Inc.
Import Samples
• Platform:
• Affymetrix CHP (500K, SNP6 etc.)
• Affymetrix text file (Mouse
Diversity Genotyping Array)
• Illumina output from
GenomeStudio
• Illumina Text file
• Specify input file(s)
• Specify output file(s)
42
Copyright © Partek Inc.
21
QA/QC
Sample QC
• Report rate of NC and
heterogynous call in each sample
SNP QC
• Hardy-Weinberg Equilibrium
— Allele frequencies in a
population remain constant
• Chi-square test is performed on
expected genotype frequency vs.
observed genotype frequency
• Output frequencies of each
allele
43
Copyright © Partek Inc.
Create LOH
• Hidden Markov Model is used to
find the LOH regions based on
genotype error and expected
heterozygous frequency at each
SNP
• Paired and unpaired
• Paired is preferred when
possible as it is more accurate in
its expected genotype
frequencies.
44
Copyright © Partek Inc.
22
Paired LOH
• Specify the
tumor/normal pair
information
• Specify parameters for
HMM
• Homozygous SNPs in
normal sample are
excluded
45
Copyright © Partek Inc.
Unpaired LOH
• Need to create baseline
using normal samples
• Specify baseline file
• Specify parameters for HMM
• Default heterozygous
frequency is used if no
baseline is provided
46
Copyright © Partek Inc.
23
LOH creates a segment table
• One row per sample
per LOH region
• Heterozygous rate is
the number of AB calls
divided by total number
of genotype calls in the
region
• Paired LOH only display
case sample LOH regions
47
Copyright © Partek Inc.
48
Copyright © Partek Inc.
Intro to CN & LOH merge
• Sample ID should match
• Input file is regions spreadsheet
detected on each sample
• Output is the union of the
regions with different
categories
•
•
•
•
•
Amplification with LOH
Amplification without LOH
Deletion with LOH
Deletion without LOH
Copy-Neutral LOH
24
LOH & CN Overlay
LOH
Amplification
Deletion
Amplification
Amp
w/
LOH
Copy
neutra
l LOH
49
Del
w/
LOH
Deletion
Copyright © Partek Inc.
Output of CN overlap with LOH
• Genomic location of the
region
• Sample ID
• Description
• Average copy number
• Heterozygous rate
50
Copyright © Partek Inc.
25
LOH and Copy Number Overlap
Chromosome view of the region in the five categories
51
Copyright © Partek Inc.
Find regions in multiple samples
• Specify output region that is in
common in # of samples
• Output number of samples in
the region and sample ID
52
Copyright © Partek Inc.
26
Visualization
Histogram on number of samples in each region for each category
53
Copyright © Partek Inc.
Allele Specific Copy Number
Copyright © 2009 Partek Incorporated.
All rights reserved.
27
Why ASCN
• Estimate number of copies for each allele
• Able to detect imbalance in copy number between alleles which is
important for mixture tissue e.g. tumor samples
• Help interpret total copy number analysis results
Copyright © 2009 Partek Incorporated.
All rights reserved.
ASCN Workflow
• Import samples
• genotype calls
• allele intensity
• Sample ID should match in both spreadsheets
• Normal sample(s) are required on the
spreadsheet
• Analyze both paired and unpaired design
• Detect allelic imbalance
• Overlap with copy number analysis
Copyright © 2009 Partek Incorporated.
All rights reserved.
28
Create ASCN
• Not all SNPs are used
• Paired is preferred (if possible)
• minimizes differences between
individuals
• Informative SNPs are
heterozygous call in normal
sample
• Un-Paired analysis
• Informative SNPs are
heterozygous call in tumor
sample
• LOH regions in tumor sample
means no informative SNPs
(missing values)
Copyright © 2009 Partek Incorporated.
All rights reserved.
ASCN Result
• Each row is one allele
copy number in each
sample
• “?” means it is not
informative SNPs in the
sample
• Min/max is determined
by its value
• Diploid region min/max
should be around 1
Copyright © 2009 Partek Incorporated.
All rights reserved.
29
Detect Allelic Imbalance
• Each informative SNP proportion score is:
Proportion = (Max - Min)/(Max + Min)
• Large proportion score means allelic imbalance, range from 0-1
• Genomic segmentation is performed,
• Average proportion score on informative SNPs is reported
Copyright © 2009 Partek Incorporated.
All rights reserved.
Questions?
Hands On & Demo
Copyright © 2009 Partek Incorporated.
All rights reserved.
30
Data Set
•20 paired tumor/normal samples
•Kindly provided by Ian Campbell, Peter MacCallum Cancer Centre
•Could easily be run paired
•Run on the Affymetrix Human SNP 6.0 Arrays
° 900K SNPs& 900K CNVs for 1.8 M total genetic markers
Copyright © 2009 Partek Incorporated.
All rights reserved.
31