R and Eddie for Breast Cancer Bioinformatics Duncan Sproul

advertisement
R and Eddie for Breast
Cancer Bioinformatics
Duncan Sproul
Microarrays
• Assay Samples on a Genome Wide Scale
• Developed in 1990s
• Widely Applied in Biology
Microarrays Widely Applied in
Breast Cancer
Sorlie et al Nature 2000
Microarray Approaches in Breast
Cancer
Unsupervised
Supervised
Molecular subtyping
Patient Based
Model Based
‘Intrinsic set’
Recurrence/Metastasis/Survival
ER/Grade/Hypoxia/Proliferation
Molecular subtypes
Poor
Good
Low
High
Prognostic
Implications?
Sims (2009). J. Clin Path. (available online)
What Can They Measure?
SNP arrays
CGH arrays
VARIATION
COPY NUMBER
METHYLATION
Methylation arrays
Genomic DNA
Gene
TRANSCRIPTION
mRNA
Gene expression
(mRNA) arrays
miRNA
miRNA arrays
TRANSLATION
Protein sequence
Peptide arrays
POST-TRANSLATIONAL
MODIFICATIONS
3D Protein structure
How We Use Eddie…
Integration of Previous Studies
Richardson et al
Farmer et al.
ERBB2 (216836_s_at)
GRB7 (210761_s_at)
ERBB2 (210930_s_at)
GATA3 (209604_s_at)
GATA3 (209602_s_at)
GATA3 (209603_s_at)
FBP1 (209696_at)
ESR1 (205225_at)
NAT1 (214440_at)
SEMA3C (203789_at)
XBP1 (200670_at)
KRT17 (212236_at)
KRT17 (205157_at)
KRT5 (201820_at)
1
Richardson et al. (2006) Cancer Cell 9: 121-32
40 tumours, U133 plus2, standard labelling,
18 ‘basal-like’, ‘20 non-basal-like’ , 2 BRCA
2
Farmer et al. (2005) Oncogene. 24(29):4660-71
49 tumours, U133A, amplified RNA,
27 luminal, 16 basal, 6 ‘molecular apocrine’
Richardson et al.
Farmer et al.
ERBB2 (210930_s_at)
ERBB2 (216836_s_at)
GRB7 (210761_s_at)
NAT1 (214440_at)
XBP1 (200670_at)
GATA3 (209603_s_at)
GATA3 (209604_s_at)
GATA3 (209602_s_at)
FBP1 (209696_at)
ESR1 (205225_at)
KRT5 (201820_at)
KRT17 (212236_at)
KRT17 (205157_at)
Sims et al. (2008) BMC Medical Genomics 1:42
Data Repositories
Microarray Hybridisations
by Year in Array Express
Number of Hybrisations
300000
250000
200000
150000
100000
50000
0
2004
2005
2006
2007
2008
2009
Year
Statistics taken from Array Express for January of Each Year
Semi-Automated Pre-Processing of
Studies from Repositories
Running Analyses with Varying
Parameters Using Array Jobs
Number of Loci by Distance Paramater
100%
Number of Loci (Relative to Distance = 0)
90%
80%
70%
60%
Mouse
50%
Human Interphase
Human Mitotic
40%
30%
20%
10%
0%
0
100
200
300
400
500
600
Distance Parameter
700
800
900
1000
Analysis of WNT Signalling in
Breast Cancer
WNT Gene Sets
2 Datasets
Process Multiple
Breast Cancer
Datasets
Groups of
Functionally
Related Genes
Remaining
Datasets
Determine
Association with
Clinical Variables
Detecting Regions of CoRegulation in Breast Cancer
Number of Loci (Relative to Distance = 0)
Number of Loci by Distance Paramater
100%
90%
80%
70%
60%
Mouse
Human Interphase
Human Mitotic
50%
40%
30%
20%
10%
0%
0
Process Breast
Cancer Datasets
100
200
300
400
500
600
700
Distance Parameter
800
900
Decide on
Parameters
Find Regions of
Interest
Significance
Testing by
Permutation
1000
Regions for
Further Analysis
Mapping of Short Sequence Tags
to Transcriptional Start Sites
900
800
700
Short Sequence
Tags Mapped to
Genome
Number of Tags
600
Low
500
Medium
400
High
300
200
100
0
-2000
-1500
-1000
-500
0
500
1000
1500
2000
Distance from TSS
Enrichment by
Gene Group
Gene Locations
Parallelization of Mapping Reduces Estimated Time from ~11hrs to ~2hrs
The Future
• More Parallel Jobs
• Bigger Jobs and Parallel Processing
– Eg affyPara Bioconductor Package
• More Mass Sequencing
– More Data!
Thanks!
Andy Sims
Arran Turnbull
Liang Liang
Colette Meyer
Robert Kitchen
Breakthrough Unit
Elad Katz
Sylvie Dubois-Marshall
Charlene Kay
Bartlett Lab
Nick Gilbert
Bernie Ramsahoye
Catherine Naughton
Jayne Culley
Ben Skerry
Jacqueline Dickson
Melanie Spears
Karen Taylor
Carrie Cunningham
Meehan Lab
Colm Nestor
Donncha Dunican
Bickmore Lab
Sehrish Rafique
Lee Murphy
Angie Fawkes
Louise Evenden
WTCRF
Bauke Ylstra,
VU Medisch Centrum,
Amsterdam
Dimitra Dafou
Kate Lawrenson
UCL EGA
Institute for Women’s Health
Download