Bioinformatics, Translational Informatics & Personalized Medicine

advertisement

Bioinformatics, Translational

Bioinformatics, Personalized

Medicine

Uma Chandran, MSIS, PhD

Department of Biomedical Informatics

University of Pittsburgh chandran@pitt.edu

412-648-9326

07/17/2013

Outline of lecture

• What is Bioinformatics?

– Examples of bioinformatics

– Past to present

• What is translational bioinformatics?

• Personalized Medicine

– Bioinformatics and Personalized Medicine

What is Bioinformatics?

• http://en.wikipedia.org/w iki/Bioinformatics

• Application of information technology to molecular biology

• Databases

• Algorithms

• Statistical techniques

Bioinformatics examples

• Sequence analysis

• Genome annotation

• Evolutionary biology

• Literature analysis

• Analysis of Gene Expression

• Analysis of regulation

• Analysis of protein expression

• Analysis of mutations in cancer

• Comparative genomics

• Systems Biology

• Image analysis

• Protein structure prediction

From Wikipedia

Early Bioinformatics

• Robert Ledley and

Margaret Dayhoff

– First bioinformaticians

– Using IBM 7090 and punch card analyzed amino acid structure of proteins

– Created amino acid scoring matrix

– Protein evolution

– Protein sequence alignment http://blog.openhelix.eu/?p=1078

Sequence analysis

• Databases to store sequence info

– Phage Φ-X174 sequenced in 1977

– GenBank

• 30, 000 organisms

• 143 billion base pairs

– BLAST program for sequence searching

• Algorithms, databases, software tools

Evolutionary biology

• Compare relationships between organism by comparing

– DNA sequences

– Now whole genomes

• Can even find single base changes, duplication, insertions, deletions

• Uses advanced algorithms, programs and computational resources

Literature mining

• Millions of articles in the literature

• How to find meaningful information

– Natural language processing techniques

• Example

– Type in p53 or PTEN in Pubmed – will retrieve 1000s of publications

– How to summarize all the information for a particular gene

– Function, disease, mutations, drugs

– IHOP database creates network between genes and proteins for

30000 genes

Genome annotation

• Marking genes and other features in DNA

• Algorithms, software

Bioinformatics

• Interdisciplinary discipline

– Gene/proteins/function/ - Biologist

– In Cancer – Physician/Scientist/Biologist

– Algorithms, for example, BLAST – Math/CS

– Separate Signal from Noise, Diff gene expression, correlation with disease – Statistician

– Tools, Software, Databases – Software developers, programmers

• Aim to make sense of biological data

Translational bioinformatics

• Translational = benchside to bedside

– Bringing discoveries made at the benchside to clinical use

• the development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data into proactive, predictive, preventative, and participatory health. Translational bioinformatics includes research on the development of novel techniques for the integration of biological and clinical data and the evolution of clinical informatics methodology to encompass biological observations. The end product of translational bioinformatics is newly found knowledge from these integrative efforts that can be disseminated to a variety of stakeholders,

including biomedical scientists, clinicians, and patients.

• Translational = benchside to bedside

Atul Butte, JAMIA 2008;15:709-714 doi:10.1197

Central dogma

• DNA is transcribed to

RNA

• RNA is translated to protein

• Many regulatory processes control these steps

Molecular Biology Primer

• 20, 000 genes

• Many transcripts, many proteins

• More than 20, 000 proteins

• Southern, Northern, Western Blots

Biological questions

• DNA

– Are there any mutations

• sickle cell anemia

• Cystic fibrosis

• Hemophilia

• Other diseases such as diabetes, cancer ??

– Polymorphisms

• Variation in the population

• Mutation

DNA amplification

• Are there regions of amplification or deletions that correlate with disease

– If so, what genes are present in these regions

– HER2 amplification in breast cancer

– EGFR mutations in lung cancer

• RNA

– DNA is transcribed to RNA

– Approximately 20K genes

• RNA levels will differ in different conditions

– Liver, kidney, cancer, normal, treatment etc

– Diagnosis or prognostic

– microRNAs level

– lnncRNAs

– Splicing differences

RNA

mRNA

Clinical questions

• DNA level

– Are there mutations or polymorphism between different cancer patient groups

• Good outcome v bad outcome

• Early stage vs late stage

• Therapy responders v non-responders

• Examples: Renal cell, prostate cancer etc

• RNA

– Are there specific transcripts – mRNA, microRNA - that are up or down and are signature for outcome, disease and response

– 1000s of studies

– Consortia projects

• TCGA – The Cancer Genome Atlas projects

• Profile 500 samples of each cancer for DNA, RNA changes

Molecular Biology Primer

• 20, 000 genes

• Many transcripts, many proteins

• More than 20, 000 proteins

• Southern, Northern, Western Blots

Base pairing

• Microarray and

Northern/Southern blots

– Exploit the ability of nucleotides to hybridize to each other

– Base pairing

– Complementary bases

• A :T (U)

• G: C

Northern

Sensitivity and dynamic range low

How are these changes measured

• Example: Northern blot (measure RNA)

– http://www.youtube.com/watch?v=KfHZFyADnNg

– Workflow of Northern blot

• Key points

– mRNA run on gel – separated by size

– transferred to a membrane – immobilized

– Have a hypothesis – for example studying RNA level for BRCA in normal and cancer

– Only probe for a mRNA or transcript is labeled or tagged

– probe is prepared and labeled with radioactivity

– Hybridized to X-ray film

– Only that mRNA is detected and quantitated

Microarrays

• Solid surface

– Many different technologies

• Affy, Illumina, Agilent

– Probes are synthesized on the solid surface

• Synthesized using proprietary technology

– Probe are selected using proprietary algorithms

– RNA (or DNA) is in solutions

– RNA is labeled or tagged

– Hybridized to the chip

– Tagged RNA is quantitated

– Compare between conditions

Affymetrix

Need for computational methods

• Data Management

– Each file for a chip experiment is large

• 100MG x 10 = 1G

• Generates Gigabytes of data

• Data preprocessing

– Convert raw image into signal values

• Data analysis

– 1000s of genes (or SNPs) and few samples

– How to find differences between samples

– What statistical methods to use?

– Like finding needle in a haystack

How to analyze?

Normal Noise reduction

Background subtraction

Normalization

Tumor name id

Samples

2 2

662.7

2

369.7

589.9

2

217.5

883.8

2

489

395.5

2

228

979.5

tyrosine kinase with immunoglobulin and epidermal growth factor homology domains 62.7

---

---

1008_f_at

1017_at

3205.4

33.1

14.6

122

11.5

837.2

275.6

156.4

887.4

1582.4

173.1

58.4

198.8

153.5

33.8

817.4

515.3

125.1

299.3

5618.8

82.8

272.7

161.5

192.6

195.2

31

936.4

620

264.9

1324.8

3589.1

213.7

393.7

243.5

11.3

194.6

238.8

96

662.3

381.3

168.7

132.4

1401.2

24.1

132.6

156.9

592.4

18.4

53.7

126.6

26.1

939.3

417.4

33.7

235.6

93.9

200.4

267.1

231.5

hypoxia-inducible factor 1, alpha subunit (basic helix-loop-helix transcription factor) 103.5

ephrin-A5 1041_at

--melan-A

1047_s_at

1051_g_at

96.1

E2F transcription factor 5, p130-binding

95.4

14.1

81.9

94.8

1055.9

18.8

139.4

332.3

175.1

368.2

48.9

140.8

53.3

210.3

170.5

23

178.5

10.2

125.3

146.4

62.4

182.4

57.5

143.5

99.2

120.9

CCAAT/enhancer binding protein (C/EBP), delta 2984.6

30.1

113.5

89.3

28.3

116.9

162.4

196.9

74.9

--1069_at 45.8

4336

66.9

112.5

40.1

480

319.2

155.7

53.6

290.6

12.8

82.1

1157.3

99

97.1

117.9

10.2

91.9

85.8

118.6

6.9

714.7

96.7

604.1

3959.9

55.6

403

67.8

31.8

157.4

27.4

38.1

8.7

369.1

153.3

561.6

1280.4

34.9

166.8

136.7

13.1

48.7

24.6

52

9

889.6

114.6

212

193

394.6

114.7

358.1

150.2

151.9

45.2

191.7

264.5

131.2

108.4

174

87.6

242.9

100.3

23.5

343.1

60.7

43.1

183.6

144.4

30.2

375

41.8

187.6

15.5

324.2

306.4

432.6

314.3

amyloid beta (A4) precursor protein-binding, family B, member 1 (Fe65) 155.1

25.6

163 14.2

--1104_s_at 16378.1

4845.4

1160.6

338.7

32

2711.3

399.2

143.3

4218.5

325.6

48.7

151.1

49.6

386

53.1

280.2

119

24.5

347.1

249.5

212.6

53.3

1763.4

498.1

110.2

2951.4

17.2

E

393.6

307.1

227.2

N

341.8

145.3

E

708.1

408.3

112.6

S

2

336.3

420.4

131.8

170.8

173

1910.3

47.6

238.7

33.8

78.9

733.4

158.3

773

245.4

166.3

552.6

24.2

276

248.5

33.5

331.5

274.4

163.9

16.1

13180.9

283.5

107.4

167.2

313.8

278.4

124.1

13

118.7

103.5

19.3

4129.2

86.2

294.4

103.4

170.9

246

57.5

187.1

651.7

4.2

37

115

19.3

890.5

332.4

127.7

301.3

110.7

92.4

371.5

73.9

338.4

153.8

18.8

269.7

339

86.4

57.6

3130.4

411.4

183.8

4279.4

10.2

329.7

126

60.2

147.4

17.9

79.5

11

371.3

78.9

958.4

198.8

76.2

698

394.7

23.5

216.6

372.3

185.4

87.2

148.6

100.8

91

221.5

20.6

2

381.7

389.1

59.6

223.6

112.6

1195.2

19.3

166.7

237.2

204.8

517.6

14.3

224.5

57.3

117.7

373.3

73.6

154.1

107.4

12.6

253.3

354.5

111.3

20.2

3238.5

305.3

151.7

2391.4

82.4

586

104.9

5.5

26.3

96.2

19.5

25.7

418.1

173.8

993.5

145.5

34.3

1006.6

435.6

28.9

187.7

77.7

243.7

146.8

94.9

36.6

129.5

166.3

28.8

2

363.2

457.8

54.4

186.5

117.5

1217.8

100.4

242.9

183.4

184.6

643.9

40.6

88.2

153.9

274.7

311.2

54.7

224.9

101.7

32

324.7

288.1

59.1

15.9

3576.2

468.1

115.8

1673.5

245.3

157.6

105

26.6

47.9

25.9

59

81.1

595.7

91.8

3909.5

166.8

32.5

742.3

366.7

56.1

255.8

427.7

183.5

523.9

115.2

77.5

15.7

190.3

149.8

2

373.2

495.7

116.4

42.7

241.5

2928.7

20.4

257.3

81.2

290.2

478.8

122.7

194.4

136.6

109.5

266.4

75

95.6

84.3

15.2

294.5

326.1

92.6

78.5

2802.9

269.3

140.4

4456

54.8

480

20

38.8

117.6

25.1

54.1

4

563.1

219.6

1426.7

101.5

28

838

308.6

38.9

180.6

110.8

317.2

216.6

74.4

57.3

109.8

70.9

12.8

2

263.8

346.3

32.7

93.4

153.9

1305.9

111.7

283.9

103

154

742

6.9

134.5

2

302.8

482.5

22.2

115.1

212.2

589.6

78

390.8

104.4

172.2

Data

43.6

4965.1

100.4

163.9

85.5

14.6

94.8

65.1

87.2

10.6

779.1

186.6

2014.7

12

1093.6

499.9

94

160.7

195

333.4

242.8

54.5

48.6

169.4

87.4

9.7

125.8

109.4

397

156.3

161.2

118.5

10.8

150.4

363.2

71.5

82

1992

386.2

203.7

Data analysis

• Class discovery

– Are there novel subclasses within data?

• Class comparison

– How are tumor and normal different in expression?

– Which SNPs are different?

• Class prediction

– Predict class of new sample

• Advanced pathway

Analysis

Pathway Analysis

Analytic methods – many studies, many methods

Dupuy and Simon, JNCI; 2007

SNPs to detect Copy Number changes

amplification amplification diploid deletion

Hagenkord et al; Modern Pathology, 21:599

What is personalized medicine

Personalized medicine is the tailoring of medical treatment to the individual characteristics of each patient.

• Based on scientific breakthroughs in understanding of how a person’s unique molecular and genetic profile makes them susceptible to certain diseases.

• ability to predict which medical treatments will be safe and effective for each patient, and which ones will not be.

From ageofpersonalizedmedicine.org

Personalized Medicine

From ageofpersonalizedmedicine.org

Personalized Medicine

From Fernald et al; Bioinformatics, 13: 1741

Examples of personalized medicine

• Breast cancer

– 30% of patients over express HER2

– Treated with Herceptin

– Oncotype Dx: gene expression predicting recurrence

• Cardiovascular

– Patients response to Warfarin, the blood thinner

– Response determined by polymorphism in a CYP genes

Personalized Medicine

• Examples of personalized medicine resulted from studies that generate

– Lots of data

– Rely on bioinformatics methods to discover these associations

• Oncotype Dx:

– Gene expression studies of large number of patients

• CYP polymorphisms

– Discover single nucleotide polymorphisms in patient polulations and association with response

» Initial studies done with PCR methods

Personalized Medicine

• Current examples are few in numbers

• Making personalized medicine a reality

– Generate the data

– Discover the associations

– Find targeted therapies

– Genome sequences prices are dropping

– Large scale genome information is coming:

• 1000 genome

• TCGA

• ICGC

• Also possible to commercially sequence a person’s genome

• Processing all this data into translating these discoveries into medical practice has many challenges

Bioinformatics challenges in personalized medicine

• Processing large scale robust genomic data

• Interpreting the functional impact of variants

• Integrating data to relate complex interactions with phenotypes

• Translating into medical practice

Fernald et al; Bioinformatics: 13: 1741

Era of Personalized medicine

• Shift from microarrays to Next Gen

Sequencing

Central dogma

• DNA is transcribed to

RNA

• RNA is translated to protein

• Many regulatory processes control these steps

Next Gen Sequencing

• Directly sequence DNA to determine

– SNP

– CN

– Expression, mRNA, microRNA

– Protein binding sites

– Methylation

• Initial steps depend not on hybridization but also on base pairing or complementarity and DNA synthesis

• Bioinformatics is extremely challenging

Next Gen Sequencing

NGS in personalized medicine

• Whole genome sequencing

– Sequence genomes and find variants (1000 genome project)

• Find variants associated with disease phenotype

• Sequence exomes only

– Find coding region variants associated with phenotypes

• RNA seq

– RNA sequence signatures associated with phenotype

Microarrays v NGS RNA Seq

• Restricted to probes on chips

• Only transcripts with probes

• File sizes in MBs to GB

• Algorithms, methods

• Typically done on PCs

• Storage on hard drives

• No – predetermined probes

• Can detect everything that is sequenced

• More applications than microarray

• Very large file sizes

• Computationally very intensive

• Clusters, supercomputers

• Large scale storage solutions

Microarrays v RNA seq Expression

Analysis

• Dynamic range is low

• Statistic to determine expression based on signal

• Many methods in the last 10 years

• Dynamic range is high

– Based on reads

• Statistics based on counts

– Affected by read length, total number of transcripts, lack of replicates

Read mapping Alignment

• Denovo assembly

• Mapping to reference genome

– Based on complementarity of a given 35 nucleotide to the entire genome

– Computationally intensive

• Million of 35 bp reads has to search for alignment against the reference and align spefically to a given regions

– Large file sizes

• Sequence files in the TB

• Aligned file BAM files

– Several hundred GB

Reference genome

Sequence variation

Bioinformatics challenges in personalized medicine

• Processing large scale robust genomic data

– Suppose we want to identify DNA variants associated with disease

• Which technology

• How much data

• How to analyze the data

• How to identify variants

• Each genome can have millions of variants

• 300, 000 new variants – i.e, not in existing databases

– Will have to separate error from true variants

– 1 error per 100 kb can lead to 30,000 errors in a single experiment

• Why do these errors happen?

Fernald et al; Bioinformatics: 13: 1741

Bioinformatics Challenges

• Data

• Which technology to use

– Each technology has different error rates , Ion Torrent (higher error rate), SOLID, Illumina

– Speed of generation of data – Ion Torrent is faster

• Application – Whole genome or exome or targeted exome

• Analysis

• Analysis

– Algorithms, speed, accuracy

– BLAST is not good for WGS

– Other new algorithms

• Speed of analysis

– Alignment can take days

• Alignment relies on matches between sequence and reference genome

– How much mismatches to tolerate

– True mismatch or error – sequencing error, true mismatch – is it a SNP

• Quality of reference genome

• Large amounts of data

– Each whole genome sequencing experiment can generate TB of data

• Where to store – patient privacy

– Servers, locations, networking

• Sample sizes – how many samples to sequence to discover the association with disease

Bioinformatics Challenges

• Technology

– Ion Torrent, SoLiD, Illumina

– Each has its own error rates

– Speed of data generation

– Dependent on application – WGS or exome

• Analysis

• Analysis

– Algorithms, speed, accuracy

• Speed of analysis

– Alignment can take days

• Alignment relies on matches between sequence and reference genome

– How much mismatches to tolerate

– True mismatch or error – sequencing error, true mismatch – is it a SNP

• Quality of reference genome

From Mark Boguski’s presentation at the IOM, July 19, 2011

From Mark Boguski’s presentation at the IOM, July 19, 2011

From Mark Boguski’s presentation at the IOM, July 19, 2011

Molecular Diagnostics using NGS

From Mark Boguski’s presentation at the IOM, July 19, 2011

NGS Bioinformatics - medicine

• Infrastructure

– Storage, backup, archive

– Where – HIPAA compliant?

– Network

• How to move data

• Analysis

– Methods – statistics, annotation

– Computing resources

– How many samples can be handled at a time?

– Time to report

NGS and bioinformatics

Next Gen Sequencing

From Mark Boguski’s presentation at the IOM, July 19, 2011

Download