computational biology, high-throughput data - Frank Emmert

advertisement
3rd Summer School
in Computational Biology
September 8, 2014
Frank Emmert-Streib
Computational Biology and Machine Learning Laboratory
Center for Cancer Research and Cell Biology
Queen’s University Belfast, UK
Organizers of the summer school
General questions:
Frank Emmert-Streib
f.emmert-streib@qub.ac.uk
Shu-Dong Zhang
s.zhang@qub.ac.uk
Lecturers of the summer school
Ken Mills
Darragh McArt
Ricardo
Matos Simoes .
ShaileshdeTripathi
Salissou Moutari
James McCann
Alexey Stupnikov
Frank Emmert-Streib
Shu-Dong Zhang
& Kevin Keenan, David Simpson, Caroline Meharg, Myrto Kostadima, Bori Mifsud
We thank our sponsors
History of the summer school
Number of participants
40
35
35
30
25
25
20
18
15
10
5
0
year
2012
2013
2014
Organizational notes
• Coffee breaks (short - foyer)
• Lunch (1 hour)
• Sign-in sheets
• Internet access:
– Students from QUB: Use your QUB account
– External students: Guest account
Shailesh Tripathi
Schedule
introductory level
(undergraduate level
w.r.t CB!)
What will we learn?
• different high-throughput data types:
– Microarray data
– Sequencing data (DNA-seq, RNA-seq, ChIP-seq)
• basic statistics and machine learning methods
– Hypothesis testing
– Supervised & unsupervised learning
• basic data visualization
• importance of large-scale data in modern
biology
systems biology
Interdisciplinary summer school
Vision of the VC
Universities require interdisciplinary
engagement in the educational and research
effort
Professor Patrick Johnston of President and
Vice-Chancellor (VC) of Queen’s University
What will we not learn?
(Adjusting expectations)
• Example:
– When learning a foreign language, how much can you
learn in 3 days?
• Analogy:
– programming language
– statistics/machine learning
– biology
The time it takes to become proficient in
computational biology is comparable to the time to
learn a language.
Good news!
• The summer school in computational biology
provides you with a guided start.
• When you are from Belfast:
– Journal club: computational biology and biostatistics
(every Monday in the HSB, 3pm)
– Degree: MSc in Computational Genomics &
Bioinformatics
– General problems/questions: Frank Emmert-Streib
High-throughput data
Data Types
Central Dogma of Molecular Biology
Francis Crick, 1956
Reproducible Research
What is reproducible research?
Reproducible research is the ability that an
entire study can be reproduced, either by the
same researcher or an independent researcher.
In this context
is important.
Example
In order to understand the meaning of
reproducible research let’s consider the
following examples.
x
Task: Produce
the figure.
P(R)
f
y
t arget concept h
P(R ) =
y
x
x
R
4
Example
In order to understand the meaning of reproducible
research let’s consider the following examples.
Task: Produce
the figure.
x
Approach:
Adobe Illustrator
Gimp
CorelDraw
Powerpoint
P(R)
f
y
t arget concept h
P(R ) =
y
x
x
R
4
Example
In order to understand the meaning of reproducible
research let’s consider the following examples.
Task: Produce
the figure.
x
Summary:
How long did it take?
t=30min
How did you do it?
Describe it in a report.
P(R)
f
y
t arget concept h
P(R ) =
y
x
x
R
4
Example
When you publish results, e.g.,
x
t arget concept h
P(R)
f
y
P(R ) =
4
y
x
x
R
and someone wants to repeat the same or a similar
analysis
– How long does a re-analysis take?
– How is a re-analysis done?
Example
When you publish results, e.g.,
x
P(R)
f
y
t arget concept h
P(R ) =
4
y
x
x
R
and someone wants to repeat the same or a similar
analysis
– How long does a re-analysis take? – 30min
– How is a re-analysis done? – depends on the report
you provided & the availability of the software
Alternative way to generate results
Create the figure by writing a program.
• Latex
• freely available
Comparison
Proprietary Software
with GUI
Programming language
Time for you to create
figure for the first time
t = 30min
t = 30min
Time for you to create
figure for the n-th time
ts < t (ts = 20min)
tp < t (tp << 1sec)
Time for someone else
to create the same
figure for first time
t’ ~ t (t’ = 30min)
t’’ ~ tp (t’’<< 1sec
Need to pay for
license?
Yes
No
Figure reproducible by
everyone?
No
Yes
Back to data analysis
The same line or argumentation holds for the
analysis of data.
• Create a figure -> conduct a data analysis
• Adobe Illustrator -> Partek, GenomeStudio etc
Back to data analysis
The same line or argumentation holds for the
analysis of data.
• Create a figure -> conduct a data analysis
• Adobe Illustrator -> Partek, GenomeStudio etc
In order to obtain reproducible results in
‘genomics’ we use R.
Reproducible research
• Analyze data by writing programs in R.
• Share your data & your programs with others.
Other groups can reproduce
your results.
For this reason we use R in this summer school.
Data sharing
US National Institute of Health (NIH) requires
that all generated genomics data funded by NIH
must be shared online.
Nature, 4 September 2014
Mandatory!
Enjoy the summer school!
Download