Using the biclust package in the R Commander

advertisement
Biostatistics and Statistical
Bioinformatics
Setia Pramana
Universitas Brawijaya Malang,
7 October 2011
1
BECOMING A STATISTICIAN?
2
Who Need Statisticians?
• Can only become a lecturer/teacher?
• NO…… More applied fields:
• My classmates work in:
– Information and Communication
Technology.
– Research and Developments
– Governments: Ministry of Finance, PLN,
Bank Indonesia, Danareksa, etc.
– Entrepreneur
– Many more...
• Writer....
• Read the book: 9 Summers 10 Autumns
3
4
BIOSTATISTICIANS
5
Biostatistics
• The study of statistics as applied to biological
areas such as Biological laboratory
experiments, medical research (including
clinical research), and public health services
research.
• Biostatistics, far from being an unrelated
mathematical science, is a discipline essential
to modern medicine – a pillar in its edifice’
(Journal of the American Medical Association
(1966)
6
Biostatistics
• Public Health:
– Epidemiology
– Modeling Infectious Diseases: HIV, HCV
– Disease Mapping
– Genetics: family related disease
• Bioinformatics
– Image Processing
– Data Mining
– Pattern recognition
– etc
7
Biostatistics
• Agriculture
– Experimental Design
– Genetics
• Biomedical Research
• Evidence-based medicine
• Clinical studies
• Drug Development
8
Statistical Methods?
•
•
•
•
•
•
•
•
•
•
t-test
ANOVA
Regression
Cluster analysis
Discriminant analysis
Non-Linear Modeling
Multiple comparison
Linear Mixed Model
Bayesian
Etc,
• z
9
BIOSTATISTICIANS IN DRUG
DEVELOPMENT
10
Drugs Development
• Takes 10-15 years
• Cost more than 1 million USD
• To ensure that only the drugs that are that
are both safe and effective can be marketed.
• Stages:
- Drug Discovery
- Pre-clinical Development
- Clinical Development -> 4 Phases
Statisticians are involved in all stages (a must)
11
discovery of compound; synthesis
Pharmaceutical development and purification of drug substance;
manufacturing procedures
Pre-clinical (animal) studies
pharmacological profile; acute
toxicity; effects of long-term usage
Investigational New Drug application
Phase I clinical trials
small; focus on safety
Phase II clinical trials
medium size; focus on safety and
short-term efficacy;
Phase III clinical trials
large and comparative; focus on
efficacy and cost benefits
New Drug Application
Phase IV clinical trials
„real world” experience; demonstrate
cost benefits; rare adverse reactions
12
12
International Conference on
Harmonization (ICH)
• The international harmonization of
requirements for drug research and
development so that information generated in
one country or area would be acceptable to
other countries or areas.
• Regions: Europe, USA, Japan.
• All clinical trials must follow ICH regulations.
• Statistics plays important role.
• Statistical Principles for Clinical Trials (ICH
E9).
13
Preclinical and Clinical Development
• Statisticians are involved from the beginning
of the study
• Planning the study
– Formulating the hypothesis
– Choosing the endpoint
– Choosing the design and sample size
• Conduct of the study
– Patient accrual
– Data collection
• Data Quality control, Data analysis
• Publication of results
14
BIOINFORMATICS
15
Bioinformatics
• Bioinformatics is a science straddling the
domains of biomedical, informatics,
mathematics and statistics.
• Applying computational techniques to biology
data
•
•
•
•
•
Functional Genomics
Proteomics
Sequence Analysis
Phylogenetic
Etc,.
16
“Informatics” in Bioinformatics
• Databases
– Building, Querying
– Object DB
• •Text String Comparison
– Text Search
• Finding Patterns
– AI / Machine Learning
– Clustering
– Data mining
• etc
17
Central Dogma of Molecular Biology
• Genes contain
construction
information
• All structure and
function is made
up by proteins
18
Genomics
• Premise: Physiological changes -> Gene
expression changes -> mRNA abundance
level changes
• Objective: Use gene expression levels
measured via DNA microarrays to identify a
set of genes that are differentially expressed
across two sets of samples (e.g., in diseased
cells compared to normal cells)
19
Microarrays Technology
• DNA microarrays are a new and promising
biotechnology which allow the monitoring of
expression of thousand genes simultaneously
20
Gene Expression Analysis
• Overview of the
process of
generating high
throughput gene
expression data
using
microarrays.
21
Preprocessed data
Genes
G8521
G8522
G8523
G8524
G8525
G8526
G8527
G8528
G8529
G8530
G8531
G8532
C1
6.89
6.78
6.52
5.67
5.64
4.63
8.28
7.81
4.26
7.36
5.30
5.84
C2
7.18
6.55
6.61
5.69
5.91
4.85
7.88
7.58
4.20
7.45
5.36
5.48
C3
6.60
6.37
6.72
5.88
5.61
5.72
7.84
7.24
4.82
7.31
5.70
5.93
T1
7.40
6.89
6.51
7.43
7.41
5.71
8.12
7.79
3.11
7.46
5.41
5.84
T2
7.15
6.78
6.59
7.16
7.49
5.47
7.99
7.38
4.94
7.53
5.73
5.73
T3
7.40
6.92
6.46
7.31
7.41
5.79
7.97
8.60
3.08
7.35
5.77
5.75
22
Applications
•
•
•
•
High efficacy and low/no side effect drug
Personalized medicine.
Genes related disease.
Biological discovery
– new and better molecular diagnostics
– new molecular targets for therapy
– finding and refining biological pathways
• Molecular diagnosis of leukemia, breast
cancer,
• Appropriate treatment for genetic signature
• Potential new drug targets
23
Challenges
• Mega data, difficult to visualize
• Too few records (columns/samples), usually <
100
• Too many rows(genes), usually > 1,000
• Too many columns likely to lead to False
positives
• for exploration, a large set of all relevant genes
is desired
• for diagnostics or identification of therapeutic
targets, the smallest set of genes is needed
• model needs to be explainable to biologists
24
Microarray Data Analysis Types
• Gene Selection
– find genes for therapeutic targets
• Classification (Supervised)
– identify disease (biomarker study)
– predict outcome / select best treatment
• Clustering (Unsupervised)
– find new biological classes / refine existing ones
– Understanding regulatory relationship/pathway
– exploration
25
Gene Selection
•
•
•
•
•
Modified t-test
Significance Analysis of Microarray (SAM)
Limma (Linear model for microarrays )
Random forest
Lasso (least absolute selection and shrinkage
operator)
• Linear Mixed model
• Elastic-net
• Etc,
26
Visualization
•
•
•
•
•
Dimensionality reduction
PCA (Principal Component Analysis)
Biplot
Multi dimensional scaling
Etc
27
Clustering
• Cluster the genes
• Cluster the
arrays/conditions
• Cluster both
simultaneously
• K-means
• Hierarchical
• Biclustering
algorithms
28
Clustering
• Cluster or
Classify genes
according to
tumors
• Cluster tumors
according to
genes
29
Biclustering
• A biclustering method is an unsupervised
learning method which looks for submatrices in a data matrix with a high
similarity of elements.
• Algorithms: Statistical based, AI, machine
learning.
• BiclustGUI: A User Friendly Interface for
Biclustering Analysis
30
Bicluster Structure
31
Software/Statistical Packages
•
•
•
•
•
•
•
Minitab
SAS
SPSS
R
S-Plus
Matlab
Stata
32
• R now is growing, especially in bioinformatics
– Statistics, data analysis, machine learning
– Free
– High Quality
– Open Source
– Extendable (you can submit and publish
your own package!!)
– Can be integrated with other languages
(C/C++, Java, Python)
– Large active user community
– Command-based (-)
33
Summary
• Statisticians can flexibly get involved in many
fields.
• Only tools, applications are widely range.
• Biostatisticians have many opportunities in
public health services ( Centers for Disease
Control and Prevention, CDC), pharmaceutical
companies, research institutions etc.
• Statistical Bioinformatics: cutting edge
technology -> methods are growing -> many
more developments in future.
34
Thank you for your attention...
hafidztio@yahoo.com
http://setiopramono.wordpress.com
35
Download