Lecture 10

advertisement
L10:
Classic RF to uncover
biological interactions
Kirill Bessonov
GBIO0002
Nov 24th 2015
1
Talk Plan
• Trees
– Basic concepts
– Examples
• Tree-based algorithms
– Regression trees
– Random Forest
• Practical on RF
– RF variable selection
• Networks
– network vocabulary
– biological networks
2
Data Structures
• arrangement of data in a computer's memory
• Convenient access by algorithms
• Main types
– Arrays
– Lists
– Stack
– Queue
– Binary Trees
– Graph
tree
stack
queue
graph
3
Trees
• is a data structure with
– a hierarchy relationships
• Basic elements
– Nodes (N)
• Variables
• Features
• e.g. files, genes, cities
– Edges (E)
• directed links from
– From lower
to higher depth
4
Nodes (N)
• Usually defined by one variable
• Are selected from {x1 …. xp} variables
– Differential selection criteria
• E.g. strength of association to response (Y~X)
• E.g. “best” split
• Others
• Node variable could take many forms
– question
– feature
– data point
5
Binary splits
• a node could be split in
– Two
• binary split
• Two
– child nodes
– Multiple ways
• multi-split
• Several
– Child nodes
• Travelling from top to bottom of a tree
– Like being lost in cave maze
6
Edges (E)
• Edges connects
– Parent and child nodes (parent  child)
– Directional
• Do not have weight
• Represent node splits
parent
children
7
Tree types
• Decision
– To move to next node need to take decision
• Classification
– Allow to predict class
• use input to predict output  class label
• i.e. classify input
• Regression
– Allow to predict output value
• i.e. use input to predict output
8
Predicting response(s)
• Input: data on a sample; Nodes: variables
• Travel from root down to child nodes
9
Decision Tree example
10
Classification tree
A banking customer would accept a personal loan? Classes = {yes, no}
• Leaf nodes predict class of input (i.e. customer)
• Output class label – yes or no answer
11
Classification example
Name Cough Fever
Marie yes
yes
Jean no
no
Marc yes
no
Weight Pain Class
skinny none
normal none
normal none
12
Classification example
Name Cough Fever
Marie yes
yes
Jean no
no
Marc yes
no
Weight Pain
skinny none
normal none
normal none
Class
flu
none
cold
13
Regression trees
• Predict outcome – e.g. price of a car
14
Trees
• Purpose 1:
– recursively partition data
• cut data space into perpendicular hyperplanes (w)
• Purpose 2: classify data
• class label at the leaf node
• E.g. a potential customer will respond to
a direct mailing?
– predicted binary class: YES or NO
Source: DECISION TREES by Lior Rokach
Tree growth and splitting
Selected feature(s)
• In top-down approach
• Select attribute(s)/feature(s) to
split the node
• Stop tree growth based on
Max depth reached
Splitting criteria is not met
X<x
X>x
Leafs/Terminal
nodes
– assign all data to root node
Y>y
Y<y
Recursive splitting (1)
Splits
Data
Variable 2
Recursive splitting (2)
Variable 1
Each node corresponds to a quadrant
Recursive splitting (3)
19
Stopping rules
• When splitting is enough?
– Max depth reached
• No new levels wanted
– Node contains too few samples
• Prone to unreliable results
– Further splits do not improve
• purity of child nodes
• association of Y~X is below threshold
20
Other tree uses
• Trees can be used also for
– Clustering
– Hierarchy determination
• E.g. phylogenetic trees
• Convenient visualization
– effective visual condensation of the
clustering results
• Gene Ontology
– Direct acyclic graph (DAG)
– Example of functional hierarchy
21
GO tree example
22
last common ancestor
23
Alignment trees
24
Tree Ensembles
Random Forest (RF)
25
Random Forest
• 1999 introduced by Breiman
– Ensemble of tree predictors
– Let each tree “vote” for most popular class
• Significant performance improvement
– over previous classification algorithms
• CART and C4.5
• Relies on randomness
• Variable selection based on purity of a split
– GINI index
26
Random Forests
Randomly build ensemble of trees
1. Bootstrap a sample of data, start
building a tree
2. Create a node by
1. Randomly selecting m variables from M
2. Keep m constant (except for term. nodes)
3. Split the node based on m variables
4. Grow a tree until no more splits
possible
5. Repeat steps 1-4 n times
 Generate an ensemble of trees
6. Calculate variable importance for
each predictor variable X
Random Forest animation
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3
C3
D3
A4
B4
C4
D4
A5
B5
C5
D5
A6
B6
C6
D6
A7
B7
C7
D7
<X
2
B
1
{A,B,C,D}
C
<X
2
D
{A,B,D,D}
A1
>X
3
A
{A,B,C,D}
<X
2
B
1
C
B1 A2C1 B2D1 C2
D2
A2
B2 A3C2 B3D2 C3
B3 A4C3 B4D3 C4
B4 A5C4 B5D4
D3
A3
3
A
<X
3
B
2
C
D4
C5 D5
A5 B5 A6C5 B6D5
C6 D6
A6 B6 A7C6 B7D6
C7 D7
A7 B7 C7 D7
>X
>X
C1 D1
A1
A4
1
C
B1
1
C
>X
3
B
Building a forest
• Forest
– Collection of several trees
• Random Forest
– Aggregation of several decision trees
• Logic
– Single tree – too variable performance
– Forest of trees – good and stable performance
• Predictions averaged over several trees
29
Splits
• Based on split purity criteria of a node
– gini impurity criterion (GINI index)
– measures impurity of the outputs after a split
• a split of a node is made on variable m
– with lowest Gini Index split (slide 33)
where j is class and pj is probability of class (proportion of samples with class j)
30
Goodness of split
Which two splits would give
– the highest purity ?
– The lowest Gini Index ?
31
GINI Index calculation
• At given tree node the probability of
p(normal)=0.4 and p(asthma)=0.6. Calculate
node GINI Index.
Gini Index = 1 – (0.42 + 0.62) = 0.48
• What will be GINI index of a “pure” node
– Gini Index = 0
32
GINI Index of a split
• In case of tree building
• GINI Index of a split is used instead
• Given node N and a splitting value j*
– the left child (Nleft) has Sleft samples
– The right child(Nright) has Sright samples
𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥𝑠𝑝𝑙𝑖𝑡
𝑆𝑙𝑒𝑓𝑡
𝑆𝑟𝑖𝑔ℎ𝑡
𝑁 =
𝑔𝑖𝑛𝑖 𝑁𝑙𝑒𝑓𝑡 +
𝑔𝑖𝑛𝑖 𝑁𝑟𝑖𝑔ℎ𝑡
𝑆
𝑆
33
GINI Index split
• Given tax evasion data sorted by income
variable, choose the best split value based on
GINI Index split?
Income
10K 20K 30K 40K 50K 60K 70K 80K 90K 100K 110K
Split
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes
0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 30
No
0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 70
GINI Index split 0.42 0.4 0.375 0.343 0.417 0.4 0.3 0.343 0.375 0.4 0.42
𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥𝑠𝑝𝑙𝑖𝑡
𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥𝑠𝑝𝑙𝑖𝑡
𝐺𝑖𝑛𝑖 𝑖𝑛𝑑𝑒𝑥𝑠𝑝𝑙𝑖𝑡
2
2
6
4
3
3
=
(1 −
−
)+
(1 − 0 4
6
6
10
10
= 0.6 ∗ 0.5 + 0.4 ∗ 0
= 0.3
2
2
4
−
4 )
34
mtry parameter
• To build each node m variables are selected
• Specified by mtry parameter
• Allows to build different trees
– Randomized selection of variables at each split
– Gives heterogeneity to a forest
• Given X={A,B,C,D,E,F,G} and mtry=2
–
–
–
–
Node N1 = {A,B}
Node N2 = {C,G}
Node N3 = {A,D}
…
• Default mtry = sqrt(p)
– p – number of predictor variables
35
RF stopping criteria
• RF – ensemble of non-pruned decision trees
• Growth tree until
– Node has maximum purity (GINI index = 0)
• all samples of the same class
– No more samples for next split
• 1 sample in a node
• Greedy sleeting
• randomForest library min samples per node
– 1 in classification
– 5 in regression
36
Performance comparison
• RF handles
– missing values
– continuous and categorical predictors (i.e. X)
– and high-dimensional dataset where p>>N
• Improves over single tree performance
* Lower is better
Source: Breiman, Leo. "Statistical modeling: The two cultures." Quality control and applied statistics 48.1 (2003): 81-82.
37
RF variable importance (1)
• Need to estimate “importance” of each
predictor {x1 … xp} in predicting a response y
• Ranking of predictors
• Variable importance measure (VIM)
– Classification: misclassification rate (MR)
– Regression: mean square error (MSE)
• VIM - the increasing in mean of the errors (MR
or MSE) in the forest when the y values are
randomly permuted in the OOB samples
38
Out-of-the-bag (OOB) samples
• Dataset divided into
bootstrap
OOB
– Bootstrap (i.e. training)
– OOB (i.e. testing)
• Trees are built on bootstrap
• Predictions are made on OOB
• Benefits
– avoids over-fitting
bootstrap
OOB
• false results
X1…1.5
X2…1.2
X3…-0.5
VIM
RF variable importance (2)
1. Predict classes of the OOB samples using each
tree of a RF
2. Calculate the misclassication rate = out of bag
error rate (OOBerrorobs) for each tree
3. For each variable in the tree, permute the
variables values
4. Using the tree compute the permutation-based
out-of-bag error (OOBerrorperm)
5. Aggregate OOBerrorperm over all trees
6. Compute the final VIM
40
RF variable importance (3)
• The VIM mathematically is defined as
𝑉𝐼𝑀 𝑥1
1
=
∗
𝑛𝑡𝑟𝑒𝑒
𝑛𝑡𝑟𝑒𝑒
(𝑂𝑂𝐵𝑒𝑟𝑟𝑜𝑟𝑝𝑒𝑟𝑚 − 𝑂𝑂𝐵𝑒𝑟𝑟𝑜𝑟𝑜𝑏𝑠 )
𝑡=1
• VIM domain {−∞; +∞}
– Thus, VIM could be negative
• Means that variable is insignificative
• Ideal situation when 𝑂𝑂𝐵𝑒𝑟𝑟𝑜𝑟𝑝𝑒𝑟𝑚 > 𝑂𝑂𝐵𝑒𝑟𝑟𝑜𝑟𝑜𝑏𝑠
41
Aims of variable selection
• Find variables related to response
– E.g. predictive of class with highest probability
• Simplify problem
– Summarize dataset by fever variables
– Decrease dimensionality
42
RFs in R
randomForest library
Titanic example
43
randomForest
• Well implemented library
• Good performance
• Main functions
– randomForest()
– importance()
• Install library
–
install.packages("randomForest", repos="http://cran.freestatistics.org")
• Load titanic_example.Rdata
44
randomForest
Which factor was the most important in survival of passengers?
library(randomForest);
titanic_data =
read.table(file="http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt",
header=T, sep=",");
train_idx = sample(1:1313,0.75*1313);
test_idx = which(!1:1313 %in% train_idx);
titanic_data.train = titanic_data[train_idx,];
titanic_data.test = titanic_data[test_idx,];
titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age,
data=titanic_data.train, ntree=1000, importance=TRUE, na.action=na.omit);
c_matrix = titanic.survival.train.rf$confusion[1:2,1:2];
print(accuracy in training data);
sum( diag(c_matrix) ) / sum(c_matrix);
imp = importance(titanic.survival.train.rf);
print(imp);
survived
0
1 MeanDecreaseAccuracy MeanDecreaseGini
pclass 25.08341 27.00470
30.59483
16.38874
sex
77.77933 82.17791
84.19724
74.51846
age
22.82038 22.48106
30.55145
18.07370
varImpPlot(titanic.survival.train.rf);
45
Variable importance
• Which variable was the most important in
survival of passengers?
46
RF to uncover
networks of interactions
47
One to many
• So far seen one response Y to many predictors X
• Can use RF to predict many Ys sequentially
– create interaction network
48
RF to interaction networks (1)
• Can we build networks from tree ensembles?
– Need to consider all possible interactions
• Y1~X, then Y2~X … Yp~X
– Need to “shift” Y to new variable
• Assign Y to new variable (previously X)
– Complete matrix of interactions (i.e. network)
49
RF to interaction networks (1)
• For example given continuous data and A,B,C
variables, build an interaction network
– Consider interaction scenarios
• A ~ {B,C}
• B ~ {A,C}
• C ~ {A,B}
– Need to have 3 RF runs giving 3 sets of VIMs
– Fill out interaction network matrix A
A
A
B
C
B
C
0
0
0
Interaction network (p x p matrix)
50
RF to interaction networks (2)
• Read in input data matrix D with 3 variables
– Continuous scale
• Aim: variable ranking and response prediction
• Load RFtoNetworks.Rdata
A
rf =randomForest(A~B+C, data=D, importance=T);
importance(rf,1)
%IncMSE
B 8.706760
C 8.961513
rf =randomForest(B~A+C, data=D, importance=T);
importance(rf,1)
%IncMSE
A 9.603829
C 3.271325
rf =randomForest(C~A+B, data=D, importance=T);
importance(rf,1)
%IncMSE
A 8.830840
B 1.519951
D =
1
2
3
4
5
6
7
8
9
10
B
4.41
11.18
1.32
6.82
10.51
2.67
6.24
6.85
10.50
9.15
C
7.95
5.76
7.74
3.83
5.05
7.78
5.30
1.56
4.89
8.05
6.27
3.35
1.11
3.64
10.42
1.83
6.07
3.50
8.44
9.73
51
RF to interaction networks (3)
• Fill out the interaction matrix D
A
A
B
C
B
C
0 8.706760 8.961513
9.603829
0 3.271325
8.830840 1.519951
0
• The resulting network
8.83 1.51
3.27
9.60
8.70
52
Networks of
biological interactions
53
Networks
• What comes to your mind? Related terms?
• Where can we find networks?
• Why should we care to study them?
54
We are surrounded by networks
55
56
Transportation Networks
57
Computer Networks
58
Social networks
59
Internet submarine cable map
60
Social interaction patterns
61
PPI (Protein Interaction Networks)
• Nodes
• Links
– protein names
– physical binding event
62
Network Definitions
63
Network components
• Networks also called graphs
– Graph (G) contains
• Nodes (N): genes, SNPs, cities, PCs, etc.
• Edges (E): links connecting two nodes
64
Characteristics
• Networks are
– Complex
– Dynamic
– Can be used to reduce data dimensionally
time = t0
time = t
65
Topology
• Refers to connection pattern of a network
– The pattern of links
66
Modules
• Sub-networks with
– Specific topology
– Function
• Biological context
– Protein complex
– Common function
• E.g. energy production
clique
67
Network types
• Directed
– Edge have directionality
– Some links are unidirectional
– Direction matters
• Going A B is not the same as BA
– Analogous to chemical reactions
• Forward rate might not be the same as reverse
– E.g. directed gene regulatory networks (TF  gene)
• Undirected
– Edges have no directionality
– Simpler to describe and work with
– E.g. co-expression networks
68
Edges Types
graph:
N nodes
E edges
directed
undirected
Neighbours of node(s)
• Neighbours(node, order) = {node1 … nodep}
• Neighbours(3,1) = {2,4}
• Neighbours(2,2) = {1,3,5,4}
70
Node degree (k)
• the number of edges connected to the node
• k(6) = 1
• k(4) = 3
71
Connectivity matrix
(also known as adjacency matrix)
Size
binary or weighted
A=
Degree distribution (P(k))
• Determines the statistical properties of
uncorrelated networks
source: http://www.network-science.org/powerlaw_scalefree_node_degree_distribution.html
73
Topology: random
Degree distribution of nodes is statistically independent
Topology: Scale-free
• Biological processes are characterized by this topology
– Few hubs (highly connected nodes)
– Predominance of poorly connected nodes
– New vertices attach preferentially to highly connected ones
•
Barabási, Albert-László, and Réka Albert. "Emergence of scaling in random
networks." science 286.5439 (1999): 509-512.
75
Topologies: scale-free
Most real networks have
Degree distribution that follows power-law
• the sizes of earthquakes craters on the
moon
• solar flares
• the sizes of activity patterns of neuronal
populations
• the frequencies of words in most languages
• frequencies of family names
• sizes of power outages
• criminal charges per convict
• and many more
Shortest path (p)
• Indicates the distance between i and j in
terms of geodesics (unweighted)
• p(1,3) =
– {1-5-4-3}
– {1-5-2-3}
– {1-2-5-4-3}
– {1-2-3}
77
Cliques
• A clique of a graph G is a complete subgraph of G
– i.e. maximally interconnected subgraph
• The highlighted clique is the maximal clique of size 4 (nodes)
78
“The richest people in the world look for and
build networks. Everyone else looks for work.”
–Robert Kiyosaki
Biological context
80
Biological Networks
81
Biological examples
• Co-expression
– For genes that have similar expression profile
• Directed gene regulatory networks (GRNs)
– show directionality between gene interactions
• Transcription factor  target gene expression
– Show direction of information flow
– E.g. transcription factor activating target gene
• Protein-Protein Interaction Networks (PPI)
– Show physical interaction between proteins
– Concentrate on binding events
• Others
– Metabolic, differential, Bayesian, etc.
82
Biological networks
• Three main classes
Type
Name Nodes
PPI proteins
DTI drugs/targets
Edges
physical bonds
molecular interactions
physical bonds
genetic
GI
genes
interactions
functional
functional associations
ON Gene Ontology relations
GDA genes/diseases associations
expression profile
Co-Ex genes
similarity
functional/structural similarities
structural
PStrS proteins
similarities
Resource
BioGRID
PubChem
BioGRID
GO
OMIM
GEO,
ArrayExpress
PDB
Source: Gligorijević, Vladimir, and Nataša Pržulj. "Methods for biological data integration: perspectives and challenges." Journal of The Royal Society Interface 12.112 (2015): 20150571.
83
Summary
• Trees are powerful techniques for
– Response prediction
• e.g. Classification
• Random Forest is a powerful tree ensemble
technique for variable selection
• RF can be used to build networks
– assess pair-wise variable associations
• Networks are well suited for interactions
representation
– Biological networks are scale-free
84
References
1) https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
2) Breiman, Leo. "Statistical modeling: The two cultures." Quality control and applied
statistics 48.1 (2003): 81-82.
3) Liaw, Andy, and Matthew Wiener. "Classification and regression by
randomForest." R news 2.3 (2002): 18-22.
4) Loh, Wei-Yin, and Yu-Shan Shih. "Split selection methods for classification
trees." Statistica sinica 7.4 (1997): 815-840.
85
Download