Regression

advertisement
Regression
Usman Roshan
CS 675
Machine Learning
Regression
• Same problem as classification except that the
target variable yi is continuous.
• Popular solutions
– Linear regression (perceptron)
– Support vector regression
– Logistic regression (for regression)
Linear regression
• Suppose target values are generated by a
function yi = f(xi) + ei
• We will estimate f(xi) by g(xi,θ).
• Suppose each ei is being generated by a Gaussian
distribution with 0 mean and σ2 variance (same
variance for all ei).
• This implies that the probability of yi given the
input xi and variables θ (denoted as p(yi|xi,θ) is
normally distributed with mean g(xi,θ) and
variance σ2.
Linear regression
• Apply maximum likelihood to estimate g(x, θ)
• Assume each (xi,yi) i.i.d.
• Then probability of data given model
(likelihood) is P(X|θ) = p(x1,y1)p(x2,y2)…p(xn,yn)
• Each p(xi,yi)=p(yi|xi)p(xi)
• p(yi|xi) is normally distributed with
meang(xi,θ) and variance σ2
• Maximizing the log likelihood (like for
classification) gives us least squares (linear
regression)
Logistic regression
• Similar to linear regression derivation
• Minimize sum of squares between predicted
and actual value
• However
– predicted is given by sigmoid function and
– yi is constrained in the range [0,1]
Support vector regression
• Makes no assumptions about probability
distribution of the data and output (like
support vector machine).
• Change the loss function in the support vector
machine problem to the e-sensitive loss to
obtain support vector regression
Support vector regression
• Solved by applying Lagrange multipliers like in
SVM
• Solution w is given by a linear combination of
support vectors (like in SVM)
• The solution w can also be used for ranking
features.
• From regularized risk minimization the loss
would be
1 n
T
max(0,

|
y

(
w
xi  w0 ) |  )

i
n i 1
Application
• Prediction of continuous phenotypes in mice
from genotype (Predicting unobserved phen…)
• Data are vectors xi where each feature takes on
values 0, 1, and 2 to denote number of alleles of
a particular single nucleotide polymorphism
(SNP)
• Data has about 1500 samples and 12,000 SNPs
• Output yi is a phenotype value. For example coat
color (represented by integers), chemical levels in
blood
Mouse phenotype prediction from
genotype
• Rank SNPs by Wald test
– First perform linear regression y = wx + w0
– Calculate p-value on w using t-test
•
•
•
•
•
•
•
•
t-test: (w-wnull)/stderr(w))
wnull = 0
T-test: w/stderr(w)
stderr(w) given by Σi(yi-wxi-w0)2 /(xi-mean(xi))
– Rank SNPs by p-values
– OR by Σi(yi-wxi-w0)
Rank SNPs by Pearson correlation coefficient
Rank SNPs by support vector regression (w vector in SVR)
Rank SNPs by ridge regression (w vector)
Run SVR and ridge regression on top k ranked SNP under
cross-validation.
MCH phenotype in mice
MCH mean of 10 Splits, ranking with W vector, predicting with SVR & Ridge. As well as ranking with PCC predicting with SVR and Ridge.
0.65
0.6
0.55
0.5
0.45
0.4
0.35
Top 100 Top 200 Top 300 Top 400 Top 500 Top 600 Top 700 Top 800 Top 900 Top 1K Top 2K Top 3K Top 4K Top 5K Top 6K Top 7K Top 8K Top 9K
SVR-Ridge
SVR-SVR
PCC-Ridge
PCC-SVR
Ridge-Ridge
Ridge-SVR
All
CD8 phenotype in mice
CD8 mean of 10 Splits, ranking with W vector, predicting with SVR & Ridge. As well as ranking with PCC predicting with SVR
and Ridge.
0.75
0.73
0.71
0.69
0.67
0.65
0.63
0.61
0.59
0.57
0.55
Top 100 Top 200 Top 300 Top 400 Top 500 Top 600 Top 700 Top 800 Top 900 Top 1K Top 2K Top 3K Top 4K Top 5K Top 6K Top 7K Top 8K Top 9K
SVR-Ridge
SVR-SVR
PCC-Ridge
PCC-SVR
Ridge-Ridge
Ridge-SVR
All
Rice phenotype prediction from
genotype
• Same experimental study as previously
• Improving the Accuracy of Whole Genome
Prediction for Complex Traits Using the Results
of Genome Wide Association Studies
• Data has 413 samples and 37,000 SNPs
(features)
• Basic unbiased linear prediction (BLUP)
method improved by prior SNP knowledge
(given in genome-wide association studies)
Days to flower
Chart Title
0.7
0.65
0.6
0.55
0.5
0.45
0.4
Top
100
Top
200
Top
300
Top
400
Top
500
Top
600
Top
700
Top
800
SVR-Ridge
Top Top 1K Top 2K Top 3K Top 4K Top 5K Top 6K Top 7K Top 8K Top 9K Top
900
10k
SVR-SVR
PCC-Ridge
PCC-SVR
Top
11K
Top
12K
Top
13K
Flag leaf length
Chart Title
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
Top 100 Top 200 Top 300 Top 400 Top 500 Top 600 Top 700 Top 800 Top 900 Top 1K Top 2K Top 3K Top 4K Top 5K Top 6K Top 7K Top 8K Top 9K Top 10k
SVR-Ridge
SVR-SVR
PCC-Ridge
PCC-SVR
Panicle length
Chart Title
0.7
0.65
0.6
0.55
0.5
0.45
Top
100
Top
200
Top
300
Top
400
Top
500
Top
600
Top
700
Top
800
SVR-Ridge
Top Top 1K Top 2K Top 3K Top 4K Top 5K Top 6K Top 7K Top 8K Top 9K Top
900
10k
SVR-SVR
PCC-Ridge
PCC-SVR
Top
11K
Top
12K
Top
13K
Download