Document

advertisement
Outline
•
•
•
•
Time series prediction
Find k-nearest neighbors
Lag selection
Weighted LS-SVM
Time series prediction
• Suppose we have an univariate time series x(t)
for t = 1, 2, …, N. Then we want to know or
predict the value of x(N + p).
• If p = 1, it would be called one-step prediction.
• If p > 1, it would be called multi-step
prediction.
Flowchart
Find k-nearest neighbors
• Assume the current time
index is 20.
• First we reconstruct the
query
q  [ x 20 1
y 20 1
x 20
y 20 ]
• Then the distance
between the query and
historical data is
D Trend ( q , t 2 )  {[( x 20  x 20 1 )  ( x 2  x 2 1 )]  [( y 20  y 20 1 )  ( y 2  y 2 1 )] }
2
2
D Original ( q , t 2 )  [( x 20 1  x 2 1 )  ( y 20 1  y 2 1 )  ( x 20  x 2 )  ( y 20  y 2 ) ]
2
2
2
2
0 .5
D (q, t2 ) 
N ( D Trend )  N ( D Original )
2
Find k-nearest neighbors
• If k = 3, and the first k
closest neighbors are t14,
t15, t16. Then we can
construct the smaller
data set.
 x13

D  x14

 x15
y 13
x14
y 14
y 14
x15
y 15
y 14
x16
y 16
y 15 

y 16

y 17 
Flowchart
Lag selection
• Lag selection is the process of selecting a
subset of relevant features for use in model
construction.
• Why we need lag?
• Lag selection is like feature selection, not
feature extraction.
Lag selection
• Usually, the lag selection can be divided into
two broad classes: filter method and wrapper
method.
• The lag subset is chosen by an evaluation
criterion, which measures the relationship of
each subset of lags with the target or output.
Wrapper method
• The best lag subset is
selected according to
the model.
• The lag selection is a
part of the learning.
Filter method
• In this method, we need
the criterion which can
measures the correlation
or dependence.
• For example, correlation,
mutual information, … .
Lag selection
• Which is better?
• The wrapper method solve the real problem,
but need more time.
• The filter method provide the lag subset which
perform the worse result.
• We use the filter method because of the
architecture.
Entropy
• The entropy is a measure of uncertainty of a
random variable.
• The entropy of a discrete random variable is
defined by
H ( X )    p ( x ) log p ( x )
x X
• 0log0 = 0
Entropy
• Example, let
1
X 
0
with probabilit
y p
with probabilit
y 1-p
• Then
H ( X )   p log p  (1  p ) log( 1  p )
Entropy
Entropy
• Example, let
a

b
X 
c
d

with probabilit
y 1/ 2
with probabilit
y 1/ 4
with probabilit
y 1/ 8
with probabilit
y 1/ 8
• Then
H (X )  
1
2
log
1
2

1
4
log
1
4

1
8
log
1
8

1
8
log
1
8

7
4
bits
Joint entropy
• Definition: The joint entropy of a pair of
discrete random variables (X, Y) is defined as
H ( X ,Y )   

x  X y Y
p ( x , y ) log p ( x , y )
Conditional entropy
• Definition: The conditional entropy is defined
as
H (Y | X ) 

p ( x ) H (Y | X  x )
x X
• And
H ( X , Y )  H ( X )  H (Y | X )
Proof
H ( X ,Y )   

p ( x , y ) log p ( x , y )
 

p ( x , y ) log p ( x ) p ( y | x )
 

p ( x , y ) log p ( x ) 
x  X y Y
x  X y Y
x  X y Y
   p ( x ) log p ( x ) 
x X
 H ( X )  H (Y | X )

p ( x , y ) log p ( y | x )
x  X y Y

x  X y Y
p ( x , y ) log p ( y | x )
Mutual information
• The mutual information is a measure of the
amount of information one random variable
contains about another.
• It’s the extended notion of the entropy.
• Definition: The mutual information of the two
discrete random variables is
I ( X ,Y ) 

p ( x , y ) log
x  X y Y
 H (X )  H (X |Y )
p ( x, y )
p( x) p( y)
Proof
I ( X ,Y ) 

p ( x , y ) log
p( x) p( y)
x, y


p ( x , y ) log
p(x | y)
p( x)
x, y

p ( x, y )

p ( x , y ) log p ( x | y ) 

p ( x , y ) log p ( x )
x, y
x, y
   p ( x ) log p ( x )  (   p ( x , y ) log p ( x | y ) )
x
x,y
 H (X )  H (X |Y )
 H (Y )  H (Y | X )
 H ( X )  H (Y )  H ( X , Y )
The relationship between entropy and
mutual information
Mutual information
• Definition: The mutual information of the two
continuous random variables is
I ( X ,Y ) 
 dxdyp ( x , y )
p ( x, y )
p( x) p( y)
• The problem is that the joint
probability density function of X and Y is hard
to compute.
Binned Mutual information
• The most straightforward and widespread
approach for estimating MI consists in
partitioning the supports of X and Y into bins
of finite size
I ( X , Y )  I binned ( X , Y ) 

p ( i , j ) log
p (i ) p ( j )
ij
p (i ) 
n (i )
N
p (i ) 
n (i )
N
p (i , j )
p (i, j ) 
n (i , j )
N
Binned Mutual information
• For example, consider a set of 5 bivariate
measurements, zi=(xi, yi), where i = 1, 2, …, 5.
And the values of these points are
Index
1
2
3
4
5
Feature 1 Feautre 2
0
1
0.5
5
1
3
3
4
4
1
Binned Mutual information
Binned Mutual information
p x (1) 
p y (1) 
3
5
2
5
p xy (1,1) 
, p x (2) 
, p y (2) 
1
5
2
5
2
5
, p y (3) 
, p xy (1, 2 ) 
1
5
1
5
, p xy (1,3 ) 
1
5
, p xy ( 2 ,1) 
1
5
, p xy ( 2 , 2 ) 
1
5
Binned Mutual information
Iˆ ( X , Y ) 
3
3

i 1
p xy ( i , j ) log
j 1
p xy ( i , j )
p x (i ) p j ( j )
1

1
log(
5
5
3
5
 0 . 1996
1

2
5
)
1
5
log(
1
5
3
5

2
5
)
1
5
log(
1
5
3
5

1
5
)
1
5
log(
1
5
2
5

2
5
)
1
5
log(
5
2
5

2
5
)
Estimating Mutual information
• Another approach for estimating mutual
information. Consider the case with two
variables. The 2-dimension space Z is spanned
by X and Y. Then we can compute the distance
between each point.
z i  z j  max{ x i  x j , y i  y j }
i , j  1, 2 ,..., N
i j
Estimating Mutual information
• Let us denote by  (i ) / 2 the distance from z i to
its k-nearest neighbor, and by  x (i ) / 2 and  y (i ) / 2
the distances between the same points
projected into the X and Y subspaces.
• Then we can count the number nx(i) of points
xj whose distance from xi is strictly less
than  (i ) / 2 , and similarly for y instead of x.
Estimating Mutual information
Estimating Mutual information
• The estimate for MI is then
Iˆ
(1 )
( X ,Y )   (k ) 
1
N
N
 [ ( n
x
( i  1))   ( n y ( i  1))]  ( N )
i 1
• Alternatively, in the second algorithm, we
replace nx(i) and ny(i) by the number of points
with
xi  x j   x (i ) / 2
y i  y j   y (i ) / 2
Estimating Mutual information
Estimating Mutual information
Estimating Mutual information
• Then
Iˆ
(2)
( X , Y )   (k ) 
1
k

1
N
N
 [ ( n
i 1
x
( i ))   ( n y ( i ))]   ( N )
Estimating Mutual information
• For the same example, k = 2
• For the point p1(0, 1) D  max{ 0 .5 , 4}  4
12
D 13  max{ 1, 2}  2
D 14  max{ 3 , 3}  3
D 15  max{ 4 , 0}  4
n x (1)  2
• For the point p2(0.5,5)
n y (1)  2
D 21  max{ 0 . 5 , 4}  4
D 23  max{ 0 . 5 , 2}  2
D 24  max{ 2 . 5 ,1}  2 . 5
D 25  max{ 3 . 5 , 4}  4
n x (2)  2
n y (2)  2
Estimating Mutual information
• For the point p3(1,3)
D 31  max{ 1, 2}  2
D 32  max{ 0 . 5 , 2}  2
D 34  max{ 2 ,1}  2
D 35  max{ 3 , 2}  3
n x (3)  2
• For the point p4(3,4)
n y (3)  1
D 41  max{ 3 , 3}  3
D 42  max{ 2 . 5 ,1}  2 . 5
D 43  max{ 2 ,1}  2
D 45  max{ 1, 3}  3
n x (4)  2
n y (4)  2
Estimating Mutual information
• For the point p5(4,1)
D 51  max{ 4 , 0}  4
D 52  max{ 3 . 5 , 4}  4
D 53  max{ 3 , 2}  3
D 54  max{ 1,3}  3
n x (5 )  1
n y (5 )  2
• Then
1
(1 )
Iˆ ( X , Y )   ( k ) 
N
  (2) 
1
N
 [ ( n
( i )  1)   ( n y ( i )  1)]   ( N )
i 1
5
[ ( n

5
i 1
 0 . 2833
x
x
( i )  1)   ( n y ( i )  1)]   ( 5 )
Estimating Mutual information
• Example
– a=rand(1,100)
– b=rand(1,100)
– c=a*2
• Then
I ( a , b )  0 . 0427
I ( a , c )  2 . 7274
Estimating Mutual information
• Example
– a=rand(1,100)
– b=rand(1,100)
– d=2*a + 3*b
• Then
I ( a , d )  0 . 2218
I ( b , d )  0 . 6384
I (( a , b ), d )  1 . 4183
Flowchart
Model
• Now we have a training data set which
contains k records, then we need a model to
predict.
Instance-based learning
• The points that are close to the query have
large weights, and the points far from the
query have small weights.
• Locally weighted regression
• General Regression Neural Network(GRNN)
Property of the local frame
Property of the local frame
Weighted LS-SVM
• The goal of the standard LS-SVM is to
minimize the risk function:
min J ( , e ) 
1
2
  
T
1
2
k

j 1
2
ej
• Where the γ is the regularization parameter.
Weighted LS-SVM
• The modified risk function of the weighted
LS-SVM is
min J ( , e ) 
1
2
 
T
1
k


2
j 1
• And
 j  wj 
j  1, 2 ,..., N
2
e
j j
Weighted LS-SVM
• The weighted is designed as
w1 j  D Nj
w 2 j  u j  median (U )
w j  exp(  1  (
N ( w1 j )  N ( w 2 j )
2
))
Download