Document

Outline • • • • Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM Time series prediction • Suppose we have an univariate time series x(t) for t = 1, 2, …, N. Then we want to know or predict the value of x(N + p). • If p = 1, it would be called one-step prediction. • If p > 1, it would be called multi-step prediction. Flowchart Find k-nearest neighbors • Assume the current time index is 20. • First we reconstruct the query q  [ x 20 1 y 20 1 x 20 y 20 ] • Then the distance between the query and historical data is D Trend ( q , t 2 )  {[( x 20  x 20 1 )  ( x 2  x 2 1 )]  [( y 20  y 20 1 )  ( y 2  y 2 1 )] } 2 2 D Original ( q , t 2 )  [( x 20 1  x 2 1 )  ( y 20 1  y 2 1 )  ( x 20  x 2 )  ( y 20  y 2 ) ] 2 2 2 2 0 .5 D (q, t2 )  N ( D Trend )  N ( D Original ) 2 Find k-nearest neighbors • If k = 3, and the first k closest neighbors are t14, t15, t16. Then we can construct the smaller data set.  x13  D  x14   x15 y 13 x14 y 14 y 14 x15 y 15 y 14 x16 y 16 y 15   y 16  y 17  Flowchart Lag selection • Lag selection is the process of selecting a subset of relevant features for use in model construction. • Why we need lag? • Lag selection is like feature selection, not feature extraction. Lag selection • Usually, the lag selection can be divided into two broad classes: filter method and wrapper method. • The lag subset is chosen by an evaluation criterion, which measures the relationship of each subset of lags with the target or output. Wrapper method • The best lag subset is selected according to the model. • The lag selection is a part of the learning. Filter method • In this method, we need the criterion which can measures the correlation or dependence. • For example, correlation, mutual information, … . Lag selection • Which is better? • The wrapper method solve the real problem, but need more time. • The filter method provide the lag subset which perform the worse result. • We use the filter method because of the architecture. Entropy • The entropy is a measure of uncertainty of a random variable. • The entropy of a discrete random variable is defined by H ( X )    p ( x ) log p ( x ) x X • 0log0 = 0 Entropy • Example, let 1 X  0 with probabilit y p with probabilit y 1-p • Then H ( X )   p log p  (1  p ) log( 1  p ) Entropy Entropy • Example, let a  b X  c d  with probabilit y 1/ 2 with probabilit y 1/ 4 with probabilit y 1/ 8 with probabilit y 1/ 8 • Then H (X )   1 2 log 1 2  1 4 log 1 4  1 8 log 1 8  1 8 log 1 8  7 4 bits Joint entropy • Definition: The joint entropy of a pair of discrete random variables (X, Y) is defined as H ( X ,Y )     x  X y Y p ( x , y ) log p ( x , y ) Conditional entropy • Definition: The conditional entropy is defined as H (Y | X )   p ( x ) H (Y | X  x ) x X • And H ( X , Y )  H ( X )  H (Y | X ) Proof H ( X ,Y )     p ( x , y ) log p ( x , y )    p ( x , y ) log p ( x ) p ( y | x )    p ( x , y ) log p ( x )  x  X y Y x  X y Y x  X y Y    p ( x ) log p ( x )  x X  H ( X )  H (Y | X )  p ( x , y ) log p ( y | x ) x  X y Y  x  X y Y p ( x , y ) log p ( y | x ) Mutual information • The mutual information is a measure of the amount of information one random variable contains about another. • It’s the extended notion of the entropy. • Definition: The mutual information of the two discrete random variables is I ( X ,Y )   p ( x , y ) log x  X y Y  H (X )  H (X |Y ) p ( x, y ) p( x) p( y) Proof I ( X ,Y )   p ( x , y ) log p( x) p( y) x, y   p ( x , y ) log p(x | y) p( x) x, y  p ( x, y )  p ( x , y ) log p ( x | y )   p ( x , y ) log p ( x ) x, y x, y    p ( x ) log p ( x )  (   p ( x , y ) log p ( x | y ) ) x x,y  H (X )  H (X |Y )  H (Y )  H (Y | X )  H ( X )  H (Y )  H ( X , Y ) The relationship between entropy and mutual information Mutual information • Definition: The mutual information of the two continuous random variables is I ( X ,Y )   dxdyp ( x , y ) p ( x, y ) p( x) p( y) • The problem is that the joint probability density function of X and Y is hard to compute. Binned Mutual information • The most straightforward and widespread approach for estimating MI consists in partitioning the supports of X and Y into bins of finite size I ( X , Y )  I binned ( X , Y )   p ( i , j ) log p (i ) p ( j ) ij p (i )  n (i ) N p (i )  n (i ) N p (i , j ) p (i, j )  n (i , j ) N Binned Mutual information • For example, consider a set of 5 bivariate measurements, zi=(xi, yi), where i = 1, 2, …, 5. And the values of these points are Index 1 2 3 4 5 Feature 1 Feautre 2 0 1 0.5 5 1 3 3 4 4 1 Binned Mutual information Binned Mutual information p x (1)  p y (1)  3 5 2 5 p xy (1,1)  , p x (2)  , p y (2)  1 5 2 5 2 5 , p y (3)  , p xy (1, 2 )  1 5 1 5 , p xy (1,3 )  1 5 , p xy ( 2 ,1)  1 5 , p xy ( 2 , 2 )  1 5 Binned Mutual information Iˆ ( X , Y )  3 3  i 1 p xy ( i , j ) log j 1 p xy ( i , j ) p x (i ) p j ( j ) 1  1 log( 5 5 3 5  0 . 1996 1  2 5 ) 1 5 log( 1 5 3 5  2 5 ) 1 5 log( 1 5 3 5  1 5 ) 1 5 log( 1 5 2 5  2 5 ) 1 5 log( 5 2 5  2 5 ) Estimating Mutual information • Another approach for estimating mutual information. Consider the case with two variables. The 2-dimension space Z is spanned by X and Y. Then we can compute the distance between each point. z i  z j  max{ x i  x j , y i  y j } i , j  1, 2 ,..., N i j Estimating Mutual information • Let us denote by  (i ) / 2 the distance from z i to its k-nearest neighbor, and by  x (i ) / 2 and  y (i ) / 2 the distances between the same points projected into the X and Y subspaces. • Then we can count the number nx(i) of points xj whose distance from xi is strictly less than  (i ) / 2 , and similarly for y instead of x. Estimating Mutual information Estimating Mutual information • The estimate for MI is then Iˆ (1 ) ( X ,Y )   (k )  1 N N  [ ( n x ( i  1))   ( n y ( i  1))]  ( N ) i 1 • Alternatively, in the second algorithm, we replace nx(i) and ny(i) by the number of points with xi  x j   x (i ) / 2 y i  y j   y (i ) / 2 Estimating Mutual information Estimating Mutual information Estimating Mutual information • Then Iˆ (2) ( X , Y )   (k )  1 k  1 N N  [ ( n i 1 x ( i ))   ( n y ( i ))]   ( N ) Estimating Mutual information • For the same example, k = 2 • For the point p1(0, 1) D  max{ 0 .5 , 4}  4 12 D 13  max{ 1, 2}  2 D 14  max{ 3 , 3}  3 D 15  max{ 4 , 0}  4 n x (1)  2 • For the point p2(0.5,5) n y (1)  2 D 21  max{ 0 . 5 , 4}  4 D 23  max{ 0 . 5 , 2}  2 D 24  max{ 2 . 5 ,1}  2 . 5 D 25  max{ 3 . 5 , 4}  4 n x (2)  2 n y (2)  2 Estimating Mutual information • For the point p3(1,3) D 31  max{ 1, 2}  2 D 32  max{ 0 . 5 , 2}  2 D 34  max{ 2 ,1}  2 D 35  max{ 3 , 2}  3 n x (3)  2 • For the point p4(3,4) n y (3)  1 D 41  max{ 3 , 3}  3 D 42  max{ 2 . 5 ,1}  2 . 5 D 43  max{ 2 ,1}  2 D 45  max{ 1, 3}  3 n x (4)  2 n y (4)  2 Estimating Mutual information • For the point p5(4,1) D 51  max{ 4 , 0}  4 D 52  max{ 3 . 5 , 4}  4 D 53  max{ 3 , 2}  3 D 54  max{ 1,3}  3 n x (5 )  1 n y (5 )  2 • Then 1 (1 ) Iˆ ( X , Y )   ( k )  N   (2)  1 N  [ ( n ( i )  1)   ( n y ( i )  1)]   ( N ) i 1 5 [ ( n  5 i 1  0 . 2833 x x ( i )  1)   ( n y ( i )  1)]   ( 5 ) Estimating Mutual information • Example – a=rand(1,100) – b=rand(1,100) – c=a*2 • Then I ( a , b )  0 . 0427 I ( a , c )  2 . 7274 Estimating Mutual information • Example – a=rand(1,100) – b=rand(1,100) – d=2*a + 3*b • Then I ( a , d )  0 . 2218 I ( b , d )  0 . 6384 I (( a , b ), d )  1 . 4183 Flowchart Model • Now we have a training data set which contains k records, then we need a model to predict. Instance-based learning • The points that are close to the query have large weights, and the points far from the query have small weights. • Locally weighted regression • General Regression Neural Network(GRNN) Property of the local frame Property of the local frame Weighted LS-SVM • The goal of the standard LS-SVM is to minimize the risk function: min J ( , e )  1 2    T 1 2 k  j 1 2 ej • Where the γ is the regularization parameter. Weighted LS-SVM • The modified risk function of the weighted LS-SVM is min J ( , e )  1 2   T 1 k   2 j 1 • And  j  wj  j  1, 2 ,..., N 2 e j j Weighted LS-SVM • The weighted is designed as w1 j  D Nj w 2 j  u j  median (U ) w j  exp(  1  ( N ( w1 j )  N ( w 2 j ) 2 ))

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib