Unsupervised Learning NEtWORKS •PCA NETWORK PCA is a Representation Network useful for signal, image, video processing PCA NEtWORKS In order to analyze multi-dimensional input vectors, a representation with maximum information is the principal component analysis (PCA). PCA • per component: extract most significant features, • inter-component: avoid duplication or redundancy between the neurons. An estimate of the autocorrelation matrix by taking the time average over the sample vectors: Rx Řx = (1/M ) Σt Rx = t UΛU t x(t)x (t) the optimal matrix W is formed by the first m singular vectors of Rx . x(t) = W a(t) the errors of the optimal estimate are [Jain89]: • matrix-2-norm error = λm+1 • least-mean-square error = Σin=m+1 λi First PC a(t) = t w(t) x(t) to enhance the correlation between the input x(t) and the extracted component a(t), it is natural to use a Hebbian-type rule: w(t+1) = w(t) + β x(t)a(t) Oja Learning Rule Δw(t) = β [x(t)a(t) - w(t) 2 a(t) ] the Oja learning Rule is equivalent to a normalized Hebbian rule. (Show procedure!!) Convergence theorem: Single Component By the Oja learning rule, w(t) converges asymptotically (with probability 1) to w = w(∞) = e1 where e1 is the principal eigenvector of Rx Proof: Δw(t) = β [x(t)a(t) - w(t) a(t)2] Δw(t) = β [x(t)x’(t)w(t) - a(t)2 w(t)] take average over a block of data, and redenote ť as the block time index: Δw(ť) = β [Rx - σ(ť)I] w(ť) Δw(ť) = β [UΛUT - σ(ť)I] w(ť) Δw(ť) = β U[Λ - σ(ť)I] UT w(ť) ΔUTw(ť) = β [Λ - σ(ť)I] UTw(ť) ΔΘ(ť) = β [Λ - σ(ť)I] Θ (ť) Convergence Rates Θ(ť) = [θ1(ť) θ2(ť) … θn(ť)]T Each of the eigen-components is enhanced/dampened by θi(ť+1) = [1+β' λi - β' σ(ť)] θi(ť) the relative dominance of the principle component grows, with a growth rate: (1+β' [λi-σ(ť)])/(1+β' [λ1 - σ(ť)]) Simulation: Decay Rates of PCs How to extract Multiple Principal Components Let W denote a nm weight matrix ΔW(t) = β [x(t) - W(t) a(t)] a(t)t Concern on duplication/redundancy Deflation Method Assume that the first component is already obtained; then the output value can be ``deflated'' by the following transformation: x˜ = (I- w1 w’1) x Lateral Orthogonalization Network the basic idea is to allow the old hidden units to influence the new units so that the new ones do not duplicate information (in full or in part) already provided by the old units. By this approach, the deflation process is effectively implemented in an adaptive manner. APEX Network (multiple PCs) APEX: Adaptive Principal-component Extractor the Oja Rule: for i-th component (e.g. i=2) Δwi(t) = β [ x(t)ai(t) - wi(t) ai(t)2] Dynamic Orthogonalization Rule (e.g. i=2,j=1) Δαij(t) = β [ ai(t) aj(t) - αij(t) ai(t)2 ] Convergence theorem: Multiple Components the Hebbian weight matrix W(t) in APEX converges asymptotically to a matrix formed by the m largest principal components. the weight matrix W(t) converges to (with probability 1), W(∞) = W where W is the matrix formed by m row vectors wit, wi = wi(∞) = ei Δw2(t) = β [ x(t)a2(t) – w2(t) a2(t)2] Δα(t) = β [ a1(t) a2(t) - α(t) a2(t)2 ] w’1Δw2(t) = β [w’1 x(t)a2(t) – w’1w2(t) a2(t)2] Δw’1w2(t) = β [a1(t)a2(t) – w’1w2(t) a2(t)2] Δ[w’1w2(t)- Δα(t)] = β[ w’1w2(t) -α(t)]a2(t)2 [w’1w2(t+1)- α(t+1)] = [1-βσ(t)][ w’1w2(t) -α(t)] w’1w2(t) - α(t) → 0 α(t)→w’1w2(t) a2(t) = x’ (t)w2(t) - α(t)a1(t) = x’ (t) [I- w’1w1] w2(t) Learning Rates of APEX [w’1w2(ť+1)- α(ť+1)] = [1-β’σ(ť)][w’1w2(ť) -α(ť)] β’ = 1/σ(ť) Learning Rates • β = 1/[Σta2 • β = 1/[Σt 2 (t) ] t γa 2 2 (t) ] Other Extensions • PAPEX: Hierarchical Extraction • DCA: Discriminant Component Analysis • ICA: Independent Component Analysis