Manifold Learning and Its Applications: Papers from the AAAI Fall Symposium (FS-10-06) Detection of Anomaly Trends in Dynamically Evolving Systems Neta Rabin and Amir Averbuch Yale University, Mathematics Dept. neta.rabin@yale.edu Tel Aviv University, Computer Science Dept. amir@math.tau.ac.il which capture the system’s behavior, do not capture the phenomenon that govern the data directly. Their fusion should express the equilibrium that exists in normal data samples. Abnormal system behavior is expressed by a system profile that deviates from the normal captured profile. The deviation is often expressed by an uncommon balance between the sensors rather then by exceptionally high or low values within each sensor. The proposed algorithm detects the data points that deviate from normal system behavior. One real life example that fits the described model is blood tests. The result of a blood test should determine if one is healthy on not healthy based on a group of measured parameters (sensors) that measure different quantities in the blood. Today, the analysis is done by setting linear thresholds on each of the sampled quantities (sensors). However, it is clear that these measured values obtain linear and nonlinear relations among each other and that a blood test result may be abnormal even when all of the measured values lay within the thresholds. Abstract We propose a learning framework, which is based on diffusion methodology, that performs data fusion and anomaly detection in multi-dimensional time series data. Real life applications and processes usually contain a large number of sensors that generate parameters (features), where each sensor collects partial information about the running process. These input sensors are fused to describe the behavior of the whole process. The proposed data fusing algorithm is done in an hierarchial fashion: first it re-scales the input sensors. Then, the re-formulated inputs are fused together by the application of the diffusion maps to reveal the nonlinear relationships among them. This process constructs by embedding a low-dimensional description of the system. The embedding separates between sensors (parameters) that cause stable and instable behavior of the system. This unsupervised algorithm first studies the system’s profile from a training dataset by reducing its dimensions. Then, the coordinates of newly arrived data points are determined by the application of multi-scale Gaussian approximation. To achieve this, an hierarchial processing of the incoming data is introduced. Hierarchial Sensor Fusion This section describes how to organize a training dataset that was captured from a dynamic system. We assume that the sensor data that is captured in the training set is stationary. The type of data captured by the sensors can be numerical or categorial, the application of DM to both types of data is straight forward. Let T = {T1 , T2 , . . . , TK } of size N × K be a multidimensional sensor array that collects data on a dynamic process. Each sensor Ti ∈ T is a column vector, which holds a one-dimensional time series. The dataset is processed in a bottom-up approach to form an hierarchical tree shaped structure. The hierarchical tree consists of nodes. We will refer to the leaf nodes as bottom level nodes, to the tree root as the top level node and to the rest of the nodes as intermediate level nodes. In the bottom level, each sensor is processed separately. A bottom level tree nodes holds an embedding that is constructed by a single sensor. The intermediate (middle) levels fuse together groups of sensors. The top level node in the hierarchy combines data from its child’s nodes to construct an embedding that gives full description of the dynamic process. The hierarchical embedding tree provides a flexible structure, which can be adapted to the input data. Introduction Data mining via dimensionality reduction occupies a central role in many fields such as compression and coding, statistics, machine learning, image and video processing, sampling of networking systems, fault diagnosis, performance analysis, and many more. In essence, the goal is to change the representation of the captured datasets, originally in a form involving a large number of independent variables (features), which dynamically change, into a low-dimensional description using only a small number of free parameters. The relevant low-dimensional features gain insight into our understanding of the phenomenon that generated and govern the inspected data. We focus on the task of analyzing a dynamic process with a typical behavior. The typical behavior is defined by an equilibrium within the system components. The sensors, c 2010, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 44 In each level, the processing is based on the application of Diffusion Maps (DM) (R.R. Coifman 2006a) to the input. The DM unfolds the geometric structure of the data. The diffusion coordinates describe the data in a reliable way, in which points that capture stable behavior of the system lie in high dense areas in the embedding manifold and anomalous trends can be detected because they lie in low dense areas. The following describes the embedding process that is applied to the sensor data. Higher level processing: The bottom level nodes embed a single sensor by replacing it with a small number of embedding coordinates. Denote the embedding coordinates, which describe the sensor Ti , as Ψi . Higher levels merge the embeddings Ψi , i = 1, . . . , K, into a joint embedding by the following steps: 1. For each bottom level node, the embedding coordinates that belong to Ψi are scaled λ1 ψ1 (x) λ2 ψ2 (x) λ3 ψ3 (x) Ψi (x) = , , ,··· . λ1 ψ1 (x) λ2 ψ2 (x) λ3 ψ3 (x) Bottom level processing: The bottom level of the hierarchical structure holds K embedding manifolds, which are generated for each sensor Ti ∈ T , i = 1, . . . , K. The bottom level nodes are constructed in the following steps: to have a norm that is equal to 1. This process makes the diffusion coordinates comparable. 2. The scaled diffusion coordinates are gathered as a new input matrix V = {Ψ1 , Ψ2 , Ψ3 , Ψ4 , . . .}. The DM is applied to the matrix V . The embedding manifold describes the mutual behavior of the sensors Ti , i = 1, . . . , K. If the sensors are naturally separated into groups, the process can first be applied to each group to form intermediate level nodes and then repeated for integrating all the sensors. 1. By using a sliding window of length μ, which slides along the sensor Ti , a dynamic-matrix T̃i is constructed. T̃i is of size (N − μ + 1) × μ. Each dynamicpath, which is a row in the dynamic-matrix, holds μ consecutive measures from Ti . For the next step, we think of a dynamic-path as a point in a μ-dimensional space. 2. The DM is applied to each of the dynamic-matrices T̃i , i = 1, . . . , K. −x−y2 Detection of Abnormal behavior (a) A Gaussian kernel w(x, y) = e 2ε , which is a weight function that measures the pairwise similarity between the points in Ti , is constructed. (b) The kernel is normalized by: ŵ (x, y) = The hierarchial embedding structure provides a scheme for studying dynamically evolving systems that are monitored by a set of sensors. The manifolds form a reliable space for tracking and detecting abnormal system behavior. Points that appear frequently are embedded close together, and abnormal points have a small number of neighbors in the embedded space. In general, we assume that the typical anomalies in such systems are not expressed by outliers, which receive extremely high or low values in each sensor, but rather by an unbalanced combination between the sensor that implies about an unusual system state. In order to evaluate the appearance probability of the embedded points, a frequency score function is defined on the embedded space. Once evaluated, this function will be used as an additional embedding coordinate. We first recall that the diffusion distance as was defined in (B. Nadler and Kevrekidis 2006; R.R. Coifman 2006a). For a given dataset Γ, the diffusion distance between two data points x and y, which belong to Γ, is the weighted L2 distance w (x, y) w (x, y) . , q (x) = q 0.5 (x) q 0.5 (y) y∈Ti (1) (c) A transition matrix P is constructed such that p (x, y) = ŵ (x, y) ŵ (x, y) . , d (y) = d (y) (2) x∈Ti (d) If {φk } and {ψk } are the corresponding left and right eigenvectors of P , then, the eigendecomposition of the transition matrix is given by p(x, y) = λk ψk (x)φk (y). (3) k≥0 D2 (x, y) = (e) The family of DM, which embeds Ti , is defined by Ψi (x) = (λ1 ψ1 (x), λ2 ψ2 (x), λ3 ψ3 (x), · · · ) . These coordinates embed T̃i into an Euclidean space. Usually, a small number of diffusion coordinates is sufficient to describe the behavior of a single sensor. (p(x, z) − p(z, y))2 z∈Γ φ0 (z) . (4) This distance reflects the geometry of the dataset where the value of φ01(x) depends on the points’ density. Two data points are close if there is a large number of paths that connect them. Substituting Eq. 3 into Eq. 4 together with the biorthogonality property, then, the diffusion distance with the right eigenvectors of the transition matrix P is expressed as λ2k (ψk (x) − ψk (y))2 . (5) D2 (x, y) = The manifold organizes the sensor’s short-time dynamics. Dynamic paths, which appear frequently, are embedded close together while abnormal paths have a small number of neighboring points on the manifold. Since the diffusion kernel measures distances within a single sensor, the original inputs sensors can be of different scales. This process bypasses the need to scale the sensors (parameters, features) if heterogenous datasets are processed. k≥1 In these new coordinates, the Euclidean distance between two points in the embedded space represents the distances 45 between the two high-dimensional points as defined by a random walk. The diffusion distance is used for the construction of the frequency score function, denoted by s(x), on the set of embedding coordinates Ψ(x). For each point in the embedded space, the values of s(x) are determined by the sum of the point’s diffusion distances from its nearest neighbors s(x) = 1+ y∈S 1 Ψ(x) − Ψ(y) , of its levels are extended by iterative applications of the Geometric Harmonics (R.R. Coifman 2006b), similarly to multiscale Gaussian approximations (Burt 1981) and (Burt and Adelson 1983). Once the manifolds are extended, the values of the frequency score functions can be calculated directly. Evaluation Dataset The hierarchial embedding process is applied to highdimensional data that was collected from a performance monitor. The performance monitor resides inside a system that handles incoming transactions. The data consists of sensors of different types and scales. Similarly to the syntectic example, the transactions system has several states in which the system performance is stable. The number of profiled normally-behaved system states is unknown. The analysis is carried out based only on the inputs from the sensors. Each of the sensors captures partial data regarding the process. The goal is to fuse the sensors in order to detect anomalous system behavior. The anomalies are detected when the equilibrium between the sensors is unusual compared to the training data that captures its normal behavior. For this specific analysis, 13 measured sensors were used. The learning phase implementation can be represented by an hierarchical three-level embedding model that is shown in Fig. 2. In each node, the DM embedding coordinates and the frequency score functions are constructed from the data. (6) S = {l nearest neighbors of x in Ψ(x)}. For normally behaved points, the sum in the denominator of Eq. 6 is small and the score s(x) is close to 1. Abnormally behaved points receive scores that are close to zero. The function s(x) gives a one-dimensional measure in each tree node for detecting exceptional points. For the sensor fusion process, the frequency score function s(x) is scaled to have a norm that is equal to 1. Now, s(x) can be added to the diffusion coordinates. The original sensor or the group of sensors, which contributed to the node, are replaced by {Ψ(x) ∪ s(x)}. Figure 1 shows an hierarchical two-level embedding model that constructs the function s(x) on each of the embeddings. The bottom level score functions indicate whether a single sensor is normally behaved at a given point in time. The frequency score functions are used as inputs for the next level, which fuses the data. An additional score function is similarly constructed in the top level node. This function tracks the mutual behavior of the system and alerts of unusual behavior. Figure 2: Hierarchical embedding tree for embedding the sensors that track the performance monitor. The 13 sensors are separated into three groups, which measure different activities in the system (see Fig. 2). The first group consists of six sensors. These sensors measure the average response time from different transactions that run in the system in the previous time step. The measured scale (time resolution) is in minutes while the quantization step is ten seconds long. The sensors in this group are usually correlated. The second group of sensors measures the percentage of executed transactions that wait for a specific system’s resource. At each time step, the performance monitor tracks all the running transactions in the system and calculates the percentage of the running transactions that wait for a particular system’s resource like I/O, database access, Figure 1: An example for an hierarchical two-level embedding model with the constructed frequency score functions s(x) for fusing 4 sensors to detect anomalous behavior Processing new sensor data The hierarchical embedding structure can be extended to newly arrived data points. The embedding manifolds in each 46 etc. Anomalous behavior in the system can be expressed, for example, by an unusual distribution profile of the running transactions on the resources. The last group holds two sensors that capture the capacity (in percentage) of two different memories that the system uses. A middle level was added to the hierarchical structure in order to first embed the joint behavior of the groups and then embed the entire system by fusing these groups. The single sensors, which form the bottom level of the hierarchical tree, are embedded by carrying out the steps that are described for bottom level processing, in the hierarchial sensor fusion section. Two diffusion coordinates are used to embed each sensor. Then, the frequency score function s(x) is constructed for each sensor. For this application, the number of nearest neighbors l, which is used to calculated s(x), was set to 10. Figures 3, 4 and 5 present a few bottom level embeddings. Figure 3 shows the embedding of 2 out of 6 sensors that belong to the first group. These nodes are located on the bottom left side of the hierarchical embedding tree that is seen in Figure 2. The horizontal axes are the first two diffusion maps embedding coordinates. The vertical axis is the constructed frequency score function s(x), which was defined in in Eq. 6. Figure 5: The embedding of the 2 sensors that belong to the third group and capture the system’s memory usage. These sensors are located on the right bottom part of the hierarchical tree (see Fig. 2), which models the system. The bottom levels nodes are fused together, first in groups. The fusing procedure is applied to create the intermediate and top level nodes of the hierarchial tree. The embedding manifolds that are constructed in the intermediate and top level nodes use the first 6 computed DM coordinates to described the fusion of the data gathered from their child nodes. In addition, the frequency score function s(x) is constructed on the embedding coordinates of each intermediate and top level node. Figures 6, 7 and 8 show the embedding of the three intermediate level nodes, respectively. The horizontal axes are the first two diffusion maps coordinates and the vertical axis is the function s(x). Figure 3: Embedding of 2 sensors that belong to the first group of sensors. These embeddings are located in the left bottom part of the three-level hierarchical embedding tree, which is seen in Fig. 2). Figure 4 presents two of the five embedding that belong to the second group. These sensors measures the percentage of running transactions that wait for a specific system resource. Figure 4: The embedding of 2 sensors that are located in the center bottom part of the three-level hierarchical embedding tree that was presented in Fig. 2. Figure 6: The embedding manifold which describes the fusion of 6 sensors, which track the system’s transactions behavior. This manifold belongs to the left intermediate level node, which is seen in Fig. 2. The third group of sensors, which is seen on the right bottom part of the tree in Fig. 2, contains two sensors that track the capacity of two memories that are used in the system. Figure 5 shows their embedding. 47 Figure 7: The manifold that belongs to the center intermediate level node from Fig. 2. The manifold embeds 5 sensors, which describe the joint behavior of the sensors that track the queues on the system’s resources. Figure 9: The low dimensional embedding of the dynamic system. This is embedding belongs to the top node of the hierarchial tree. Points that have z values that is close to 0 express abnormal system behavior. Detection of anomaly behavior is done by tracking the values of the function s(x), which was defined for each node. The tracking is done in a top down approach. Figure 10 shows the values of the frequency score function s(x) that was constructed for the top level node in the hierarchial tree. The one dimensional function assigns a score that tells about the state of the system at each time point. The last two points in Fig. 10 on the right indicate an anomalous condition. This anomaly was followed by a system crash. Figure 8: The manifold that belongs to the right intermediate level node from Fig. 2. An embedding manifold that expresses the fusion of 2 sensors, which track the system’s memories capacity. Figure 10: Left: The frequency score function s(x), which was constructed on the DM embedding coordinates of the top level tree node, which is marked on the right. The horizontal axis is time and the vertical axis is a score. The last two points, which have a score close to 0, are anomaly points. The top level node of the hierarchial tree holds an embedding that expresses the behavior of the entire system, by fusing the embedding coordinated from the intermediate level. The embedding is shown in Fig. 9. Figures 11, 12 and 13 present the frequency score functions that belong to the intermediate level nodes. The anomalous behavior that is seen at the last two time points of the top level score function, is a result of an unusual combination of the intermediate level embeddings and score functions. Although most of the scores in the intermediate level were normal during the last two time point, their fusion, which deviates from the usual system profile, is clearly seen in the top level score function (Fig. 10). 48 into sub-systems of different resolutions. The top node of the hierarchical tree describes the entire system, middle level nodes describe different subsystems and bottom level node describe the dynamic behavior of single input sensors. Diffusion Maps processes are applied at each tree node for projecting the sub-system into a reliable space. This embedding space is described by the Diffusion Maps coordinates. A frequency score function, which is defined on the points in their embedded space, gives a measure to identify normal and abnormal input points, which deviate from the normal system profile. This work was done with the support of the Israeli ministry of science and technology. Figure 11: Left: The frequency score function s(x) that was constructed for the intermediate node that fused 6 sensors. Right: The score function on the left belongs to the marked intermediate level node on the tree. References B. Nadler, S. Lafon, R. C., and Kevrekidis, I. 2006. Diffusion maps, spectral clustering and reaction coordinate of dynamical systems. Applied and Computational Harmonic Analysis: Special issue on Diffusion Maps and Wavelets 21:113–127. Burt, P., and Adelson, T. 1983. The laplacian pyramid as a compact image code. IEEE Trans. Communications 9:4:532540. Burt, P. 1981. Fast filter transforms for image processing. Computer Vision, Graphics and Image Processing 16:20– 51. R.R. Coifman, S. L. 2006a. Diffusion maps. Applied and Computational Harmonic Analysis: Special issue on Diffusion Maps and Wavelets 21:5–30. R.R. Coifman, S. L. 2006b. Geometric harmonics: A novel tool for multiscale out-of-sample extension of empirical functions. Applied and Computational Harmonic Analysis: Special issue on Diffusion Maps and Wavelets 21:31–52. Figure 12: Left: The frequency score function s(x) that tracks the behavior of 5 of the system’s sensors. Right: The marked intermediate level node in the tree holds the the score function seen on the left. Figure 13: Left: Tracking the joint behavior of 2 of the system’s sensors by a frequency score function s(x) (left). Anomalous behavior in this group is seen around t = 2500 and at the last point. Right: The score function on the left belongs to the marked intermediate level node on the tree. For some types of anomalies, meaningful information about the source of the systems failure can be gained by analyzing the score functions s(x) when going further down the tree, to the bottom level. Unusual values in the score functions in the bottom level can help direct the system’s users to specific components in the system that need to be checked. Conclusion In this paper, we introduced a general un-supervised data mining approach for organizing and tracking the behavior dynamically evolving systems. First, a training step is carried out on a training dataset. The high-dimensional dynamic data is embedded to a low-dimensional space by using an hierarchical tree structure. The system is decomposed 49