Manifold Learning and Its Applications: Papers from the AAAI Fall Symposium (FS-10-06) Applying Diffusion Distance for Multi-Scale Analysis of an Experience Space Meng Su, Xiaocong Fan WeiLi Ge The Behrend College The Pennsylvania State University Erie, PA 16563, USA Email: {mengsu, xfan}@psu.edu School of Information Engineering Zhengzhou University Zhengzhou, 450001, China Email: iewlge@zzu.edu.cn The diffusion geometry framework (Coifman and Lafon 2006) introduces the diffusion distance (and diffusion maps, diffusion wavelets) and offers a general foundation for multiscale analysis on massive data sets. For a data set X with n observations, suppose a pairwise similarity matrix W = {wi,j } can be built. The n × n matrix W is called a kernel function, representing some notion of affinity or similarity between pairs of points of X. A typical choice is the 2 Gaussian kernel: wi,j = e−(||xi −xj ||/ε) , where ε is a scale (precision) parameter of the Gaussian distribution. W is then normalized to obtain a matrix P = D−1 W , where D is the diagonal matrix with entries Dii = n j=1 wi,j . The matrix P is the diffusion operator on X, where each entry p(i, j) = wi,j /Dii can be viewed as the transition probability of going from node i to node j in one time step of the Markov chain on X. Since P encodes the geometric information about the data set X, the transitions directly reflect the local geometry defined by the immediate neighbors of each node in the graph of the data. Carrying this “random walk” view further, P t gives the probability of transitions from one node to another in t steps. A family of diffusion distances Dt at step t is defined as Abstract Diffusion distance has been shown to be significantly more effective than Euclidean distance in multi-scale recognition of similar experiences in RecognitionPrimed Decision making (Fan and Su 2010). In this paper, we first examine the experience data set used in the previous study. The visualization of the data set (using the first three dominant eigenvectors of the diffusion space) suggests the applicability of the diffusion approach. Second, we investigate two approaches to the computation of diffusion distance: Spectrum based and Probability-Matching based. Specifically, by ‘Spectrum based’ approach we refer to the one derived in terms of the eigenvalues/eigenvectors of the normalized diffusion matrix (Coifman and Lafon 2006). We use the term ‘Probability-Matching’ to refer to the use of various probability distances, where the original L2 diffusion distance is treated as a special case. Our preliminary result indicates that the performance of using L2 diffusion distance at least is tied with the use of Spectrum based distance. Furthermore, when spectrum based approach is applied, we have to use the embedding and extending techniques for labeling new experience data (Lieu and Saito 2009), while such re-computation is not necessary when the L2 diffusion distance is used. We do not need to re-compute the diffusion matrix, hence the diffusion map each time when adding a new data. It is more natural and robust especially for labeling new single experience data. The numerical examples also show the improvement on the performance. We are currently working on several other Probability-Matching approaches (e.g. the Earth-Mover’s Distance). 1 2 Δ Dt (x, y) = n (p(z, t|x) − p(z, t|y))2 /w(z) z=1 t t =Px. − Py. L2 (X) (1) where p(z, t|x) is the probability of therandom walk from t is node x to node z after t steps, w(z) = x wz,x , and Px. the row vector in matrix P t for point x. Diffusion distances can be computed using eigenvectors {ψk }(0 ≤ k ≤ L) and eigenvalues {γk } of P (Coifman and Lafon 2006) where 1 = γ0 > |γ1 | ≥ |γ2 | ≥ · · · |γL |: L γk2t (ψk (x) − ψk (y))2 (2) Dt (x, y)2 = Introduction Multi-agent research, including large-scale agent organization and experience-based decision making often needs to identify similar events or patterns of events buried in a massive and high dimensional vector space. They may cross policy boundaries, where multi- scale information analysis is the key to effective reasoning at multiple policy/priority levels. Diffusion geometry provides a natural framework to study the clustering and labeling of these high-dimensional data. The results can lead to the accurate decision making. k=1 A family of diffusion maps {Ψt }t∈N ⎛ t γ1 ψ1 (x) t γ ⎜ 2 ψ2 (x) Δ Ψt (x) = ⎜ .. ⎝ . c 2010, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. γLt ψL (x) 59 is defined by ⎞ ⎟ ⎟. ⎠ (3) That is, the ordinary Euclidean distance in the diffusion space measures the intrinsic diffusion distance on the data. Euclidean distance often fails in capturing the global spatialrelation among points of a massive data set, while the diffusion distance has a global meaning for data sets with a nonlinear geometric structure (manifold), and is very robust to noise data. Since in many practical applications the spectrum of the matrix P has a spectral gap with only a few eigenvalues close to 1 and all the others much smaller than 1, the diffusion distance at a large enough scale t can be well approximated by only the first few δ eigenvectors. Such an observation provides a theoretical justification for dimensional reduction. However, calculating diffusion distance by formula (2) need to evaluating eigenvalues and eigenvectors. It is especially awkward if you need to do that every time when you add a new data because the matrix has been changed. Lieu and Saito proposed to compute the diffusion distance directly from the definition (1) for t=1, which they named it NCM (Node Connectivities Matching)(Lieu and Saito 2009). They applied this new approach to image pixel type recognition and showed it outperform the methods based on the diffusion distance defined by (2). The rest of the paper is organized as follows. We frame our problem in section 2. In section 3, we introduce our method by generalizing NCM to the mutli-scale levels for any t. Our experiment results are presented and our further work are discussed. The last section summarizes the results. 2 Table 2: Course of action and related parameters Seq 1 2 3 4 5 Action AgentRecall(Squad, X1) AgentAssignment(Squad, X2) AgentRecall(EOD, X3) AgentAssignment(EOD, X4) ReadinessRecover(X5, X6) 6 7 MoveTo/RushTo/CreepOver CaptureInsurgent/Monitor /DisperseF/DisperseW /DisperseP/RemoveIED Description Recall X1 number of squads Assign X2 number of squads Recall X3 number of EODs Assign X4 number of EODs Recover X6 number of agents’ readiness to X5 percentage X7: get to the target’s location X8: approach to handle threats We can frame our problem as follows. Suppose an agent has a large set E of experiences (knowledge collected from domain experts) about decision making on a certain task type, and the set F = {Fi |1 ≤ i ≤ m} of features (type of information) relevant to this task type is fixed. Each experience ei = Si , Ai ∈ E has two parts: feature-based situation description Si = (f1 , f2 , · · · , fm ) where fi is a value for feature Fi , and a course of action Ai = (a1 , a2 , · · · , ak ) (i.e. a solution successfully implemented in a situation as described by Si ). Given a new decision situation D = (d1 , d2 , · · · , dm ) where di is a value for feature Fi , the feature-matching problem is to find a set E ∗ ⊂ E such that those experiences in E ∗ are similar to D in terms of all the features being considered. The solution construction problem is to synthesize a new course of action such that the solution part of each experience ei ∈ E ∗ has an appropriate influence on the new synthesized solution. Our objective here is to examine whether the use of diffusion distance, as compared with the traditional Euclidean distance, can perform better in computing the set E ∗ so that solutions of higher quality can be produced afterward. Experiences in real-world problems are high-dimensional data. For this study, we choose to use a data set E including 16, 383 decision making experiences about how C2 teams react to potential threats that emerge unexpectedly in a metropolis. Members of a C2 team, consisting of an S2 agent (intelligence cell) and an S3 agent (operations cell), need to work collaboratively to handle three types of targets: crowds, insurgents, and IEDs (Improvised Explosive Device). Two types of friendly units are under S3’s charge: Squad units and EOD (Explosive Ordnance Disposal) teams. The S3 agent, when making decisions on how to handle a specific threat, needs to consider information about 34 features (which are classified as target-specific, situationspecific, or weather-related) and decide resource allocation actions appropriate for that threat. Table 1 gives a portion of the situation description of three example experiences, and Table 2 gives the fixed set of action types and the corresponding parameters. All the experiences in E are complete (both the situation description part and the course of action part are available). Since the course of action of all the experiences in E has the same sequence as shown in Table 2, all that matters is the values for parameters X1 through X8. For each experience ei ∈ E, Problem Definition Naturalistic decision making (NDM) focuses on how people actually make decisions in realistic settings that typically involve ill-structured problems, uncertain dynamic environments, shifting/competing goals, and time stress (Zsambok and Klein 1997). One particular model is Klein’s Recognition-Primed Decision framework (RPD) (Klein 1989). The RPD model is based on the supposition that in complex situations human experts usually make decisions based on the recognition of similarities between the current decision situation and previous decision experiences. Cognitive studies have shown that over 95% of human decisions conform to the RPD model in various time-stressed situations (Klein 1998). The RPD model (Klein 1989) has a recognition phase and an evaluation phase. In the recognition phase, a decision maker synthesizes the observed information about the current decision situation into appropriate cues or pattern of cues, then employs a strategy called “feature-matching” to recall experiences worked before in a similar situation. These similar experiences are then used to construct candidate solutions, each of which is a course of action that might be applicable to the current situation to achieve the goal under concern. In the evaluation phase, the RPD model stresses on Simon’s satisficing criterion (Simon 1955) rather than objective optimization: a decision maker considers the candidate solutions one by one, terminating the evaluation as soon as a workable solution is obtained. 60 Insurgent Yes No No IED No No Yes Table 1: Feature-based situation description of example experiences Target-specific Situation-specific Weather Crowd Level Speed CloseToRoute UnitReadiness ... PrecipitationType Rate No XHigh Slow Yes 85 ... Rain High Yes Low Fast Yes 80 ... Hail Light No High None No 60 ... Snow High the values of X1 through X8 can be concatenated into one string, which will be refered as the label of ei below. 3 Visibility Haze Fog Fog 6 5.5 Methodology and Experiment Fan and Su(2010) described a approach to compare diffusion distance with the traditional Euclidean distance in identifying similar experiences. This is a scale-up approach at a diffusion level i, where the diffusion distance, evaluated by the formula (2) for t = i, is applied to the “transformed” experience space for similar experience identification. It is noted that diffusion maps can filter out high-frequency noises, which suggests that noises can be reduced as diffusion level increases. They utilized this property implied by the diffusion process and designed an anytime algorithm for solution construction in recognition-primed decision making. However, calculating diffusion distance by formula (2) need to evaluate eigenvalues and eigenvectors. It is especially awkward if you need to do that every time when you add a new data because the matrix has been changed. Lieu and Saito (Lieu and Saito 2009) proposed to compute the diffusion distance directly from the definition (1) for t=1, which was named NCM (Node Connectivities Matching). As explained in Section 2, the data set E used in this experiment includes 16, 383 decision making experiences, which are represented in a 34-dimension feature space. Since the value ranges of the 34 features are different (some are indicator variables, some are percentages, some are integers with fixed ranges), the data set is first standardized such that all the features have the same range [0, 1]. We denote this standardized set by Xn×m , where n = 16383, m = 34. Due to the huge number of different possible labels ( >> 28 actions, see Table 2) for each experience, we did not try to cluster them by labels. However, we test the small number (e.g. 6 ) of clusters, the distinguishable cluster pattern (Figure 1) in the diffusion space shows that the diffusion distance can reflect the intrinsic feature of the experience data more effectively. As stated in Section 1, from the labeled set X, we first build a symmetric matrix W where wi,j is the similarity between points xi and xj . W can be taken as a graph, where points xi , xj ∈ X are connected by an edge of weight wi,j . For any new unlabeled experience data y, we connect y with all the points in X and define the similarity vector p = 2 (wi,y )ni=1 , where wi,y = e−(||xi −y||/ε) , q = p/||p||2 . It is easy to obtain that the diffusion distance at level t from y to xi is (4) Dt (xi , y)2 =Pxti . − q T P t L2 (X) 0.03 5 0.02 4.5 0.01 4 3.5 0 3 −0.01 2.5 −0.02 0.04 2 0.02 0.02 0.01 0 1.5 0 −0.02 −0.01 −0.04 1 −0.02 Figure 1: The clusters of experience data set in the diffusion space into account all incidences relating the unlabeled new data to the labeled training data instead of performing spectral embedding. This makes it robust to noise, saves time and improves accuracy. We apply this algorithm to the 16, 383 decision making experiences data set mentioned above on t = 1, 2, 4, 8, and 16, and compare their performance with the result using the traditional Euclidean measurement. The performance was evaluated in terms of the recoverability of labels. In particular, suppose we are given a decision situation D together with its label ζD . After a set E ∗ = {e1 , e2 , · · · , ek } of k similar experiences are identi∗ for D by mafied for D, we can generate another label ζD ∗ is determined jority vote: each part (C1 through C8) of ζD by majority vote of the corresponding part of experiences in ∗ is the weighted sum of the correctness E ∗ . The score of ζD (0 or 1) of each part as compared with the corresponding part of the known label ζD . We choose = 1.0, and varied the parameter k ( kNN from 3, 10, 30) for the nearest neighbor search. Figure 2(a) plots the result. It shows that the performance of using direct calculated diffusion distances (levels 1-16) can be significantly better than using Euclidean distance in the original space (level O). For instance, the level-1 performance was lower than level-O performance, but the performance increased considerably to its peak as the diffusion level increased from 1 to 4 (regardless of the value of kNN). Figure 2(a) plots the result. It shows that the performance of using direct calculated diffusion distances (levels 1-16) can be significantly better than using Euclidean distance in the original space (level O). However, it is about the same as using the ’Spectrum-based’ diffusion distance Our approach evaluates the diffusion distance directly on any level t of random walk from the definition of (4). It takes 61 2 1.32 1.3 1.3 1.28 1.28 1.26 1.26 Labeling performance Labeling performance x 10 1.24 1.22 1.2 1.18 x 10 1.24 1.22 1.2 1.18 1.16 1.16 1.14 kNN=3 kNN=10 kNN=30 1.14 1.12 1.1 Spectrum−Based Diffusion Distance 5 L Diffusion Distance (Probability−Matching) 5 1.32 1.1 O kNN=3 kNN=10 kNN=30 1.12 1 2 4 8 16 O 1 2 4 8 16 Diffusion levels Diffusion levels (a) (b) Figure 2: (a) Performance on L2 diffusion distance definition; (b)Performance on eigenvalue/eigenvector diffusion distance eralized to any Probability-Matching approaches (e.g. the Earth-Mover’s Distance). in Figure 2(b)(Fan and Su 2010). We can extend the definition of diffusion distance by (1) in L2 to be defined by various other probability or statistic distances (Lieu and Saito 2009). We are currently experimenting with these various Probability-Matching approachs (e.g. the Earth-Mover’s Distance). More results will be available, and hopefully, will be presented at the Symposium. 4 References Coifman, R. R., and Lafon, S. 2006. Diffusion maps. Appl. Comput. Harmon. Anal. 21(1):5–30. Fan, X., and Su, M. 2010. Using geometric diffusions for recognition-primed multi-agent decision making. In van der Hoek; Kaminka; Lesperance; Luck; and Sen., eds., Proc. of 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010). Klein, G. A. 1989. Recognition-primed decisions. In Rouse, W. B., ed., Advances in man-machine systems research, volume 5. Greenwich, CT: JAI Press. 47–92. Klein, G. A. 1998. Recognition-primed decision making. In Sources of power: How people make decisions. MIT Press. 15–30. Lieu, L., and Saito, N. 2009. Signal classification by matching node connectivities. In Statistical Signal Processing, 2009. SSP ’09. IEEE/SP 15th Workshop on, 81 –84. Simon, H. 1955. A behavioral model of rational choice. Quarterly Journal of Economics 69:99–118. Zsambok, C. E., and Klein, G., eds. 1997. Naturalistic Decision Making. Lawrence Erlbaum Associates. Conclusion In this study, we first investigated the agent’s large experience data set in their mapped diffusion space and visualized their k-mean clustering(e.g. k = 6) in the 3-dimensional space spanned by the first three dominant eigenvectors of the diffusion matrix. The ”fireworks” pattern of the clusters suggests the applicability of the diffusion geometry approach to the experience data set. Second, we compared two approaches to the computation of diffusion distance between two experience vectors: one is derived in terms of the eigenvalues/eigenvectors of the diffusion matrix; another one is referred to the original definition of diffusion distance. Our preliminary result indicates that the performance of our second method at least is tied with the use of Spectrum based distance for our data sets. However, the second method is simpler because the evaluation of the eigenvalues and eigenvectors are not necessary. Furthermore, our second approach, which is basically the L2 histogram discriminant of the diffusion distribution of each point, can be gen- 62