Approaches to evaluation of document map creation algorithms. Mieczysław A. Kłopotek, Artur Wilkowski Institute of Computer Science, Polish Academy of Sciences Faculty of Mathematics and Information Sciences, Warsaw Unicversity of Technology 1. Introduction Several map creation algorithms were evaluated for their efficiency as well as the quality of map they created. The map quality was assessed by some universal evaluation measures described below and, more subjectively, by the final user. The process of evaluation included comparisons between different types of vector’s projection (LSA, random projectors with dense and sparse random matrices) and different efficiency improvements to the main SOM algorithm (addressing old winners, initial construction of sparse maps, batch map algorithm). 2. Map evaluation measures 2.1. Tests for measuring vector projection precision A good projection scheme is characterized by its capability to preserve spatial relationships between vectors in the input space. A fairly simple measure to assess the quality of projection is the comparison of distances between vectors in the input and an output space. For normalized vectors we can use a simple average square error of distances in both spaces. E sq 1 2 M M M (d i , j 1,i j p ij d ijn ) 2 , where M – the number of documents dijp – distance between documents i and j in a low-dimensional space dijn – distance between documents i and j in a highly-dimensional space Another, highly correlated measure, which does not require normalization of vectors is called Sammon error [21]: E Sammon M 1 M i, j 1, j j d ijn i, j 1,i j (d ijp d ijn) 2 d ijn , where M – the number of documents dijp – distance between documents i and j in a low-dimensional space dijn – distance between documents i and j in a highly-dimensional space 2.2. Generic methods measuring the SOM map quality For measuring the SOM map quality we can use either the average quantization error or the topography error. The average quantization error [6] tests how well model vectors approximate documents assigned to them. It is computed as an average distance between document vectors and their closest model vectors. Eq 1 M M || xi mc || i 1 where M – the number of documents xi – document vector mc – winner model vector for document vector xi 1 The other one is the topography error [19]. It measures map orderliness - how well its topology reflects the similarity relationships between clusters. In order to do this, the proportion of documents for which the 2 best matching model vectors do not lie in adjacent map units is calculated. Et 1 M M u( xi ) i 1 where M – document count u(xi) = 1 when 2 best matching model vectors of the document xi are not neighbours, otherwise – 0 These two measures are dependent on the data set used. The proportion between them reflects the compromise between the local ordering of data (their assignment to particular map units) and the data density approximation by the model vectors. This proportion depends on the final width of the neighbourhood function, the diversity of documents being mapped and the size of the map. Generally, the greater the final neighbourhood width the lower value of the topography error and greater the average quantization error. Additionally we used a very simple map smoothness measure to check how similar to each other are adjacent vectors on the map (map smoothness is one of the goals of the SOM algorithm). The map smoothness measure is obtained by calculating average distances between adjacent map units over the whole map. In order to compare map quality achieved by the use of several modified SOM algorithms a good idea is to create a reference map using the basic algorithm first and then match the outcomes of other algorithms against it. However, several experiments revealed that it is extremely hard to compare two maps created even by the same algorithm. The SOM map as a whole behaves in highly chaotic way and even if a slight disturbance of data appear or some random factor is introduced (which is always the case for the document vectors are presented in a random order and model vectors are initiated randomly at the beginning) and the resulting maps can dramatically different, although both may be of the same quality. Different part of the map can be rotated, scaled and twisted in many ways, however, they still may reflect a specific clustering structure that is enforced by the data set and the algorithm principles. Although there exists sophisticated measures for comparing two SOM maps [20], for the purpose of the tests we developed a simplified one that seems to be sufficient providing that maps are not very large (as it only take into consideration immediate neighbours of a map unit). This measure computes the proportion of pairs of documents that are neighbours on one map and are situated farther off on the second. E sq 1 M2 M M testDist (d ijI ,d IIij) , i , j 1,i j 1 (d ijI 1,5 d IIij 1,5) (d ijII 1,5 d ijI 1,5) testDist (d ijI ,d IIij) 0 otherwise where M – the number of documents dijp – distance between documents i and j on the first map dijd – distance between documents i and j on the second map 2.3. Data-set specific methods for measuring map quality An efficient measure of map quality using a priori knowledge about the structure of the data set has been proposed by Kohonen in [6]. Although the categorization of web pages that his work is concerned with may be not so obvious as the categorization of scientific document abstracts, it is still possible. For this purpose we explored the portal republika.pl which has this advantage that it catalogues variety of web pages that are manually organized into topical structure. What is more, this hierarchical structure can be easily established basing on URL addresses of web pages.weused this property to compare the degree of expected similarity between web pages (without in fact performing any categorization). As a measure of topical similaritywetook the distance between documents in the directory structure. Two documents occupying the same directory were assigned the distance 0, those from the directories in parent-child relationship – distance 1, those from sibling directories – distance 2 and so on. Having thus established a metric of category distancewedirectly used the quality measure of the cluster purity similar to the one suggested by Kohonen. For each map unit there is calculated the number of documents belonging the a dominant category. Document’s from other categories are treated as errors. Then the average is computed over the whole map. The cluster purity measure adopted in this work is only a very imperfect approximation of a measure utilizing manually established 2 categories. It does not take into consideration overlapping categories for instance. During the test several problems also stemmed from the fact that even topically similar web pages were assigned to completely different categories (for instance web pages committed to horses could appear both in categories for instance sport/horseriding and hobby/animals) and therefore some of them was classified as errors. However, the fact that the there was noted a strong correlation between using simplified algorithms for map generation and the deny in the measure value proves that this measure generally well evaluates the quality of generated map clusters. Having in mind that similar documents are not bound to occupy single map units but also adjacent units (the map clusters may contain several map units) we developed also measures which calculates the average distances between documents belonging to categories of different degree of similarity (eg. average distance between the documents belonging to the same category, between documents that have degree of similarity of 1, 2 and so on according to the explicit “category distance” measure described above). Note that this measure must be used together with other measures as it assigns the lowest (for low degrees of similarity are considered the best) values to a map where all documents are localized in only one map unit. 3. The evaluation of documents’ vector projection The main projection experiment was carried out on a corpus of 1000 documents collected from servers of the web portal republika.pl. From 33450 keywords remained after filtering by stop words lists and word stemming we removed those that appeared in the whole text corpus in less than in three documents. This left us with the input space of 4599 entries. Apart from applying direct measures for assessing the quality of projected vectors (like Sammon error) we run for every configuration the 20 step (100 iterations per map unit) basic SOM algorithm (utilizing the Kohonen Learning Rule) to evaluate the accuracy of classification for different projection settings. The map taught by the algorithm consisted of 196 map units (rectangular map of 14 x 14 map units). The test was carried out for output vectors dimensions: 20, 30, 50, 100, 200. As the projection results for vector dimension 200 proved to be comparable to those obtained using fully-dimensional vectors we abandoned tests for higher input vector dimensions. From the projection methods described in the Methodology chapter, during the test we evaluated results of the projection for the normally distributed random projector, random projector with bipolar entries, sparse random projector with number of nonzero elements ranging from 1 to 5 and the projector basing on SVD decomposition (LSA projector) utilizing Lanczos algorithm. The maps generated by the use of different projections were compared against the map created using fully-dimensional space. One instance of the latter was also used to create a reference map used for direct map comparisons. To increase statistical accuracy each test consisted of six consecutive algorithm runs. The whole test took about 20 hours. All tests were completed on a Celeron 600 computer with 192 MB RAM. Sammon error full size vectors LSA Random pr gauss Random pr bipolar 0,0168 0,0032 0,0015 0,0050 0,0020 Random pr 0,0027 sparse (5 0,0004 nonzero) Figure 1. average quantizat ion error topograp hy error category distance (0) category distance (1) 67,4 0,8 68,4 1,4 66,2 1,3 66,5 1,3 0,3596 0,0005 0,2850 0,0011 0,3620 0,0100 0,3460 0,0160 0,0180 0,0070 0,0300 0,0100 0,0302 0,0060 0,0308 0,0050 0,67 0,06 0,65 0,08 0,89 0,09 0,98 0,05 2,57 0,08 2,50 0,20 2,63 0,18 2,90 0,30 distance projection SOM to the time (s) algorithm reference time (s) map 1076,0 7,0 0,0250 0,0010 410,9 1,9 0,0254 19,990 0,0009 0,180 409,4 0,6 0,0314 8,121 0,0013 0,005 410,5 1,5 0,0311 8,159 0,0009 0,008 66,3 0,7 0,3540 0,0080 0,0220 0,0050 0,86 0,10 2,65 0,16 0,0301 0,0014 cluster purity (%) 0,820 0,050 345,0 3,0 Comparison of different types of projection for model vector dimension 200 (standard deviations are taken as error margins) The most distinct thing that can be observed basing on the results in figure 1 is a very good performance of the LSA projector. Its classification accuracy measured by the cluster purity is even slightly better that the one obtained using full size model vectors. This is reflected also by lower values of the average quantization error measure. Other measures except for the topography error are comparable to those obtained using fullydimensional vectors, particularly the average distances between documents belonging to the same category (category distance(0)) . It must be noted too, that the degree of similarity with the reference map is virtually the 3 same for the LSA projection and full size document vectors . These results can of course be partly attributed to possible deficiencies of evaluation methods adopted (which only aim at approximating human perception of the map) but can also prove the LSA algorithm ability to remove the noise from data and to emphasize important relationships as was suggested in [7]. The second important issue about the Lanczos algorithm is its fairly good efficiency in comparison with the duration time of the basic version of the SOM algorithm used subsequently. Additionally, the projection performed using Lanczos algorithm proved to be only little above 2 times slower than the random projection utilizing completely filled random matrix. However, the efficiency comparison with the sparse matrix random projection seems to be much in favour of the latter (20 s vs. 0,8 s). Similarly to the results obtained in [6] the performance of the random projectors group turned out to be very good. The classification accuracy for random projectors was not more than 1% worse than the one achieved with full size document vectors. Regarding other measures the random projector family also seems to work very well. Sparse random projectors tend even to have better topological structure (measured by the topography error) than the map based on any other projection scheme. However, the degree of similarity to the reference map is much lower for randomly projected documents than for LSA projection. (At this point it is worth to note that the measure directly comparing the original and projected vector spatial relationships (Sammon error) does not necessarily reflect the quality of projection, as the LSA projection, however, scoring worse in this test, performs better in any other.) It is also interesting to observe that the map quality measures are to the large extent independent of the matrix sparsity used for random projection. The projection results for different number of nonzero elements in matrix column are presented in figure 2. Number of nonzero elements in a column 1 2 3 4 5 200 Figure 2. Cluster purity projection time (%) (s) 64,0 0,2 64,0 2,0 67,0 1,1 65,5 1,7 66,3 0,7 66,5 1,3 0,430 0,030 0,550 0,040 0,636 0,019 0,720 0,040 0,820 0,050 8,159 0,008 SOM algorithm time (s) 196,0 3,0 266,0 4,0 307,0 4,0 331,0 3,0 345,0 3,0 410,5 1,5 Relationships between the number of nonzero elements in a column of a random projector, cluster purity, a projection and a SOM algorithm duration. Document vectors dimension 200 (standard deviations are taken for error margins) As can be easily noted regarding only the numbers of nonzero elements in the random matrix it is quite difficult to distinguish better or worse projection. What is more, the sparse random projector with as much as one nonzero element in each column for output vectors dimensionality 200 performed not much worse than the random projection with a completely filled matrix. This could be attributed to a quite small number of documents involved in the experiment in comparison with the output vectors dimensionality. This allows a direct mapping of one element from highly dimensional space to the other from destination space without running a great risk of confusing two different words represented by the same unit in the output vector. Therefore if the great number of varying documents is concerned it is better to ensure that every word is mapped onto a large enough pattern of entries in the output space. The number of nonzero elements in the random matrix affects not only the projection time but also time in which the main SOM algorithm runs (up to the factor of 2 when two matrix density extremities are compared). This is the result of the final structure of vectors that undergone projection. The more sparse the random projection matrix is the more sparse are resulting vectors. This of course contributes to much more effective calculations of inner products between documents during the winner search phase if it is optimized for handling sparse document vectors. For this reasons it does not seem to be reasonable to invest into dense random matrix as it greatly increases computation overhead and the quality decline is almost insignificant. In [6] the authors give 5 as the value of number of nonzero elements in each random matrix column which should be sufficient even for large documents collections. It should be mentioned that also LSA projector gives as the output dense vectors, so the algorithm efficiency cannot be boosted by quick inner products computations. In this case the opportunity for improvement lies rather in its quality potential, namely by means of the further reduction of vectors sizes (the cluster purity for 100 dimensional vectors and LSA algorithm was close to 65%!). 4 During the tests both vector sizes 200 and 100 performed well in terms of map quality. For lower dimensionalities there was more rapid decrease in the cluster purity and other quality measures. Although, vector sizes 100 and 200 work well with this specific document collection this could not be the rule for all document collections, especially much larger or more varied topically and with larger vocabulary. In [6] authors adopted the value of 500 as sufficient even for a document collection counted in millions of documents. The values of the cluster purity in a function of the projected vectors dimensionality (for LSA and “dense” random projection) is displayed in figure 3. 80% 70% 60% Cluster Purity 50% Random Projection 40% LSA 30% 20% 10% 0% 0 50 100 150 200 250 Dimension Figure 3. Relations between the cluster purity and the size of document vectors used 4. The evaluation of the sped-up SOM creation methods In order to evaluate the performance of different algorithms of fast SOM map creation against the main algorithm we carried out a series small scale experiments on a text corpus containing 1000 web pages from the domain republika.pl. The documents were placed on a map of 14 x 14 units. Before that, the documents of dimensionality 4599 (after removing word appearing in documents less than 3 times) were projected onto 100 – dimensional space using random projector with 5 nonzero entries in each column of the matrix. With the basic SOM algorithm the map was taught for 20 major iterations, in each all documents from the corpus were presented to the neural network in the random order. This gave more than 100 iterations per map unit. During the learning process the learning coefficient and the neighbourhood size were gradually reduced – fast at the beginning and slowly at the end of the process. During the final stages about 11% of map units were updated in each step (this meant the final neighbourhood size of about 3 units) The basic algorithm was compared with modified SOM methods. The speedup was obtained through creation of a sparse map initially, approximating the large map and then fine tuning it. The fine tuning algorithms we tested were the SOM algorithm based on the Kohonen Learning Rule and the Batch Map Algorithm described in the Methodology chapter. Assuming that the resulting map unit length was 1 the initially created sparse maps were tested with units lengths 1.7, 2.4, 2.7, 3.3, 3.7, 5.3. Units lengths we chose purposefully to avoid multiplicities of resulting map unit length. we noticed that in such cases, especially for low unit lengths (2, 3) and using batch map algorithm the clusters “inherited” from the sparse map (created around map points directly underlying their counterparts on the sparse map) tend remain stable and only very little “dissolve” into adjacent map points. This problem could be addressed by the reduction of the final neighbourhood size, but this results in a worst overall map ordering and the increase of the Topography Error. Thankfully, the problem turned out to be not so relevant if non-integer sparse map unit sized were applied – so we adopted the latter approach. 5 For the tests of the batch map algorithm the sparse map was carefully taught for 20 major iterations then a dense map was approximated and the batch map fine tuning algorithm was utilized for 5 iterations (the number of iterations suggested in [6]). The tests of a fine tuning using the simple Kohonen Rule comprised teaching the sparse map for 5, 10 or 15 major iterations and fine tuning the map for remaining 15, 10 or 5 iterations with the Kohonen Rule. For each set of parameters we conducted a series of 6 experiments that as a whole took about 12 hours. ). All tests were completed on a Celeron 600 computer with 192 MB RAM. The performance of the fine tuning process using the Kohonen Rule measured by the Topography Error, Average Quantization Error, and Cluster Purity for fine tuning taking more than 10 iterations was the same or very close to the performance of the basic SOM algorithm. However, the speed up factor was less than 2 it this case. The remaining methods proved to be more promising as far as the efficiency gain is concerned. The relations between the duration of the algorithms and the cluster purity are presented in the figure 4. 64% 62% 58% 56% Classification accuracy 60% Basic Kohonen Kohonen 15 5 Batch Map - 5 54% 52% 50% 200 150 100 50 0 Time Figure 4. The comparison of the classification accuracy for a basic Kohonen algorithm, Kohonen algorithm with 15 main iterations and 5 fine-tuning Kohonen steps and Kohonen algorithm with 5 batch map fine tuning steps. The subsequent points on the graph represent increasing unit sizes of the sparse maps (obviously the greater unit sizes the shorter duration time but lesser map precision too). What we can see from the graph is that fine tuning by the simple Kohonen’s Rule outperforms batch mapping for the same sparse map unit sizes. The difference is quite small, for the cluster purity measure it does not exceed 2 percent. The batch map algorithm dominates slightly when efficiency is concerned and creates the map about 10% faster than the module utilizing Kohonen’s Rule. It is interesting to note that the values of cluster purity for the batch map algorithm stabilize a little above 55% even for very sparse initial maps. This points out that 5 batch map steps (unlike corresponding 5 Kohonen major iterations) suffice to perform local ordering of data even for very coarse map approximations. This good results might be deceptive – so we should not rely on them to make judgements of the map as a whole. If we look at the figure 5 that visualises relationships between the algorithm duration and the Topography Error we can note a quick rise of the Topography Error in all cases for larger sparse map unit sizes (the turning point can be located close to the map unit size 2.7) 6 0,1 0,09 0,08 0,06 0,05 0,04 Topography error 0,07 Basic Kohonen Kononen 15 5 Batch map 5 0,03 0,02 0,01 0 200 150 100 50 0 Time Figure 5. The comparison of the topography error for a basic Kohonen algorithm, Kohonen algorithm with 15 main iterations and 5 fine-tuning Kohonen steps and Kohonen algorithm with 5 batch map fine tuning steps. This well illustrates the fact that tuning methods (even the Batch Map with 5 steps for a small neighbourhood size), although able to improve the map locally, can do very little to make up for a poor global ordering of the map. Also when this measure is concerned the Kohonen Rule fine tuning seems to give the results of the best quality but in relatively large larger amount of time than the Batch Map algorithm. For the next tests we chose the length of the sparse map unit 2,7 as this offers considerable speedup and is still before the “steep slope” of the topography error function. In the next stage of the experiment we evaluated the principle of addressing winners from previous iterations in order to speed up the winner search process. During preliminary test we noticed that the hit ratio of choosing the next winner starting from the previous ones is very high near the end of the learning phase (over 99%) and quite low at the beginning (about 50%). Therefore we decided that a better solution, than intermittenly applying full winner search every some specified number of iterations, would be to adjust the number of iterations between full winner search to the stage of the learning process. we related the number of iterations to the current size of the neighbourhood as, intuitively, the hit ratio largely depend on the question if the map is updated globally or only locally. So, the number of iterations between each winner search is increased in an inversely linear manner by, with each iteration more slowly, bringing it nearer to some pre-specified constant value. (for the experiment we chose maximum 4 old winner search iterations between each full winner search). Kohonen - full size vectors Kohonen – projection dim 100 Kohonen – 15-5 Kohonen – 15-5 old winer search category distance (0) category distance (1) distance to the reference map cluster purity (%) average topography quantization error error SOM algorithm time (s) 67,4 0,8 0,3596 0,0005 0,018 0,007 0,67 0,06 2,57 0,08 0,0250 0,0010 1076 7 63,4 1,6 0,3470 0,0160 0,036 0,006 1,30 0,30 3,10 0,30 0,0360 0,0010 197 3 57,4 0,3600 1,4 0,0200 0,033 0,007 1,22 0,06 3,20 0,20 0,0359 0,0017 68 1 56,5 0,3630 0,058 1,39 0,17 3,40 0,30 0,0357 0,0011 44 1 7 2,3 Batch map (5 iterations) (dim 100) Batch map (5 iterations) (dim 100) old winner Figure 6. 0,0080 0,007 55,5 1,6 0,3620 0,0190 0,039 0,009 1,27 0,04 3,30 + 0,20 0,0362 0,0013 66 1 55,5 1,6 0,3630 0,0100 0,046 0,008 1,29 0,10 3,60 0,40 0,0359 0,0014 41 1 Comparison of efficiency and accuracy of basic and sped-up algorithms of map creation for the random projection algorithms (2,7 is as a unit for the initial sparse map, standard deviations are taken as errors) As could be expected the impact of old winner search improvement on the cluster purity of the map or the average quantization error was very little or none at all as in the case of the batch map algorithm (see figure 6). On the other hand the values of the topography error increased noticeably. This directly results from the fact that the lack of a full winner search in each iteration affects only the global organisation of the map, which can be still locally very well ordered. It is worth noting that the decline in topographical accuracy was more prominent for simple Kohonen algorithm than for the batch map rule. This together with a significant time save (from over 60 to 40 seconds) make the old winner search especially effective supplement to the batch map algorithm. Kohonen - full size vectors Kohonen – LSA projection dim 100 Kohonen – LSA 15-5 iterations Kohonen – LSA 15-5 iterations old winer search Batch map – LSA (5 iterations) (dim 100) Batch map – LSA (5 iterations) (dim 100) old winner Figure 7. category distance (0) category distance (1) distance to the reference map cluster purity (%) average topography quantization error error SOM algorithm time (s) 67,4 0,8 0,3596 0,0005 0,018 0,007 0,67 0,06 2,57 0,08 0,0250 0,0010 1076 7 64,8 1,2 0,2355 0,0009 0,041 0,010 0,63 0,03 2,42 0,16 0,0251 0,0018 211 1 56,8 1,3 0,2407 0,0017 0,044 0,017 0,75 0,07 2,70 0,30 0,0259 0,0019 73 1 57,7 1,3 0,2407 0,0016 0,040 0,011 0,72 0,08 2,60 0,30 0,0254 0,0001 48 1 58,2 2,2 0,2381 0,0018 0,032 0,005 0,69 0,06 2,51 0,20 0,0260 0,0020 72 1 57,7 1,4 0,2366 0,0010 0,044 0,007 0,71 0,04 2,55 0,15 0,0266 0,0011 44 1 Comparison of efficiency and accuracy of basic and sped-up algorithms of map creation for LSA projection algorithm (2,7 is as a unit for the initial sparse map, standard deviations are taken as errors) To finish off the tests we performed another experiment for the same choice of parameters but for document vectors projected using LSA algorithm (figure 7). In each of the tests methods using LSA projection exhibit about 2% advantage in the cluster purity over random projection methods. They also note much lower quantization error values, however, this measure may not allow a direct comparison of methods as it is data set dependant (and therefore projection dependant). The more important fact, which can be more reliably measured, is that all the algorithms based on LSA projection are able to preserve very low average distance between documents from the same category – category distance 0 - (0,7 for LSA and 1,3 for the random projection) and a very high degree of similarity to the reference map created with a simple Kohonen rule using full-size vectors (the measured distance was about 0,026 for LSA-based algorithms and about 0,036 for the random projection based algorithms). The values of the topography error for the simple algorithm versions were slightly greater in case of the LSA – projected vectors (it is difficult to make accurate comparisons due to quite high standard deviations of results). 8 The use of sped-up versions of the algorithms did not result in increase of this parameter as was in case of the random projection vectors, so for sped-up algorithms (the batch map and the batch map with old winner search) results were comparable or even a little in favour of the LSA algorithm. The LSA-based algorithms are about 10% slower than corresponding random projection – based algorithms (with a sparse projection matrix) due to more dense document vectors and slower winner vector computation. If we sum up the duration of both projection and the SOM algorithm duration times this ratio increases to about 25%. However, this seems to be still a reasonable computational overhead to consider if we take into account the quality advantages of the LSA algorithm. This make the LSA projection with the batch map and old winners speedups an interesting alternative for the random projection – based solutions. Both of the recent figures demonstrates that the batch map algorithm and a simple SOM algorithm based on the principle of map densening turn out to be about 3 times faster than the SOM algorithm working on a full-size map from the very beginning. The application of the old winner search increased this ratio to about 4,8. In result the fastest version of the algorithm is 24-26 times faster than the basic algorithm working on full-size document vectors. According to [6] this proportion is likely to change in favour of the sped-up algorithms for larger maps. This is for the fact the unit size of the sparse map should be adjusted to the final neighbourhood size in the learning process. For larger maps, where larger neighbourhood sizes are used, the larger sparse map unit can be used and significantly greater efficiency gain obtained. For larger maps also several map densening steps can be taken – so the time consuming work on a full size map can be reduced to the minimum and enough map quality preserved at the same time. 5. The manual evaluation of the sped-up map creation methods The parametric results are obviously not the most important factor of map creation. What really counts is how users perceive the maps created, how concise, logical they find the cluster structure displayed on the map. In order to compare different map creation algorithms using the “human factor”, two people were independently asked to assess the quality of several maps as well as the accuracy and usefulness of search facility for the given maps. The test procedure went as follows. For the first test the users were asked to manually assess the purity of map units (the measure similar to the cluster purity but performed in more intuivite and “intelligent” manner). As the second test the users were asked to assess the general map organisation, evaluate the extent to which similar documents tend to lie in map units nearby and if the documents clusters lying not far from one another are topically much different (this measure reflects the topography quality of the map). Both tests were mark on a scale from 0-10. In addition to these two texts users were asked to choose randomly 10 words (5 words each) after some preliminary map exploration. Then, in subsequent experiments, the users were asked to query the map for these words and evaluate the quality of response. Each query results were assessed in a scale of 0-1 (depending on the relevance to the query each of the subsequent best matches marked on the map). Manual tests were performed for the simple SOM algorithm working on document vectors projected by the LSA and various random projectors to the 100-dimensional space, the same algorithm for dense (gaussian) random projector and 300-dimensional space and the batch map algorithm using old winner search principle and 100-dimensional LSA projected vectors. (all other settings for all algorithms were set to the same values as during the previous experiment). The average results of the experiments are shown in figure 8. Map creation algorithm LSA – 100 cluster purity 9,5 topography quality 8,5 query results 10 Random pr gauss – 7 4,5 1,75 100 Random pr bipolar 7 6 1,5 – 100 Random pr sparse 7 7 2,5 100 Random pr gauss 9 9 5 300 LSA – 100 + map densening, batch 8,5 9,5 9,5 map + old winners Figure 8. User evaluation of different map creation techniques 9 As it is clearly visible in figure 8 the testers admit a very high quality of the map created basing on vectors projected using LSA algorithm. If the random projection is used (for the same model vectors dimension – 100) both users agree that the maps are relatively poorer than the map based on the LSA projection algorithm. However, they still score quite high (in most case about 7) which means that they are still regarded as well reflecting the structure of document’s collection. It turned out that in order to achieve comparable map quality using the random projection to those obtained with the LSA algorithm, larger model vectors’ sizes must be used. In the experiment the use of the random projector and model vector of dimension 300 gave results very alike to those of the LSA algorithm for vectors dimensionality 100. It must be noted that the computational complexity of the SOM algorithm for vectors with 300 entries is 3 times the computational complexity for vectors with just 100 entries. At this point the question must be stated if it is profitable to use quick random projection algorithms having in mind that not very much slower LSA algorithm can give results of much higher quality. The answer in favour of the LSA algorithm is very straightforward in the case of small document collection and low model vectors dimensionalities utilized. When the dimensionality of model vectors is larger the LSA projector would not prove as efficient as for lower dimensionalitites as its major ability is to choose the most important factors of document by term matrix. Therefore for model vector dimensionalities of over 500 the results of SVD decomposition and the random projection might be to large extent similar. Figure 8. shows also that the main SOM algorithm speedups (map densening, batch map algorithm, old winner search), although much shortening the algorithm time, do not have much negative impact on the map quality as perceived by users. This means that the fast approximation of the SOM algorithm suggested in [6], when applied to the information retrieval problems gives results almost indistinguishable from the outcome of the original algorithm. The difference in quality of projection between the random projection and the SVD decomposition is the best depicted by the difference in query accuracy results. For all model vector dimensions querying accuracy for the LSA algorithm is almost perfect, while for random projection algorithms very poor. The latter improves only when 300 dimensional vectors are utilized and even then reaches only about the half of the accuracy for the LSA algorithm. At the end it is interesting to observe that the user-percepted quality of the map was somehow correlated with the distance between categories measure. The poorly marked solutions have the masure of distance between category 0 above 1.0 while all positively marked algorightms (LSA and the random projection dim 300) scored significantly below 1.0. 6. Conclusion It has been proved in this thesis that SOM-created map can be the useful and attractive to the user way of presenting Internet document collections. Moreover, applying the SOM algorithm can also reveal the hidden structure of the collection. The SOM map, due to its integrated browsing and searching properties can not only fill the gap between purely search and browse based engines but can even set a new standard for the web exploration. It has been also shown that a basic and unimproved version of the Websom algorithm is sufficient for processing data from small document collections. However, when the number of documents and gathered keywords increases, several improvements including vector projections and more advanced methods of map creation must be applied to keep the algorithm computational complexity within reasonable bounds. During the process of evaluation it has been established that the use of language recognition tool, stop-word lists and word stemming is essential for maintaining a good quality of document vectors, which directly influences the map quality and especially map labelling and search facility. It has been also proved that the choice of appropriate vector projection algorithm can drastically boost the SOM algorithm efficiency, however, this choice should be backed up by a careful analysis of possible speedups and possible decline in map quality. It has been also shown that the improved SOM map creation algorithms (including map densening, batch map algorithm and old winner search) are in the domain of the Information Retrieval useful and even desired alternatives to the basic SOM algorithm, especially in terms of efficiency improvement and applying the SOM algorithm to larger Internet document collections. 7. Bibliography [1] S. Brin, L. Page, The anatomy of a Large-Scale Hypertextual Web Search Engine, http://citeseer.nj.nec.com/brin98anatomy.html, 2000 [2] K. Lagus, Text Mining with the Websom, Acta Polytechnica Scandinavica, computing series No. 110, 2000 [3] M. Porter, The Porter Stemming Algorithm, http://www.tartarus.org/~martin/PorterStemmer/, April 2003 [4] M. Porter, The English (Porter2) Stemming Algorithm Web Page, http://snowball.tartarus.org/english/stemmer.html, April 2003 [5] The Lovins Stemming Algorithm, http://snowball.tartarus.org/lovins/stemmer.html, April 2003 10 [6] T. Kohonen, S. Kaski, K. Laguss, J Sajolarvi, J Honkela, V. Paatero, A. Saarela, Self organization of a massive document collection, IEEE Transactions on Neural Networks vol. 11 No. 3, May 2000 [7] S. Deerwester, S. T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing by Latent Semantic Analysis, http://citeseer.nj.nec.com/deerwester90indexing.html, 1990 [8] T. Hoffman, Probabilistic Latent Semantic Analysis, UAI Stockholm, http://citeseer.nj.nec.com/hofmann99probabilistic.html, 1999 [9] A. Ultsch, G. Guimaraes, D. Korus, H. Li, Knowledge extraction from Artificial Neural Networks and Applications, World Transputer Congress 93, Springer, 1993 [10] K. Lagus, S. Kasski, Keyword selection method for characterizing word document maps, Proceedings of ICANN’99, 1999 [11] D. Cohn, T. Hoffmann, The Missing Link – A Probabilistic Model of Document Content and Hypertext Connectivity, http://citeseer.nj.nec.com/cohn01missing.html, 2001 [12] T. Dunning, Statistical Identification of Language, March 1994 [13] D. Weiss, Polish Stemmer, http://www.cs.put.poznan.pl/dweiss/index.php/projects/lametyzator/index.xml?lang=en, September 2003 [14] T. Kowaltowski, C. L. Lucchesi, J. Stolfi, Finite Automata and Efficient Lexicon Implementation, Relatorio Tecnico IC-98-2, Janeiro, 1998 [15] J. Daciuk, Finite State Utilities, http://www.eti.pg.gda.pl/~jandac/fsa.html, September 2003 [16] W.B. Canvar, J.M. Trenkle, N-Gram-Based Text Categorization, Environmental Research Institute of Michigan, http://citeseer.nj.nec.com/68861.html, 1997 [17] R. Ganesan, A. T. Sherman, Statistical Techniques for Language Recognition: An Introduction and Guide for Cryptanalysts, http://citeseer.nj.nec.com/ravi93statistical.html, February 1993 [18] M. Berry, T. Do, G. O’Brien, V. Krishna, S. Varadhan, SVDPACKC(Version 1.0) User’s Guide, http://citeseer.nj.nec.com/9643.html, October 1993 [19] K. Kiviluoto, Topology Preservation in Self-Organizing Maps, Proceedings of ICANN 96, IEEE International Conference on Neural Networks, 1996 [20] S. Kasski, K. Lagus, Comparing Self Organizing Maps, Proceedings of ICANN 96, Internetional Conference on Artificial Neural Networks, Lecture Notes in Computer Science vol. 11112, pp.809-814, Springer, Berlin, 1996 [21] Sammon Mapping, http://www.eng.man.ac.uk/mech/merg/research/datafusion.org.uk/techniques/sammon.html, November 2003 [22] Wilkowski A. : M.Sc. Thesis, Warsaw University of Technology. (supervised by M.A.Klopotek) Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems" 11 Approaches to evaluation of document map creation algorithms. ............................................... 1 1. Introduction .......................................................................................................................... 1 2. Map evaluation measures ..................................................................................................... 1 2.1. Tests for measuring vector projection precision .......................................................... 1 2.2. Generic methods measuring the SOM map quality ..................................................... 1 2.3. Data-set specific methods for measuring map quality ................................................. 2 3. The evaluation of documents’ vector projection .................................................................. 3 4. The evaluation of the sped-up SOM creation methods ........................................................ 5 5. The manual evaluation of the sped-up map creation methods ............................................. 9 6. Conclusion .......................................................................................................................... 10 7. Bibliography ....................................................................................................................... 10 12