25th Workshop of the Austrian Association for Pattern Recognition, ÖAGM/AAPR 2001 Berchtesgaden, Germany, June 7–8, 2001 OCG, Vienna, 2001, ISBN 3-85403-147-5, S. Scherer (ed.), pp. 191–198 Simulating Perceptual Clustering by Gestalt Principles Erich Rome GMD – German National Research Center for Information Technology Institute for Autonomous intelligent Systems Schloss Birlinghoven, D-53754 Sankt Augustin, Germany rome@gmd.de, http://ais.gmd.de/ Abstract: In this paper we propose a method for the detection of salient non-local structures in vector graphics. Non-local structures may consist of similar graphical objects—the constituents of a vector graphics—or of objects which are orderly arranged. They may be perceived immediately, but they are not explicitly represented in the internal description of a graphics. Information on such cognitive relevant structures may serve as additional indices to the graphics data base of a graphics retrieval system or may guide higher scene interpretation routines. Nonlocal structures emerge as a result of grouping processes of visual perception. The method used to detect non-local structures is the simulation of models of organizing phenomena of human visual perception. In particular, we use Treisman’s feature map model and Palmer’s transformational approach to human visual perception. 1 Introduction Non-local relational image structures, such as three rectangles of equal dimensions and colors, that are equidistantly positioned on a horizontal line (fig. 1), may be perceived immediately as a whole, but they usually are not explicitly represented as an entity in the internal description of a graphics or scene segmentation. Such non-local structures can be found in photographs, bitmapped and graphical images. They may be salient and of strong visual appeal, thus they seem to be of some cognitive relevance to a viewer of such images. It seems desirable to develop methods to detect these structures automatically in order to aid scene interpretation or to assist a user in designing or retrieving graphical images. Figure 1: A non-local structure: Perceivable, but not explicitly represented. The process of perceiving non-local relational structures as entities is called perceptual grouping. At the beginning of the 20th century, adherents of the Gestalt school of psychology Figure 2: Gestalt principles of proximity, good continuation, and similarity (by color). investigated perceptual grouping processes and found a number of principles that guide them [13, 25]. They qualitatively formulated about a dozen grouping principles. Among them are the well known Gestalt “laws” of grouping by proximity, good continuation, and similarity [9, 3]. Examples are shown in figure 2. But it was not until the late 20th century that perceptual psychologists and cognitive scientists came up with promising quantitative models of Gestalt grouping. Anne Treisman’s [23] Feature Map Model of and Palmer’s [17] Transformational Approach to human visual perception provide a basis for the computer simulation of grouping processes and are being used in our own MAX Gestalt grouping simulation. MAX has been incorporated in an innovative Graphics Design Assistant System, the GraphicsDesigner [11]. The first application has been the graphics retrieval component SkaGra [10], enabling a user to search for Gestalt like arrangements of graphical elements in a data base of over 300 vector graphics slides. MAX simulates the clustering of primitive graphical objects according to the Gestalt grouping principles of proximity, similarity, and good continuation. The paper is organized as follows. First, the state of the art in computational approaches to perceptual grouping is critically reviewed. A brief survey of modern models and theories of perceptual grouping follows. Then it is described how MAX employs ideas from two of these models. Test results of MAX’s performance on Gestalt test patterns and on presentation graphics are presented. We conclude with summarizing the main issues of MAX, remarks on pure computational approaches, and pointing out promising research trends. 2 State of the Art in Perceptual Grouping Simulation Computational approaches to perceptual grouping have been investigated for a long time. The earliest respective paper found by the author is a 1961 article of Guiliano et al. [7]. Many approaches modify well-known techniques like the Hough transform [4], Simulated Annealing [12], or methods from Spectral Graph Theory [21, 22] to construct groups of pictorial elements. Some perceptual grouping algorithms themselves have become classic techniques, like Zahn’s Minimum Spanning Tree method [27]. In the respective literature, one of the most cited publications is the classical 1984 dissertation of David G. Lowe [15]. This work stands out for its by then unique combination of a perceptual grouping stage, model-based form and viewpoint matching, and a component for reasoning under uncertainty. Lowe adopts Witkin and Tenenbaum’s [26] hypothesis that non-accidental patterns are of significance for visual perception, and he applies this principle in his vision system. Lowe’s grouping component clusters points by proximity, and then tries to construct continuous curved line segments from proximity groups at increasingly larger scales. A small gap at a large scale is more likely to be accidental than on a small scale. Thus at a large scale a continuation will be constructed to fill the gap, whereas a gap of the same size will not be bridged at a small scale. Compared with other work on perceptual grouping methods, Lowe’s grouping component alone does not stand out. A number of computational approaches employ minimizing or optimizing techniques. In [28], a distance function is constructed that weighs grouping terms. The “best” grouping has a minimum total weight. A similar idea can be found in [16] and [12], where modern energy minimizing techniques are being used to construct groups. The basic idea that a “good” grouping is simpler or has less energy than alternative groupings has a long tradition and goes back to the original Gestalt principle of Prägnanz or figural goodness. In the 1950s, several information processing definitions of the Prägnanz principle have been generated, “... suggesting that figural goodness may be inversely proportional to the amount of information necessary to describe or specify a figure” [8, p. 194]. Critics objected that this definition depends strongly on the coding of the describing data. Computational approaches differ in several aspects. These are the number of phenomena they simulate, the kind of pictorial data on which they operate, the modus operandi, i.e. supervised versus unsupervised operation, use of 3D clues, attentive or preattentive grouping, hierarchical versus flat grouping, the kind of employed algorithm, and the application area. Some methods have been successfully demonstrated to improve certain grouping tasks in the respective application area, where other methods are mere conceptual studies. A feature that is common to most of these approaches is that they usually are simply inspired by Gestalt grouping phenomena, but do not rely on contemporary models and theories of human visual perception. It is the author’s belief that contemporary results of cognitive psychology are well suited for designing improved grouping methods. 3 Theories and Models of Perceptual Grouping The recent research interest in Gestalt phenomena—a phenomenon of its own—has been called Neo-Gestaltism. Neo-Gestaltist research of the last decade came up with new Gestalt principles (“connectedness” and “common region”, [19]), psychophysical investigations on the quantification of proximity, similarity and temporal grouping [14], investigations on mutual effects of different grouping principles, and on the influence of 3D clues on grouping [19]. In psychophysical experiments, Kubovy and Gepshtein [14] recently found the pure distance law that quantitatively describes proximity grouping. Incorporating this law in proximity grouping algorithms should yield more reasonable grouping results than other methods. With the emergence of Artificial Neural Networks, several connectionist approaches to perceptual grouping have been issued. One of the most elaborate connectionist approaches is Grossberg’s FACADE theory of visual perception, which has been developed for more than a decade [6, 2, 18, 5]. This body of work is too large to be described in this article, but it should be noted that FACADE models various Gestalt phenomena, like figure-ground separation, Gestalt grouping, and subjective contours, to name some. This work is clearly a valuable source of quantitative models of Gestalt phenomena. In the next section, we describe briefly those aspects of two models [23, 17] of human visual perception that we have used for our own grouping simulator MAX. 4 Perceptual Grouping Simulator MAX The grouping simulator MAX operates on vector representations of 2.5 D graphical images. The primitive graphical objects of its representation language EPICT are dots, lines, arrows, rectangles, rounded rectangles, ellipses, polygons, splines, text objects, and bitmap objects. Typical attributes of these objects are fill color, line color, position, size etc. For grouping purposes, additional attributes are being computed, like orientation and brightness. In Treisman’s Feature Map model [23], a perceived picture is split up into a number of intermediate pictures along several feature dimensions. These intermediate pictures are called feature maps. Each feature map consists of those parts of the picture that contain a certain feature. There are two basic questions concerning this model: 1. What are the features that are relevant for human vision? and 2. How are scene elements assigned to feature maps (how are feature values “coded”)? Treisman found some evidence that brightness, color, size, orientation, curvature, blobness, and line discontinuities may be relevant visual features. In each dimension, feature values are coded relative to a given standard value. The feature maps may be conceived as similarity classes, thus we have adapted the preattentive part of the model for our simulation of similarity grouping. During this simulation, all EPICT objects constituing a presentation graphics are assigned to feature maps—at most five—in the feature dimensions of form, color, brightness, orientation, and size. In other words, if a graphics contains only red and blue squares and circles of similar size, orientation, and brightness, then there is one feature map that contains all the circles, one that contains all the squares, one that contains all the red objects, and one that contains all the blue objects. Since the same object may be contained in several feature maps, it may also belong to several found groupings, if any. The overall procedure for the simulation of Gestalt grouping of a vector graphics is organized roughly according to the following steps. 1. Grouping by similarity of form, color, brightness, size, and orientation: Put each constituent graphical object into five similarity classes according to its actual feature values regarding the above mentioned features. Here, we use absolute feature value coding according to psychophysical laws and findings. 2. For each object in each similarity class perform the following steps: a) Proximity grouping: Construct a local environment with a fixed radius that contains a maximum of eight nearest neighbors. b) Continuity grouping: Construct good continuations between neighboring local environments (this idea stems from [17]). 3. Repeat Step 2 for all objects that remained ungrouped, using a larger radius. 4. Stop if all objects are grouped or if the radius of the local environments has reached its maximum, i.e. the maximum extent of the presentation graphics in any direction. 5. Collect the groupings found in all similarity classes, unite the results, if possible, and sort them according to some heuristic saliency criteria. Groups of primitive objects may themselves be grouped in a hierarchical fashion. One important aspect of the form of a group is the spatial arrangement of its constituent parts. Such arrangements are represented using polygons, their vertices being the center points of group elements. Arrangement polygons are compared by a matching algorithm that computes a similarity metric for polygonal shapes [1]; this makes it possible to retrieve similarly arranged groups in a data base of presentation graphics. For finding classes of similar polygonal forms, the metric is combined with an unsupervised numerical clustering algorithm [24]. 5 Experimental Results MAX has been tested with classical Gestalt test patterns, variations of those, and with presentation graphics. In the three patterns of figure 2, MAX correctly constructs the three proximity pairs of circles, the two crossing continuations in the second pattern and, in the third pattern, the five dominant vertical columns based on similar color and good continuation. Figure 3: Grouping by a) similar size and b) orientation. c) No dominant similarity grouping. d) Continuity grouping with varying distances. Similarly, MAX generates the five dominant row structures in figure 3 a and b, based on continuity and similar size and orientation, respectively. The pattern 3 c), however, does not exhibit a dominant similarity grouping. All elements look alike, and only the positions differ. Based on proximity and continuity, MAX generates five row and five column groups of five circles each. The pattern in figure 3 d can be perceived as two continuity groups. Here, we have varied the distances between group elements so that they lie within the tolerance of proximity grouping, except for the circles close to the junction of the bowed-T-shaped pattern. This yields two disjunct groups. Narrowing the gap would make the junction circle belong to both groups. Figure 4 shows a typical presentation graphics generated with the GraphicsDesigner. The dominating global structure is the diagonal arrangement of shadowed text boxes. They form a group according to the Gestalt principles of similarity and of good continuation. Assisting Computer (AC) "Intelligent" Support, Planning, and Design Applications in Engineering Sciences (in prep.) Technical Interpretation of Numbers Designing Text in its Contents Designing Layout of Graphics Hoschka AC 23 GMD Institute for Applied Information Technology Figure 4: A vector graphics example: Presentation slide exhibiting a salient Gestalt structure. For the graphics of fig. 4, MAX found 10 groups, among them the most salient diagonal arrangement of text boxes. Of the other groups, four could be considered salient, among them the text rows in the upper left corner and the footnote like subtitles. The remaining 5 are of rather small saliency. More details on the algorithms, summaries of the used models, and an examination of the state of the art in Gestalt clustering by machine can be found in [20]. In general, MAX’s grouping results can be roughly divided into three categories. Salient groups that can be easily perceived belong to the first category. Groups of the second category are perceivable, but usually not considered to be salient. The third category contains groups that can only be explained with knowledge of MAX’s grouping algorithm. 6 Conclusion A method has been presented that is able to find perceptually salient non-local structures in 2.5 D vector graphics. Primitive graphical objects are grouped according to the Gestalt principles of proximity, good continuation, and similarity of form, color, size, brightness, and orientation. The method employs graph-based techniques and is mainly based on the Feature Map Model of preattentive visual perception [23]. For the implementation, some ideas have been adopted from the Transformational Approach to human visual perception [17]. The presented method MAX differs from many other related approaches in the following aspects: 1) MAX works without tuning or supervision by a user, 2) MAX provides some grouping criteria that are rarely found in other approaches, namely grouping by similarity of form and by similar orientation, and a basic hierarchical grouping feature, 3) MAX’s application area is Graphical Search in 2.5 D vector graphics, whereas all known other approaches perform grouping on dot patterns or on preprocessed or raw bitmapped images. 4) MAX is based on modern psychological models of preattentive perceptual grouping processes that can be quantified. The original Gestalt theory offered only qualitative descriptions of grouping phenomena. Many other computing approaches modify well-known techniques like energy minimization or Hough Transform in order to achieve Gestalt-like grouping of pictorial elements, but lack the foundation of contemporary psychological models and findings. Experimental results have shown that many of the classical test patterns of Gestalt Psychology are correctly grouped by MAX. Good results have also been achieved with varied patterns and combinations of pattern elements. In a test data base of 16 presentation graphics, MAX formed almost all of the groups that may be perceived as salient non-local structures, but also a number of non-salient groups. This can be attributed to the aimless preattentive grouping method and the lack of appropriate “goodness” criteria for the found groups. While this is of less importance for the application area of content-based retrieval—except that it increases search time—, it might be an undesired feature for other application areas. The presented work and the reviewed related work shows that there are special application areas where Gestalt-like grouping aids image analyzing. It is the author’s opinion that the most successful grouping procedures do not rely on computational approaches alone, but have a sound foundation of contemporary theories or psychological models of human visual perception. With more interdisciplinary approaches, we will see better and more general artificial vision systems in the future. Approaches that incorporate attentional mechanisms and exploit 3D clues seem to be a step into the right direction. References [1] E.M. Arkin, P. Chew, D.P. Huttenlocher, K. Kedem, and J.S.B. Mitchell. An efficiently computable metric for comparing polygonal shapes. Technical Report TR 89-1007, Cornell University, Dept. of Comp. Sci., Ithaca, NY, 1989. [2] G.A. Carpenter and S. Grossberg. The art of adaptive pattern recognition by a self-organizing neural network. In J. Diederich, editor, Artificial Neural Networks — Concept Learning, pages 69–80. IEEE Computer Society Press, Washington D.C., 1990. [3] W.D. Ellis, editor. A Source Book of Gestalt Psychology. Lowe & Brydone, Thetford, UK, 1974. 5th impression, first published 1938. [4] D.F. Gillies and G.N. Khan. Perceptual grouping and the hough transform. In Proceedings of the SPIE - The International Society for Optical Engineering, volume 1607, pages 188–196, 1992. Conference: Intelligent Robots and Computer Vision X: Algorithms and Techniques. Boston, MA, USA, 11-13 Nov 1991. [5] S. Grossberg. Publications, 2001. URL: http://cns-web.bu.edu/Profiles/Grossberg/abstracts.html. [6] S. Grossberg and E. Mingolla. Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. In Stephen Grossberg, editor, The Adaptive Brain II — Vision, Speech, Language, and Motor Control, volume II, pages 144–210. North-Holland, Amsterdam, 1987. [7] V.E. Guiliano, P.E. Jones, G.E. Kimball, R.F. Meyer, and B.A. Stein. Automatic pattern recognition by a Gestalt method. Information Control, 4:332–345, 1961. [8] R.N. Haber and M. Hershenson. The Psychology of Visual Perception. Holt, Rinehart and Winston, Inc., New York (NY), 1973. [9] H.H. Helson. The fundamental propositions of Gestalt psychology. Psychological Review, 40:13–32, 1933. [10] P. Henne and G. Schmitgen. Graphisches Suchsystem SkaGra – Konzepte und Realisierung. TASSOReport 48, GMD, Sankt Augustin, Germany, 1993. [11] P. Hoschka, editor. Computers as Assistants. Lawrence Erlbaum Associates, Hillsdale, NJ, 1996. [12] P. Kahn, A. Winkler, and C.Y. Chong. Perceptual grouping as energy minimization. In 1990 IEEE International Conference on Systems, Man and Cybernetics, pages 542–546, New York, NY, USA, 1990. IEEE. Conference: Los Angeles, CA, USA, Nov 4-7, 1990. [13] W. Köhler. Gestalt Psychology. Liveright, New York (NY), 1929. [14] M. Kubovy and S. Gepshtein. Gestalt: From phenomena to laws. In K.L. Boyer and S. Sarkar, editors, Perceptual Organization for Artificial Vision Systems, pages 69–80. Kluwer Academic Publishers, Boston, 2000. [15] David G. Lowe. Perceptual Organization and Visual Recognition. PhD thesis, Stanford University, Stanford, (CA), September 1984. Report No. STAN-CS-84-1020. [16] J.D. McCafferty. Human and machine vision: computing perceptual organisation. Ellis Horwood, Chichester, UK, 1990. [17] S.E. Palmer. The psychology of perceptual organization: A transformational approach. In J. Beck, B. Hope, and A. Rosenfeld, editors, Human and Machine Vision, pages 269–339, Orlando, FL, 1983. Academic Press. [18] R. Raizada and S. Grossberg. Context-sensitive bindings by the laminar circuits of v1 and v2: A unified model of perceptual grouping, attention, and orientation contrast. Visual Cognition, 2001. in press. [19] I. Rock and S.E. Palmer. The legacy of Gestalt psychology. Scientific American, pages 84–90, December 1990. [20] E. Rome. Simulierte Gestalt-Erkennung in Präsentationsgrafiken. PhD thesis, University of Bremen, Bremen, Germany, 1995. [21] Sudeep Sarkar. Supervised learning of large perceptual organization: Graph spectral partitioning and learning automata. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(5):504–525, May 2000. [22] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8):888–905, Aug 2000. [23] A. Treisman. Preattentive processing in vision. Computer Vision, Graphics, and Image Processing, 31:156–177, 1985. [24] R.M. Umesh. A technique for cluster formation. Pattern Recognition, 21(4):393–400, 1988. [25] Max Wertheimer. Untersuchungen zur Lehre von der Gestalt II. Psychologische Forschung, (4):301–350, 1923. [26] A. P. Witkin and J. M. Tenenbaum. On the role of structure in vision. In J. Beck, B. Hope, and A. Rosenfeld, editors, Human and Machine Vision, pages 481–543, Orlando, FL, 1983. Academic Press. [27] C.T. Zahn. Graph-theoretical methods for detecting and describing Gestalt clusters. IEEE Transactions on Computers, C-20(1):68–86, January 1971. [28] A. L. Zobrist and W.B. Thompson. Building a distance function for Gestalt grouping. IEEE Transactions on Computers, C-24(7):718–728, July 1975.