Creating and Exploring a Large Photorealistic Virtual Space Josef Sivic1 ∗, Biliana Kaneva2 , Antonio Torralba2 , Shai Avidan3 and William T. Freeman2 1 INRIA, WILLOW Project, Laboratoire d’Informatique de l’Ecole Normale Superieure, Paris, France 2 CSAIL, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. 3 Adobe Research, Newton, MA, U.S.A. The supplementary video can be viewed at: http://people.csail.mit.edu/josef/iv08/video.avi Abstract showing the retrieved images as thumbnails, one can display images in a 3D context, enhancing the user experience. Instead of typing query tags, the user can use simple and intuitive 3D controls to navigate. Indeed, the Street View approach offered by several companies allows users to virtually wander the streets of a city in a 3D like fashion. This was underscored by the success of the PhotoTourism interface [27] which put all images of a given scene in a common 3D space using bundle adjustment. Users browse the image collection by moving in 3D space. If all images are collected from the same point of view then a big Gigapixel image can be constructed and viewed interactively using familiar 2D image controls [16]. Alternatively, if the image dataset consists of a random collection of images of different scenes, taken at different times, one can construct an AutoCollage [25]. This gives visually pleasing collages, but fails to scale to large collections where thousands or millions of images are involved. Another option considered in the past is to treat images as points in a high dimensional space, compute the distance between them and use multi-dimensional scaling to display them in the image plane for the user to navigate through [26]. However, there is no effort to create a virtual 3D world and as a result there is no sense of ”being there”. In this work, we use intuitive 3D navigation controls to explore a virtual world created from a large collection of images, as illustrated in figure 1. This side-steps the problem of image tagging altogether, relying instead on an automatic image similarity measure. In our approach the user is free to move from one image to the next using intuitive 3D controls such as move left/right, zoom and rotate. In response to user commands, the system retrieves the most visually similar images, from a data set of several million images, and displays them in correct geometric and photometric alignment with respect to the current photo. This gives the illusion of moving in some large virtual space. We preprocess the collection ahead of time, allowing the system to respond interactively to user input. There are a number of applications to this technology. We present a system for exploring large collections of photos in a virtual 3D space. Our system does not assume the photographs are of a single real 3D location, nor that they were taken at the same time. Instead, we organize the photos in themes, such as city streets or skylines, and let users navigate within each theme using intuitive 3D controls that include move left/right, zoom and rotate. Themes allow us to maintain a coherent semantic meaning of the tour, while visual similarity allows us to create a ”being there” impression, as if the images were of a particular location. We present results on a collection of several million images downloaded from Flickr and broken into themes that consist of a few hundred thousand images each. A byproduct of our system is the ability to construct extremely long panoramas, as well as image taxi, a program that generates a virtual tour between a user supplied start and finish images. The system, and its underlying technology can be used in a variety of applications such as games, movies and online virtual 3D spaces like Second Life. 1. Introduction Exploring large collections of images requires one to determine how to organize images and how to present them. A leading approach to organizing photo collections is to use tags. The user navigates the collection by typing a query tag and reviewing the retrieved images. Typically, the results are shown as a page of thumbnails or a slide show, which is a practical but not engaging way to browse images. Moreover, automatic image tagging is still an unsolved problem and manual image tagging is a time consuming process that does not always occur in practice. Of particular interest are geo-referenced images that can be automatically and accurately tagged with GPS data. This allows users to explore the collection based on location, but more importantly offers a new way to navigate. Instead of ∗ This work was carried out whilst J.S. was at MIT, CSAIL 1 Input Image Move Left Move Left Input Image Move Forward Move Forward Figure 1: Given the user supplied starting image, our system lets the user navigate through a collection of images as if in a 3D world. First, imagine large 3D virtual online worlds like Second Life, where users are free to roam around. We make it possible for them to tour a photorealistic environment, instead of a computer generated one. Alternatively our method can be used to construct large photorealistic environments for games. Other applications include movies or advertisements, where very wide backgrounds for a particular scene may be needed. A byproduct of our method is the ability to create different types of panoramas such as infinitely long panoramas or an image taxi. Image taxi allows the user to specify the first and last images of a tour and the program will automatically generate a virtual tour through the image collection that starts and ends at the user specified images. This extends the large body of work on multi-perspective panoramas [22, 24] where a large number of images taken from different view points are stitched together. Our method does not assume images to be of the same scene or taken at the same time. In a similar fashion we can create infinite zoom effects that resemble the ”Droste effect”1 . The effect is named after a particular image that appeared on the tins and boxes of the Dutch Droste cocoa powder. It displays a nurse carrying a serving tray with a cup of hot chocolate and a box of Droste cocoa. The recursive effect first appeared in 1904, and eventually become a household notion. This effect has been used by various artistic groups to generate infinite zooms2 . To achieve these goals we use a number of components. First, we use a large collection of images, which defines the image space the user can explore. Second, we use a convenient interface to break images into themes to ensure the semantic coherence of the tour. Third, we use an image similarity metric that allows us to retrieve images similar to a transformed version of the current image (e.g. to support a ”move left/right” operation we need to shift the current image, before retrieving similar images). Finally, we show how to combine two images in a visually pleasing manner to create the illusion of moving in a single, coherent 3D space. 2. Constructing the image space In this section we describe how we collect and arrange a large collection of images into a virtual 3D space for the user to explore as if walking through the pictures. The images do not need to correspond to a real unique 3D space as in Phototourism [27]. Instead, our images are expected to correspond 1 http://en.wikipedia.org/wiki/Droste effect 2 http://zoomquilt2.madmindworx.com/zoomquilt2.swf to unrelated places. We collected a dataset of more than 6 million images from Flickr by querying for suitable image tags and relevant groups. We queried for particular locations in conjunction with tags such as ’New York street’, or ‘California beach’, and also downloaded images from Flickr groups returned by search on terms relevant to particular themes, such as ‘Landscape’ or ‘Forest’. The typical resolution of the downloaded images is 500 × 375 pixels. One million jpeg compressed images takes about 120GB of hard-disk space. Traditional image retrieval deals with the task of finding images that are similar to a particular query image. Here we want to extend this query by also allowing transformations between images. For instance, we want to find images similar to a photo taken if we rotate the camera by 45 degrees with respect to a query image. We hope that the candidate set contains images that can be seamlessly blended with the query image after applying the appropriate camera transformation. We show how a simple planar image model can be used to retrieve images that are related to the query image by 3D transformations. 2.1. Geometric scene matching Our goal is to extend a given image using images from the theme database to give an impression of a particular camera motion. We consider three camera motions: (i) translation left/right, (ii) horizontal rotation (pan), and (iii) zoom (forward motion). The camera motions are illustrated in figure 2. First, we synthesize a new view of the current image as seen from the new desired camera location. Camera translation is approximated by a translation in the image plane, ignoring parallax effects, horizontal camera rotation is achieved by applying appropriate homography transformation to the image and zoom is approximated by scaling the image. Now, we find semantically similar images in the database coarsely matching the geometry of the observed portion of the synthesized view. For example, when the camera rotation changes a fronto-parallel view of a building to a view with a strong perspective (as shown in the middle column of figure 2), we find most retrieved images depict scenes looking down a street. 2.2. Image representation A key component of our approach is finding a set of semantically similar images to a given query image. For example, if the query image contains a cityscape in a sunset with a park in the foreground, we would like to find a candidate set Input image Input image Input image Translation Camera rotation Best match translation Best match rotation Stitched translation Forward motion Best match forward Stitched rotation Stitched zoom Figure 2: Scene matching with camera view transformations. First row: The input image with the desired camera view overlaid in green. Second row: The synthesized view from the new camera. The goal is to find images which can fill-in the unseen portion of the image (shown in black) while matching the visible portion. The third row shows the top matching image found in the dataset of street scenes for each motion. The fourth row illustrates the induced camera motion between the two pictures. The final row shows the composite after Poisson blending. of images with similar objects, camera viewpoint and lighting. We describe the image features we use to achieve this and how to extract semantic scene description and camera properties from them. Semantic scene matching is a difficult task but some success has been recently shown using large databases of millions of images [13, 29]. In this paper we show that we can also induce camera transformations without an explicit model of the 3D geometry of the scene. Matching results are sometimes noisy; for example, a river is sometimes mismatched for a road. In a recent work on image completion [13] this issue was addressed by relying on user interaction, essentially allowing the user to select a visually pleasing result among a set of candidate completions. We wish to find matching images automatically or with minimal user interaction. To reduce the difficulty of scene matching we train classifiers to pre-select images of particular scene types or themes from the entire image collection. The user can then navigate in a visual space that is built only from images of a particular theme or a combination of themes. Images are represented using the GIST descriptor, which was found to perform well for scene classification [21]. The GIST descriptor measures the oriented edge energy at multiple scales aggregated into coarse spatial bins. In this work we use three scales with (from coarse to fine) 8, 8 and 4 filter orientations aggregated into 6 × 4 spatial bins, resulting in a 480 dimensional image descriptor. In addition we represent a rough spatial layout of colors by down-sampling each of the RGB channels to 8 × 8 pixels. We normalize both GIST and the color layout vectors to have unit norm. This makes both measures comparable. As illustrated in figure 2, not all pixels are observed in the transformed image and hence only descriptor values extracted from the observed descriptor cells in the query image are considered for matching. In this case, the GIST and the color layout vectors are renormalized to have unit norm over the visible region. The distance between two images is evaluated as the sum of square differences between the GIST descriptors and the low-resolution color layout of the two images. We set the weight between GIST and color to 0.5, which we found is a good compromise between matching on the geometric structure of the scene captured by GIST and the layout of colors. Currently we consider only images in the landscape format with an aspect ratio close to 4:3. This represents about 75% of all collected images. Figure 3 shows an example of a query image, the bins used to compute the image descriptor and the closest matching images from a dataset of 10,000 street images. The images returned belong to similar scene categories and have similar camera properties (same perspective pattern and similar depth). 2.3. Alignment and compositing Images returned by the scene matcher tend to contain similar scenes, but can still be misaligned as the GIST descriptors matches only the rough spatial structure given by the 6 × 4 grid of cells. For example, in the case of city skylines, the horizon line can be at a slightly different height. To fine tune the alignment, we apply a standard gradient descent alignment [20, 28] minimizing the mean of the sum of square pixel color differences in the overlap region between the two images. To do this, we search over three parameters: translation offset in both the x and y direction and scale. The alignment is performed on images down-sampled to 1/6 of their original resolution. In the case of translations and zoom, the alignment search is initialized by the synthesized view transformation. In the case of camera rotation we initialize the alignment with a translation in the x direction matching the image overlap induced by the rotation, e.g. half the image width for the example in middle column of figure 2. As a result, the camera rotation is approximated by a translation and scale in the image domain. The camera rotation is only used in the scene matching to induce a change in the geometry of the scene. The aligned images are blended along a seam in their re- Query image Query image GIST Camera rotation & GIST Best match Top matches Top matches Best match after rotation Figure 3: Each row shows the input image, the 6 × 4 spatial bins used to compute the GIST description, the best match on a dataset of 10,000 images, and the next 6 best matches. Top row: we look for images that match the full GIST descriptor. Bottom row: Result of a query after simulating a camera rotation. The returned images contain a new perspective, similar to the one that we will have obtained by rotating the camera 45 degrees to the right. gion of overlap using Poisson blending [23]. In the case of camera translation and rotation we find a seam running from the top to the bottom of the overlap region minimizing the sum of absolute pixel color differences by solving a dynamic program [10]. In the case of zoom, where images are within each other, we find four seams, one for each side of the overlap region. In addition, to preserve a significant portion of the zoomed-in image, we constrain each seam to lie close to the boundary of the overlap region. Finally, images are blended in the gradient domain using the Poisson solver of [2]. To obtain a sequence of images with a particular camera motion the above process is applied iteratively using the most recently added image as a query. 2.4. Estimation of camera parameters Although we do not perform a full estimation of the 3D geometry of the scene structure, we show that the estimation of some simple camera parameters can improve the quality of results when navigating through the collection. Specifically, we show how to estimate 3D camera properties from a single image using the GIST descriptor. Here we follow [18, 30] in which the estimation of the horizon line is posed as a regression problem. As opposed to [7, 9] in which camera orientation is estimated by an explicit model of the scene geometry, we use machine learning to train a regressor. The main advantage of our method is that it works even when the scene lacks clear 3D features such as a vanishing point. We collected 3, 000 training images for which we entered the location of the horizon line manually (for pictures taken by a person standing on the ground, the horizon line can be approximated by the average vertical location of all the faces present in the image). We use weighted mixture of linear regressors [12] to estimate the location of the horizon line Forward motion Input image Forward motion on ground plane Input image Best match forward Query region Best match forward Query region Figure 4: The top row shows the queried region when we take the central image portion. The bottom row shows the results obtained when centering the queried region on the horizon line. The retrieved images contain roads with similar perspective to the input image. from the GIST features as described in [30]. Figure 4 illustrates the importance of estimating the location of the horizon line. In this example, we want forward motion to represent a person walking on the ground plane. As shown in the top row, if we zoom into the picture by using the image center as the focus of expansion, we move into the sky region. However, if we zoom in on the horizon line we simulate a person moving on the ground plane. In order to generate a motion that simulates a person moving on the ground plane it is important to keep the horizon line within the queried region. 2.5. Organizing pictures into themes When performing a query after a camera motion, the information available for matching is weaker than the original GIST descriptor (due to the smaller image overlap after the transformation). This results in semantically incoherent Skyline Graph Landscapes People Streets Figure 5: Example of images belonging to different scene themes. Partitioning a large collection of images improves the quality of the results. The classification is done automatically. When navigating through the image collection, it is important to keep the themes constant when moving from one picture to another to avoid undesired transitions. The user can also allow transitions across themes. transitions between images when creating long sequences (in fact, out of 10 sequences none of them produced satisfactory results when using a random set of 0.5 million images). In order to enhance the GIST descriptor, we use the theme as an additional constraint. The theme of an image is, in most cases, invariant with respect to camera motions. Therefore, we can use this as an additional constraint by only matching pictures that belong to the same theme. Using the themes results in dramatically improved results. It also lowers the memory and CPU requirements as only part of the database needs to be searched. Examples of themes we consider in this work are shown in figure 5. To obtain a visually and semantically coherent set of images for each theme we train theme classifiers from manually labeled training data in a manner similar to [5, 19]. For each theme we train 1-vs-all linear SVM classifier from about 1,000 positive and 2,000 negative training images. We have developed a suitable labeling interface so that the classifier can be trained interactively. At each iteration the user can visually assess the classifier performance and label additional images if needed. At each iteration, the most uncertain images are shown to the user for labeling. 3. Navigating the virtual 3D space We process the image database and create a graph where the nodes represent images and there are different types of (directed) edges, corresponding to different types of motion, connecting the nodes. The weight of each edge corresponds to the matching cost between the two images and the particular camera motion associated with this edge. We typically retain only the top 10 matches for every particular motion (Figure 6). This gives us a sparse graph that allows our system to tour the virtual space efficiently. The main bottleneck Virtual 3D world Figure 6: In the original graph, each image is a node and there are different types of edges (color coded in the upper left graph) that correspond to different camera motions. For each motion, there are many good matches and we typically keep just the top 10 for each edge type. All the images of a collection can by organized in a virtual world by greedily assigning to each image the best neighbors and camera motions that minimize the matching cost. in processing the database is to compute the GIST descriptors for every image, which we do at about 5 images per second. Once the image database is processed and the graph is constructed the user can start navigating in it using intuitive 3D controls. There are different modes of operation, all relying on the same basic machinery. Given a query image, the user can interactively tour the virtual space, ask for a predefined path or generate a tour between two images using the image taxi metaphor. It takes about a second to retrieve the matching images, in our matlab implementation, and a couple of seconds to align and blend the retrieved images with the query image. 3.1. Sequences with different camera transformations The user selects an image that serves as his gateway to the virtual 3D space and starts navigating interactively in a particular theme using intuitive 3D commands (move left/right, zoom, rotate). The system currently can either choose the top matching image automatically or let the user choose the best one from the top 5 matches that were found. The tour can be recorded and then shown as a video, seamlessly moving between different images. In some cases, users are interested in a particular type of tour. For example, the user might be interested in generating a tour that starts with a short forward motion, followed by a long left translation and concluding with a right rotation. The system will then query the graph and construct a video sequence that satisfies the user’s request. In addition, the user can choose the image taxi option. In this case, the user only specifies the first and last images of the tour and the image taxi takes the user along the shortest path in the graph connecting these two images. Figures 7, 8(a) and 8(b) show image sequences for camera zoom, translation and rotation respectively obtained from a ‘street’ theme. Note that the left/right translation motion tends to preserve the camera orientation with respect to the scene (the scene remains roughly fronto-parallel), while the rotation motion induces a change in perspective between consecutive images (Fig 9). The translation and rotation sequence were produced fully automatically. The zoom-in sequence was produced interactively by letting the user specify the zoom-in direction - toward the horizon of the image. (a) (b) (c) (d) The reader is encouraged to view the accompanied video that presents the different tours generated by our system. All the sequences in the video were generated automatically, except for the “Hollywood” and “Windows” sequence that had the user choose the best match out of the top 5 candidates. To generate the motion videos we produce the composite (including Poisson blending) of each image of the sequence with its neighboring images. The intermediate frames of the motion are synthesized from these composited images by temporal blending to hide small color differences resulting from Poisson blending subsets of images independently. In the case of translation and rotation we look at only one image to each side of the current image. In the case of zoom, a particular composited image can contain pixels from several images ahead, as illustrated in figure 7(d). The translation is simulated by translating with constant per pixel speed between consecutive images of the sequence (also applying small scale changes if the consecutive images are of different size). The case of rotation is similar to translation but, in addition, we display the resulting images on a cylinder. In the case of zoom we apply constant scale change between consecutive frames of the video to create an impression of a person walking at a constant speed in the 3D world. The zoom is performed toward the center of the next image in the sequence, which can result in (typically small) changes in zoom direction between consecutive images. Infinite panoramas and the infinite street: As a byproduct of our system we can construct infinite panoramas by automatically moving the camera at fixed intervals. Fig. 7 gives an example of an infinite perspective street generated with our system. Figure 10 shows examples of long panoramas created for various themes. The image taxi: finding a path between two images: The image taxi allows the user to specify start and end images and let the system take him on a tour along the shortest path between them. Given two input images, we first connect them to the graph and then find the shortest path between them using the Dijkstra algorithm [6]. We then follow the path creating a tour based on the different edges along the way. (e) Figure 7: Zoom sequence of 16 images. (a) The query image. (b) The query image composited with the consecutive images in the sequence. (c) Image boundaries of the following images in the sequence overlaid over the composited image (b). (d) Masks indicating areas in (b) coming from different images. (e) Images 2–4 used to generate the zoom sequence. Each consecutive image was obtained by inducing camera zoom as illustrated in figure 2. The zoom direction was indicated interactively by clicking on the horizon line. Figure 9: A camera rotation can be used to generate from a single picture a good guess of the surrounding environment not covered by the camera. Here, the system is hallucinating what could be behind of the camera (original image marked with a frustrom). Note that the back of the camera is also a perspective street, aligned with the camera view. 4. Limitations and Conclusion In the course of developing the system we have learned about some of its limitations. First, the system is as good a) Camera translation Original images Aligned and composite image Masks and seams between images Final sequence b) Camera rotation Original images Final sequence Figure 8: This figure illustrates sequences generated from a large database of street images by inducing two camera motions: a) Camera translation. Each consecutive image was obtained by inducing camera translation by half the image size as illustrated in figure 2. b) Camera rotation. Each consecutive image was obtained by inducing camera rotation of 45 degrees as illustrated in figure 2. This produces changes in the perspective between consecutive images. as the underlying database is. The larger the database, the better the results. We also found that there are three typical failure modes. The first is when a semantically wrong image is retrieved. This is mitigated by the use of themes but still remains a problem for future research. Second, compositing two distinct images is always a challenge and at times, the seam is quite visible. Finally, there are cases in which the seam runs through important objects in the image which produces noticeable artifacts. Nevertheless, the proposed system offers an intuitive 3Dbased navigation approach to browsing large photo collections. This can be used to create photorealistic visual content for large online virtual 3D worlds like Second Life, computer games or movies. Our system arranges the images into themes and relies on image content to retrieve the next image. Another novelty of our system, as opposed to existing image retrieval systems, is the use of transformed image retrieval. We first transform the query image, before performing the query, to simulate various camera motions. Our in- frastructure also allows us to create infinite panoramas or use the image taxi to generate tailor-made tours in the virtual 3D space. These applications can find use in games, movies and other media creation processes. References [1] A. Agarwala, M. Agrawala, M. Cohen, D. Salesin, and R. Szeliski. Photographing long scenes with multi-viewpoint panoramas. ACM Trans. Graph., 2006. [2] A. Agrawal, R. Raskar, and R. Chellappa. What is the range of surface reconstructions from a gradient field? In Proc. European Conf. Computer Vision, 2006. [3] S. Avidan and A. Shamir. Seam carving for content-aware image retargeting. ACM Trans. Graph., 2007. [4] S. Bae, S. Paris, and F. Durand. Two-scale tone management for photographic look. ACM Transactions on Graphics, 25(3):637–645, 2006. [5] A. Bosch, A. Zisserman, and X. Munoz. Scene classification using a hybrid generative/discriminative approach. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(4), 2008. [6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 2nd edition, 2001. Figure 10: Various panoramas created by our system. The top two panoramas were created automatically from the ‘landscape’ and ‘skyline’ themes respectively. The bottom three panoramas were created interactively from the ‘people’ ‘forest’ and ‘graffiti’ themes respectively. [7] J. Coughlan and A. L. Yuille. Manhattan world: Orientation and outlier detection by bayesian inference. Neural Computation, 15:1063– 1088, 2003. [8] J. S. De Bonet. Multiresolution sampling procedure for analysis and synthesis of texture images. In Proc. ACM SIGGRAPH, 1997. [9] J. Deutscher, M. Isard, and J. MacCormick. Automatic camera calibration from a single manhattan image. In Proc. European Conf. Computer Vision, pages 175–188, 2002. [10] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proc. ACM SIGGRAPH, 2001. [11] W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based superresolution. IEEE Computer Graphics & Applications, 22(2):56–65, 2002. [12] N. Gershenfeld. The nature of mathematical modeling. Cambridge University, 1998. [13] J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics, 26(3):4:1–4:7, 2007. [14] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image analogies. In Proc. ACM SIGGRAPH, pages 327–340, 2001. [15] H. D. J. and J. R. Bergen. Pyramid-based texture analysis/synthesis. In Proc. ACM SIGGRAPH, 1995. [16] J. Kopf, M. Uyttendaele, O. Deussen, and M. Cohen. Capturing and viewing gigapixel images. ACM Transactions on Graphics, 26(3), 2007. [17] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis using graph cuts. In Proc. ACM SIGGRAPH, 2003. [18] J. F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, and A. Criminisi. Photo clip art. ACM Transactions on Graphics, 26(3):3:1–3:10, 2007. [19] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006. [20] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. of the 7th Int. Joint Conf. on Artificial Intelligence, pages 674–679, 1981. [21] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. Journal of Computer Vision, 42(3):145–175, 2001. [22] S. Peleg, B. Rousso, A. Rav-Acha, and A. Zomet. Mosaicing on adaptive manifolds. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(10):1144–1154, 2000. [23] P. Perez, M. Gangnet, and A. Blake. Poisson image editing. ACM Trans. Graph., 22(3):313–318, 2003. [24] P. Rademacher and G. Bishop. Multiple-center-of-projection images. ACM Trans. Graph., 1998. [25] C. Rother, L. Bordeaux, Y. Hamadi, and A. Blake. Autocollage. In Proc. ACM SIGGRAPH, 2006. [26] Y. Rubner, C. Tomasi, and L. J. Guibas. Adaptive color-image embeddings for database navigation. In ‘Asian Conference on Computer Vision, pages 104–111, 1998. [27] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3d. In Proc. ACM SIGGRAPH, 2006. [28] R. Szeliski. Image alignment and stitching: A tutorial. Foundations and Trends in Computer Graphics and Computer Vision, 2006. [29] A. Torralba, R. Fergus, and W. T. Freeman. Tiny images. Technical Report MIT-CSAIL-TR-2007-024, MIT, 2007. [30] A. Torralba and P. Sinha. Statistical context priming for object detection. In Proc. IEEE Int. Conf. Computer Vision, pages 763–770, 2001. [31] Y. Weiss and W. Freeman. What makes a good model of natural images? In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007. [32] D. N. Wood, A. Finkelstein, J. F. Hughes, C. E. Thayer, and D. Salesin. Multiperspective panoramas for cel animation. ACM Trans. Graph., 1997.