1 When Location Meets Social Multimedia: A Survey on Vision-Based Recognition and Mining for Geo-Social Multimedia Analytics RONGRONG JI, Department of Cognitive Science, Xiamen University YUE GAO, School of Computing, National University of Singapore WEI LIU, IBM T. J. Watson Research Center XING XIE, Microsoft Research Asia QI TIAN, Department of Computer Science, University of Texas at San Antonio XUELONG LI, Chinese Academy of Science Coming with the popularity of multimedia sharing platforms such as Facebook and Flickr, recent years have witnessed an explosive growth of geographical tags on social multimedia content. This trend enables a wide variety of emerging applications, for example, mobile location search, landmark recognition, scene reconstruction, and touristic recommendation, which range from purely research prototype to commercial systems. In this article, we give a comprehensive survey on these applications, covering recent advances in recognition and mining of geographical-aware social multimedia. We review related work in the past decade regarding to location recognition, scene summarization, tourism suggestion, 3D building modeling, mobile visual search and city navigation. At the end, we further discuss potential challenges, future topics, as well as open issues related to geo-social multimedia computing, recognition, mining, and analytics. Categories and Subject Descriptors: H.4 [Information Systems Applications]: Miscellaneous; I.4.8 [Image Processing and Computer Vision]: Scene Analysis; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, System, Measurement Additional Key Words and Phrases: Algorithms, multimedia systems, knowledge representation, Internet, image analysis ACM Reference Format: Rongrong Ji, Yue Gao, Wei Liu, Xing Xie, Qi Tian, and Xuelong Li. 2015. When location meets social multimedia: A survey on vision-based recognition and mining for geo-social multimedia analytics. ACM Trans. Intell. Syst. Technol. 6, 1, Article 1 (March 2015), 18 pages. DOI: http://dx.doi.org/10.1145/2597181 This work is supported by the Nature Science Foundation of China (No. 61373076), the Fundamental Research Funds for the Central Universities (No. 2013121026), and the 985 Project of Xiamen University. Authors’ addresses: R. Ji, School of Information Science and Engineering, Xiamen University, No. 422, Siming South Road, Xiamen, Fujian, China, 361005; email: rrji@xmu.edu.cn; Y. Gao (Corresponding author), Department of Automation, Tsinghua University, Tsinghua University, Beijing City, China, 100084; email: kevin.gaoy@gmail.com; W. Liu, IBM T. J. Watson Research Center, 1101 Kitchawan Rd, Yorktown Heights, NY, USA. 10598; email: wliu.cu@gmail.com; X. Xie, Microsoft Research Asia, Tower 2, No. 5 Danling Street, Haidian District, Beijing, China 100080; email: xingx@microsoft.com; Q. Tian, Department of Computer Science, University of Texas at San Antonio, One UTSA Circle, University of Texas at San Antonio, San Antonio, TX, USA. 78249-1604; email: qtian@cs.utsa.edu; X. Li, State Key Laboratory of Transient Optics And Photonics, Chinese Academic of Science Ch’ang-an, Shaanxi, China; email: xuelong_li@ieee.org. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2015 ACM 2157-6904/2015/03-ART1 $15.00 DOI: http://dx.doi.org/10.1145/2597181 ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:2 R. Ji et al. 1. INTRODUCTION Nowadays, mobile devices (e.g., mobile phones and personal digital assistants) are widely equipped with embedded cameras and Global Positioning Systems (GPS) chips. As a result, visual data associated with geographical or location tags can be easily produced in our daily lives. In addition, many social multimedia sharing platforms such as Flickr and Picasa now enable users to tag the locations of their uploaded multimedia content on the geographical map. And it is now feasible to infer the implicit location tags from the multimedia metadata like location names, telephone numbers, zip codes, and URL postfixes. Such explicit and implicit geo tagging have brought a rich connection between the virtual multimedia contents and our physical world, resulting in a massive yet ever increasing amount of “geo-social multimedia” available on the web. This geo-social multimedia has opened a new gate for the browsing, search, analysis, and mining of social multimedia, which adds a novel “dimension” over the original visual content and metadata context. In the recent decade, extensive efforts are devoted to the computing, recognition, mining, and analytics of geo-social multimedia, with emerging applications ranging from location recognition to scene summarization, tourism suggestion, 3D building modeling, and city navigation. As we believe, the time has come for a timely survey article summarizing recent advance, fundamental issues, as well as open questions in geo-social multimedia. Topics covered in this survey are itemized in Table II and detailed later in respective sections, in each of which, in addition to comprehensive reviews, we have also identified underneath challenges, as well as highlighting future trends from both academic research and industrial practices perspectives. In the rest of this section, we first briefly discuss about the data origins and application scenarios of geo-social multimedia. 1.1. Data Origins In principle, social multimedia refers to the means of interaction among people in which they create, share, and/or exchange multimedia information, for example, images, videos, audio, texts in virtual communities, and networks. We briefly review the data origin of social multimedia as in the following text. Explicit GPS Tagging. A dominant proportion of geo-social multimedia is produced during the visual content creation procedure (e.g., during the photo capturing by using the embedded GPS chips along with the cameras). A comprehensive resource comes from aligning the timestamps between GPS devices and video streams, for instance the works in Jia et al. [2006] and Ji et al. [2009a]. In addition to the aforementioned automatic geographical tagging, explicit geographical labels can also obtained via the collaborative tagging of the web user community. Today, social multimedia sharing websites like Flickr (www.flickr.com), Picasa (www.picasa.com), Panoramio (www.panoramio.com), and Moblog (www.moblog.net) have enabled users to tag the latitude and longitude, precisely or approximately, of their uploaded photos onto the geographical map. For instance, as of May 2010, Flickr had collected over 100 million geo-tagged photos since launching its geographical tagging functionality in 2006. Implicit Geographical Tagging. Another major proportion of geo-social multimedia is produced from the implicit metadata description of multimedia, from which the geographical locations can be parsed using the “location extraction” techniques [Buyukkokten et al. 1999; Li et al. 2002, 2003; Wang et al. 2005]. Such implicit descriptions may include, but are not restrict to, the URL address prefixes, language types, as well as geographical-related nouns (e.g., city and country names, telephone, and zip code). For instance, image metadata with “Statue of Liberty and Manhattan” indicates that this image is taken in “Manhattan ⇒ New York ⇒ United States.” For another instance, image metadata with the zip code “10027” indicates that this image is ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:3 Table I. Summarization of Data Origin vs. Application Scenarios Controlled Collection Explicit GPS Tagging Origin/Application Community Contributed Implicit Controlled Collection Geographical Community Tagging Contributed Location Recognition [Jia et al. 2006] [Lee et al. 2008] Landmark Mining [Li et al. 2008] [Tsai et al. 2005] [O’Hare et al. 2005] Scene Modeling [Debevec et al. 1996] [Pollefeys et al. 2002] [Longuet-Higgins et al. 1981] [Robertson et al. 2004] [Brockmann et al. 2006] [Aloimonos et al. 1993] [Schindler et al. 2007] [Hufnagel et al. 2004] [Oliensis et al. 1999] [Paletta et al. 2005] [Gonzalez et al. 2008] [Triggs et al. 1999] [Xiao et al. 2008] [Kalogerakis et al. 2009] [Pollefeys et al. 2002] [Ji et al. 2009b] [Teller et al. 2003] [Irschara et al. 2009] [Akbarzadeh et al. 2006] [Shao et al. 2003] [Tomasi et al. 1992] [Goedeme et al. 2004] [Szeliski et al. 1994] [Eade et al. 2008] [Zhang et al. 2006] [Hays et al. 2008] [Snavely et al. 2006] [Liu et al. 2009] [Li et al. 2009] [Li et al. 2008] [Hays et al. 2008] [Jing et al. 2006] [Vergauwen et al. 2006] [Kalogerakis et al. 2009] [Kennedy et al. 2007] [Cristani et al. 2008] [Zheng et al. 2009] [Aliaga et al. 2003] [Philbin et al. 2007] [Snavely et al. 2006] [Zhang et al. 2006] [Ji et al. 2009b] [Simmon et al. 2007] [Crandall et al. 2009] [Cao et al. 2009] [Yeh et al. 2004] WikiTravel also taken in the same address as mentioned earlier. Note that, different from explicit geographical tags, implicit geographical tags also provide the landmark or location identities, which can further facilitate direct visual analysis of the target location. Community Consensus. The web user community is the major contributor of geosocial multimedia, which differs from the traditional data acquisitions in terms of its ever-increasing scalability and uncontrolled quality. From this perspective, the geosocial multimedia also reflects the user preference or “intelligence.” One example is that the geo-tagged photos in Flickr or Panoramio usually exhibit a distribution bias toward popular landmarks. Within a given given geographical location, the visual statistics also exhibit a good consensus to the representative landmarks, as well as its representative views, which are typically the focus in photographing for social multimedia users. In aforementioned cases, by analyzing the visual or metadata statistics, the popularity of given landmarks or views can be estimated, which can be further exploited for the purpose of touristic recommendation. 1.2. Application Scenarios There are increasing amount of diverse applications related to geo-social multimedia, which can be briefly categorized into the following four groups: (1) geographical location recognition, (2) landmark mining and summarization, (3) tourism recommendation, and (4) 3D scene modeling and city navigation. Table I summarizes the corresponding application scenarios and representative works. Geographical Location Recognition aims to recognizing the geographical location of the input visual data, such as an image or a video sequence. The recognition is typically achieved by near-duplicate visual search from the query to the reference image or video dataset. Vision-based location recognition can be adopted as a complementary or alternative solution to the GPS-based location recognition. On one hand, GPS signal is not stable in locations with dense buildings. On the other hand, vision-based location ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:4 R. Ji et al. recognition can provide more fine-gained guidance, that is, the viewing angles of the identified location, which can facilitate applications such as augmented reality. Landmark Mining and Recommendation aims to discovering representative landmarks from geo-tagged photos. The mined landmarks can be summarized and recommended for online or visual city tourism. In this case, the geographical tags can be either explicit or implicit. And the mining is mainly based on exploring the consensus among the context and content of such photos. The mined landmarks can then be leveraged to help users (e.g., how to optimize the travel route or what is the best viewing angle in photographing). 3D Scene Modeling and City Navigation aims to reconstructing the 3D models, either for individual building or for the entire city. The key idea is to do image correspondence to estimate the depth or 3D structure from images within a certain location, which are identified using their GPS or location tagging. The built 3D models can be further used for virtual navigation (e.g., some commercial systems like Google Earth [www.earth.google.com/] or Apple Maps [www.apple.com/ios/maps/]). 1.3. Survey Organization The rest of this survey is organized as follows: In Sections 2 through 4, we review the four groups of applications as briefed earlier. In each section, we review the state-ofthe-art approaches, the developed systems, as well as the benchmark datasets. Finally, we discuss potential trends and challenges for geo-social multimedia in Section 5. 2. GEOGRAPHICAL LOCATION RECOGNITION In this section, we review related work on geographical location recognition. In principle, the recognition is achieved via matching the query to a set of reference photos that are associated with geographical locations. This query may come from a photo or a video sequence, either captured using the mobile device or uploaded by the web user. And the similar photos of the query are identified by using near-duplicate visual search techniques, based on which location(s) of the most similar photo(s) can be identified as the recognition output. Here, the key issue lies in ensuring both the recognition scalability and discriminability, which relies on designing a good visual feature representation. To this end, Bag-of-Features [Sivic and Zisserman 2003; Nister and Stewenius 2006; Ji et al. 2009a] is widely adopted in image representation. One significant advantage of Bag-of-Features lies in its online matching efficiency by inverted indexing, that is, the matching between two images can be counting how many local descriptors between two images are quantized into identical visual words. In Zhang and Kosecka [2006], visual matching is done to the sequence of images captured, which enables the estimation of the motion of photographer on the geographical map. In Lee et al. [2008] and Irschara et al. [2009], visual matching is conducted between query and multiple reference images using wide baseline matching or Structure-from-Motion, which can recovers the camera pose and orientation for augmented reality applications. One significant argument for vision-based location recognition is the challenge of directly using GPS signals. In our understanding, vision-based location recognition owns its unique and unreplaceable advantages such as: —The GPS signal in certain locations is limited (e.g., dense urban areas and indoors, where the location can be only identified through visual matching). —Camera pose and orientation are required in applications such as augmented reality, which can be only estimated using vision-based matching rather than GPS.1 To this end, GPS coordinates are too coarse. 1 Given the trajectory of GPS coordinates, it is possible to recover the motion of the target, as pointed out in Ji et al. [2009a]. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:5 Both GPS and vision-based techniques can be further combined to improve the recognition accuracy. A typical strategy is using GPS for coarse recognition, which is refined by visual matching. Instead, Liu et al. [2009] leveraged base station tags to accomplish coarse recognition, which serves as a preliminary step before visual matching. In vision-based markerless augmented reality, the alignment between camera screen and physical world highly depends on the efficiency and accuracy in visual matching as well as the identification of camera pose and orientation. 2.1. Literature Review Earlier work on vision-based location recognition operated on a small scale, say 200 to 1,000 images. These photos are typically collected by professional image acquisition procedures, for instance [Robertson et al. 2004; Shao et al. 2003; Zhang et al. 2006]. For example, the location recognition benchmark used by Robertson and Cipolla [2004] contains only 200 images, each of which corresponds to a unique building facade. In Robertson and Cipolla [2004], vertical vanishing points are used to align the views between the query and reference images, based on which the query is transferred to with the identical viewing angle and shape-based features are extracted to perform location recognition. Wide baseline matching is also investigated in location recognition, in which scenario the query should be taken from widely separated views. For example, the work in Goedeme and Tuytelaars [2004] performed recognition by matching line segments and their associated descriptors, upon which rejected false matching by using epipolar geometry constraints. Then, the relative pose is recovered, as to “stitch” the unordered set of views between query and reference images from approximately the same location. Near-Duplicated Matching. Recently, advances on near-duplicate image retrieval Nister and Stewenius [2006]; and Philbin et al. [2007] have well addressed the scalability issue for vision-based location recognition. Image search in large databases is the focus of Sivic and Zisserman [2003] and later Nister and Stewenius [2006] and Philbin et al. [2007]. The latter two are both based on a hierarchical visual dictionary with inverted indexing. The idea comes from the work in Brin et al. [1995] and earlier in Fukunaga and Narendra [1975],which was originally targets at retrieving large-scale text document corpus. In visual search and recognition, the hierarchical visual vocabulary model (e.g., vocabulary tree Nister and Stewenius [2006]) has been well advocated for developing scalable systems. For example, Paletta and Fritz [2005] proposed to detect and match buildings using SIFT descriptors with indexed word voting. Schindler and Brown [2007] presented a method to use geo-tagged video streams for large-scale location recognition. Such streams are captured using vehicles equipped with cameras that went through main streets of the city. In vocabulary tree–based recognition, a multipath-based visual word search is further proposed to reduce the error during the local descriptor quantization. The vocabulary tree–based approach is also adopted by Eade and Drummond [2008] for the application of real-time loop closing, in which a compact local descriptor is extracted from the original SIFT descriptor by using principle component analysis. In our previous work [Ji et al. 2009a], a density-based metric learning is presented to refine the vocabulary tree construction. To perform scalable scene recognition, another choice is to use global features to identify a small number of candidates before the kick in of local feature matching. For instance, Yeh et al. [2004] proposed a mobile location recognition approach, which combines color histogram and local feature to perform a hybrid visual similarity ranking. Stepping forward from the city-scale location recognition, Cristani et al. [2008] proposed a vision-based location search system to search outdoor images all over southeastern France. Crandall et al. [2009] presented a system to identify landmark buildings ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:6 R. Ji et al. based on metadata and image taken in a consecutive 15-minute window. The IM2GPS system [Hays et al. 2008] proposed to estimate the location distributions of the query image by search within a worldwide geo-referenced image database. As its consecutive work, Kalogerakis et al. [2009] studied how to combine the visual matching of image sequence to improve the overall matching accuracy. Learning based Recognition. In learning-based location recognition, each location “instance” (e.g., a landmark or a building) corresponds to a category, from which a recognition classifier is trained. For example, Symeon et al. [2011] presented a scheme to detect landmarks and events in geo-tagged image collections, which is based on mining and detection over the image similarity graph as a means to combine heterogeneous similarity between query and reference images. Kretzschmar et al. [2008] proposed a vision-based location recognition system for robot navigation, in which the recognition model is trained from images collected from Flickr and the Google image search engine. Torniai et al. [2007] presented an overview of the creation, sharing, and usage of geo-tagged images, based on which a new photo-browsing interface was provided using both location cue and metadata context. Joshi et al. [2010] proposed to use image tags along with visual content to infer the image location, in which visual recognition is carried out on a basis of over one million geo-tagged images. Keiji and Qiu [2010] proposed to automatically select representative images for a given landmark, which is achieved by clustering bag-of-features with density-based popularity estimation and topic discovery. Luo et al. [2011] present a detailed literature review on geographical tagging, which, as being different from this article, treats location recognition as a “geographical tagging” task, which is similar to Section 2 of landmark recognition. Combining with 3Ds. Recently, Irschara et al. [2009] adopted structure-frommotion to build 3D scene models for vision-based location recognition. In Irschara et al. [2009], 3 D models are combined with vocabulary tree to simultaneously recognize the location as well as the query orientation and pose. Beyond structure-from-motion, visual simultaneous localization and mapping (SLAM) is also adopted, which was originally popular in robotic vision for the application of self-localization in the indoor environment. For instance, Xiao et al. [2008] proposed to combine bag-of-features with geometric verification to improve the precision of object recognition for robotic navigation. Incremental Recognition. Besides improving the recognition accuracy and efficiency, how to maintain the recognition system in real-world applications is also an important issue. For example, is it possible to directly use the learned dictionary from one location dataset to another? Or similarly, when facing the visual appearance changes of newly added images, how to maintain the search system incrementally without rebuilding individual components. Such visual appearance variations are commonsense in real-world scenarios, for instance, from images captured from season changing, raining, snowing, building appearance changes, and occlusions in data acquisition. Nevertheless, rebuilding every individual component would result in optimal accuracy, which, however, lacks efficiency and is indeed unacceptable in large-scale applications. As investigated in Ji et al. [2008], an alternative solution is to incrementally maintain the location recognition system. It refers to updating, rather than rebuilding, its core components (e.g., the visual vocabulary model and inverted indexing system). In Ji et al. [2008], we explored the incremental updating of both aforementioned components to efficiently maintain the recognition system in a scalable and changing scenario. The key idea is to maintain a “balanced” vocabulary tree Nister and Stewenius [2006] when new batches of images coming in, after the updating of which the inverted indexing system is also updated correspondingly. This algorithm has achieved very comparable accuracy to the time-consuming alternative approach of rebuilding the overall visual vocabulary and inverted indexing system. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:7 Fig. 1. The ZuBuD Zurich building database. Fig. 2. The Oxford Building database. 2.2. Databases The reference image datasets associated with GPS tags are built by sensing location entities such as street views, buildings, and landmarks. We review some widely used datasets in the following text: ZuBuD. The Zurich city building dataset [Shao et al. 2003] contains 1,005 images captured from 201 buildings in Zurich. This database has widely adopted to evaluate the accuracy of vision-based location recognition in a small scale. All the images were captured by Panasonic-NV-MX300 digital camera and Pentax-Opti430 camera, with fixed resolution of 640 × 480 without flash. For each building or scene, five images were acquired from arbitrary viewpoints. Despite using two different cameras, images were also taken under different seasons and occlusions to include wide variation in their visual appearance. Figure 1 shows some examples in ZuBuD. Oxford Buildings. The Oxford Building dataset [Philbin et al. 2007] contains 5,062 images collected from Flickr by searching for particular Oxford landmarks. The entire dataset has been manually annotated to generate a comprehensive ground truth for 11 different landmarks. Each landmark contains 5 queries, producing 55 queries in total in Oxford Building to evaluate the location recognition accuracy. One of the following labels are assigned to each image: (1) Good: A nice, clear picture of the object/building; (2) OK: more than 25% of the object is clearly visible; (3) Bad: the object is not present; and (4) Junk: less than 25% of the object is visible, or there are very high levels of occlusion or distortion. Figure 2 shows some examples in Oxford Buildings. SCity. The SCity database contains 20,000 street view photos, as shown in Figure 3. It was taken from the main streets of Seattle by a car with six surrounding cameras and a GPS device. The GPS location of each captured photo was obtained ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:8 R. Ji et al. Fig. 3. Examples from SCity database captured from Seattle street side views, in which every six consecutive images are defined as a scene. by aligning the time stumps between photo streams and GPS tag sequences. The entire database contains over 1.2 million street view photos, from which 20,000 representative scene photos are selected to build the benchmark. And we resized these photos to 400 × 300 pixels and extracted 300 features from each photo, on average. Every six successive photos were grouped as a scene. This database has been used in our previous works [Ji et al. 2008, 2009a]. Figure 3 shows some examples in SCity. PKUBench. The Peking University Benchmark dataset contains 5,574 images of 198 landmarks captured from the Peking University (PKU) campus, among which there are 567 queries, 3,805 matching pairs, and 48,675 nonmatching pairs. Five groups of exemplar mobile query scenarios (in total 178) are demonstrated to evaluate the real-world visual search performance in challenging situations: (1) occlusive query set containing 20 mobile queries and 20 corresponding digital camera queries, where the occlusion comes from foreground cars, people or buildings; (2) background cluttered query set containing 20 mobile queries and 20 corresponding digital camera queries captured far away from a landmark with cluttered backgrounds; (3) night query set containing 9 mobile phone queries and 9 digital camera queries, in which the photo quality heavily depends on the lighting conditions; (4) blurring and shaking query set containing 20 mobile queries with blurring or shaking, as well as 20 corresponding mobile queries without any blurring or shaking; and (5) distractors set, which is added to evaluate the effects of how imprecise contextual tags would affect the visual search accuracy. We collect a distractor set of 6,629 photos from Summer Palace2 and 2,012 photos from other locations of the PKU campus. We then randomly assign these distractor images with GPS tags valid in the original PKUBench dataset. Finally, 20 locations with 40 queries are selected to evaluate the search accuracy degeneration. Figure 4 shows some examples in PKUBench. 3. LANDMARK DISCOVERY AND RECOMMENDATION In recent years, there is an increasing interest in recognition and mining of landmarks from geo-tagged photos from photo sharing websites such as Flickr and Picasa [Kennedy et al. 2007; Hays et al. 2008; Ji et al. 2009a; Zheng et al. 2009; Li et al. 2008]. As a sort of user-generated content, these photos typically reveal the behavior consensus of social media users, resulting in a dominant portion of landmarks in these photos. From the perspective of the visual appearance, such user consensus also 2 The landmarks in Summer Palace are visually similar to those in Peking University campus. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:9 Fig. 4. Exemplar illustration on the data acquisition on PKUBench dataset. c Fig. 5. Exemplar Photos of Worldwide Landmarks from Geo-Tagged Flickr Photo Collection. Flickr copyright reserved. corresponds to the near-duplicate statistics of photos tagged nearby. Figure 5 shows several well-recognized landmarks all over the world. Mining landmarks from social multimedia poses a wide variety of applications, ranging from city scene summarization and 3D modeling to tourism recommendation. In the aforementioned applications, a fundamental step is to mine the representative landmark from unorganized photo collections. To this end, discovering representative data is a longstanding research topic in the literature, some of which can be tracked back to the 2000s. For instance, discovering popular news is a hot topic, as studied in Wayne et al. [2000], Mannilla et al. [1997], and Hu et al. [2006]. One representative work is the Topic Detection and Tracking Project [Wayne 2000], which aimed to discover topics in newswires and broadcast news in an incremental manner. Mannilla et al. [1997] proposed to discover frequent episodes from time-ordered event sequences. Hu et al. [2006] proposed to group news stories from multiple sources and dig out the most important stories by mutual reinforcement between news articles and sources. Referring to web page importance evaluation, PageRank Brin and Page [1998] is one of the most important techniques used in state-of-the-art commercial search engines like Google. It is a static rank algorithm that assigns a numerical weighting to each web page in a hyperlinked environment, with the purpose of “measuring” its relative importance to other pages. PageRank has been used for representative image discovery. For instance, Jing and Baluju [2008] modeled image content association as visual “links” to simulate a PageRank-like representative image selection. In Jing and Baluju [2008], a visual link between two images is built once local features extracted from both images are quantized into an identical visual word. Given the links among images, the image set is simulated as a web page collection, from which PageRank can be run to select the most “popular” images. An alternative approach to PageRank is the HITS algorithm [Kleinberg et al. 1999], which aims to infer page importance in a dynamic rank scenario. It identifies two properties of a page: authority and hub. Authority stands for importance of a web page that is calculated by a links ensemble from other pages. Hub stands for the ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:10 R. Ji et al. confidence of this page in producing its links to other pages. Learning is carried out by semisupervised propagation between authority layers and hub layers. 3.1. Existing Approaches Landmark Discovery. Different from geographical location recognition as reviewed in Section 2, landmark discovery refers to recognizing a landmark identity, for instance, “Central Park,” “Statue of Liberty,” or “Times Square.”3 Earlier work on landmark discovery usually resorted to supervised recognition methods based on visual or contextual similarity among geo-tagged multimedia data [Hays and Efros 2008; Zheng et al. 2009; Li et al. 2009; Tsai et al. 2005; Jing et al. 2006; O’Hare et al. 2005; Crandall et al. 2009; Snavely et al. 2006]. For instance, Tsai et al. [2005] presented an image-based landmark matching system to find predefined landmarks with interactive geo-location selection over the map. O’Hare et al. [2005] presented an image-based search system to match scenes of a given location in a manually collected, geo-tagged photo dataset. Simmon et al. [2007] proposed a SIFT-based image clustering scheme with a hierarchical presentation interface to select representative canonical views from geo-tagged Flickr photos. Jing et al. [2006] presented a text-based photo recommendation approach in which user comments from forums were used to estimate, while not precisely, the popularity of landmark photos. Crandall et al. [2009] proposed to combine visual, textual, and temporal features to register Internet images to worldwide maps. Similarly, the IM2GPS system [Hays and Efros 2008] performed image-based matching to locate query photos in the world. And Zheng et al. [2009] deployed an Internet-scale landmark recognition engine, which acquired candidate images by querying online image search engines using a set of predefined landmarks. Landmarks are then discovered via reclustering and pruning of the image search results. Li et al. [2009] presented a large-scale landmark classification system deployed over geo-tagged Flickr images using multiclass support vector machines. And Gao et al. [2010] proposed to recognize and rank landmarks by using both the Yahoo Travel Guide information and the Flickr image corpus. In this method, landmarks are located with the geo-tag information, and the popular tourist attractions are selected based on both the subjective opinions of travel editors and the popularity among tourists. Landmark Mining. Unsupervised mining of representative landmarks is widely studied in the literature. For instance, Kennedy et al. [2007] presented a tag-locationvision–based processing flowchart to mine landmarks from Flickr photos geo-tagged within San Francisco, as shown in Figure 5. In this approach, metadata consensus is considered by weighting author confidences, time durations, and context variations. Ji et al. [2009b] presented a graph-based landmark mining framework to discover landmarks from blogs. Similar to Kennedy et al. [2007], this approach integrated author, context, and content cues into a HITS reinforcement model, together with a collaborative filtering operation to find and summarize the city scenes for landmark recommendation. Cao et al. [2009] proposed to model the canonical correlation between heterogeneous features and an annotation lexicon of interest. In Cao et al. [2009], a generalized semantic and geographical annotation engine is built based on canonical correlations to improve the annotation accuracy of Internet images. Tourism Recommendation. For a given location, the system collects online recommendations and photos from Internet users. Then, it mines landmarks and traveling 3 In this article, landmark recognition refers to recognizing the specific landmark or building facet to identify the name of the landmark. Under such a circumstance, the training and testing images should capture the same or similar (overlapped) viewing angles of a given landmark. And such landmarks typically refer to some specific famous locations in the city, such as the Statue of Liberty in New York or the Forbidden City in Beijing. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:11 characters to facilitate the user navigation and recommendation. For instance, the IM2GPS project [Hays and Efros 2008; Kalogerakis et al. 2009] enables location estimation from the sequential geo-tagged Flickr photos. These mining results are then used to model the human activities for the application of personalized traveling recommendation. The mined knowledge provides valuable information about the tourism, which is a supplementary of existing text-based recommendation system. For the purpose of modeling human activity, Brockmann et al. [2006] analyzed the distribution of human traveling logs using random walk. Hufnagel et al. [2004] analyzed the airline traffic data for traveling activity understanding. And Gonzalez et al. [2008] analyzed the traces of mobile phones captured via based station records to study the human mobility statistics. Application Systems. There are emerging applications and real-world systems devoting to the recognition and mining of geo-social multimedia. Some representative systems include TripAdvisor, Lonely Planet, National Geographic, Yelp, and so forth, which features in using community knowledge to collaboratively edit, recommend, and evaluate the touristic information. Two famous systems are WikiTravel (http://wikitravel.org/) and Yahoo! Travel Guide (http://travel.yahoo.com/). WikiTravel was founded in 2003 and has served as one of the most effective tourism recommendation systems. As of October 2009, it contained 22,490 destination guides, which were contributed by WikiTravelers all over the world. For each destination, the articles in WikiTravel generally include history, climate, landmarks, food, and shopping information. Yahoo! Travel Guide provides an area-based guide service. It listed several main cities in each country. For each city, Yahoo! Travel Guide provides information on key attractions, where a series of ranked landmarks are shown to users with their comments. Despite the aforementioned systems that require manual editing, there are also systems that derive touristic recommendation by mining from human traveling logs recorded on traveling websites. For instance, the Microsoft Travel Guide [2008] system offers tourism recommendation from the collaborative knowledge of online users. 3.2. Databases Most existing large-scale landmark mining systems operated on data crawled from photo sharing websites like Flickr and Picasa. We review some representative sources in the following text. Geo-Tagged Flickr Photos. For instance, Ji et al. [2009b] show the distribution of geo-tagged Flickr photos in 20 worldwide metropolises through 2009, including Beijing, Berlin, Boston, Chicago, Houston, London, Los Angeles, Moscow, New York, Philadelphia, Rome, Seattle, Shanghai, Sydney, Tokyo, Paris, Toronto, Vancouver, Venice, and Vienna. It is interesting that this distribution can perform as a hint to reflect the touristic popularity of these cities to a certain degree. Blog Photos. Blogs serve as another source for geo-social multimedia. Blogs, such as Windows Live Spaces and Google Blogger, contain tens of millions of sight photos with various forms of metadata tagging such as photographer, time, location names, and user comments. Note that visually similar photos densely distributed geographically nearby can identify the landmarks in a given location. For instance, tourists in New York City usually take photos from the front view of the Statue of Liberty and tag them in their blogs. These photos occupy a large portion of sight photos in New York City, which indicates that the Statue of Liberty is a famous landmark in New York City. In Ji et al. [2009b], blog photos from 20 worldwide metropolises are crawled from Live Spaces’s RSS feeds, from which representative landmarks in each city are mined. During August 2008, 380,000 sight photos are crawled together with their contextual descriptors. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:12 R. Ji et al. c Fig. 6. The Google Earth interface at Berlin and the Virtual Earth interface at Las Vegas. Google Earth, 2011. Online Landmark Corpus. Systems such as PhotoSynth [Snavely et al. 2006] further enable Internet users to upload tourism photos to build 3D landmark model online. An advantage for online uploaded photos is the interactive functionality, that is, users are able to interactively guide the construction process of the 3D models. 4. 3D SCENE MODELING AND CITY NAVIGATION Most of the world’s famous locations have been photographed under various conditions, both from the ground and from the air. It has produced a very rich imagery of landmarks and cities enabling 3D modeling and navigation. Given the 3D models of a certain scene or landmark, it becomes much easier for vision-based augmented reality. The gain comes from eliminating the previous ambiguity in matching real-world scenes to 2D images.4 In the following, we review related work on 3D modeling and augmented (virtual) reality navigation by using geo-social multimedia. 4.1. Geographical Modeling City-Scale Modeling. Currently, 3D modeling of worldwide cities has been launched in several commercial systems such as Google Earth (earth.google.com, a snapshot of which is shown in Figure 6) and Virtual Earth (maps.live.com). In addition to these commercial systems, there are also several research projects targeting at 3D modeling in a city scale. In its early stage, the semiautomatic Facade system [Debevec et al. 1996] has shown good results to reconstruct compelling fly-through of the UC Berkeley campus. The MIT City Scanning Project [Teller et al. 2003] captured thousands of calibrated images from an instrumented rig to construct a 3D model of the MIT campus. Similar work can also be found in the Stanford CityBlock Project [Roman et al. 2004] and the URbanScape Project [Akbarzadeh et al. 2006]. Hu et al. [2003] gave a detailed view on the earlier work on this topic. Xiao et al. [2009] presented a scheme to reconstruct 3D street view model from street view photo collections. Its main idea is to first adopt a multiview semantic segmentation to recognize and segment images into semantically meaningful areas such as buildings, sky, ground, vegetation, and car. Then, 3D buildings structures are obtained through an inverse patch-based orthographic composition and structure analysis. Landmark-Scale Modeling. Aiming to modeling individual landmarks, the PhotoSynth system [Snavely et al. 2006] adopted image-based reconstruction with structurefrom-motion and bundle adjustment. Li et al. [2008] proposed an iconic scene graph representation to build stereo models within a given scene location for 3D landmark recognition. 4 The matching from 2D image to 3D scene is much more challenging because there exists an ambiguity from when matching 2D points to 3D, as the depth information is missing. On the contrary, 3D-to-3D matching can be more accurate by reducing such depth ambiguity, once dense 3D points can be obtained. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:13 c Fig. 7. Examples of Google street view images. Google Street View, 2009. In either city-scale or landmark-scale modeling, the Structure-from-Motion (SfM) technique Longuet-Higgins [1981] is widely used for recovering the 3D structure of scenes [Hartley and Zisserman 2004; Snavely et al. 2006; Li et al. 2008]. Given a set of images captured at different viewing angles and positions, it aims to simultaneously reconstructing the 3D scene model and recovering the camera positions and orientations. SfM is achieved by building feature set correspondences from the image collection. As its variations, multiframe SfMs are developed based on factorization methods with global optimizations [Tomasi et al. 1992; Aloimonos 1993; Szeliski and Kang 1994; Oliensis 1999]. To handle unknown camera calibration parameters, self-calibrations have been used for projective 3D reconstruction [Pollefeys et al. 2002]. Vergauwen and Van Gool [2006] developed a similar approach for the use on cultural heritage applications. SfM is also used in the PhotoTourism system [Snavely et al. 2006], which has been demonstrated to be very successful in building 3D models from real-world image collections uploaded by social multimedia users. It is worth mentioning that the input of PhotoTourism may come from variant cameras, multiple zoom in/out levels, resolutions, different times or days or seasons, with changing illuminations, weathers, and occlusions. Another widely used technique is the so-called Image-based Rendering (IbR), which aims to synthesizing new views of a scene from a set of input photographs. Earlier work in IbR refers to the Aspen MovieMap Project [Lippman 1980], in which thousands of images of Aspend Colorado captured from a moving car are registered to a street map of the city and stored on laserdisc. The Sea of Images Project [Aliaga et al. 2003] used a large collection of images taken throughout an architectural space, in which features are matched across multiple reference images for IbR. Recently, companies like Google and EveryScape [2009] have created similar “surrogate travel” applications that can be viewed in a web browser. Many of them now have covered over millions of photos worldwide. Representative View Selection. Rather than building the exact 3D models, the work by Simmon et al. [2007] aimed to computing a minimal canonical subset of views. Such a set of views can best represent a large collection of given images to facilitate user navigation. They used a SIFT-based image clustering with a hierarchical selection interface to find representative canonical views from Flickr photos. 4.2. Geo-Navigation The massive amount of geo-social multimedia also enables an effective navigation of landmarks, street views, and cities. For instance, the Google Earth Street View Project, as shown in Figure 7, has recorded panoramas in worldwide cities, organized by map coordinates at both the semantic and symbolic levels. Google Street View has blurred pedestrian faces since 2008 for privacy consideration. It is note that recently the Google Ocean and Google Sky projects have been also launched. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:14 R. Ji et al. 5. FUTURE CHALLENGES AND TRENDS Scalability. One of the main driving forces for geographical-aware social multimedia is the prosperity of user community, which largely reduces the extensive cost of human labor in producing or collecting the media content and metadata. Such data scale raises another point: In an ordinary data scale, due to the data sparsity and bias, existing approaches usually adopted parametric prediction models [Ji et al. 2009b; Kennedy et al. 2007; Crandall et al. 2009]. However, whether parametric models are so suitable for modeling the massive and ever-growing data scale is questionable. In such scenarios, there are more prospections on investigating the unparametric models. The most promising application for geo-social multimedia lies in its integration with mobile devices. In such a case, the geographical information and social cues can be well deployed with numerous applications, such as vision-based city navigation [Schindler and Brown 2007] and location-sensitive tourism recommendation [Hays and Efros 2008; Kalogerakis et al. 2009]. One potential trend left unexploited lies in the analytics of social statistics from geo-social multimedia. Indeed, either photos, videos, or even GPS tags could be regarded as a sort of signals reflecting the social activity of users. Especially, it is still an open problem for the recognition of user groups from their geo-social multimedia [Ji et al. 2012]. Solving this problem has many potential applications such as personalized tourism recommendation, friend recommendation, and commercial suggestion. One feasible solution to handle the scalability issue is to distribute computing and service on the cloud. In such a case, the key design lies in how to parallel the computing pipeline and distribute the computing structure. One consideration is the combination with the original geographical information to parallel the data storage. For instance, in building the near-duplicate visual matching system for location recognition, it is not always necessary to build the indexing model over the entire city [Ji et al. 2008, 2009a]. On the contrary, there is another choice to subdivide the geo-tagged photos into subregions based on, for example, by dividing data into geographical groups using either GPS or base station tags [Ji et al. 2012]. Under such a circumstance, more precise and more efficient indexing models can be prospected. REFERENCES A. Akbarzadeh, J. M. Frahm, P. Mordohai, and B. Clipp. 2006. Towards urban 3D reconstruction from video. In 3DPVT. D. G. Aliaga, D. Yanovsky, T. Funkhouser, and I. Carlbom. 2003. Image sequence geolocation with human travel priors. In International Symposium on Interactive 3D Graphics. Y. Aloimonos. 1993. Active Perception. Lawrence Erlbaum Associates. S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. 2006. Analysis of representations for domain adaptation. In NIPS. D. Blei, A. Ng, and M. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022. S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In World Wide Web. S. Brin. 1995. Near neighbor search in large metric spaces. In VLDB. 574–584. D. Brockmann, L. Hufnagel, and T. Geisel. 2006. The scaling laws of human travel. Nature 439, 7075, 462–465. O. Buyukkokten, J. Cho, H. Garcia-Molina, L. Gravano, and N. Shivakumar. 1999. Exploiting geographic location information of web pages. In ACM SIGMOD Workshop on the Web and Databases. I. Cadez and P. Bradley. 2001. Model based population tracking and automatic detection of distribution changes. In NIPS. C. Campbell and K. P. Bennett. 2001. A linear programming approach to novelty detection. In NIPS. L. Cao, Y. Gao, Q. Liu, and R. Ji. 2012. Geographical retagging. In Multimedia Modeling. L.-L. Cao, J. Yu, J. Luo, and T. S. Huang. 2009. Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression. In ACM Multimedia. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:15 D. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. 2009. Mapping the world’s photos. In WWW. M. Cristani, A. Perina, U. Castellani, and V. Murino. 2008. Geolocated image analysis using latent representations. In CVPR. P. E. Debevec, C. J. Taylor, and J. Malik. 1996. Modeling and rendering architecture from photographs: A hybrid geometry and image-based approach. In ACM SigGraph. 11–20. D. Donoho. 2006. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communication on Pure and Applied Mathematics 59, 7, 907– 934. M. Dundar and J. Bi. 2007. Joint optimization of cascaded classifiers for computer aided detection. In CVPR. E. D. Eade and T. W. Drummond. 2008. Unified loop closing and recovery for real time monocular SLAM. In BMVC. EveryScape. 2009. Homepage. Retrieved from www.everyscape.com. K. Fukunaga and P. M. Narendra. 1975. A branch and bound algorithms for computing k-nearest neighbors. IEEE Transactions on Computers 24, 7, 750–753. L. Fei-Fei and P. Perona. 2007. A Bayesian hierarchical model for learning natural scene categories. In ICCV. Y. Gao, J. Tang, R. Hong, Q. Dai, T.-S. Chua, and R. Jain. 2010. W2Go: A travel guidance system by automatic landmark ranking. In International Conference on Multimedia. T. Goedeme and T. Tuytelaars. 2004. Fast wide baseline matching for visual navigation. In CVPR. 24–29. M. C. Gonzalez, C. A. Hidalgo, and A.-L. Barabasi. 2008. Understanding individual human mobility patterns. Nature 453, 7196, 779–782. K. Grauman and T. Darrell. 2007. Approximate correspondences in high dimensions. In NIPS. A. Irschara, C. Zach, J. M. Frahm, and H. Bischof. 2009. From structure-from-motion point clouds to fast location recognition. In CVPR. R. Ji, X. Xie, H. Yao, and W.-Y. Ma. 2009. Hierarchical optimization of visual vocabulary for effective and transferable retrieval. In CVPR. R. I. Hartley and A. Zisserman. 2004. Multiple View Geometry. Cambridge University Press. J. Hays and A. Efros. 2008. IMG2GPS: Estimating geographic information from a single image. In CVPR. T. Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 41, 177–196. Y. Hu, M.-J. Li, Z. Li, and W.-Y. Ma. 2006. Discovering authoritative news sources and top news stories. Information Retrieval Technology, Section 2A: Web Information Retrieval. J. Hu, S. You, and U. Neumann. 2003. Approaches to large-scale urban modeling. In IEEE Computer Graphics and Applications. L. Hufnagel, D. Brockmann, and T. Geisel. 2004. Forecast and control of epidemics in a globalized world. PNAS 101, 24, 15124–15129. Y. Li, D. J. Crandall, and D. P. Huttenlocher. 2009. Landmark recognition in large-scale image collections. In ICCV. H. Li, R. K. Srihari, C. Niu, and W. Li. 2002. Location normalization for information extraction. In COLING. H. Li, R. K. Srihari, C. Niu, and W. Li. 2003. InfoXtract location normalizations: A hybrid approach to geographic references in information extraction. In International Workshop on the Analysis of Geographic References. A. Lippman. 1980. Movie maps: An application of the optical videodisc to computer graphics. In ACM SigGraph. 32–43. D. Liu, M. Scott, R. Ji, H. Yao, and X. Xie. 2009. Geolocation sensitive image based advertisement platform. In ACM Multimedia. Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM H. Jegou, H. Harzallah, and C. Schmid. 2007. A contextual dissimilarity measure for accurate and efficient image search. CVPR. R. Ji, Y. Gao, B. Zhong, H. Yao, and Q. Tian. 2012. Mining Flickr landmarks by modeling reconstruction sparsity. ACM Transactions on Multimedia Computing, Communication, and Application. R. Ji, X. Xie, H. Yao, W.-Y. Ma, and Y. Wu. 2008. Vocabulary tree incremental indexing for scalable scene recognition. In IEEE International Conference on Multimedia and Expo. R. Ji, X. Xie, H. Yao, and W.-Y. Ma. 2009a. Hierarchical optimization of visual vocabulary for effective and transferable retrieval. CVPR. R. Ji, X. Xie, H. Yao, and W.-Y. Ma. 2009b. Mining city landmarks from blogs by graph modeling. In ACM Multimedia. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:16 R. Ji et al. M. Jia, X. Fan, X. Xie, M. Li, and W.-Y. Ma. 2006. Photo-to-search: Using camera phones to inquire of the surrounding world. In Mobile Data Management. Y. Jing and S. Baluju. 2008. PageRank for product image search. In World Wide Web. F. Jing, L. Zhang, and W.-Y. Ma. 2006. VirtualTour: An online travel assistant based on high quality images. In ACM Multimedia. D. Joshi, A. Gallagher, J. Yu, and J. Luo. 2010. Inferring photographic location using geotagged web images. In Multimedia Tools and Applications. F. Jurie and B. Triggs. 2005. Creating efficient codebooks for visual recognition. In ICCV. E. Kalogerakis, O. Vesselova, J. Hays, A. Efros, and A. Hertzmann. 2009. Image sequence geolocation with human travel priors. In CVPR. Y. Keiji and B. Qiu. 2010. Mining regional representative photos from consumer-generated geotagged photos. In Handbook of Social Network Technologies and Applications. L. Kennedy, M. Naaman, and S. Ahern. 2007. How Flickr helps us make sense of the world: Context and content in community contributed media collections. In ACM Multimedia. H. Kretzschmar, C. Stachniss, C. Plagemann, and W. Burgard. 2008. Estimating landmark locations from geo-referenced photographs. In IEEE Conference on Intelligent Robots and Systems. J. A. Lee, K.-C. Yow, and A. Sluzek. 2008. Image-based information guide on mobile devices. In Advances in Visual Computing. T. Leung and J. Malik. 2005. Representation and recognition the visual appearance of materials using 3-d textons. International Journal of Computer Vision 32, 1, 29–44. X. Li, C. Wu, C. Zach, S. Lazebnik, and J.-M. Frahm. 2008. Modeling and recognition of landmark image collections using iconic scene graphs. ECCV. H. C. Longuet-Higgins. 1981. A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135. D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 20, 2, 91–110. J. Luo, D. Joshi, J. Yu, and A. Gallagher. 2011. Geotagging in multimedia and computer vision—a survey. Multimedia Tools and Applications 51, 1, 187–211. Y. Ma, H. Derksen, W. Hong, and J. Wright. 2007. Segmentation of multivariate mixed data via lossy coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence. H. Mannilla, H. Toivonen, and A. Verkamo. 1997. Discovery of frequent episodes in event sequences. ACM SIGKDD. J. Matas, O. Chum, M. Urban, and T. Pajla. 2004. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing 22, 10, 761–767. K. Mikolajczyk, B. Leibe, and B. Schiele. 2005. Local features for object class recognition. In ICCV. D. Nister and H. Stewenius. 2006. Scalable recognition with a vocabulary tree. In CVPR. J. Oliensis. 1999. A multi-frame structure-from-motion algorithm under perspective projection. International Journal of Computer Vision 34, 2–3, 163–192. N. O’Hare, C. Gurrin, G. Jones, and A. Smeaton. 2005. Combination of content analysis and context features for digital photograph retrieval. In European Workshop on the Integration of Knowledge, Semantic and Digital Media Technology. L. Paletta and G. Fritz. 2005. Urban object detection from mobile phone imagery using informative sift descriptors. In Scandinavian Conference on Image Analysis. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. 2007. Object retrieval with large vocabulary and fast spatial matching. In CVPR. M. Pollefeys, R. Koch, and L. Van Gool. 1999. Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision 32, 1, 7– 25. M. Pollefeys and L. Van Gool. 2002. From images to 3D models. Communications of the ACM 45, 7, 50– 55. P. Resnick, N. Lacovou, M. Suchak, P. Bergstrom, and J. Riedl. 1994. GroupLens: An open architecture for collaborative filtering of netnews. In ACM Conference on Computer Supported Cooperative Work. D. Robertson and R. Cipolla. 2004. An image-based system for urban navigation. In BMVC. A. Roman, G. Garg, and M. Levoy. 2004. Interactive design of multi-perspective images for visualizing urban landscapes. In IEEE Conference on Visualization. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. When Location Meets Social Multimedia: A Comprehensive Survey 1:17 G. Salton and C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513–523. G. Schindler and M. Brown. 2007. City-scale location recognition. In CVPR. H. Shao, T. Svoboda, T. Tuytelaars, and L. J. Van Gool. 2003. Hpat indexing for fast object/scene recognition based on local appearance. In CIVR. H. Shao, T. Svoboda, and L. Van Gool. 2003. ZuBuD: Zurich Buildings Database for Image Based Recognition. Technical Report. I. Simmon, N. Snavely, and S. M. Seitz. 2007. Scene summarization for online image collections. In ICCV. J. Sivic and A. Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In ICCV. N. Snavely, S. Seitz, and R. Szeliski. 2006. PhotoTourism: Exploring photo collections in 3D. In ACM SigGraph. P. Symeon, Z. Christos, K. Yiannis, and V. Athena. 2011. Cluster-based landmark and event detection for tagged photo collections. IEEE Multimedia 18, 1, 52–63. R. Szeliski and S. B. Kang. 1994. Recovering 3D shape and motion from image streams using nonlinear least squares. Journal of Visual Communication and Image Representation 5, 1, 10–28. R. Szeliski. 2006. Image alignment and stitching: A tutorial. Foundations and Trends in Computer Graphics and Computer Vision 2, 1, 1–104. S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg, M. Jethwa, and N. Master. 2003. Calibrated, registered images of an extended urban area. International Journal of Computer Vision 53, 1, 93–107. R. Tibshirani. 1997. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society 58, 1, 267–288. C. Tomasi and T. Kanade. 1992. Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision 9, 2, 133–154. C. Torniai, S. Batte, and S. Cayzer. 2007. Sharing, Discovering and Browsing Geotagged Pictures on the Web. HP Lab. Technical Report. Travel Guide. 2008. Homepage. Retrieved from www.travel.msra.cn. B. Triggs, Andrew Fitzgibbon, Richard Hartley, and Philip F. McLauchlan. 1999. Bundle adjustment—A modern synthesis. In International Workshop on Vision Algorithms. 298–372. C. Tsai, A. Qamra, and E. Chang. 2005. Extent: Inferring image metadata from context and content. In ICME. M. Vergauwen and L. Van Gool. 2006. Web-based 3D reconstruction service. Machine Vision and Applications 17, 2, 321–329. P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In CVPR. L. Wang. 2007. Toward a discriminative codebook: Codeword selection across multi-resolution. In CVPR. C. Wang, X. Xie, L. Wang, Y. Lu, and W.-Y. Ma. 2005. Detecting geographic locations from web resources. In ACM Geographical Information Systems Workshop. C. L. Wayne. 2000. Multilingual topic detection and tracking: Successful research enabled by corpora and evaluation. In International Conference of the Language Resources and Evaluation. J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. 2009. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence. J. X. Xiao, J. N. Chen, D. Y. Yeung, and L. Quan. 2008. Structuring visual words in 3D for arbitrary-view object localization. In ECCV. J. Xiao, T. Fang, P. Zhao, M. Lhuillier, L. Quan. 2009. Image-based street-side city modeling. In ACM SigGraph Asia. J. Yang, Y.-G. Jiang, A. Hauptmann, and C.-W. Ngo. 2007. Evaluating bag-of-visual-words representations in scene classification. In ACM Multimedia. J. Yang, J. Wright, T. Huang, and Y. Ma. 2008. Image super-resolution as sparse representation of raw image patches. In CVPR. R. B. Yates and B. R. Neto. 1999. Modern Information Retrieval. ACM Press. T. Yeh, J. Lee, and T. Darell. 2007. Adaptive vocabulary forest for dynamic indexing and category learning. In CVPR. T. Yeh, K. Tollmar, and T. Darrell. 2004. Searching the web with mobile images for location recognition. In CVPR. W. Zhang and J. Kosecka. 2006. Image based localization in urban environments. In International Symposium on 3D Data Processing, Visualization and Transmission. ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015. 1:18 R. Ji et al. Y. Zheng, L. Liu, L. Wang, and X. Xie. 2008. Learning transportation modes from raw GPS data for geographic application on the web. In World Wild Web. Y.-T. Zheng, M. Zhao, Y. Song, and H. Adam. 2009. Tour the world: Building a web-scale landmark recognition engine. In CVPR. Received August 2013; revised October 2013; accepted December 2013 ACM Transactions on Intelligent Systems and Technology, Vol. 6, No. 1, Article 1, Publication date: March 2015.