What Makes Paris Look like Paris? Carl Doersch1 Saurabh Singh1 Abhinav Gupta1 Josef Sivic2 Alexei A. Efros1,2 1Carnegie Mellon University 2INRIA / Ecole Normale Sup´erieure, Paris SIGGRAPH 2012 Presenter Yunhai@VCC Outline • • • • • Problem Related Work Approach Results and Validation Application Problem • Given a large repository of geotagged imagery, how to automatically find visual elements, that are most distinctive for a certain geo-spatial area? Characteristic • Given all possible patches in all images, which of them are both frequently occurring and geographically informative? – Sidewalks and cars occur frequently in Paris but are hardly discriminative, – Eiffel Tower is very discriminative, but too rare to be useful Motivation • Understanding which visual elements are fundamental to our perception of a complex visual concept • Help CG modelers generate “reference art” for a city • Provide a stylistic narrative for a visual experience of a place Outline • • • • • Problem Related Work Approach Results and Validation Application Mining geotagged images • Mining Model the photographer-defined frequency • maps of cities model worldwide human travel priors • place recognition Object discovery from geotagged imagery • Unsupervised methods • Supervised methods Procedural modeling • generate 3D models of entire cities • parse images of facades Outline • • • • • Problem Related Work Approach Results and Validation Application Data • Google Street View imagery – Approximately 10, 000 perspective images (936x537 pixels) are extracted for each city – 12 cities: Paris, London, Prague, Barcelona, Milan, New York, Boston, Philadelphia, San Francisco, San Paulo, Mexico City, and Tokyo. Data Organization • Visual elements are represented by square image patches at various resolutions. The database is divided into two parts: – the positive set containing images from the location whose visual elements are wished to discover (e.g. Paris); – the negative set containing images from the rest of the world Challenge • Matching the occurrences of the rare interesting elements is like finding a few needles in a haystack – the overwhelming majority of data is uninteresting, occur in both the positive and negative sets, and should be filtered out. Existing methods • clustering on image patches represented by SIFT descriptors tend to be dominated by low-level features • k-means clustering of larger image patches (HOG) behaves poorly in very high dimensions Existing methods • Use the geographic information as part of the clustering, extracting elements that are both repeated and discriminative. – However, these methods either produce inhomogeneous clusters or focus too much on the most common visual features. – The reason is such approaches include at least one step that partitions the entire feature space. Approach • Start with a large number of randomly sampled candidate patches, and then give each candidate a chance to see if it can converge to a cluster that is both frequent and discriminative. – compute the nearest neighbors of each candidate, and reject candidates with too many neighbors in the negative set. – gradually build clusters by applying iterative discriminative learning to each surviving candidate. Approach • Discriminative clustering – alternates between clustering and training discriminative classifier – Applying cross-validation to prevent overfitting 25000 candidates KNN Selection 1000 candidates SVM detector training Crossvalidation Image descriptor • Square patches and patches scales ranging from 80-by-80 pixels to height-of-image size. • Patches are represented with standard HOG (8x8x31 cells), plus a 8x8 color image in L*a*b color space (a and b only). Initial Candidate Selection • Randomly sample a subset of 25, 000 highcontrast patches to serve as candidates for seeding the clusters. • The initial geo-informativeness of each patch is estimated by finding the top 20 nearest neighbor patches in the full dataset. The candidate with too many neighbors in the negative set is rejected Iterative clustering • Train an SVM detector for each visual element, using the top k nearest neighbors from the positive set as positive examples, and all negative-set patches as negative examples. • Iterate the SVM learning, using the top k detections from previous round as positives • cross-validation – Dividing the dataset into l equally-sized subset – Apply the detectors trained on the previous round to a new, unseen subset of data to select the top k detections for retraining. – Three iterations can achieve the convergence Approach Steps of this algorithm for two sample candidate patches in Paris. The first row: initial candidate and its NN matches. Rows 2-4: iterations of SVM learning (trained using patches on left). Red boxes indicate matches outside Paris. Rows show every 7th match for clarity. Notice how the number of not-Paris matches decreases with each iteration, except for right cluster, which is eventually discarded. Performance • A soft-margin SVM with C fixed to 0.1 is used. • The full mining computation is quite expensive; a single city requires approximately 1, 800 CPU-hours. Outline • • • • • Problem Related Work Approach Results and Validation Application Result Result Result Trouble with US cities • Some of discovered geo-informative elements turned out to be different brands of cars, road tunnels, etc. Evaluation 1. do the discovered visual elements correspond to an expert opinion of what visually characterizes a particular city? 2. are they indeed objectively geoinformative? 3. do users find them subjectively geo-informative in a visual discrimination task? 4. can the elements be potentially useful for some practical task? First question • Consulted a respected volume on 19th century Paris architecture [Loyer 1988] How geo-informative the discovered visual elements • Ran the top 100 Paris element detectors over an unseen dataset which was 50% from Paris and 50% from elsewhere. The average accuracy of top detectors was 83% (where chance is 50%). • Repeated this for our top 100 Prague detectors, and found the average accuracy on an unseen dataset of Prague to be 92%. How geo-informative the discovered visual elements • Repeated the above experiment with people rather than computers. Reduced the dataset to 100 visual elements, 50 from Paris and 50 from Prague. – 50% of the elements were selected by algorithm for Paris and Prague. The other 50% were randomly sampled patches of Paris and Prague. – 22 naive subjects were asked to label each patch as belonging to either Paris or Prague – average classification performance for the algorithmselected patches was 78.5% (std = 11.8), while for random patches it was 58.1% (std = 6.1) Reference art • asked an artist to make a sketch from a photo of Paris and then sketch it again after showing her the top discovered visual elements for this image Outline • • • • • Problem Related Work Approach Results and Validation Application Applications:Mapping Patterns of Visual Elements Applications: Exploring Different Geospatial Scales Applications: Visual Correspondences Across Cities Applications: Geographically-informed Image Retrieval Conclusion • Argued that the “look and feel” of a city rests on a set of stylistic elements, the visual minutiae of daily urban life • automatically find a subset of such visual elements from a large dataset offered by Google Street View. Video Future work • Capture larger structures, both urban and natural • What makes an Apple product? • Can we use discriminative clustering for another problems, such as co-segmentation?