Image Miner: An Architecture to Support Deep Mining of Images by Edwin Meng Zhang Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2015 c Massachusetts Institute of Technology 2015. All rights reserved. ○ Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science May 22, 2015 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kalyan Veeramachaneni Research Scientist Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Professor Albert Meyer Chairman, Masters of Engineering Thesis Committee 2 Image Miner: An Architecture to Support Deep Mining of Images by Edwin Meng Zhang Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2015, in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science and Engineering Abstract In this thesis, I designed a cloud based system, called ImageMiner, to tune parameters of feature extraction process in a machine learning pipeline for images. Feature extraction is a key component of the machine learning pipeline, and tune its parameters to extract the best features can have significant effect on the accuracy achieved by the machine learning system. To enable scalable parameter tuning, I designed a master-slave architecture to run on the Amazon cloud. To overcome the computational bottlenecks due to large datasets, I used a data parallel approach where each worker runs independently on a subset of data. The worker uses a Gaussian Copula Process to tune parameters and determines the best set of parameters and model to use. Thesis Supervisor: Kalyan Veeramachaneni Title: Research Scientist 3 4 Acknowledgments I would like to thank Kalyan Veeramachaneni for supporting and guiding me through my thesis project. His guidance was critical to the completion of this project. I would also like to thank everyone in the ALFA lab for helping me and providing a good working environment. I would like to thank my girlfriend, Riana Lo Bu, for her support and keeping me grounded through all the highs and lows during this process. I would also like to thank all my friends who helped me through MIT and who made the experience fun. Finally, I would also like to thank my family, and especially my parents, for getting me to where I am today and helping me through every step of the process. 5 6 Contents 1 Introduction 1.1 11 What is ImageMiner? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Geolocation Problem 13 15 2.1 The Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Multiple Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Past Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 ImageMiner 23 3.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Possible designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Data Parallel Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.2 Parallel Iteration Design . . . . . . . . . . . . . . . . . . . . . . . . . 25 ImageMiner Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Slave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.3 Other Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Training Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.1 31 3.3 3.4 User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.5 3.4.2 Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.3 Slave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Testing Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.1 User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.2 Master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.3 Slave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Parameter Tuning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7 Summary of ImageMiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Experiments and Results 41 5 Experience of designing the system 47 5.1 Data pre-processing and preparation . . . . . . . . . . . . . . . . . . . . . . 47 5.2 ImageMiner Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3 System issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.1 Installing libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.2 Instance limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Integrating Python Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.6 SVM Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.7 Processing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.8 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.9 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6 Conclusion 6.1 6.2 55 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1.1 User Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1.2 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Future Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 A ImageMiner Interface 57 A.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 57 A.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Results 59 61 9 10 Chapter 1 Introduction With the adoption of smart phones, the number of pictures available on the web have increased tremendously. For example on Flickr, in 2005, less than 10 million images were being uploaded per month. In 2014, that number was over 60 million images per month [12]. As the number of pictures on the internet is growing, there is an ever increasing amount of interest in performing machine learning on these images. Researchers want to be able to extract information from these pictures to do something useful with them. Common examples are performing an image search for similar pictures, identifying where an image was taken, or doing object recognition or scene understanding. Algorithms to do these activities fall under the area of machine learning called computer vision. Most approaches in computer vision follow the same basic steps. Pre-processing step After acquiring the images, pre-process the image to get the image in the desired format and extract useful metadata from the image. Tag and label retrieval Tags, labels, and latitude and longitude information are extracted from the images for future use, when available. Feature Extraction Extract visual features from the images via one of the many feature extraction libraries available, such as SIFT or GIST. Training step Use the extracted features to train a classification or regression model on the training data to predict the label of the data. 11 Testing Test the model produced from the training step on a set of test images. An example of a computer vision problem is trying to predict the tags for an image, given the image, relevant metadata and any related images. Problems with the current workflow for machine learning on images Each step is done separately This is inefficient because that means researchers have to manually build and run each step. They need to store the results from one step so that the next step can use it, which takes space and time. By combining all the steps in the computer vision pipeline into an end-to-end framework, we can save researchers time and space by allowing researchers to provide inputs into one system and receive their desired outputs without having to handle all the data in between. Large Datasets Image datasets are becoming larger and larger as more images become available online. Yahoo!, for example, has provided researchers access to a dataset of 99.3 million images and 0.7 million videos, which takes about 12GB just for the metadata [15]. If we were to factor in the size of the dataset with the actual images and videos were included, the size would be even larger. We want to build a parallel computing architecture to process and perform machine learning on larger datasets. Storing features takes a huge amount of space. For the dataset Yahoo! provided, they planned to pre-compute and store a number of computer vision and audio features for each image and video. They estimated that storing all these features would take 50 TB [15]! Instead, if we are able to design a system that can extract features on demand and enable analyses, we would no longer need to store features in a database and perform queries and database lookups, which can be quite expensive. This would greatly decrease the storage overhead required for many problems in computer vision. Parameter Tuning There many different ways to process the data, many different machine learning algorithms, and many parameters that can be tuned when extracting features. Not extracting the best possible features can potentially limit the how well the machine learning algorithm performs. Current research is mainly focused on tuning the hyper 12 parameters/parameters of the classifier and feature selection instead of focusing on optimizing the whole pipeline. All other methods we looked at, first extract features and then focus on figuring out the best regression or classification model to build. We want to optimize not only the prediction model, but also the feature extraction and image pre-processing steps. Many computer vision feature extractors come with a set of parameters that the user can tune while extracting these features. For example, with the SIFT feature extractor, one can tune parameters like the edge threshold, the number of octaves, the peak threshold. 1.1 What is ImageMiner? The goal of ImageMiner is to build an architecture to support deep mining of images. Definition 1. Deep mining is defined as a parameter tuning system that attempts to tune the entire machine learning pipeline including the parameters involved in feature extraction. As we mentioned above, typically, only the parameters for the classifier are tuned. By tuning the parameters during the feature extraction process, we can extract better features, which in turn can lead to better classifiers and better results. ImageMiner is an attempt to built the first system of its kind to tune parameters and extract features on the cloud without storing any features, to the best of our knowledge. However, feature extraction is an expensive process. To properly tune feature extraction parameters, it requires multiple iterations with different sets of parameter values, which means that we are performing feature extraction several times over the same data to tune one set of parameters. This is a major challenge that ImageMiner aims to address. To implement all this, we built a architecture in Java and Python that runs on Amazon Web Services (AWS). This is done using a Master-Slave framework. We parallelized the process by creating multiple Amazon EC2 instances to run each slave on so that multiple slaves may be running ImageMiner at one time. 13 ImageMiner is split into two modules: the Training module, which tunes parameters and produces a model, and the Testing module, which tests the models produced by the Training module. We downloaded the Flickr Creative Commons dataset provided by Yahoo! to get a database of images to train and test ImageMiner on [15]. The rest of the thesis is laid out as follows. Chapter 2 discusses the geolocation problem. Chapter 3 describes the ImageMiner architecture. Chapter 4 goes over the experiments we performed on ImageMiner and the results from those experiments. Chapter 5 discusses the challenges that were faced while building ImageMiner. Chapter 6 talks about future work left for ImageMiner. 14 Chapter 2 The Geolocation Problem Geolocation is the problem of identifying the location of something, in our case, an image, based off the information provided by the image. Geolocation is an important problem because people like to see a visual picture of whatever they are searching for or studying. When people see a picture of a location, they immediately want to know when, and more importantly, where the picture was taken. The location of a picture can provide a lot of context for an image and change the meaning of the picture. Geolocation can also affect our personal lives, as well. If a friend on Facebook posts a picture, we want to know where that picture was taken. If we see a picture of somewhere beautiful online, we want to know where that picture was taken so that we may perhaps visit that place. 2.1 The Data Set The dataset that we will be using for this geolocation problem comes from the Flickr Creative Commons dataset provided by Yahoo for research purposes [15]. This dataset consists of 99.3 million images, 49 million of which are geo-tagged, and 0.7 million videos. The metadata for each image or video contains the title of the image or video, the user id of the uploader, the URL for the image, tags for the image, latitude and longitude for geo-tagged images, and several other useful facts about the image. We had to submit a request to the Yahoo! Webscope program for approval to obtain use of the dataset, which is hosted on Amazon S3. 15 Table 2.1: Data for One Image Metadata File Number of total images 9975030 Number of images with geolocation 1541252 Number of images with tags 6014968 Number of images with user tags 5969236 For our project, we narrowed down the dataset from 99.3 million images down to about 1.5 million images. The metadata we received from Yahoo! was simply a list of 10 text files with one line of metadata for each image. To narrow down the dataset to a manageable amount, we selected one of the 10 files and took all the images with a geolocation, which was a little more than a tenth of the images, to give us our 1.54 million image dataset, as shown in Table 2.1. Figure 2-1 shows a heat map of the location of the 1.54 million images in our dataset. As expected, most the images are located on the coasts of the continental United States, as well as central and western Europe, while almost no images are present closer to the poles. Figure 2-1: A heat map showing a distribution of all the images in our dataset. The heat map was generated using heatmap.py [5] and OSM Viz [13] 16 2.2 Multiple Methodologies When doing any type of machine learning, there are many different ways to define the problem. The three main methods for this problem are classification, regression, and a hierarchical approach. 2.2.1 Classification The first step in doing a classification problem is to assign a label to each image. There are two common ways to do this. The first is by clustering the images. Clustering images can be done a number of ways, such as k-means. Once the images are clustered, each image is given a label corresponding to what cluster they are in. Features are then extracted from each image and a classifier is trained using the extracted features and the image labels. Future images are classified by extracting its features and running it through the trained classifier. A variation of this method is to set a threshold after the images are clustered. Only images or clusters that pass the threshold are used for the classifier. The second method is to assign each image to a city, depending on what cities the image is close to. For example, one image can be assigned to Paris, while another is assigned to New York. When trying to classify an image, the classifier will produce a list of possible cities and the image will be assigned to the most likely city. To determine the performance of the classifier, we count the percentage of images that have been assigned the correct label. 2.2.2 Regression Regression for geo-tagging problems is done by extracting features from an image and after performing some computation on those features, we build a model between the image features and the latitude and longitude of an image. To perform regression, we first extract features from a number of images. Then, given the image’s latitude and longitude, we train a regression model to link an image’s features to it’s geo-location. Future images are geotagged by extracting its features and running it through the model to give it a latitude and 17 longitude. To evaluate the results, the estimated latitude and longitude are compared to the actual latitude and longitude to see how far off the guess actually was. 2.2.3 Hierarchical A hierarchical approach is done by dividing the world map into sections and iteratively dividing those sections into smaller sections. After extracting features, the images are iteratively assigned to a section and within that section assigned to another section until they have been assigned to a section on the lowest level. An estimated location is considered correct if it is assigned to the same section as the actual location. 2.3 Past Approaches Several efforts have already been made to geo-tag images and videos based on the features and metadata of the images. Many of these have come as result of the MedialEval Placing Task, which challenges researchers to be able to accurately locate where an image or video was taken based off the features of the image and the metadata for that image [1]. There were several different approaches to trying to accurately place these multimedia items. One group extracted a variety of different features, such as Gist and color histograms, and then did a k-Nearest Neighbors to form a distribution of likely locations across the world, also known as a probability map. The location of the highest probability was the likely location of the image [7]. However, this approach had less than a 20% accuracy within 200 km. Another group had a different approach to feature extraction. Their approach centered on extracting features, such as SIFT and color histograms, from an image to create a feature vector and storing that feature vector in a dictionary of scenes. When trying to geo-tag an image, they extract the feature vectors from the image and compare it to those already in the dictionary to determine the most likely location [14]. This approach, although slightly better than [7], still only had roughly a 25% success rate within 200km. On the other end of the spectrum, another group only used feature extraction as a last resort [16]. This group would first search for any tags for the image and use frequency 18 Table 2.2: Information about previous approaches to geotagging Study Features Used Methodology Algorithm [9] Tags, Color histogram, Texton histogram Hierarchical Border detection, Iteration [4] Tags, FCTH, CEDD, Tamura, Gist Regression Graphical Models, Conditional dependency, Gausian Mixture Model [10] Tags, User Profile, Color histogram, FCTH, CEDD, Tamura, Gabor, Edge histogram Classification Prior Distribution [16] Tags, User Profile, SIFT Hierarchical IR Frequency, Frequency matching, Filtering, Prior Distribution [7] Color histogram, Texton histogram, Tiny images, Line features, Gist, Geometric context Classification k-Nearest Neighbor, Probability Map [3] Tags, Color histogram, Tiny images, Gist Regression Canonical Correlation Analysis, Logistic Canonical Correlation Regression, Mean shift algorithm [14] Color histogram, FCTH, CEDD, Tamura, Gabor, Edge histogram, SIFT Classification Bag-of-scenes, Visual dictionary matching to place the most likely location of the image, given the set of tags. If no tags were available, then the user profile of the person who uploaded the image was used. They would use information, such as the user upload history, the user’s hometown, or the user’s social network information to guess the location of the image. If nothing useful could be extracted from the user’s profile, the group would extract features from the image, using SIFT. A nearest neighbor search would then be performed to determine the location of the image. Although they performed better than [7], they only had roughly a 50% accuracy within 200 km. Although this seems significantly better than the previous two groups, most of this boost is from the use of tags and the challenge of accurately predicting the location of images through images features still remains. 19 Several other groups combined feature extraction with metadata from the images, such as user information and tags, to try to determine the location of where the images were taken. [9, 4, 10]. Out of these, [9] performed the best with over a 98% accuracy within 200 km. They divided the world map into national borders to try and narrow down possible areas that the image could have been taken before combining their feature extraction and image metadata with a probabilistic model before using a centroid-based candidate fusion to finally estimate where the image was taken [9]. One thing in common that all the groups had was that when they attempted to estimate the location where the multimedia item was taken, their algorithm would return a latitude and longitude. To test the performance and accuracy of their algorithm, they would measure how far their predicted latitude and longitude was from the actual latitude and longitude. Most groups would then determine the accuracy of their algorithm based on how far their predictions were from the actual locations within a variety of distances, such as 100km and 200km. 2.4 Our Approach Our approach to the problem was to use the classification method. To cluster the images, we implemented a very simple clustering algorithm that would initially put all the images in a single cluster and then iteratively divided each cluster. Only clusters greater than 200km were divided, while those smaller than 200km were left as is. We defined the distance of a cluster to be the largest distance between two images in the cluster. To divide the clusters, we picked two images that were at least 500km apart, or at least 200km if no such images could be found. These images became the first image in our two newer and smaller clusters. We then went through every image in that cluster and assigned it to one of the newer clusters, depending on which cluster that image was closer too. We repeat this process until all clusters are smaller than 200km. We ended up with 29789 clusters for our 1.5 million images. The distribution of the images in the clusters is shown in 2-2. A cutoff was then set to only select clusters with 100 or more in them, to reduce the dataset, and to eliminate smaller clusters that did not 20 Number of Images 6,000 4,000 2,000 3, 00 0 2, 50 0 2, 00 0 1, 50 0 00 0 1, 50 0 1 0 Cluster Figure 2-2: Distribution of images for clusters with greater than 100 images in them. have enough images to be useful for training our classifier. This gave us our final dataset of 825659 images with 2766 clusters. We split two-thirds of the images to use as training data, while designating the remaining one-third as test data. 21 22 Chapter 3 ImageMiner ImageMiner is an end-to-end image mining system designed to automatically tune feature extraction parameters. It is split into two modules: the Training module, that trains the classifier and tunes the parameters, and the Testing module, that takes in and tests all the classifiers generated by the Training module. In this chapter, we discuss the design decisions, as well as the architecture behind ImageMiner. ImageMiner uses a Master-Slave framework to train and test the models. Each module uses Amazon Web Services (AWS), to get image metadata, store results, and run each slave. In addition, we use a variety of other systems to train our classifier models and run feature extractions. This chapter will discuss all of these systems in greater depth. 3.1 Goal The goal of ImageMiner is to find the best set of parameters for feature extraction when doing machine learning on and geo-tagging images. There are a few goals ImageMiner strives to achieve: Usability - It should be easy to understand and use. Flexibility - Users should be able to customize inputs, such as what features to extract, what parameters to tune, and what images to train and test on. 23 Scalability - Tuning on one hundred images should be just as easy as tuning on twenty thousand images. Fault Tolerance - The system should be able to deal with errors seamlessly in the background while processing images and extracting features. Ideally, the user will not even know that an error occurred. Speed - Users using this will be using this to determine the best set of parameters. ImageMiner should run fast enough so that it does not become a bottleneck for whatever situation it is being used in. 3.2 Possible designs To find the best set of parameters for feature extraction, while also meeting the goals mentioned above, we had to think carefully about the design of ImageMiner. To determine the best set of parameters, we need to run multiple iterations of the entire machine learning pipeline, while using a parameter tuning algorithm to generate the next set of parameters to use and test. Each iteration has several steps that are repeated during every iteration: 1. Extract visual features from each image 2. Process the data for classification 3. Do cross-fold classification on 𝑙 folds 4. Report performance The performance of the parameters is judged using the cross-validation accuracy of the entire process. For a small number of images, we can easily run the entire process on one machine. However, as the number of images and iterations increase, it becomes infeasible to run everything on one machine. There are two main designs we considered for ImageMiner: One is similar to PhysioMiner that divides the problem into separate tasks and each worker runs on a subset of the data [6] and one with a centralized database where each slave runs one iteration. 24 3.2.1 Data Parallel Design The design of this approach is to create a master worker, which divides the problem into several tasks. Each task is to run all iterations of ImageMiner on a subset of the images. The master worker creates slave workers to perform each task. Each slave picks a fixed subset of images to train and test on. They individually tune their own parameters before reporting back to the database with its best model and corresponding set of parameters. This means that each slave runs independently of the other slaves. Each slave produces one classifier so ImageMiner ends up with many as many classifier as there are slaves. A subset of this design is Noisy Parameter Tuning. For this design, instead of each slave having a fixed subset of images to train and test on, it randomly selects a subset of images for each iteration and reports the performance on that random subset. 3.2.2 Parallel Iteration Design The design for this approach was to create a centralized database to store the best set of parameters and update those parameters every time a worker runs. Each worker runs one iteration of ImageMiner by getting the best parameter from the centralized database and tuning on those parameters, before reporting its results back to the centralized database. The next worker grabs the results the previous worker reported and repeat the process. Thus, each worker is dependent on the result of the previous workers. This produces one classifier total that is the deemed the best classifier for feature extraction. We ended up going with the data parallel design for scalability. This design is the least expensive since it requires the fewest calls to the database. In addition, each worker runs independently of other workers, so if one worker goes down or produces bad results, the other workers can compensate for that. 3.3 ImageMiner Architecture ImageMiner is designed with a master-slave framework using Amazon AWS to communicate between the master and the slave. The workflow is shown in Figure 3-1. There are 11 steps 25 2 SQS ImageMiner messages Master 1 Location of data Feature Parameters S3 bucket DynamoDB table info EC2 Worker 4 ImageMiner message 6 Best model/parameter info 10 Best model/parameter info EC2 Worker Image data Training module 5 DynamoDB User .... EC2 Worker 3 9 Image data 7 Training results S3 11 Testing results EC2 Worker .... EC2 Worker EC2 Worker Testing module 8 Testing messages Testing message SQS Figure 3-1: The basic system architecture for ImageMiner, showing both modules. There are three main components: the database, the master-slave framework, and the file storage. Please refer to the framed text in section 3.3 for more details. to ImageMiner: 1. The user passes in the location of the images, information about the feature and parameters, as well as the S3 bucket and DynamoDB table to use 2. The Master starts running the Training module by creating ImageMiner messages based off the user inputs and sends those messages to Amazon Simple Queue Service (SQS). 3. Each EC2 Worker grabs an ImageMiner message from SQS to run. 4. Using the DynamoDB table information provided by the user, each worker grabs a set of image metadatas from the DynamoDB database to train on. 26 5. After running the Training module, each EC2 worker writes the results of the module to another table in DynamoDB. 6. Each worker also writes the best model and the corresponding set of parameters to Amazon Simple Storage Service (S3). 7. Once the Training module finishes running, the user prompts the Master to start running the Testing module. The Master starts running the Testing module by creating Testing messages that are then passed to Amazon SQS. 8. Each EC2 Worker grabs a Testing message off the queue. 9. From the DynamoDB table information provided by the user, each worker grabs of set of images from the DynamoDB database to test on. 10. From the S3 bucket information provided by the user, each worker also grabs the model and parameter information stored in S3 by the Training module 11. After running the Testing module, each worker then writes the results to S3. 3.3.1 Master The user can give the master a variety of inputs, but they must provide an S3 bucket, a table name, a file with a list of features and feature locations, a file with a list of parameters for each feature and the bounds and default for each parameter, the number of slave workers to create, and the IAM profile to use on Amazon AWS. The message processes the input from the user and then uses the information to create the slave workers and populate the message queue. The master is responsible for creating tasks. Each ImageMiner job is split up into messages that are then put into the message queue for slave workers to read. The master can be run on any machine (personal laptop, EC2 instance, etc.). Dividing the tasks up usually takes less than a minute, so the master runs quite quickly. The job for the Training and 27 Testing module differ quite a bit. We will go into more detail on what each job looks like later in the chapter. 3.3.2 Slave The slaves are responsible for completing the tasks in the message queue. When a slave starts running, it queries the message queue for a message. Once it receives a message, it parses the message to extract the task, as well as any other necessary information required to complete the task. Once the slave finishes a task, it writes the result to the database, and deletes the message from the queue and looks for another message to grab. If there are no more messages on the queue, the worker shuts down. Because the tasks for the Training and Testing modules are completely different, the slaves for each module function completely differently. We will detail how each slave functions later in the chapter. ImageMiner runs each slave on an Amazon EC2 instance. Each instance is its own virtual machine, with its own memory and storage and runs independently of all the other instances. These instances are automatically created by the master, so there is no need for the user to create them. The number of instances created by the master is inputted by the user. 3.3.3 Other Systems ImageMiner uses a lot of different systems to run and support the Master-Slave framework. AWS Amazon Web Services (AWS) is the backbone of the Master-Slave framework. AWS services are used to pass messages between the master and slaves, to store results and files, and to run the EC2 instances for the slaves. DynamoDB DynamoDB is used to store the image metadata that ImageMiner uses, as well as the results from ImageMiner. It is a NoSQL database designed to be fast and scalable and it supports hash keys and range keys for each table. A very simple API is provided for writing to and reading from the database [6]. 28 Images Table The images table stores the metadata for all the images that can be used for training and testing the classifier and tuning the parameters. Each row represents an image, which has a unique Image ID. Group type represents whether the image will be used for training or testing. File path refers to the location of the file on the Internet. The setup of the table is shown in Table 3.1. Feature Parameters Table The feature parameters table stores the results of the classifiers generated by the Training module. Each row represents a result and each result has a unique ID. Each row contains the cross-training accuracy and standard deviation, as well as the testing accuracy and the parameters used to attain these results. The setup of the table is shown in Table 3.2. EC2 Elastic Compute Cloud (EC2) is a scalable cloud on AWS. It provides a simple API to start and stop virtual machines, also known as instances. These instances are used to run the slave workers. They give the user full control over each instance and there are many different instance types that can be created depending on the amount of RAM, CPU, and disk space needed [6]. The default instance type for ImageMiner is r3.large, but the user can input provide a different instance type depending on their needs. We decided to use r3.large instances because they were relatively cheap and could easily support large amounts of memory. Each EC2 instance is created from an "image", which is essentially the template for every instance created from the image and provides the necessary information to launch and run the instance. Instances created from the same image start out completely identical. S3 Simple Storage Service (S3) is a scalable storage mechanism on AWS that allows the user to store files. This is used to store the best model and the corresponding set of parameters from each slave. SQS Simple Queue Service (SQS) is a AWS messaging system. It contains a messaging queue that is used by ImageMiner to pass messages from the master to the slave. The master adds messages onto the queue and each slave pops messages of the queue to perform the 29 tasks. It is designed to handle concurrency as multiple workers can access the queue at the same time [6]. Scikit-Learn Scikit-Learn is a machine learning library in Python. It provides a variety of machine learning tools, such as classification, regression, and clustering. For our purposes, we used ScikitLearn to perform k-means clustering on our extracted SIFT descriptors to provide a uniform number of features for each image. SVMLight SVMLight is a Suppor Vector Machine (SVM) implementation written in C [8]. It provides a variety of different SVMs for all different use cases, but ImageMiner uses SVMLight Multiclass to classify our images. This is necessary, because each cluster label is a class, so we have 2766 different classes available to classify an image into. VLFeat VLFeat is an open source computer vision library [17]. VLFeat provides many different tools for image feature extraction, as well as many other algorithms relating to image processing. We used VLFeat’s implementation of SIFT as our feature extractor and ImageMiner tunes the parameters of SIFT to determine the best set of parameters. SIFT SIFT stands for Scale-Invariant Feature Transform (SIFT). The idea behind SIFT is to find the key points of an image and compute its descriptors. These points should be invariant to any type of scaling, orientation changes, distortion, or illumination changes. Once the key points for each image are found and the descriptors are calculated, k-means clustering is run on all descriptors across all images to cluster the descriptors. To get the features of an image, a histogram of the distribution of its descriptors across the clusters is calculated. To compare two images, the histogram for the sift descriptors are compared [11]. 30 3.4 Training Module The Training module is the module that performs the parameter tuning and trains the classifier. It tunes the parameters for the SIFT extractor and trains classifier using those parameters. It reports the best classifier and best set of parameters. 3.4.1 User Input To run the Training Module, users need to specify the following items: 1. S3 bucket - The bucket to store the models and parameters from ImageMiner 2. Features Text File - A text file containing each feature’s name, along with the location of the feature executable and the format of the output. If the output is written to an output stream, then no format is provided. 3. Parameter Text File - A text file containing the parameters for each feature, along with the default value for each parameter, the bounds of the parameter, and whether the parameter is an integer or a real value. 4. Table name - The name of the DynamoDB table to write the results out to, as well as the name of the DynamoDB table to grab the image metadata from. The user can also choose to specify a variety of other inputs, such as the number of training and test images to use or the number of cross-validations to do. If no input is provided, a default value is used. 3.4.2 Master The master receives several text files from the user. The master processes these text files and puts the information into a ImageMiner message that goes onto the message queue. The slave can then parse the information about the features and parameters to generate parameter values and run the feature extraction. The master also receives input about the S3 bucket for ImageMiner to use, as well as the table name, and information about how to create the EC2 instances. The master uses 31 some of this information to create the EC2 instances, and passes the rest to the slaves via the ImageMiner message. 3.4.3 Slave Once a slave reads an ImageMiner message from the queue, it performs the following steps as shown in Table 3-2: 1. Downloads image metadata from DynamoDB 2. Processes the metadata, downloads the image from the metadata, and converts the image to PGM format 3. Generates parameters (a) For iterations 1, use the default parameter values. (b) For iterations 2-5, randomly generate parameter values. (c) For the rest of the iterations, run a parameter tuning algorithm based off previous results 4. Runs SIFT using the generated parameters on the PGM file 5. Process the SIFT descriptors to make suitable features for training the images. (a) Cluster the SIFT descriptors into 100 different clusters using k-means. (b) For each image, create a histogram of the distribution of its descriptors over the 100 clusters. (c) The counts of descriptors in each cluster is now a feature for each image. 6. Perform cross-validation (a) Divide the data into 𝑙 folds. (b) Choose 𝑙-1 folds to train the classifier on (c) Test on the remaining fold 32 (d) Repeate steps a-c until all folds have been tested on. 7. Repeat steps 3-6 for a specified number of iterations 8. Determine the best model out of all the models generated 9. Store the performance of the model in DynamoDB and the model and the set of parameters that generated the model in S3 3.5 Testing Module The Testing module grabs all the classifiers and parameters generated from the Training module and tests all of them to determine the overall performance of the ImageMiner system. It writes the result out to DynamoDB. 3.5.1 User Input To run the Testing Module, users need to specify the following items: 1. S3 bucket - The bucket needed to access the models and parameters and to write the predictions out to 2. Table name - The name of the DynamoDB table to grab the image metadata from The user can also choose to provide the number of images to test on. If no number is provided, a default value is used. 3.5.2 Master The master is given the S3 bucket that contains the models and parameters generated from the ImageMiner Module, as well as the table name to write results out to. The master passes these specifications to the slave via a Testing message that goes onto the messaging queue. 33 3.5.3 Slave Once a slave receives a Testing message, it performs the following steps as shown in Table 3-3: 1. Downloads image metadata from DynamoDB 2. Gets the set of best parameters and models from Amazon S3 3. Processes image metadata, downloads the image from the metadata, and converts the image to PGM format 4. For each model/parameter pair (a) Run SIFT using the give parameters on the PGM file (b) Process the SIFT descriptors to make suitable features for training the images. i. Cluster the SIFT descriptors into 100 different clusters using k-means. ii. For each image, create a histogram of the distribution of its descriptors over the 100 clusters. iii. Each histogram is now a feature for each image. (c) Put the results of the k-means clustering through the given classifier model (d) Store the prediction for the classifier 5. For each image, get the predictions from all the models and write them to a file and put that file on S3. 3.6 Parameter Tuning Algorithm To tune our parameters, we used a parameter tuning model that used a Gaussian Copula Process (GCP). The model, when given a list of a set of parameters previously used, as well as previous results, generated the next set of parameters to test. The Gaussian Copula process works like a Gaussian process, but the marginal distribution and mappings are modified to deal with the instability of the Gaussian process and offer greater flexibility. 34 To generate the next set of parameters to use for the machine learning pipeline, ImageMiner passed in a list of tested parameters and the performance with those parameters to the parameter tuning model, which then uses the upper bound criteria [2] as an acquisition function. The next set of parameters were chosen by maximizing the acquisition function on the Gaussian Copula distribution. For ImageMiner, the first iteration always used the default parameters supplied, while iterations two through five used randomly generated parameters. After the fifth iteration, the GCP Parameter Tuning algorithm was run to determine the best next set of parameters to use. 3.7 Summary of ImageMiner ImageMiner acts as a black box, so users can simply input a few specifications and ImageMiner will run and try to determine the best set of parameters. There are a number of goals that ImageMiner strives to achieve to be the most helpful to users. First, ImageMiner needs to be usable. To accomplish this, ImageMiner provides a simple command line interface to run the system, as well as a command line help menu that describes all the possible arguments that ImageMiner takes. Second, ImageMiner needs to be flexible. ImageMiner allows each user to provide a simple text file detailing what feature extraction scripts should be run and another text file to describe the parameters that should be tuned. In addition, users can change various aspects of the ImageMiner system by supplying a few extra command line arguments. Third, ImageMiner needs to be scalable. To make sure this goal was achieved, we made sure that each of the external systems that were integrated into ImageMiner were also scalable. Additionally, increasing the number of workers easily allows the user to scale the number of images used to tune and test the classifier. Fourth, ImageMiner needs to be fault tolerant. While building ImageMiner, we included a variety of different error-handling methods to deal with errors every part of the system, from converting each image to PGM to parameter generation to training the classifier. Lastly, ImageMiner needs to be fast. This goal still is yet to be achieved, especially as 35 the number of images increases, but we are hopeful that this goal can be accomplished. 36 37 Training Testing 220145245 245225901 1626 981 Cluster number image flickr URL image flickr URL File path 33.050786 52.295462 Latitude -117.29169 13.25715 ds,game digimax, himmel, ludwigsfelde, samsung, sun Longitude Tags 48018609@N00 30794983@N00 User ID Table 3.1: The fields of the images table. Each row represents an image with a unique image id. The group type of an image specifies whether the image will be used for training or testing. The file path is the location of the image on the Internet. The latitude and longitude specify the location of the image. The tags are the tags that were given to the image, either by the user or from machine tags. The user id is the id of the user who uploaded the picture. Group type Image ID BRtcONQqzQVyfKM74iEvK68 rG7PNHcKAhKwlN8okQdhir ID 7.40778 5 20.9523596 15 CrossCrossvalidation validation Accuracy Standard Deviation 0 0 Test Accuracy 7.67150535 15.534482 2 3 88 1 parameter parameter parameter 1 2 3 Table 3.2: The fields of the feature parameters table. Each row represents the result from a model generated by a slave worker. Each results has a unique ID, along with the cross-validation accuracy and standard deviation. Each row also includes the testing accuracy and the parameter values used to obtain these results. 38 39 Image Metadata Process, Download, and Convert Parameter tuning from past results Extract features Training Testing classifier Figure 3-2: The flow of a Image Miner Module worker PGM image, image label 1st 5 runs Randomly generate parameters Results Get best model & corresponding set of parameters Image Metadata Process, Download, and Convert Get parameters PGM image, image label Extract features Test classifier Get models Results Get majority results from models Figure 3-3: The flow of a Testing Module worker Final predictions 40 Chapter 4 Experiments and Results We ran a couple different experiments to test the performance of ImageMiner. Our first experiment was run with 5 workers with 20 images per worker. The experiment also used 10 iterations with 10 cross-validations in each iteration. The Testing module for this experiment ran with 4 workers with 13 test images per worker on 5 different models. The second experiment we ran was run with 50 training workers with 200 images per worker with 10 iterations and 10 cross-validations per iteration. The Testing module ran on 100 workers with 250 images per worker on 67 models, generated from the Training module. The experiments are also described in Table 4.1. The results from our first experiment are shown in Table 4.2. The results of the Testing module are shown in Table 4.3. Given the small number of images per worker and the large number of clusters, it’s no surprise that the testing results are poor. However, the cross-validation accuracies are much better, albeit with extremely large standard deviations. This makes sense because the workers pick the best classifier, so it is not surprising, given 10 iterations, that one of the iterations correctly classified more than a couple images. The results from our second experiment are shown in Appendix B, while a graph of the results are shown in Figure 4-2. The results from the Testing module are shown in Appendix B. The results from the Testing module are not great, with each model having less than a 41 1% accuracy. Given the number of total clusters, even a 0.1% to 0.9% accuracy is actually more than 3 times better than random guessing. However, the majority predictions actually have the worst accuracy of 0.008%, which is even worse than random guessing. Looking at the predictions, because of the large number of images in cluster 1, many of the models end up predicting, incorrectly, that an image is in cluster 1, so the majority prediction for nearly all the images is cluster 1, which is incorrect most of the time. Looking at the per-class accuracy, unsurprisingly, cluster 1 is predicted correctly more than 10 times as often as the next cluster. Only 10% of the clusters had a non-zero accuracy, which is lower than we would have preferred, but given the number of predictions for cluster 1, not unsurprising. GCP Accuracy 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 GCP Iteration Figure 4-1: Performance of the models generated by the Training module at each iterations. Figure 4-1 shows the average accuracy of all the models for each of the 10 iterations. Interestingly, the first iteration, where ImageMiner used the default feature values, performed the best. The iterations with randomly generated parameters performed much worse, which was expected, while each of the iterations that used parameter tuning performed better than each of the iterations that generated random parameters. There is a general upward trend for each iteration that uses parameter tuning, which ends with the last iteration producing the best results. Looking at the graph, it is encouraging to see that each iteration of the parameter tun42 ing algorithm produces better or comparable results to previous iterations. However, from the tests, it seems that using the default parameters actually produce the best results. It would be interesting to run experiment with more iterations to see if the performance of the parameter tuning algorithm can eventually overtake the performance of the classifier using Cross-validation Accuracy the default parameter values. 50 40 30 20 10 0 1 10 20 30 40 50 60 70 80 90 100 Model Figure 4-2: Performance of each of the classifiers generated by ImageMiner. Figure 4-2 shows the cross-validation accuracy and standard deviation of each classifier generated by ImageMiner. Most of the models have an accuracy between 0% and 5%, although a few models perform especially well, with accuracies over 10%. Although 5% acccuracy appears to be fairly low, it is important to note that there are 2766 different classes, so a 5% accuracy is significantly better than randomly guessing. For more in-depth results, please refer to Appendix B. Many related studies we looked at had much higher accuracies, ranging from 20% all the way up to over 90%. In comparison, ImageMiner looks much worse. However, it is important to remember that ImageMiner is first and foremost an architecture to improve parameter selection. The results produced are mainly to make sure the architecture is in place. Although a 5% accuracy is relatively low compared to the other studies we looked at, ImageMiner only used 1 feature, SIFT, while other studies used at least three features. In addition, using tags as a feature, which we do not do, produced the best results. The 43 performance of studies using only image features was only around 20% to 25%. Looking at our results in this light, 5% is not as poor as we initially believed. 44 45 5 50 Experiment 1 Experiment 2 200 20 Images per worker 10 10 Number of iterations 10 10 Number of crossvalidations 100 4 Number of testing workers 250 13 Images per worker Table 4.1: A summary of the two experiments we ran with ImageMiner. Number of training workers 67 5 Number of models Table 4.2: Results for Experiment 1 Worker Cross-Validation Accuracy Cross-Validation STD 1 25 38.18813079 2 22.22222222 34.24674446 3 5.555555556 15.71348403 4 12.5 21.65063509 5 5 15 Results for experiment 1 with 5 workers, 20 images per worker, 10 iterations and 10 cross-validations Table 4.3: Results for Test 1 Worker Number correct Number of total images 1 0 13 2 0 13 3 0 13 4 0 13 Results for the Testing Module test 1 with 4 workers and 13 test images per worker with 5 different models. 46 Chapter 5 Experience of designing the system We dealt with a variety of challenges during this project while building ImageMiner that ended up limiting the amount of time available for experiments. These challenges ranged from issues that had to be resolved before I could even start building the architecture to processing results from ImageMiner and everything in between. we will discuss some of the challenges I faced below. 5.1 Data pre-processing and preparation I dealt with several challenges before even writing ImageMiner. The first issue was that once I had obtained the image metadata, I needed to cluster the images. Since our goal was to cluster images into clusters of 200km or less, using k-means clustering was impractical. This is because k-means does not set a limit on how large or how spread out the cluster is. In addition, to use k-means one needs to know the number of clusters they want to create, which is borderline impossible for our case. My first solution was to put all the images in one cluster and then find the two images furthest apart and divide the cluster in two based off which cluster the image was closer too. However, this took too long because finding the two images furthest apart in a cluster required multiple passes through all the images in that cluster. To solve this problem, I realized I just needed to find two images that were greater than 200km apart to divide the cluster. I ended up modifying my clustering algorithm to find two images that were 500km apart to divide the 47 cluster. This greatly increased the runtime of the clustering algorithm. The other challenge I had to deal with was writing the image metadata to the database. Although I had the image metadata, it was in one big text file that would be unwieldy to section off to different workers. To solve this problem, I added each image’s metadata to a DynamoDB database. This way each worker can select its own list of images to train and test on. Writing all 825659 images took several days, which decreased the amount of the time I could experiment and test ImageMiner. 5.2 ImageMiner Architecture I also had to plan out how I wanted ImageMiner to run from start to finish. This meant everything from the user inputs allowed to the types of messages that were passed to the worker to how the worker processed the message. Each step was related to the other and I had to carefully plan out how I wanted the output from one step to affect the input to the next step. Along the lines of user inputs, one problem I had to tackle was how to pass in feature extractor and parameter information to ImageMiner. For the feature extractor, ImageMiner needed to know what to call in the command line to run the feature extractor1 where to download the feature extractor and what the output of the feature extractor was. The parameter information had to include the feature and parameter names, the default value, the bounds on parameter values, and the type of the parameter (integer or float). My solution was to have the user pass in two files: one for feature information and one about the parameters. Each line in the file was a new feature or parameter and the inputs on each line were tab-separated. ImageMiner would read in and process each file and store the information and pass it onto the worker to use. Once the Training module finished running, I had to figure out how to pass the resulting models and the corresponding set of parameters to the Testing Module. The easiest way to do this was to write the model, which was already a file, and the corresponding set of parameters, which needed to be written to a file, to Amazon S3. Each model and parameter 1 Since that was how to run VLFeat’s version of SIFT 48 would have a hash of the images used to test and train on in their filename so the Testing module would know how to link each model to the correct set of parameters. The Testing module would then grab the models and parameters and process the file to use for testing. 5.3 System issues After the ImageMiner system was built, I ran into several issues dealing with the underlying systems that ImageMiner was using. 5.3.1 Installing libraries The first challenge was to make sure that SVMLight, the SVM classifier ImageMiner was using, and SIFT, the feature extractor, ran on the EC2 instances that each worker was running on. I had been testing on my local machine, so I made sure the SVMLight and SIFT executables ran on my 64-bit Mac laptop. However, each EC2 instance ran on a Linux Operating System, so I had to download different executables for SVMLight and SIFT to ensure that both would run on the EC2 instances. Another issue was that the instances I originally used did not have many of the libraries I needed installed. For example, I was using Python’s sklearn library for my k-means clustering. However, the EC2 instances I was creating did not have that library installed. My first solution was to attempt to have each worker attempt to install sklearn via the command line while running the program, but that ran into its own set of problems. What I ended up doing, instead, was to create a new image, similar to the one I were using already but with the necessary libraries installed. I then changed the image I was using to create the instances to my own created image. 5.3.2 Instance limits One of the biggest issues I ran into was the limit on number of instances I could create. I were limited to 100 r3.large EC2 instances, which meant severely slowed down the runtime of ImageMiner, especially the Testing module. I looked into other types of instances to run 49 on that had a significantly higher limit. However, those types were more expensive and also required some additional setup, which I did not have time to do, so I ended up sticking with the r3.large EC2 instances. Since I were limited to 100 instances, I could only create a few instances at a time for my Testing module, which meant that each instance had to train and test on thousands of images instead of hundreds of images. The other option was to wait for all the Training slaves to finish before launching my Testing slaves. This way each slave could run on fewer images and run faster. The problem with this is that if there is a particularly slow ImageMiner instance, it can delay the whole project. I ended up waiting for all the Training slaves to run before launching my Testing slaves, so that each Testing slave would run faster. 5.4 Integrating Python Libraries ImageMiner also uses several python files and libraries to extract features and generate parameters. I had to figure out how to best connect the different Python and Java files and pass information between each. I considered using a java library called Jython, which allows Python to run on Java. However, I felt that this required too much setup work. What I did instead was I used Java to call the command line to run the python files and passed in the necessary information as arguments. The Python file would receive the command line arguments, process them, and then use them to run its program. It would then print out the results to the command line. Meanwhile, the Java program would wait for the Python program to print to the command line and read in the output as it was sent to the command line. Another problem with the python files was that they would occasionally run into errors that were uncontrollable on my end. I had to make sure that these errors did not interrupt the module each worker was running. I also had to decide how to properly deal with each error. In some cases, such as with the parameter generation, I would simulate a possible output from the python file. In other cases, I would just ignore the error and keep on running. 50 5.5 Feature Extraction A critical component of the SIFT feature extraction was that the images had to be in PGM file. However, the URLs provided for each image downloaded each image as a JPG file, so I had to figure out how to convert each JPG file to a PGM file. After I had written code to do just that, I ran into an issue where my ImageMiner jar file could not find the library that I were using to convert each JPG image to PGM. I realized that the library I was using was not on the build path, so it was never added to the jar file when it was created, so after I did that that problem was solved. I ran into another issue where sometimes the jar file could not download the JPG image from the URL or performing SIFT on the PGM image caused some sort of error. I had to handle these errors, but also make sure the worker knew that that image was no longer usable to train or test on. Once the SIFT descriptors were extracted, I needed to run k-means clustering on the descriptors so that each image would be described by the cluster of its descriptors, which would make the images and features much easier to test on. However, I ran into several bugs while writing the k-means clustering code. One issue I dealt with was that a few SIFT descriptors only had 2 or 3 numbers, while the rest had 132 numbers. This would result in sklearn being unable to cluster the images, because the python was unable to turn the resulting array of descriptors into an numpy array, because the descriptors were not all the same length. I dealt with that by removing any descriptors that did not have 132 numbers, which fortunately were not too many. Sometimes, removing these descriptors or processing SIFT files with very few descriptors would result in there being fewer descriptors than the desired number of classes, which was an issue. I remedied this by lowering the number of classes to the number of descriptors, if this happened. 51 5.6 SVM Classifier One issue with the SVMLight classifier was that occasionally the training executable would take an extremely long time to run, or never finish running. This meant that when the testing executable ran, it could not find the appropriate model to test on and throw an error. In this cases, I would assume an accuracy of 0% for that test. 5.7 Processing results Once the modules finished running, I had to process the results and put them in a format so that they could be easily read and displayed in a graph. Some of that was relatively easily. The performance of the classifiers were stored in DynamoDB, so I just had to simply query the database and process the results into a CSV file. Others were more complicated. To get the results of each iteration of the parameter tuning, I had to manually log into each worker and look through the output file to get the accuracy at each iteration. To get the results of the Testing Worker, the predictions for each model and each worker was written out to a file and stored on Amazon S3. However, when trying to download this data, I ran into a Java OutOfMemoryError due to the lack of heap space. I dealt with this by increasing the size of the memory allocation pool using the ’-Xmx’ option, as well as modifying my code to write out the results to a file after the predictions of a certain number of images were downloaded and manually combining the results. 5.8 Miscellaneous One of the biggest issues with ImageMiner was the run-time of the two modules. Because the run-time for each was very slow (2-4 days for the Training module with 200 images per worker, 1 day for the Testing module with 250 workers), it was difficult to debug and it often took a day or two to discover a bug during run-time, which would delay the next steps by a day. Luckily, I was able to discover many of the bugs before too much time had passed so the amount of wasted time was minimized. 52 In addition, because each module took so long to run and was such a big system, testing individual portions, like the parameter generation or the k-means clustering, was difficult and impossible to test within the system. I had to write code specifically designed to test each part before integrating it into the system. Of course, once I integrated each part in, the transition between each part would occasionally malfunction, but that was easier and faster to debug and fix. Another minor issue that came up was that since I had stored all the executables and python files on Dropbox, when I launched 100 instances at once, I generated so much traffic downloading the executables and python files that the links were temporarily suspended. I dealt with this by moving the Python files to Amazon S3, so that there would be less traffic going to Dropbox. 5.9 Lessons Learned In addition to all the problems listed above, I had to deal with the typical programming problems, such as deciding on the best algorithm to use to solve a certain problem or debugging an error caused by a typo. Although dealing with all these errors was extremely frustrating, at times, it was a great learning experience and I am glad to have been able to work through them. I learned a lot while building ImageMiner. For example, ∙ Get the data pre-processing step done as soon as possible. ∙ Make sure to design the system really well before building it ∙ When building a system, it helps a lot to use other people’s contributions but do not be scared to customize your own things. ∙ Testing on a local machine and testing on the cloud are completely different. I wish I had known all this beforehand, but I am very grateful to have learned all these lessons. 53 54 Chapter 6 Conclusion 6.1 Future Work There is a lot of future work left for ImageMiner. First, due to time constraints, we were not able to perform as many experiments as we would have liked. More experiments with different test cases should be run to better determine the effectiveness of ImageMiner and parameter tuning. One future test to run, is to increase the number of iterations to see if the parameter tuning algorithm will eventually produce better results than the default parameter values. It would also be good to experiment with the clustering algorithm used. Currently, a large number of the images are clustered into a few clusters, which could skew the results of the Training and Testing modules. Modifying the algorithm so each cluster has roughly an equal number of images could improve performance, as well. In addition, there are many issues that can be improved. Some of the issues were touched on earlier, but we will go into more detail here: 6.1.1 User Flexibility Currently, ImageMiner is only designed to handle SIFT feature extraction. Since all of the feature processing is done by ImageMiner, it requires the features are outputted in a certain format. One possible solution is to have the user to write the code to process the features 55 themselves and allow them to insert to run within ImageMiner. This will allow features to outputted in any format the user desires and allow ImageMiner to deal with a variety of features. Additionally, ImageMiner can only handle one feature at a time. It would be ideal if the user could input multiple features and have ImageMiner tune the parameters for all the supplied features. There are several possible solutions, but the easiest way to do this would be to modify the code to process and tune each feature one-by-one, instead of trying to extract and tune all the provided features at once. ImageMiner could also create specific workers for each feature so that each worker only has to tune one feature. To increase flexibility, ImageMiner should be able to handle a variety of feature extractors and be able to handle multiple features at once. 6.1.2 Speed Although ImageMiner does test a variety of different parameters and produces a best prediction model and set of parameters, the code runs fairly slowly. An ImageMiner EC2 instance running on 200 images takes anywhere from 1 to 5 days. Improving the speed of ImageMiner would greatly increase the usability of ImageMiner. The bottleneck for ImageMiner comes from extracting the features and running k-means on the SIFT descriptors. Finding a faster way to extract and process the feature would greatly increase the speed of ImageMiner. 6.2 Future Goals The goal of ImageMiner is to be able to tune parameters values for feature extraction to extract the best features possible when running a machine learning algorithm. Although there is still room for improvement, we hope that ImageMiner can help researchers extract better features and obtain better results. 56 Appendix A ImageMiner Interface This chapter will discuss how a user interacts with ImageMiner. This will provide a guide to how ImageMiner works. The source code is here: https://github.mit.edu/ALFAGroup/ DeepMining.Images/tree/master/ImageMiner. The user should first either build the jar from the source code or download the JAR from https://github.mit.edu/ALFAGroup/DeepMining.Images/tree/master/ImageMiner. The user also needs JRE 7 and an AWS account to get started. Make sure the AWS account credentials can be found by the jar. If not, the user should run export AWS_SECRET_ACCESS_KEY = xxxxxxxx and export AWS_ACCESS_KEY_ID = xxxxxxxx to set their AWS credentials in the command line. A.1 Training The Training module determines the best parameters to use for feature extraction and creates a classifier off those parameters. It accepts the following command line arguments: An example command for running the Training module might look like 57 -a,–ami <arg> Amazon Machine Image to launch instances with (default is the public ImageMiner AMI) -b,–bucket <arg> Bucket to upload models to -d,–num-iterations <arg> Number of iterations for each worker -f,–features <arg> File containing feature extraction scripts. This file should contain a row for each feature in the format <feature_name> <dropbox_url_of_script> <output_file_format(s)>. There can be multiple output_file_formats. The output file format should contain the type of format for the output file(s) (.doc, .txt, etc). If no output file format is specified, then it is assumed the output is written to stdout. -g,–tag <arg> Each instance will be tagged with this value. -h,–help Show the help menu. -i,–initialize-tables Pass this argument to initialize the tables for the first time. If tables with the given table name already exist, they will be deleted and recreated. -k,–num-train <arg> Number of images to train on -l,–num-cross <arg> Number of cross-validations -m,–num-test <arg> Number of images to test on -n,–num-instances <arg> Number of instances to create -p,–iam-profile <arg> Name of IamProfile to use for EC2 instances (must have access to DynamoDB, S3, and SQS) -r,–parameters <arg> File containing parameter info. This file contains a row for each parameter in the format ’<feature> <param_name> <default_value> <lower_bound>,<upper_bound> 0(for integer only params) or 1(for real numbered values)’ -t,–table <arg> Name of table to store in the database (names will be ’<table_name>_feature_parameters’) -y,–instance-type <arg> Amazon EC2 Instance type to launch, (possibilities: r3.large, r3.xlarge, r3.2xlarge, r3.4xlarge, r3.8xlarge, c3.large, c3.xlarge, c3.2xlarge, c3.4xlarge, c3.8xlarge) 58 java beatDB/Main imageminer -b imageminer -d 10 -f features.txt -k 8000 -l 10 -m 2000 -n 50 -p beatdbtestrole -r parameters.txt -t testing This will create 50 r3.large instances using the public ImageMiner AMI. The feature and parameter infos are stored in features.txt and parameters.txt respectively. Each worker will grab 200 (160 training, 40 testing) images from the images table to train and test on. Each worker will run 10 cross-validations on the training data and run 10 iterations to produce 10 models. It will pick the best model and write the model and the corresponding set of parameters to S3 and write the performance of the best model to the feature_parameters table. A.2 Testing The Testing module grabs all the models and parameters generated by the ImageMiner module and tests the models to determine how good the modules are. It takes the following command line arguments: An example command for running the Testing module might look like java beatDB/Main testing -b imageminer -m 5000 -n 50 -p beatdbtestrole -t testing This creates 50 r3.large instances using the public ImageMiner AMI. It downloads the module and parameters from the imageminer bucket on S3. Each worker then grabs 100 images to test on and for each model-parameter set, extract features based on the parameter and runs a prediction using the model. The worker then gets every model’s prediction for each image and does majority rules to determine what the final classification is before comparing it to the actual classification. 59 -a,–ami <arg> Amazon Machine Image to launch instances with (default is the public ImageMiner AMI) -b,–bucket <arg> Bucket to grab models and parameters from -g,–tag <arg> Each instance will be tagged with this value. -h,–help Show the help menu. -i,–initialize-tables Pass this argument to initialize the tables for the first time. If tables with the given table name already exist, they will be deleted and recreated. -m,–num-test <arg> Number of images to test on -n,–num-instances <arg> Number of instances to create -p,–iam-profile <arg> Name of IamProfile to use for EC2 instances (must have access to DynamoDB, S3, and SQS) -t,–table <arg> Name of table to store in the database (names will be ’<table_name>_test_results’) -y,–instance-type <arg> Amazon EC2 Instance type to launch, (possibilities: r3.large, r3.xlarge, r3.2xlarge, r3.4xlarge, r3.8xlarge, c3.large, c3.xlarge, c3.2xlarge, c3.4xlarge, c3.8xlarge) 60 Appendix B Results The following table displays the cross validation accurracy and standard deviation of the models generated from the Training module. Table B.1: The performance of each model generated from the ImageMiner module Model Crossvalidation Accuracy Crossvalidation STD 1 0.625 1.875 2 1 3 3 1 3 4 1 3 5 1.111 3.333 6 1.111 3.333 7 1.234444444 3.491536151 8 1.234444444 3.491536151 9 1.25 3.75 10 1.25 3.75 11 1.25 3.75 12 1.25 3.75 13 1.25 3.75 14 1.25 3.75 15 1.25 3.75 16 1.25 3.75 17 1.25 3.75 61 Table B.1 – continued. Model Crossvalidation Accuracy Crossvalidation STD 18 1.429 4.287 19 1.429 4.287 20 1.667 5.001 21 2 4 22 2 4 23 2 6 24 2 6 25 2 6 26 2 6 27 2 4 28 2.111 4.22928942 29 2.111 4.22928942 30 2.222 6.666 31 2.222 4.444 32 2.222 4.444 33 2.222 4.444 34 2.361 4.73221819 35 2.5 7.5 36 2.5 7.5 37 2.5 7.5 38 2.5 5 39 2.5 5 40 2.679 5.372929276 41 2.777777778 5.196746371 42 2.777777778 5.196746371 43 2.777777778 5.196746371 44 2.857 8.571 45 2.857 4.738552627 46 2.858 5.716 47 2.976666667 5.584792844 48 3 4.582575695 49 3.241111111 6.14270741 50 3.25 6.712860791 62 Table B.1 – continued. Model Crossvalidation Accuracy Crossvalidation STD 51 3.25 6.712860791 52 3.333 5.091241597 53 3.333 9.999 54 3.333 5.091241597 55 3.333 5.091241597 56 3.333 7.113871028 57 3.333 9.999 58 3.429 5.353724778 59 3.651 5.63711176 60 3.703333333 10.47460845 61 3.75 8.003905297 62 3.75 5.728219619 63 3.75 8.003905297 64 3.75 5.728219619 65 3.75 8.003905297 66 3.75 5.728219619 67 3.929 6.019416002 68 3.929 8.21482617 69 3.929 8.21482617 70 4.166666667 5.89255651 71 4.166666667 8.333333333 72 4.286 12.858 73 4.287 6.548500668 74 4.287 6.548500668 75 4.287 6.548500668 76 4.444 5.442766208 77 4.444 5.442766208 78 4.5 7.141428429 79 4.5 7.141428429 80 4.583 7.505808484 81 4.583 10.2815369 82 5 6.123724357 83 5 15 63 Table B.1 – continued. Model Crossvalidation Accuracy Crossvalidation STD 84 5 8.291561976 85 5 15 86 5 6.123724357 87 5 8.291561976 88 5 15 89 5 15 90 5 11.45643924 91 5 6.123724357 92 5.555555556 15.71348403 93 5.833 11.8137632 94 6.217777778 7.04466824 95 6.429 15.13566546 96 7.407777778 20.9523596 97 7.5 10 98 8.691 11.78542443 99 10 30 100 10 30 101 10 30 102 10 30 103 10 30 104 15 32.01562119 The following table displays the test accuracy of the models generated from the Training module on test data. Table B.2: Accuracy of each model from the Testing module Model Test Accuracy 1 2 3 0.0 0.12650995755794972 0.11128972424879437 4 5 6 7 8 9 64 0.16661112962345886 0.6896833170094678 0.2226069750185506 0.37136793992817496 0.7876265099575579 0.12242899118511265 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 0.11834802481227553 0.5523723154293252 0.041218416388442355 0.049464138499587806 0.0741992662517004 0.07831499113804048 0.7060071825008162 0.09386222657525302 0.14283382304929806 0.08161932745674175 0.07421150278293136 0.3998186389678909 0.11834802481227553 0.1030715316429602 0.7753836108390467 0.15099575579497226 0.21629121776036564 0.11426705843943846 0.7794645772118838 0.15915768854064644 0.35096310806398956 0.1566622691292876 0.5081639453515495 0.3789456150578829 0.2061430632859204 0.7625721352019786 0.1305909239307868 0.17316017316017315 0.04946209966613083 0.3297609233305853 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 majority 0.044890630101207966 0.06594946622150777 0.21221025138752855 0.6692784851452824 0.08570029382957885 0.05359719645433931 0.12650995755794972 0.3795298726738492 0.5305256284688215 0.267941794797807 0.10610512569376428 0.1850917045263335 0.1401772830344259 0.5958210904342148 0.16079158936301793 0.11834802481227553 0.38769180541952336 0.15507672216780935 0.5468494939601698 0.820246486130003 0.7998694090760692 0.21846661170651277 0.0948258091115234 0.07753836108390467 0.30093165141396655 0.3131954174565235 0.5672543258243552 0.12202819272038713 0.008202742409402547 The following table shows the per-class accuracy for each class from testing the models on the test data. Only clusters with non-zero accuracies were shown. Only 274 out of the 2766 clusters had non-zero accuracies. Table B.3: Per-class accuracy for each cluster from the Testing module Cluster Test Accuracy 1 2 93 23.92312789927104 2.0620179442782938 1.6635859519408502 545 2071 173 16 294 65 1.596351197263398 1.4925373134328357 1.4796547472256474 1.478743068391867 1.476510067114094 1461 679 442 314 737 1278 306 1765 195 21 120 67 28 1962 2477 1915 361 1315 1171 1441 2640 75 13 136 590 1602 373 999 1498 33 14 139 562 1674 1098 2095 995 2628 94 58 220 43 2149 40 115 1.4705882352941175 1.3513513513513513 1.3084112149532712 1.3054830287206265 1.2048192771084338 0.9852216748768473 0.9358288770053476 0.8902077151335311 0.7944389275074478 0.7875953728771843 0.7619047619047619 0.7541995200548508 0.7501071581654523 0.7462686567164178 0.7462686567164178 0.7407407407407408 0.7393715341959335 0.7380073800738007 0.7371007371007371 0.7352941176470588 0.7352941176470588 0.6947660954145438 0.6730137885751806 0.6260671599317018 0.6157635467980296 0.60790273556231 0.5988023952095809 0.591715976331361 0.5555555555555556 0.5286529921759358 0.5085464048594435 0.49813200498132004 0.4975124378109453 0.49504950495049505 0.49261083743842365 0.49261083743842365 0.49019607843137253 0.49019607843137253 0.46982291290205996 0.46748831279218017 0.431832202344232 0.4244282008960151 0.42194092827004215 0.412829469672912 0.4108885464817668 7 419 486 285 26 1876 1225 835 420 649 824 943 2405 1272 1365 2197 36 509 125 565 128 27 221 323 2339 1509 828 1671 2410 570 1095 1875 2292 714 1183 1245 1404 50 644 217 31 104 62 289 593 66 0.4089775561097257 0.40540540540540543 0.4037685060565276 0.4016064257028112 0.3923766816143498 0.38314176245210724 0.37174721189591076 0.3703703703703704 0.36900369003690037 0.36900369003690037 0.36900369003690037 0.36900369003690037 0.36900369003690037 0.3676470588235294 0.3676470588235294 0.3676470588235294 0.36322360953461974 0.3401360544217687 0.3336510962821735 0.32786885245901637 0.32377428307123035 0.3114658360911038 0.3110419906687403 0.31007751937984496 0.3048780487804878 0.3003003003003003 0.2958579881656805 0.2958579881656805 0.2958579881656805 0.2949852507374631 0.2949852507374631 0.2949852507374631 0.2949852507374631 0.29411764705882354 0.29411764705882354 0.29411764705882354 0.29411764705882354 0.27165710836100215 0.2684563758389262 0.2682763246143528 0.2658396101019052 0.25665704202759065 0.2548853016142736 0.24968789013732834 0.24630541871921183 1246 181 156 365 471 1392 5 304 23 20 86 817 11 163 210 34 378 454 300 532 658 363 214 346 275 553 865 673 726 1114 84 4 161 1253 1342 315 1082 331 460 232 445 580 71 166 3 0.2457002457002457 0.24554941682013504 0.24549918166939444 0.24509803921568626 0.24509803921568626 0.24509803921568626 0.23443910444262106 0.2288329519450801 0.22742779167614283 0.22711787417669774 0.22361359570661896 0.2178649237472767 0.21671407287010702 0.21398002853067047 0.21261516654854712 0.21136683889149835 0.21119324181626187 0.21097046413502107 0.21052631578947367 0.21052631578947367 0.21052631578947367 0.2103049421661409 0.20147750167897915 0.19685039370078738 0.1926782273603083 0.18832391713747645 0.1851851851851852 0.18484288354898337 0.18484288354898337 0.18484288354898337 0.18475750577367206 0.1846892603195124 0.18450184501845018 0.18450184501845018 0.18450184501845018 0.1841620626151013 0.1838235294117647 0.17436791630340018 0.1737619461337967 0.17301038062283738 0.16366612111292964 0.16366612111292964 0.1596169193934557 0.15337423312883436 0.15293442936341045 144 171 514 97 242 80 754 140 149 1019 1068 567 608 6 101 74 347 8 250 186 333 353 87 19 142 63 105 615 35 462 286 324 272 398 10 266 29 45 612 230 264 334 370 1088 150 67 0.1525165226232842 0.15236160487557138 0.15037593984962408 0.14829461196243204 0.14814814814814814 0.14787430683918668 0.14771048744460857 0.14756517461878996 0.14749262536873156 0.14749262536873156 0.14749262536873156 0.14705882352941177 0.14705882352941177 0.14513788098693758 0.14238253440911247 0.14231499051233396 0.1402524544179523 0.14023457419683832 0.13568521031207598 0.13513513513513514 0.13513513513513514 0.13513513513513514 0.13452914798206278 0.1306701512040321 0.13054830287206268 0.1303780964797914 0.13009540329575023 0.125 0.12315270935960591 0.12300123001230012 0.12285012285012285 0.12285012285012285 0.1226993865030675 0.12254901960784313 0.12181916621548457 0.11785503830288745 0.11693171188026193 0.11574074074074073 0.11441647597254005 0.11402508551881414 0.11376564277588168 0.11376564277588168 0.11350737797956867 0.11312217194570137 0.10911074740861974 531 329 169 54 79 842 527 317 349 309 212 621 188 876 32 291 207 240 257 274 49 22 280 423 197 350 59 227 308 82 12 60 24 406 219 284 107 262 25 138 15 260 216 0.10660980810234541 0.10504201680672269 0.10183299389002036 0.10131712259371835 0.10090817356205853 0.09871668311944717 0.09861932938856016 0.09842519685039369 0.09842519685039369 0.09832841691248771 0.09828009828009827 0.09823182711198428 0.09813542688910697 0.09803921568627451 0.09420631182289213 0.09250693802035154 0.09216589861751152 0.09208103130755065 0.09191176470588235 0.09191176470588235 0.09078529278256922 0.08976660682226212 0.08703220191470844 0.08673026886383348 0.08665511265164644 0.08650519031141869 0.08389261744966443 0.08278145695364239 0.08244023083264633 0.08240626287597858 0.08119519324455993 0.0790722192936215 0.07791195948578107 0.07776049766718507 0.07739938080495357 0.07446016381236038 0.07390983000739099 0.07390983000739099 0.07316627034936894 0.07102272727272728 0.07023705004389816 0.0702247191011236 0.0700770847932726 30 175 296 53 213 193 251 218 126 38 162 52 92 141 98 88 151 131 102 153 143 154 37 44 47 108 95 159 119 168 89 69 9 64 118 46 65 57 56 42 39 17 51 68 0.06944444444444445 0.0675219446320054 0.0675219446320054 0.06702412868632708 0.06697923643670461 0.06693440428380187 0.06150061500615006 0.06146281499692685 0.05945303210463733 0.05732301519059903 0.05694760820045558 0.05688282138794084 0.05672149744753262 0.05665722379603399 0.0546448087431694 0.05279831045406547 0.051150895140664954 0.04962779156327543 0.04940711462450593 0.049164208456243856 0.04906771344455348 0.047709923664122134 0.047505938242280284 0.04691531785127844 0.046264168401572985 0.046125461254612546 0.04601932811780948 0.0447427293064877 0.044722719141323794 0.04342162396873643 0.04230118443316413 0.04106776180698152 0.03756574004507889 0.03607503607503607 0.03594536304816679 0.03506311360448808 0.03428179636612959 0.030759766225776686 0.027932960893854747 0.026281208935611037 0.02382654276864427 0.0209819555182543 0.020475020475020478 Bibliography [1] MediaEval Multimedia Benchmark. Mediaeval benchmarking initiative for multimedia evaluation. http://www.multimediaeval.org/mediaeval2014/placing2014/. [2] Eric Brochu, Vlad M Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. eprint arXiv:1012.2599, arXiv.org, December 2010. [3] L. Cao, J. Yu, J. Luo, and T. Huang. Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression. ACM International Conference on Multimedia, pages 125–134, 2009. [4] J. Choi, H. Lei, V. Ekambaram, P. Kelm, L. Gottlieb, T. Sikora, K. Ramchandran, and G. Friedland. Human vs. machine: Establishing a human baseline for multimodal location estimation. ACM International Conference on Multimedia, pages 866–867, 2013. [5] Seth Golub. heatmap.py. http://www.sethoscope.net/heatmap/. [6] Vineet Gopal. Physiominer: A scalable cloud based framework for physiological waveform mining. Master’s thesis, MIT, 2014. [7] J. Hays and A. A. Efros. Im2gps: Estimating geographic information from a single image. CVPR Computer Vision and Pattern Recognition Conference, 2008. [8] Thorsten Joachims. Svmlight. http://svmlight.joachims.org. [9] P. Kelm, S. Schmiedeke, J. Choi, G. Friedland, V. Ekambaram, K. Ramchandran, and T. Sikora. A novel fusion method for integrating multiple modalities and knowledge for multimodal location estimation. ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia, 2013. [10] M. Larson, M. Soleymani, P. Serdyukov, S. Rudinac, C. Wartena, V. Murdock, G. Friedland, R. Ordelman, and G. J.F. Jones. Automatic tagging and geotagging in video collections and communities. ACM ICMR International Conference, pages 51–54, 2011. [11] David G. Lowe. Object recognition from local scale-invariant features. Computer Vision, 2:1150–1157, 1999. 69 [12] Frank Michel. How many public photos are uploaded to flickr every day, month, year? https://www.flickr.com/photos/franckmichel/6855169886/in/photostream/. [13] OpenStreetMap. Osm viz. http://cbick.github.io/osmviz/html/index.html. [14] O. A.B. Penatti, L. T. Li, J. Almeida, and R. da S. Torres. A visual approach for video geocoding using bag-of-scenes. ACM ICMR International Conference on Multimedia Retrieval, 2012. [15] David A. Shamma. One hundred million creative commons flickr images for research. http://yahoolabs.tumblr.com/post/89783581601/ one-hundred-million-creative-commons-flickr-images. [16] M. Trevisiol, H. Jégou, J. Delhumeau, and G. Gravier. Retrieving geo-location of videos with a divide and conquer hierarchical multimodal approach. ACM ICMR International Conference on Multimedia Retrieval, 2013. [17] VLFeat. Vlfeat. http://www.vlfeat.org/. 70