Andrew Hovingh CS 5950 Summer 2 2010 Western Michigan University Computer Science MAPREDUCE, SENSOR DATA, AND THE CLOUD Objectives Use libraries/API’s to automate obtaining a large dataset of sensor information Apply MapReduce in an appropriate environment to produce aggregate information, e.g. statistics Use Hadoop for MapReduce in a local Hadoop cluster of machines Use the same MapReduce program in a cluster of virtual machines using Amazon’s Elastic MapReduce service (And S3 storage) Process the MapReduce output Sensor Input Western Michigan University Parkview campus uses a MySQL database of sensor data from the rooms in the building, consisting of columns: space (the room the data is from) subsystem logtime (time of measurement) temp (temperature in °F) occupied Alarmed A Java program is used to acquire all the rows in this history database and generate a simple comma separated text file of the contents as input for MapReduce MapReduce for Statistics Mapper: Input to mapper is a key value pair, where the key is the position in the input text file and the value is a whole line of text. If the line is validated, it is split by commas and the space (the room) and temperature are extracted The key value pair emitted uses the space text as a key and the value as a pair of: Minimum (the temperature so far) Maximum (the temperature so far) Sum (the temperature so far) Sum of Squares (the temperature squared so far) Number of points (1 so far) An additional key value pair is emitted using the key “global” and the same value pair as above. This is for global statistics for all rooms. MapReduce for Statistics Cont’d Reducer: Input to reducer is the space key (either a room or “global” for all rooms) and a list of value pairs as described in mapper The minimum, maximum, sum, sum of squares, and total number of points is calculated from all the values pairs to form a new value pair of the same format, which is written to output text files from the Hadoop MapReduce framework The combiner uses the same method as the Reducer Processing MapReduce Output The MapReduce program generates files with name “part- #####” consisting of key value pairs described in the mapper and reducer. A Java program was written to read these files in order and store the key value pairs in an internal associated array, reducing the values together with the same reducer logic if necessary The associative array was written to an xml document, where the entries consist of: The space (either room or “global”) Minimim Maximum Mean (Sum/number of points) Standard Deviation ((number of points * mean^2 – 2*mean*sum + sum of squares)/(number of points – 1)) Number of Points Java Program Solution Overview All control of the process done from runnable Program.jar Takes a path to a local xml file containing access information to the machines and accounts required to do the job (access.xml) and parses that file for that information Reads MySQL sensor database and generates local text file of database data If requested, uploads sensor text file and runs MapReduce on local Hadoop cluster and gets back statistics xml file If requested, uploads sensor text file and MapReduce jar file to Amazon S3 storage, runs MapReduce on Amazon Elastic MapReduce service, and gets back output files and generates statistics xml file Java Program for Hadoop Cluster Processes access.xml to get access information (hostnames, passwords, etc.) Reads MySQL database and generates text file Uploads text file to Hadoop cluster Puts text file into Hadoop distributed file system Deletes output directory in cluster and distributed file system Runs MapReduce with jar stored on cluster (automatically uploaded with my Ant script, but needs uploading only once) Gets output directory from distributed file system Runs FileReducer to process output files to xml document with jar stored on the cluster (automatically uploaded with my Ant script, needs uploading only once) Downloads output xml from cluster back to local machine Java Program for Amazon Web Services, Elastic MapReduce and S3 Processes access.xml to get access information (hostnames, passwords, etc.) Reads MySQL database and generates text file Deletes appropriate files and directories in S3 storage (MapReduce jar file, input and output folders) Uploads text file and MapReduce jar file to S3 storage Runs a MapReduce Job Flow in ElasticMapReduce Downloads the output directory from S3 storage back to the local machine Runs the FileReducer logic on the local output files to generate an output xml file Custom Ant Build To facilitate development, a custom Ant script was used to automatically generate all the jar files for use in the processes described previously, upload the jar files to the Hadoop cluster (using scp), and run Program.jar After the Ant build is run once, output xml files can be generated simply by running Program.jar alone (can be setup in a script to automatically run at a given frequency to automatically update output) Everything is automated and well defined! Setting Up and Running the Project Make sure to have latest Java JDK installed (In Ubuntu go to Synaptic Package Manager (SPM), search “default-jdk” package, mark for installation, and apply) Get the latest MySQL Java driver Connector/J and note location of driver jar installed (in linux at the time this was /usr/share/java/mysql-connector-java5.1.10.jar) (In Ubuntu use SPM to install “libmysql-java” package) Setup Eclipse with AWS SDK for Java (Instructions can be found at http://aws.amazon.com/eclipse/) Amounts to adding a software site (http://aws.amazon.com/eclipse) and installing plugins from the site Note access key and secret key of your AWS account (instructions provided on website) Import the existing Eclipse project source (File > New Project…, select existing project and browse to source directory) Edit the access.xml file with all the machine and account information required. Reflect some of that data to the Ant build.xml script in the project. Run the Ant build.xml script (right click on build.xml and select Run As… > Ant Build Subsequent runs of the built program can simply be run by calling the runnable jar Program.jar in the build/jars/ folder in the project. Conclusions Java and Ant was used to automate the entire process of getting sensor data and running Hadoop MapReduce to get statistics results both on the local cluster and in AWS Please refer to the source code for file details Any questions?