MapReduce, Sensor Data, and the Cloud

Andrew Hovingh
CS 5950 Summer 2 2010
Western Michigan University Computer Science
 Use libraries/API’s to automate obtaining a large
dataset of sensor information
 Apply MapReduce in an appropriate
environment to produce aggregate information,
e.g. statistics
 Use Hadoop for MapReduce in a local Hadoop
cluster of machines
 Use the same MapReduce program in a cluster of
virtual machines using Amazon’s Elastic
MapReduce service (And S3 storage)
 Process the MapReduce output
Sensor Input
 Western Michigan University Parkview campus uses
a MySQL database of sensor data from the rooms in
the building, consisting of columns:
space (the room the data is from)
logtime (time of measurement)
temp (temperature in °F)
 A Java program is used to acquire all the rows in this
history database and generate a simple comma
separated text file of the contents as input for
MapReduce for Statistics
 Mapper:
 Input to mapper is a key value pair, where the key is the position
in the input text file and the value is a whole line of text.
 If the line is validated, it is split by commas and the space (the
room) and temperature are extracted
 The key value pair emitted uses the space text as a key and the
value as a pair of:
Minimum (the temperature so far)
Maximum (the temperature so far)
Sum (the temperature so far)
Sum of Squares (the temperature squared so far)
Number of points (1 so far)
 An additional key value pair is emitted using the key “global” and
the same value pair as above. This is for global statistics for all
MapReduce for Statistics
 Reducer:
 Input to reducer is the space key (either a room or
“global” for all rooms) and a list of value pairs as
described in mapper
 The minimum, maximum, sum, sum of squares,
and total number of points is calculated from all
the values pairs to form a new value pair of the
same format, which is written to output text files
from the Hadoop MapReduce framework
 The combiner uses the same method as the
Processing MapReduce Output
 The MapReduce program generates files with name “part-
#####” consisting of key value pairs described in the mapper
and reducer.
 A Java program was written to read these files in order and store
the key value pairs in an internal associated array, reducing the
values together with the same reducer logic if necessary
 The associative array was written to an xml document, where the
entries consist of:
The space (either room or “global”)
Mean (Sum/number of points)
Standard Deviation ((number of points * mean^2 – 2*mean*sum + sum
of squares)/(number of points – 1))
 Number of Points
Java Program Solution
 All control of the process done from runnable Program.jar
 Takes a path to a local xml file containing access
information to the machines and accounts required to do
the job (access.xml) and parses that file for that information
 Reads MySQL sensor database and generates local text file
of database data
 If requested, uploads sensor text file and runs MapReduce
on local Hadoop cluster and gets back statistics xml file
 If requested, uploads sensor text file and MapReduce jar file
to Amazon S3 storage, runs MapReduce on Amazon Elastic
MapReduce service, and gets back output files and
generates statistics xml file
Java Program for Hadoop
 Processes access.xml to get access information (hostnames,
passwords, etc.)
Reads MySQL database and generates text file
Uploads text file to Hadoop cluster
Puts text file into Hadoop distributed file system
Deletes output directory in cluster and distributed file system
Runs MapReduce with jar stored on cluster (automatically
uploaded with my Ant script, but needs uploading only once)
Gets output directory from distributed file system
Runs FileReducer to process output files to xml document with
jar stored on the cluster (automatically uploaded with my Ant
script, needs uploading only once)
Downloads output xml from cluster back to local machine
Java Program for Amazon Web
Services, Elastic MapReduce
and S3
 Processes access.xml to get access information
(hostnames, passwords, etc.)
Reads MySQL database and generates text file
Deletes appropriate files and directories in S3 storage
(MapReduce jar file, input and output folders)
Uploads text file and MapReduce jar file to S3 storage
Runs a MapReduce Job Flow in ElasticMapReduce
Downloads the output directory from S3 storage back to
the local machine
Runs the FileReducer logic on the local output files to
generate an output xml file
Custom Ant Build
 To facilitate development, a custom Ant script
was used to automatically generate all the jar
files for use in the processes described
previously, upload the jar files to the Hadoop
cluster (using scp), and run Program.jar
 After the Ant build is run once, output xml files
can be generated simply by running Program.jar
alone (can be setup in a script to automatically
run at a given frequency to automatically update
 Everything is automated and well defined!
Setting Up and Running the
Make sure to have latest Java JDK installed (In Ubuntu go to Synaptic Package
Manager (SPM), search “default-jdk” package, mark for installation, and apply)
Get the latest MySQL Java driver Connector/J and note location of driver jar
installed (in linux at the time this was /usr/share/java/mysql-connector-java5.1.10.jar) (In Ubuntu use SPM to install “libmysql-java” package)
Setup Eclipse with AWS SDK for Java (Instructions can be found at
Amounts to adding a software site ( and installing plugins
from the site
Note access key and secret key of your AWS account (instructions provided on website)
Import the existing Eclipse project source (File > New Project…, select existing
project and browse to source directory)
Edit the access.xml file with all the machine and account information required.
Reflect some of that data to the Ant build.xml script in the project.
Run the Ant build.xml script (right click on build.xml and select Run As… > Ant
Subsequent runs of the built program can simply be run by calling the runnable
jar Program.jar in the build/jars/ folder in the project.
 Java and Ant was used to automate the entire
process of getting sensor data and running
Hadoop MapReduce to get statistics results
both on the local cluster and in AWS
 Please refer to the source code for file details
 Any questions?
Related documents