COMP3019 Coursework 2014 - Electronics and Computer Science

advertisement
UNIVERSITY OF SOUTHAMPTON
ELECTRONICS AND COMPUTER SCIENCE
Lecturer Coursework Hand-out Form
Course Code
Lecturer
Due Time
COMP3019
Jeff Reeve
4pm
Assignment Number
Due Date
Mark Contribution
1
Mar 27 2014
30%
Coursework is the continuously assessed part of the examination and is a required part of
the degree assessment.
Students' attention is drawn to the appropriate section of the course handbook that
discusses the originality of work.
Aim:
The overall aim is for you to develop an understanding of grid systems on two levels: the
architectural and operational concepts associated with grid systems, and developing a
basic workflow that uses services provided by a real grid implementation.
The coursework is therefore in two distinct parts. The first part aims to improve your
understanding of grid concepts via a lightweight grid implementation, m-grid.
The second part aims to familiarize you with the Google MapReduce framework, and
requires you to extend pseudocode to implement MapReduce, distributing Map jobs
against an example grid service provider.
Objectives:
1. To equip students to drive a lightweight grid implementation to solve a problem
that can benefit from using grid technology.
2. To develop an understanding of the basic mechanisms used to solve such
problems.
3. To develop a general architectural and operational understanding of typical
production-level grid software.
4. To develop the programming skills required to drive typical services on a
production-level grid.
Resources
Resources for this coursework can be found at
http://www.ecs.soton.ac.uk/~stc/COMP3019/.
Part 1
This involves writing a Java program to perform a simple task, submitting it to m-grid,
and obtaining the results.
(a) Write an m-grid Java applet that takes two arguments: a single character and a textfile.
The program should analyse the textfile and output how many times the character appears
in the textfile. A valid character can be anything from a-z (lowercase only), a full stop ‘.’
or a space ‘ ‘.
(b) Run the program locally using Appletviewer, using the supplied textfile and counting
the number of times the letter ‘e’ occurs in the textfile. Describe the results.
(c) Submit the applet to an m-grid server with an appropriate set of parameters in an mgrid parameter file to count the following characters in the supplied textfile: a-z
(inclusive, lowercase only), full stop ‘.’ and space ‘ ‘. When submitting, ensure that you
‘volunteer’ a browser to m-grid as an execution node. Monitor the progress of the jobs
until they are complete.
Explain, using your submitted jobs as examples, the process of submitting jobs to m-grid
and their subsequent execution, with particular emphasis on the following:


How m-grid processes the job with respect to input
The role of execution nodes
Reference the appropriate web page of m-grid to illustrate how each is achieved.
(d) Using m-grid as an example grid platform, describe an approach that uses duplicate
job submissions to verify results (i.e. for detecting errors in the results) for each of the
following cases:


Done manually by a user?
Performed automatically by the m-grid server itself for every job submission?
What are the advantages and disadvantages of each of the above approaches?
(e) Explain the type(s) of problems that can benefit from grid technologies, using at least
one example application to illustrate the advantages.
Part 2
This involves writing Java that uses the GridSAM API to accomplish job submission and
monitoring, and extending a basic MapReduce framework in pseudocode.
Download GridSAM from the link provided and install following the instructions in the
slides. Next, download the COMP3019-materials.tgz file from the link on the resources
page and unpack it within the GridSAM client install directory (e.g. /home/user/gridsam2.3.0-client). This contains the necessary Java code, compile and run scripts needed for
this part of the coursework. You’ll also need to use the GridSAM FTP server (also
covered in the slides) to stage input data into GridSAM and stage output data out of
GridSAM. Alternatively, you can use a different FTP server (e.g. vsftpd).
See the resource pages for the location of the servers that can be used.
(a) Basic Job submission and monitoring using GridSAM. See GridSAM Java API docs
(accessible from coursework help page) and the JSDL examples in gridsam-2.3.0client/examples for reference.
The GridSAMExample.java code is unfinished. It only includes code for setting up a job
manager and submitting to a GridSAM server:
(i) Complete the createJSDLDescription() stub so that it returns valid JSDL that
executes execName with arguments args and returns standard error and standard
output to a local client FTP server.
(ii) Complete the process so that the code is also able to monitor the job until
completion using its unique job ID as a reference (hint: look at the JobManager
interface to start).
(iii) Run the resultant program against a GridSAM service using ‘/bin/echo’ as the
execName, and ‘Hello World!’ as the args. Briefly describe the process of how
GridSAM handles your submission from job submission to completion.
NB: You’ll need to launch a GridSAM FTP server on your client machine so that
the standard error and standard output files can be staged back to your program.
(b) Implementing basic MapReduce. A basic MapReduce system, initially designed to
run on a single machine to count words within multiple files, looks like the following in
pseudocode.
The MapReduce() top-level function takes an inputList of (filename, fileLocation)
elements, and runs:

Map() with a given mapFunction that’s applied to each element in that list to
produce an intermediateList of (fileName, wordCount) elements.

Reduce(), which takes the intermediateList, firstly groups the elements with the
same key together, adding them to groupList, and runs the reduceFunction on
each element within that list, returning outputList with (filename, totalCount)
elements.
Function MapReduce(inputList, mapFunction, reduceFunction)
intermediateList = Map(inputList, mapFunction)
finalResultsList = Reduce(intermediateList, reduceFunction)
Return finalResultsList
End Function
Function Map(inputList, mapFunction)
outputList = new List
# Apply mapFunction to each fileName/fileLocation in inputList
# output to outputList
For Each (fileName, fileLocation) In inputList
(fileName, wordCount) = mapFunction(fileName, fileLocation)
Add (fileName, wordCount) To outputList
Next
Return outputList
End Function
Function Reduce(intermediateList, reduceFunction)
groupList = new List
outputList = new List
# Group together value elements in intermediateList by their key
# output to groupList
For Each (fileName, wordCount) In intermediateList
# If it can be found in groupList, add to its groupList total
found = false
For Each (s_fileName, s_wordCountList) In groupList
If (fileName = s_fileName) Then
Replace (s_fileName, s_wordCountList) In groupList
With (s_fileName, s_wordCountList + [s_wordCount])
found = true
End If
Next
# If it can’t be found in groupList, add it to groupList
If Not Found Then Add (fileName, [wordCount]) To groupList
Next
# Apply reduceFunction to each key/value pair in groupList
# output to outputList
For Each (fileName, wordCountList) In groupList
totalCount = reduceFunction(fileName, wordCountList)
Add (fileName, totalCount) To outputList
Next
Next
Return outputList
End Function
The following are the actual functions that perform the map and reduce tasks specific to
wordcount:
Function mapFunction(fileName, fileLocation)
matchCount = countMatches(<some_word>, fileLocation/fileName)
Return [(fileName, matchCount)]
End Function
Function reduceFunction(fileName, countList)
totalCount = 0
For Each count In countList
totalCount += count
Next
Return totalCount
End Function
However, we want to extend this so that the Map() function can submit mapFunction jobs
to the Grid to parallelise their computation (whilst retaining the Reduce() processing on
the local machine). Let us define the following conceptual primitives for dealing with
jobs within the Grid (which could be managed by GridSAM):




gridSubmit(function, argumentList): writes JSDL that executes function
executable on its argumentList on the Grid service, and submits that job to the
Grid service.
gridJobFinished(jobID): queries the status of the given job, and returns true if the
job is complete, else returns false.
hostedInputFileLocation = gridCopyToDataServer(filename/fileLocation): copies
the given local file to a locally hosted directory that is exposed through FTP (so
the Grid service can retrieve the given file from it for input data). Returns where
the file has been copied to.
fileContents = gridCopyFromDataServer(hostedOutputFileLocation): retrieves
the given output file generated by the job and returns the file contents.
Define a replacement Map() function, written in pseudocode, that takes the same
arguments and uses the above primitives to submit the wordcount mapFunction jobs to
the Grid service for each input file in the inputList. You’ll need to take into account the
possibility that the same file (with the same filename) may be processed more than once,
and the asynchronous nature of submitting jobs to the Grid service and monitoring them
to completion. The Map() function should return the outputList only after all jobs are
complete and their results have been processed.
NB: you don’t need to write any pseudocode descriptions for the above function
primitives, but state any assumptions you make about how they operate where necessary.
What to Hand In
Submit source code for part 1(a), 2(a)(i), 2(a)(ii), the m-grid parameter file and results
file(s) for 1(c), and trace output where it helps to answer the question. The other parts
that require written answers should form a separate document (in text, PDF or Microsoft
Word formats), up to a maximum of 800 words in length, not including any trace output.


Part 1 – 15%: (a): 3%, (b): 1%, (c): 4%, (d): 4%, (e): 3%
Part 2 – 15%: (a)(i): 3%, (a)(ii): 3%, (a)(iii): 3%, (b): 6%
Submission instructions will be available on the ECS Coursework Submission System.
Deadline
The deadline is March 27 2014 at 4pm.
Download