UNIVERSITY OF SOUTHAMPTON ELECTRONICS AND COMPUTER SCIENCE Lecturer Coursework Hand-out Form Course Code Lecturer Due Time COMP3019 Jeff Reeve 4pm Assignment Number Due Date Mark Contribution 1 Mar 27 2014 30% Coursework is the continuously assessed part of the examination and is a required part of the degree assessment. Students' attention is drawn to the appropriate section of the course handbook that discusses the originality of work. Aim: The overall aim is for you to develop an understanding of grid systems on two levels: the architectural and operational concepts associated with grid systems, and developing a basic workflow that uses services provided by a real grid implementation. The coursework is therefore in two distinct parts. The first part aims to improve your understanding of grid concepts via a lightweight grid implementation, m-grid. The second part aims to familiarize you with the Google MapReduce framework, and requires you to extend pseudocode to implement MapReduce, distributing Map jobs against an example grid service provider. Objectives: 1. To equip students to drive a lightweight grid implementation to solve a problem that can benefit from using grid technology. 2. To develop an understanding of the basic mechanisms used to solve such problems. 3. To develop a general architectural and operational understanding of typical production-level grid software. 4. To develop the programming skills required to drive typical services on a production-level grid. Resources Resources for this coursework can be found at http://www.ecs.soton.ac.uk/~stc/COMP3019/. Part 1 This involves writing a Java program to perform a simple task, submitting it to m-grid, and obtaining the results. (a) Write an m-grid Java applet that takes two arguments: a single character and a textfile. The program should analyse the textfile and output how many times the character appears in the textfile. A valid character can be anything from a-z (lowercase only), a full stop ‘.’ or a space ‘ ‘. (b) Run the program locally using Appletviewer, using the supplied textfile and counting the number of times the letter ‘e’ occurs in the textfile. Describe the results. (c) Submit the applet to an m-grid server with an appropriate set of parameters in an mgrid parameter file to count the following characters in the supplied textfile: a-z (inclusive, lowercase only), full stop ‘.’ and space ‘ ‘. When submitting, ensure that you ‘volunteer’ a browser to m-grid as an execution node. Monitor the progress of the jobs until they are complete. Explain, using your submitted jobs as examples, the process of submitting jobs to m-grid and their subsequent execution, with particular emphasis on the following: How m-grid processes the job with respect to input The role of execution nodes Reference the appropriate web page of m-grid to illustrate how each is achieved. (d) Using m-grid as an example grid platform, describe an approach that uses duplicate job submissions to verify results (i.e. for detecting errors in the results) for each of the following cases: Done manually by a user? Performed automatically by the m-grid server itself for every job submission? What are the advantages and disadvantages of each of the above approaches? (e) Explain the type(s) of problems that can benefit from grid technologies, using at least one example application to illustrate the advantages. Part 2 This involves writing Java that uses the GridSAM API to accomplish job submission and monitoring, and extending a basic MapReduce framework in pseudocode. Download GridSAM from the link provided and install following the instructions in the slides. Next, download the COMP3019-materials.tgz file from the link on the resources page and unpack it within the GridSAM client install directory (e.g. /home/user/gridsam2.3.0-client). This contains the necessary Java code, compile and run scripts needed for this part of the coursework. You’ll also need to use the GridSAM FTP server (also covered in the slides) to stage input data into GridSAM and stage output data out of GridSAM. Alternatively, you can use a different FTP server (e.g. vsftpd). See the resource pages for the location of the servers that can be used. (a) Basic Job submission and monitoring using GridSAM. See GridSAM Java API docs (accessible from coursework help page) and the JSDL examples in gridsam-2.3.0client/examples for reference. The GridSAMExample.java code is unfinished. It only includes code for setting up a job manager and submitting to a GridSAM server: (i) Complete the createJSDLDescription() stub so that it returns valid JSDL that executes execName with arguments args and returns standard error and standard output to a local client FTP server. (ii) Complete the process so that the code is also able to monitor the job until completion using its unique job ID as a reference (hint: look at the JobManager interface to start). (iii) Run the resultant program against a GridSAM service using ‘/bin/echo’ as the execName, and ‘Hello World!’ as the args. Briefly describe the process of how GridSAM handles your submission from job submission to completion. NB: You’ll need to launch a GridSAM FTP server on your client machine so that the standard error and standard output files can be staged back to your program. (b) Implementing basic MapReduce. A basic MapReduce system, initially designed to run on a single machine to count words within multiple files, looks like the following in pseudocode. The MapReduce() top-level function takes an inputList of (filename, fileLocation) elements, and runs: Map() with a given mapFunction that’s applied to each element in that list to produce an intermediateList of (fileName, wordCount) elements. Reduce(), which takes the intermediateList, firstly groups the elements with the same key together, adding them to groupList, and runs the reduceFunction on each element within that list, returning outputList with (filename, totalCount) elements. Function MapReduce(inputList, mapFunction, reduceFunction) intermediateList = Map(inputList, mapFunction) finalResultsList = Reduce(intermediateList, reduceFunction) Return finalResultsList End Function Function Map(inputList, mapFunction) outputList = new List # Apply mapFunction to each fileName/fileLocation in inputList # output to outputList For Each (fileName, fileLocation) In inputList (fileName, wordCount) = mapFunction(fileName, fileLocation) Add (fileName, wordCount) To outputList Next Return outputList End Function Function Reduce(intermediateList, reduceFunction) groupList = new List outputList = new List # Group together value elements in intermediateList by their key # output to groupList For Each (fileName, wordCount) In intermediateList # If it can be found in groupList, add to its groupList total found = false For Each (s_fileName, s_wordCountList) In groupList If (fileName = s_fileName) Then Replace (s_fileName, s_wordCountList) In groupList With (s_fileName, s_wordCountList + [s_wordCount]) found = true End If Next # If it can’t be found in groupList, add it to groupList If Not Found Then Add (fileName, [wordCount]) To groupList Next # Apply reduceFunction to each key/value pair in groupList # output to outputList For Each (fileName, wordCountList) In groupList totalCount = reduceFunction(fileName, wordCountList) Add (fileName, totalCount) To outputList Next Next Return outputList End Function The following are the actual functions that perform the map and reduce tasks specific to wordcount: Function mapFunction(fileName, fileLocation) matchCount = countMatches(<some_word>, fileLocation/fileName) Return [(fileName, matchCount)] End Function Function reduceFunction(fileName, countList) totalCount = 0 For Each count In countList totalCount += count Next Return totalCount End Function However, we want to extend this so that the Map() function can submit mapFunction jobs to the Grid to parallelise their computation (whilst retaining the Reduce() processing on the local machine). Let us define the following conceptual primitives for dealing with jobs within the Grid (which could be managed by GridSAM): gridSubmit(function, argumentList): writes JSDL that executes function executable on its argumentList on the Grid service, and submits that job to the Grid service. gridJobFinished(jobID): queries the status of the given job, and returns true if the job is complete, else returns false. hostedInputFileLocation = gridCopyToDataServer(filename/fileLocation): copies the given local file to a locally hosted directory that is exposed through FTP (so the Grid service can retrieve the given file from it for input data). Returns where the file has been copied to. fileContents = gridCopyFromDataServer(hostedOutputFileLocation): retrieves the given output file generated by the job and returns the file contents. Define a replacement Map() function, written in pseudocode, that takes the same arguments and uses the above primitives to submit the wordcount mapFunction jobs to the Grid service for each input file in the inputList. You’ll need to take into account the possibility that the same file (with the same filename) may be processed more than once, and the asynchronous nature of submitting jobs to the Grid service and monitoring them to completion. The Map() function should return the outputList only after all jobs are complete and their results have been processed. NB: you don’t need to write any pseudocode descriptions for the above function primitives, but state any assumptions you make about how they operate where necessary. What to Hand In Submit source code for part 1(a), 2(a)(i), 2(a)(ii), the m-grid parameter file and results file(s) for 1(c), and trace output where it helps to answer the question. The other parts that require written answers should form a separate document (in text, PDF or Microsoft Word formats), up to a maximum of 800 words in length, not including any trace output. Part 1 – 15%: (a): 3%, (b): 1%, (c): 4%, (d): 4%, (e): 3% Part 2 – 15%: (a)(i): 3%, (a)(ii): 3%, (a)(iii): 3%, (b): 6% Submission instructions will be available on the ECS Coursework Submission System. Deadline The deadline is March 27 2014 at 4pm.