Revised Presentation One 3/21/2013 1 Outline • • • • • • • • • • • • • 1: Title 2: Outline 3: Members 4: Mentor 5-6: Societal Issue 7: History 8-9: Dr. Li 10-11: Cluster Computing 12-14: Case Study 15: Accuracy 16: Current Major Functional Component Diagram 17: Current Process Flow 18: Problem Statement 3/21/2013 • • • • • • • • • • • • • • 19: Proposed Major Functional Component Diagram 20: Proposed Process Flow 21-24: Dinosolve Walkthrough 25: Dinosolve Issues 26: Software 27: Hardware 28: Solution Statement 29: Competition Identified 30-32: 508 Compliance 33: Objectives 34: Benefits of Solution 35: Conclusion 36-39: References 40-44: Appendix 2 Group Members and Roles • • • • • • Scott Pardue (Team Leader) Michael Rajs (Risk Manager) Adam Willis (Algorithm Specialist) Sybil Acotanza (Documentation Specialist) Jordan Heinrichs (Database Designer) David Crook (User Interface Designer) 3/21/2013 3 Dr. Yaohang Li •Associate Professor in the Department of Computer Science at Old Dominion University. •Research interests include: •Computational Biology: applies computational simulation techniques to solve biological problems •Markov Chain Monte Carlo (MCMC) methods: statistical algorithm for sampling from probability distributions •Parallel Distributed Grid Computing: uses multiple computers communicating via Internet to solve a problem 3/21/2013 4 How do researchers handle the massive amounts of data they are collecting in order to benefit their research? 3/21/2013 5 “Every day, [mankind] create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.”1 3/21/2013 http://www-01.ibm.com/software/data/bigdata/ 6 Data Management Examples • Large Hadron Collider 2 – 150 million sensors report 40 million times per second • Facebook 3 – 2.5 billion – content items shared – 2.7 billion – “Likes” – 300 million – photos uploaded • Walmart 2 – 1 million customer transactions – 2.5 x 10^15 bytes of data 3/21/2013 http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-ofcontent-and-500-terabytes-ingested-every-day/ 7 Dr. Li’s Research • Ideally, his research can be used to develop new protein-modeling programs. Computational approaches can be more efficient and less expensive than biologists, chemists and others experimenting in lab settings • Leads to the manufacturing of additional drugs to fight conditions as varied as Alzheimer’s disease, cystic fibrosis and mad cow disease http://diverseeducation.com/article/13348/ Dr. Li’s Grants • Dinosolve, his current project, was secured for a five year, $400,000 CAREER Award from the National Science Foundation • Dr. Li has been the principal or co-principal investigator on research grants totaling more than $15.3 million Big Data Analysis Hardware • Cluster Computing 4 • A cluster consists of many nodes (computers). • Big data can be generated and analyzed quicker by spreading the workload amongst the nodes. Head Node • Logging data • Job submission 3 Computation Node • 2 Processors each • 4 Execution slots per processor 24 total execution slots Head node packages data from the computation nodes and presents it in a readable format so that it is usable by the research community 3/21/2013 10 Managing the Cluster Distributed Resource Management Systems (D-RMS) –Job management subsystem –Physical resource management subsystem –Scheduling and queuing subsystem 3/21/2013 11 Dr. Yaohang Li and Dinosolve • Dinosolve examines a protein sequence of amino acids and determines if the protein can be manipulated by an addition of a disulfide bond • Each computational result enhances the prediction accuracies for future results 3/21/2013 http://hpcr.cs.odu.edu/dinosolve/index.php 12 Dinosolve Case Study • Bioinformatics7 – Disulfide bond prediction program – Disulfide bond creation is important to the research community 3/21/2013 13 Dinosolve Users • Drug design • Pharmaceutical companies • Antibody design • To combat viruses • Bio-energy development • Creation of new fuels to replace diminishing fossil fuels • Genetic mapping5 • Research to cure cancer, HIV, and other diseases 3/21/2013 14 Accuracy of Popular Tools Accuracy Dinosolve DiANNA Scrath Protein Predictor 90.8% 81% 87% More users use Dinosolve because of the enhanced accuracy 3/21/2013 Reference 13,14 and 15 15 3/21/2013 17 What is the problem? • Processing time on big data sets is computationally expensive and as the volume of queries grows the system will progressively drop in performance until the system fails. • 300 simultaneous requests will cause the web served to crash 3/21/2013 18 3/21/2013 19 3/21/2013 20 User interface will be improved to be more aesthetically pleasing 3/21/2013 21 Working with Dinosolve Input title Input protein sequence Input e-mail address Submit, then wait for confirmation... Protein Sequence: string of alphabetic characters, each of which represent a particular amino acid in the protein 3/21/2013 22 Working with Dinosolve Confirmation of request Now wait for results 3/21/2013 23 Working with Dinosolve Check your e-mail, Click the link provided The results are displayed 24 Dinosolve Issues As it continues to grow in popularity, these are expected to occur: • Hard resources for computation – – – – CPU cycles Memory Disk space Network bandwidth • Server crashes Goal is to prepare the system to be able to continue to support the research community in light of its expected growth in requests 3/21/2013 25 Software • Unix operating system installed on the Dinosolve cluster • Dinosolve algorithm • Sun Grid Engine which will be our Distributed Resource Management System (D-RMS) installed on the cluster. • MySQL (database software) • Web-based user interface (website) 3/21/2013 26 Hardware • MySQL database server • A computer cluster to run the Dinosolve algorithm • Web server for web-based user interface 3/21/2013 27 How will we correct the problem? Configure a distributed resource management system 3/21/2013 28 Competing Distributed Resource Management Systems • Sun Grid Engine (SGE) • Portable Batch System (PBS) • Load Sharing Facility (LSF) 3/21/2013 29 508.22 compliance percentage 3/21/2013 Dinosolve DiANNA Scrath Protein Predictor 67% 85% 67% 30 508 compliance • Amended Rehabilitation Act of 1998 – require Federal agencies to make their electronic and information technology accessible to people with disabilities [32] – enacted to eliminate barriers in information technology, to make available new opportunities for people with disabilities, and to encourage development of technologies that will help achieve these goals [32] 3/21/2013 31 Why is it important to be compliant? If an entity wishes to receive government funding then any electronic form the entity uses must be 508 compliant. 3/21/2013 32 Objectives • Interpret and visualize current usage statistics • Configure, utilize, and optimize the SGE • Aesthetically pleasing and professional user interface 3/21/2013 33 What benefits will come from attaining the goals? • • • • Efficient utilization of available resources Increased throughput of the cluster An intuitive and professional user interface Rise in popularity due to excellent accuracy, efficiency, and professional design 3/21/2013 34 Conclusion With the updated user interface and correctly configured Sun Grid Engine, Dr. Li hopes to establish a reputable, reliable, and aesthetically pleasing Disulfide Bonding Prediction Server. 3/21/2013 35 References for history 1. 2. 3. 4. http://www-01.ibm.com/software/data/bigdata/ http://en.wikipedia.org/wiki/Big_data http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-contentand-500-terabytes-ingested-every-day/ http://en.wikipedia.org/wiki/Computer_cluster 3/21/2013 36 References for case study 5. Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling Applications [Abstract]. National Science Foundation Award Abstract #1066471. 6. Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features. Biotechnology and Bioinformatics Symposium 7. bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013, from http://www.merriam-webster.com/dictionary/bioinformatics 8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry Dictionary: http://guweb2.gonzaga.edu/faculty/cronk/biochem/Dindex.cfm?definition=disulfide_bond 9. Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx. 10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/ 3/21/2013 37 References for competition 11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011 URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF 12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF, PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf) 13. Dr. Li’s site http://hpcr.cs.odu.edu/dinosolve/ 14. Scratch Predictor http://scratch.proteomics.ics.uci.edu/ 15. DiANNA server http://clavius.bc.edu/~clotelab/DiANNA/ Portable Batch System (PBS) 16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf 17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=1 18. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf 19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf 20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf 21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdf Moab HPC Suite 22.http://www.adaptivecomputing.com/publication/420/wppa_open/ IBM Platform LSF 23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDF Apache Hadoop with Zookeeper 24. http://zookeeper.apache.org/doc/current/zookeeperOver.html 25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf 3/21/2013 38 Reference for 508 Compliance 26. http://en.wikipedia.org/wiki/Section_508_Am endment_to_the_Rehabilitation_Act_of_1973 3/21/2013 39 Appendix • 40: Competition Matrix for Resource Management Systems • 41-43: 508.22 Compliance Statistics for Dinosolve 3/21/2013 40 Competing Resource Management Systems Features of systems PBS LSF SGE Supported platforms Unix Unix & NT Unix Multi-cluster support Yes Yes No System level checkpoint restart No Yes Yes User level checkpoint restart No No Yes Large computational grid support No No No Massive Scalability Yes Yes Yes Parallel job support with Sun HPC ClusterTools Loose Integration Tight Integration Loose Integration Distribution format of end product Source Binary only Binary and Source Free? Yes No Yes Posix 1002.2d compliance Yes No Yes 3/21/2013 Reference 19 41 3/21/2013 42 3/21/2013 43 3/21/2013 44