Presentation Two 4/09/2013 1 Agenda • • • • • • • • • • • • • • • 1: Title 2: Outline 3: Members 4: Mentor 5-6: Societal Issue 7: History 8-9: Dr. Li 10-11: Cluster Computing 12-14: Case Study 15: Accuracy 16: Current Major Functional Component Diagram 17: Current Process Flow 18: Problem Statement 19: Proposed Major Functional Component Diagram 20: Proposed Process Flow 4/09/2013 • • • • • • • • • • • • • • • • • 21-24: Dinosolve Walkthrough 25: Dinosolve Issues 26: Software 27: Hardware 28: Solution Statement 29: Competition Identified 30-32: 508 Compliance 33: Objectives 34: Benefits of Solution 35-41: Milestones 42: Sitemap 43: Database Schema 44: Entity Relationship Diagram 45: Risks 46: Conclusion 47-50: References 51-54: Appendix 2 Group Members and Roles • • • • • • Scott Pardue (Team Leader) Michael Rajs (Risk Manager) Adam Willis (Algorithm Specialist) Sybil Acotanza (Documentation Specialist) Jordan Heinrichs (Database Designer) David Crook (User Interface Designer) 4/09/2013 3 Dr. Yaohang Li •Associate Professor in the Department of Computer Science at Old Dominion University. •Research interests include: •Computational Biology: applies computational simulation techniques to solve biological problems •Markov Chain Monte Carlo (MCMC) methods: statistical algorithm for sampling from probability distributions •Parallel Distributed Grid Computing: uses multiple computers communicating via Internet to solve a problem 4/09/2013 4 How do researchers manage the massive amounts of data they are collecting in order to benefit their research? 4/09/2013 5 “Every day, [mankind] creates 2.5 quintillion (2.5*10^18) bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” - IBM 4/09/2013 http://www-01.ibm.com/software/data/bigdata/ 6 Data Management Examples • Large Hadron Collider 2 • 150 million sensors report 40 million times per second • Watson on Jeopardy • 200 million pages • Structured and Unstructured • 4 Terabytes of information • DinoSolve Protein Prediction Server • Proteins are made up of single or multiple amino acids • 20 different amino acids • If a protein is made up of 5 amino acids then the number of possible proteins will be 20^5 or 3,200,000 4/09/2013 7 Big Data Analysis Hardware Scheduling and Queuing Subsystem Job Management Subsystem 4/09/2013 Physical Resource Management Subsystem 8 Dr. Li’s Cluster Configuration Servers Database Server Web Server Hardware Server Cluster 4/09/2013 Dell PowerEdge R410 Server Head Node Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Node Node Node Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor 9 Dinosolve Issues As it continues to grow in popularity, these are expected to occur: • Limited hard resources for computation • CPU cycles • Memory • Disk space • Network bandwidth • Server crashes Goal is to prepare the system to be able to continue to support the research community in light of its expected growth in requests and to also enhance the design of the user interface 4/09/2013 10 Job Management Subsytem Servers Database Server Hardware Server Cluster 4/09/2013 Dell PowerEdge R410 Server Head Node Web Server Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Node Node Node 11 Physical resource management Servers Database Server Hardware Server Cluster 4/09/2013 Dell PowerEdge R410 Server Head Node Web Server Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Node Node Node 12 Scheduling and Queueing Servers Database Server Hardware Server Cluster 4/09/2013 Dell PowerEdge R410 Server Head Node Web Server Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Node Node Node 13 Dr. Li’s Grants • DinoSolve • secured for a five year, $400,000 CAREER Award from the National Science Foundation • Dr. Li • principal or co-principal investigator • research grants totaling more than $15.3 million 4/09/2013 14 Dr. Yaohang Li and Dinosolve • Dinosolve examines a protein sequence of amino acids and determines if the protein can be manipulated by an addition of a disulfide bond • Each computational result enhances the prediction accuracies for future results • 40^20, larger than 10^32, different possible combinations for only the shortest sequence 4/09/2013 15 300 simultaneous requests will cause the web server to crash 4/09/2013 System throughput (Mb/sec) What is the problem? 16 Dinosolve Case Study • Bioinformatics7 • Disulfide bond prediction program • Disulfide bond creation is important to the research community 4/09/2013 http://www.merriamwebster.com/dictionary/bioinformatics 17 Dinosolve Users • Drug design • Pharmaceutical companies • Antibody design • To combat viruses • Bio-energy development • Creation of new fuels to replace diminishing fossil fuels • Genetic mapping5 • Research to cure cancer, HIV, and other diseases 4/09/2013 18 Accuracy of Popular Tools Accuracy Dinosolve DiANNA Scratch Protein Predictor 90.8% 81% 87% More users use Dinosolve because of the enhanced accuracy 4/09/2013 19 Current Major Functional Component Diagram Web Server Internet Researcher Email MySQL Database Server 4/09/2013 DinoSolve Algorithm 20 Current Process Flow User (Researcher) Validity check Database DinoSolve engine Start Web server Current Process Flow 4/09/2013 Visit DinoSolve Input valid? Enter sequence and email address Display error No Send sequence View results End Display No Reaction Yes CYS < 2 Execute Algorithm (Big Data Calculation) CYS? Email link to results CYS > 1 Bond formed Store results (Big Data Storage) 21 RWP Major Functional Component Diagram Web Server Internet Researcher SGE scheduler Email MySQL Database Server 4/09/2013 Execution Host Execution Host 22 RWP Process Flow User (Researcher) Validity check Input valid? Enter sequence and email address View results Display No Reaction Accept and schedule job CYS < 2 Execute Algorithm (Big Data calculation) CYS? Database Send sequence SGE scheduler Visit DinoSolve SGE execution host Start Web server Proposed Process Flow 4/09/2013 No Display error End Yes Email link to results CYS > 1 Bond formed Store results (Big Data Storage) 23 Objectives • Configure, utilize, and optimize the SGE • Aesthetically pleasing and professional user interface • 508 Compliance • Improve the existing database schema and adding user accounts 4/09/2013 24 Benefits from Goals • Efficient utilization of available resources and increased throughput of the cluster • Professional user interface leading to a rise in popularity • Accessibility • Security and efficient access of previous submissions 4/09/2013 25 User interface will be improved to be more aesthetically pleasing 4/09/2013 26 Working with Dinosolve Input title Input protein sequence Input e-mail address Submit, then wait for confirmation... Protein Sequence: string of alphabetic characters, each of which represent a particular amino acid in the protein 4/09/2013 27 Working with Dinosolve Confirmation of request Now wait for results 4/09/2013 28 Working with Dinosolve Check your e-mail, Click the link provided The results are displayed 4/09/2013 29 Why is it important to be compliant? If an entity wishes to receive government funding then any electronic form the entity uses must be 508 compliant 4/09/2013 30 508 Compliance • Amended Rehabilitation Act of 1998 • require Federal agencies to make their electronic and information technology accessible to people with disabilities [32] • enacted to eliminate barriers in information technology, to make available new opportunities for people with disabilities, and to encourage development of technologies that will help achieve these goals [32] 4/09/2013 http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973 31 Compliance of Popular Tools 508.22 compliance percentage 4/09/2013 Dinosolve DiANNA Scratch Protein Predictor 67% 85% 67% 32 Milestones ARMS Hardware Within Scope 4/09/2013 Software Testing Unaffected 33 Three Computational Nodes Dell PowerEdge R410 Server Computational Node Intel E5506 processor Intel E5506 processor Each processor has four execution slots 4/09/2013 34 Processors Servers Database Server Web Server Hardware Server Cluster Dell PowerEdge R410 Server Head Node 6 processors yield 24 execution slots *Each computational node has two processors 4/09/2013 Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Node Node Node Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor 35 Software Milestones Software User Interface Within Scope 4/09/2013 Database Algorithm Disulfide Bond Predictor Sun Grid Engine Scheduler Unaffected 36 Testing Milestones Testing • Cluster Performance • Stress testing • Prevention of denial of service attacks Server Cluster Performance • Database Performance Database Performance • Stress testing • Prevention of MySQL injection attacks 4/09/2013 37 Complete Milestone Tree Database Performance Testing Server Cluster Performance ARMS Software User Interface Database Sun Grid Engine Algorithm Disulfide Bond Predictor Scheduler Servers Database Server Hardware Server Cluster Within Scope 4/09/2013 Unaffected Dell PowerEdge R410 Server Head Node Web Server Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Dell PowerEdge R410 Server Computational Node Node Node Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor Intel E5506 processor 38 Sitemap DinoSolve Homepage Admin 4/09/2013 User Information References Statistics Help Contact 39 Database Schema 4/09/2013 40 Entity Relationship 4/09/2013 41 Risks Risks Probability I m • T1 T2 p a C2 T3 C1 T4 • • • c t • • T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength T2: Improper synchronization of cluster resources could lead to a deadlock T3: Race conditions between the HPCR cluster and the MySQL database T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system C1: Users may not like new design C2: SGE does not enforce exclusive access to the reserved processors Technical Risks and Mitigations Probability I T1 T2 m p a T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength Probability: 1 Impact: 5 Mitigation: Creating indexes, use specialized data structures and aggregate tables. T2: Improper synchronization of cluster resources can lead to a deadlock Probability: 2 Impact: 4 Mitigation: Modify and read application data. Alter execution logic and basic software configuration of SGE. c t 4/09/2013 43 Technical Risks and Mitigations Probability I m p T3: Race conditions between the HPCR cluster and the MySQL database. Probability: 3 Impact: 3 Mitigation: Using software control on the SGE. T3 a T4 c t 4/09/2013 T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system Probability: 2 Impact: 2 Mitigation: Keep virus protection up to date. Use very specific types of passwords. Run current scripts because hackers look for dated scripts because they most likely have a hole in them. Limit access to certain files. 44 Risks Probability I C2 m p C1: Users may not like new design. Probability: 3 Impact: 3 Mitigation: Create a new more aesthetically pleasing design C1 a C2: SGE does not enforce exclusive access to the reserved processors. Probability: 4 Impact: 4 Mitigation: Qsub and knowledge of node memory capacity c t 4/09/2013 45 With the updated user interface and correctly configured Sun Grid Engine, Dr. Li hopes to establish a reputable, reliable, and aesthetically pleasing Disulfide Bonding Prediction Server 4/09/2013 46 References for history 1. 2. 3. 4. http://www-01.ibm.com/software/data/bigdata/ http://en.wikipedia.org/wiki/Big_data http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-contentand-500-terabytes-ingested-every-day/ http://en.wikipedia.org/wiki/Computer_cluster 4/09/2013 47 References for case study 5. Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling Applications [Abstract]. National Science Foundation Award Abstract #1066471. 6. Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features. Biotechnology and Bioinformatics Symposium 7. bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013, from http://www.merriam-webster.com/dictionary/bioinformatics 8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry Dictionary: http://guweb2.gonzaga.edu/faculty/cronk/biochem/Dindex.cfm?definition=disulfide_bond 9. Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx. 10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/ 4/09/2013 48 References for competition 11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011 URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF 12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF, PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf) 13. Dr. Li’s site http://hpcr.cs.odu.edu/dinosolve/ 14. Scratch Predictor http://scratch.proteomics.ics.uci.edu/ 15. DiANNA server http://clavius.bc.edu/~clotelab/DiANNA/ Portable Batch System (PBS) 16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf 17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=1 18. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf 19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf 20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf 21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdf Moab HPC Suite 22.http://www.adaptivecomputing.com/publication/420/wppa_open/ IBM Platform LSF 23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDF Apache Hadoop with Zookeeper 24. http://zookeeper.apache.org/doc/current/zookeeperOver.html 25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf 4/09/2013 49 Reference for 508 Compliance 26. http://en.wikipedia.org/wiki/Section_508_Am endment_to_the_Rehabilitation_Act_of_1973 4/09/2013 50 Appendix • 52: Competition Matrix for Resource Management Systems • 53-55: 508.22 Compliance Statistics for Dinosolve 4/09/2013 51 Competing Resource Management Systems Features of systems PBS LSF SGE Supported platforms Unix Unix & NT Unix Multi-cluster support Yes Yes No System level checkpoint restart No Yes Yes User level checkpoint restart No No Yes Large computational grid support No No No Massive Scalability Yes Yes Yes Parallel job support with Sun HPC ClusterTools Loose Integration Tight Integration Loose Integration Distribution format of end product Source Binary only Binary and Source Free? Yes No Yes Posix 1002.2d compliance Yes No Yes 4/09/2013 52 4/09/2013 53 4/09/2013 54 4/09/2013 55