Revised Presentation One 3/21/2013 1

advertisement
Revised Presentation One
3/21/2013
1
Outline
•
•
•
•
•
•
•
•
•
•
•
•
•
1: Title
2: Outline
3: Members
4: Mentor
5-6: Societal Issue
7: History
8-9: Dr. Li
10-11: Cluster Computing
12-14: Case Study
15: Accuracy
16: Current Major Functional
Component Diagram
17: Current Process Flow
18: Problem Statement
3/21/2013
•
•
•
•
•
•
•
•
•
•
•
•
•
•
19: Proposed Major Functional
Component Diagram
20: Proposed Process Flow
21-24: Dinosolve Walkthrough
25: Dinosolve Issues
26: Software
27: Hardware
28: Solution Statement
29: Competition Identified
30-32: 508 Compliance
33: Objectives
34: Benefits of Solution
35: Conclusion
36-39: References
40-44: Appendix
2
Group Members and Roles
•
•
•
•
•
•
Scott Pardue (Team Leader)
Michael Rajs (Risk Manager)
Adam Willis (Algorithm Specialist)
Sybil Acotanza (Documentation Specialist)
Jordan Heinrichs (Database Designer)
David Crook (User Interface Designer)
3/21/2013
3
Dr. Yaohang Li
•Associate Professor in the Department of
Computer Science at Old Dominion
University.
•Research interests include:
•Computational Biology: applies
computational simulation techniques to
solve biological problems
•Markov Chain Monte Carlo (MCMC)
methods: statistical algorithm for sampling
from probability distributions
•Parallel Distributed Grid Computing: uses
multiple computers communicating via
Internet to solve a problem
3/21/2013
4
How do researchers handle the massive
amounts of data they are collecting in
order to benefit their research?
3/21/2013
5
“Every day, [mankind] create 2.5
quintillion bytes of data — so much
that 90% of the data in the world today
has been created in the last two years
alone.”1
3/21/2013 http://www-01.ibm.com/software/data/bigdata/
6
Data Management Examples
•
Large Hadron Collider 2
– 150 million sensors report 40 million times per
second
•
Facebook 3
– 2.5 billion – content items shared
– 2.7 billion – “Likes”
– 300 million – photos uploaded
•
Walmart 2
– 1 million customer transactions
– 2.5 x 10^15 bytes of data
3/21/2013
http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-ofcontent-and-500-terabytes-ingested-every-day/
7
Dr. Li’s Research
• Ideally, his research can be used to develop new
protein-modeling programs. Computational
approaches can be more efficient and less
expensive than biologists, chemists and others
experimenting in lab settings
• Leads to the manufacturing of additional drugs
to fight conditions as varied as Alzheimer’s
disease, cystic fibrosis and mad cow disease
http://diverseeducation.com/article/13348/
Dr. Li’s Grants
• Dinosolve, his current project, was secured
for a five year, $400,000 CAREER Award
from the National Science Foundation
• Dr. Li has been the principal or co-principal
investigator on research grants totaling
more than $15.3 million
Big Data Analysis Hardware
• Cluster Computing 4
• A cluster consists of many nodes (computers).
• Big data can be generated and analyzed quicker by
spreading the workload amongst the nodes.
Head Node
• Logging data
• Job submission
3 Computation Node
• 2 Processors each
• 4 Execution slots
per processor
24 total execution slots
Head node packages data from the computation nodes and presents it in
a readable format so that it is usable by the research community
3/21/2013
10
Managing the Cluster
Distributed Resource Management
Systems (D-RMS)
–Job management subsystem
–Physical resource management
subsystem
–Scheduling and queuing subsystem
3/21/2013
11
Dr. Yaohang Li and Dinosolve
• Dinosolve examines a protein sequence of
amino acids and determines if the protein
can be manipulated by an addition of a
disulfide bond
• Each computational result enhances the
prediction accuracies for future results
3/21/2013
http://hpcr.cs.odu.edu/dinosolve/index.php
12
Dinosolve Case Study
• Bioinformatics7
– Disulfide bond
prediction program
– Disulfide bond
creation is
important to the
research community
3/21/2013
13
Dinosolve Users
• Drug design
• Pharmaceutical companies
• Antibody design
• To combat viruses
• Bio-energy development
• Creation of new fuels to replace diminishing fossil fuels
• Genetic mapping5
• Research to cure cancer, HIV, and other diseases
3/21/2013
14
Accuracy of Popular Tools
Accuracy
Dinosolve
DiANNA
Scrath Protein
Predictor
90.8%
81%
87%
More users use Dinosolve because of the enhanced accuracy
3/21/2013
Reference 13,14 and 15
15
3/21/2013
17
What is the problem?
• Processing time on big data
sets is computationally
expensive and as the volume
of queries grows the system
will progressively drop in
performance until the
system fails.
• 300 simultaneous requests
will cause the web served to
crash
3/21/2013
18
3/21/2013
19
3/21/2013
20
User interface
will be
improved to
be more
aesthetically
pleasing
3/21/2013
21
Working with Dinosolve
Input title
Input protein sequence
Input e-mail address
Submit, then wait for
confirmation...
Protein Sequence: string of
alphabetic characters, each of
which represent a particular
amino acid in the protein
3/21/2013
22
Working with Dinosolve
Confirmation of request
Now wait for results
3/21/2013
23
Working with Dinosolve
Check your e-mail,
Click the link provided
The results are displayed
24
Dinosolve Issues
As it continues to grow in popularity, these are expected
to occur:
• Hard resources for computation
–
–
–
–
CPU cycles
Memory
Disk space
Network bandwidth
• Server crashes
Goal is to prepare the system to be able to continue to
support the research community in light of its expected
growth in requests
3/21/2013
25
Software
• Unix operating system installed on the
Dinosolve cluster
• Dinosolve algorithm
• Sun Grid Engine which will be our
Distributed Resource Management System
(D-RMS) installed on the cluster.
• MySQL (database software)
• Web-based user interface (website)
3/21/2013
26
Hardware
• MySQL database server
• A computer cluster to run the
Dinosolve algorithm
• Web server for web-based user
interface
3/21/2013
27
How will we correct the problem?
Configure a distributed resource
management system
3/21/2013
28
Competing Distributed Resource
Management Systems
• Sun Grid Engine (SGE)
• Portable Batch System (PBS)
• Load Sharing Facility (LSF)
3/21/2013
29
508.22
compliance
percentage
3/21/2013
Dinosolve
DiANNA
Scrath Protein
Predictor
67%
85%
67%
30
508 compliance
• Amended Rehabilitation Act of 1998
– require Federal agencies to make their electronic
and information technology accessible to people
with disabilities [32]
– enacted to eliminate barriers in information
technology, to make available new opportunities for
people with disabilities, and to encourage
development of technologies that will help achieve
these goals [32]
3/21/2013
31
Why is it important to be compliant?
If an entity wishes to receive
government funding then any
electronic form the entity uses must
be 508 compliant.
3/21/2013
32
Objectives
• Interpret and visualize current usage
statistics
• Configure, utilize, and optimize the
SGE
• Aesthetically pleasing and
professional user interface
3/21/2013
33
What benefits will come from attaining the goals?
•
•
•
•
Efficient utilization of available resources
Increased throughput of the cluster
An intuitive and professional user interface
Rise in popularity due to excellent
accuracy, efficiency, and professional
design
3/21/2013
34
Conclusion
With the updated user interface and
correctly configured Sun Grid Engine,
Dr. Li hopes to establish a reputable,
reliable, and aesthetically pleasing
Disulfide Bonding Prediction Server.
3/21/2013
35
References for history
1.
2.
3.
4.
http://www-01.ibm.com/software/data/bigdata/
http://en.wikipedia.org/wiki/Big_data
http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-contentand-500-terabytes-ingested-every-day/
http://en.wikipedia.org/wiki/Computer_cluster
3/21/2013
36
References for case study
5. Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling
Applications [Abstract]. National Science Foundation Award Abstract #1066471.
6. Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with
Context-based Features. Biotechnology and Bioinformatics Symposium
7. bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013,
from http://www.merriam-webster.com/dictionary/bioinformatics
8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry
Dictionary:
http://guweb2.gonzaga.edu/faculty/cronk/biochem/Dindex.cfm?definition=disulfide_bond
9. Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management
Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx.
10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/
3/21/2013
37
References for competition
11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011
URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF
12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF,
PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf)
13. Dr. Li’s site
http://hpcr.cs.odu.edu/dinosolve/
14. Scratch Predictor
http://scratch.proteomics.ics.uci.edu/
15. DiANNA server
http://clavius.bc.edu/~clotelab/DiANNA/
Portable Batch System (PBS)
16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf
17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=1
18. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf
19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf
20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf
21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdf
Moab HPC Suite
22.http://www.adaptivecomputing.com/publication/420/wppa_open/
IBM Platform LSF
23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDF
Apache Hadoop with Zookeeper
24. http://zookeeper.apache.org/doc/current/zookeeperOver.html
25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf
3/21/2013
38
Reference for 508 Compliance
26.
http://en.wikipedia.org/wiki/Section_508_Am
endment_to_the_Rehabilitation_Act_of_1973
3/21/2013
39
Appendix
• 40: Competition Matrix for Resource Management Systems
• 41-43: 508.22 Compliance Statistics for Dinosolve
3/21/2013
40
Competing Resource Management Systems
Features of
systems
PBS
LSF
SGE
Supported
platforms
Unix
Unix & NT
Unix
Multi-cluster
support
Yes
Yes
No
System level
checkpoint restart
No
Yes
Yes
User level
checkpoint restart
No
No
Yes
Large
computational grid
support
No
No
No
Massive Scalability
Yes
Yes
Yes
Parallel job
support with Sun
HPC ClusterTools
Loose Integration
Tight Integration
Loose Integration
Distribution format
of end product
Source
Binary only
Binary and Source
Free?
Yes
No
Yes
Posix 1002.2d
compliance
Yes
No
Yes
3/21/2013
Reference 19
41
3/21/2013
42
3/21/2013
43
3/21/2013
44
Download