Presentation Two 4/09/2013 1

advertisement
Presentation Two
4/09/2013
1
Agenda
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1: Title
2: Outline
3: Members
4: Mentor
5-6: Societal Issue
7: History
8-9: Dr. Li
10-11: Cluster Computing
12-14: Case Study
15: Accuracy
16: Current Major Functional
Component Diagram
17: Current Process Flow
18: Problem Statement
19: Proposed Major Functional
Component Diagram
20: Proposed Process Flow
4/09/2013
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
21-24: Dinosolve Walkthrough
25: Dinosolve Issues
26: Software
27: Hardware
28: Solution Statement
29: Competition Identified
30-32: 508 Compliance
33: Objectives
34: Benefits of Solution
35-41: Milestones
42: Sitemap
43: Database Schema
44: Entity Relationship Diagram
45: Risks
46: Conclusion
47-50: References
51-54: Appendix
2
Group Members and Roles
•
•
•
•
•
•
Scott Pardue (Team Leader)
Michael Rajs (Risk Manager)
Adam Willis (Algorithm Specialist)
Sybil Acotanza (Documentation Specialist)
Jordan Heinrichs (Database Designer)
David Crook (User Interface Designer)
4/09/2013
3
Dr. Yaohang Li
•Associate Professor in the Department of
Computer Science at Old Dominion
University.
•Research interests include:
•Computational Biology: applies
computational simulation techniques to
solve biological problems
•Markov Chain Monte Carlo (MCMC)
methods: statistical algorithm for sampling
from probability distributions
•Parallel Distributed Grid Computing: uses
multiple computers communicating via
Internet to solve a problem
4/09/2013
4
How do researchers manage the massive
amounts of data they are collecting in
order to benefit their research?
4/09/2013
5
“Every day, [mankind] creates 2.5
quintillion (2.5*10^18) bytes of data —
so much that 90% of the data in the
world today has been created in the
last two years alone.” - IBM
4/09/2013
http://www-01.ibm.com/software/data/bigdata/
6
Data Management Examples
• Large Hadron Collider 2
• 150 million sensors report 40 million times per second
• Watson on Jeopardy
• 200 million pages
• Structured and Unstructured
• 4 Terabytes of information
• DinoSolve Protein Prediction Server
• Proteins are made up of single or multiple amino
acids
• 20 different amino acids
• If a protein is made up of 5 amino acids then the
number of possible proteins will be 20^5 or 3,200,000
4/09/2013
7
Big Data Analysis Hardware
Scheduling and
Queuing Subsystem
Job
Management
Subsystem
4/09/2013
Physical Resource
Management
Subsystem
8
Dr. Li’s Cluster Configuration
Servers
Database Server
Web Server
Hardware
Server
Cluster
4/09/2013
Dell
PowerEdge
R410 Server
Head Node
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Node
Node
Node
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
9
Dinosolve Issues
As it continues to grow in popularity, these are expected to occur:
• Limited hard resources for computation
• CPU cycles
• Memory
• Disk space
• Network bandwidth
• Server crashes
Goal is to prepare the system to be able to continue to support the
research community in light of its expected growth in requests and to
also enhance the design of the user interface
4/09/2013
10
Job Management Subsytem
Servers
Database Server
Hardware
Server
Cluster
4/09/2013
Dell
PowerEdge
R410 Server
Head Node
Web Server
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Node
Node
Node
11
Physical resource management
Servers
Database Server
Hardware
Server
Cluster
4/09/2013
Dell
PowerEdge
R410 Server
Head Node
Web Server
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Node
Node
Node
12
Scheduling and Queueing
Servers
Database Server
Hardware
Server
Cluster
4/09/2013
Dell
PowerEdge
R410 Server
Head Node
Web Server
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Node
Node
Node
13
Dr. Li’s Grants
• DinoSolve
• secured for a five year, $400,000 CAREER
Award from the National Science Foundation
• Dr. Li
• principal or co-principal investigator
• research grants totaling more than $15.3 million
4/09/2013
14
Dr. Yaohang Li and Dinosolve
• Dinosolve examines a protein sequence of
amino acids and determines if the protein
can be manipulated by an addition of a
disulfide bond
• Each computational result enhances the
prediction accuracies for future results
• 40^20, larger than 10^32, different
possible combinations for only the
shortest sequence
4/09/2013
15
300 simultaneous requests will
cause the web server to crash
4/09/2013
System throughput (Mb/sec)
What is the problem?
16
Dinosolve Case Study
• Bioinformatics7
• Disulfide bond
prediction program
• Disulfide bond
creation is
important to the
research community
4/09/2013
http://www.merriamwebster.com/dictionary/bioinformatics
17
Dinosolve Users
• Drug design
• Pharmaceutical companies
• Antibody design
• To combat viruses
• Bio-energy development
• Creation of new fuels to replace diminishing fossil fuels
• Genetic mapping5
• Research to cure cancer, HIV, and other diseases
4/09/2013
18
Accuracy of Popular Tools
Accuracy
Dinosolve
DiANNA
Scratch Protein
Predictor
90.8%
81%
87%
More users use Dinosolve because of the enhanced accuracy
4/09/2013
19
Current Major Functional Component Diagram
Web Server
Internet
Researcher
Email
MySQL Database Server
4/09/2013
DinoSolve Algorithm
20
Current Process Flow
User
(Researcher)
Validity
check
Database
DinoSolve engine
Start
Web server
Current Process Flow
4/09/2013
Visit
DinoSolve
Input valid?
Enter sequence
and email address
Display
error
No
Send
sequence
View results
End
Display
No
Reaction
Yes
CYS < 2
Execute
Algorithm
(Big Data
Calculation)
CYS?
Email link to
results
CYS > 1
Bond formed
Store results
(Big Data Storage)
21
RWP Major Functional Component Diagram
Web Server
Internet
Researcher
SGE scheduler
Email
MySQL Database Server
4/09/2013
Execution Host
Execution Host
22
RWP Process Flow
User (Researcher)
Validity
check
Input valid?
Enter sequence
and email address
View results
Display
No
Reaction
Accept and
schedule job
CYS < 2
Execute
Algorithm
(Big Data
calculation)
CYS?
Database
Send
sequence
SGE scheduler
Visit
DinoSolve
SGE execution
host
Start
Web server
Proposed Process Flow
4/09/2013
No
Display
error
End
Yes
Email link to
results
CYS > 1
Bond formed
Store results
(Big Data Storage)
23
Objectives
• Configure, utilize, and optimize the SGE
• Aesthetically pleasing and professional
user interface
• 508 Compliance
• Improve the existing database schema
and adding user accounts
4/09/2013
24
Benefits from Goals
• Efficient utilization of available resources
and increased throughput of the cluster
• Professional user interface leading to a rise
in popularity
• Accessibility
• Security and efficient access of previous
submissions
4/09/2013
25
User interface
will be
improved to
be more
aesthetically
pleasing
4/09/2013
26
Working with Dinosolve
Input title
Input protein sequence
Input e-mail address
Submit, then wait for
confirmation...
Protein Sequence: string of
alphabetic characters, each of
which represent a particular
amino acid in the protein
4/09/2013
27
Working with Dinosolve
Confirmation of request
Now wait for results
4/09/2013
28
Working with Dinosolve
Check your e-mail,
Click the link provided
The results are displayed
4/09/2013
29
Why is it important to be compliant?
If an entity wishes to receive
government funding then any
electronic form the entity uses must
be 508 compliant
4/09/2013
30
508 Compliance
• Amended Rehabilitation Act of 1998
• require Federal agencies to make their electronic
and information technology accessible to people
with disabilities [32]
• enacted to eliminate barriers in information
technology, to make available new opportunities for
people with disabilities, and to encourage
development of technologies that will help achieve
these goals [32]
4/09/2013
http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973
31
Compliance of Popular Tools
508.22
compliance
percentage
4/09/2013
Dinosolve
DiANNA
Scratch Protein
Predictor
67%
85%
67%
32
Milestones
ARMS
Hardware
Within
Scope
4/09/2013
Software
Testing
Unaffected
33
Three Computational Nodes
Dell PowerEdge
R410 Server
Computational
Node
Intel E5506
processor
Intel E5506
processor
Each processor has four execution slots
4/09/2013
34
Processors
Servers
Database Server
Web Server
Hardware
Server
Cluster
Dell
PowerEdge
R410 Server
Head Node
6 processors yield
24 execution slots
*Each computational node
has two processors
4/09/2013
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Node
Node
Node
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
35
Software Milestones
Software
User
Interface
Within
Scope
4/09/2013
Database
Algorithm
Disulfide
Bond
Predictor
Sun Grid
Engine
Scheduler
Unaffected
36
Testing Milestones
Testing
• Cluster Performance
• Stress testing
• Prevention of denial of
service attacks
Server
Cluster
Performance
• Database Performance
Database
Performance
• Stress testing
• Prevention of MySQL
injection attacks
4/09/2013
37
Complete Milestone Tree
Database
Performance
Testing
Server Cluster
Performance
ARMS
Software
User
Interface
Database
Sun Grid
Engine
Algorithm
Disulfide Bond
Predictor
Scheduler
Servers
Database Server
Hardware
Server
Cluster
Within
Scope
4/09/2013
Unaffected
Dell
PowerEdge
R410 Server
Head Node
Web Server
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Dell
PowerEdge
R410 Server
Computational
Node
Node
Node
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
Intel E5506
processor
38
Sitemap
DinoSolve
Homepage
Admin
4/09/2013
User
Information
References
Statistics
Help
Contact
39
Database Schema
4/09/2013
40
Entity Relationship
4/09/2013
41
Risks
Risks
Probability
I
m
•
T1
T2
p
a
C2
T3
C1
T4
•
•
•
c
t
•
•
T1: Larger volumes of queries could cause
slower processing speeds and may be the
result of hardware strength
T2: Improper synchronization of cluster
resources could lead to a deadlock
T3: Race conditions between the HPCR
cluster and the MySQL database
T4: A local attacker could exploit these
vulnerabilities and cause a crash or execute
arbitrary code on the system
C1: Users may not like new design
C2: SGE does not enforce exclusive access
to the reserved processors
Technical Risks and Mitigations
Probability
I
T1
T2
m
p
a
T1: Larger volumes of queries could
cause slower processing speeds and
may be the result of hardware
strength
Probability: 1
Impact: 5
Mitigation: Creating indexes, use
specialized data structures and
aggregate tables.
T2: Improper synchronization of
cluster resources can lead to a
deadlock
Probability: 2
Impact: 4
Mitigation: Modify and read
application data. Alter execution
logic and basic software
configuration of SGE.
c
t
4/09/2013
43
Technical Risks and Mitigations
Probability
I
m
p
T3: Race conditions between the HPCR
cluster and the MySQL database.
Probability: 3
Impact: 3
Mitigation: Using software control on the
SGE.
T3
a
T4
c
t
4/09/2013
T4: A local attacker could exploit these
vulnerabilities and cause a crash or execute
arbitrary code on the system
Probability: 2
Impact: 2
Mitigation: Keep virus protection up to date.
Use very specific types of passwords. Run
current scripts because hackers look for dated
scripts because they most likely have a hole
in them. Limit access to certain files.
44
Risks
Probability
I
C2
m
p
C1: Users may not like new design.
Probability: 3 Impact: 3
Mitigation: Create a new more
aesthetically pleasing design
C1
a
C2: SGE does not enforce exclusive
access to the reserved processors.
Probability: 4 Impact: 4
Mitigation: Qsub and knowledge of
node memory capacity
c
t
4/09/2013
45
With the updated user interface and
correctly configured Sun Grid Engine,
Dr. Li hopes to establish a reputable,
reliable, and aesthetically pleasing
Disulfide Bonding Prediction Server
4/09/2013
46
References for history
1.
2.
3.
4.
http://www-01.ibm.com/software/data/bigdata/
http://en.wikipedia.org/wiki/Big_data
http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-contentand-500-terabytes-ingested-every-day/
http://en.wikipedia.org/wiki/Computer_cluster
4/09/2013
47
References for case study
5. Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling
Applications [Abstract]. National Science Foundation Award Abstract #1066471.
6. Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with
Context-based Features. Biotechnology and Bioinformatics Symposium
7. bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013,
from http://www.merriam-webster.com/dictionary/bioinformatics
8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry
Dictionary:
http://guweb2.gonzaga.edu/faculty/cronk/biochem/Dindex.cfm?definition=disulfide_bond
9. Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management
Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx.
10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/
4/09/2013
48
References for competition
11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011
URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF
12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF,
PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf)
13. Dr. Li’s site
http://hpcr.cs.odu.edu/dinosolve/
14. Scratch Predictor
http://scratch.proteomics.ics.uci.edu/
15. DiANNA server
http://clavius.bc.edu/~clotelab/DiANNA/
Portable Batch System (PBS)
16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf
17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=1
18. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf
19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf
20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf
21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdf
Moab HPC Suite
22.http://www.adaptivecomputing.com/publication/420/wppa_open/
IBM Platform LSF
23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDF
Apache Hadoop with Zookeeper
24. http://zookeeper.apache.org/doc/current/zookeeperOver.html
25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf
4/09/2013
49
Reference for 508 Compliance
26.
http://en.wikipedia.org/wiki/Section_508_Am
endment_to_the_Rehabilitation_Act_of_1973
4/09/2013
50
Appendix
• 52: Competition Matrix for Resource Management Systems
• 53-55: 508.22 Compliance Statistics for Dinosolve
4/09/2013
51
Competing Resource Management Systems
Features of
systems
PBS
LSF
SGE
Supported
platforms
Unix
Unix & NT
Unix
Multi-cluster
support
Yes
Yes
No
System level
checkpoint restart
No
Yes
Yes
User level
checkpoint restart
No
No
Yes
Large
computational grid
support
No
No
No
Massive Scalability
Yes
Yes
Yes
Parallel job
support with Sun
HPC ClusterTools
Loose Integration
Tight Integration
Loose Integration
Distribution format
of end product
Source
Binary only
Binary and Source
Free?
Yes
No
Yes
Posix 1002.2d
compliance
Yes
No
Yes
4/09/2013
52
4/09/2013
53
4/09/2013
54
4/09/2013
55
Download