Data Decomposition Algorithm

advertisement
Clustering Technology
Clustering Schematic
Cluster Components
• Cluster hardware (processor, main
memory, hard disk, …)
• Cluster network (Fast Ethernet, Gigabit
Ethernet, Myrinet, …)
• Cluster Software (operating system,
programming environment, …)
Cluster Operating System
Characteristics
•
•
•
•
•
•
•
Manageability: An absolute necessity is remote and intuitive system administration; this is
often associated with a Single System Image (SSI) which can be realized on different levels,
ranging from a high-level set of special scripts down to real state-sharing on the OS level.
Stability: The most important characteristics are robustness against crashing processes,
failure recovery by dynamic reconfiguration, and usability under heavy load.
Performance: The performance critical parts of the OS, such as memory management,
process and thread scheduler, file I/O and communication protocols should work in as efficiently
as possible.
Extensibility: The OS should allow the easy integration of cluster-specific extensions, which
will most likely be related to the inter-node cooperation. A good example for this is the MOSIX
system that is based on Linux.
Scalability: The scalability of a cluster is mainly influenced by the provision of the contained
nodes, which is dominated by the performance characteristics of the interconnect.
Support: Many intelligent and technically superior approaches in computing failed due to the
lack of support in its various aspects: which tools, hardware drivers and middleware environments
are available.
Heterogeneity: Clusters provide a dynamic and evolving environment in that they can be
extended or updated with standard hardware just as the user needs to or can afford. Therefore, a
cluster environment does not necessarily consist of homogenous hardware.
Cluster Solution
• Hardware of Cluster nodes is typical PC’s (this selection
reduces the cost per performance ratio).
• Fast Ethernet was used as Cluster interconnection (17
nodes connected by Fast Ethernet infrastructure).
• Linux was choosed as Cluster OS to provide our needs
as much as possible (we can recompile and tune the
kernel to meet our needs).
• Using VMware software for virtualizing our computing
resources.
• Message Passing Interface was selected as the parallel
programming environment.
Configuring Cluster
• Configuring the Cluster nodes (network
configuration, packages installation, …).
• Optimizing and securing the Linux OS to
extract the maximum utilization from
Cluster resources.
• Cluster administration (Samba service,
ssh, rlogin, rcp, administration scripts and
…).
Algorithm Identification
Integer Factorization
Sieving
Trial Division
QS Algorithm
MPQS Algorithm
SIQS Algorithm
Algorithm Complexity
Improvement
QS  MPQS  SIQS  NFS
Optimizing
Serial Implementation
Optimizing Serial Implementation
• Algorithm level optimizations (the most
important step for optimizing serial codes
is to reduce the complexity of algorithm
maximally).
• Code level optimizations (in this phase we
use
• Compiler level optimizations
Algorithm Optimization
• In computation-intensive software programs, we
will often find that 99% of the CPU time is used
in the innermost loop.
• Identifying the most critical part of your software
is therefore necessary if you want to improve the
speed of computation (by profilers).
• Study the algorithm used in the critical part of
your code and see if it can be improved.
Innermost loop
(Conventional Sieving)
Pentium4 Memory Access Times
An Optimized Sieving Approach (1)
An Optimized Sieving Approach (2)
Code level optimization
techniques
• Loop unrolling (Unrolling amortizes the branch overhead,
since it eliminates branches and some of the code to
manage induction variables. Unrolling allows you to
aggressively schedule (or pipeline) the loop to hide
latencies).
• Function inlining (We can instruct the compiler to insert
the code of a function into the code of its callers, to the
point where actually the call is to be made. inlining
method reduces the function-call overhead. In a
compiler, inlining a function exposes more opportunity for
optimization).
• gcc inline assembly (Assembly routines written as inline
functions. They are handy, speedy and very much
useful).
• Builtin gcc functions (__builtin_prefetch) (This function is
used to minimize cache-miss latency by moving data into
a cache before it is accessed).
• Using “unsigned int” type only (Use 32-bit integers
instead of integers with smaller sizes (16-bit or 8-bit) to
reduce the machine cycles needed).
• Division-free arithmetic (change division to use
multiplication by reciprocals).
• Release allocated memory blocks.
• and …
Loop unrolling (code level opt.)
gcc compiler optimizations
Parallel Algorithm Design
Parallel Algorithm Design
Methodology
• Partitioning (Domain decomposition or
Functional Decomposition)
• Communication
• Agglomeration
• Mapping
Methodical Design (1)
• Partitioning: The computation that is to be performed and the data operated
on by this computation are decomposed into small tasks. Practical issues such as the
number of processors in the target computer are ignored, and attention is focused on
recognizing opportunities for parallel execution.
• Communication: The communication required to coordinate task
execution is determined, and appropriate communication structures and algorithms
are defined.
• Agglomeration: The task and communication structures defined in the
first two stages of a design are evaluated with respect to performance requirements
and implementation costs. If necessary, tasks are combined into larger tasks to
improve performance or to reduce development costs.
• Mapping: Each task is assigned to a processor in a manner that attempts to
satisfy the competing goals of maximizing processor utilization and minimizing
communication costs. Mapping can be specified statically or determined at runtime by
load-balancing algorithms.
Methodical Design (2)
Load Balancing Mechanism
• For load balancing Master/Slave
mechanism was used.
• Master node sends the initial data and
assign the jobs to slave nodes.
Data Decomposition Algorithm
• Using SPMD model (SPMD program that
creates exactly one task per processor).
• We can sieve with multiple polynomials in SIQS.
• To generate these polynomials, we must first
compute ‘a’ factors.
• Sieving with separated ‘a’ values can be done
independently on different processors.
• Thus we need to build ‘a’ values on several
tasks without any coordination.
• Duplicated ‘a’ factor conclude to weak
concurrency.
Data Decomposition Algorithm (1)
(Initialization Data)
Data Decomposition Algorithm (2)
(Determining the number
and size of ‘a’ value’s factors)
Data Decomposition Algorithm (2)
(Determining the number
and size of ‘a’ value’s factors)
Data Decomposition Algorithm (3)
(Computing the factors of ‘a’ values)
Data Decomposition Algorithm (3)
(Computing the factors of ‘a’ values)
Master Node Algorithm
Slave Node Algorithm
Double Large Prime
Variation effects
Master Node Algorithm (1)
(Improved version)
Master Node Algorithm (2)
(Improved version)
Slave Node Algorithm
(Improved version)
Cluster Benchmarks
Performance Evaluation
• Speedup: Amdahl’s law gives the ideal speedup
Sp:
• Efficiency: The efficiency, p, of a p-node
computation with speed-up Sp is given by:
Total Execution Time
Sieving Execution Time
Total Speedup
Sieving Speedup
Total Efficiency
Sieving Efficiency
Sources of inefficiency
• Static load balancing (overestimate our needs
and waste Cluster resources).
• Communication overhead (Ethernet protocol and
tcp/ip stack operations).
• Load imbalance.
• Non parallelized stages of the application such
as Linear algebra stage (Admahl’s law).
• Inefficient software and hardware technologies
(MPI overhead and OS inefficiencies).
Conclusions and Future works
• The distributed computing by Cluster technology is an open
problem yet. The Mosix project is one of the most successful
solutions that builds an single system image and concrete
Linux kernels to simulate one single system for running
processes.
• But Mosix and other similar solutions not work optimally for
every desired application.
• In fact supplying a SSI at the operating system level, while a
definite boon in terms of manageability, drastically inhibits
scalability.
• The availability of the source code in conjunction with the
possibility to extend (and thus modify) the operating system
on this base. This property has a negative influence on the
stability and manageability of the system: over time, many
variants of the operating system will develop, and the different
extensions may conflict when there is no single supplier.
Conclusions and Future works
• In this thesis our goal was to present an global approach to
run computational applications on distributed systems
optimally.
• Algorithm optimization is the major step in presented
approach. In the computation-intensive software programs,
we must in first step identify the most critical part of the
software and try to optimize this algorithm as much as
possible (most efficient step).
• In the second phase we implement the best optimized
algorithm serially to exploit the maximum usage of local
resources. In fact we do distributed computing when local
computation cost is bigger than remote execution).
• In the last stage, parallelize the serial algorithm optimally. You
must focus on most computational part of the serial algorithm
and try to parallelize this part efficiently.
Conclusions and Future works
• The performance benchmarks shows that
our strategy works well, and therefore we
can apply this method to every other
application similarly.
• Using NFS algorithm for RSA key
cracking.
• Parallelize linear algebra stage for
factorization of very large numbers
(greater than 120 digits).
Download