Soft Coherence: Preliminary Experiments with Error-Tolerant Cache Coherence in Numerical Applications

advertisement
Soft Coherence: Preliminary Experiments with Error-Tolerant
Cache Coherence in Numerical Applications
Guoping Long†, Frederic T. Chong‡, Diana Franklin‡, John Gilbert‡, Dongrui Fan†
† Institute of Computing Technology, Chinese Academy of Sciences
‡ Department of Computer Science, UC Santa Barbara
Abstract
As we scale into the multi-core era, we face severe challenges in the scalability and performance
of on-chip cache-coherent shared memory mechanisms. We explore application error-tolerance as
an extra degree of freedom to meet these challenges. Iterative numerical algorithms, in particular, can cope with the occasional stale value
with little or no effect on accuracy or convergence
time. We explore analysis methods to distinguish
between critical and non-critical data in such algorithms. We exploit this distinction to design
soft coherence protocols that provide strong guarantees for critical data and weak guarantees for
non-critical data. Our preliminary results use a
conjugate gradient solver as an example, with experiments on five sparse matrices showing 6.9%12.6% performance improvement, with little loss
in precision.
1
Introduction
Many
multi-processor
systems
assume
invalidation-based coherence protocols [1] to
provide good programmability and reasonable
performance. The cost of maintaining strict cache
coherence of data has been a constant struggle
for researchers, leading to proposals to prefetch
invalidated data[2], decoupled coherence protocols
[3], etc.
The performance overhead is caused by two services of the invalidation protocol. First, all writes
to the same address must be serialized, which requires access to a centralized location when accessing a currently shared location. Second, each
read must return the value written by the most recent write, requiring a long-latency load if data has
been changed. This places high overhead on both
the writing and reading of shared data.
In this paper, we propose to bypass this overhead when accessing some shared data. When a
load requests a line that resides in the cache but has
been invalidated by another processor, the processor simply uses the stale value while the request
for the updated line is being fulfilled. That is, the
processor does not have to wait for the completion
of the cache miss to continue execution.
In the soft approach for cache coherence, the
system allows a load operation to return stale values. This clearly breaks an important assumption
of previous cache coherence protocols, which require a load to only return the value written by the
most recent store operation. Therefore, naive adoption of this approach may cause correctness problems. However, some applications allow a certain
degree of algorithmic error resilience. For example, given a parallel numerical application, if the
required precision (the difference between the execution output and the theoretical result) is 1e − 10,
and strict implementation of cache coherence can
achieve the precision of 1e − 15, then there is room
for relaxation of operations which are not critical to
the output precision. This observation motivates us
to employ a conservative coherence protocol (such
as invalidation based protocols) as the correctness
substrate, and explore the use of soft coherence for
selected memory operations to achieve more performance gain.
To support soft coherence, we provide two hardware modes for cache coherence. One is the conservative mode, in which all loads receive the most
recently stored value. The other one is the aggres-
sive mode, in which the soft coherence protocol is
enabled to seek as much performance gain as possible. The policy which specifies which execution
mode to use is determined by the relaxation plan.
In this work, we evaluate the potential of the soft
coherence protocol on the Godson-T many core
platform [4, 5, 6]. The application is the CG program from the NPB benchmark suite [7]. Besides
the default input matrix, we select four sparse matrices from the UF sparse matrix collection [8] for
our experiments. Experimental results show that
our approach can bring 8.8% of performance improvement on average for all matrices.
The rest of the paper is organized as follows.
Section 2 discusses the parallel computation structure of CG. Section 3 presents the methodology
on how to generate an appropriate relaxation plan
for each input sparse matrix. Section 4 discusses
the detailed of soft cache coherence design on
Godson-T. Section 5 reports our preliminary experimental results. Section 6 concludes the paper and
discusses future work.
2
Parallel Conjugate Gradient Solver
The conjugate gradient (CG) solver here is
adapted from the CG program of the NPB benchmark suite. This program estimates the largest
eigenvalue of a symmetric positive definite sparse
matrix with the inverse power method. The basic
structure of the kernel loop of CG, which is used to
solve the sparse matrix equation Ax = b, is shown
in Figure 1. We use this algorithm here to illustrate our approach. More discussions on how to
optimize the kernel loop can be found in [9].
When the sparse matrix becomes large, the matrix to vector multiply (Ax) dominates the execution time of kernel loop. Since naive parallelization
of Ax by row or column can not achieve good load
balance, we first partition A with a sparse matrix
partitioning tool, Mondriaan [10, 11]. Mondriaan
can distribute the computation load among processors evenly, and at same time try to minimize interprocessor communication.
We allocate a temp vector for each processor to
store its own part of partial sums. When all partial sum vectors have been generated, we sum up
Figure 1: Parallel Implementation of the Kernel
Loop
them together to form the solution vector q, which
should be equal to A ∗ p. The problem is how to
sum up these partial vectors. In current implementation, we parallelize this sum up operation embarrassingly. That is, each processor is responsible
for generating an equal part of the resultant vector
q. This approach incurs much coherence traffic because each processor has to access parts of all other
partial sum vectors.
Aside from A ∗ x, the rest of the vector computations can all be parallelized embarrassingly.
Note that although the inner products incur a rather
small volume of communication and computation,
they represent global data dependence and synchronization. Relaxation of any of these partial
sums will have tremendous impact on the output
precision. Among all the shared vectors or global
scalars in Figure 1, our detailed profiling shows
that approximately 95% of the coherent misses are
caused by the vector p and the partial sums of the
A ∗ x operation. We therefore only consider relaxing p and those important partial sums in our
work.
3
Relaxation Plan Generation
In this section, we study in detail how to relax
the coherence requirements of memory operations
on the vector p and partial sums of Ax. We will
present a technique to evaluate the relative importance of different vector elements of p. Then we
discuss details on how to generate relaxation plans.
3.1 Sensitivity Analysis
Now we evaluate the importance of a particular
vector element x. We can model the computation
in Figure 1 as a function of x: y = f (x), and use
this function to study the impact of the variation
on input x on the output y. The Taylor expansion
of y = f (x) can be written as follows:
n
f (x) = f (a) + ∑
i=1
f (i) (a)
(x − a)i + O(n)
i!
(1)
Therefore, not only should we evaluate the importance of a particular variable x, we should also
evaluate the importance of the corresponding cache
line. The L1 cache line size of Godson-T is 32
bytes. A cache line can hole four consecutive vector elements. Assume x1 , x2 , x3 , x4 are in the
same cache line. Let w(x1 , i), w(x2 , i), w(x3 , i) and
w(x4 , i) be the weights (the importance factor) of
x1 , x2 , x3 , x4 at iteration i, respectively. We use
the following formula to measure the importance
of the cache line at iteration i:
W (x1 , x2 , x3 , x4 , i) =
q
(w(x1 , i)2 + w(x2 , i)2 + w(x3 , i)2 + w(x4 , i)2 )
(5)
Note that the larger the value of W (x1 , x2 , x3 , x4 , i),
the more important the cache line.
3.2 Relaxation Plan Generation
In particular, if limn→+∞ (O(n)) = 0, we have:
n
f (x) ≈ f (a) + ∑
i=1
f (i) (a)
(x − a)i
i!
(2)
In this work, we choose n = 5, and re-write equation (2) as follows:
f (x) ≈ f (a) +C1 ∗ (x − a) +C2 ∗ (x − a)2 +
C3 ∗ (x − a)3 +C4 ∗ (x − a)4 +C5 ∗ (x − a)5
(3)
Given a serial C implementation of the algorithm
Figure 1, we can obtain the co-efficients C1-C5
with automatic differentiation tools, such as Rapsodie [12].
As shown in Figure 1, the kernel loop will iterate
for IT ERS times. The constant IT ERS is different
for different input matrices. In each iteration i, the
variable x is read during the computation of A ∗ p,
later it is re-written. Assume x is changed from a to
b, then we use the following formula to represent
the importance of x at iteration i:
W (x, i) = C1 ∗(a−b)+C2 ∗(a−b)2 +C3 ∗(a−b)3 +
C4 ∗ (a − b)4 +C5 ∗ (a − b)5
(4)
In reality, we relax the coherence requirements
of memory operations at cache line granularity.
For iterative algorithms such as CG solver, it is
important to ensure convergence even after relaxing memory operations. However, arbitrary relaxation may generate results which can not converge.
To ensure convergence, the key observation is that
we should not relax a vector element for an arbitrary number of consecutive iterations. Specifically, we can introduce a tolerance factor d. For
example, assume we relax some element in iteration i, we can still relax the same element in iteration i + 1, i + 2, ..., i + d − 1. And we should never
relax it in iteration i + d. With this relaxation constraint, at each iteration i, each processor can read a
value at most read a value written in iteration i − d.
W assume d = 2 throughout all experiments in this
work.
Now it’s time to discuss the relaxation plan for
the vector elements of p. Assume the vector has
N elements. Since a cache line can hold 4 consecN
utive elements, there are approximately
cache
4
lines for this vector. We partition the cache lines
N
into two groups, and each group has cache lines.
8
The relaxation plan for the vector p is determined
as follows:
N
a. At iteration 1, the least important elements
8
of p are considered to be relaxation candidates; b.
Starting from iteration 2 to the last iteration, we
N
try to select least important elements under the
8
condition that no selected element has been relaxed
in the iteration before.
A relaxation plan only specifies candidates for
relaxation. An interesting property of the conjugate gradient solver is that the output precision
is very sensitive to the results of early iterations.
Therefore, for all input matrices we decide not to
relax any operation during the first several iterations. A candidate is relaxed at iteration I only if
two additional conditions are satisfied: (1) we decide to relax operations on candidates specified in
the relaxation plan; (2) the candidate is used by
more than one processor.
Recall that we allocate a temp vector for each
processor to store its own part of partial sums during the computation of A ∗ p. The strategy for relaxing partial sum vector elements is almost the
same with the vector p. For each processor, assume the set of the elements been accessed is
{e1, e2, e3, e4, e5, e6, ..., }. Then at even iteration,
we relax elements {e2, e4, e6, ...}. At odd iteration,
we relax elements {e1, e3, e5, ...}. The basic rationale here is the same: we do not relax the same
element in two consecutive iterations to ensure its
convergence.
4
Hardware Support for Soft Coherence
In this section, we present our hardware design
for soft cache coherence protocols. We first discuss the cache coherence mechanisms present in
Godson-T, our initial simulation platform. We then
present the enhancements we have made to approximate the performance of an invalidation protocol.
Finally, we present our implementation of soft coherence.
phisticated hardware to protactively invalidate any
stale copies in the system, the Godson-T provides
simpler mechanisms that are triggered at synchronization points.
In the Godson-T synchronization-based cache
coherence scheme, all L1 caches are private,
whereas the L2 cache is shared. Coherence is accomplished by flushing data from the L1 cache
that has been changed by this or another processor. This flushing serves to both write the new data
into the L2 cache and force other processors to retrieve the new data from the L2 cache. As long
as the sharing patterns are known in software, this
provides very efficient coherence.
In order to illustrate the behavior of
synchronization-based cache coherence, we
will present the behavior in two situations: critical
sections and barriers.
When outside of a critical section, the L1 cache
acts as a traditional write-back cache. Inside the
critical section, however, shared memory is being
accessed, so the coherence mechanisms must be
employed. Since stale copies will not be invalidated, all stores within the critical section write
through to the shared L2 cache. In order to load in
the latest values from the L2 cache, the first load to
a particular address within the critical section will
obtain data from the L2 cache. Subsequent loads
to the same address within the critical section can
trust the local copy after the first reference.
In order to guarantee that all changes will be
reflected following a barrier, the L1 caches are
flushed entirely. All dirty data is written to the
L2 cache to communicate the new values, and all
clean data is invalidated to ensure that subsequent
accesses will go to the L2 cache for the new value.
This overhead is expensive, especially for applications like CG solver, which relies heavily on barrier
synchronization, as shown in Figure 1.
4.1 Cache Coherence in Godson-T
Godson-T provides basic hardware mechanisms
to implement cache coherence without a centralized directory. The basic semantics of all cache
coherence protocols are the same: For each load
operation, it can only obtain the value written by
the most recent store operation. While directorybased invalidation protocols rely entirely on so-
4.2 Approximating Invalidation-based Coherence
In order to more closely approximate the performance of an invalidation-based coherence protocol in Godson-T, we must limit the accesses to
the L2 cache to lines that are modified by one processor and read by another processor. Our CG
Table 1: Architectural parameters of Godson-T V3
Component
Core
On-Chip-Network
Router
L1 Cache
L2 Cache
DDR2 Controller
Description
In order due-issue pipeline with eight stages.
User level MIPS ISA, 1 ALU, 1 FPU and 1 LSU.
8x8 MESH network with static X-Y routing algorithm.
4-stage router with 16GB/s peak bandwidth.
Two virtual channels, one request buffer for each.
Each core has 64KB L1 cache.
Line size: 32 bytes.Access latency: 1 cycle.
Does not support outstanding misses.
64 L2 banks, 256KB each. Request buffer size: 4 entries.
64B L2 cache line size and 4 cycle access latency.
Four DDR2 memory controllers.
32GB/s peak memory bandwidth.
Solver employs barriers, so we modified the design to selectively invalidate cache lines based on
software hints. Specifically, at the barrier point,
two categories of cache lines are no longer flushed.
First, read-only shared data is kept, which is especially critical for the CG solver because most
data (the large sparse matrix) is read only. Second, the private processor data is held, since it can
not have been modified by any other processor. We
are currently working on adding full support of the
invalidation-based protocol to the Godson-T simulator.
4.3 Soft Cache Coherence in Godson-T
The purpose of soft coherence is to allow lines
identified as soft to use old values some of the time.
To support this, each cache line is augmented with
a ”stale” bit. This indicates that this is a shared line
that may have been modified by another processor,
but the new value has not been obtained.
When a barrier is reached, the hardware marks
all soft cache lines as stale rather than flushing
them. If a processor reads a stale cache line, the
stale data is returned. At the same time, a load
request is issued to shared L2 cache to fetch the
correct copy of the line. In the current design, only
one such outstanding request is allowed. That is,
if the hardware reads a stale cache line, but there
is already an outstanding stale line request, then
the hardware does not issue a new request, it just
returns the stale value to the processor pipeline.
If the hardware reads the cache and there is a
cache miss, then an ordinary refill request is issued un-conditionally. Note that an ordinary refill
request can be issued even if there is a outstanding stale line refill request. If there is an ordinary
cache miss, then the processor pipeline stops sending new memory requests to data cache. That is,
the design does not support multiple ordinary outstanding misses.
Let’s explain this design in more detail with an
example. Assume there are three cache lines, A, B,
and C. Both A and B are stale lines, C is an ordinary
line. Assume the processor accesses A first, then B
and then C. When the processor accesses A, while
it returns the stale value back to the register, it also
issues a request to the L2 cache to refill the stale
line. When the refill is complete, the line becomes
up to date.
When the processor accesses B, there are two
cases. First, the request for A is finished. If this
is the case, then B is processed in the same way as
A. Second, the request for A is outstanding. In this
case, the stale value is still returned, but no requestion is sent to the L2 cache to update line B, because the system does not support two outstanding
stale refill requests simultaneously. When the processor accesses B again, if the line is still stale, and
there are no other stale lines being refilled, then the
hardware sends a refill request for B to bring back
the updated value.
If the processor experiences a cache miss accessing C, there are also two cases, but neither depend
on the access to A or B. As long as there are not any
outstanding conventional cache misses oustanding,
it will send the request to the L2 cache. If there is
an outstanding miss, then the processor must wait
until that has completed before satisfying the request for C.
cores to the number of floating point multiply and
add operations. For each input matrix, the amount
of computation is Θ(IT ERS ∗ (nnz + 5 ∗ n)).
5
The amount of communication traffic caused by
the vector p depends how the sparse matrix is partitioned. This is done by the Mondriaan sparse
matrix partitioning tool [10, 11]. In all the experiments of this work, each input matrix is partitioned
among 32 processors with at most 1% load imbalance. As long as the partitioned matrix is given,
the communication traffic caused by the vector p
is determined. Let Cp denote the communication
traffic. Another part of the communication traffic is caused by summing up of the partial sums
in A ∗ p operation. Total amount of communication
is Θ(Cp + IT ERS ∗ n ∗ (p − 1))).
Experimental Results
5.1 Experimental Setup
Our experimental platform is a modified version
of the Godson-T V3 infrastructure [4, 5, 6]. It is a
tiled many core architecture with 64 homogeneous
processing cores interconnected by an 8x8 mesh
network. Each core is a general purpose in order
due-issue processor which implements a core subset of the MIPS instruction set. Each core has a
private L1 cache and a local slice of the shared L2
cache. As discussed before, Godson-T employs a
lock based cache coherence algorithm which eliminates directory completely. We implement the optimistic cache coherence protocol on Godson-T V3
platform. Important parameters of the platform are
summarized in Table 1. More details can be found
in [4, 5].
Godson-T provides a pthread like interface for
multi-threaded programming. We implement the
conjugate gradient solver with this interface, and
compile it with gcc-3.3.3 x64 to MIPS cross compiler with -O3 optimization level.
Table 2 lists the input matrices for the conjugate gradient solver. Among them, fv1 fv3 and
Chem97ZtZ are from the UF sparse matrix collection [8], cg.mtx is a random input matrix (S class)
generated by the makea routine of the CG benchmark. The Chem97ZtZ is a statistical matrix from
the Bates group. Fv1 fv3 are matrices which model
finite elements of human body parts. These matrices are all symmetric, positive and definite and can
converge in less than 200 iterations (inner loop) for
our conjugate gradient solver. The second column
shows the problem size of each matrix. For example, the cg.mtx is a 1400x1400 matrix with 78148
non-zeros. The third column shows the number of
inner loop iterations for each matrix.
5.2 Analysis of the Comm/Comp Ratio
For each input matrix, the communication to
computation ratio has tremendous impact on the
potential for relaxation. In this work, we measure this ratio as the amount of elements (of double type) transferred between different processing
Note that the communication traffic Cp varies
when running different processing cores. We measure the communication to computation ratio for
each input matrix from two cores to 64 cores, and
summarize the data in Table 3. Since the ratio is
1
quite high (> ) when the core number is larger
8
than 16, the performance is highly sensitive to the
coherence traffic. This offers much room for relaxation.
5.3 Relaxation and Precision
For each input matrix, we perform experiments
for several relaxation plans. Relaxation plans for
each input matrices are shown in Table 4. Each relaxation plan implies a certain degree of relaxation
of memory operations in the inner loop of Figure
1 after the iteration specified. For example, for
the input matrix cg.mtx, we consider relaxation of
memory operations when the inner loop iteration
number is larger than 2, 3, 4 and 5, respectively.
P0 is a special case, as it means no relaxation for
all input matrices (the iteration bound given is the
Table 2: Input Matrix Description
Name
cg.mtx
fv1.mtx
fv2.mtx
fv3.mtx
Chem97ZtZ.mtx
Size
1400x1400,78148
9604x9604, 85264
9801x9801,87025
9801x9801,87024
2541x2541,7361
ITERS
25
50
50
50
140
last iteration). There is a trade-off between the output precision and performance improvement. If we
relax too many iterations, more performance improvement can be achieved, but the output precision will be lower.
Figure 2 shows the trade-off between the relaxation level and the output precision. The x-axis denotes the relaxation plans. The y-axis denotes the
precision level that can be achieved with each relaxation plan for each matrix. We show the figure
in logarithmic plot to make it more readable. The
specification of the CG benchmark sets acceptable
output precision to be 1e − 10. For cg.mtx and fv3,
both relaxation plans P3 and P4 meet the requirement well. For three other matrices, there is a little
sacrifice of the precision even for P4. Further experiments show that if we set the relaxation plan a
little more conservative (by relaxing one or two iterations less), all can achieve the precision 1e − 10
with almost the same performance gain compared
to P4.
16
14
cg
fv1
fv2
fv3
Chem97ZtZ
12
axis denotes the performance speedup that can be
achieved in each configuration. Since we are interested in how much potential the relaxation can
improve the scalability, for each matrix, we normalize the performance data to the one processor
configuration. In each figure, we plot the performance scalability for four configurations. The
”no-relax” configuration is the one without any
relaxation. The other three represent relaxation
plans for different output precision requirements:
1e − 10, 1e − 8, 1e − 6.
When the the number of processors is small (<
8), there is limited improvement for all relaxation
plans. There are two reasons. First, as can been
seen in Table 3, the computation to communication
ratio is high. There is inherently little room for
relaxation. Second, when the processor number is
small, the working set can not be held in the private
caches, and capacity cache misses become the first
order performance bottleneck.
When the processor number grow beyond 16
processors, we can observe 6.9% to 13.3% of performance improvement for all input matrices. And
importantly, the system scalability is better as well.
precision
10
8
6
22
4
20
no-relax
1e-10
1e-8
1e-6
18
2
16
0
P2
P3
relaxation plan
P4
speedup
14
-2
P1
P0
12
10
Figure 2: Relaxation Level V.S. Precision
8
6
4
5.4 Performance Potential of Relaxation
2
2 4
Figure 3 to Figure 7 show the performance potential for each input matrix. The x-axis denotes the number of processing cores. The yTable 3: The Comp to Comm Ratio
matrix
cg
fv1
fv2
fv3
Chem97Ztz
2C
196.8
55.0
53.7
53.7
27.7
4C
65.9
17.0
17.7
17.7
9.8
8C
28.5
7.6
7.7
7.6
4.3
16C
13.9
3.6
3.6
3.6
2.1
32C
7.0
1.7
1.8
1.7
1.0
8
16
32
processors
64
Figure 3: Performance Results for cg
Table 4: Relaxation Plan Description
64C
3.5
0.87
0.88
0.87
0.49
Plan
P0
P1
P2
P3
P4
cg
> 25
>2
>3
>4
>5
fv1
> 50
>5
> 10
> 15
> 20
fv2
> 50
>5
> 10
> 15
> 20
fv3
> 50
>5
> 10
> 15
> 20
Chem97ZtZ
> 140
> 70
> 80
> 90
> 100
20
14
no-relax
1e-10
1e-8
1e-6
18
no-relax
1e-10
1e-8
1e-6
12
16
10
12
speedup
speedup
14
10
8
8
6
6
4
4
2
2
2 4
8
16
32
processors
64
2 4
Figure 4: Performance Results for fv1
16
speedup
14
12
10
8
6
4
2
8
16
32
processors
64
Figure 5: Performance Results for fv2
20
no-relax
1e-10
1e-8
1e-6
18
16
speedup
14
12
10
8
6
4
2
2 4
8
16
32
processors
64
Figure 6: Performance Results for fv3
6
32
processors
64
rect protocol operation. The difference between
soft coherence and previous research is that here,
we neither need any speculative execution nor retry
mechanism to ensure correctness. Instead, we exploit the arithmetic resilience of applications to tolerate coherence errors.
no-relax
1e-10
1e-8
1e-6
2 4
16
Figure 7: Performance Results for Chem97ZtZ
20
18
8
Related Works
An important goal of the soft coherence is to
mitigate the long latency of coherence messages
required to maintain cache coherence. It should
be noted that some important previous works have
pursued similar goals. The value prediction [13]
and silent stores [14] seek to mitigate the overheads
of memory loads and stores, respectively. Another interesting work is coherence decoupling [3],
which allows loads to return values prematurely,
and relies a verification mechanism to ensure cor-
It also should be noted that this work is not the
first to look at loose cache coherence protocols
[3, 15]. Our work shares the similarity with previous research in that the processor can continue
execution without waiting for the completion of
coherence operations of load instructions. An important difference of our work is that we exploit
the numerical resilience of numerical application
to tolerate cache incoherence. We propose an analytical technique based on automatic differentiation tools to identify the set of memory operations
which can be relaxed. Based on this analysis, the
hardware design for our work should be much simpler. First, we do not need write updates to consume extra bandwidth. Second, we do not need
any verification mechanism. Therefore, instead of
making use of incoherence speculatively as been
explored before, we tolerate incoherence to enable
better scalability.
7
Conclusion and Future Work
Conventional cache coherence designs all adhere to the same restriction: any load operation
must return the value written by the most recent
write operation. In this work, we propose soft coherence for applications with algorithmic error resilience. Our approach relaxes this requirement for
some memory locations, allowing some loads to
return stale values without much sacrifice of the
output precision. We propose an analytical approach based on automatic differentiation to identify those operations which can be relaxed. Experimental results on the conjugate gradient solver
show that 6.9% to 12.6% performance improvement can be achieved.
Our initial evaluation was performed on the
Godson-T architecture, which differs in a few important ways to traditional directory-based invalidate protocols. With the use of software hints,
the timing of coherence operations is known, removing the requirement of accessing a distant directory and invalidating other copies. We expect
more performance improvement in a system with
invalidation-based coherence protocols. In future
work, we will evaluate more applications on a
more general system running an invalidation-based
cache coherence protocol.
References
[1] D. Culler, J. P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software
Approach, Morgan Kaufmann, 1998.
[2] D. Koufaty and J. Torrellas, “Comparing data
forwarding and prefetching for communicationinduced misses in shared-memory mps,” in Proceedings of the 12th International Conference on
Supercomputing, 1998.
[3] J. Huh, J. C. Chang, D. Burger, and G. S. Sohi,
“Coherence decoupling: Making use of incoherence,” in Proceedings of International Conference
on Architectural Support for Programming Languages and Operating Systems, October 2004.
[4] N. Yuan, L. Yu, and D. Fan, “An efficient and
flexible task management for many-core architecture,” in Proceedings of Workshop on Software
and Hardware Challenges of Manycore Platforms. In conjunction with the 35th International
Symposium on Computer Architecture, June 2008.
[5] H. Huang, N. Yuan, W. Lin, G. P. Long, F. L.
Song, L. Yu, L. Liu, Y. Zhou, X. Ye, J. Zhang, and
D. Fan, “Architecture supported synchronizationbased cache coherence protocol for many-core
processors,” in Proceedings of the 2nd Workshop
on Chip Multiprocessor Memory Systems and Interconnects. In conjunction with the International
Symposium on Computer Architecture, June 2008.
[6] G. P. Long, D. R. Fan, and J. C. Zhang, “Architectural support for cilk computations on many core
architectures,” in Proceedings of ACM SIGPLAN
Symposium on Principles and Practice of Parallel
Programming, Febrary 2009.
[7] D. Bailey, E. Barszcz, J. Barton, D. Browning,
R. Carter, L. Dagum, R. Fatoohi, S. Fineberg,
P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga,
“The nas parallel benchmarks,” Technical Report
RNR-94-007, NASA Advanced Supercomputing
(NAS) Division, March 1994.
[8] T. Davis, “University of florida sparse matrix collection,” 2009.
[9] H. Lof and J. Rantakokko, “Algorithmic optimizations of a conjugate gradient solver on shared
memory architectures,” The International Journal
of Parallel, Emergent and Distributed Systems,
Vol. 21, No. 5, pages 345-363 , 2006.
[10] B. Vastenhouw and R. H. Bisseling, “A twodimensional data distribution method for parallel sparse matrix-vector multiplication,” SIAM Review, Vol. 47, No. 1, page 67-95 , January 2005.
[11] B. Vastenhouw and W. Meesen, “Communication
balancing in parallel sparse matrix-vector multiplication,” Electronic Transactions on Numerical
Analysis, Vol. 21, pages 47-65, special issue on
Combinatorial Scientific Computing , 2005.
[12] I. Charpentier and J. Utke, “Fast higher-order
derivative tensors with rapsodia,” Optimization
Methods Software , 2009.
[13] M.H.Lipasti, C.B.Wilkerson, and J.P.Shen, “Value
locality and load value prediction,” in Proceedings
of International Conference on Architectural Support for Programming Languages and Operating
Systems,
[14] K. M. Lepak and M. H. Lipasti, “Silent stores for
free,” in Proceedings of International Symposium
on Microarchitecture, December 2000.
[15] “Memory systems for parallel programming,”
Ph.D. thesis, Computer Sciences Department,
University of Wisconsin - Madison , August 1996.
Download