Soft Coherence: Preliminary Experiments with Error-Tolerant Cache Coherence in Numerical Applications Guoping Long†, Frederic T. Chong‡, Diana Franklin‡, John Gilbert‡, Dongrui Fan† † Institute of Computing Technology, Chinese Academy of Sciences ‡ Department of Computer Science, UC Santa Barbara Abstract As we scale into the multi-core era, we face severe challenges in the scalability and performance of on-chip cache-coherent shared memory mechanisms. We explore application error-tolerance as an extra degree of freedom to meet these challenges. Iterative numerical algorithms, in particular, can cope with the occasional stale value with little or no effect on accuracy or convergence time. We explore analysis methods to distinguish between critical and non-critical data in such algorithms. We exploit this distinction to design soft coherence protocols that provide strong guarantees for critical data and weak guarantees for non-critical data. Our preliminary results use a conjugate gradient solver as an example, with experiments on five sparse matrices showing 6.9%12.6% performance improvement, with little loss in precision. 1 Introduction Many multi-processor systems assume invalidation-based coherence protocols [1] to provide good programmability and reasonable performance. The cost of maintaining strict cache coherence of data has been a constant struggle for researchers, leading to proposals to prefetch invalidated data[2], decoupled coherence protocols [3], etc. The performance overhead is caused by two services of the invalidation protocol. First, all writes to the same address must be serialized, which requires access to a centralized location when accessing a currently shared location. Second, each read must return the value written by the most recent write, requiring a long-latency load if data has been changed. This places high overhead on both the writing and reading of shared data. In this paper, we propose to bypass this overhead when accessing some shared data. When a load requests a line that resides in the cache but has been invalidated by another processor, the processor simply uses the stale value while the request for the updated line is being fulfilled. That is, the processor does not have to wait for the completion of the cache miss to continue execution. In the soft approach for cache coherence, the system allows a load operation to return stale values. This clearly breaks an important assumption of previous cache coherence protocols, which require a load to only return the value written by the most recent store operation. Therefore, naive adoption of this approach may cause correctness problems. However, some applications allow a certain degree of algorithmic error resilience. For example, given a parallel numerical application, if the required precision (the difference between the execution output and the theoretical result) is 1e − 10, and strict implementation of cache coherence can achieve the precision of 1e − 15, then there is room for relaxation of operations which are not critical to the output precision. This observation motivates us to employ a conservative coherence protocol (such as invalidation based protocols) as the correctness substrate, and explore the use of soft coherence for selected memory operations to achieve more performance gain. To support soft coherence, we provide two hardware modes for cache coherence. One is the conservative mode, in which all loads receive the most recently stored value. The other one is the aggres- sive mode, in which the soft coherence protocol is enabled to seek as much performance gain as possible. The policy which specifies which execution mode to use is determined by the relaxation plan. In this work, we evaluate the potential of the soft coherence protocol on the Godson-T many core platform [4, 5, 6]. The application is the CG program from the NPB benchmark suite [7]. Besides the default input matrix, we select four sparse matrices from the UF sparse matrix collection [8] for our experiments. Experimental results show that our approach can bring 8.8% of performance improvement on average for all matrices. The rest of the paper is organized as follows. Section 2 discusses the parallel computation structure of CG. Section 3 presents the methodology on how to generate an appropriate relaxation plan for each input sparse matrix. Section 4 discusses the detailed of soft cache coherence design on Godson-T. Section 5 reports our preliminary experimental results. Section 6 concludes the paper and discusses future work. 2 Parallel Conjugate Gradient Solver The conjugate gradient (CG) solver here is adapted from the CG program of the NPB benchmark suite. This program estimates the largest eigenvalue of a symmetric positive definite sparse matrix with the inverse power method. The basic structure of the kernel loop of CG, which is used to solve the sparse matrix equation Ax = b, is shown in Figure 1. We use this algorithm here to illustrate our approach. More discussions on how to optimize the kernel loop can be found in [9]. When the sparse matrix becomes large, the matrix to vector multiply (Ax) dominates the execution time of kernel loop. Since naive parallelization of Ax by row or column can not achieve good load balance, we first partition A with a sparse matrix partitioning tool, Mondriaan [10, 11]. Mondriaan can distribute the computation load among processors evenly, and at same time try to minimize interprocessor communication. We allocate a temp vector for each processor to store its own part of partial sums. When all partial sum vectors have been generated, we sum up Figure 1: Parallel Implementation of the Kernel Loop them together to form the solution vector q, which should be equal to A ∗ p. The problem is how to sum up these partial vectors. In current implementation, we parallelize this sum up operation embarrassingly. That is, each processor is responsible for generating an equal part of the resultant vector q. This approach incurs much coherence traffic because each processor has to access parts of all other partial sum vectors. Aside from A ∗ x, the rest of the vector computations can all be parallelized embarrassingly. Note that although the inner products incur a rather small volume of communication and computation, they represent global data dependence and synchronization. Relaxation of any of these partial sums will have tremendous impact on the output precision. Among all the shared vectors or global scalars in Figure 1, our detailed profiling shows that approximately 95% of the coherent misses are caused by the vector p and the partial sums of the A ∗ x operation. We therefore only consider relaxing p and those important partial sums in our work. 3 Relaxation Plan Generation In this section, we study in detail how to relax the coherence requirements of memory operations on the vector p and partial sums of Ax. We will present a technique to evaluate the relative importance of different vector elements of p. Then we discuss details on how to generate relaxation plans. 3.1 Sensitivity Analysis Now we evaluate the importance of a particular vector element x. We can model the computation in Figure 1 as a function of x: y = f (x), and use this function to study the impact of the variation on input x on the output y. The Taylor expansion of y = f (x) can be written as follows: n f (x) = f (a) + ∑ i=1 f (i) (a) (x − a)i + O(n) i! (1) Therefore, not only should we evaluate the importance of a particular variable x, we should also evaluate the importance of the corresponding cache line. The L1 cache line size of Godson-T is 32 bytes. A cache line can hole four consecutive vector elements. Assume x1 , x2 , x3 , x4 are in the same cache line. Let w(x1 , i), w(x2 , i), w(x3 , i) and w(x4 , i) be the weights (the importance factor) of x1 , x2 , x3 , x4 at iteration i, respectively. We use the following formula to measure the importance of the cache line at iteration i: W (x1 , x2 , x3 , x4 , i) = q (w(x1 , i)2 + w(x2 , i)2 + w(x3 , i)2 + w(x4 , i)2 ) (5) Note that the larger the value of W (x1 , x2 , x3 , x4 , i), the more important the cache line. 3.2 Relaxation Plan Generation In particular, if limn→+∞ (O(n)) = 0, we have: n f (x) ≈ f (a) + ∑ i=1 f (i) (a) (x − a)i i! (2) In this work, we choose n = 5, and re-write equation (2) as follows: f (x) ≈ f (a) +C1 ∗ (x − a) +C2 ∗ (x − a)2 + C3 ∗ (x − a)3 +C4 ∗ (x − a)4 +C5 ∗ (x − a)5 (3) Given a serial C implementation of the algorithm Figure 1, we can obtain the co-efficients C1-C5 with automatic differentiation tools, such as Rapsodie [12]. As shown in Figure 1, the kernel loop will iterate for IT ERS times. The constant IT ERS is different for different input matrices. In each iteration i, the variable x is read during the computation of A ∗ p, later it is re-written. Assume x is changed from a to b, then we use the following formula to represent the importance of x at iteration i: W (x, i) = C1 ∗(a−b)+C2 ∗(a−b)2 +C3 ∗(a−b)3 + C4 ∗ (a − b)4 +C5 ∗ (a − b)5 (4) In reality, we relax the coherence requirements of memory operations at cache line granularity. For iterative algorithms such as CG solver, it is important to ensure convergence even after relaxing memory operations. However, arbitrary relaxation may generate results which can not converge. To ensure convergence, the key observation is that we should not relax a vector element for an arbitrary number of consecutive iterations. Specifically, we can introduce a tolerance factor d. For example, assume we relax some element in iteration i, we can still relax the same element in iteration i + 1, i + 2, ..., i + d − 1. And we should never relax it in iteration i + d. With this relaxation constraint, at each iteration i, each processor can read a value at most read a value written in iteration i − d. W assume d = 2 throughout all experiments in this work. Now it’s time to discuss the relaxation plan for the vector elements of p. Assume the vector has N elements. Since a cache line can hold 4 consecN utive elements, there are approximately cache 4 lines for this vector. We partition the cache lines N into two groups, and each group has cache lines. 8 The relaxation plan for the vector p is determined as follows: N a. At iteration 1, the least important elements 8 of p are considered to be relaxation candidates; b. Starting from iteration 2 to the last iteration, we N try to select least important elements under the 8 condition that no selected element has been relaxed in the iteration before. A relaxation plan only specifies candidates for relaxation. An interesting property of the conjugate gradient solver is that the output precision is very sensitive to the results of early iterations. Therefore, for all input matrices we decide not to relax any operation during the first several iterations. A candidate is relaxed at iteration I only if two additional conditions are satisfied: (1) we decide to relax operations on candidates specified in the relaxation plan; (2) the candidate is used by more than one processor. Recall that we allocate a temp vector for each processor to store its own part of partial sums during the computation of A ∗ p. The strategy for relaxing partial sum vector elements is almost the same with the vector p. For each processor, assume the set of the elements been accessed is {e1, e2, e3, e4, e5, e6, ..., }. Then at even iteration, we relax elements {e2, e4, e6, ...}. At odd iteration, we relax elements {e1, e3, e5, ...}. The basic rationale here is the same: we do not relax the same element in two consecutive iterations to ensure its convergence. 4 Hardware Support for Soft Coherence In this section, we present our hardware design for soft cache coherence protocols. We first discuss the cache coherence mechanisms present in Godson-T, our initial simulation platform. We then present the enhancements we have made to approximate the performance of an invalidation protocol. Finally, we present our implementation of soft coherence. phisticated hardware to protactively invalidate any stale copies in the system, the Godson-T provides simpler mechanisms that are triggered at synchronization points. In the Godson-T synchronization-based cache coherence scheme, all L1 caches are private, whereas the L2 cache is shared. Coherence is accomplished by flushing data from the L1 cache that has been changed by this or another processor. This flushing serves to both write the new data into the L2 cache and force other processors to retrieve the new data from the L2 cache. As long as the sharing patterns are known in software, this provides very efficient coherence. In order to illustrate the behavior of synchronization-based cache coherence, we will present the behavior in two situations: critical sections and barriers. When outside of a critical section, the L1 cache acts as a traditional write-back cache. Inside the critical section, however, shared memory is being accessed, so the coherence mechanisms must be employed. Since stale copies will not be invalidated, all stores within the critical section write through to the shared L2 cache. In order to load in the latest values from the L2 cache, the first load to a particular address within the critical section will obtain data from the L2 cache. Subsequent loads to the same address within the critical section can trust the local copy after the first reference. In order to guarantee that all changes will be reflected following a barrier, the L1 caches are flushed entirely. All dirty data is written to the L2 cache to communicate the new values, and all clean data is invalidated to ensure that subsequent accesses will go to the L2 cache for the new value. This overhead is expensive, especially for applications like CG solver, which relies heavily on barrier synchronization, as shown in Figure 1. 4.1 Cache Coherence in Godson-T Godson-T provides basic hardware mechanisms to implement cache coherence without a centralized directory. The basic semantics of all cache coherence protocols are the same: For each load operation, it can only obtain the value written by the most recent store operation. While directorybased invalidation protocols rely entirely on so- 4.2 Approximating Invalidation-based Coherence In order to more closely approximate the performance of an invalidation-based coherence protocol in Godson-T, we must limit the accesses to the L2 cache to lines that are modified by one processor and read by another processor. Our CG Table 1: Architectural parameters of Godson-T V3 Component Core On-Chip-Network Router L1 Cache L2 Cache DDR2 Controller Description In order due-issue pipeline with eight stages. User level MIPS ISA, 1 ALU, 1 FPU and 1 LSU. 8x8 MESH network with static X-Y routing algorithm. 4-stage router with 16GB/s peak bandwidth. Two virtual channels, one request buffer for each. Each core has 64KB L1 cache. Line size: 32 bytes.Access latency: 1 cycle. Does not support outstanding misses. 64 L2 banks, 256KB each. Request buffer size: 4 entries. 64B L2 cache line size and 4 cycle access latency. Four DDR2 memory controllers. 32GB/s peak memory bandwidth. Solver employs barriers, so we modified the design to selectively invalidate cache lines based on software hints. Specifically, at the barrier point, two categories of cache lines are no longer flushed. First, read-only shared data is kept, which is especially critical for the CG solver because most data (the large sparse matrix) is read only. Second, the private processor data is held, since it can not have been modified by any other processor. We are currently working on adding full support of the invalidation-based protocol to the Godson-T simulator. 4.3 Soft Cache Coherence in Godson-T The purpose of soft coherence is to allow lines identified as soft to use old values some of the time. To support this, each cache line is augmented with a ”stale” bit. This indicates that this is a shared line that may have been modified by another processor, but the new value has not been obtained. When a barrier is reached, the hardware marks all soft cache lines as stale rather than flushing them. If a processor reads a stale cache line, the stale data is returned. At the same time, a load request is issued to shared L2 cache to fetch the correct copy of the line. In the current design, only one such outstanding request is allowed. That is, if the hardware reads a stale cache line, but there is already an outstanding stale line request, then the hardware does not issue a new request, it just returns the stale value to the processor pipeline. If the hardware reads the cache and there is a cache miss, then an ordinary refill request is issued un-conditionally. Note that an ordinary refill request can be issued even if there is a outstanding stale line refill request. If there is an ordinary cache miss, then the processor pipeline stops sending new memory requests to data cache. That is, the design does not support multiple ordinary outstanding misses. Let’s explain this design in more detail with an example. Assume there are three cache lines, A, B, and C. Both A and B are stale lines, C is an ordinary line. Assume the processor accesses A first, then B and then C. When the processor accesses A, while it returns the stale value back to the register, it also issues a request to the L2 cache to refill the stale line. When the refill is complete, the line becomes up to date. When the processor accesses B, there are two cases. First, the request for A is finished. If this is the case, then B is processed in the same way as A. Second, the request for A is outstanding. In this case, the stale value is still returned, but no requestion is sent to the L2 cache to update line B, because the system does not support two outstanding stale refill requests simultaneously. When the processor accesses B again, if the line is still stale, and there are no other stale lines being refilled, then the hardware sends a refill request for B to bring back the updated value. If the processor experiences a cache miss accessing C, there are also two cases, but neither depend on the access to A or B. As long as there are not any outstanding conventional cache misses oustanding, it will send the request to the L2 cache. If there is an outstanding miss, then the processor must wait until that has completed before satisfying the request for C. cores to the number of floating point multiply and add operations. For each input matrix, the amount of computation is Θ(IT ERS ∗ (nnz + 5 ∗ n)). 5 The amount of communication traffic caused by the vector p depends how the sparse matrix is partitioned. This is done by the Mondriaan sparse matrix partitioning tool [10, 11]. In all the experiments of this work, each input matrix is partitioned among 32 processors with at most 1% load imbalance. As long as the partitioned matrix is given, the communication traffic caused by the vector p is determined. Let Cp denote the communication traffic. Another part of the communication traffic is caused by summing up of the partial sums in A ∗ p operation. Total amount of communication is Θ(Cp + IT ERS ∗ n ∗ (p − 1))). Experimental Results 5.1 Experimental Setup Our experimental platform is a modified version of the Godson-T V3 infrastructure [4, 5, 6]. It is a tiled many core architecture with 64 homogeneous processing cores interconnected by an 8x8 mesh network. Each core is a general purpose in order due-issue processor which implements a core subset of the MIPS instruction set. Each core has a private L1 cache and a local slice of the shared L2 cache. As discussed before, Godson-T employs a lock based cache coherence algorithm which eliminates directory completely. We implement the optimistic cache coherence protocol on Godson-T V3 platform. Important parameters of the platform are summarized in Table 1. More details can be found in [4, 5]. Godson-T provides a pthread like interface for multi-threaded programming. We implement the conjugate gradient solver with this interface, and compile it with gcc-3.3.3 x64 to MIPS cross compiler with -O3 optimization level. Table 2 lists the input matrices for the conjugate gradient solver. Among them, fv1 fv3 and Chem97ZtZ are from the UF sparse matrix collection [8], cg.mtx is a random input matrix (S class) generated by the makea routine of the CG benchmark. The Chem97ZtZ is a statistical matrix from the Bates group. Fv1 fv3 are matrices which model finite elements of human body parts. These matrices are all symmetric, positive and definite and can converge in less than 200 iterations (inner loop) for our conjugate gradient solver. The second column shows the problem size of each matrix. For example, the cg.mtx is a 1400x1400 matrix with 78148 non-zeros. The third column shows the number of inner loop iterations for each matrix. 5.2 Analysis of the Comm/Comp Ratio For each input matrix, the communication to computation ratio has tremendous impact on the potential for relaxation. In this work, we measure this ratio as the amount of elements (of double type) transferred between different processing Note that the communication traffic Cp varies when running different processing cores. We measure the communication to computation ratio for each input matrix from two cores to 64 cores, and summarize the data in Table 3. Since the ratio is 1 quite high (> ) when the core number is larger 8 than 16, the performance is highly sensitive to the coherence traffic. This offers much room for relaxation. 5.3 Relaxation and Precision For each input matrix, we perform experiments for several relaxation plans. Relaxation plans for each input matrices are shown in Table 4. Each relaxation plan implies a certain degree of relaxation of memory operations in the inner loop of Figure 1 after the iteration specified. For example, for the input matrix cg.mtx, we consider relaxation of memory operations when the inner loop iteration number is larger than 2, 3, 4 and 5, respectively. P0 is a special case, as it means no relaxation for all input matrices (the iteration bound given is the Table 2: Input Matrix Description Name cg.mtx fv1.mtx fv2.mtx fv3.mtx Chem97ZtZ.mtx Size 1400x1400,78148 9604x9604, 85264 9801x9801,87025 9801x9801,87024 2541x2541,7361 ITERS 25 50 50 50 140 last iteration). There is a trade-off between the output precision and performance improvement. If we relax too many iterations, more performance improvement can be achieved, but the output precision will be lower. Figure 2 shows the trade-off between the relaxation level and the output precision. The x-axis denotes the relaxation plans. The y-axis denotes the precision level that can be achieved with each relaxation plan for each matrix. We show the figure in logarithmic plot to make it more readable. The specification of the CG benchmark sets acceptable output precision to be 1e − 10. For cg.mtx and fv3, both relaxation plans P3 and P4 meet the requirement well. For three other matrices, there is a little sacrifice of the precision even for P4. Further experiments show that if we set the relaxation plan a little more conservative (by relaxing one or two iterations less), all can achieve the precision 1e − 10 with almost the same performance gain compared to P4. 16 14 cg fv1 fv2 fv3 Chem97ZtZ 12 axis denotes the performance speedup that can be achieved in each configuration. Since we are interested in how much potential the relaxation can improve the scalability, for each matrix, we normalize the performance data to the one processor configuration. In each figure, we plot the performance scalability for four configurations. The ”no-relax” configuration is the one without any relaxation. The other three represent relaxation plans for different output precision requirements: 1e − 10, 1e − 8, 1e − 6. When the the number of processors is small (< 8), there is limited improvement for all relaxation plans. There are two reasons. First, as can been seen in Table 3, the computation to communication ratio is high. There is inherently little room for relaxation. Second, when the processor number is small, the working set can not be held in the private caches, and capacity cache misses become the first order performance bottleneck. When the processor number grow beyond 16 processors, we can observe 6.9% to 13.3% of performance improvement for all input matrices. And importantly, the system scalability is better as well. precision 10 8 6 22 4 20 no-relax 1e-10 1e-8 1e-6 18 2 16 0 P2 P3 relaxation plan P4 speedup 14 -2 P1 P0 12 10 Figure 2: Relaxation Level V.S. Precision 8 6 4 5.4 Performance Potential of Relaxation 2 2 4 Figure 3 to Figure 7 show the performance potential for each input matrix. The x-axis denotes the number of processing cores. The yTable 3: The Comp to Comm Ratio matrix cg fv1 fv2 fv3 Chem97Ztz 2C 196.8 55.0 53.7 53.7 27.7 4C 65.9 17.0 17.7 17.7 9.8 8C 28.5 7.6 7.7 7.6 4.3 16C 13.9 3.6 3.6 3.6 2.1 32C 7.0 1.7 1.8 1.7 1.0 8 16 32 processors 64 Figure 3: Performance Results for cg Table 4: Relaxation Plan Description 64C 3.5 0.87 0.88 0.87 0.49 Plan P0 P1 P2 P3 P4 cg > 25 >2 >3 >4 >5 fv1 > 50 >5 > 10 > 15 > 20 fv2 > 50 >5 > 10 > 15 > 20 fv3 > 50 >5 > 10 > 15 > 20 Chem97ZtZ > 140 > 70 > 80 > 90 > 100 20 14 no-relax 1e-10 1e-8 1e-6 18 no-relax 1e-10 1e-8 1e-6 12 16 10 12 speedup speedup 14 10 8 8 6 6 4 4 2 2 2 4 8 16 32 processors 64 2 4 Figure 4: Performance Results for fv1 16 speedup 14 12 10 8 6 4 2 8 16 32 processors 64 Figure 5: Performance Results for fv2 20 no-relax 1e-10 1e-8 1e-6 18 16 speedup 14 12 10 8 6 4 2 2 4 8 16 32 processors 64 Figure 6: Performance Results for fv3 6 32 processors 64 rect protocol operation. The difference between soft coherence and previous research is that here, we neither need any speculative execution nor retry mechanism to ensure correctness. Instead, we exploit the arithmetic resilience of applications to tolerate coherence errors. no-relax 1e-10 1e-8 1e-6 2 4 16 Figure 7: Performance Results for Chem97ZtZ 20 18 8 Related Works An important goal of the soft coherence is to mitigate the long latency of coherence messages required to maintain cache coherence. It should be noted that some important previous works have pursued similar goals. The value prediction [13] and silent stores [14] seek to mitigate the overheads of memory loads and stores, respectively. Another interesting work is coherence decoupling [3], which allows loads to return values prematurely, and relies a verification mechanism to ensure cor- It also should be noted that this work is not the first to look at loose cache coherence protocols [3, 15]. Our work shares the similarity with previous research in that the processor can continue execution without waiting for the completion of coherence operations of load instructions. An important difference of our work is that we exploit the numerical resilience of numerical application to tolerate cache incoherence. We propose an analytical technique based on automatic differentiation tools to identify the set of memory operations which can be relaxed. Based on this analysis, the hardware design for our work should be much simpler. First, we do not need write updates to consume extra bandwidth. Second, we do not need any verification mechanism. Therefore, instead of making use of incoherence speculatively as been explored before, we tolerate incoherence to enable better scalability. 7 Conclusion and Future Work Conventional cache coherence designs all adhere to the same restriction: any load operation must return the value written by the most recent write operation. In this work, we propose soft coherence for applications with algorithmic error resilience. Our approach relaxes this requirement for some memory locations, allowing some loads to return stale values without much sacrifice of the output precision. We propose an analytical approach based on automatic differentiation to identify those operations which can be relaxed. Experimental results on the conjugate gradient solver show that 6.9% to 12.6% performance improvement can be achieved. Our initial evaluation was performed on the Godson-T architecture, which differs in a few important ways to traditional directory-based invalidate protocols. With the use of software hints, the timing of coherence operations is known, removing the requirement of accessing a distant directory and invalidating other copies. We expect more performance improvement in a system with invalidation-based coherence protocols. In future work, we will evaluate more applications on a more general system running an invalidation-based cache coherence protocol. References [1] D. Culler, J. P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998. [2] D. Koufaty and J. Torrellas, “Comparing data forwarding and prefetching for communicationinduced misses in shared-memory mps,” in Proceedings of the 12th International Conference on Supercomputing, 1998. [3] J. Huh, J. C. Chang, D. Burger, and G. S. Sohi, “Coherence decoupling: Making use of incoherence,” in Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, October 2004. [4] N. Yuan, L. Yu, and D. Fan, “An efficient and flexible task management for many-core architecture,” in Proceedings of Workshop on Software and Hardware Challenges of Manycore Platforms. In conjunction with the 35th International Symposium on Computer Architecture, June 2008. [5] H. Huang, N. Yuan, W. Lin, G. P. Long, F. L. Song, L. Yu, L. Liu, Y. Zhou, X. Ye, J. Zhang, and D. Fan, “Architecture supported synchronizationbased cache coherence protocol for many-core processors,” in Proceedings of the 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects. In conjunction with the International Symposium on Computer Architecture, June 2008. [6] G. P. Long, D. R. Fan, and J. C. Zhang, “Architectural support for cilk computations on many core architectures,” in Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Febrary 2009. [7] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, “The nas parallel benchmarks,” Technical Report RNR-94-007, NASA Advanced Supercomputing (NAS) Division, March 1994. [8] T. Davis, “University of florida sparse matrix collection,” 2009. [9] H. Lof and J. Rantakokko, “Algorithmic optimizations of a conjugate gradient solver on shared memory architectures,” The International Journal of Parallel, Emergent and Distributed Systems, Vol. 21, No. 5, pages 345-363 , 2006. [10] B. Vastenhouw and R. H. Bisseling, “A twodimensional data distribution method for parallel sparse matrix-vector multiplication,” SIAM Review, Vol. 47, No. 1, page 67-95 , January 2005. [11] B. Vastenhouw and W. Meesen, “Communication balancing in parallel sparse matrix-vector multiplication,” Electronic Transactions on Numerical Analysis, Vol. 21, pages 47-65, special issue on Combinatorial Scientific Computing , 2005. [12] I. Charpentier and J. Utke, “Fast higher-order derivative tensors with rapsodia,” Optimization Methods Software , 2009. [13] M.H.Lipasti, C.B.Wilkerson, and J.P.Shen, “Value locality and load value prediction,” in Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, [14] K. M. Lepak and M. H. Lipasti, “Silent stores for free,” in Proceedings of International Symposium on Microarchitecture, December 2000. [15] “Memory systems for parallel programming,” Ph.D. thesis, Computer Sciences Department, University of Wisconsin - Madison , August 1996.