og re ss Efficient and highly parallel computation Jeffrey Finkelstein Computer Science Department, Boston University October 17, 2014 1 1.1 Models of computation Parallel random access machines W or k- in - pr There are many variations of the basic parallel random access machine (PRAM), which includes an arbitrary number of processors sharing an arbitrary amount of random access memory. An exclusive-read, exclusive-write PRAM (EREW PRAM) does not allow any concurrent memory accesses (so all concurrent memory accesses must be explicitly serialized in the program). An EREW PRAM is useful because any algorithm running on such a machine experiences no contention due to concurrent memory accesses. A concurrent-read, concurrentwrite PRAM (CRCW PRAM), on the other hand, has a unit cost for any number of processors concurrently accessing any single memory location. In this model, contention due to concurrent memory accesses is not necessarily included in the running time. Can an EREW PRAM simulate a CRCW PRAM (with the intention of reducing or eliminating time spent queuing memory concurrent accesses in physical hardware implementations)? Due to a refinement of a simulation originally by Vishkin [28], a CRCW PRAM algorithm running in time t(n) on p(n) processors can be simulated by an EREW PRAM algorithm running in time O(t(n) · lg n) on p(n) processors. 1.2 Queue-read, queue-write parallel random access machines Gibbons, Matias, and Ramachandran [14] define the queue-read, queue-write PRAM (QRQW PRAM). Whereas the EREW PRAM does not allow concurrent memory accesses and the CRCW PRAM has a unit cost for each concurrent Copyright 2012, 2014 Jeffrey Finkelstein 〈jeffreyf@bu.edu〉. This document is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License, which is available at https://creativecommons.org/licenses/by-sa/4.0/. The LATEX markup that generated this document can be downloaded from its website at https://github.com/jfinkels/parallel. The markup is distributed under the same license. 1 memory access, the QRQW PRAM has a time penalty at each step directly proportional to the number of concurrent memory accesses. This is the cost due to memory contention (or the “queuing delay”), intended to model the cost of servicing memory access requests sequentially at the memory module of the machine. The QRQW PRAM is “more powerful” than the EREW PRAM, because we may be able to collapse multiple exclusive memory accesses into a single concurrent one. (Even so, there are some algorithms in which such a simulation does not necessarily result in a decreased running time.) The QRQW PRAM is “less powerful” than the CRCW PRAM, because the CRCW PRAM does not account for the cost due to contention during memory accesses. [14] briefly provides the simple simulation of a QRQW PRAM by a CRCW PRAM, but I could not seem to find a work which provides a simulation of a QRQW PRAM by a EREW PRAM (though we examine such a simulation in the next section). The same work also provides some theorems about “time separations” among different variants of PRAMs, but they don’t seem to define what that means; it has something to do with time lower bounds. 1.3 Variants of the queue-read, queue-write parallel random access machine In [13], the authors define single instruction, multiple data (SIMD) QRQW PRAMs, and multiple instruction, multiple data (MIMD) QRQW PRAMs. In [14], they define QRQW asynchronous PRAMs, in which the computation at each processor proceeds asynchronously, but concurrent memory accesses to the same memory location truly form a queue which is processed serially; the head of the queue, if it is not empty, is dequeued and processed at each time step. 1.4 Queuing shared memory machines The queuing shared memory (QSM) machine [12] is a essentially a parameterized QRQW PRAM in which the parameter g scales the cost (of each step in each processor) due to memory accesses. It is considered a “bulk-synchrony” model because it allows each processor to operate asynchronously between barrier synchronizations. (This is a compromise between a completely synchronous model and a completely asynchronous model.) 1.5 Bulk synchronous parallel random access machine The bulk syncronous parallel (BSP) [27] model is really a distributed computation model, in that it has an arbitrary number of processors, each with local memory, connected via a (stateless) network with limited bandwidth. Todo 2. Get a copy of [27]. The bulk synchronous PRAM (BSPRAM) [26] is a generalization of the BSP model which has an arbitrary number of PRAMs connected to a bandwidth2 limited shared memory (in place of a bandwidth-limited network). If we choose the PRAMs in the BSPRAM to be QRQW PRAMs, this essentially yields a generalization of a QSM machine with an extra parameter, l, which measures latency at shared memory. A QSM machine is essentially a QRQW BSPRAM with l = 1 (since the QSM machine does not penalize latency at shared memory). 2.1 Memory access complexity In [33, 32], Zhang et al. suggest that as memory access becomes a bottleneck, it pays to examine computational complexity not only in terms of time, space, and number of processors, but also in terms of memory access complexity. Memory access complexity measures the cost due to temporal and spatial locality of memory accesses in an algorithm (as well as “contiguous and non-contiguous accesses, short message[s] and long message[s], etc.” [34]). In [2], the authors suggest the input/output model, also known as the disk access model, in which there are two layers of memory: a (fast) cache with a finite number of blocks and a (slow) disk. Some analysis of the amount of memory transfers required for common algorithms appears in lecture notes from Erik Demaine’s advanced data structures course [9]. 2.2 Parameterized parallel random access machine The Parameterized PRAM (PPRAM) defined in [15] is a general family of PRAM models which assign a different cost to each unique type of operation. Several other PRAM-like models are special cases of this generalization. The PPRAM is parameterized by a four-tuple of positive reals, (α, β, γ, δ), which are the costs for EREW memory access, CRCW memory access, multiprefix operation (for example, a “prefix-sum” operation), and global synchronization, respectively. (Local computation has a cost of one.) The benefit of this parameterized model is that one can analyze the tradeoffs in resource usage due to differences in distinct algorithms for a single computational problem. Todo 3. Augment this model with two more parameters: for cost due to round trips to memory and ζ for cost due to queuing delay. 3.1 Other models of parallel computation There are many, many other models of parallel computing machines; see [34] for details and a listing of modern parallel computation models. In that work, all the PRAMs as well as the QSM discussed in the previous sections are considered “first generation” models because they use a shared memory model. The second generation models use distributed memory and the third hierarchical memory. It seems that most models can be simulated on most other models with small costs. Todo 4. Which of the models mentioned in [34] is closest to our current hardware? Which will be closest to the next-generation of hardware? 3 4.1 Lower bounds for parallel models of computation [17] provides lower bounds for certain problems on the QSM and BSP models, by using a generalized shared memory (GSM) machine. This is a generalization of both QSM and BSP. 4.2 The explicit multi-threading machine Vishkin, Caragea and Lee [29] describe complexity measures for the Explicit Multi-Threading (XMT) machine (introduced in [30]). When computing the overall complexity of an algorithm written for the XMT machine, they include the computation depth, the number of round-trips to memory, and the queuing delay. They also include an additional cost due to thread and processor management when there are more threads than processors (which could be called “processor contention”). The XMT model combines some of the features of several different models of parallel computation, including a shared memory model, queuing delays, hierarchical memory access, memory segregated into modules, etc. 5 Tradeoffs in resource usage One of our goals is to examine the tradeoff among time, space, number of processors, number of round-trips to memory, queuing delay, etc. 5.1 Simulating a QRQW PRAM on an EREW PRAM Our first thought was to investigate eliminating queuing delay at the expense of computation depth. To do this, we consider Vishkin’s EREW PRAM simulation of a CRCW PRAM, and try to adapt it to simulate a QRQW PRAM. 5.1.1 Simulating a CRCW PRAM on an EREW PRAM Vishkin showed that simulating a PRIORITY CRCW PRAM on a EREW PRAM can be done with a O(lg2 n) increase in time. A PRIORITY CRCW PRAM is a CRCW PRAM in which write conflicts are resolved by choosing the processor with the smallest ID. This model is more powerful than both the COMMON CRCW PRAM (in which concurrent writes to the same address are only allowed when the processors write the same data) and the ARBITRARY CRCW PRAM (in which an arbitrary processor succeeds on concurrent writes), so simulating the PRIORITY CRCW PRAM model suffices to simulate these other two. Theorem 5.1 ([28]). An algorithm which runs on an input of size n on a CRCW PRAM with p(n) processors in t(n) time and s(n) space can be simulated on an EREW PRAM with p(n) processors in O(t(n) · lg2 n) time and O(p(n) · s(n)) space. 4 Proof idea. The PRIORITY CRCW PRAM and EREW PRAM models are synchronous, so it suffices to show separately how to simulate a read step and a write step. To simulate a write step, each processor writes the address, the processor’s ID number, and the data to write to a memory address. Perform Batcher sort, a O(lg2 n) EREW PRAM (stable) sorting algorithm, on the addresses (then on processor ID number). Each processor reads one of the addresses and reads the one before it (in the sorted sequence). If the two are different, this processor has the lowest address of all processors attempting to write to this address, so it writes its value. All other processors do nothing. To simulate a read step, sort the sequence of addresses and IDs in O(lg2 n) time as in the write step. Each processor reads an address and the one before it, as above. Create a bitmap with a 1 where a processor sees that it has a new address and a 0 where a processor does not see a new address. Each processor corresponding to a 1 reads from the corresponding address. Perform a “segmented prefix copy-right”, using the bitmap as the segment delimiters, to disseminate the data read in O(lg n) time. “Unsort” the adresses and IDs along with the data read. A later O(lg n) EREW PRAM sorting algorithm due to Cole improves the above theorem. Theorem 5.2. An algorithm which runs on an input of size n on a CRCW PRAM with p(n) processors in t(n) time and s(n) space can be simulated on an EREW PRAM with p(n) processors in O(t(n) · lg n) time and O(p(n) · s(n)) space. See [24] for more information on the Batcher sorting algorithm, the Cole sorting algorithm, and this simulation in general. [22] also has the statement and proof of the earlier version of this theorem. 5.1.2 Simulating a SIMD-QRQW PRAM on an EREW PRAM A SIMD-QRQW PRAM is just a CRCW PRAM with an added time penalty due to concurrent reads and writes. Therefore, the same proof technique for simulating a CRCW PRAM on an EREW PRAM applies for simulating a SIMD-QRQW PRAM on an EREW PRAM. At each step of a SIMD-QRQW PRAM with p processors, the time cost due to contention is at most p (if all processors are accessing the same memory location). However, the simulated step on the EREW PRAM with p processors still takes O(lg p) time. Suppose the SIMD-QRQW PRAM step has contention κ. If κ ∈ O(lg p), then the EREW PRAM simulation runs more slowly than the original algorithm. If κ ∈ Ω(lg p), then the EREW PRAM simulation runs more quickly than the original algorithm. This last fact is a bit unsettling; we expect a more restricted model of computation to take longer to decide a computational problem. We take this fact as evidence that the EREW PRAM is not refined enough to explore tradeoffs 5 between time, space, number of processors, number of memory accesses, and queuing delays. 5.1.3 Simulating a MIMD-QRQW PRAM on an EREW PRAM We first note the simulation of a MIMD-QRQW PRAM on a SIMD-QRQW PRAM: Theorem 5.3 ([14], Observation 2.1). An algorithm which runs on an input of size n on a MIMD-QRQW PRAM with p(n) processors in t(n) time and s(n) space with a maximum time cost for a single step τ (n) can be simulated on a SIMD-QRQW PRAM with p(n) · τ (n) processors in O(t(n)) time and s(n) + τ (n) · p(n) space. (The increase in space comes from the proof; it is not explicitly stated in the original theorem.) Combining Theorem 5.3 with the results stated in subsubsection 5.1.2, we find that simulating a step of a MIMD-QRQW PRAM on an EREW PRAM also is subject to the same unusual behavior, as described below. A step on the MIMD-QRQW PRAM with contention κ becomes multiple steps on the SIMD-QRQW PRAM. Some of these steps are subject to contention κ as well. The same analysis as above applies. 5.2 Commonly studied problems for PRAMs Tradeoffs between amount of communication and number of processors in a partitioned PRAM (in which each of a finite set of processors receives only a subset of the input and there is a finite amount of shared memory) for the following problems are studied in [1]. • k-selection • minimum • prefix-sum • integer sorting • k-k routing • permutation routing • list reversal Two algorithms for majority are given in [15]. Many algorithms for minimum weight spanning tree also exist because of the many applications, for example, in networking. There are also a bunch of PRAM approximation algorithms in [10]. [29] gives algorithms and analyses for 6 • array compaction • k-ary tree summation • breadth-first search • sparse matrix, dense vector multiplication Todo 6. Examine these algorithms with respect to queuing delay and number of round-trips to memory, etc., as performed in [29]. 7 7.1 Inherently sequential and highly parallel computational problems Highly parallel computational problems NC is the class of all languages which can be decided by a (logarithmic space) uniform family of Boolean circuits with polynomial size, polylogarithmic depth, and fan-in two. If we view each of the gates in the circuit as a processor and each layer of the circuit as a single time step, then NC represents the class of languages decidable by a CREW PRAM with a polynomial number of processors running in polylogarithmic parallel time. Languages in NC are considered highly parallel. We know L ⊆ NL ⊆ NC2 ⊆ NC ⊆ P. The first inclusion is obvious, the second is due to [6], the third follows from the definition, and the last is because NC is logarithmic space uniform (so the circuit can be generated in polynomial time) and because circuit evaluation can be performed in polynomial time. 7.2 Inherently sequential computational problems C Languages which are complete for P under ≤N reductions are considered m inherently sequential, because a highly parallel solution (that is, an NC algorithm) for any one of them would imply a highly parallel solution for all problems in P. Todo 8. What is the relationship between logarithmic space many-one reductions NC (≤L m ) and NC many-one reductions (≤m )? Since L ⊆ NC, the former reduction implies the latter. Such problems would be more accurately described as “inherently sequential in the worst case”, since the complexity classes P and NC are defined by worst-case time bounds (and worst-case number of processors in the latter class). The canonical natural P-complete problem is the Circuit Value Problem (CVP): given the encoding of both a Boolean circuit and its input, decide if the output of the circuit is 1. 7 8.1 Highly parallel appromixations An inherently sequential (that is, P-complete) optimization problem may still admit a highly parallel approximation. Indeed, so do some NP- and PSPACEcomplete optimization problems (for example, see [16] for NC approximations for NP- and PSPACE-hard problems). Analogous to the chain of inclusions P ⊆ PO ⊆ FPTAS ⊆ PTAS ⊆ ApxPO ⊆ poly-ApxPO ⊆ exp-ApxPO ⊆ NPO for approximating NP optimization problems, we have NC ⊆ NCO ⊆ FNCAS ⊆ NCAS ⊆ ApxNCO ⊆ poly-ApxNCO ⊆ exp-ApxNCO for approximating NC optimization problems (see working definitions in [11]) and L ⊆ LO ⊆ FLAS ⊆ LAS ⊆ ApxLO ⊆ poly-ApxLO ⊆ exp-ApxLO ⊆ NLO for approximating NL optimization problems (classes defined in [25]). 8.2 Fixed parameter parallelizable problems As an analog to the class of fixed parameter tractable problems, FPT, we might also consider the class of fixed parameter parallelizable problems, FPP, introduced (in a sketchy manner) in [8]. Todo 9. Survey and create a listing of known problems in FPP. 9.1 Fixed parameter parallelizable approximation In [18], the author presents a cohesive view of fixed parameter tractable approximation algorithms. See also [7]. Todo 10. No work seems to exist which studys fixed parameter parallelizable approximations. 10.1 Average-case complexity P, NP, and NP-completeness are all defined using worst-case time complexity. Average-case complexity classes are defined using expected running times over a probability distribution on inputs of each size. There is a wealth of complexity classes related to the notion of average-case complexity (AvgP, DistP, HeurP, NearlyP, etc.). For a survey of the average-case complexity of NP problems, see [5]. There are some definitions for smaller average-case complexity classes, but they are a bit harder to find. See [31, Section 3.5.1] for general definitions of average-case complexity classes and [4, Section 7] for average-case logarithmic space. The latter paper very briefly (that is, in a single sentence) states that their definitions can be used to define average-case logarithmic space-uniform NC. This is a bit indirect because the average-case hardness comes from the unifom Turing machine which generates the circuit, and not the circuit itself. 8 10.2 Average-case highly parallel problems One possible set of requirements for a problem to be considered “highly parallel on average” is 1. the expected depth of the circuit deciding it is polylogarithmic, and 2. the expected size of the circuit is polynomial, but average-case highly parallel computation does not seem to be explicitly defined anywhere. Todo 11. What is the appropriate definition for average-case NC? Is it the definition given by [4], or would it make more sense to define it using the CREW PRAM characterization of NC (see subsection 7.1)? There do seem to be a few examples of average-case highly parallel PRAM algorithms. • Sublist Selection Problem [20] • k-Colorability of Permutation Graphs [3] • Single-Source Shortest-Paths [19] • various other graph problems [21] Todo 12. There has been some work on average-case complexity of P-complete problems in [23], though I have not examined that yet. Also, there doesn’t seem to be much follow up to that work. There is also plenty of work done on expected depth of random circuits, which may be of use. References [1] Adnan Agbaria, Yosi Ben-Asher, and Ilan Newman. “Communicationprocessor tradeoffs in limited resources PRAM”. In: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures. SPAA ’99. Saint Malo, France: ACM, 1999, pp. 74–82. isbn: 1-58113-124-0. doi: 10.1145/305619.305628. [2] Alok Aggarwal and Jeffrey Vitter. “The input/output complexity of sorting and related problems”. In: Communications of the ACM 31.9 (1988), pp. 1116–1127. [3] Maria Andreou and Stavros D. Nikolopoulos. “NC Coloring Algorithms for Permutation Graphs”. In: Nordic J. of Computing 6.4 (Dec. 1999), pp. 422–445. issn: 1236-6064. url: http://dl.acm.org/citation.cfm? id=642160.642164. 9 [4] Shai Ben-David et al. “On the Theory of Average Case Complexity”. In: STOC. Ed. by David S. Johnson. Seattle, Washington, USA: ACM, May 14– 17, 1989, pp. 204–216. isbn: 0-89791-307-8. doi: 10.1145/73007.73027. [5] Andrej Bogdanov and Luca Trevisan. “Average-Case Complexity”. In: CoRR abs/cs/0606037 (2006). [6] Allan Borodin. “On Relating Time and Space to Size and Depth”. In: SIAM J. Comput. 6.4 (1977), pp. 733–744. [7] Liming Cai and Xiuzhen Huang. “Fixed-Parameter Approximation: Conceptual Framework and Approximability Results”. In: Algorithmica 57 (2 2010), pp. 398–412. issn: 0178-4617. doi: 10.1007/s00453-008-9223-x. [8] Marco Cesati and Miriam Di Ianni. “Parameterized parallel complexity”. In: Euro-Par.98 Parallel Processing. Ed. by David Pritchard and Jeff Reeve. Vol. 1470. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 1998, pp. 892–896. isbn: 978-3-540-64952-6. doi: 10.1007/BFb0057945. [9] Erik Demaine. Advanced Data Structure, Lecture 19. http://courses. csail.mit.edu/6.851/spring07/scribe/lec19.pdf. Accessed April 15, 2014. Apr. 25, 2007. [10] J. Díaz et al. Paradigms for fast parallel approximability. Cambridge International Series on Parallel Computation. Cambridge University Press, 1997. isbn: 9780521117920. url: http://books.google.com/books?id= tC9gCQ2lmVcC. [11] Jeffrey Finkelstein. “Highly parallel approximations for inherently sequential problems”. Unpublished manuscript. 2013. url: https://github. com/jfinkels/ncapproximation. [12] P. B. Gibbons, Y. Matias, and V. Ramachandran. “Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?” In: Theory of Computing Systems 32 (3 1999), pp. 327–359. issn: 1432–4350. doi: 10.1007/s002240000121. [13] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. “The QueueRead Queue-Write Asynchronous PRAM Model”. In: Euro-Par’96 Parallel Processing, Lecture Notes in Computer Science. Springer, 1998, pp. 279– 292. [14] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. “The QueueRead Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms”. In: SIAM J. Comput. 28.2 (1998), pp. 733–769. doi: 10. 1137/S009753979427491. [15] Timothy J. Harris and Murray I. Cole. The Parameterized PRAM. ???? [16] Harry B Hunt III et al. “NC-Approximation Schemes for NP- and PSPACEHard Problems for Geometric Graphs”. In: Journal of Algorithms 26.2 (1998), pp. 238–274. issn: 0196-6774. doi: 10.1006/jagm.1997.0903. 10 [17] Philip D. MacKenzie and Vijaya Ramachandran. “Computational bounds for fundamental problems on general-purpose parallel models”. In: Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures. SPAA ’98. Puerto Vallarta, Mexico: ACM, 1998, pp. 152–163. isbn: 0-89791-989-0. doi: 10.1145/277651.277681. [18] Dániel Marx. “Parameterized Complexity and Approximation Algorithms”. In: The Computer Journal 51.1 (2007), pp. 60–78. doi: 10.1093/comjnl/ bxm048. eprint: http://comjnl.oxfordjournals.org/content/51/1/ 60.full.pdf+html. [19] Ulrich Meyer. “Design and Analysis of Sequential and Parallel SingleSource Shortest-Paths Algorithms”. PhD thesis. Saarbrücken: Saarland University, 2002. [20] G. Pucci and W. Zimmerman. On the Average Case Complexity of Parallel Sublist Selection. Tech. rep. International Computer Science Institute, Mar. 1991. [21] John H. Reif and Paul G. Spirakis. “Expected Parallel Time and Sequential Space Complexity of Graph and Digraph Problems”. In: Algorithmica 7.5&6 (1992), pp. 597–630. doi: 10.1007/BF01758779. [22] John E. Savage. Models of Computation: Exploring the Power of Computing. Addison-Wesley, 1998. [23] Maria Serna and Fatos Xhafa. “On the average case complexity of some P-complete problems”. In: Informatique Théorique et Applications 33 (1 1999), pp. 33–45. [24] Justin R. Smith. The Design and Analysis of Parallel Algorithms. New York, NY, USA: Oxford University Press, Inc., 1993. [25] Till Tantau. “Logspace Optimization Problems and Their Approximability Properties”. In: Theory Comput. Syst. 41.2 (2007), pp. 327–350. doi: 10.1007/s00224-007-2011-1. [26] Alexandre Tiskin. “The Design and Analysis of Bulk-Synchronous Parallel Algorithms”. PhD thesis. Oxford University, 1998. [27] L. G. Valiant. “Bulk-synchronous parallel computers”. In: Parallel processing and artificial intelligence. Ed. by M. Reeve and S.E. Zenith. Wiley communicating process architecture. Wiley, 1989. Chap. 2, pp. 15–22. isbn: 9780471924975. url: http : / / books . google . com / books ? id = tWUlAAAAMAAJ. [28] Uzi Vishkin. “Implementation of simultaneous memory address access in models that forbid it”. In: Journal of Algorithms 4.1 (1983), pp. 45–50. issn: 0196-6774. doi: 10.1016/0196-6774(83)90033-0. [29] Uzi Vishkin, George C. Caragea, and Bryant Lee. Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform. Tech. rep. University of Maryland, Apr. 2006. 11 [30] Uzi Vishkin et al. “Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism”. In: Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA. ACM Press, 1998, pp. 140–151. [31] Tomoyuki Yamakami. “Average Case Computational Complexity Theory”. PhD thesis. Toronto, Canada: University of Toronto, 1997. [32] Yunquan Zhang. “Performance Optimizations on Parallel Numerical Software Package and Studys on Memory Complexity”. PhD thesis. Beijing: Institute of Software, Chinese Academy of Sciences, June 2000. [33] Yunquan Zhang et al. “Memory Complexity in High Performance Computing”. In: Proceedings of the Third International Conference on High Performance Computing in Asia-Pacific Region. Singapore, Sept. 1998, pp. 142–151. [34] Yunquan Zhang et al. “Models of parallel computation: a survey and classification”. In: Frontiers of Computer Science in China 1 (2 2007). 10.1007/s11704-007-0016-1, pp. 156–165. issn: 1673-7350. 12