記 錄 10783 編 號 狀 G0496516011 態 助 教 建檔完成 查 核 索 書 查核完成 號 學 校 輔仁大學 名 稱 系 所 資訊工程學系 名 稱 舊 系 所 名 稱 學 496516011 號 研 究 鍾華元 生( 中) 研 究 Hua-Yuan Chung 生( 英) 論 用 VHDL 實做出已排班的資料流架構和暫存器內文 文 名 稱( 中) 論 文 名 VHDL Implementation of Scheduled Dataflow Architecture and the Register Context 稱( 英) 其 他 題 名 指 導 教 周賜福 授( 中) 指 導 教 Joseph M. Arul 授( 英) 校 內 全 文 2012.8.31 開 放 日 期 校 外 全 文 2012.8.31 開 放 日 期 全 文 不 開 放 理 由 電 子 全 文 同意 送 交 國 圖. 國 圖 全 文 2012.8.31 開 放 日 期. 檔 案 封面 摘要 謝辭 目次 第一章 第二章 第三章 第四章 第五章 參考書目 說 明 電 子 01 02 03 04 05 06 07 08 09 10 全 文 學 位 碩士 類 別 畢 業 98 學 年 度 出 版 99 年 語 文 英文 別 關 鍵 非阻斷 多緒執行 排班 資料流架構 字( 中) 關 鍵 Nonblocking Multi-threaded Scheduled Dataflow Architecture 字( 英) 自從微處理器從1970年開始發展,業界的CPU效能的改進大多是從ILP來著手。到 了大約2000年,ILP的發展似乎到了一個瓶頸,並且因為功率消耗和CPU易過熱的 考量使得CPU發展重點從ILP改到了TLP並設法的有效使用多個處理器。然而目前 的CPU的設計還是靠著複雜的硬體來偵測RAW危障,此舉導致了CPU耗電量增大 並且使CPU的設計更加複雜。 在這篇論文中,我們提出了一個全然不同的架構和方 法來解決RAW危障。透過使用資料流的概念,我們可以很自然的移除RAW危障。 除此之外,這個架構的運作也藉由結合控制流概念和資料流概念來提升ILP和TLP 。這個架構也就是已排成的資料流架構(SDF)。SDF是一個非阻斷多緒執行分開記 摘 憶體存取和資料運算執行的資料流架構。也因著分開記憶體存取和算數運算,同步 要( 處理器(SP)負責資料的記憶體存取而算數處理器(EP)則負責執行所有的算術運算。 中) 之前的SDF是透過C++和C來模擬,然而為了更精確的模擬到硬體的細節,SDF在這 篇論文是用VHDL來實做並且用ModelSIM來模擬。除了模擬之外,我們也用硬體來 測試SDF。另外在這篇論文中也測試看提升register context可以提升多少效能。平常 在多執行緒架構中,執行緒互傳資料可以透過frame memory。如果一個執行緒能透 過register context來傳資料或是運算結果給別的執行緒就可以避免記憶體的存取。因 此效能可以有所提升。為了測試SDF,我們把SDF燒到DE2板子上的CycloneII FPGA 晶片。研究顯示合成SDF 至少 CycloneII 50% 的資源。 Cyclone II 最多可以 合成有四個 register set 的 SDF。 這個研究分析了每種 Cyclone II 合成 SDF 的狀態 。並且發現 SDF 至少要有兩個 register set 才可以使多執行緒的程式同時執行。 Since the invention of microprocessors around 1970, CPU performance improvement 摘 together with the ILP had been the main focus in the computer industry. Around the year 要( 2000, ILP seemed to have reached a limit, together with the power consumption and heat 英) dissipation emerged multi-core era. The focus has shifted from ILP to TLP and efficient use of multi-core processors. However, the RAW hazard detection technique relies on complex hardware in the current computers which may cause the designers to make the CPU consume lot of energy and the design more complex. In this particular research we propose a totally different architecture and a different way to solve the RAW hazard. By using dataflow paradigm, we can naturally eliminate the RAW hazards. Besides, this architecture comes as a new paradigm to closely link the ILP and TLP by combining sequential and dataflow paradigm. This is named as Scheduled Dataflow Architecture (SDF). SDF is a non-blocking multithreaded decoupled dataflow architecture, because the main engine relies on dataflow paradigm. Since it is a decoupled architecture, the synchronization processor is responsible for data access and the execution processor is responsible for execution of all the instructions. Previously SDF was simulated in C++ and C languages [19-20]. For more precisely to imitate the hardware complexity, this simulation uses VHDL to implement SDF and simulated it by ModelSIM. We have also tested using Altera DE2 hardware. The main focus of this research is to measure the performance gain having more register context. When a multithreaded architecture is used, passing of data between threads can happen through the frame memory. If we use the register context, and efficiently pass the data to the following threads that need the results of the previous thread, several memory accesses can be reduced, thus improving the performance of a program. To test the SDF, we have also used the program into CycloneII FPGA chip of DE2 board. SDF uses at least 50% of the resource of CycloneII. CycloneII can synthesis SDF using at most four register sets. We used for all these synthesis and found that SDF requires at least two register sets to run multithreaded program concurrently. 論 文 目 次 摘要 I Abstract III 謝誌 V List of Figures IX List of Tables X Chapter1 Introduction 1 1.1 Introduction 1 1.2 Introduction to FPGA Technology 2 1.3 Motivation 3 1.4 Organization of This Thesis 3 Chapter2 Background and Related Work 5 2.1 Background 5 2.1.1 Background of Data flow Architecture 5 2.1.2 Background to Decoupled Memory Architecture 7 2.2 Related Work 7 2.3 SDF Background 11 Chapter 3 Scheduled Dataflow Architecture 12 3.1 The Hardware Composition of SDF 12 3.1.1 Control Unit 12 3.1.2 Synchronization Pipeline 14 3.1.3 Execution Pipeline 15 3.1.4 Linking of Register Sets to SP and EP 16 3.2 Register to Register Method 17 3.3 Memory Management Unit 19 3.4 Thread Status 20 3.5 Implementation of Common Application 21 3.5.1 The Detailed Explanations of RTM Method 22 3.5.2 The Detailed Explanations of RTR Method 24 3.5.3 Branch Implementation in an SDF Thread 25 3.5.4 Loop Implementation in SDF by RTR and RTM Method 26 Chapter 4 Architecture Implementation Analysis 28 4.1 SDF Environment 28 4.1.1 Basic Elements of DE2 29 4.1.2 SDF Processor Logic Elements 29 4.1.3 Memory Bit used by SDF 30 4.2Measure multithreaded program run on DE2 31 4.2.1 Experiment method 31 4.2.2 Multithread Summation Program. 33 4.3 Conclusion from the Experiments 35 Chapter 5 Conclusion and Future Work 37 5.1 Conclusion 37 5.2 Future Work 38 Reference 39 參 考 文 獻 [1] Richard M. Karp and Raymond E. Miller, “Properties of a Model for Parallel Computation: Determinacy, Termination, Queueing,” SIAM Journal on Applied Mathematics, Vol. 14, No. 6, pp. 1390-1411, Nov., 1966. [2] Jack B. Dennis and David P. Misunas, “A Preliminary Architecture for a Basic Data-Flow Processor,” ACM SIGARCH Computer Architecture News, Vol. 3, Issue 4, pp. 126-132, 1975. [3] K. Arvind and Rishiyur S. Nikhil, “Executing a Program on the MIT Tagged-Token Dataflow Architecture,” IEEE Transactions on Computers Archive, Vol. 39, Issue 3, pp. 300-318, Mar., 1990. [4] Gregory M. Papadopoulos and David E. Culler, “Monsoon: an explicit token-store architecture,” in Proceedings of the 17th annual international symposium on Computer Architecture, pp. 82-91, Seattle, WA., May, 1990. [5] Mitsuhisa Sato, Yuetsu Kodama, Shuichi Sakai, Yoshinori Yamaguchi, and Yasuhito Koumura, “Thread-based programming for the EM-4 hybrid dataflow machine,” ACM SIGARCH Computer Architecture News, Vol. 20, Issue 2, pp. 224-233, May, 1992. [6] James E. Smith, “Decoupled access/execute computer architectures”, in Proceedings of the 9th annual symposium on Computer Architecture, pp. 112-119, Austin, Texas, United States,Apr., 1982. [7] J. Kreuzinger and T. Ungerer, “Context-switching techniques for decoupled multithreaded processors,” in Euromicro ’99, pp. 248-251, Milan ,Italy, 1999. [8] James E. Smith, G.E. Dermer, B.D. Vanderwarn, S.D. Klinger, C.M. Rozewski,D.L. Fowler, K.R. Scidmore, and J.P. Laudon “The ZS-1 Central Processor,” ACM SIGOPS Operating Systems Review archive, Vol. 21, Issue 4, pp. 199-204, Oct., 1987. [9] James E. Smith, Shlomo Weiss, and Nicholas Y. Pang, “A Simulation Study of Decoupled Architecture Computers,” IEEE Trans. Computers, Vol. 35, No. 8, pp. 692-702, Aug., 1986. [10] Won W. Ro, Stephen P. Crago, Alvin M. Despain, and Jean-Luc Gaudiot, “HiDISC: A Decoupled Architecture for Data-Intensive Applications,” in Proceedings of the 17th International Symposium on Parallel and Distributed Processing, pp. 3.2,Nice, France, Apr., 2003. [11] Kyriakos Stavrou, Costas Kyriacou, Paraskevas Evripidou, and Pedro Trancoso “Chip multiprocessor based on data-driven,” International Journal of High Performance Systems Architecture, Vol. 1, No. 1, pp. 34-43, 2007. [12] Roberto M. Giorgi, Zdravko Popovic, and Nikola Puzovic,“ DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems,” in Computer Architecture and High Performance Computing, SBAC-PAD 2007. 19th International Symposium, pp. 263-270, Gramado, RS, Brazil, Oct., 2007. [13] John L. Hennesy and David A. Patterson, “Computer Architecture a Quantitative Approach 4th,” ELSEVIER, 2006. [14] David M. Harris and Sarah L. Harris, “Digital design and computer architecture,” ELSEVIER, 2007. [15] Michael Sung, Ronny Krashinsky, and Krste Asanović, “Multithreading decoupled architectures for complexity-effective general purpose computing”, ACM SIGARCH Computer Architecture News, Vol. 29, Issue 5, pp. 56-61, Dec., 2001. [16] DE2 User Manual, ftp://ftp.altera.com/up/pub/Webdocs/DE2_UserManual.pdf [17] Cyclone II Device Handbook, Volume 1, http://www.altera.com/literature/hb/cyc2/cyc2_cii5v1.pdf [18] Krishna M. Kavi, Roberto M. Giorgi, and, Joseph M. Arul, “Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation,” IEEE Trans. On Computers, Vol. 50, No. 8, pp. 834-846, Aug., 2001. [19] Joseph M. Arul, Tso-Zen Yeh, Chia-Cheng Hsu, and Jan-Jr Li, “An Efficient Way of Passing of Data in a Multithreaded Scheduled Dataflow Architecture,” in Proceedings of 8th International Conference on High-Performance Computing in Asia-Pacific Region, pp. 487-492, Beijing, China, Dec.,2005. 論 文 51 頁 數 附 註 全 文 點 閱 次 數 資 料 建 2010/8/31 置 時 間 轉 檔 2010/09/01 日 期 全 文 檔 存 取 記 錄 496516011 2010.8.31 16:27 140.136.149.190 new 01 496516011 2010.8.31 16:32 140.136.149.190 new 01 496516011 2010.8.31 16:32 140.136.149.190 new 02 496516011 2010.8.31 16:32 140.136.149.190 new 03 496516011 2010.8.31 16:51 140.136.149.190 new 04 496516011 2010.8.31 16:52 140.136.149.190 new 05 496516011 2010.8.31 16:52 140.136.149.190 new 06 496516011 2010.8.31 16:52 140.136.149.190 new 07 496516011 2010.8.31 16:52 140.136.149.190 new 08 496516011 2010.8.31 16:54 140.136.149.190 new 09 496516011 2010.8.31 16:54 140.136.149.190 new 10 異 動 記 錄 C 496516011 Y2010.M8.D31 16:57 140.136.149.190 M 496516011 Y2010.M8.D31 16:58 140.136.149.190 M inen3883 Y2010.M8.D31 17:02 140.136.149.190 M inen3883 Y2010.M9.D1 9:11 140.136.148.222 M inen3883 Y2010.M9.D1 9:12 140.136.148.222 M inen3883 Y2010.M9.D1 9:12 140.136.148.222 M 030540 Y2010.M9.D1 9:30 140.136.209.41 M 030540 Y2010.M9.D1 9:33 140.136.209.41 M 030540 Y2010.M9.D1 9:33 140.136.209.41 I 030540 Y2010.M9.D1 9:35 140.136.209.41