スーパースカラ Super Scalar

スーパースカラ Super Scalar • From CPI（Clock/Instruction）to IPC（Instruction/clock） • スーパースカラ/Super Scalar • 考え方 - 複数命令の同時実行構造 Basic idea: Simultaneous issues of several instructions • インオーダとアウトオブオーダー発行 In order/Out of order issues • Stage Decoupling with the pipeline • Reservation Station（FIFO） • インオーダとアウトオブオーダー完了 In order/Out of order completion • データコンフリクトと例外処理 Data conflict and exception handling • ROB（ReOrder Buffer） • レジスタリネーミング; register renaming • 分岐予測; Branch prediction 福永力; Chikara Fukunaga 1 From CPI to IPC • MIPS=動作周波数（MHz）÷CPI MIPS=Machine Frequency (MHz)/CPI • もし複数の命令を実行できるようになればCPIという概念は拡張される（命令を1クロックで複数こなせるようになる）．すると If several instructions can be issued at once, several instructions will be executed seemingly in a clock cycle. Then we may regard IPC=1/CPI（Instructions per clock）がCPUを評価する基本単位となりうる．その場合 as a basic unit for the machine performance. In this case MISP will be redefined MIPS=周波数（MHz）×IPC • その複数命令同時実行構造をスーパースカラと呼ぶ． Structure of this idea (Execution of Several instructions at once) is referred to as the Super Scalar. • （注）MIPS評価には他の方法もある（Dhrystone MIPS）．この評価をもとにしてIPCを計算すると2本の命令が同時実行可能なスーパースカラでもIPC>2などという数字がでてくるので注意．（Note） MIPS can be estimated with the another way (Dhrystone MIPS). This Dhrystone over-estimates IPC>2, for example, for superscalar with double instruction stream. 福永力; Chikara Fukunaga 2 スーパースカラの概念 Idea of Super scalar • • • 複数パイプラインを設け複数命令の同時発行（issue）を可能にする Simultaneous Issues of several instructions with multiple pipelines 命令は相互に独立してあるべきが理想 Ideally an instruction must be independent with the other ones. プログラムから独立した複数の命令を選ぶ（独立した命令をピックアップし矛盾なく並び替える仕事はCPUハードウェアによる）．もちろんプログラム側（コンパイラ）の配慮も必要 Multiple independent instructions should be selected and rearranged the order by CPU 福永力; Chikara Fukunaga 3 スーパースカラの方式・形態（１） Methods and types of Super scalar (1) • 一般的に命令処理にかかるクロック数（レイテンシー）は命令により異なることに注意（今までは簡単のため同じとして扱ってきたが）． Note the clock count of instructions may not be fixed (variable), although so far we implicitly assumed it as constant. • 複数の命令フェッチ（issueあるいはway数） Fetch of Multiple instructions; Number of instructions fetch simultaneously is called number of issues or ways. • フェッチとフェッチ以後のステージを分離しここに命令キュー（FIFO）を挟む（このFIFOをReservation StationあるいはCentral Instruction Window（集中型命令ウィンドウ）と呼ぶ）のが一般的． An instruction queue (FIFO) is implemented after Fetch stage but before other stages. This FIFO is called Reservation Station or Central Instruction Window 福永力; Chikara Fukunaga 4 スーパースカラの方式・形態（2） Methods and types of Super scalar (2) • 命令の発行，完了がオリジナルプログラム通りをインオーダー（In-order），違ってくる場合をアウトオブオーダー(Out-of-order）と呼ぶ．若干異なる構造をとる． If the execution of instructions is done in order with the original program, we call “In-order”, otherwise “Out-of-Order” • 命令依存性は命令キュー内待機命令に対して専用ハードウェアにより確認あるいは積極的に解消させる（レジスタリネーミング；後述）．この時点でのデータハザードによるストールは原則としておきないようにする． Dependency is checked with instructions waited in the FIFO queue. • ロード/ストア命令の実行ユニットは他の命令のそれと別にしてストール発生を抑えるようにしている（キャッシュ，メモリアクセスの遅さ考慮）． Load/Store instruction execution unit is normally independently installed in order to avoid the stall for data access in the main storage. 福永力; Chikara Fukunaga 5 命令のインオーダー発行 In order issue of instructions • 命令キュー出口より2つの命令の（オペランドの）依存関係をチェック，独立であれば2つを発行． IF two instructions out of the queue are confirmed independet, two are issued. • 依存性があれば1命令のみ発行．次の命令はその次の命令とペアを組む． If there is an dependency, only one instr. is taken, one with the later one is paired. • 命令の発行順はオリジナルプログラムのそれがキープされる． Issued order is kept in execution. 福永力; Chikara Fukunaga 6 メモリ研究Study（1） FIFO（Fast In Fast Out）形態 • メモリに入った順序でデータが保持される Data are kept in order with the ones entered. • 待ち行列の構造 Waiting queue structure • カウンタに多くの人が列を作ると行列は伸びる If many people come to a counter, queue is longer. • しかし列の先頭から処理がなされていくと行列は縮む But the queue is shrink if the process is done turn by turn • 押し掛ける人の数/時間（Producer）と処理される人の数/時間（Consumer）でFIFOの大きさ（深さ）を決めて両サイドの単位時間あたりのデータ量の違いを調整 Producer (number of people to counter) and Consumer (number of people accessed) rates will be adjusted with this queue. 福永力; Chikara Fukunaga 7 FIFO 構造 structure • 待ち行列のh/wでの実現をFIFO．アドレスを持つ必要がないので単純な構造でメモリを構築できる．It is implemented with less hardware ingredient than usual memory since this structure needs not the address. • s/wでも実現できる（循環バッファ） The circulation Buffer can be implemented with the software. 福永力; Chikara Fukunaga 8 命令のアウトオブオーダー発行 Out of order issue of instructions • 命令キュー全体（あるいは既定数の命令）について（オペランドの）依存関係を調べる All the instructions in the queue are checked with the independency • 依存関係のない命令をissue数分パイプラインステージに送る Independent instructions for number of ways are send to the pipelines． • 発行される順序はプログラムのそれと異なる可能性がある（アウトオブオーダー）． Execution order will be different with the one arranged in the program. • すべてに依存関係があれば 1命令発行も止むなし． If all the instructions have dependency, only one instruction will be employed 福永力; Chikara Fukunaga 9 命令の依存関係チェックのハードウェア機構 H/W check system for Dependency of instructions （Wake-up logic） • Instruction format is taken as “instruction D S1 S2”, Destination & sources For example, “add r1,r3,r1” = r1←r3+r1． • ある命令のDと他の命令のSの関係より命令の依存関係が判明する． Relation of D of an instruction with S of other instruction are used to check the dependency • 実行されなかった命令を実行させるようにするという意味でウェイクアップロジックとも呼ばれる． The instructions that are not able to be executed can be executed => Wakeup logic.s • このロジックの高速化と回路規模の縮小が課題 Fast processing and down-sizing of this logic is an issue for this circuit. 福永力; Chikara Fukunaga 10 （常に）アウトオブオーダー完了 (Constant) Out of order completion • 命令を並列処理するとレイテンシ（命令実行に要するクロック数）の違いで完了（リタイア）時のタイミングは異なる．これはインオーダー発行でも同じである．つまり基本的にスーパースカラではアウトオブオーダー完了の形をとる． Completion timing is different for instructions executed in parallel due to the difference of latency (Number of clock cycles required for execution). Thus the completion timing is always out of order in the super scalar machine. • 注意すべきはWAW hazardと割込み発生後の再開命令 It should be noted WAW hazard as well as the restart instruction after interrupt. • Data hazard with WAW（Write After Write） mulf r5,r3,r2 ; late ori r5,r4,0 ; early • 発生後の再開 Restart after interrupt mulf r1,r2,r3 ; late ori r3,r4,0 ; early  Interrupt occurs at this moment （この時点でr3は更新され，r1はまだだと，再開はmulfからでないとまずい; At this moment r3is updated, but r1 is kept old, so the restarting after the interrupt should be from mulf instruction 福永力; Chikara Fukunaga 11 ROBを入れてインオーダー完了 But in order completion is possible with ROB • スーパースカラはアウトオブオーダー完了が基本（前スライド）だが，前述の如く種々の問題を含んでいる．そこで命令完了の最終ステージであるレジスタライトはプログラム順番通りにさせる．これをインオーダー完了 There are problems in the out of order completion as seen in the previous slide. An idea to keep in order for Register writing at the last stage of instruction execution • このために（さらに）新たなるハードウェアを導入する．これをリオーダーバッファ（ReOrder Buffer; ROB）と呼ぶ．ここに各命令のプログラム上での順番を記録，各命令に対して実行が終了したかどうか記憶させる． In order to do so, an another hardware must be implemented. “Reorder Buffer; ROB” to memorize the execution of the instructions 福永力; Chikara Fukunaga 12 インオーダーかアウトオブオーダーか In order or Out of order? • Out-of-order発行/完了のほうがプロセッサのパフォーマンスは確実に向上する． Processor performance is better with out of order than in order. • しかしそのためには例えば動的スケジューリング機能としてのウェークアップロジックのようなハードウェア回路を必要とする． But need more hardware components like wakeup logic etc. • In order発行では複雑なハードウェアを組込む必要はないが平行して処理された命令に予期せぬ遅滞が生じた場合（キャッシュミスヒットとか）その完了をいたずらに待たなければならない．効率的ではない． An instruction must be waited even completed but if the other instructions take longer time with, for example, cache miss-hit in the case of in order issue. 福永力; Chikara Fukunaga 13 レジスタリネーミング Register renaming • アウトオブオーダー発行により生じる新たなデータハザード（WAR,WAW）の完全解消を目指す Out of order will make new data hazard with WAR and WAW • WAW（Write After Write）:後続命令のレジスタ変更を遅れて実行された先行命令が書き直す． • WAR（Write After Read）:先行する命令のソースレジスタを後続の命令実行で書き直す． • これらは相前後する命令で同じレジスタを利用している．そのレジスタの利用を他のレジスタで置き換えればこれらのハザードは回避できる． The hazards occur because instructions uses the same registers. We can avoid these hazard using other registers instead. • そのため（ハードウェアとして密かに）複数のレジスタを用意しそのような状況に陥ったときそれを利用すれば（ソフトウェアのコンシステンシは保たれたまま）ハザードを回避できる． If the hardware can prepare more (shadow) registers, WAR and WAW will be avoided. 福永力; Chikara Fukunaga 14 レジスタリネーミング例 Example of register renaming • If latency of mulf is longer than addi with 4-issues Multi scalar processor, WAR or WAW will be happened mult r3,r3,r5 RAW addi r4,r3,1 WAW WAR addi r3,r5,1 RAW mult r7,r3,r4 • レジスタリネーミングを行いデスティネーションレジスタを置き換えるとWAWは解消できる WAW can be eliminated with register renaming mult P1,r3,r5 RAW addi P2,P1,1 addi P3,r5,1 RAW mult P4,P3,P2 • RAWはレジスタリネーミングでも解消できない RAW can not be eliminated in principle 福永力; Chikara Fukunaga 15 一般的なスーパースカラ構造 Super scalar structure 福永力; Chikara Fukunaga 16 ここでも分岐予測は問題 Branch prediction is still serious issue • スーパースカラでも分岐命令の取り扱いには手を焼く（前回パイプラインと同じ状況） The situation is the same as one with the simple pipeline. • 静的予測/Static Prediction • 例：通常taken ,ループでnot takenと固定 Normally default is set to not-taken，but taken in a loop • 動的予測/Dynamic Prediction • Branch Prediction Buffer（BPB or BHT；takenするかどうか予測） &Branch Target Buffer（BTB；分岐先アドレス情報）方式の併用 • 投機実行/Speculative Prediction taken/not-takenどちらの分岐先の命令も先を見越（予測）して実行させる福永力; Chikara Fukunaga 17 ウェイクアップロジックについて Wake-up Logic Structure • ここではメモリの別形態CAMを用いた処理がなされるのが一般的 In general CAM will be used for this logic • CAMとは（means）Contents Addressable Memory あるいは（or）Contents Associative Memory • アドレス参照ではなく内容参照のメモリ；連想メモリとも呼ばれる A memory for referring data not from the address but from the data themselves • 例えば電話番号からその持ち主を知る（telephone book; number → name） Telephone Owner name • CPUではキャッシュヒット/TLBの判定 090-1234-8765 吉田で一般的に使われる CAM is used in Cash or TLB hit decision 080-9991-0101 長島 0233-220-1987 福永力; Chikara Fukunaga 藤田 18 Cacheの内部構成 Internal structure of Cache • 抽象化されたCacheの構造 Abstractive structure of Cache • コントロール部（アドレスデコーダとtag，コンパレータ）とCacheメモリ部 Control (Address decoder and Tag comparator) stage and Cache memory • メインストレージのアドレスとtag情報の素早い参照が重要（マッピング） From quick search of tag field, CPU knows whether the address in MS is involved in Cache or not. • データの格納法にもさまさまな工夫 Careful specification for Tag associated from the address 福永力; Chikara Fukunaga 19 メモリ研究Study（2） CAM構造/Structure • 基本構成は右図のようにまとめられる： Basic idea of CAM is summarized in this figure. • Input Dataは（必要とあれば）マスクで不必要な情報を無視できる Mask registers will mask unnecessary bit if it is needed. • 結果はマスクをかけず Hit（1）/No hit（0）だけでなくオリジナルとどれだけ食い違っているかの判定するものもある There is a tagging without the mask but with estimating number of bit (un)matched. 福永力; Chikara Fukunaga 比較 20 CAMの比較機構 Bit comparator used in CAM • 入力データとサンプルメモリのデータをビットごとに高速で比較する． High speed bit by bit comparison of Input data with reference • Word単位で並列化できる Parallelization is indispensable • 回路構成は単純だが1word並列処理でそのフルビット分回路を用意しなければならない Simple logic but need for all the bit • 右図では各ビットで比較しそのデータを下位から来た結果（Li-1）と論理積をとり Liとしさらに上位に送ることにより並列化を達成 AND (Li and Li-1) enables parallelization 福永力; Chikara Fukunaga 21 最小距離検索連想メモリLSI CAM LSI with the Minimum Distance Search （広島大学ナノデバイスシステム研究センター Hiroshima Univ. Nano-Device Research Center ; 2004）連想メモリのパターン認識への応用 Application of CAM for pattern recognition 検索（入力）データとメモリデータの比較に範囲（距離）を導入 Tagging of a Search word to reference data not exactly but with distance 入力データにもっとも近傍のメモリデータを探索する Pick up a reference data for search word with minimum distance with or 福永力; Chikara Fukunaga 22 Minimum-Distance-Search(MDS) CAM Core part • Word length （語長）1≦i<W bit • Number of References （標本数） 1≦j≦R • Samples (Ref) are stored in advance in Storage Cells SC (SCij) • Search word are distributed vertically downward. • Then all the comparsion are done paralell in Unit Comparator Cells UC (UCij) • Results are accumulated in analog manner in Word Comparator (WCi) 福永力; Chikara Fukunaga 23 MDS CAM 回路構成 Circuits in detail • Unit Comparator and Word Comparator (Manhattan Distance case) • Self Adaptive Winner Line-up Amplification 福永力; Chikara Fukunaga 24

スーパースカラ Super Scalar

Related documents

Products

Support

スーパースカラ Super Scalar

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib