A Fully Buffered Memory System Simulator Rami Nasr -M.S. Thesis, and ENEE 759H Course Project Thursday May 12th, 2005 Another Simulator? Sim-DRAM exists and supports FB-DIMM. Why write another simulator? Sim-DRAM still had a few unworkable bugs in its FB-DIMM model when I began my study. FB-DIMM is radically different than other memory architectures. New simulator => fresh start. FBsim is made exclusively for simulating and studying the FBDIMM architecture. Easier to study FB-DIMM with an exclusive simulator. Different scheduler, mapping algorithm, approach, style, section of study in the FB-DIMM design space. FBsim is ideal for simulating ‘unreasonably’ high memory request rates and studying channel saturation effects. The two simulators can be used to validate each other’s results in FB-DIMM studies. Writing a memory simulator was a great experience for me. FBsim Overview All code written from scratch. Standalone product. Does not currently interface with CPU simulators or memory traces. Instead probabilistically models memory transactions according to user specifications. => Does not actually store memory data Written in ANSI C. ~5000 lines of code. Code organized into header files, commented, quite easy to hack. Fast. For each memory channel, 1 second simulates ~10ms (or ~1ms during channel saturation) on a 2.4 GHz Pentium 4. Supports Open & Closed Page Mode, Fixed & Variable Latency Mode. Supports output of macro and micro (frame by frame) simulation data Does not model channel init, maintenance, sync. overhead. Does not model memory refresh. Does not model power consumption, and power timing limitations (tFAW etc.). The above options can be incorporated readily into future versions. FBsim Overview 2 Channel Scheduler 0 Input Transaction Generator Channel Scheduler 1 Address Mapper A Frame Iteration •Try to generate transactions •Map any generated transactions to its channel scheduler. •Fire each scheduler once. Channel Scheduler 7 Input Transaction Model • Step Distributions • Normal (Gaussian) Distributions Input Transaction Model 2 Bus Trace Viewer FBsim Model Address Mapping Physical addressClosed mustWHILE be mapped somehow to the Page (a Mode non zero row sum exists) right channel, DIMM, rank, bank, row, and column. { WHILE (visit each channel with a non zero row sum exactly once) FBsim built to support different DIMM capacities, { Theeven next 'result' is channel DIMM with the different channel capacities, unbalanced highest number. configurations Decrement that DIMM's number by 1. Decrement the row sum by 1. Mode } Modulus = 4+2+1+2 = 9Open Page => Algorithm needed }to map incoming transaction to DIMM Channel Scheduler FB-DIMM Frame Format Review SouthBound (SB) Frame could be a: • Channel Frame (not modeled in FBsim) • Command Frame (up to three DRAM commands, with only one command possible to each DIMM in the channel) • Command + Wdata Frame (holds one DRAM command, plus one DDR beat of write data) NorthBound (NB) Frame could be a: • Channel Frame (not modeled in FBsim) • Read Response Frame (holds two DDR beats of returned read data) Some of my Results Case Study Conclusion • With at least two DIMMs on each channel, performance scales very well in FB-DIMM •More than two DIMMs only increases throughput • 1x8 achieved 7.9 GBps before saturating (82%) • 2x4 achieved capacity, 15.6not GBps (82%) • 4x2 achieved •Adding each DIMM adds ~5ns average channel latency 31.3 GBps in FLM, and slightly over half that in VLM (82%) • 8x1 achieved • In closed page mode, only 82% of peak theoretical 45.2 GBps throughput of a channel can be reached. (59%!) Some of my Results 2 • In Closed Page Mode with 2:1 read/write ratio, a reordering window of size ~12 transactions achieves best possible performance (channel saturation) for a FB-DIMM channel scheduler. Increasing window-size over this has no benefit. • The more skewed the read/write ratio, the bigger the scheduling window needs to be (at 4:1, its ~18). • In Variable Latency Mode, a reordering window of size ~20 achieves best possible performance. Some of my Results 3 Micro-study shows that in Closed Page Mode, the FB channel can at most reach ~93% write data utilization on the SB, and ~84% read data utilization on the NB. Micro-study showed that FBsim channel utilization was slightly worse for non 2:1 read/write ratios (it was 2% worse for 4:1). FBsim scheduler can quite straightforwardly be made more adaptive to read/write ratio of transactions in scheduler. Future Ideas with FBsim • I’m graduating this semester (if Dr Jacob and Mr (Dr?) Wang so please), and escaping to the corporate world. • => Writing a guide for FBsim along with some ideas for future work. Anyone who wishes to take over development is eagerly encouraged to. • If so, I would be happy to help get things rolling by email or in person. Feel free to access & use anything in FBsim or my thesis paper. • I strongly believe a very interesting paper or three can quite quickly come out of this research area (me) Future Ideas with FBsim 2 • For credibility in a paper, add an interface between FBsim and a CPU simulator or memory traces. Run real benchmarks through FBsim. Compare and contrast these results with the transaction modeling results. • AND/OR add more functionality and provable realism to the transaction modeler. Study this. • Best yet, integrate FBsim into the Sim-DRAM package as an added option. • Add modeling for channel overhead, memory refresh overhead, error simulation and error handling, power consumption constraints and metrics. • Enhance adaptivity of FBsim scheduler to non 2:1 read/write ratios. • Experiment with address mapping algorithm and load balancing. • Experiment with different type scheduler implementations (eg. ones not based on pattern matching). *involved* • Study hardware constraints in FB-DIMM channel scheduling. More Possible FB-DIMM Studies Channel utilization and configuration trade-offs for Open Page Mode Performance degradation of shrinking scheduler reorder window size Relaxation on critical DRAM device parameters (density, nBanks, timing constraints, clock frequency) allowed by FB-DIMM architecture OR optimizing the FB-DIMM architecture by increasing the SB and NB channel widths (adding lines) or bitrates, and maybe modifying the frame protocol AMB is a logic device on a memory module!! Can add buffers, arithmetic units, processing power, etc….. Special Thanks to.. Dr Jacob for introducing me to the field and guiding my progress David Wang for the course lectures and material