Platforms II. Dezső Sima 2011 December (Ver. 1.6) Sima Dezső, 2011 3. Platform architectures Contents • Platform architectures • 3.1. Design space of the basic platform architecture • 3.2. The driving force for the evolution of platform architectures • 3.3. DT platforms • 3.3.1. Design space of the basic architecture of DT platforms • 3.3.2. Evolution of Intel’s home user oriented multicore DT platforms • 3.3.3. Evolution of Intel’s business user oriented multicore DT platforms • 3.4. DP server platforms • 3.4.1. Design space of the basic architecture of DP server platforms • 3.4.2. Evolution of Intel’s low cost oriented multicore DP server platforms • 3.4.3. Evolution of Intel’s performance oriented multicore DP server platforms Contents • 3.5. MP server platforms • 3.5.1. Design space of the basic architecture of MP server platforms • 3.5.2. Evolution of Intel’s multicore MP server platforms • 3.5.3. Evolution of AMD’s multicore MP server platforms 3.1. Design space of the basic platform architecture 3.1 Design space of the basic platform architecture (1) Platform architecture Architecture of the processor subsystem Architecture of the memory subsystem • Interpreted only for DP/MP systems • In SMPs: Specifies the interconnection of the processors and the chipset • In NUMAs: Specifies the interconnections between the processors Specifies • the point and • the layout Architecture of the I/O subsystem Specifies the structure of the I/O subsystem (Will not be discussed) of the interconnection Example: Core 2/Penryn based MP SMP platform P P P P P P P P P P P P .. FSB MCH .. ICH Processors are connected to the MCH by individual buses ICH • Memory is attached to the MCH • There are serial FB-DIMM channels .. MCH .. MCH .. .. ICH The chipset consist of two parts designated as the MCH and the ICH 3.1 Design space of the basic platform architecture (2) The notion of Basic platform architecture Platform architecture Architecture of the processor subsystem Architecture of the memory subsystem Basic platform architecture Architecture of the I/O subsystem 3.1 Design space of the basic platform architecture (2) The notion of Basic platform architecture Platform architecture Architecture of the processor subsystem Architecture of the memory subsystem Basic platform architecture Architecture of the I/O subsystem 3.1 Design space of the basic platform architecture (3) Architecture of the processor subsystem Interpreted only for DP and MP systems. The interpretation depends on whether the multiprocessor system is an SMP or NUMA Architecture of the processor subsystem SMP systems NUMA systems Scheme of attaching the processors to the rest of the platform Scheme of interconnecting the processors Examples .. P P FSB MCH .. .. .. .. P .. P ICH P .. .. Memory P .. P .. P .. .. 3.1 Design space of the basic platform architecture (4) a) Scheme of attaching the processors to the rest of the platform (In case of SMP systems) Scheme of attaching the processors to the rest of the platform MP platforms DP platforms Single FSB P P MCH Single FSB Dual FSBs P MCH Memory P P MCH P P Memory P P MCH Quad FSBs P P Memory P P MCH P .. Memory P P Dual FSBs Memory 3.1 Design space of the basic platform architecture (5) b) Scheme of interconnecting the processors (In case of NUMA systems) Scheme of interconnecting the processors Partially connected mesh Fully connected mesh Memory P P Memory Memory P P Memory Memory P P Memory Memory P P Memory 3.1 Design space of the basic platform architecture (6) The notion of Basic platform architecture Platform architecture Architecture of the processor subsystem Architecture of the memory subsystem Basic platform architecture Architecture of the I/O subsystem 3.1 Design space of the basic platform architecture (7) Architecture of the memory subsystem (MSS) Architecture of the memory subsystem (MSS) Point of attaching the MSS Layout of the interconnection 3.1 Design space of the basic platform architecture (8) a) Point of attaching the MSS (Memory Subsystem) (1) Platform MCH Memory ? Platform Processor Memory Point of attaching the MSS 3.1 Design space of the basic platform architecture (9) Point of attaching the MSS – Assessing the basic design options (2) Point of attaching the MSS Attaching memory to the MCH (Memory Control Hub) Attaching memory to the processor(s) • Longer access time (~ 20 – 70 %), • Shorter access time (~ 20 – 70 %), • As the memory controller is on the MCH die, the memory type (e.g. DDR2 or DDR3) and speed grade is not bound to the processor chip design. • As the memory controller is on the processor die, the memory type (e.g. DDR2 or DDR3) and speed grade is bound to the processor chip design. 3.1 Design space of the basic platform architecture (10) Related terminology Point of attaching the MSS Attaching memory to the MCH (Memory Control Hub) Attaching memory to the processor(s) DT platforms DP/MP platforms DT platforms DP/MP platforms DT Systems with off-die memory controllers Shared memory DP/MP systems DT Systems with on-die memory controllers Distributed memory DP/MP systems SMP systems (Symmetrical Multiporocessors) NUMA systems (Systems w/ non uniform memory access) 3.1 Design space of the basic platform architecture (11) Example 1: Point of attaching the MSS in DT systems Point of attaching the MSS Attaching memory to the MCH Attaching memory to the processor(s) Processor Processor FSB FSB MCH Memory ICH DT System with off-die memory controller Memory MCH ICH DT System with on-die memory controller Examples Intel’s processors before Nehalem Intel’s Nehalem and subsequent processors 3.1 Design space of the basic platform architecture (12) Example 2: Point of attaching the MSS in SMP-based DP servers Point of attaching the MSS Attaching memory to the MCH Processor Attaching memory to the processor(s) Processor Memory Processor Memory FSB FSB MCH Processor Memory ICH MCH ICH • Shared memory DP server aka Symmetrical Multiprocessor (SMP) • Distributed memory DP server aka System w/ non-uniform memory access (NUMA) • Memory does not scale with the number of processors • Memory scales with the number of processors Examples Intel’s processors before Nehalem Intel’s Nehalem and subsequent processors 3.1 Design space of the basic platform architecture (13) Point of attaching the MSS Attaching memory to the MCH Examples Attaching memory to the processor(s) UltraSPARC II (1C) (~1997) UltraSPARC III (2001) and all subsequent Sun lines AMD’s K7 lines (1C) (1999-2003) Opteron server lines (2C) (2003) and all subsequent AMD lines POWER4 (2C) (2001) PA-8800 (2004) PA-8900 (2005) and all previous PA lines Core 2 Duo line (2C) (2006) and all preceding Intel lines Core 2 Quad line (2x2C) (2006/2007) Penryn line (2x2C) (2008) Montecito (2C) (2006) POWER5 (2C) (2005) and subsequent POWER families Nehalem lines (4) (2008) and all subsequent Intel lines Tukwila (4C) (2010??) Figure: Point of attaching the MSS 3.1 Design space of the basic platform architecture (14) b) Layout of the interconnection Layout of the interconnection Attaching memory via serial links Attaching memory via parallel channels Data are transferred over parallel buses Data are transferred over point-to-point links in form of packets 1 0 1 01 15 MC MC 1 0 0 t E.g: 16 cycles/packet on a 1-bit wide link t 01 E.g: 64 bits data + address, command and control as well as clock signals in each cycle MC t E.g: 4 cycles/packet on a 4-bit wide link Figure: Attaching memory via parallel channels or serial links 3.1 Design space of the basic platform architecture (15) b1) Attaching memory via parallel channels The memory controller and the DIMMs are connected • by a single parallel memory channel • or a few number of memory channels to synchron DIMMs, such as SDRAM, DDR, DDR2 or DDR3 DIMMs. Example 1: Attaching DIMMs via a single parallel memory channel to the memory controller that is implemented on the chipset [45] 3.1 Design space of the basic platform architecture (16) Example 2: Attaching DIMMs via 3 parallel memory channels to memory controllers implemented on the processor die (This is actually Intel’s the Tylersburg DP platform, aimed at the Nehalem-EP processor, used for up to 6 cores) [46] 3.1 Design space of the basic platform architecture (17) The number of lines of the parallel channels The number of lines needed depend on the kind of the memory modules, as indicated below: SDRAM 168-pin DDR 184-pin DDR2 240- pin DDR3 240-pin All these DIMM modules provide an 8-byte wide datapath and optionally ECC and registering. 3.1 Design space of the basic platform architecture (18) b2) Attaching memory via serial links Serial memory links are point-to-point interconnects that use differential signaling. Attaching memory via serial links Serial links attach FB-DIMMs Serial links attach S/P converters w/ parallel channels .. link .. link Proc. /MCH Serial link .. S/P .. link .. S/P .. Serial .. .. Proc. /MCH Serial .. .. Serial .. FB-DIMMs provide buffering and S/P conversion 3.1 Design space of the basic platform architecture (19) Example 1: FB-DIMM links in Intel’s Bensley DP platform aimed at Core 2 processors-1 Xeon 5400 Xeon 5200 Xeon 5000 Xeon 5100 Xeon 5300 / / / / / (Harpertown) (Harpertown) (Dempsey) (Woodcrest) (Clowertown) 2x2C 2C 2x1C 2C 2x2C 65 nm Pentium 4 Prescott DP (2x1C)/ Core2 (2C/2*2C) 65 nm Pentium 4 Prescott DP (2C)/ Core2 (2C/2*2C) FSB E5000 MCH ESI 631*ESB/ 632*ESB IOH FB-DIMM w/DDR2-533 ESI: Enterprise System Interface 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) 3.1 Design space of the basic platform architecture (20) Example 2: SMI links in Intel’s the Boxboro-EX platform aimed at the Nehalem-EX processors-1 Xeon 6500 (Nehalem-EX) (Becton) SMB SMB SMB Nehalem-EX (8C) Westmere-EX (10C) SMB SMI links Xeon E7-2800 or (Westmere-EX) QPI Nehalem-EX (8C) Westmere-EX (10C) QPI QPI DDR3-1067 7500 IOH ESI ICH10 SMB SMB SMB SMB SMI links DDR3-1067 ME SMI: Serial link between the processor and the SMB SMB: Scalable Memory Buffer with Parallel/serial conversion Nehalem-EX aimed Boxboro-EX scalable DP server platform (for up to 10 cores) 3.1 Design space of the basic platform architecture (21) Example 2: The SMI link of Intel’s the Boxboro-EX platform aimed at the Nehalem-EX processors-2 [26] SMB • The SMI interface builds on the Fully Buffered DIMM architecture with a few protocol changes, such as those intended to support DDR3 memory devices. • It has the same layout as FB-DIMM links (14 outbound and 10 inbound differential lanes as well as a few clock and control lanes). • It needs altogether about 50 PC trails. 3.1 Design space of the basic platform architecture (22) Design space of the architecture of the MSS Layout of the interconnection Attaching memory via parallel channels Attaching memory to the MCH Serial links attach S/Pconverters w/ par. channels .. S/P .. .. .. MCH .. .. MCH .. MCH .. .. S/P .. .. .. .. .. P .. P .. .. .. S/P .. S/P .. .. .. .. .. .. P .. P .. .. .. .. .. P .. P S/P S/P .. .. .. .. .. .. .. Attaching memory to the processor(s) Serial links attach FB-DIMMs .. .. Point of attaching memory Parallel channels attach DIMMs Attaching memory via serial links .. 3.1 Design space of the basic platform architecture (23) Max. number of memory channels that can be implemented while using particular design options of the MSS Subsequent fields from left to right and from top to down of the design space of the architecture of MSS allow to implement an increasing number of memory channels (nM), as discussed in Section 4.2.5 and indicated in the next figure. 3.1 Design space of the basic platform architecture (24) Layout of the interconnection Design space of the architecture of the MSS Attaching memory via parallel channels Attaching memory to the MCH Serial links attach S/Pconverters w/ par. channels .. S/P .. .. .. MCH .. .. MCH .. MCH .. .. S/P .. .. .. .. .. P .. P .. .. .. S/P .. S/P .. .. .. .. .. .. P .. P .. .. .. .. .. P .. P S/P S/P .. .. .. .. .. .. .. Attaching memory to the processor(s) Serial links attach FB-DIMMs .. .. Point of attaching memory Parallel channels attach DIMMs Attaching memory via serial links .. nC 3.1 Design space of the basic platform architecture (25) The design space of the basic platform architecture-1 Platform architecture Architecture of the processor subsystem Architecture of the memory subsystem Basic platform architecture Architecture of the I/O subsystem 3.1 Design space of the basic platform architecture (26) The design space of the basic platform architectures-2 Obtained as the combinations of the options available for the main aspects discussed. Basic platform architecture Architecture of the processor subsystem Scheme of attaching the processors (In case of SMP systems) Scheme of interconnecting the processors (In case of NUMA systems) Architecture of the memory subsystem (MSS) Point of attaching the MSS Layout of the interconnection 3.1 Design space of the basic platform architecture (27) The design space of the basic platform architectures of DT, DP and MP platforms will be discussed next in the Sections 3.3.1, 3.4.1 and 3.5.1. Design space of the basic architecture of particular platforms Design space of the basic architecture of DT platforms Section 3.3.1 Design space of the basic architecture of DP server platforms Section 3.4.1 Design space of the basic architecture of MP server platforms Section 3.5.1 3.2. The driving force for the evolution of platform architectures 3.2 The driving force for the evolution of platform architectures (1) The peak per processor bandwidth demand of a platform Let’s consider a single processor of a platform and the bandwidth available for it (BW). The available (peak) memory bandwidth of a processor (BW) is a product of • the number of memory channels available per processor (nM) , • their width (w) as well as • the transfer rate of the memory used (fM): BW = nM x w x fM • BW needs to be scaled with the peak performance of the processor. • The peak performance of the processor increases linearly with the core count (nC). The per processor memory bandwidth (BW) needs to be scaled with the core count (nC). 3.2 The driving force for the evolution of platform architectures (2) If we assume a constant width for the memory channel (w = 8 Byte), it can be stated that nM x fM needs to be scaled with the number of cores that is it needs to be doubled approximately every two years. This statement summarizes the driving force • for raising the bandwidth of the memory subsystem, and • at the same time also it is the major motivation for the evolution of platform architectures. 3.2 The driving force for the evolution of platform architectures (3) The bandwidth wall • As recently core counts (nC) double roughly every two years, also the per processor bandwidth demand of platforms (BW) doubles roughly every two years, as discussed before, • On the other hand, memory speed (fM) doubles only approximately every four years, as indicated e.g. in the next figure for Samsung’s memory technology. 3.2 The driving force for the evolution of platform architectures (4) Evolution of the memory technology of Samsung [12] The time span between e.g. DDR-400 and DDR3-1600 is approximately 7 years, this means roughly a doubling of memory speeds (fM) in every 4 years. 3.2 The driving force for the evolution of platform architectures (5) • This fact causes a widening gap between the bandwidth demand and achievable bandwidth growth due to increasing memory speed. •This fact can be designated as the bandwidth wall. It is the task of the developers of platform architectures to overcome the bandwidth wall by providing the needed number of memory channels . 3.2 The driving force for the evolution of platform architectures (6) The square root rule of scaling the number of memory channels It can be shown that in case when the core count (nC) increases according to Moore’s law and memory subsystems will be evolved by using in them the fastest available memory devices, as typical, then the number of per processor available memory channels needs be scaled as nM(nC) = √2 x √nC to provide altogether a linear bandwidth scaling with the core count (nC). Then the scaled number of memory channels available per processor (nM(nC)) and the increased device speed (fM) together will provide the needed linear scaling of the per processor bandwidth (BW) with nC. The above relationship can be termed as the square root rule of scaling the number of memory channels. Remark For multiprocessors incorporating nP processors then the total number of memory channels of the platform (NM) amounts to NM = nP x nM 3.3. DT platforms • 3.3.1. Design space of the basic architecture of DT platforms • 3.3.2. Evolution of Intel’s home user oriented multicore DT platforms • 3.3.3. Evolution of Intel’s business user oriented multicore DT platforms 3.3 DT platforms 3.3 DT platforms 3.3.1 Design space of the basic architecture of DT platforms 3.3.1 Design space of the basic architecture of DT platforms (1) Point of attaching the MSS DT platforms .. P .. .. MCH .. .. .. ICH Pentium D/EE to Penryn (Up to 4C) 1. G. Nehalem to Sandy Bridge (Up to 6C) P .. .. .. P .. ICH .. .. .. .. .. .. S/P .. .. .. S/P .. P .. .. .. .. .. S/P S/P .. S/P .. .. .. S/P ICH .. .. P MCH .. .. MCH .. Serial links attach. Serial links Parallel channels S/P conv. w/ par. chan. attach FB-DiMMs attach DIMMs .. .. Attaching memory via parallel channels P .. Attaching memory via serial links Attaching memory to the processor .. Layout of the interconnection Attaching memory to the MCH .. No. of mem. channels 3.3.1 Design space of the basic architecture of DT platforms (2) Attaching memory to the MCH Pentium D/EE 2x1C (2005/6) Core 2 2C (2006) Core 2 Quad 2x2C (2007) Penryn 2C/2x2C (2008) Attaching memory to the processor 1. G. Nehalem 4C (2008) Westmere-EP 6C (2010) 2. G. Nehalem 4C (2009) Westmere-EP 2C+G (2010) Sandy Bridge 2C/4C+G (2011) Sandy Bridge-E 6C (2011) No need for higher memory bandwidth through serial memory interconnection No. of memory channels No. of memory channels Seria l links attach Parallel channels attach DIMMs FB-DIMMs Point of attaching the MSS Serial links attach S/P converters w/ par. channels Attaching memory via parallel channels Attaching memory via serial links Layout of the interconnection Evolution of Intel’s DT platforms (Overview) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (1) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (1) Up to DDR2-667 Pentium D/ Pentium EE (2x1C) FSB 945/955X/975X MCH DMI ICH7 Anchor Creek (2005) 2/4 DDR2 DIMMs up to 4 ranks Up to DDR3-1067 Up to DDR2-800 Up to DDR3-1067 Core2 2C Core 2 Quad (2x2C) /Penryn (2C/2*2C) FSB 965/3-/4- Series MCH DMI ICH8/9/10 Bridge Creek (2006) (Core 2 aimed) Salt Creek (2007) (Core 2 Quad aimed) Boulder Creek (2008) (Penryn aimed) 1. gen. Nehalem (4C)/ Westmere-EP (6C) QPI X58 IOH DMI ICH10 Tylersburg (2008) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (2) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (2) Up to DDR3-1067 Up to DDR3-1333 1. gen. Nehalem (4C)/ Westmere-EP (6C) QPI X58 IOH DMI 2. gen. Nehalem (4C)/ Westmere-EP (2C+G) DMI FDI Up to DDR3-1333 Sandy Bridge (4C+G) DMI2 FDI 5- Series 6- Series PCH PCH ICH10 Tylersburg (2008) Kings Creek (2009) Sugar Bay (2011) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (3) 3.3.2 Evolution of Intel’s home user oriented multicore DT platforms (3) Up to DDR3-1600 Up to DDR3-1067 1. gen. Nehalem (4C)/ Westmere-EP (6C) QPI X58 IOH Sandy Bridge-E (4C)/6C) DMI2 X79 PCH DMI ICH10 Tylersburg (2008) DDR3-1600: up to 1 DIMM per channel DDR3-1333: up to 2 DIMMs per channel Waimea Bay (2011) 3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (1) 3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (1) Up to DDR2-800 Up to DDR3-1067 Up to DDR2-667 Pentium D/ Pentium EE (2x1C) 2/4 DDR2 DIMMs up to 4 ranks Core2 (2C) Core 2 Quad (2x2C) Penryn (2C/2*2C) FSB FSB 945/955X/975X MCH DMI ICH7 LCI 82573E GbE (Tekoe) ME Gigabit Ethernet LAN connection Lyndon (2005) Q965/Q35/Q45 MCH ME DMI C-link ICH8/9/10 LCI/GLCI 82566/82567 LAN PHY Up to DDR3-1333 2. gen. Nehalem (4C) Westmere-EP (2C+G) FDI DMI Q57 PCH PCIe 2.0/SMbus 2.0 82578 GbE LAN PHY Gigabit Ethernet LAN connection Gigabit Ethernet LAN connection Averill Creek (2006) (Core 2 aimed) Weybridge (2007) (Core 2 Quad aimed) McCreary (2008) (Penryn aimed) ME Piketon (2009) 3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (2) 3.3.3 Evolution of Intel’s business user oriented multicore DT platforms (2) Up to DDR3-1333 2. gen. Nehalem (4C) Westmere-EP (2C+G) FDI DMI Q57 PCH Sandy Bridge (4C+G) FDI ME PCIe 2.0/SMbus 2.0 82578 GbE LAN PHY Gigabit Ethernet LAN connection Piketon (2009) Up to DDR3-1333 DMI2 Q67 PCH ME PCIe 2.0/SMbus 2.0 GbE LAN Gigabit Ethernet LAN connection Sugar Bay (2011) 3.4. DP server platforms • 3.4.1. Design space of the basic architecture of DP server platforms • 3.4.2. Evolution of Intel’s low cost oriented multicore DP server platforms • 3.4.3. Evolution of Intel’s performance oriented multicore DP server platforms 3.4 DP server platforms 3.4 DP server platforms 3.4.1 Design space of the basic architecture of DP server platforms 3.4.1 Design space of the basic architecture of DP server platforms (1) P P P .. MCH .. Core 2/Penryn 2C/2x2C (2006/7) P P Nehalem-EP to Sandy Bridge -EP/EN Up to 8 C (2009/11) P .. .. P .. .. ICH P .. .. .. .. MCH .. .. ICH .. .. ICH 90 nm Pentium 4 DP 2x1C (2005) MCH P .. ICH P P .. .. MCH .. .. .. .. 65 nm Pentium 4 DP 2x1C Core 2/Penryn 2C/2x2C (2006/7) P P .. P .. S/P .. S/P .. P .. .. S/P P .. .. ICH .. .. S/P .. ICH .. .. .. .. S/P S/P .. .. MCH .. .. .. .. .. MCH .. .. S/P .. .. S/P P .. Serial links attach. Serial links Parallel channels S/P conv. w/ par. chan. attach FB-DiMMs attach DIMMs P NUMA .. Attaching memory via parallel channels Dual FSBs .. Attaching memory via serial links Single FSB .. Layout of the interconnection DP platforms .. Nehalem-EX/Westmere-EX 8C/10C (2010/11) nM 3.4.1 Design space of the basic architecture of DP server platforms (2) Scheme of attaching and interconnecting DP processors Parallel channels attach DIMMs Single FSB Serial links attach. Serial links S/P converters attach FB-DiMMs w/ par. chan. Attaching memory via parallel channels Attaching memory via serial links Layout of the interconnection SMP NUMA Dual FSBs Eff. Sandy Bridge-EN 8C (2011) Core 2 2C/ 90 nm Pentium 4 DP Eff. Core 2 Quad 2x2C/ Romley-EN 2x1C (2006) Penryn 2C/2x2C Nehalem-EP 4C (2009) (Paxville DP) (2006/2007) Westmere-EP 6C (2010) HP (Cranberry Lake) (Tylersburg-EP) HP Sandy Bridge-EP 8C (20 11) Romley-EP HP 65 nm Pentium 4 DP 2x1C Core 2 2C Core 2 Quad 2x2C Penryn 2C/2x2C (2006/2007) HP (Bensley) No. of memory channels Evolution of Intel’s DP platforms (Overview) Nehalem-EX/ Westmere-EX 8C/10C (2010/2011) (Boxboro-EX) No. of memory channels nM 3.4.2 Evolution of Intel’s low cost oriented multicore DP server platforms (1) 3.4.2 Evolution of Intel’s low cost oriented multicore DP server platforms 3.4.2 Evolution of Intel’s low cost oriented multicore DP server platforms (2) Evolution from the Pentium 4 Prescott DP aimed DP platform (up to 2 cores) to the Penryn aimed Cranberry Lake DP platform (up to 4 cores) Xeon 5300 (Clowertown) 2x2C Xeon DP 2.8 /Paxville DP) 90 nm Pentium 4 Prescott DP (2C) 90 nm Pentium 4 Prescott DP (2C) FSB Xeon 5400 Xeon 5200 or (Harpertown) (Harpertown) 4C 2C Core 2 (2C/ Core 2 Quad (2x2C)/ Penryn (2C/2x2C) Core 2 (2C/ Core 2 Quad (2x2C)/ /Penryn (2C/2x2C) FSB E7520 MCH E5100 MCH HI 1.5 ICH5R/ 6300ESB IOH or DDR-266/333 DDR2-400 ESI DDR2-533/667 ICHR9 90 nm Pentium 4 Prescott DP aimed DP server platform (for up to 2 C) Penryn aimed Cranberry Lake DP server platform (for up to 4 C) HI 1.5 (Hub Interface 1.5) 8 bit wide, 66 MHz clock, QDR, 66 MB/s peak transfer rate ESI: Enterprise System Interface 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) 3.4.2 Evolution of Intel’s low cost oriented multicore DP server platforms (3) Evolution from the Penryn aimed Cranberry Lake DP platform (up to 4 cores) to the Sandy Bridge-EP aimed Romley-EP DP platform (up to 8 cores) Xeon 5300 (Clowertown) 2x2C or Xeon 5400 Xeon 5200 or (Harpertown) (Harpertown) 4C 2C Core 2 (2C/2x2C)/ Penryn (2C/4C) proc. Core 2 (2C/2x2C)/ Penryn (2C/4C) proc. Sandy Bridge-EN (8C) Socket B2 FSB QPI Sandy Bridge-EN (8C) Socket B2 DMI2 E5100 MCH ESI E5-2400 Sandy Bridge–EN 8C C600 PCH DDR2-533/667 DDR3-1600 ICHR9 Penryn aimed Cranberry Lake DP platform (for up to 4 C) Sandy Bridge-EN (Socket B2) aimed Romley-EN DP server platform (for up to 8 cores) 3.4.3 Evolution of Intel’s performance oriented multicore DP server platforms (1) 3.4.3 Evolution of Intel’s performance oriented multicore DP server platforms 3.4.3 Evolution of Intel’s performance oriented multicore DP server platforms (2) Evolution from the Pentium 4 Prescott DP aimed DP platform (up to 2 cores) to the Core 2 aimed Bensley DP platform (up to 4 cores) Xeon 5400 Xeon 5200 Xeon 5000 Xeon 5100 Xeon 5300 / / / / / (Harpertown) (Harpertown) (Dempsey) (Woodcrest) (Clowertown) 2x2C 2C 2x1C 2C 2x2C Xeon DP 2.8 /Paxville DP) 90 nm Pentium 4 Prescott DP (2x1C) 90 nm Pentium 4 Prescott DP (2x1C) FSB 65 nm Pentium 4 Prescott DP (2C)/ Core2 (2C/2*2C) FSB E7520 MCH E5000 MCH HI 1.5 ICH5R/ 6300ESB IOH 65 nm Pentium 4 Prescott DP (2x1C)/ Core2 (2C/2*2C) ESI DDR-266/333 DDR2-400 90 nm Pentium 4 Prescott DP aimed DP server platform (for up to 2 C) HI 1.5 (Hub Interface 1.5) 8 bit wide, 66 MHz clock, QDR, 66 MB/s peak transfer rate 631*ESB/ 632*ESB IOH FB-DIMM w/DDR2-533 Core 2 aimed Bensley DP server platform (for up to 4 C) ESI: Enterprise System Interface 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) 3.4.3 Evolution of Intel’s performance oriented multicore DP server platforms (3) Evolution from the Core 2 aimed Bensley DP platform (up to 4 cores) to the Nehalem-EP aimed Tylersburg-EP DP platform (up to 6 cores) Nehalem-EP (4C) Westmere-EP (6C) Nehalem-EP (4C) Westmere-EP (6C) QPI QPI QPI 65 nm Pentium 4 Prescott DP (2C)/ Core2 (2C/2*2C) DDR3-1333 65 nm Pentium 4 Prescott DP (2C)/ Core2 (2C/2*2C) FSB 55xx IOH1 ESI FBDIMM w/DDR2-533 5000 MCH ME DDR3-1333 CLink ICH9/ICH10 Nehalem-EP aimed Tylersburg-EP DP server platform with a single IOH (for up to 6 C) ESI Nehalem-EP (4C) Westmere-EP (6C) 631*ESB/ 632*ESB IOH QPI QPI QPI 65 nm Core 2 aimed high performance Bensley DP server platform (for up to 4 C) DDR3-1333 QPI 55xx IOH1 ESI ME Nehalem-EP (4C) Westmere-EP (6C) 55xx IOH1 DDR3-1333 CLink ICH9/ICH10 ME: Management Engine chipset with PCI 2.0 1First Nehalem-EP aimed Tylersburg-EP DP server platform with dual IOHs (for up to 6 C) 3.4.3 Evolution of Intel’s performance oriented multicore DP server platforms (4) Basic system architecture of the Sandy Bridge-EN and -EP aimed Romley-EN and –EP DP server platforms Xeon 55xx / (Gainestown) Nehalem –EP (4C) Westmere-EP (6C) QPI Xeon 56xx (Gulftown) Nehalem-EP aimed Tylersburg-EP DP server platform (for up to 6 cores) Nehalem-EP (4C) Westmere-EP (6C) DMI DDR3-1333 34xx PCH E5-2600 Sandy Bridge–EP 8C Sandy Bridge-EP (8C) Socket R DDR3-1333 ME E5-2600 Sandy Bridge-EP 8C QPI 1.1 QPI 1.1 Sandy Bridge-EP (8C) Socket R DMI2 C600 PCH DDR3-1600 Sandy Bridge-EP (Socket R) aimed Romley-EP DP server platform (for up to 8 cores) (LGA 2011) 3.4.3 Evolution of Intel’s performance oriented multicore DP server platforms (5) Contrasting the Nehalem-EP aimed Tylersburg-EP DP platform (up to 6 cores) to the Nehalem-EX aimed very high performance scalable Boxboro-EX DP platform (up to 10 cores) Xeon 5600 Xeon 5500 or (Gulftown) (Gainestown) Nehalem –EP (4C) Westmere-EP (6C) Nehalem –EP (4C) Westmere-EP (6C) QPI ESI DDR3-1333 34xx PCH DDR3-1333 ME Nehalem-EP aimed Tylersburg-EP DP server platform (for up to 6 cores) Xeon 6500 (Nehalem-EX) (Becton) SMB SMB SMB Nehalem-EX (8C) Westmere-EX (10C) SMB SMI links Xeon E7-2800 or (Westmere-EX) QPI Nehalem-EX (8C) Westmere-EX (10C) QPI QPI DDR3-1067 7500 IOH ESI ICH10 SMB SMB SMB SMB SMI links DDR3-1067 ME SMI: Serial link between the processor and the SMB SMB: Scalable Memory Buffer with Parallel/serial conversion Nehalem-EX aimed Boxboro-EX scalable DP server platform (for up to 10 cores) 3.5. MP server platforms • 3.5.1. Design space of the basic architecture of MP server platforms • 3.5.2. Evolution of Intel’s multicore MP server platforms • 3.5.3. Evolution of AMD’s multicore MP server platforms 3.5 MP server platforms 3.5 MP server platforms 3.5.1 Design space of the basic architecture of MP server platforms 3.5.1 Design space of the basic architecture of MP server platforms (1) P P P P P P .. MCH P P P P .. MCH .. MCH .. .. .. ICH .. ICH ICH Pentium 4 MP 1C (2004) P P P P P P P .. P P .. MCH .. ICH P .. MCH .. MCH P P .. ICH .. Serial links attach FB-DiMMs Parallel channels attach DIMMs P Quad FSBs .. .. ICH Core 2/Penryn up to 6C P P P P P P P P P S/P .. .. MCH .. MCH .. .. ICH .. .. 90 nm Pentium 4 MP 2x1C .. S/P .. S/P .. ICH .. S/P .. .. .. ICH P .. S/P .. MCH P .. S/P P .. .. .. Serial links attach. S/P conv. w/ par. chan. Attaching memory via parallel channels P Dual FSBs .. Attaching memory via serial links Single FSB .. Layout of the interconnection MP SMP platforms .. 3.5.1 Design space of the basic architecture of MP server platforms (2) .. .. .. .. .. .. .. .. .. P P .. .. .. S/P .. S/P .. .. .. .. S/P .. .. .. .. .. .. S/P .. S/P .. .. P P .. .. .. .. P P .. .. .. .. S/P .. .. .. .. S/P .. .. .. S/P .. .. .. S/P S/P S/P .. .. .. .. .. .. .. P .. .. .. P P .. .. S/P P .. P .. .. P .. Serial links attach FB-DiMMs .. .. .. Serial links attach. S/P conv. w/ par. chan. .. .. .. Attaching memory via serial links .. .. .. .. P .. .. S/P P .. S/P .. .. .. .. .. .. .. .. .. AMD Direct Connect Architecture 2.0 (2010) .. P .. .. AMD Direct Connect Architecture 1.0 (2003) P P .. P .. P .. Parallel channels attach DIMMs .. .. Attaching memory via parallel channels .. .. S/P P .. P .. P .. .. .. P .. P .. Inter proc. BW .. .. .. Layout of the interconnection .. Fully connected mesh Mem. BW Partially connected mesh MP NUMA platforms S/P .. Nehalem-EX/Westmere up to 10C (2010/11) 3.5.1 Design space of the basic architecture of MP server platforms (3) Scheme of attaching and interconnecting MP processors Single FSB Serial links attach. Serial links Parallel channels S/P converters attach FB-DiMMs attach DIMMs w/ par. chan. Attaching memory via parallel channels Attaching memory via serial links Layout of the interconnection SMP Dual FSBs NUMA Quad FSBs Pentium 4 MP 1C (2004) (Not named) Part. conn. mesh Fully conn. mesh AMD DCA 1.0 (2003) AMD DCA 2.0 (2010) Core 2/Penryn up to 6C (2006/2007) Caneland 90 nm Pentium 4 MP 2x1C (2006) (Truland) No. of memory channels Nehalem-EX/ Westmere up to 10C (2010/11) (Boxboro-EX) Interproc. bandwidth No. of memory channels Evolution of Intel’s MP platforms (Overview) 3.5.2 Evolution of Intel’s multicore MP server platforms (1) 3.5.2 Evolution of Intel’s multicore MP server platforms 3.5.2 Evolution of Intel’s multicore MP server platforms (2) Evolution from the first generation MP servers supporting SC processors to the 90 nm Pentium 4 Prescott MP aimed Truland MP server platform (supporting up to 2 cores) Xeon MP Xeon 7000 Xeon 7100 / / (Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C Xeon MP1 SC Xeon MP1 SC Xeon MP1 SC Xeon MP1 SC Pentium 4 Xeon MP 1C/2x1C Pentium 4 Xeon MP 1C/2x1C Pentium 4 Xeon MP 1C/2x1C FSB FSB XMB Preceding NBs XMB 85001/8501 XMB E.g. DDR-200/266 E.g. HI 1.5 Pentium 4 Xeon MP 1C/2x1C E.g. DDR-200/266 HI 1.5 DDR-266/333 DDR2-400 Preceding ICH XMB ICH5 DDR-266/333 DDR2-400 HI 1.5 266 MB/s Previous Pentium 4 MP aimed MP server platform (for single core processors) 90 nm Pentium 4 Prescott MP aimed Truland MP server platform (for up to 2 C) 3.5.2 Evolution of Intel’s multicore MP server platforms (3) Evolution from the 90 nm Pentium 4 Prescott MP aimed Truland MP platform (up to 2 cores) to the Core 2 aimed Caneland MP platform (up to 6 cores) Xeon MP Xeon 7000 Xeon 7100 / / (Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C Pentium 4 Xeon MP 1C/2x1C Pentium 4 Xeon MP 1C/2x1C Pentium 4 Xeon MP 1C/2x1C Pentium 4 Xeon MP 1C/2x1C FSB XMB Xeon 7200 Xeon 7300 Xeon 7400 / / (Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C) Core 2 (2C/2x2C) Penryn (6C) Core 2 (2C/2x2C) Penryn (6C) Core 2 (2C/2x2C) Penryn (6C) Core 2 (2C/2x2C) Penryn (6C) FSB 4 channels up to 8 DIMMs /channel XMB 85001/8501 XMB 7300 XMB ESI HI 1.5 DDR-266/333 DDR2-400 ICH5 DDR-266/333 DDR2-400 90 nm Pentium 4 Prescott MP aimed Truland MP server platform (for up to 2 C) HI 1.5 (Hub Interface 1.5) 8 bit wide, 66 MHz clock, QDR, 266 MB/s peak transfer rate 1 631xESB/ 632xESB FB-DIMM DDR2-533/667 Core 2 aimed Caneland MP server platform (for up to 6 C) ESI: Enterprise System Interface 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac) 3.5.2 Evolution of Intel’s multicore MP server platforms (4) Evolution to the Nehalem-EX aimed Boxboro-EX MP platform (that supports up to 10 cores) (In the basic system architecture we show the single IOH alternative) Xeon 7500 (Nehalem-EX) (Becton) 8C / Xeon 7-4800 (Westmere-EX) 10C SMB SMB SMB SMB Nehalem-EX 8C Westmere-EX 10C Nehalem-EX 8C Westmere-EX 10C QPI SMB SMB SMB SMB QPI QPI QPI QPI SMB SMB SMB SMB Nehalem-EX 8C Westmere-EX 10C SMB Nehalem-EX 8C Westmere-EX 10C QPI QPI SMB SMB QPI 2x4 SMI channels DDR3-1067 SMB 2x4 SMI channels 7500 IOH ESI ICH10 ME DDR3-1067 SMI: Serial link between the processor and the SMBs SMB: Scalable Memory Buffer Parallel/serial converter ME: Management Engine Nehalem-EX aimed Boxboro-EX MP server platform (for up to 10 C) 3.5.3 Evolution of AMD’s multicore MP server platforms (1) 3.5.3 Evolution of AMD’s multicore MP server platforms [47] (1) Introduced in the single core K8-based Opteron DP/MP servers (AMD 24x/84x) (6/2003) Memory: 2 channels DDR-200/333 per processor, 4 DIMMs per channel. 3.5.3 Evolution of AMD’s multicore MP server platforms (2) 3.5.3 Evolution of AMD’s multicore MP server platforms [47] (2) Introduced in the 2x6 core K10-based Magny-Course (AMD 6100)(3/2010) Memory: 2x2 channels DDR3-1333 per processor, 3 DIMMs per channel. 5. References 5. References (1) [1]: Wikipedia: Centrino, http://en.wikipedia.org/wiki/Centrino [2]: Industry Uniting Around Intel Server Architecture; Platform Initiatives Complement Strong Intel IA-32 and IA-64 Targeted Processor Roadmap for 1999, Business Wire, Febr. 24 1999, http://www.thefreelibrary.com/Industry+Uniting+Around+Intel+Server +Architecture%3B+Platform...-a053949226 [3]: Intel Core 2 Duo Processor, http://www.intel.com/pressroom/kits/core2duo/ [4]: Keutzer K., Malik S., Newton R., Rabaey J., Sangiovanni-Vincentelli A., System Level Design: Orthogonalization of Concerns and Platform-Based Design, IEEE Transactions on Computer-Aided Design of Circuits and Systems, Vol. 19, No. 12, Dec. 2000, pp. 1-29. [5]: Krazit T., Intel Sheds Light on 2005 Desktop Strategy, IDG News Service, Dec. 07 2004, http://pcworld.about.net/news/Dec072004id118866.htm [6]: Perich D., Intel Volume platforms Technology Leadership, Presentation at HP World 2004, http://98.190.245.141:8080/Proceed/HPW04CD/papers/4194.pdf [7] Powerful New Intel Server Platforms Feature Array Of Enterprise-Class Innovations. Intel’s Press release, Aug. 2, 2004 , http://www.intel.com/pressroom/archive/releases/2004/20040802comp.htm [8]: Smith S., Multi-Core Briefing, IDF Spring 2005, San Francisco, Press presentation, March 1 2005, http://www.silentpcreview.com/article224-page2 [9]: An Introduction to the Intel QuickPath Interconnect, Jan. 2009, http://www.intel.com/ content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf [10]: Davis L. PCI Express Bus, http://www.interfacebus.com/PCI-Express-Bus-PCIe-Description.html 5. References (2) [11]: Ng P. K., “High End Desktop Platform Design Overview for the Next Generation Intel Microarchitecture (Nehalem) Processor,” IDF Taipei, TDPS001, 2008, http://intel.wingateweb.com/taiwan08/published/sessions/TDPS001/FA08%20IDFTaipei_TDPS001_100.pdf [12]: Computing DRAM, Samsung.com, http://www.samsung.com/global/business/semiconductor /products/dram/Products_ComputingDRAM.html [13]: Samsung’s Green DDR3 – Solution 3, 20nm class 1.35V, Sept. 2011, http://www.samsung.com/global/business/semiconductor/Greenmemory/Downloads/ Documents/downloads/green_ddr3_2011.pdf [14]: DDR SDRAM Registered DIMM Design Specification, JEDEC Standard No. 21-C, Page 4.20.4-1, Jan. 2002, http://www.jedec.org [15]: Datasheet, http://download.micron.com/pdf/datasheets/modules/sdram/ SD9C16_32x72.pdf [16]: Solanki V., „Design Guide Lines for Registered DDR DIMM Module,” Application Note AN37, Pericom, Nov. 2001, http://www.pericom.com/pdf/applications/AN037.pdf [17]: Fisher S., “Technical Overview of the 45 nm Next Generation Intel Core Microarchitecture (Penryn),” IDF 2007, ITPS001, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf [18]: Razin A., Core, Nehalem, Gesher. Intel: New Architecture Every Two Years, Xbit Laboratories, 04/28/2006, http://www.xbitlabs.com/news/cpu/display/20060428162855.html [19]: Haas, J. & Vogt P., Fully buffered DIMM Technology Moves Enterprise Platforms to the Next Level,” Technology Intel Magazine, March 2005, pp. 1-7 5. References (3) [20]: „Introducing FB-DIMM Memory: Birth of Serial RAM?,” PCStats, Dec. 23, 2005, http://www.pcstats.com/articleview.cfm?articleid=1812&page=1 [21]: McTague M. & David H., „ Fully Buffered DIMM (FB-DIMM) Design Considerations,” Febr. 18, 2004, Intel Developer Forum, http://www.idt.com/content/OSA-S009.pdf [22]: Ganesh B., Jaleel A., Wang D., Jacob B., Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, 2007, [23]: DRAM Pricing – A White Paper, Tachyon Semiconductors, http://www.tachyonsemi.com/about/papers/DRAM%Pricing.pdf [24]: Detecting Memory Bandwidth Saturation in Threaded Applications, Intel, March 2 2010, http://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-inthreaded-applications/ [25]: McCalpin J. D., STREAM Memory Bandwidth, July 21 2011, http://www.cs.virginia.edu/stream/by_date/Bandwidth.html [26]: Rogers B., Krishna A., Bell G., Vu K., Jiang X., Solihin Y., Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling, ISCA 2009, Vol. 37, Issue 1, pp. 371-382 [27]: Vogt P., Fully Buffered DIMM (FB-DIMM) Server Memory Architecture: Capacity, Performance, Reliability, and Longevity, Febr. 18 2004, http://www.idt.com/content/OSA_S008_FB-DIMM-Arch.pdf [28]: Wikipedia: Intel X58, 2011, http://en.wikipedia.org/wiki/Intel_X58 5. References (4) [29]: Sharma D. D., Intel 5520 Chipset: An I/O Hub Chipset for Server, Workstation, and High End Desktop, Hotchips 2009, http://www.hotchips.org/archives/hc21/2_mon/ HC21.24.200.I-O-Epub/HC21.24.230.DasSharma-Intel-5520-Chipset.pdf [30]: DDR2 SDRAM FBDIMM, Micron Technology, 2005, http://download.micron.com/pdf/datasheets/modules/ddr2/HTF18C64_128_256x72F.pdf [31]: Wikipedia: Fully Buffered DIMM, 2011, http://en.wikipedia.org/wiki/Fully_Buffered_DIMM [32]: Intel E8500 Chipset eXternal Memory Bridge (XMB) Datasheet, March 2005, http://www.intel.com/content/dam/doc/datasheet/e8500-chipset-external-memorybridge-datasheet.pdf [33]: Intel 7500/7510/7512 Scalable Memory Buffer Datasheet, April 2011, http://www.intel.com/content/dam/doc/datasheet/7500-7510-7512-scalable-memorybuffer-datasheet.pdf [34]: AMD Unveils Forward-Looking Technology Innovation To Extend Memory Footprint for Server Computing, July 25 2007, http://www.amd.com/us/press-releases/Pages/Press_Release_118446.aspx [35]: Chiappetta M., More AMD G3MX Details Emerge, Aug. 22 2007, Hot Hardware, http://hothardware.com/News/More-AMD-G3MX-Details-Emerge/ [36]: Goto S. H., The following server platforms AMD, May 20 2008, PC Watch, http://pc.watch.impress.co.jp/docs/2008/0520/kaigai440.htm [37]: Wikipedia: Socket G3 Memory Extender, 2011, http://en.wikipedia.org/wiki/Socket_G3_Memory_Extender 5. References (5) [38]: Interfacing to DDR SDRAM with CoolRunner-II CPLDs, Application Note XAPP384, Febr. 2003, XILINC inc. [39]: Ahn J.-H., „Memory Design Overview,” March 2007, Hynix, http://netro.ajou.ac.kr/~jungyol/memory2.pdf [40]: Ebeling C., Koontz T., Krueger R., „System Clock Management Simplified with Virtex-II Pro FPGAs”, WP190, Febr. 25 2003, Xilinx, http://www.xilinx.com/support/documentation/white_papers/wp190.pdf [41]: Kirstein B., „Practical timing analysis for 100-MHz digital design,”, EDN, Aug. 8, 2002, www.edn.com [42]: Jacob B., Ng S. W., Wang D. T., Memory Systems, Elsevier, 2008 [43]: Allan G., „The outlook for DRAMs in consumer electronics”, EETIMES Europe Online, 01/12/2007, http://eetimes.eu/showArticle.jhtml?articleID=196901366&queryText =calibrated [44]: Ebeling C., Koontz T., Krueger R., „System Clock Management Simplified with Virtex-II Pro FPGAs”, WP190, Febr. 25 2003, Xilinx, http://www.xilinx.com/support/documentation/white_papers/wp190.pdf [45]: Memory technology evolution: an overview of system memory technologies, Technology brief, 9th edition, HP, Dec. 2010, http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00256987/c00256987.pdf 5. References (6) [46]: Kane L., Nguyen H., Take the Lead with Jasper Forest, the Future Intel Xeon Processor for Embedded and Storage, IDF 2009, July 27 2009, ftp://download.intel.com/embedded/processor/prez/SF09_EMBS001_100.pdf [47]: The AMD Opteron™ 6000 Series Platform: More Cores, More Memory, Better Value, March 29 2010, http://www.slideshare.net/AMDUnprocessed/amd-opteron-6000-series -platform-press-presentation-final-3564470 [48]: Memory Module Picture 2007, Simmtester, Febr. 21 2007, http://www.simmtester.com/page/news/showpubnews.asp?num=150