Intel©– ItaniumTM Architecture -- Satya P. Vedula Intel – Itanium Architecture Agenda 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. History Introduction Block Diagram Pipeline Register Set Instruction Set EPIC x86 Compatibility Database on Itanium Security & Itanium Itanium and Java Itanium and Win64 Intel – Itanium Architecture History Generation 1 2 3 4 5 5+ Transistors 29k 134k 275k 1.2M 3.1M 4.5M FPU 8087 80287 80387 None/ built-In built-in built-in 8k – L1 16k L1 32k L1 Cache 8086/8088 1978-81 80286 1984 80386 DX/SX 80486 SX/DX Pentium 1987-88 1990-92 1993-95 Pentium MMX 1997 Intel – Itanium Architecture History contd.. Generation 6 Transistors 5.5M – 7.5M Cache 16k L1 512k L2 6+ 27.4M 8 42M 25M 32k – L1 96k – L2 4M – L3 32k L1 Pentium Pro Pentium II 1995 9.3M 7 1997 Mobile Pentium 1997 Pentium III 1999 Pentium 4 2001 Itanium 2001 Intel – Itanium Architecture Introduction - Itanium The Intel® ItaniumTM processor is the first in a family of processors based on the new Itanium architecture. Product Highlights Explicitly Parallel Instruction Computing (EPIC) technology enables up to 20 operations/clock. Three levels of cache reduce memory latency: 2MB or 4MB Level 3 cache, 96K Level 2 cache, and 32K Level 1 cache. Operating frequencies of 733MHz and 800MHz. 266MHz data bus enables fast system bus transactions with 2.1 GB/sec bandwidth. Advanced error detection, correction and containment provided by Machine Check Architecture (MCA), comprehensive error logging, and Error Correcting Code (ECC) on caches and the system bus. IA-32 instruction binary compatibility in hardware. 6.4 giga flops at peak performance Intel – Itanium Architecture 2. Block Diagram Simple block diagram Complex block diagram Intel – Itanium Architecture Pipeline Comparison with others Itanium – 10 stages Pentium III - 12-stages Alpha 21264 – 8 stages Pentium 4 - 20 stages Athlon - 10 stages 10 stage In-Order pipeline Intel – Itanium Architecture Register Set Each task can have individual set of registers general-purpose integer registers (each 64 bits wide), - 128 floating-point registers (each 82 bits wide), - 128 1-bit predicate registers - 64 branch registers - 8 Intel – Itanium Architecture Instruction Set Instructions are 41 bits long. It takes 7 bits to specify one of 128 GPR 2 source-operand fields and a destination field = 21 bits Predication = 6 bits (64 combination) 1 Bundles = 128 bits (Instructions are given in bundles) three 41-bit instructions (making 123 bits), plus one 5-bit template Instruction categories = 4 integer, load/store, floating-point, and branch operations. Intel – Itanium Architecture EPIC EPIC: Explicitly Parallel Instruction Computing It is a combination of features from RISC and VLIW Advantages -Conditional (predicated) execution -hinted and speculative loads (LD.A – Load Advanced, uses special buffer ALAT) -64 free-form predicate bits (Earlier Chips have (zero), V (overflow), S (sign), and N (negative) flags ) -One conditional branch with 64 predicate bits -VLIW features -Groups of independent instructions -Simple hardware -Exploit Instruction Level Parallelism (ILP) with Compiler Disadvantages -Large increase in code size -Blocking caches Intel – Itanium Architecture EPIC – Power to Compilers C source code: if (x == 4) z = 9 else z = 0; Compiled on Itanium 1. Compare x to 4 and store result in a predicate bit (we'll call it A) 2. If A==1; z = 9 Compiled on Pentium 1. Compare x to 4 2. If not equal go to line 5 3. z = 9 4. go to line 6 5. z = 0 6. // Program continues from here 32-bit compiled code 3. If A==0; z = 0 64-bit compiled code Intel – Itanium Architecture EPIC Features Data Speculation A sequence of instructions which consist of an advanced load, zero or more instructions dependent on the value of that load, and a check instruction Code speculation It is a Compiler Concept. An instruction or a sequence of instructions is executed before it is known that the dynamic control flow of the program will actually reach the point in the program where the sequence of instructions is needed Prediction Branch prediction now given to Programmers. For dynamic runtime branch prediction Preprocessing 1) Register use, 2) Loop optimization, 3) Instruction execution order, and 4) logical program layout Intel – Itanium Architecture EPIC Features contd.. Compiler advantages -Complexity shifts to compilers -Methods to express compile time information -Optimized FPUs for multimedia applications -Reliability and performance – server side Intel – Itanium Architecture x86 compatibility - Supports all x86 instructions including MMX, SSE (not SSE2), Protected, Virtual 8086, and Real mode features - Run entire OS in x86 mode, or run the applications under a new IA-64 OS. - X86 compatible registers: AR24 through AR31 - JMPE: Switch instruction to switch between x86 and new mode x86 – Register compatibility Intel – Itanium Architecture How does it looks like? Transistors: 325 million Processor chip: 25 million (including L1 and L2 caches) each of the four L3 cache: 75 million Pentium III : 24 million Pentium 4: 42 million Itanium Code: 2x Pentium (estimated) 30% more than other RISC Intel – Itanium Architecture Itanium - anatomy Intel – Itanium Architecture Other 64 bit processors IBM Power4 module MIPS 20K processor Photograph of Alpha 21264 Slot B module UltraSPARC-III chips Intel – Itanium Architecture Overview of the processors Intel – Itanium Architecture It’s just beginning Deerfield Madison McKinley Merced Itanium Code names Intel – Itanium Architecture Databases A quantum leap Intel – Itanium Architecture Databases – Storage needs Contd.. 2003 24B The Coming Content “Big Bang” 40,000 BCE cave paintings bone tools 3500 writing 0 C.E. paper 105 2001 6B 1450 printing 2000 3B 1870 electricity, telephone transistor 1947 computing 1950 Late 1960s Internet (DARPA) Source: IBM Informix Conference, 2001 Las Vegas 1993 The web 1999 GIGABYTES 2002 12B Intel – Itanium Architecture Databases – Storage – Requirements Data Explosion! • We are in the midst of a data explosion – “The Big Bang”! • Terabytes of data – Common corporate expression – Petabytes(10^15) & Exabytes(10^18) is fast approaching • 2-3 Exabytes = total volume of all information generated worldwide annually • Storage capacities are growing – 72 GB Hard Drive (HD) becoming industry standard – 180 GB High Density HD – in production Source: IBM Informix Conference, 2001 Las Vegas Intel – Itanium Architecture Databases contd.. The Need for Speed • Memory access speeds desired – long term – Memory latency averaging 235-360 nano seconds – Max = 256 GB of RAM – 64 bit => 20 Exabytes addressing capabilities • Disk access speeds are the reality – near term – Disk latency averaging 3-4 milli seconds – 4 “orders of magnitude slower” • DW tables contain Billions of rows • Light table Scan – 100 byte row @ 1 GB/s – ~ 9 million rows/sec – ~ 540 million rows/minute – 5.4 billion rows (500GB) ~ 10 minutes Source: IBM Informix Conference, 2001 Las Vegas Intel – Itanium Architecture Databases – Itanium advantages 64-bit addressing Tens of Gigabytes to thousands of Terabytes stored in nanosecond access main memory eliminates millisecond disk access times thus improving application response time. Large number of Registers and innovative register model Data and intermediate calculations stored in on-chip registers reduce the repetitive load and store of intermediate data values thus improving the response time of an application’s database request. Instruction set parallelism Ability to execute instructions in parallel allows quick access simultaneously and manipulation of data derived from multiple rows and columns of a large in-memory database table or tables. Predication Predication allows the conditional execution of instructions before it is known whether the execution is needed. Predication allows more code to execute in parallel, the performance penalty of branch-dependent code is less, and applications with heavy branching speed Up. Intel – Itanium Architecture Databases – Itanium advantages contd.. Control/Data Speculation Control speculation allows certain load instructions to be scheduled before conditional branch instructions, rather than after. Data speculation is similar to control speculation but allow loads to be scheduled above stores. Both allow a reduction in the CPU wait states generated by branch-intensive code with high latency RAM accesses thus speeding application performance. Instruction/Data Prefetch Instruction prefetches can be signaled on branch instructions. Data can be prefetched with explicit prefetch instructions. Both prefetches speed application performance by reducing wait states. Advantages Big databases like, -Data warehousing -Decision Support -Web-Enabled ERP Intel – Itanium Architecture Security Intel – Itanium Architecture Security -Common encryption algorithms run 3-5 times faster -EPIC parallelism with register rotation makes algorithms more faster -Performance boost to CAD/CAE applications due to increased floating point registers -Performance boost to 3d applications -82-bit floating-point unit offers high precision -RSA computations are 512-bits to 1024-bits in length -New Multiply-Add Instruction comes to aide -Parallelism comes to aide (2 128-bit computations are performed in parallel) -Predication eliminates branches (if) from RSA computations -RSA, AES, SHA-1 algorithms are improved, as they use only counted loops utilizing Register Rotation -Vast number of registers -Large Physical Memory for Security Cache: Directory Services can be stored on Memory -Network traffic can be encrypted Intel – Itanium Architecture Security contd.. Performance statistics – Encryption algorithms RSA ECC AES DES RC6 SHA Multi-precision arithmetic X X X X Multi-precision logical operation X X X X X Fixed data rotate X X Variable data rotate X X X X Integer multiplication X X X X Sbox lookup X X X Logical Operation X X Intel – Itanium Architecture Java Intel – Itanium Architecture Java Common Java Limitations (J2SE 1.3) -Garbage Collection -Object-oriented programming (OOP) -Byte code vs. native machine code -Variability of performance because of interpretation -Multithreaded applications -Java Native Interface Vs. Native Method Interface -Network Performance -Limitations with current architectures -EJB involves frequent invocation of method calls -Java needs dynamic bounds checking, null checking, exception handling -Java has a 64 bit integer data type – long -Java Object Handles (ObjId) is 64-bit Intel – Itanium Architecture Java Contd.. Advantages using IBM Java2 -Streamlined Garbage Collection reduces pause time -OOP: IBM Java uses Thread Local Heaps allowing variable sized thread local heaps -Just-In-Time compiler translates to optimized native code -Mixed Mode Interpreter does Selective Compilation -Multi-threading now has light weight and full power mode -JNI enhanced and NMI removed in Java 2 -N/w Performance: Java Socket API overhead removed Intel – Itanium Architecture Java Contd.. Advantages using Itanium -Predication: Branching caused by Java technology’s bounds checking is benefited -Speculation: Multiway branching allows address locations and data needed for Java’s bounds and null checks to be prefetched increasing performance -Instruction Parallelism: Multiple execution units run instructions concurrently increasing the performance -Register Set: Smaller methods need not contend for registers as more registers are available Intel – Itanium Architecture Win64 Intel – Itanium Architecture Win64 Win64 data types Type Name What it is Type Name What it is LONG32, INT32 32-bit Signed LONG64, INT64 64-bit Signed INT_PTR, LONG_PTR Signed Int, Pointer Precision UINT_PTR, ULONG_PTR DWORD_PTR Unsigned Int, Pointer Precision SIZE_T Unsigned Count, Pointer Precision SSIZE_T Signed Count, Pointer Precision ULONG32,UNIT32, 32-bit Unsigned DWORD32 ULONG64,UNIT64, 64-bit Unsigned DWORD64 Intel – Itanium Architecture Win64 Contd.. Win64 Issues - LLP64 issues -Porting issues (32-bit to 64-bit) -Polymorphic data usage -Pointer/length combinations -RPC and COM -Supports RPC between IA-32 and IA-64 -Supports LocalServer style (out-of-proc) COM between IA32 and IA-64 bit processes -IA-32 DLL cannot be loaded into 64-bit process -IA-64 DLL cant be loaded into 32-bit process -Use COM as out-of-proc (Solves prev 2 problems) -PnP should be RPCable enabled Intel – Itanium Architecture Questions?