Intel 8086/8088 Microprocessors Intel 8086 and 8088 Microprocessors are the basis of all IBM-PC compatible computers (8086 introduced in 1978, first IBM-PC released in 1981) All Intel, AMD and other advanced microprocessors are based on and are compatible with the original 8086/8 At Power Up and Reset time, Pentiums, Athlons etc all look like 8086 processors 06/03/2005 ET4508_p2 (KR) 1 Intel 8086/8088 Microprocessors Intel 8086 is a 16b microprocessor: Width of external data bus: 16b data registers, 16b ALU 8086: 16b 8088: 8b Width of external address bus: 16b+4b=20b Some techniques to optimise the CPU performance when it’s executing programs Segment: Offset memory model Little-Endian Data Format 06/03/2005 ET4508_p2 (KR) 2 8086/8088 (1) Original IBM PC used 8088 microprocessor 8088 is similar to the 8086, but it has an external 8b data bus & only 4B-deep queue For cost reduction reasons We can consider 8086 and 8088 together PC clones often used 8086 for better performance 8-bit bus reduces performance, but meant cheaper computers 06/03/2005 ET4508_p2 (KR) 3 8086/8088 (2) Remember the Fetch-Decode-Execute cycle? Fetching from EXTERNAL MEMORY is SLOW The 8086/8 used an instruction queue to speed up performance While the processor is decoding and executing an instruction, its bus interface can be reading new instructions, since at that time the bus is not actually in use 06/03/2005 ET4508_p2 (KR) 4 8086/8088 Functional Units Bus Interface Unit(BIU) Fetches Opcodes, Reads Operands, Writes Data Execution Unit (EU) 8086/8088 MPU 06/03/2005 ET4508_p2 (KR) 5 8086/8088 (3) 8086/8088 consists of two internal units The execution unit (EU) - executes the instructions The bus interface unit (BIU) - fetches instructions, reads operands and writes results The 8086 has a 6B prefetch queue The 8088 has a 4B prefetch queue 06/03/2005 ET4508_p2 (KR) 6 8086/8088 Internal Organisation EU BIU Address Bus 20 bits AH AL BH BL CH CL DH DL SUMMATION Data Bus CS DS SP SS BP ES DI IO BI Bus Control Internal Communications Registers 8088 Bus Temporary Registers Instruction Queue ALU EU Control 1 2 3 4 Flags 06/03/2005 ET4508_p2 (KR) 7 BIU Elements Instruction Queue: the next instructions or data can be fetched from memory while the processor is executing the current instruction Segment Registers: The memory interface is slower than the processor execution time so this speeds up overall performance CS, DS, SS and ES are 16b registers Used with the 16b Base registers to generate the 20b address Allow the 8086/8088 to address 1MB of memory Changed under program control to point to different segments as a program executes Instruction Pointer (IP) contains the Offset Address of the next instruction, the distance in bytes from the address given by the current CS register 8086/8088 20-bit Addresses CS 16-bit Segnment Base Address 0000 IP 16-bit Offset Address 20-bit Physical Address 06/03/2005 ET4508_p2 (KR) 9 Exercise: 20-bit Addressing 1. 2. CS contains 0A820h,IP contains 0CE24h. What is the resulting physical address? CS contains 0B500h, IP contains 0024h. What is the resulting physical address? 06/03/2005 ET4508_p2 (KR) 10 8086/8 In Circuit (1) 8086/8 microprocessors need support circuits in a microcomputer system 8086/8 multiplex the address and data buses on the same pins This saves pins but at a price: Demultiplexing logic is needed to build up separate address and data buses to interface with RAMs and ROMs 06/03/2005 ET4508_p2 (KR) 11 MAXIMUM MODE GND 1 Vcc AD14 AD15 AD13 A16,S3 AD12 A17,S4 AD11 A18,S5 AD10 A19,S6 AD9 /BHE,S7 AD8 MN,/MX AD7 /RD AD6 /RQ,/GT0 HOLD /RQ,/GT1 HLDA AD4 /LOCK /WR AD3 /S2 IO/M AD2 /S1 DT/R AD1 /S0 /DEN AD0 QS0 ALE NMI QS1 /INTA 8086 AD5 INTR /TEST CLK READY GND 06/03/2005 40 MINIMUM MODE 20 21 ET4508_p2 (KR) RESET 12 MAXIMUM MODE GND 1 40 MAXIMUM MODE MINIMUM MODE GND Vcc 1 40 Vcc AD14 AD15 A14 A15 AD13 A16,S3 A13 A16,S3 AD12 A17,S4 A12 A17,S4 AD11 A18,S5 A11 A18,S5 AD10 A19,S6 A10 A19,S6 AD9 /BHE,S7 A9 high AD8 MN,/MX A8 MN,/MX AD7 /RD AD7 /RQ,/GT0 HOLD AD6 /RQ,/GT1 HLDA AD5 AD4 /LOCK /WR AD3 /S2 AD2 MINIMUM MODE /SS0 /RD /RQ,/GT0 HOLD /RQ,/GT1 HLDA AD4 /LOCK /WR IO/M AD3 /S2 IO/M /S1 DT/R AD2 /S1 DT/R AD1 /S0 /DEN AD1 /S0 /DEN AD0 QS0 ALE AD0 QS0 ALE NMI QS1 /INTA NMI QS1 /INTA AD6 8086 AD5 8088 INTR /TEST INTR /TEST CLK READY CLK READY RESET GND GND 20 06/03/2005 21 20 ET4508_p2 (KR) 21 RESET 13 8086/8 In Circuit (2) In Maximum Mode the 8086/8 needs at least the following: 8288 Bus Controller, 8284A Clock Generator, 74HC373s and 74HC245s With the aid of these devices the 8086 begins to look like the ideal microprocessor we looked at earlier 06/03/2005 ET4508_p2 (KR) 14 i8086 Circuit - Maximum Mode CLK Vcc 8284A Clock Generator S0# S1# S2# CLK READY RESET 8288 Bus Controller DEN DT/R# ALE RDY 8086 CPU MRDC# MWTC# AMWC# IORC# IOWC# AIOWC# INTA# MN/MX# LE OE# BHE# AD15:AD0 A19:A16 ADDR/DATA 74LS373 x3 A19:A0, BHE# INTR DIR EN# ADDR/Data 74LS245 74LS245 x2 x2 D15:D0 8086/8 Maximum Mode In maximum mode, the 8288 uses a set of status signals (S0, S1, S2) to rebuild the normal bus control signals of the microprocessor MRDC#, MWTC#, IORC#, IOWC# etc Equivalent to MEMR# etc Look at some special signals briefly 06/03/2005 ET4508_p2 (KR) 16 RESET# Signal The Active low RESET# signal puts the 8086/8 into a defined state Clears the flags register, segment registers etc. Sets the effective program address to 0FFFF0h (CS=0F000h, IP=0FFF0h) 8086/8 Programs always start at 0FFFF0H after Reset has been asserted and removed Continues into latest generation CPUs 06/03/2005 ET4508_p2 (KR) 17 BHE# Signal (8086 Only) The 8086 processor can address memory a byte at a time Its data bus is 16b wide It uses the BHE# signal and A0 (sometimes called BLE#) to address bytes using its 16b bus 06/03/2005 ET4508_p2 (KR) 18 Use of BHE#/A0(BLE#) Byte-Wide addressing (8088) ODD Addresses (8086) EVEN Addresses (8086) FFFFF FFFFF FFFFE FFFFE FFFFD FFFFC FFFFD FFFFC FFFFB FFFF9 FFFFA FFFF8 A19..A1 A19..A1 00002 00005 00004 00001 00003 00002 00000 00001 00000 D15:D8 D7:D0 BHE# 06/03/2005 ET4508_p2 (KR) A0/BLE# 19 Use of BHE#/BLE# BHE# A0/BLE# 0 0 Whole word (16-bits) 0 1 High byte to/from odd address 1 0 Low byte to/from even address 1 1 No selection 06/03/2005 Selection ET4508_p2 (KR) 20 ALE and Address/data Bus Multiplexing 8086/8 Multiplexes the Address and Data signals onto the same set of pins Need off-chip logic to separate the signals Transparent latches designed just for address demultiplexing 06/03/2005 ET4508_p2 (KR) 21 ALE and 74HC373 Transparent Latch Clock Address/ Data Bus Address Time Data Time ALE Output of 74HC373 Microcomputer AddressBus 74HC373 or equivalent Address/ Data Bus In0:In7 ALE Q0:Q7 System Address Bus LE OE# 06/03/2005 TriState Control signal, OE#, shown connected to GND for simplicity ET4508_p2 (KR) 22 Use of ALE (Address Latch Enable) ALE is used with an external latch (74HC373) to demultiplex the address and data lines 74HC373 is transparent when its LE input (connected to ALE) is high When ALE goes low, the ‘373 holds the last data until ALE goes high again 06/03/2005 ET4508_p2 (KR) 23 8288 Bus Controller and Bus Transceivers 8288 Bus Controller also generates Direction and Enable signals for BiDirectional Transeivers 8288 Bus Controller DEN# DT/R# Supports Buffering the System Data Bus CPU [D15:D8] 74HC245 Buffered [D15:D8] EN# DIR CPU [D7:D0] 06/03/2005 74HC245 ET4508_p2 (KR) Buffered [D7:D0] To Memory and I/O Systems EN# DIR DIR 24 8086 Read Cycle T1 T2 T3 T4 CLK /S0, /S1, /S2 A16..A19, /BHE 001 or 101 Address Status S3..S6 ALE AD0..AD15 Address A0..A19 float Valid Data float Valid Address DT/R DEN /MRDC or /IORC 06/03/2005 ET4508_p2 (KR) 25 8086 Write Cycle T1 T2 T3 T4 CLK /S0, /S1, /S2 A16..A19, /BHE 010 or 110 Address Status Address Valid Data S3..S6 ALE AD0..AD15 A0..A19 Valid Address DT/R DEN /MWTC or /IOWC 06/03/2005 ET4508_p2 (KR) 26 8086 Read Cycle T1 (1 Wait State) T2 T3 Tw T4 CLK /S0, /S1, /S2 A16..A19, /BHE 001 or 101 Address Status S3..S6 ALE 8284 RDY READY AD0..AD15 Address A0..A19 float Valid Data float Valid Address DT/R DEN /MRDC or /IORC 06/03/2005 ET4508_p2 (KR) 27 8086/8088 Summary First Generation (introduced June 1978) One of the first 16b processors on the market 16b internal registers 16/8b external data bus 20b address bus (1MB addressable) Used in 1st generation IBM PCs (1981) 06/03/2005 ET4508_p2 (KR) 28 80186/80188 Evolution of 8086/8088 80186/80188 Increased instruction set On-chip system components (Clock generator, DMA, Interrupt, Timers…) Unsuccessful in PCs Popular in embedded systems… 06/03/2005 ET4508_p2 (KR) 29 2nd Generation Processor 286 P2 (286) = 2nd Generation Processor Introduced in 1981 CPU behind IBM AT Throughput of original IBM AT (6MHz) was about 500% of IBM PC (4.77MHz) Level of integration: 134k transistors (vs 29k in 8086) Still a 16b processor… Available in higher clock frequencies: 25MHz 06/03/2005 ET4508_p2 (KR) 30 2nd Generation Processors 286 Fully backwards compatible to 8086 80286 runs 8086 software without modification Improved instruction execution Average instruction takes 4.5 cycles vs. 12 cycles (8086) Improved instruction set Real mode and Protected Mode Multitasking-support. What happens in one area of memory doesn’t affect other programs. Protected mode supported by Windows 3.0. 16MB addressable physical memory On-chip MMU (1GB virtual memory) Non-multiplexed address-bus and data-bus 06/03/2005 ET4508_p2 (KR) 31 Improving Computer Performance We’ve seen how 16b computer technology based on the 8086 and 80286 processors developed These computers are not powerful enough for today’s applications How do you improve the performance of your computer? Let’s start with the CPU 06/03/2005 ET4508_p2 (KR) 32 CPU Performance (1) MOST OBVIOUS: Processor Clock Frequency Increased frequency – increased execution rate State of the Art: >4GHz (03/2005) Memory and I/O access times can be performance bottleneck – unless you take some special measures 06/03/2005 ET4508_p2 (KR) 33 CPU Performance (2) ALU register width A processor is an n-bit processor, where N represents the precision of the ALU – N can be 4, 8, 16, 32, or 64 The wider the registers – the more processing per clock Data bus width The wider the data bus the faster we can transfer data Since the memory and I/O device access times are finite, the more bits transferred per cycle the better 06/03/2005 ET4508_p2 (KR) 34 CPU Performance (3) Address bus width Increased address width doesn’t provide a ‘speed’ increase as such CPU can directly address more memory PCs use big programs, which would not fit in a smaller address space Overcoming small address space takes time Impacts on overall system performance 06/03/2005 ET4508_p2 (KR) 35 3rd Generation Processor 386 P3 (386) = 3rd Generation Processor Introduced: 10/1985 Full 32b processor (32b registers. 32b internal and external databus. 32b address bus) 275k transistors. CMOS. 132-pin PGA package. (Supply current Icc=400mA. Roughly the same as 8086 !) Clock speeds: 16-33MHz P3 processors were far ahead of their time: It took 10 years before 32b operating systems became mainstream! First 386 PCs early 1987 (COMPAQ) 06/03/2005 ET4508_p2 (KR) 36 3rd Generation Processor 386 Modes of operation: Real. Protected. Virtual Real. Protected mode of 386 is fully compatible with 286 Protected mode=native mode of operation. Chips are designed for advanced operating systems such as Windows NT New virtual real mode Processor can run with hardware memory protection while simulating the 8086’s real-mode operation. Multiple copies of e.g. DOS can run simultaneously, each in a protected area of memory. If a program in one memory area crashes, the rest of the system is protected. 06/03/2005 ET4508_p2 (KR) 37 Intel 32-bit Architecture:IA-32 Address Addressing Unit (AU) Bus Unit (BU) Prefetch Queue Data Execution Unit (EU) ALU Control Unit (CU) Instruction Unit (IU) Registers The 80386 includes a Bus Interface Unit for reading and providing data and instructions, witha Prefetch Queue, an IU for controlling the EU with its registers, as well as an AU for generating memory and I/O addresses 80386 Features 32b general and offset registers 16B prefetch queue Memory management unit with segmentation unit and paging unit 32b address and data bus 4GB physical address space 64TB virtual address space i387 numerical coprocessor Implementation of real, protected and virtual 8086 modes 06/03/2005 ET4508_p2 (KR) 39 80386 Operating Modes Protected Mode for Multitasking support Real Mode (native 8086 mode) Processor powers up in Real Mode System Management Mode Power management or system security Processor switches to separate address space, while saving the entire context of the currently running program or task 06/03/2005 ET4508_p2 (KR) 40 80386 Register Set Instruction Pointer 31 16 15 EIP EFLAG IP General-Purpose Registers 16 15 31 EFLAG Register 16 15 31 0 8 7 EAX AH AL EBX BH BL ECX CH CL EDX DH DL ESI SI EDI DI EBP BP ESP SP FLAG Segment Registers 15 0 0 CS SS DS ES FS GS E0 80386 Prefetch Queue Execution Unit 16-byte deep Instruction Queue Fetching from on-chip Queue is fast 06/03/2005 Bus Interface Unit 32-bit Data Bus Reading from off-chip Memory is slow ET4508_p2 (KR) 42 80386 Prefetch Queue 1. 2. 80386 Prefetch queue is 16B deep The instruction fetch can read from the prefetch queue faster than from memory The prefetcher can do some work while the execution unit is doing other tasks in parallel 06/03/2005 ET4508_p2 (KR) 43 Coprocessor: i387 The hardware implementation of floating point processing in the i387 means floating point operations run at much higher speed. The i386 can execute all mathematical expressions using software emulation of the i387. 06/03/2005 ET4508_p2 (KR) 44 80386: Classic CISC Processor CISC = Complex Instruction Set Computer Complex instructions ...but code-size efficient Micro-encoding of the machine instructions Extensive addressing capabilities for memory operations Few, but very useful CPU registers 06/03/2005 ET4508_p2 (KR) 45 80386 Execution Sequence Coprocessor Microcode ROM Microcode Queue Control Unit Register Register Register Execution Unit Decoding Unit Prefetch Queue Bus Interface CISC Processor Register ALU In a microprogrammed CISC the processor fetches the instructions via the bus interface into a prefetch queue, which transfers them to a decoding unit. The decoding unit breaks the machine instruction into many elementary micro-instructions and apples them to a microcode queue. The micro-instructions are transferred from the microcode queue to the control and execution unit which drives the ALU and the registers 06/03/2005 ET4508_p2 (KR) 46 80386 Complex Instructions CISC drawback: Most instructions are so complicated, they have to be broken into a sequence of micro-steps These steps are called Micro-Code Stored in a ROM in the processor core Micro-code ROM: Access-time and size... They require extra ROM and decode logic 06/03/2005 ET4508_p2 (KR) 47 RISC: Less is More RISC = Reduced Instruction Set Computer 20/80 Rule: 20% of the instructions take up 80% of the time Sometimes executing a sequence of simple instructions runs quicker than a single complex machine instruction that has the same effect 06/03/2005 ET4508_p2 (KR) 48 RISC Ideas (1) Reduce the instruction set to simplify the decoding Smaller Instruction Set -> Simpler Logic -> Smaller Logic -> Faster Execution Eliminate microcode – hardwire all instruction execution Pipeline instruction decoding and executing – do more operations in parallel 06/03/2005 ET4508_p2 (KR) 49 RISC Ideas (2) Load/Store Architecture – only the load and store instructions can access memory All other instructions work with the processor internal registers This is necessary for single-cycle execution – the execution unit can’t wait for data to be read/written 06/03/2005 ET4508_p2 (KR) 50 RISC Ideas (3) Increase number of internal register due to Load/Store Architecture Also registers are more general purpose and less associated with specific functions Compiler designed along with the RISC processor design. Compiler has to be aware of the processor architecture to produce code that can be executed efficiently 06/03/2005 ET4508_p2 (KR) 51 Instruction Pipelining - Operations Can Be Carried Out in Parallel Read the instruction from memory or the prefetch queue (instruction fetch phase) Decode the instruction (decode phase) Where necessary, fetch the operands (operand fetch phase) Execute the instruction (execute phase) Write back the result (write-back phase) 06/03/2005 ET4508_p2 (KR) 52 Instruction Fetch Decode Operand Fetch Execution Write-back Pipelined Execution Instruction k Instruction k-1 Instruction k-2 Instruction k-3 Instruction k-4 Result k-4 Cycle n+1 Instruction k+1 Instruction k Instruction k-1 Instruction k-2 Instruction k-3 Result k-3 Cycle n+2 Instruction k+2 Instruction k+1 Instruction k Instruction k-1 Instruction k-2 Result k-2 Cycle n+3 Instruction k+3 Instruction k+2 Instruction k+1 Instruction k Instruction k-1 Result k-1 Cycle n+4 Instruction k+4 Instruction k+3 Instruction k+2 Instruction k+1 Instruction k Result k Cycle n Superscalar Architecture The processor may have more than one pipeline (Pentium…) Where possible each pipeline works independently Not always possible May achieve average completed execution of more more than one instruction per clock cycle 06/03/2005 ET4508_p2 (KR) 54 Pipeline Challenges More logic per pipeline stage – same resource can’t be used twice E.g. can’t re-use ALU for computing implied addresses Synchronisation Problems Delayed Jump/Branch Data and Register dependency, e.g. ADD reg1, reg2, reg7 AND reg6, reg1, reg3 06/03/2005 ET4508_p2 (KR) 55 Getting the Benefits of Pipelining Simplified Instruction decoding Simpler, faster logic On-chip cache memories Local memory on-chip to avoid memory access bottlenecks Floating Point pipeline for FP coprocessor Speculative Execution to get around pipeline flushes 06/03/2005 ET4508_p2 (KR) 56 Software Implications of RISCs Optimising Compiler must know how pipeline works (Compiler must be aware of pipeline delays, and insert NOPs if need be) Lower code density in RISC because instructions are less efficient PowerPC code takes up to 30% more code to do the same tasks as an x86 CPU more memory accesses, potential performance impact... 06/03/2005 ET4508_p2 (KR) 57 80486: IA-32 with RISC elements Introduced 04/91 Greatly improved 80386 CPU Hard-wired implementation of frequently used instructions (as in RISCs). On average 2 clock cycles/instruction. 5 stage instruction pipeline Internal L1 Cache Memory (8kB) + cache controller On-chip Floating Point coprocessor (FPU) Longer Prefetch Queue (32-bytes as opposed to 16 on the 80386) Higher frequency operation: up to 120MHz >1.2M transistors, 0.8mm CMOS. 168-pin PGA. 06/03/2005 ET4508_p2 (KR) 58 D31-D0 Control and Status Signals Segmentation Unit Paging Unit Decoding Unit Bus Interface A31-A0 Cache (8K bytes) Prefetcher (32-byte queue) 80486 Block Diagram Control Unit Register and ALU Floating Point Unit i486 CPU 06/03/2005 ET4508_p2 (KR) 59 Cycle n Cycle n+1 Cycle n+2 Write-back ADD eax, mem32 Decode ADD, fetch mem32 Decode ADD (continued) Add eax and mem32 Cycle n+3 Write result into eax Cycle n+4 06/03/2005 Execution Decode 2 Decode 1 (memory access) Instruction Fetch 80486 Pipeline ET4508_p2 (KR) 60