Basic Microcomputer Design

advertisement
A Simple Computer consists of a
Processor (CPU-Central Processing
Unit), Memory, and I/O
Arithmetic
Logic
Unit
Control
Unit
Input
Memory
Registers
Output
Processor
Or CPU
I/O
Basic Functional Units of a
Computer
• Input – accepts coded information from
human operators, from electromechanical
devices (such as keyboards), or from other
digital medium via digital communication
lines.
• The information received is either stored in
the memory or immediately used by the
arithmetic and logic unit (ALU) to perform
the desired operations.
• The results are sent back out through the
output medium.
• All actions are coordinated through the
control unit.
The Information
• Categorized as either instructions or data
• Instructions (or machine instructions) are
explicit commands that
– Govern the transfer of information within a
computer as well as between the computer
and its I/O devices.
– Specify the arithmetic and logic operations to
be performed.
Programs
• A list of instructions that performs a task is
called a program.
• Usually the program is stored in memory.
• The program fetches the instructions from
memory, one after another, and performs the
desired operations.
• The computer is completely controlled by the
stored program, except for possible
interruption by an operator or by I/O devices
connected to the machine.
• Data are numbers and encoded characters
that are used as operands by the instructions.
Computer System Organization
Inside the CPU
• Control Unit (CU)
coordinates the sequencing of steps
involved in executing machine instructions
• Arithmetic Logic Unit (ALU)
performs arithmetic and logical operations
• Registers
storage locations
• Clock
synchronizes the internal operations of the
CPU with the other system components
Bus Structure
• Bus - a group of parallel wires that transfer
information from one part of the computer to
another.
– Control Bus
synchronizes the actions of all of the devices
attached to the system bus.
– Address Bus
passes the addresses of instructions and data
between the CPU and memory (or I/O).
– Data Bus
transfers instructions and data between the
CPU and memory (or I/O).
Bus Sizes
• For the 8086 Processor
– Address Bus – 20 bits
• can access 1M of memory
• Addresses defined as $00000-$FFFFF
– Data Bus – 16 bits (16-bit processor)
• A word is 16 bits
• Each word is byte addressable
More Facts on The 8086 Processor
Generation
External
Data
Bus
Width
Internal
Register
Width
Address
Bus Width
Numeric
Data
Processor
L1
Cache
L2
Cache
P1
16
16
20
External
None
None
The Intel CPU Family
Chip
4004
8008
8080
8085
8086
8088
80286
80386
80486
Pentium
Pentium Pro
Pentium II
Pentium III
Date
MHz
4/1971
0.108
4/1972
0.108
4/1974
2-3
4/1976
3-8
6/1978
5-10
6/1979
5-8
2/1982
8-12
10/1985
16-33
4/1989
25-100
3/1993
60-233
3/1995 150-200
5/1997 233-400
1998
550
Transistors Memory
2,300
3,500
6,000
6,500
29,000
29,000
134,000
275,000
1.2M
3.1M
5.5M
7.5M
9.5M
640
16KB
64KB
64KB
1MB
1MB
16MB
4GB
4GB
4GB
4GB
4GB
Notes
First microprocessor on a chip
First 8-bit processor
First general-purpose CPU on a chip
First 16-bit CPU on a chip
Used in IBM PC
Memory protection present
First 32-bit CPU
Built-in 8K cache memory
Two pipelines; later models had MMX
Two levels of cache built in
Pentium Pro plus MMX
Streaming SIMD extensions (SSE)
Notes from Intel Family Chart
• Notice that 386 – Pentium 4 are 32-bit
processors (32-bit data bus – 4 bytes)
• Notice that 386 and beyond have 32-bit
address bus can access (4G of memory
addresses).
Machine Cycle
• Most basic unit of time for machine
instructions
• = the time required for one complete clock
cycle.
• Machine instructions require at least 1
clock cycle to execute. Most require more.
• Wait states – empty clock cycles of
machine execution time (due to memory
access time being slower than speed of
clock).
Instruction Execution Cycle
• If using Memory operand (mov ax, 0A69Bh)
– Calculate address of operand
– Place address of operand on address bus
– Wait for memory to get operand and pass it on
data bus
The data path of a typical
von Neumann Machine
A+B
A
Registers
B
A
B
ALU Input Register
ALU Input Bus
ALU
A+B
ALU Output Register
Instruction Execution Cycle
The CPU executes each instruction in a series of
small steps
1. Fetch the next instruction from memory into
the instruction register.
2. Change the program counter to point to the
next instruction.
3. Decode the instruction.
4. Fetch any memory operands necessary into a
CPU register.
5. Execute the instruction.
6. Store output operand into a CPU register.
Execution of von Neumann Machines
To fetch the next instruction while the first is executing would speed up the machine
Instructions are stored in prefetch buffers (registers), to be accessed more quickly
than waiting for fetch from memory. Prefetching divides instruction execution up into
two parts: fetching and actual execution.
Pipelining divides up instruction execution into many parts, each one handled by a
piece of dedicated hardware, all which can run in parallel.
2-stage Pipelining
• Execution Unit: executes the
microcode instructions.
• Bus Interface Unit: accesses memory
and provides I/O
A Five-stage Pipeline
S1
S2
S3
S4
S5
Code
Prefetch Unit
Instruction
Decode Unit
Operand
Fetch Unit
Instruction
Execution
Unit
Write Back
Unit
A five-stage pipeline.
S1
S2
S3
S4
S5
1
2
1
3
2
1
4
3
2
1
5
4
3
2
1
6
5
4
3
2
7
6
5
4
3
8
7
6
5
4
9
8
7
6
5
t1
t2
t3
t4
t5
t6
t7
t8
t9
The state of each stage as a function of time.
How Fast Does This Machine Run?
• Suppose that the cycle time of this machine
is 2 nsec.
• Then it takes 10nsec for an instruction to
progress all the way through the five-stage
pipeline.
• Does the machine run at 100MIPS (1/10n)?
• No, at every clock cycle (2nsec) a new
instruction is completed, so the actual rate
of processing is 500MIPS.
How many cycles are required to
execute n instructions?
(Pipelined Versus Non-Pipelined Systems)
• For a system with k stages
• In non-pipelined systems, n instructions require
(n*k) cycles to process.
– 5 instructions require 5 clock cycles
• Using a pipelined system with k pipeline stages,
n instructions require (k + (n-1)) cycles to
complete.
– 5 instructions require (5 + (5-1)) = 9 clock cycles
(*refer to slide #14)
Tradeoffs
Pipelining allows a tradeoff between
– Latency
• How long it takes to execute an instruction
• Latency = nT nanosec (where cycle time is T nanosec and
the number of stages is n)
• And
– Processor Bandwidth
• How many MIPS the CPU has
• Bandwidth = 1000/T MIPS
*logically we should measure CPU bandwidth in BIPS or GIPS since we are measuring T
in nanosec, but nobody does this.
IA-32 Processor Pipelining
(6-stage Execution Cycle)
• Bus Interface Unit: accesses memory and provides
I/O
• Code Prefetch Unit: receives instructions from the
BIU and inserts them into a holding area (instruction
queue)
• Instruction Decode Unit: decodes machine
instructions from the prefetch queue and translates
them into microcode.
• Execution Unit: executes the microcode instructions.
• Segment Unit: translates logical addresses into
linear addresses and performs protection checks
• Paging Unit: translates linear addresses into
physical addresses, performs page protection
checks and keeps a list of recently accessed pages
Superscalar Architecture
• If one pipeline is good, then two pipelines
must be better.
• Parallel paths exist through which different
instructions can be executed in parallel.
• It is possible to start the execution of
several instructions in every clock cycle.
• The logical correctness of programs must
be maintained.
Dual five-stage pipelines with a
common Code Prefetch Unit
The code prefetch unit fetches pairs of instructions together and puts each one into
Its own pipeline, complete with its own ALU for parallel operation.
Superscalar processor with 5
functional units
Four pipelines duplicates too much hardware.
Instead, use a single pipeline and give it
multiple functional units. This assumes that
the S3 stage can issue instructions faster than
the S4 stage can execute them. (Pentium II)
Parallelism
• So far we have dealt with instruction-level
parallelism.
• There is also processor-level parallelism
– Array processors
– Multiprocessors
– Multicomputers
CISC
Complex Instruction Set Computer
• A large number of variable length
instructions (more than 128)
• Multiple addressing modes
• A small number of internal processor
registers
• Instructions that require multiple numbers
of clock cycles to execute
8086
(A Real CISC)
• Over 3000 different instruction forms, each
requiring anywhere from one to six bytes
• Nine different addressing modes are
supported
• The processor only has eight general
purpose registers
• Instruction execution times range from 2
clock cycles to more than 80 cycles for
ASCII adjust for multiplication instruction.
Intel’s i860 RISC Processor
•
•
•
•
82 instructions, each 32 bits in length
Four addressing modes
32 general purpose registers
All instructions execute in one clock cycle
Why hasn’t RISC won out?
• Backward compatibility (companies have
spent billions of dollars on Intel processor
software).
• Intel has built CPU cores with RISC like
structure that executes the simplest and
most common instructions in a simgle data
path, while interpreting the more complex
instructions in the usual CISC way.
Design Principles of Modern
Computers
• All instructions are directly executed in
hardware
• Maximize the rate at which instructions are
issued
• Instructions should be easy to decode
• Only loads and stores should be able to
reference memory
• Provide plenty of registers
Application Specific Microprocessors
Digital Signal Processors
• Previously, analog signals had to be
handled with discrete circuits (op-amps,
capacitors, inductors, and resistors
forming filters, amplifiers, etc…)
• Now low-cost analog-to-digital and digitalto-analog converters are available.
• => thus we have digital signal processing
systems
DSP systems
• DSPs are used to perform repetitive complex
mathematical computations on the converted
analog data.
• One computation may require as many as
500,000 add-multiply operations.
DSP Architecture
• Data and instructions are stored in two different
memory areas each with their own buses
(Harvard Architecture)
• Hardware multipliers and adders are built into
the processor and optimized to perform a
calculation in a single clock cycle.
• Arithmetic pipelining is used so that several
instructions can be operated on at once.
• Hardware DO loops are provided to speed up
repetitive operations
• Multiple (serial) I/O ports are provided for
communication with other processors.
DSP Applications
• Mulitmedia sound cards (used to
compress speech and music signals)
• DSP can be reprogrammed (allows some
sound cards to double as a modem
• Cellular phones
• Speech and image compression
• Optical character recognition
• Video conferencing
Operating System
• A collection of programs (a large program), that
are used to control the sharing of and interaction
among various computer units as they execute
application programs.
• Performs the tasks required to assign computer
resources to individual application programs.
– Assigning memory and magnetic disk space to
program and disk files
– Moving data between memory and disk units
– Handling I/O operations
Example of How A Operating System Manages
the execution of more than one application
program at the same time
• Application program has been compiled from a high level language
form into machine language form and is stored on disk
• Assume somewhere in the program, a data file must be read,
perform some computation on the data, and print results .
– Transfer file into memory
– When transfer is complete, begin execution
– When point in program is reached that data file is needed, the
program requests the operating system to transfer the data file
from the disk to memory.
• The OS performs this task and passes execution control back to the
application program, which then proceeds to perform the required
computation.
• When the computation is completed and the results are ready to be
printed,
Can Multitasking be used for concurrent
execution of application programs?
Printer
Disk
OS
Routines
Program
t0
t1
t2
t3
t4
t5
Download