CS 61C: Great Ideas in Computer Architecture (Machine Structures) SIMD II Instructors:

advertisement
CS 61C: Great Ideas in Computer
Architecture (Machine Structures)
SIMD II
Instructors:
Randy H. Katz
David A. Patterson
http://inst.eecs.Berkeley.edu/~cs61c/sp11
6/27/2016
Spring 2011 -- Lecture #14
1
New-School Machine Structures
(It’s a bit more complicated!)
Software
• Parallel Requests
Assigned to computer
e.g., Search “Katz”
Hardware
Harness
Smart
Phone
Warehouse
Scale
Computer
• Parallel Threads Parallelism &
Assigned to core
e.g., Lookup, Ads
Achieve High
Performance
Computer
• Parallel Instructions
>1 instruction @ one time
e.g., 5 pipelined instructions
Memory
• Hardware descriptions
All gates @ one time
6/27/2016
Today’s
Lecture
Core
(Cache)
Input/Output
• Parallel Data
>1 data item @ one time
e.g., Add of 4 pairs of words
…
Core
Instruction Unit(s)
Core
Functional
Unit(s)
A0+B0 A1+B1 A2+B2 A3+B3
Main Memory
Logic Gates
Spring 2011 -- Lecture #14
3
Review
• Flynn Taxonomy of Parallel Architectures
–
–
–
–
SIMD: Single Instruction Multiple Data
MIMD: Multiple Instruction Multiple Data
SISD: Single Instruction Single Data (unused)
MISD: Multiple Instruction Single Data
• Intel SSE SIMD Instructions
– One instruction fetch that operates on multiple operands
simultaneously
– 128/64 bit XMM registers
• SSE Instructions in C
– Embed the SSE machine instructions directly into C programs
through use of intrinsics
– Achieve efficiency beyond that of optimizing compiler
6/27/2016
Spring 2011 -- Lecture #14
4
Agenda
•
•
•
•
•
•
Amdahl’s Law
Administrivia
SIMD and Loop Unrolling
Technology Break
Memory Performance for Caches
Review of 1st Half of 61C
6/27/2016
Spring 2011 -- Lecture #14
5
Big Idea: Amdahl’s (Heartbreaking) Law
• Speedup due to enhancement E is
Speedup w/ E =
Exec time w/o E
---------------------Exec time w/ E
• Suppose that enhancement E accelerates a fraction F (F <1)
of the task by a factor S (S>1) and the remainder of the task is
unaffected
Execution Time w/ E = Execution Time w/o E  [ (1-F) + F/S]
Speedup w/ E = 1 / [ (1-F) + F/S ]
6/27/2016
Fall 2010 -- Lecture #17
6
Big Idea: Amdahl’s Law
Speedup =
Example: the execution time of half of the
program can be accelerated by a factor of 2.
What is the program speed-up overall?
6/27/2016
Fall 2010 -- Lecture #17
7
Big Idea: Amdahl’s Law
Speedup =
Non-speed-up part
1
(1 - F) + F
S
Speed-up part
Example: the execution time of half of the
program can be accelerated by a factor of 2.
What is the program speed-up overall?
1
0.5 + 0.5
2
6/27/2016
=
1
=
0.5 + 0.25
Fall 2010 -- Lecture #17
1.33
8
Big Idea: Amdahl’s Law
If the portion of
the program that
can be parallelized
is small, then the
speedup is limited
The non-parallel
portion limits
the performance
6/27/2016
Fall 2010 -- Lecture #17
9
Example #1: Amdahl’s Law
Speedup w/ E = 1 / [ (1-F) + F/S ]
• Consider an enhancement which runs 20 times faster but
which is only usable 25% of the time
Speedup w/ E = 1/(.75 + .25/20) = 1.31
• What if its usable only 15% of the time?
Speedup w/ E = 1/(.85 + .15/20) = 1.17
• Amdahl’s Law tells us that to achieve linear speedup with
100 processors, none of the original computation can be
scalar!
• To get a speedup of 90 from 100 processors, the
percentage of the original program that could be scalar
would have to be 0.1% or less
Speedup w/ E = 1/(.001 + .999/100) = 90.99
6/27/2016
Fall 2010 -- Lecture #17
11
Parallel Speed-up Example
Z0 + Z1 + … + Z10
X1,1
X1,10
Y1,1
Y1,10
Y10,1
Y10,10
+
X10,1
Non-parallel part
X10,10
Partition 10 ways
and perform
on 10 parallel
processing units
Parallel part
• 10 “scalar” operations (non-parallelizable)
• 100 parallelizable operations
• 110 operations
– 100/110 = .909 Parallelizable, 10/110 = 0.91 Scalar
6/27/2016
Fall 2010 -- Lecture #17
12
Example #2: Amdahl’s Law
Speedup w/ E = 1 / [ (1-F) + F/S ]
• Consider summing 10 scalar variables and two 10 by
10 matrices (matrix sum) on 10 processors
Speedup w/ E = 1/(.091 + .909/10) = 1/0.1819 = 5.5
• What if there are 100 processors ?
Speedup w/ E = 1/(.091 + .909/100) = 1/0.10009 = 10.0
• What if the matrices are 33 by 33(or 1019 adds in total) on
10 processors? (increase parallel data by 10x)
Speedup w/ E = 1/(.009 + .991/10) = 1/0.108 = 9.2
• What if there are 100 processors ?
Speedup w/ E = 1/(.009 + .991/100) = 1/0.019 = 52.6
6/27/2016
Fall 2010 -- Lecture #17
14
Strong and Weak Scaling
• To get good speedup on a multiprocessor while keeping
the problem size fixed is harder than getting good
speedup by increasing the size of the problem.
– Strong scaling: when speedup can be achieved on a parallel
processor without increasing the size of the problem (e.g.,
10x10 Matrix on 10 processors to 100)
– Weak scaling: when speedup is achieved on a parallel
processor by increasing the size of the problem
proportionally to the increase in the number of processors
– (e.g., 10x10 Matrix on 10 processors =>33x33 Matrix on 100)
• Load balancing is another important factor: every
processor doing same amount of work
– Just 1 unit with twice the load of others cuts speedup almost
in half
6/27/2016
Fall 2010 -- Lecture #17
15
Peer Review
• Suppose a program spends 80% of its time in a
square root routine. How much must you
speedup square root to make the program run
5 times faster?
Speedup w/ E = 1 / [ (1-F) + F/S ]
A red)
4
B orange) 5
C green) 10
20
E pink)
6/27/2016
None of the above
Spring 2011 -- Lecture #14
16
Administrivia
•
•
•
•
Lab #7 posted
No Homework, no project this week!
TA Review: Su, Mar 6, 2-5 PM, 2050 VLSB
Midterm Exam: Tu, Mar 8, 6-9 PM, 145/155 Dwinelle
– Split: A-Lew in 145, Li-Z in 155
– Small number of special consideration cases, due to class
conflicts, etc.—contact Dave or Randy
• No discussion during exam week; no lecture that day
• Sent (anonymous) 61C midway survey before Midterm:
Please fill out! (Only 1/3 so far; have your voice heard!)
• https://www.surveymonkey.com/s/QS3ZLW7
6/27/2016
Spring 2011 -- Lecture #14
17
61C in the News
“Remapping Computer Circuitry to Avert Impending Bottlenecks,”
John Markoff, NY Times, Feb 28, 2011
Hewlett-Packard researchers have
proposed a fundamental rethinking of the a point in time where more than five
decades of progress in continuously
modern computer for the coming era of
nanoelectronics — a marriage of memory shrinking the size of transistors used in
computation will end.
and computing power that could
… systems will be based on memory
drastically limit the energy used by
chips he calls “nanostores” as distinct
computers.
from today’s microprocessors. They will
Today the microprocessor is in the center
be hybrids, three-dimensional systems in
of the computing universe, and
which lower-level circuits will be based
information is moved, at heavy energy
on a nanoelectronic technology called
cost, first to be used in computation and
the memristor, which Hewlett-Packard is
then stored. The new approach would be
developing to store data.
to marry processing to memory to cut
The nanostore chips will have a
down transportation of data and reduce
multistory design, and computing circuits
energy use.
made with conventional silicon will sit
The semiconductor industry has long
directly on top of the memory to process
warned about a set of impending
the data, with minimal energy costs.
bottlenecks described as “the wall,”
6/27/2016
Spring 2011 -- Lecture #14
18
Getting to Know Profs
• Ride with sons in MS Charity Bike
Ride every September since 2002
•
•
“Waves to Wine”
150 miles over 2 days from SF to Sonoma
• Team: “Berkeley Anti-MS Crew”
•
•
If want to join team, let me know
Always a Top 10 fundraising team despite small size
• I was top fundraiser 2006, 2007, 2008, 2009, 2010 due to computing
– Can offer fund raising advice: order of sending, when to send during week, who to send to, …
Agenda
•
•
•
•
•
•
Amdahl’s Law
Administrivia
SIMD and Loop Unrolling
Technology Break
Memory Performance for Caches
Review of 1st Half of 61C
6/27/2016
Spring 2011 -- Lecture #14
20
Data Level Parallelism and SIMD
• SIMD wants adjacent values in memory that
can be operated in parallel
• Usually specified in programs as loops
for(i=1000; i>0; i=i-1)
x[i] = x[i] + s;
• How can reveal more data level parallelism
than available in a single iteration of a loop?
• Unroll loop and adjust iteration rate
6/27/2016
Spring 2011 -- Lecture #14
21
Looping in MIPS
Assumptions:
- R1 is initially the address of the element in the array with the highest
address
- F2 contains the scalar value s
- 8(R2) is the address of the last element to operate on.
CODE:
Loop:1. l.d
F0, 0(R1)
; F0=array element
2. add.d
F4,F0,F2
; add s to F0
3. s.d
F4,0(R1)
; store result
4. addui R1,R1,#-8 ; decrement pointer 8 byte
5. bne
R1,R2,Loop ;repeat loop if R1 != R2
Loop Unrolled
Loop: l.d
add.d
s.d
l.d
add.d
s.d
l.d
add.d
s.d
l.d
add.d
s.d
addui
bne
F0,0(R1)
F4,F0,F2
F4,0(R1)
F6,-8(R1)
F8,F6,F2
F8,-8(R1)
F10,-16(R1)
F12,F10,F2
F12,-16(R1)
F14,-24(R1)
F16,F14,F2
F16,-24(R1)
R1,R1,#-32
R1,R2,Loop
NOTE:
1. Different Registers eliminate stalls
2. Only 1 Loop Overhead every 4 iterations
3. This unrolling works if
loop_limit(mod 4) = 0
Loop Unrolled Scheduled
Loop:l.d
l.d
l.d
l.d
add.d
add.d
add.d
add.d
s.d
s.d
s.d
s.d
addui
bne
F0,0(R1)
F6,-8(R1)
F10,-16(R1)
F14,-24(R1)
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
F4,0(R1)
F8,-8(R1)
F12,-16(R1)
F16,-24(R1)
R1,R1,#-32
R1,R2,Loop
4 Loads side-by-side: Could replace with 4 wide SIMD
Load
4 Adds side-by-side: Could replace with 4 wide SIMD Add
4 Stores side-by-side: Could replace with 4 wide SIMD Store
Loop Unrolling in C
• Instead of compiler doing loop unrolling, could do it
yourself in C
for(i=1000; i>0; i=i-1)
x[i] = x[i] + s;
What is downside of doing it in C?
• Could be rewritten
for(i=1000; i>0; i=i-4) {
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
6/27/2016
Spring 2011 -- Lecture #14
25
Generalizing Loop Unrolling
• A loop of n iterations
• k copies of the body of the loop
Then we will run the loop with 1 copy of the
body n(mod k) times and
with k copies of the body floor(n/k) times
• (Will revisit loop unrolling again when get to
pipelining later in semester)
Agenda
•
•
•
•
•
•
Amdahl’s Law
Administrivia
SIMD and Loop Unrolling
Memory Performance for Caches
Technology Break
Review of 1st Half of 61C
6/27/2016
Spring 2011 -- Lecture #14
27
Reading Miss Penalty:
Memory Systems that Support Caches
• The off-chip interconnect and memory architecture
on-chip affects overall system performance in dramatic ways
CPU
One word wide organization (one word wide bus and one
word wide memory)
Assume
Cache
32-bit data
&
32-bit addr
per cycle
•
•
bus
DRAM
Memory
•
1 memory bus clock cycle to send address
15 memory bus clock cycles to get the 1st word in the
block from DRAM (row cycle time), 5 memory bus
clock cycles for 2nd, 3rd, 4th words (subsequent column
access time)—note effect of latency!
1 memory bus clock cycle to return a word of data
Memory-Bus to Cache bandwidth
•
6/27/2016
Number of bytes accessed from memory and
transferred to cache/CPU per memory bus clock cycle
Spring 2011 -- Lecture #11
28
(DDR) SDRAM Operation
After a row is read
into the SRAM register
•
Input CAS as the starting “burst”
address along with a burst length
•
Transfers a burst of data (ideally a
cache block) from a series of
sequential addresses within that row
+1
N cols
DRAM
N rows
•
Column
Address
- Memory bus clock controls transfer
of successive words in the burst
Cycle Time
1st M-bit Access
Row
Address
N x M SRAM
M bit planes
M-bit Output
2nd M-bit 3rd M-bit
4th M-bit
RAS
CAS
Row Address
6/27/2016
Col Address
Row Add
Spring 2011 -- Lecture #11
29
One Word Wide Bus, One Word Blocks
on-chip
CPU
Cache
• If block size is one word, then for a
memory access due to a cache miss, the
pipeline will have to stall for the number
of cycles required to return one data
word from memory
1 memory bus clock cycle to send address
bus
DRAM
Memory
15 memory bus clock cycles to read DRAM
1 memory bus clock cycle to return data
17 total clock cycles miss penalty
• Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
4/17 = 0.235 bytes per memory bus clock
6/27/2016
cycle
Spring 2011 -- Lecture #11
31
One Word Wide Bus, Four Word Blocks
on-chip
CPU
• What if block size is four words and each
word is in a different DRAM row?
cycle to send 1st address
4 x 15 = 60 cycles to read DRAM
1 cycles to return last data word
62 total clock cycles miss penalty
1
Cache
bus
15 cycles
DRAM
Memory
15 cycles
15 cycles
15 cycles
• Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
(4 x 4)/62 = 0.258
6/27/2016
Spring 2011 -- Lecture #11
bytes per clock
33
One Word Wide Bus, Four Word Blocks
on-chip
CPU
• What if the block size is four words and
all words are in the same DRAM row?
cycle to send 1st address
15 + 3*5 = 30 cycles to read DRAM
1 cycles to return last data word
32 total clock cycles miss penalty
1
Cache
bus
15 cycles
5 cycles
DRAM
Memory
5 cycles
5 cycles
• Number of bytes transferred per clock
cycle (bandwidth) for a single miss is
(4 x 4)/32 = 0.5 bytes per clock
6/27/2016
Spring 2011 -- Lecture #11
35
Interleaved Memory,
One Word Wide Bus
•
on-chip
CPU
Cache
For a block size of four words
1
cycle to send 1st address
15 cycles to read DRAM banks
4*1 = 4 cycles to return last data word
20 total clock cycles miss penalty
15 cycles
bus
15 cycles
15 cycles
DRAM DRAM DRAM DRAM
Memory Memory Memory Memory
bank 0 bank 1 bank 2 bank 3
15 cycles
• Number of
bytes transferred per
clock cycle (bandwidth) for a
single miss is
(4 x 4)/20 = 0.8 bytes per clock
6/27/2016
Spring 2011 -- Lecture #11
37
DRAM Memory System Observations
• Its important to match the cache characteristics
– Caches access one block at a time (usually more than
one word)
1) With the DRAM characteristics
– Use DRAMs that support fast multiple word accesses,
preferably ones that match the block size of the cache
2) With the memory-bus characteristics
– Make sure the memory-bus can support the DRAM
access rates and patterns
– With the goal of increasing the Memory-Bus to Cache
bandwidth
6/27/2016
Spring 2011 -- Lecture #11
38
Agenda
•
•
•
•
•
•
Amdahl’s Law
Administrivia
SIMD and Loop Unrolling
Memory Performance for Caches
Technology Break
Review of 1st Half of 61C
6/27/2016
Spring 2011 -- Lecture #14
39
New-School Machine Structures
(It’s a bit more complicated!) Project 1
Software
• Parallel Requests
Assigned to computer
e.g., Search “Katz”
Hardware
Smart
Phone
Warehouse
Scale
Computer
Harness
• Parallel Threads Parallelism &
Assigned to core
e.g., Lookup, Ads
Achieve High
Performance
Project 2
• Parallel Instructions
>1 instruction @ one time
e.g., 5 pipelined instructions
• Parallel Data
>1 data item @ one time
e.g., Add of 4 pairs of words
• Hardware descriptions
All gates functioning in
parallel at same time
6/27/2016
Computer
…
Core
Memory
Core
(Cache)
Input/Output
Instruction Unit(s)
Project 3
Core
Functional
Unit(s)
A0+B0 A1+B1 A2+B2 A3+B3
Main Memory
Spring 2011 -- Lecture #1
Logic Gates
Project
40 4
6 Great Ideas in Computer Architecture
1.
2.
3.
4.
5.
6.
Layers of Representation/Interpretation
Moore’s Law
Principle of Locality/Memory Hierarchy
Parallelism
Performance Measurement & Improvement
Dependability via Redundancy
6/27/2016
Spring 2011 -- Lecture #1
41
Great Idea #1: Levels of First half 61C
Representation/Interpretation
High Level Language
Program (e.g., C)
Compiler
Assembly Language
Program (e.g., MIPS)
Assembler
Machine Language
Program (MIPS)
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
lw
lw
sw
sw
0000
1010
1100
0101
$t0, 0($2)
$t1, 4($2)
$t1, 0($2)
$t0, 4($2)
1001
1111
0110
1000
1100
0101
1010
0000
Anything can be represented
as a number,
i.e., data or instructions
0110
1000
1111
1001
1010
0000
0101
1100
1111
1001
1000
0110
0101
1100
0000
1010
1000
0110
1001
1111
Machine
Interpretation
Hardware Architecture Description
(e.g., block diagrams)
Architecture
Implementation
Logic Circuit Description
(Circuit Schematic Diagrams)Spring 2011 -- Lecture #1
6/27/2016
42
Predicts: 2X Transistors / chip every 2 years
# of transistors on an integrated circuit (IC)
#2: Moore’s Law
Gordon Moore
Intel Cofounder
B.S. Cal 1950!
6/27/2016
Spring 2011 -- Lecture #1
Year
43
Great Idea #3: Principle of Locality/
First half 61C
Memory Hierarchy
6/27/2016
Spring 2011 -- Lecture #1
44
Great Idea #4: Parallelism
• Data Level Parallelism in 1st half 61C
– Lots of data in memory that can be operated on
in parallel (e.g., adding together 2 arrays)
– Lots of data on many disks that can be operated
on in parallel (e.g., searching for documents)
• 1st project: DLP across 10s of servers and disks
using MapReduce
• Next week’s lab, 3rd project: DLP in memory
6/27/2016
Spring 2011 -- Lecture #1
45
6/27/2016
Spring 2011 -- Lecture #1
46
6/27/2016
Fall 2010 -- Lecture #40
47
Summary
•
•
•
•
•
Amdhal’s Cruel Law: Law of Diminishing Returns
Loop Unrolling to Expose Parallelism
Optimize Miss Penalty via Memory system
As the field changes, cs61c has to change too!
Still about the software-hardware interface
– Programming for performance via measurement!
– Understanding the memory hierarchy and its
impact on application performance
– Unlocking the capabilities of the architecture for
performance: SIMD
6/27/2016
Fall 2010 -- Lecture #40
48
Download