What Happens When Processing Storage Bandwidth

advertisement
What Happens When
Processing
Storage
Bandwidth
are Free and Infinite?
Jim Gray
Microsoft Research
1
Outline

Clusters of Hardware CyberBricks
– all nodes are very intelligent

Software CyberBricks
– standard way to interconnect intelligent nodes

What next?
– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
2
When
Computers & Communication are Free


Traditional computer industry is 0 B$/year
All the costs are in
– Content (good)
– System Management (bad)
• A vendor claims it costs 8$/MB/year to manage disk storage.
– => WebTV (1GB drive) costs 8,000$/year to manage!
– => 10 PB DB costs 80 Billion $/year to manage!
• Automatic management is ESSENTIAL

In the mean time….
3
1980 Rule of Thumb


You need a systems’ programmer per MIPS
You need a Data Administrator per 10 GB
4
One Person per MegaBuck





1 Breadbox ~ 5x 1987 machine room
48 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
A megabuck buys 40 of these!!!
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
disk arrays5
All God’s Children Have Clusters!
Buying Computing By the Slice

People are buying computers by the gross
– After all, they only cost 1k$/slice!

Clustering them together
6
A cluster is a cluster is a
cluster



It’s so natural,
even mainframes cluster !
Looking closer at usage patterns,
a few models emerge
Looking closer at sites,
hierarchies
bunches
functional specialization
emerge
Which are the roses ?
Which are the briars ?
7
“Commercial” NT Clusters

16-node Tandem Cluster
– 64 cpus
– 2 TB of disk
– Decision support

45-node Compaq Cluster
–
–
–
–
140 cpus
14 GB DRAM
4 TB RAID disk
OLTP (Debit Credit)
• 1 B tpd (14 k tps)
8
Tandem Oracle/NT




27,383 tpmC
71.50 $/tpmC
4 x 6 cpus
384 disks
=2.7 TB
9
Microsoft.com: ~150x4 nodes
Building 11
Log Processing
Ave CFG:4xP6,
Internal WWW 1 GB RAM,
180 GB HD
Ave Cost:$128K
FY98 Fcst:2
Staging Servers
(7)
The Microsoft.Com Site
Ave CFG:4xP5,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:12
FTP Servers
Ave CFG:4xP5,
512 RAM,
Download 30 GB HD
Replication Ave Cost:$28K
FY98 Fcst: 0
SQLNet
Feeder LAN
Router
Live SQL Servers
MOSWest
Admin LAN
Live SQL Server
All servers in Building11
are accessable from
corpnet.
www.microsoft.com
(4)
register.microsoft.com
(2) Ave CFG:4xP6,
home.microsoft.com
(4)
premium.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:3
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:12
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave Cost:$35K
FY98 Fcst:2
www.microsoft.com
(4)
Ave CFG:4xP6
512 RAM
28 GB HD
Ave Cost: $35K
FY98 Fcst: 17
FDDI Ring
(MIS1)
FDDI Ring
(MIS2)
activex.microsoft.com
(2)
Ave CFG:4xP6,
256 RAM,
30 GB HD
Ave Cost:$25K
FY98 Fcst:2
Router
premium.microsoft.com
(1)
Internet
Ave CFG:4xP5,
256 RAM,
20 GB HD
Ave Cost:$29K
FY98 Fcst:2
register.msn.com
(2)
search.microsoft.com
(1)
Japan Data Center
www.microsoft.com
premium.microsoft.com
(3)
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:1
512 RAM,
50 GB HD
Ave Cost:$50K
FY98 Fcst:1
FTP
Download Server
(1)
HTTP
Download Servers
(2)
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$80K
FY98 Fcst:1
msid.msn.com
(1)
Switched
Ethernet
search.microsoft.com
(2)
Router
Secondary
Gigaswitch
\\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd
12/15/97
Router
(100 Mb/Sec Each)
support.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:9
13
DS3
(45 Mb/Sec Each)
Ave CFG:4xP5,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:0
register.microsoft.com
(2)
support.microsoft.com
search.microsoft.com
(1)
(3)
2
Ethernet
Router
FTP.microsoft.com
(3)
register.microsoft.com
(1)
(100Mb/Sec Each)
Internet
Router
msid.msn.com
(1)
2
OC3
Primary
Gigaswitch
Router
FDDI Ring
(MIS3)
Switched
Ethernet
Router
Router
home.microsoft.com
(2)
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:7
Router
msid.msn.com
(1)
FTP
Download Server
(1)
SQL SERVERS
(2)
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$80K
FY98 Fcst:1
Router
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:3
cdm.microsoft.com
(1)
Ave CFG:4xP5,
256 RAM,
12 GB HD
Ave Cost:$24K
FY98 Fcst:0
512 RAM,
30 GB HD
Ave Cost:$35K
FY98 Fcst:1
msid.msn.com
(1)
search.microsoft.com
(3)
home.microsoft.com
(3)
Ave CFG:4xP6,
1 GB RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:2
msid.msn.com
(1)
512 RAM,
30 GB HD
Ave Cost:$43K
FY98 Fcst:10
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave Cost:$50K
FY98 Fcst:17
www.microsoft.com
(3)
www.microsoft.com premium.microsoft.com
(1)
Ave CFG:4xP6,
Ave CFG:4xP6,(3)
512 RAM,
50 GB HD
Ave Cost:$50K
FY98 Fcst:1
SQL Consolidators
DMZ Staging Servers
Router
SQL Reporting
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$80K
FY98 Fcst:2
European Data Center
IDC Staging Servers
MOSWest
www.microsoft.com
(5)
Internet
FDDI Ring
(MIS4)
home.microsoft.com
(5)
13
HotMail: ~400 Computers
14
Inktomi (hotbot), WebTV: > 200 nodes

Inktomi: ~250 UltraSparcs
–
–
–
–
–

web crawl
index crawled web and save index
Return search results on demand
Track Ads and click-thrus
ACID vs BASE (basic Availability, Serialized Eventually)
Web TV
– ~200 UltraSparcs
• Render pages, Provide Email
– ~ 4 Network Appliance NFS file servers
– A large Oracle app tracking customers
15
Loki: Pentium
Clusters for Science
http://loki-www.lanl.gov/
16 Pentium Pro Processors
x 5 Fast Ethernet interfaces
+ 2 Gbytes RAM
+ 50 Gbytes Disk
+ 2 Fast Ethernet switches
+
Linux…………………...
= 1.2 real Gflops for $63,000
(but that is the 1996 price)
Beowulf project is similar
http://cesdis.gsfc.nasa.gov/pub/people/becker/beo
wulf.html

Scientists want cheap mips.
16
Your Tax Dollars At Work
ASCI for Stockpile Stewardship




Intel/Sandia:
9000x1 node Ppro
LLNL/IBM:
512x8 PowerPC (SP2)
LNL/Cray:
?
Maui Supercomputer Center
– 512x1 SP2
17
Berkeley NOW (network of workstations) Project
http://now.cs.berkeley.edu/

105 nodes
– Sun UltraSparc 170,
128 MB,
2x2GB disk
– Myrinet interconnect (2x160MBps
per node)
– SBus (30MBps) limited





GLUNIX layer above Solaris
Inktomi (HotBot search)
NAS Parallel Benchmarks
Crypto cracker
Sort 9 GB per second
18
Wisconsin COW



40 UltraSparcs
64MB + 2x2GB disk
+ Myrinet
SUN OS
Used as a compute engine
19
Andrew Chien’s JBOB
http://www-csag.cs.uiuc.edu/individual/achien.html






48 nodes
36 HP 2PIIx128 1 disk
Kayak boxes
10 Compaq 2PIIx128 1 disk,
Wkstation 6000
32-Myrinet&16-ServerNet
connected
Operational
All running NT
20
NCSA Cluster

The National Center for
Supercomputing Applications
University of Illinois @ Urbana

500 Pentium cpus, 2k disks, SAN
Compaq + HP +Myricom
A Super Computer for 3M$
Classic Fortran/MPI programming
NT + DCOM programming model




21
4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)
The Bricks of Cyberspace
Cost 1,000 $
 Come with

– NT
– DBMS
– High speed Net
– System management
– GUI / OOUI
– Tools
Compatible with everyone else
 CyberBricks

22
Super Server: 4T Machine

Array of 1,000 4B machines
1
b ips processors
1 B B DRAM
10 B B disks
1 Bbps comm lines
1 TB tape robot


A few megabucks
Challenge:
CPU
50 GB Disc
5 GB RAM
Manageability
Programmability
Security
Cyber Brick
a 4B machine
Availability
Scaleability
Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
23
Cluster Vision
Buying Computers by the Slice

Rack & Stack
– Mail-order components
– Plug them into the cluster

Modular growth without limits
– Grow by adding small modules

Fault tolerance:
– Spare modules mask failures

Parallel execution & data search
– Use multiple processors and disks

Clients and servers made from the same stuff
– Inexpensive: built with
commodity CyberBricks
24
Nostalgia Behemoth in the Basement



today’s PC
is yesterday’s supercomputer
Can use LOTS of them
Main Apps changed:
– scientific  commercial  web
– Web & Transaction servers
– Data Mining, Web Farming
25
SMP -> nUMA: BIG FAT SERVERS


Directory based caching
lets you build large SMPs
Every vendor building a
HUGE SMP
– 256 way
– 3x slower remote memory
– 8-level memory hierarchy
•
•
•
•
•
•
•
L1, L2 cache
DRAM
remote DRAM (3, 6, 9,…)
Disk cache
Disk
Tape cache
Tape

Needs
– 64 bit addressing
– nUMA sensitive OS
• (not clear who will do it)

Or Hypervisor
– like IBM LSF,
– Stanford Disco
www-flash.stanford.edu/Hive/papers.html

You get an expensive
cluster-in-a-box
with very fast network
26
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPEC marks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multi-program cache,
On-Chip SMP
28
A Hypothetical Question
Taking things to the limit

Moore’s law 100x per decade:
– Exa-instructions per second in 30 years
– Exa-bit memory chips
– Exa-byte disks

Gilder’s Law of the Telecosom
3x/year more bandwidth
60,000x per decade!
– 40 Gbps per fiber today
29
Grove’s Law



Link Bandwidth doubles every 100 years!
Not much has happened to telephones lately
Still twisted pair
30
Gilder’s Telecosom Law:
3x bandwidth/year for 25 more years

Today:
– 10 Gbps per channel
– 4 channels per fiber: 40 Gbps
– 32 fibers/bundle = 1.2 Tbps/bundle



In lab 3 Tbps/fiber (400 x WDM)
In theory 25 Tbps per fiber
1 Tbps = USA 1996 WAN bisection bandwidth
1 fiber = 25 Tbps
31
Networking
BIG!! Changes coming!

Technology
– 10 GBps bus “now”

– reduce software tax
on messages
– 1 Gbps links “now”
– 1 Tbps links in 10 years
– Fast & cheap switches


– Today 30 K ins + 10
ins/byte
– Goal: 1 K ins + .01 ins/byte
Standard interconnects
– processor-processor
– processor-device (=processor)
Deregulation WILL work
someday
CHALLENGE

Best bet:
–
–
–
–
SAN/VIA
Smart NICs
Special protocol
User-Level Net IO (like disk)
32
What if
Networking Was as Cheap As Disk IO?

TCP/IP

– Unix/NT
100% cpu @ 40MBps
Disk
– Unix/NT
8% cpu @ 40MBps
Why the Difference?
Host does
TCP/IP packetizing,
checksum,…
flow control
small buffers
Host Bus Adapter does
SCSI packetizing,
checksum,…
flow control
33
DMA
The Promise of SAN/VIA
10x better in 2 years

Today:
– wires are 10 MBps (100 Mbps Ethernet)
– ~20 MBps tcp/ip saturates 2 cpus
– round-trip latency is ~300 us

In two years
250
200
Now
Soon
150
100
50
0
Bandwidth
Latency
Overhead
– wires are 100 MBps (1 Gbps Ethernet, ServerNet,…)
– tcp/ip ~ 100 MBps 10% of each processor
– round-trip latency is 20 us

works in lab today
assumes app uses zero-copy Winsock2 api.
See http://www.viarch.org/
34
Functionally Specialized Cards

Storage
P mips processor
ASIC
Today:
P=50 mips
M MB DRAM

Network
M= 2 MB
In a few years
ASIC
P= 200 mips
M= 64 MB

Display
ASIC
36
It’s Already True of Printers
Peripheral = CyberBrick


You buy a printer
You get a
– several network interfaces
– A Postscript engine
•
•
•
•
cpu,
memory,
software,
a spooler (soon)
– and… a print engine.
37
System On A Chip

Integrate Processing with memory on one chip
–
–
–
–

chip is 75% memory now
1MB cache >> 1960 supercomputers
256 Mb memory chip is 32 MB!
IRAM, CRAM, PIM,… projects abound
Integrate Networking with processing on one chip
– system bus is a kind of network
– ATM, FiberChannel, Ethernet,.. Logic on chip.
– Direct IO (no intermediate bus)

Functionally specialized cards shrink to a chip.
38
All Device Controllers will be Cray 1’s

TODAY
– Disk controller is 10 mips risc engine
with 2MB DRAM
– NIC is similar power

SOON
Central
Processor &
Memory
– Will become 100 mips systems
with 100 MB DRAM.

They are nodes in a federation
(can run Oracle on NT in disk controller).

Advantages
–
–
–
–
–
Uniform programming model
Great tools
Security
economics (cyberbricks)
Move computation to data (minimize traffic)
Tera Byte
Backplane
39
With Tera Byte Interconnect
and Super Computer Adapters

Processing is incidental to
– Networking
– Storage
– UI

Disk Controller/NIC is
– faster than device
– close to device
– Can borrow device
package & power


Tera Byte
Backplane
So use idle capacity for computation.
Run app in device.
40
Implications
Conventional



Offload device handling
to NIC/HBA
higher level protocols:
I2O, NASD, VIA…
SMP and Cluster
parallelism is important.
Central
Processor &
Memory
Radical



Move app to
NIC/device controller
higher-higher level
protocols: CORBA /
DCOM.
Cluster parallelism is
VERY important.
Tera Byte
Backplane
41
How Do They Talk to Each Other?



Each node has an OS
Each node has local resources: A federation.
Each node does not completely trust the others.
Nodes use RPC to talk to each other
– CORBA? DCOM? IIOP? RMI?
– One or all of the above.
Applications
?
RPC
streams
datagrams


Applications
Huge leverage in high-level interfaces.
Same old distributed system story.
VIAL/VIPL
?
RPC
streams
datagrams

VIAL/VIPL
Wire(s)
42
Punch Line
The huge clusters we saw
are prototypes for this:
A Federation of
Functionally specialized nodes
Each node shrinks to a “point” device
With embedded processing.
Each node / device is autonomous
43
Each talks a high-level protocol
Outline

Hardware CyberBricks
– all nodes are very intelligent

Software CyberBricks
– standard way to interconnect intelligent nodes

What next?
– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
44
Software CyberBricks: Objects!


It’s a zoo
Objects and 3-tier computing (transactions)
– Give natural distribution & parallelism
– Give remote management!
– TP & Web: Dispatch RPCs to pool of object servers

Components are a 1B$ business today!
45
The COMponent Promise

Objects are
Software CyberBricks
– productivity breakthrough (plug ins)
– manageability breakthrough (modules)



Microsoft Promise
DCOM + ActiveX +
IBM/Sun/Oracle/Netscape promise
CORBA + Open Doc + Java Beans +
Both promise
– parallel distributed execution
– centralized management of
distributed system
Both camps
Share key goals:










Encapsulation: hide
implementation
Polymorphism: generic ops
key to GUI and reuse
Uniform Naming
Discovery: finding a service
Fault handling: transactions
Versioning: allow upgrades
Transparency: local/remote
Security: who has authority
Shrink-wrap: minimal
inheritance
Automation: easy
46
History and Alphabet Soup
1995
CORBA
Solaris
Object
Management
Group (OMG)
1990
X/Open
UNIX
International
1985
Open software
Foundation (OSF)
Microsoft DCOM based
on OSF-DCE Technology
DCOM and ActiveX extend it
Open
Group
OSF
DCE
NT
COM
47
Objects Meet Databases
basis for universal data servers, access, & integration
 Object-oriented (COM oriented)
interface
to data
 Breaks DBMS into components
 Anything can be
DBMS
a data source
engine
 Optimization/navigation
“on top of”
other data sources
 Makes an RDBMS an
O-R DBMS
assuming optimizer
understands objects
Database
Spreadsheet
Photos
Mail
Map
Document
50
The BIG Picture
Components and transactions




Software modules are objects
Object Request Broker (a.k.a., Transaction Processing Monitor)
connects objects
(clients to servers)
Standard interfaces allow software plug-ins
Transaction ties execution of a “job” into an atomic unit:
all-or-nothing, durable, isolated
Object Request Broker
51
The OO Points So Far

Objects are software Cyber Bricks
Object interconnect standards are emerging
Cyber Bricks become Federated Systems.

Next points:


– put processing close to data
– do parallel processing.
53
Transaction Processing
Evolution to Three Tier




Intelligence migrated to clients Mainframe
cards
Mainframe Batch processing
(centralized)
Dumb terminals &
Remote Job Entry
green
screen
3270
TP Monitor
Intelligent terminals
database backends
Workflow Systems
Object Request Brokers
Application Generators
Server
ORB
Active
56
Web Evolution to Three Tier
Intelligence migrated to clients (like TP)
WAIS

Character-mode clients,
smart servers
Web
Server
archie
ghopher
green screen
Mosaic

GUI Browsers - Web file servers

GUI Plugins - Web dispatchers - CGI

Smart clients - Web dispatcher (ORB)
pools of app servers (ISAPI, Viper)
workflow scripts at client & server
NS & IE
Active
57
PC Evolution to Three Tier
Intelligence migrated to server

Stand-alone PC
(centralized)

PC + File & print server
IO request
reply
disk I/O
message per I/O

PC + Database server
message per SQL statement

PC + App server
message per transaction

SQL
Statement
ActiveX Client, ORB ActiveX
server, Xscript
Transaction
58
Why Did Everyone Go To ThreeTier?

Manageability
Presentation
– Business rules must be with data
– Middleware operations tools

Performance (scaleability)
workflow
– Server resources are precious
– ORB dispatches requests to server pools

Technology & Physics
–
–
–
–
Put UI processing near user
Put shared data processing near shared data
Minimizes data moves
Encapsulate / modularity
Business
Objects
Database
59
Why Put Business Objects at
Server?
MOM’s Business Objects
DAD’sRaw Data
Customer comes to store
Takes what he wants
Fills out invoice
Leaves money for goods
Easy to build
No clerks
Customer comes to store with list
Gives list to clerk
Clerk gets goods, makes invoice
Customer pays clerk, gets goods
Easy to manage
Clerks controls access
Encapsulation
60
The OO Points So Far





Objects are software Cyber Bricks
Object interconnect standards are emerging
Cyber Bricks become Federated Systems.
Put processing close to data
Next point:
– do parallel processing.
61
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Sequential
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Sequential
Program
63
Object Oriented Programming
Parallelism From Many Little Jobs





Gives location transparency
ORB/web/tpmon multiplexes clients to servers
Enables distribution
Exploits embarrassingly parallel apps (transactions)
HTTP and RPC (dcom, corba, rmi, iiop, …) are basis
Tp mon / orb/ web server
64
Why Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
100 second SCAN.
1 Terabyte
10 MB/s
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
65
Why are Relational Operators
Successful for Parallelism?
Relational data model
uniform operators
on uniform data stream
Closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
66
Database Systems
“Hide” Parallelism

Automate system management via tools
– data placement
– data organization (indexing)
– periodic tasks (dump / recover / reorganize)

Automatic fault tolerance
– duplex & failover
– transactions

Automatic parallelism
– among transactions (locking)
– within a transaction (parallel execution)
67
Automatic Parallel Object Relational DB
Select image
from landsat
where date between 1970 and 1990
and overlaps(location, :Rockies)
and snow_cover(image) >.7;
Landsat
date loc image
1/2/72
.
.
.
.
.
..
.
.
4/8/95
33N
120W
.
.
.
.
.
.
.
34N
120W
Temporal
Spatial
Image
Assign one process per processor/disk:
find images with right data & location
analyze image, if 70% snow, return it
Answer
image
date, location,
& image tests
69
Automatic Data Partitioning
Split a SQL table to subset of nodes & disks
Partition within set:
Range
Hash
A...E F...J K...N O...S T...Z
Good for equi-joins,
range queries
group-by
A...E F...J K...N O...S T...Z
Good for equi-joins
Round Robin
A...E F...J K...N O...S T...Z
Good to spread load
Shared disk and memory less sensitive to partitioning,
Shared nothing benefits from "good" partitioning
70
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
74
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
75
Hash Join: Combining Two Tables
Left
Table
Right Table
Hash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smaller
Else, hash outer to disk then
Hash bucket-by-bucket hash join.
Buckets
Purely sequential data behavior
Always beats sort-merge and nested
unless data is clustered.
Good for equi, outer, exclusion join
Lots of papers,
products just appearing (what went wrong?)
Hash reduces skew
81
Parallel Hash Join
ICL implemented hash join with bitmaps in CAFS
machine (1976)!
Kitsuregawa pointed out the parallelism benefits of
hash join in early 1980’s (it partitions beautifully)
We ignored them! (why?)
But now, Everybody's doing it.
(or promises to do it).
Hashing minimizes skew, requires little thinking for
redistribution
Hashing uses massive main memory
82
Main Message

Technology trends give
– many processors and storage units
– inexpensively

To analyze large quantities of data
– sequential (regular) access patterns are 100x
faster
– parallelism is 1000x faster (trades time for
money)
– Relational systems show many parallel
algorithms.
84
Summary




All God’s Children Got Clusters!
Technology trends imply
processors migrated to transducers
Components (Software CyberBricks)
Programming & Managing Clusters
Database experience
– Parallelism via transaction processing
– Parallelism via data flow
– Auto Everything, Always Up
86
End:
86 slides is more than
enough for an hour.
87
Clusters Have Advantages

Clients and Servers made from the same stuff.

Inexpensive:
– Built with commodity components

Fault tolerance:
– Spare modules mask failures

Modular growth
– grow by adding small modules
98
Meta-Message:
Technology Ratios Are Important

If everything gets faster & cheaper
at the same rate
THEN nothing really changes.

Things getting MUCH BETTER:

Things staying about the same
– communication speed & cost 1,000x
– processor speed & cost 100x
– storage size & cost 100x
– speed of light (more or less constant)
– people (10x more expensive)
– storage speed (only 10x better)
99
Storage Ratios Changed



10x better access time
10x more bandwidth
4,000x lower media price
DRAM/DISK 100:1 to 10:10 to 50:1
1
1980
1.
1990
Year
0.1
2000
100
Accesses per Second
10
Capacity (GB)
10.
seeks per second
bandwidth: MB/s
100
Storage Price vs Time
Megabytes per kilo-dollar
Disk accesses/second
vs Time
Disk Performance vs Time
10,000.
1,000.
MB/k$

10
100.
10.
1.
1
1980
1990
Year
2000
0.1
1980
1990
2000
Year
100
Performance = Storage Accesses
not Instructions Executed
In the “old days” we counted instructions and IO’s
 Now we count memory references
 Processors wait most of the time
Where the time goes:
clock ticks used by AlphaSort Components

Disc Wait
Disc Wait
Sort
Sort
OS
Memory Wait
B-Cache
Data Miss
70 MIPS
“real” apps have worse Icache
misses so run at 60 MIPS
if well tuned, 20 MIPS if not
I-Cache
Miss
D-Cache
Miss
104
Storage Latency:
How Far Away is the Data?
Clock Ticks
10
9
Andromdeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
105
Tape Farms for Tertiary Storage
Not Mainframe Silos
100 robots
1M$
50TB
50$/GB
3K Maps
10K$ robot
14 tapes
27 hr Scan
500 GB
5 MB/s
20$/GB Scan in 27 hours.
independent tape robots
30 Maps many
(like a disc farm)
106
The Metrics:
Disk and Tape Farms Win
GB/K$
1,000,000
Kaps
100,000
Maps
Data Motel:
Data checks in,
but it never checks ou
SCANS/Day
10,000
1,000
100
10
1
0.1
0.01
1000 x Disc Farm
STC Tape Robot
6,000 tapes, 8 readers
100x DLT Tape Farm
107
Tape & Optical:
Beware of the Media Myth
Optical is cheap: 200 $/platter
2 GB/platter
=> 100$/GB (2x cheaper than disc)
Tape is cheap:
=> 2.5 $/GB
50 $/tape
20 GB/tape
(100x cheaper than disc).
108
Tape & Optical Reality:
Media is 10% of System Cost
Tape needs a robot (10 k$ ... 3 m$ )
10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB
(1x…10x cheaper than disc)
Optical needs a robot (100 k$ )
100 platters = 200GB ( TODAY ) => 400 $/GB
( more expensive than mag disc )
Robots have poor access times
Not good for Library of Congress (25TB)
Data motel: data checks in but it never checks out!
109
The Access Time Myth
The Myth: seek or pick time dominates
The reality: (1) Queuing dominates
(2) Transfer dominates BLOBs
(3) Disk seeks often short
Implication: many cheap servers
better than one fast expensive server
– shorter queues
– parallel transfer
– lower cost/access and cost/byte
This is now obvious for disk arrays
This will be obvious for tape arrays
Wait
Transfer Transfer
Rotate
Rotate
Seek
Seek
110
Billions Of Clients



Every device will be “intelligent”
Doors, rooms, cars…
Computing will be ubiquitous
111
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
112
1987: 256 tps Benchmark



14 M$ computer (Tandem)
A dozen people
False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Simulate 25,600 clients
Network expert
Manager
Performance
expert
DB expert
A 40 GB disk array (80 drives)
Auditor
OS expert
113
1988: DB2 + CICS Mainframe
65 tps




IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
114
1997: 10 years later
1 Person and 1 box = 1250 tps




1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
115
disk arrays
What Happened?

Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)

New Economics: Commodity
class
mainframe
minicomputer
microcomputer

price/mips software
$/mips k$/year
10,000
100
100
10
10
1
GUI: Human - computer tradeoff
optimize for people, not computers
time
116
What Happens Next




Last 10 years:
1000x improvement
Next 10 years:
????
1985 1995
Today:
text and image servers are free
25 m$/hit => advertising pays for them
Future:
video, audio, … servers are free
“You ain’t seen nothing yet!”
2005
117
Smart Cards
Then (1979)
Now (1997)
EMV card with dynamic authentication
(EMV=Europay, MasterCard, Visa standard)
Bull CP8 two chip card
first public demonstration 1979
door key, vending machines, photocopiers
Courtesy of Dennis Roberson NCR
.
118
Memory Size (Bits)
Smart Card
Memory Capacity
16 KB today
but growing
super-exponentially
300 M
1M
10 K
You are here
3K
1990
1992
1996
1998
2000
2002
2004
Applications
Source: PIN/Card -Tech/ Courtesy of Dennis Roberson NCR
Cards will be able to store
data (e.g. medical)
books, movies,…
money
119
Download