What Happens When Processing Storage Bandwidth

advertisement
What Happens When
Processing
Storage
Bandwidth
are Free and Infinite?
Jim Gray
Microsoft Research
1
Outline

Hardware CyberBricks
– all nodes are very intelligent

Software CyberBricks
– standard way to interconnect intelligent nodes

What next?
– Processing migrates to where the power is
• Disk, network, display controllers have full-blown OS
• Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
• Computer is a federated distributed system.
2
A Hypothetical Question
Taking things to the limit

Moore’s law 100x per decade:
– Exa-instructions per second in 30 years
– Exa-bit memory chips
– Exa-byte disks

Gilder’s Law of the Telecosom
3x/year more bandwidth
60,000x per decade!
– 40 Gbps per fiber today
3
Grove’s Law



Link Bandwidth doubles every 100 years!
Not much has happened to telephones lately
Still twisted pair
4
Gilder’s Telecosom Law:
3x bandwidth/year for 25 more years

Today:
– 10 Gbps per channel
– 4 channels per fiber: 40 Gbps
– 32 fibers/bundle = 1.2 Tbps/bundle



In lab 3 Tbps/fiber (400 x WDM)
In theory 25 Tbps per fiber
1 Tbps = USA 1996 WAN bisection bandwidth
1 fiber = 25 Tbps
5
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPEC marks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multi-program cache,
On-Chip SMP
6
Year 2000

4B Machine
The Year 2000 commodity PC
Billion Instructions/Sec
 .1 Billion Bytes RAM
 Billion Bits/s Net
10 B Bytes Disk
 Billion Pixel display
1 Bips Processor

.1 B byte RAM
10 GB byte Disk

– 3000 x 3000 x 24

1,000 $
7
4 B PC’s:
The Bricks of Cyberspace
Cost 1,000 $
 Come with

– OS (NT, POSIX,..)
– DBMS
– High speed Net
– System management
– GUI / OOUI
– Tools
Compatible with everyone else
 CyberBricks

8
Super Server: 4T Machine

Array of 1,000 4B machines
1
b ips processors
1 B B DRAM
10 B B disks
1 Bbps comm lines
1 TB tape robot


A few megabucks
Challenge:
CPU
50 GB Disc
5 GB RAM
Manageability
Programmability
Security
Cyber Brick
a 4B machine
Availability
Scaleability
Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
9
Functionally Specialized Cards

Storage
P mips processor
ASIC
Today:
P=50 mips
M MB DRAM

Network
M= 2 MB
In a few years
ASIC
P= 200 mips
M= 64 MB

Display
ASIC
10
It’s Already True of Printers
Peripheral = CyberBrick


You buy a printer
You get a
– several network interfaces
– A Postscript engine
•
•
•
•
cpu,
memory,
software,
a spooler (soon)
– and… a print engine.
11
System On A Chip

Integrate Processing with memory on one chip
–
–
–
–

chip is 75% memory now
1MB cache >> 1960 supercomputers
256 Mb memory chip is 32 MB!
IRAM, CRAM, PIM,… projects abound
Integrate Networking with processing on one chip
– system bus is a kind of network
– ATM, FiberChannel, Ethernet,.. Logic on chip.
– Direct IO (no intermediate bus)

Functionally specialized cards shrink to a chip.
12
All Device Controllers will be Cray 1’s

TODAY
– Disk controller is 10 mips risc engine
with 2MB DRAM
– NIC is similar power

SOON
Central
Processor &
Memory
– Will become 100 mips systems
with 100 MB DRAM.

They are nodes in a federation
(can run Oracle on NT in disk controller).

Advantages
–
–
–
–
–
Uniform programming model
Great tools
Security
economics (cyberbricks)
Move computation to data (minimize traffic)
Tera Byte
Backplane
13
With Tera Byte Interconnect
and Super Computer Adapters

Processing is incidental to
– Networking
– Storage
– UI

Disk Controller/NIC is
– faster than device
– close to device
– Can borrow device
package & power


Tera Byte
Backplane
So use idle capacity for computation.
Run app in device.
14
Implications
Conventional



Offload device handling
to NIC/HBA
higher level protocols:
I2O, NASD, VIA…
SMP and Cluster
parallelism is important.
Central
Processor &
Memory
Radical



Move app to
NIC/device controller
higher-higher level
protocols: CORBA /
DCOM.
Cluster parallelism is
VERY important.
Tera Byte
Backplane
15
How Do They Talk to Each Other?





Applications

?
RPC
streams
datagrams

Each node has an OS
Each node has local resources: A federation.
Each node does not completely trust the others.
Nodes use RPC to talk to each other
CORBA? DCOM? IIOP? RMI?
One or all of the above.
Applications
Huge leverage in high-level interfaces.
Same old distributed system story.
?
RPC
streams
datagrams

VIAL/VIPL
VIAL/VIPL
Wire(s)
16
Objects!




It’s a zoo
ORBs, COM, CORBA,..
Object Relationa1 Databases
Objects and 3-tier computing
18
History and Alphabet Soup
1995
CORBA
Solaris
Object
Management
Group (OMG)
1990
X/Open
UNIX
International
1985
Open software
Foundation (OSF)
Microsoft DCOM based
on OSF-DCE Technology
DCOM and ActiveX extend it
Open
Group
OSF
DCE
NT
COM
19
The Promise





Objects are
Software CyberBricks
– productivity breakthrough (plug ins)
– manageability breakthrough (modules)
Microsoft Promises Cairo
distributed objects,
secure, transparent,
fast invocation
IBM/Sun/Oracle/Netscape promise
CORBA + Open Doc + Java Beans +
All will deliver
Customers can pick the best one
Both camps
Share key goals:










Encapsulation: hide
implementation
Polymorphism: generic ops
key to GUI and reuse
Uniform Naming
Discovery: finding a service
Fault handling: transactions
Versioning: allow upgrades
Transparency: local/remote
Security: who has authority
Shrink-wrap: minimal
inheritance
Automation: easy
20
The OLE-COM Experience


Macintosh had Publish & Subscribe
PowerPoint needed graphs:
– plugged MS Graph in as an component.

Office adopted OLE
– one graph program for all of office

Internet arrived
– URLs are object references,
– Office is Web Enabled right away!


Office97 smaller than Office95
because of shared components
It works!!
21
Linking And Embedding
Objects are data modules;
transactions are execution modules

Link: pointer to object
somewhere else
– Think URL in Internet


Embed: bytes
are here
Objects may be active; can
callback to subscribers
22
Objects Meet Databases
basis for universal data servers, access, & integration
 Object-oriented (COM oriented)
interface
to data
 Breaks DBMS into components
 Anything can be
DBMS
a data source
engine
 Optimization/navigation
“on top of”
other data sources
 Makes an RDBMS an
O-R DBMS
assuming optimizer
understands objects
Database
Spreadsheet
Photos
Mail
Map
Document
23
The BIG Picture
Components and transactions





Software modules are objects
Object Request Broker (a.k.a., Transaction Processing Monitor)
connects objects
(clients to servers)
Standard interfaces allow software plug-ins
Transaction ties execution of a “job” into an atomic unit:
all-or-nothing, durable, isolated
ActiveX Components are a 250M$/year business.
Object Request Broker
24
Object Request Broker (ORB)







Orchestrates RPC
Registers Servers
Manages pools of servers
Connects clients to servers
Does Naming, request-level authorization,
Provides transaction coordination
Direct and queued invocation
Old names:
– Transaction Processing Monitor,
Transaction
– Web server,
– NetWare
Object-Request Broker
25
The OO Points So Far

Objects are software Cyber Bricks
Object interconnect standards are emerging
Cyber Bricks become Federated Systems.

Next points:


– put processing close to data
– do parallel processing.
26
Three Tier Computing

Clients do presentation, gather input

Clients do some workflow (Xscript)

Clients send high-level requests to ORB


Presentation
workflow
ORB dispatches work-flows and business
objects -- proxies for client, orchestrate flows
& queues
Server-side workflow scripts call on
distributed business objects to execute task
Business
Objects
Database
27
The Three
Tiers
Web Client
HTML
VB Java
plug-ins
VBscritpt
JavaScrpt
Middleware
VB or Java
Script Engine
Object
server
Pool
VB or Java
Virt Machine
Internet
HTTP+
DCOM
ORB
ORB
TP Monitor
Web Server...
Object & Data
server.
DCOM (oleDB, ODBC,...)
IBM
Legacy
Gateways
28
Transaction Processing
Evolution to Three Tier




Intelligence migrated to clients Mainframe
cards
Mainframe Batch processing
(centralized)
Dumb terminals &
Remote Job Entry
green
screen
3270
TP Monitor
Intelligent terminals
database backends
Workflow Systems
Object Request Brokers
Application Generators
Server
ORB
Active
29
Web Evolution to Three Tier
Intelligence migrated to clients (like TP)
WAIS

Character-mode clients,
smart servers
Web
Server
archie
ghopher
green screen
Mosaic

GUI Browsers - Web file servers

GUI Plugins - Web dispatchers - CGI

Smart clients - Web dispatcher (ORB)
pools of app servers (ISAPI, Viper)
workflow scripts at client & server
NS & IE
Active
30
PC Evolution to Three Tier
Intelligence migrated to server

Stand-alone PC
(centralized)

PC + File & print server
IO request
reply
disk I/O
message per I/O

PC + Database server
message per SQL statement

PC + App server
message per transaction

SQL
Statement
ActiveX Client, ORB ActiveX
server, Xscript
Transaction
31
Why Did Everyone Go To ThreeTier?

Manageability
Presentation
– Business rules must be with data
– Middleware operations tools

Performance (scaleability)
workflow
– Server resources are precious
– ORB dispatches requests to server pools

Technology & Physics
–
–
–
–
Put UI processing near user
Put shared data processing near shared data
Minimizes data moves
Encapsulate / modularity
Business
Objects
Database
32
Why Put Business Objects at
Server?
MOM’s Business Objects
DAD’sRaw Data
Customer comes to store
Takes what he wants
Fills out invoice
Leaves money for goods
Easy to build
No clerks
Customer comes to store with list
Gives list to clerk
Clerk gets goods, makes invoice
Customer pays clerk, gets goods
Easy to manage
Clerks controls access
Encapsulation
33
The OO Points So Far





Objects are software Cyber Bricks
Object interconnect standards are emerging
Cyber Bricks become Federated Systems.
Put processing close to data
Next point:
– do parallel processing.
34
Parallelism:
the OTHER half of Super-Servers

Clusters of machines allow two kinds of
parallelism
– Many little jobs:
Online transaction processing
• TPC A, B, C,…
– A few big jobs: data search & analysis
• TPC D, DSS, OLAP

Both give automatic Parallelism
35
Why Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
100 second SCAN.
1 Terabyte
10 MB/s
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
36
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Sequential
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Sequential
Program
37
Why are Relational Operators
Successful for Parallelism?
Relational data model
uniform operators
on uniform data stream
Closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
38
Database Systems
“Hide” Parallelism

Automate system management via tools
– data placement
– data organization (indexing)
– periodic tasks (dump / recover / reorganize)

Automatic fault tolerance
– duplex & failover
– transactions

Automatic parallelism
– among transactions (locking)
– within a transaction (parallel execution)
39
SQL a Non-Procedural
Programming Language
SQL: functional programming language
describes answer set.
 Optimizer picks best execution plan

– Picks data flow web (pipeline),
– degree of parallelism (partitioning)
– other execution parameters (process placement, memory,...)
Execution
Planning
Monitor
Schema
GUI
Optimizer
Plan
Executors
Rivers
40
Automatic Data Partitioning
Split a SQL table to subset of nodes & disks
Partition within set:
Range
A...E F...J
K...N O...S T...Z
Good for equijoins,
range queries
group-by
Hash
A...E F...J
K...N O...S T...Z
Good for equijoins
Round Robin
A...E F...J
K...N O...S T...Z
Good to spread load
Shared disk and memory less sensitive to partitioning,
Shared nothing benefits from "good" partitioning
41
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
42
Parallel Objects?


How does all this DB parallelism connect to
hardware/software Cyber Bricks?
To scale to large client sets
– need lots of independent parallel execution.
– Comes for from from ORB.

To scale to large data sets
– need intra-program parallelism (like parallel DBs)
– Requires some invention.
43
Outline

Hardware CyberBricks
– all nodes are very intelligent

Software CyberBricks
– standard way to interconnect intelligent nodes

What next?
– Processing migrates to where the power is
•
•
•
•
Disk, network, display controllers have full-blown OS
Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them
Computer is a federated distributed system.
Parallel execution is important
44
MORE SLIDES
but there is only so
much time.
Too bad
45
The Disk Farm On a Card
The 100GB disc card
An array of discs
Can be used as
100 discs
1 striped disc
10 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth
14"
Life is cheap, its the accessories that cost ya.
Processors are cheap, it’s the peripherals that cost ya
(a 10k$ disc card).
46
Parallelism:
Performance is the Goal
Goal is to get 'good' performance.
Trade time for money.
Law 1: parallel system should be
faster than serial system
Law 2: parallel system should give
near-linear scaleup or
near-linear speedup or
both.
Parallel DBMSs obey these laws
47
Success Stories

Online Transaction Processing
– many little jobs
– SQL systems support
•
50 k tpm-C (44 cpu, 600 disk 2 node )
hardware

Batch (decision support and Utility)
– few big jobs, parallelism inside
– Scan data at 100 MB/s
– Linear Scaleup to 1,000 processors
hardware
48
The New Law of Computing
Grosch's Law:
1 MIPS
1$
2x $ is 4x performance
1,000 MIPS
32 $
.03$/MIPS
2x $ is
2x performance
Parallel Law:
Needs
Linear Speedup and Linear Scaleup
Not always possible
1,000 MIPS
1,000 $ 1 MIPS
1$
49
Clusters being built







Teradata 1,000 nodes
(30k$/slice)
Tandem,VMScluster 150 nodes
(100k$/slice)
Intel, 9,000 nodes @ 55M$
( 6k$/slice)
Teradata, Tandem, DEC
moving to NT+low slice price
IBM: 512 nodes ASCI @ 100m$
(200k$/slice)
PC clusters (bare handed) at dozens of nodes
web servers (msn, PointCast,…), DB servers
KEY TECHNOLOGY HERE IS THE APPS.
– Apps distribute data
– Apps distribute execution
50
BOTH SMP and Cluster?
Grow Up with SMP
4xP6 is now standard
SMP
Super Server
Grow Out with Cluster
Cluster has inexpensive parts
Departmental
Server
Cluster
of PCs
Personal
System
52
Clusters Have Advantages

Clients and Servers made from the same stuff.

Inexpensive:
– Built with commodity components

Fault tolerance:
– Spare modules mask failures

Modular growth
– grow by adding small modules
53
Meta-Message:
Technology Ratios Are Important

If everything gets faster & cheaper
at the same rate
THEN nothing really changes.

Things getting MUCH BETTER:

Things staying about the same
– communication speed & cost 1,000x
– processor speed & cost 100x
– storage size & cost 100x
– speed of light (more or less constant)
– people (10x more expensive)
– storage speed (only 10x better)
54
Storage Ratios Changed



10x better access time
10x more bandwidth
4,000x lower media price
DRAM/DISK 100:1 to 10:10 to 50:1
1
1980
1.
1990
Year
0.1
2000
100
Accesses per Second
10
Capacity (GB)
10.
seeks per second
bandwidth: MB/s
100
Storage Price vs Time
Megabytes per kilo-dollar
Disk accesses/second
vs Time
Disk Performance vs Time
10,000.
1,000.
MB/k$

10
100.
10.
1.
1
1980
1990
Year
2000
0.1
1980
1990
2000
Year
55
Performance = Storage Accesses
not Instructions Executed
In the “old days” we counted instructions and IO’s
 Now we count memory references
 Processors wait most of the time
Where the time goes:
clock ticks used by AlphaSort Components

Disc Wait
Disc Wait
Sort
Sort
OS
Memory Wait
B-Cache
Data Miss
70 MIPS
“real” apps have worse Icache
misses so run at 60 MIPS
if well tuned, 20 MIPS if not
I-Cache
Miss
D-Cache
Miss
59
Storage Latency:
How Far Away is the Data?
Clock Ticks
10
9
Andromdeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
60
Tape Farms for Tertiary Storage
Not Mainframe Silos
100 robots
1M$
50TB
50$/GB
3K Maps
10K$ robot
14 tapes
27 hr Scan
500 GB
5 MB/s
20$/GB Scan in 27 hours.
independent tape robots
30 Maps many
(like a disc farm)
61
The Metrics:
Disk and Tape Farms Win
GB/K$
1,000,000
Kaps
100,000
Maps
Data Motel:
Data checks in,
but it never checks ou
SCANS/Day
10,000
1,000
100
10
1
0.1
0.01
1000 x Disc Farm
STC Tape Robot
6,000 tapes, 8 readers
100x DLT Tape Farm
62
Tape & Optical:
Beware of the Media Myth
Optical is cheap: 200 $/platter
2 GB/platter
=> 100$/GB (2x cheaper than disc)
Tape is cheap:
=> 1.5 $/GB
30 $/tape
20 GB/tape
(100x cheaper than disc).
63
Tape & Optical Reality:
Media is 10% of System Cost
Tape needs a robot (10 k$ ... 3 m$ )
10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB
(1x…10x cheaper than disc)
Optical needs a robot (100 k$ )
100 platters = 200GB ( TODAY ) => 400 $/GB
( more expensive than mag disc )
Robots have poor access times
Not good for Library of Congress (25TB)
Data motel: data checks in but it never checks out!
64
The Access Time Myth
The Myth: seek or pick time dominates
The reality: (1) Queuing dominates
(2) Transfer dominates BLOBs
(3) Disk seeks often short
Implication: many cheap servers
better than one fast expensive server
– shorter queues
– parallel transfer
– lower cost/access and cost/byte
This is now obvious for disk arrays
This will be obvious for tape arrays
Wait
Transfer Transfer
Rotate
Rotate
Seek
Seek
65
Billions Of Clients



Every device will be “intelligent”
Doors, rooms, cars…
Computing will be ubiquitous
66
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
67
1987: 256 tps Benchmark



14 M$ computer (Tandem)
A dozen people
False floor, 2 rooms of machines
Admin expert
Hardware experts
A 32 node processor array
Simulate 25,600 clients
Network expert
Manager
Performance
expert
DB expert
A 40 GB disk array (80 drives)
Auditor
OS expert
68
1988: DB2 + CICS Mainframe
65 tps




IBM 4391
Simulated network of 800 clients
2m$ computer
Staff of 6 to do benchmark
2 x 3725
network controllers
Refrigerator-sized
CPU
16 GB disk farm
4 x 8 x .5GB
69
1997: 10 years later
1 Person and 1 box = 1250 tps




1 Breadbox ~ 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 1,000x less
25 micro dollars per transaction
Hardware expert
OS expert
Net expert
DB expert
App expert
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
3 x7 x 4GB
70
disk arrays
What Happened?

Moore’s law:
Things get 4x better every 3 years
(applies to computers, storage, and networks)

New Economics: Commodity
class
mainframe
minicomputer
microcomputer

price/mips software
$/mips k$/year
10,000
100
100
10
10
1
GUI: Human - computer tradeoff
optimize for people, not computers
time
71
What Happens Next




Last 10 years:
1000x improvement
Next 10 years:
????
1985 1995
Today:
text and image servers are free
25 m$/hit => advertising pays for them
Future:
video, audio, … servers are free
“You ain’t seen nothing yet!”
2005
72
Smart Cards
Then (1979)
Now (1997)
EMV card with dynamic authentication
(EMV=Europay, MasterCard, Visa standard)
Bull CP8 two chip card
first public demonstration 1979
door key, vending machines, photocopiers
Courtesy of Dennis Roberson NCR
.
73
Memory Size (Bits)
Smart Card
Memory Capacity
16 KB today
but growing
super-exponentially
300 M
1M
10 K
You are here
3K
1990
1992
1996
1998
2000
2002
2004
Applications
Source: PIN/Card -Tech/ Courtesy of Dennis Roberson NCR
Cards will be able to store
data (e.g. medical)
books, movies,…
money
74
Download