VOF_Pasadena.

advertisement
Computer
Technology Forecast
Jim Gray
Microsoft Research
Gray@Microsoft.com
http://~research.Microsoft.com/~Gray
Reality Check
• Good news
– In the limit, processing & storage & network is free
– Processing & network is infinitely fast
• Bad news
– Most of us live in the present.
– People are getting more expensive.
Management/programming cost exceeds hardware cost.
– Speed of light not improving.
– WAN prices have not changed much in last 8 years.
Interesting Topics
• I’ll talk about server-side hardware
• What about client hardware?
– Displays, cameras, speech,….
• What about Software?
– Databases, data mining, PDB, OODB
– Objects / class libraries …
– Visualization
– Open Source movement
How Much Information
Is there?
• Soon everything can be
recorded and indexed
• Most data never be seen by humans
• Precious Resource:
Human attention
Auto-Summarization
Auto-Search
is key technology.
Everything
!
Recorded
All Books
Yotta
Zetta
Exa
MultiMedia
Peta
All LoC books
(words)
.Movi
e
A Photo
Tera
Giga
Mega
www.lesk.com/mlesk/ksg97/ksg.html
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
Kilo
Moore’s Law
• Performance/Price doubles every 18 months
• 100x per decade
• Progress in next 18 months
= ALL previous progress
– New storage = sum of all old storage (ever)
– New processing = sum of all old processing.
• E. coli double ever 20 minutes!
15 years ago
Trends:
ops/s/$ Had Three Growth Phases
1890-1945
Mechanical
Relay
7-year doubling
1945-1985
Tube, transistor,..
2.3 year doubling
1985-2000
Microprocessor
1.0 year doubling
1.E+09
ops per second/$
doubles every
1.0 years
1.E+06
1.E+03
1.E+00
1.E-03
doubles every
7.5 years
doubles every
2.3 years
1.E-06
1880
1900
1920
1940
1960
1980
2000
What’s a Balanced System?
System Bus
PCI Bus
PCI Bus
Storage capacity
beating Moore’s law
Disk TB Shipped per Year
1E+7
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
• 5 k$/TB today (raw disk)
disk TB
growth:
112%/y
1E+5
Moore's Law:
58.7%/y
1E+4
1E+3
1988
Moores law
Revenue
TB growth
Price decline
1991
1994
1997
2000
58.70% /year
7.47%
112.30% (since 1993
50.70% (since 1993
Cheap Storage
• Disks are getting cheap:
• 7 k$/TB disks (25 40 GB disks @ 230$ each)
900
40
Price vs disk capacity
800
700
35
IDE
600
30
SCSI
y = 15.895x + 13.446
400
IDE
SCSI
25
$
$
500
raw k$/TB
20
15
300
200
10
y = 5.7156x + 47.857
100
5
0
0
0
10
20
30
40
Raw Disk unit Size GB
50
60
7
0
10
20
30
40
Disk unit size GB
50
60
Cheap Storage or Balanced System
• Low cost storage
(2 x 1.5k$ servers) 7K$ TB
2x (1K$ system + 8x60GB disks + 100MbEthernet)
• Balanced server (7k$/.5 TB)
–
–
–
–
–
2x800Mhz (2k$)
256 MB (400$)
8 x 60 GB drives (3K$)
Gbps Ethernet + switch (1.5k$)
14k$ TB, 28K$/RAIDED TB
2x800 Mhz
256 MB
The “Absurd” Disk
• 2.5 hr scan time
(poor sequential access)
• 1 aps / 5 GB
(VERY cold data)
• It’s a tape!
100 MB/s
200 Kaps
1 TB
Hot Swap Drives for Archive or
Data Interchange
• 25 MBps write
(so can write
N x 60 GB
in 40 minutes)
• 60 GB/overnite
= ~N x 2 MB/second
@ 19.95$/nite
17$
260$
240 GB, 2k$ (now)
300 GB by year end.
• 4x60 GB IDE
(2 hot plugable)
– (1,100$)
• SCSI-IDE bridge
– 200k$
• Box
–
–
–
–
500 Mhz cpu
256 MB SRAM
Fan, power, Enet
700$
• Or 8 disks/box
600 GB for ~3K$ ( or 300 GB RAID)
Hot Swap Drives for Archive or
Data Interchange
• 25 MBps write
(so can write
N x 74 GB
in 3 hours)
• 74 GB/overnite
= ~N x 2 MB/second
@ 19.95$/nite
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
• At 1GBps it takes 12 days!
• Store it in two (or more) places online (on disk?).
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the two copies differently
(e.g.: one by time, one by space)
Disk vs Tape
• Disk
• Tape
– 60 GB
– 30 MBps
– 5 ms seek time
– 3 ms rotate latency
– 7$/GB for drive
3$/GB for ctlrs/cabinet
– 4 TB/rack
– 1 hour scan
–
–
–
–
–
40 GB
10 MBps
10 sec pick time
30-120 second seek time
2$/GB for media
8$/GB for drive+library
– 10 TB/rack
– 1 week scan
Guestimates
Cern: 200 TB
3480 tapes
2 col = 50GB
Rack = 1 TB
=20 drives
The price advantage of tape is narrowing, and
the performance advantage of disk is growing
At 10K$/TB, disk is competitive with nearline tape.
Trends: Gilder’s Law:
3x bandwidth/year for 25 more years
• Today:
– 10 Gbps per channel
– 4 channels per fiber: 40 Gbps
– 32 fibers/bundle = 1.2 Tbps/bundle
•
•
•
•
In lab 3 Tbps/fiber (400 x WDM)
In theory 25 Tbps per fiber
1 Tbps = USA 1996 WAN bisection bandwidth
Aggregate bandwidth doubles every 8 months!
1 fiber = 25 Tbps
Sense of scale
• How fat is your pipe?
• Fattest pipe on MS
campus is the WAN!
300 MBps OC48 = G2
Or
memcpy()
94 MBps Coast to Coast
90 MBps PCI
20 MBps disk / ATM / OC3
Redmond/Seattle, WA
Information Sciences Institute
Microsoft
Qwest
University of Washington
Pacific Northwest Gigapop
New York
HSCC (high speed connectivity consortium)
DARPA
Arlington, VA
San Francisco,
CA
5626 km
10 hops
The Path
DC -> SEA
C:\tracert -d 131.107.151.194
Tracing route to 131.107.151.194 over a maximum
0
Arlington Virginia, ISI
1
16 ms
<10 ms
<10 ms 140.173.170.65
Arlington Virginia, ISI Interface ISIe
2
<10 ms
<10 ms
<10 ms 205.171.40.61
Arlington Virginia, Qwest DC Edge
3
<10 ms
<10 ms
<10 ms 205.171.24.85
Arlington Virginia, Qwest DC Core
4
<10 ms
<10 ms
16 ms 205.171.5.233
New York, New York, Qwest NYC Core
5
62 ms
63 ms
62 ms 205.171.5.115
San Francisco, CA, Qwest SF Core
6
78 ms
78 ms
78 ms 205.171.5.108
Seattle, Washington, Qwest Sea Core
7
78 ms
78 ms
94 ms 205.171.26.42
Seattle, Washington, Qwest Sea Edge
8
78 ms
79 ms
78 ms 208.46.239.90
Seattle, Washington, PNW Gigapop
9
78 ms
78 ms
94 ms 198.48.91.30
Redmond Washington, Microsoft
10
78 ms
78 ms
94 ms 131.107.151.194
Redmond Washington, Microsoft
of 30 hops
------- DELL 4400 Win2K WKS
Alteon GbE
------- Juniper M40 GbE
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Juniper M40 OC48
------- Juniper M40 OC48
------- Cisco GSR OC48
------- Compaq SP750 Win2K WKS
SysKonnect GbE
“ PetaBumps”
• 751 mbps for 300 seconds = (~28 GB)
single-thread single-stream tcp/ip
desktop-to-desktop
out of the box performance*
• 5626 km x 751Mbps =
~ 4.2e15 bit meter / second ~ 4.2 Peta
bmps
• Multi-steam is 952 mbps
~5.2 Peta bmps
•4470 byte MTUs were enabled on all routers.
•20 MB window size
The Promise of SAN/VIA:10x in 2 years
http://www.ViArch.org/
• Yesterday:
– 10 MBps (100 Mbps Ethernet)
– ~20 MBps tcp/ip saturates 2
cpus
– round-trip latency ~250 µs
• Now
250
Time µs to
Send 1KB
200
150
Transmit
receivercpu
sender cpu
100
– Wires are 10x faster
Myrinet, Gbps Ethernet, ServerNet,…
– Fast user-level
communication
• tcp/ip ~ 100 MBps 10% cpu
• round-trip latency is 15 us
• 1.6 Gbps demoed on a WAN
50
0
100Mbps
Gbps
SAN
Pointers
• The single-stream submission:
http://research.microsoft.com/~gray/papers/
Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm
• The multi-stream submission:
http://research.Microsoft.com/~gray/papers/
Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm
• The code:
http://research.Microsoft.com/~gray/papers/speedy.htm
speedy.h
speedy.c
And a PowerPoint presentation about it.
http://research.Microsoft.com/~gray/papers/
Windows2000_WAN_Speed_Record.ppt
Networking
• WANS are getting faster than LANS
G8 = OC192 = 8Gbps is “standard”
• Link bandwidth improves 4x per 3 years
• Speed of light (60 ms round trip in US)
• Software stacks
have always been the problem.
Time = SenderCPU + ReceiverCPU + bytes/bandwidth
This has been
the problem
Rules of Thumb in Data Engineering
•
•
•
•
•
•
•
Moore’s law -> an address bit per 18 months.
Storage grows 100x/decade (except 1000x last decade!)
Disk data of 10 years ago now fits in RAM (iso-price).
Device bandwidth grows 10x/decade – so need parallelism
RAM:disk:tape price is 1:10:30 going to 1:10:10
Amdahl’s speedup law: S/(S+P)
Amdahl’s IO law: bit of IO per instruction/second
(tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars)
• Amdahl’s memory law: byte per instruction/second (going to 10)
(1 TB RAM per TOP: 1 TeraDollars)
• PetaOps anyone?
• Gilder’s law: aggregate bandwidth doubles every 8 months.
• 5 Minute rule: cache disk data that is reused in 5 minutes.
• Web rule: cache everything!
http://research.Microsoft.com/~gray/papers/
MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc
Dealing With TeraBytes (Petabytes):
Requires Parallelism
parallelism: use many little devices in parallel
1,000 x parallel:
At 10 MB/s:
1.2 days to scan
100 seconds scan.
1 Terabyte
1 Terabyte
10 MB/s
Use
100 processors &
1,000 disks
Parallelism Must Be Automatic
• There are thousands of MPI programmers.
• There are hundreds-of-millions of people using
parallel database search.
• Parallel programming is HARD!
• Find design patterns and automate them.
• Data search/mining has parallel design patterns.
Scalability: Up and Out
Up
•“Scale Up”
–Use “big iron” (SMP)
–Cluster into packs for availability
•“Scale Out” clones & partitions
–Use commodity servers
–Add clones & partitions as needed
Out
Everyone scales out
What’s the Brick?
• 1M$/slice
– IBM S390?
– Sun E 10,000?
• 100 K$/slice
– HPUX/AIX/Solaris/IRIX/EMC
• 10 K$/slice
– Utel / Wintel 4x
• 1 K$/slice
– Beowulf / Wintel 1x
Terminology for scaleability
• Farms of servers:
Farm
– Clones: identical
Partition
• Scaleability + availability
Clone
– Partitions:
• Scaleability
– Packs
Pack
Shared
Nothing
• Partition availability via fail-over
• GeoPlex for disaster tolerance.
Shared
Disk
Shared
Nothing
ActiveActive
ActivePassive
Shared Nothing Clones
Farm
Shared Disk Clones
Partition
Clone
Pack
Partitions
Packed Partitions
Shared
Nothing
Shared
Disk
Shared
Nothing
ActiveActive
ActivePassive
Unpredictable Growth
• The TerraServer Story:
–
–
–
–
We expected 5 M hits per day
We got 50 M hits on day 1
We peak at 15-20 M hpd on a “hot” day
Average 5 M hpd after 1 year
• Most of us cannot predict demand
– Must be able to deal with NO demand
– Must be able to deal with HUGE demand
An Architecture for Internet Services?
• Need to be able to add capacity
– New processing
– New storage
– New networking
• Need continuous service
– Online change of all components (hardware and software)
– Multiple service sites
– Multiple network providers
• Need great development tools
– Change the application several times per year.
– Add new services several times per year.
Farm
Premise: Each Site is a
• Buy computing by the slice (brick):
Building 11
– Rack of servers + disks.
• Grow by adding slices
Internal WWW
Staging Servers
(7)
Log Processing
Av e CFG: 4xP6,
1 GB RAM,
180 GB HD
Av e Cost: $128K
FY98 Fcst: 2
• Two styles:
The Microsoft.Com Site
SQLNet
Feeder LAN
Router
Liv e SQL Serv ers
MOSWest
Admin LAN
Live SQL Server
All servers in Building11
are accessable from
corpnet.
w w w .microsoft.com
(4)
register.microsoft.com
(2) Ave CFG: 4xP6,
Ave CFG: 4xP6,
512 RAM,
30 GB HD
Ave Cost: $35K
FY98 Fcst: 3
Av e CFG: 4xP6,
512 RAM,
160 GB HD
Av e Cost: $83K
FY98 Fcst: 12
Av e CFG: 4xP6,
512 RAM,
50 GB HD
Av e Cost: $35K
FY98 Fcst: 2
home.microsoft.com
(4)
Av e CFG: 4xP6
512 RAM
28 GB HD
Av e Cost: $35K
FY98 Fcst: 17
FDDI Ring
(MIS1)
home.microsoft.com
(3)
FDDI Ring
(MIS2)
Av e CFG: 4xP6,
256 RAM,
30 GB HD
Av e Cost: $25K
FY98 Fcst: 2
Router
Internet
register.msn.com
(2)
Switched
Ethernet
search.microsoft.com
(1)
Japan Data Center
w w w .microsoft.com SQL SERVERS
(2)
premium.microsoft.com
(3)
Av e CFG: 4xP6,
(1)
Av e CFG: 4xP6,
512 RAM,
Av e CFG: 4xP6,
512 RAM,
30 GB HD
Av e Cost: $35K
FY98 Fcst: 1
512 RAM,
50 GB HD
Av e Cost: $50K
FY98 Fcst: 1
160 GB HD
Av e Cost: $80K
FY98 Fcst: 1
msid.msn.com
(1)
Switched
Ethernet
FTP
Download Serv er
(1)
HTTP
Download Serv ers
(2)
search.microsoft.com
(2)
Router
Secondary
Gigaswitch
support.microsoft.com
search.microsoft.com
(1)
(3)
Router
support.microsoft.com
(2)
13
DS3
(45 Mb/Sec Each)
Ave CFG: 4xP5,
512 RAM,
30 GB HD
Ave Cost: $28K
FY98 Fcst: 0
register.microsoft.com
(2)
register.microsoft.com
(1)
(100Mb/Sec Each)
Internet
Router
FTP.microsoft.com
(3)
msid.msn.com
(1)
2
OC3
Primary
Gigaswitch
Router
Router
Av e CFG: 4xP5,
256 RAM,
20 GB HD
Av e Cost: $29K
FY98 Fcst: 2
Av e CFG: 4xP6,
512 RAM,
30 GB HD
Av e Cost: $28K
FY98 Fcst: 7
activex.microsoft.com
(2)
Av e CFG: 4xP6,
512 RAM,
30 GB HD
Av e Cost: $28K
FY98 Fcst: 3
Router
home.microsoft.com
(2)
SQL SERVERS
(2)
Av e CFG: 4xP6,
512 RAM,
160 GB HD
Av e Cost: $80K
FY98 Fcst: 1
Router
premium.microsoft.com
(1)
FDDI Ring
(MIS3)
FTP
Download Serv er
(1)
Router
Router
msid.msn.com
(1)
512 RAM,
30 GB HD
Av e Cost: $35K
FY98 Fcst: 1
msid.msn.com
(1)
search.microsoft.com
(3)
cdm.microsoft.com
(1)
Av e CFG: 4xP5,
256 RAM,
12 GB HD
Av e Cost: $24K
FY98 Fcst: 0
Av e CFG: 4xP6,
1 GB RAM,
160 GB HD
Av e Cost: $83K
FY98 Fcst: 2
msid.msn.com
(1)
w w w .microsoft.com
(4)
512 RAM,
30 GB HD
Ave Cost: $43K
FY98 Fcst: 10
Av e CFG: 4xP6,
512 RAM,
50 GB HD
Av e Cost: $50K
FY98 Fcst: 17
w w w .microsoft.com
(3)
w w w .microsoft.compremium.microsoft.com
(1)
Av e CFG: 4xP6,
Av e CFG: 4xP6, (3)
512 RAM,
50 GB HD
Av e Cost: $50K
FY98 Fcst: 1
SQL Consolidators
DMZ Staging Serv ers
Router
SQL Reporting
Av e CFG: 4xP6,
512 RAM,
160 GB HD
Av e Cost: $80K
FY98 Fcst: 2
European Data Center
IDC Staging Serv ers
MOSWest
FTP Servers
Ave CFG: 4xP5,
512 RAM,
Download 30 GB HD
Replication Ave Cost: $28K
FY98 Fcst: 0
premium.microsoft.com
(2)
– Spread data and
computation
to new slices
Ave CFG: 4xP5,
512 RAM,
30 GB HD
Ave Cost: $35K
FY98 Fcst: 12
w w w .microsoft.com
(5)
Internet
FDDI Ring
(MIS4)
home.microsoft.com
(5)
Ave CFG: 4xP6,
512 RAM,
30 GB HD
Ave Cost: $35K
FY98 Fcst: 9
\\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd
12/15/97
– Clones: anonymous servers
– Parts+Packs: Partitions fail over within a pack
• In both cases, remote farm for disaster recovery
2
Ethernet
(100 Mb/Sec Each)
Clones: Availability+Scalability
• Some applications are
– Read-mostly
– Low consistency requirements
– Modest storage requirement (less than 1TB)
• Examples:
– HTML web servers (IP sprayer/sieve + replication)
– LDAP servers (replication via gossip)
• Replicate app at all nodes (clones)
•
•
•
•
Spray requests across nodes.
Grow by adding clones
Fault tolerance: stop sending to that clone.
Growth: add a clone.
Two Clone Geometries
• Shared-Nothing: exact replicas
• Shared-Disk (state stored in server)
Shared Nothing Clones
Shared Disk Clones
Facilities Clones Need
• Automatic replication
– Applications (and system software)
– Data
• Automatic request routing
– Spray or sieve
• Management:
– Who is up?
– Update management & propagation
– Application monitoring.
• Clones are very easy to manage:
– Rule of thumb: 100’s of clones per admin
Partitions for Scalability
• Clones are not appropriate for some apps.
– Statefull apps do not replicate well
– high update rates do not replicate well
• Examples
– Email / chat / …
– Databases
• Partition state among servers
• Scalability (online):
– Partition split/merge
– Partitioning must be transparent to client.
Partitioned/Clustered Apps
• Mail servers
– Perfectly partitionable
• Business Object Servers
– Partition by set of objects.
• Parallel Databases
• Transparent access to partitioned tables
• Parallel Query
Packs for Availability
• Each partition may fail (independent of others)
• Partitions migrate to new node via fail-over
– Fail-over in seconds
• Pack: the nodes supporting a partition
–
–
–
–
–
VMS Cluster
Tandem Process Pair
SP2 HACMP
Sysplex™
WinNT MSCS (wolfpack)
• Cluster In A Box
now commodity
• Partitions typically grow in packs.
What Parts+Packs Need
• Automatic partitioning (in dbms, mail, files,…)
– Location transparent
– Partition split/merge
– Grow without limits (100x10TB)
• Simple failover model
– Partition migration is transparent
– MSCS-like model for services
• Application-centric request routing
• Management:
– Who is up?
– Automatic partition management (split/merge)
– Application monitoring.
Partitions and Packs
• Packs for availabilty
Partitions
Packed Partitions
GeoPlex: Farm pairs
•
•
•
•
Two farms
Changes from one
sent to other
When one farm fails
other provides service
Masks
– Hardware/Software faults
– Operations tasks (reorganize, upgrade move
– Environmental faults (power fail)
Services on Clones & Partitions
• Application provides a set of services
• If cloned:
– Services are on subset of clones
• If partitioned:
– Services run at each partition
• System load balancing routes request to
– Any clone
– Correct partition.
– Routes around failures.
Cluster Scenarios: 3- tier systems
A simple web site
Clones for availability
Packs for availability
Web File Store
SQL Temp State
Load Balance
Web Clients
SQL Database
Front End
Cluster Scale Out Scenarios
The FARM: Clones and Packs of Partitions
Packed Partitions: Database Transparency
SQL Partition 3
Cloned
Packed
file
servers
SQL Partition 2
replication
Web File StoreA
Web File StoreB
Web
Clients
SQL
SQLPartition1
Database
SQL Temp State
Load Balance
Cloned
Front Ends
(firewall, sprayer,
web server)
Terminology
• Terminology for scaleability
• Farms of servers:
– Clones: identical
Farm
Partition
Clone
• Scaleability + availability
– Partitions:
• Scaleability
Pack
Shared
Nothing
– Packs
• Partition availability via fail-over
• GeoPlex for disaster tolerance.
Shared
Disk
Shared
Nothing
ActiveActive
ActivePassive
• Helping move the data to SQL
– Database design
– Data loading
1.E+7
1.E+6
1.E+5
u-g
1.E+4
Counts
What we have been doing
with SDSS
Color Magnitude Diff/Ratio Distribution
g-r
r-i
1.E+3
i-z
1.E+2
1.E+1
1.E+0
-30
-20
-10
0
10
20
Magnitude Diff/Ratio
• Experimenting with queries on a 4 M object DB
– 20 questions like “find gravitational lens candidates”
– Queries use parallelism,
most run in a few seconds.(auto parallel)
– Some run in hours (neighbors within 1 arcsec)
– EASY to ask questions.
• Helping with an “outreach” website: SkyServer
• Personal goal:
Try datamining techniques to “re-discover” Astronomy
30
References (.doc or .pdf)
• Technology forecast:
http://research.microsoft.com/~gray/papers/
MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc
• Gbps experiments:
http://research.microsoft.com/~gray/
• Disk experiments (10K$ TB)
http://research.microsoft.com/~gray/papers/Win2K_IO_MSTR_2000_55.doc
• Scaleability Terminology
http://research.microsoft.com/~gray/papers/MS_TR_99_85_Scalability_Terminology.doc
Download