What I have been Doing Peta Bumps 10k$ TB Scaleable Computing

advertisement
What I have been Doing
Peta Bumps
10k$ TB
Scaleable Computing
Sloan Digital Sky Survey
Sense of scale
• How fat is your pipe?
• Fattest pipe on MS
campus is the WAN!
300 MBps OC48 = G2
Or
memcpy()
94 MBps Coast to Coast
90 MBps PCI
20 MBps disk / ATM / OC3
Redmond/Seattle, WA
Information Sciences Institute
Microsoft
Qwest
University of Washington
Pacific Northwest Gigapop
New York
HSCC (high speed connectivity consortium)
DARPA
Arlington, VA
San Francisco,
CA
5626 km
10 hops
The Path
DC -> SEA
C:\tracert -d 131.107.151.194
Tracing route to 131.107.151.194 over a maximum
0
Arlington Virginia, ISI
1
16 ms
<10 ms
<10 ms 140.173.170.65
Arlington Virginia, ISI Interface ISIe
2
<10 ms
<10 ms
<10 ms 205.171.40.61
Arlington Virginia, Qwest DC Edge
3
<10 ms
<10 ms
<10 ms 205.171.24.85
Arlington Virginia, Qwest DC Core
4
<10 ms
<10 ms
16 ms 205.171.5.233
New York, New York, Qwest NYC Core
5
62 ms
63 ms
62 ms 205.171.5.115
San Francisco, CA, Qwest SF Core
6
78 ms
78 ms
78 ms 205.171.5.108
Seattle, Washington, Qwest Sea Core
7
78 ms
78 ms
94 ms 205.171.26.42
Seattle, Washington, Qwest Sea Edge
8
78 ms
79 ms
78 ms 208.46.239.90
Seattle, Washington, PNW Gigapop
9
78 ms
78 ms
94 ms 198.48.91.30
Redmond Washington, Microsoft
10
78 ms
78 ms
94 ms 131.107.151.194
Redmond Washington, Microsoft
of 30 hops
------- DELL 4400 Win2K WKS
Alteon GbE
------- Juniper M40 GbE
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Cisco GSR OC48
------- Juniper M40 OC48
------- Juniper M40 OC48
------- Cisco GSR OC48
------- Compaq SP750 Win2K WKS
SysKonnect GbE
750mbps over 5000 km (957 mbps multi-stream)
~
4e15 bit meters per second
4 Peta bmps (“peta bumps”)
Single Stream tcp/ip throughput
Information Sciences Institute
Microsoft
Qwest
University of Washington
Pacific Northwest Gigapop
HSCC (high speed connectivity consortium)
DARPA
5 Peta bmps multi-stream
“ PetaBumps”
• 751 mbps for 300 seconds = (~28 GB)
single-thread single-stream tcp/ip
desktop-to-desktop
out of the box performance*
• 5626 km x 751Mbps =
~ 4.2e15 bit meter / second ~ 4.2 Peta
bmps
• Multi-steam is 952 mbps
~5.2 Peta bmps
•4470 byte MTUs were enabled on all routers.
•20 MB window size
Pointers
• The single-stream submission:
http://research.microsoft.com/~gray/papers/
Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm
• The multi-stream submission:
http://research.Microsoft.com/~gray/papers/
Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm
• The code:
http://research.Microsoft.com/~gray/papers/speedy.htm
speedy.h
speedy.c
And a PowerPoint presentation about it.
http://research.Microsoft.com/~gray/papers/
Windows2000_WAN_Speed_Record.ppt
What I have been Doing
Peta Bumps
10k$ TB
Scaleable Computing
Sloan Digital Sky Survey
TPC-C high performance clusters
Standard transaction processing benchmark
Mix of 5 simple transaction types.
Database scales with workload
Measures balanced system.
• Single Site Clusters
– Billions of transactions per day
– Tera-Ops & Peta-Bytes (10 k node clusters)
– Micro-dollar/transaction
tpmC
Scalability Successes
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
tpmC vs Time
h
Jan-95 Jan-96 Jan-97 Jan-98 Jan-99
• Hardware + Software advances
– TPC & Sort examples (2x/year)
– Many other examples
Records Sorted per Second
Doubles Every Year
1.E+06
1.E+03
1.E+00
GB Sorted per Dollar
Doubles Every Year
1.E-03
1985
1990
1995
2000
Progress since Jan 99: Running out of gas?
• 50% better peak perf (not 2x)
• 2x better Price/Performance
• At a cost ceiling
Systems cost 7M$-13M$
• June 98 result: “hero” effort
240,000
230,000
220,000
210,000
200,000
190,000
180,000
170,000
160,000
150,000
tpmC
(Compaq/Alpha/Oracle 96 cpu,
8node cluster,
102,542 tpmC
@139$/tpmC, 5/5/98)
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
Out’a gas?
tpmC vs Time
Out’a gas?
140,000
TpmC
(off-scale good!)
tpmC vs Time
130,000
120,000
110,000
100,000
90,000
80,000
70,000
60,000
50,000
h
Jan-95 Jan-96 Jan-97 Jan-98 Jan-99
40,000
30,000
20,000
10,000
0
Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00
tpmC vs Time
2/17/00: back on Schedule!! Back on
tpmC
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
240,000
Schedule!
230,000
220,000
210,000
200,000
190,000
180,000
170,000
160,000
150,000
140,000
TpmC
• First proof point of
commoditized scale-out
• 1.7x Better Performance
3x Better price/performance
• 4M$ vs 7M$-13M$
• Much more to do, but…
tpmC vs Time
great start!
130,000
120,000
110,000
100,000
90,000
80,000
70,000
60,000
h
Jan-95 Jan-96 Jan-97 Jan-98 Jan-99
50,000
40,000
30,000
20,000
10,000
0
Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00
Year 2000 Sort Results
Penny
Minute
TeraByte
Daytona
Indy
4.5 GB (45 m records)
4.5 GB (45 m records)
886 seconds on a $1010 Win2K/Intel system
HMsort: doc (74KB), pdf (32KB).
Brad Helmkamp, Keith McCready,
Stenograph LLC
886 seconds on a $1010 Win2K/Intel system
HMsort: doc (74KB), pdf (32KB).
Brad Helmkamp, Keith McCready,
Stenograph LLC
7.6 GB in 60 seconds
Ordinal Nsort
SGI 32 cpu Origin IRIX
21.8 GB 218 M records in 56.51
sec
49 minutes
Daivd Cossock, Sam Fineberg,
Pankaj Mehra, John Peck
68x2 Compaq Tandem Sandia Labs
NOW+HPVMsort 64 nodes WinNT pdf
(170KB).
Luis Rivera , Xianan Zhang, Andrew Chien
UCSD
1057 seconds
SPsort 1952 SP cluster 2168 disks
Jim Wyllie
PDF SPsort.pdf (80KB)
1 M records in .998 Seconds
Datamatio
Datamation
n
(doc 703KB) or (pdf 50KB)
Mitsubishi DIAPRISM Hardware Sorter with
HP 4 x 550MHz Xeon PC server + 32 SCSI disks,
Windows NT4
Shinsuke Azuma, Takao Sakuma, Tetsuya Takeo, Takaaki Ando, Kenji Shirai
Mitsubishi Electric Corp.
What’s a Balanced System?
System Bus
PCI Bus
PCI Bus
Rules of Thumb in Data Engineering
•
•
•
•
•
•
•
Moore’s law -> an address bit per 18 months.
Storage grows 100x/decade (except 1000x last decade!)
Disk data of 10 years ago now fits in RAM (iso-price).
Device bandwidth grows 10x/decade – so need parallelism
RAM:disk:tape price is 1:10:30 going to 1:10:10
Amdahl’s speedup law: S/(S+P)
Amdahl’s IO law: bit of IO per instruction/second
(tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars)
• Amdahl’s memory law: byte per instruction/second (going to 10)
(1 TB RAM per TOP: 1 TeraDollars)
• PetaOps anyone?
• Gilder’s law: aggregate bandwidth doubles every 8 months.
• 5 Minute rule: cache disk data that is reused in 5 minutes.
• Web rule: cache everything!
http://research.Microsoft.com/~gray/papers/
MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc
Cheap Storage
• Disks are getting cheap:
• 7 k$/TB disks (25 40 GB disks @ 230$ each)
900
40
Price vs disk capacity
800
700
35
IDE
600
30
SCSI
y = 15.895x + 13.446
400
IDE
SCSI
25
$
$
500
raw k$/TB
20
15
300
200
10
y = 5.7156x + 47.857
100
5
0
0
0
10
20
30
40
Raw Disk unit Size GB
50
60
7
0
10
20
30
40
Disk unit size GB
50
60
Cheap Storage or Balanced System
• Low cost storage
(2 x 1.5k$ servers) 10K$ TB
2x (1K$ system + 8x70GB disks + 100MbEthernet)
• Balanced server (9k$/.5 TB)
–
–
–
–
–
2x800Mhz (2k$)
256 MB (500$)
8 x 73 GB drives (4K$)
Gbps Ethernet + switch (1.5k$)
18k$ TB, 36K$/RAIDED TB
2x800 Mhz
256 MB
160 GB, 2k$ (now)
300 GB by year end.
• 4x40 GB ID
(2 hot plugable)
– (1,100$)
• SCSI-IDE bridge
– 200k$
• Box
–
–
–
–
500 Mhz cpu
256 MB SRAM
Fan, power, Enet
700$
• Or 8 disks/box
600 GB for ~3K$ ( or 300 GB RAID)
Hot Swap Drives for Archive or
Data Interchange
• 25 MBps write
(so can write
N x 74 GB
in 3 hours)
• 74 GB/overnite
= ~N x 2 MB/second
@ 19.95$/nite
Doing Studies of IO bandwidth
• SCSI & IDE bandwidth
– ~15-30 MBps sequential
– SCSI 10rpm ~ 110 kaps @ 600$
– IDE 7.2krpm ~ 80 kaps @ 250$
• Get 2 disks for the price of 1
– More bandwidth for reads
– RAID
– 10K$ raid TB by 2001
What I have been Doing
Peta Bumps
10k$ TB
Scaleable Computing
Sloan Digital Sky Survey
The Sloan Digital Sky Survey
A project run by the Astrophysical Research Consortium (ARC)
The University of Chicago
Princeton University
The Johns Hopkins University
The University of Washington
Fermi National Accelerator Laboratory
US Naval Observatory
The Japanese Participation Group
The Institute for Advanced Study
SLOAN Foundation, NSF, DOE, NASA
Goal: To create a detailed multicolor map of the Northern Sky
over 5 years, with a budget of approximately $80M
Data Size: 40 TB raw, 1 TB processed
Scientific Motivation
Create the ultimate map of the Universe:
 The Cosmic Genome Project!
Study the distribution of galaxies:
 What is the origin of fluctuations?
 What is the topology of the distribution?
Measure the global properties of the Universe:
 How much dark matter is there?
Local census of the galaxy population:
 How did galaxies form?
Find the most distant objects in the Universe:
 What are the highest quasar redshifts?
First Light Images
Telescope:
First light May 9th 1998
Equatorial scans
Camera:
The First Stripes
5 color imaging of >100 square degrees
Multiple scans across the same fields
Photometric limits as expected
SDSS Data Flow
SDSS Data Products
Object catalog
parameters of >108 objects
Redshift Catalog
parameters of 106 objects
400 GB
1 GB
Atlas Images
5 color cutouts of >108 objects
1.5 TB
Spectra
in a one-dimensional form
60 GB
Derived Catalogs
- clusters
- QSO absorption lines
20 GB
4x4 Pixel All-Sky Map
heavily compressed
60 GB
All raw data saved in a tape vault at Fermilab
Distributed Implementation
User Interface
Analysis Engine
Master
SX Engine
Objectivity Federation
Objectivity
Slave
Slave
Slave
Objectivity
Slave
Objectivity
RAID
Objectivity
RAID
Objectivity
RAID
RAID
Color Magnitude Diff/Ratio Distribution
What We Have Been Doing
– Database design
– Data loading
1.E+6
1.E+5
u-g
1.E+4
Counts
• Helping move the data to SQL
1.E+7
g-r
r-i
1.E+3
i-z
1.E+2
1.E+1
1.E+0
-30
-20
-10
0
10
20
Magnitude Diff/Ratio
• Experimenting with queries on a 4 M object DB
– 20 questions like “find gravitational lens candidates”
– Queries use parallelism,
most run in a few seconds.(auto parallel)
– Some run in hours (neighbors within 1 arcsec)
– EASY to ask questions.
• Helping with an “outreach” website: SkyServer
• Personal goal:
Try datamining techniques to “re-discover” Astronomy
30
What I have been Doing
Peta Bumps
10k$ TB
Scaleable Computing
Sloan Digital Sky Survey
Download