Tandem Daytona TeraByte Sort: Tsort 1 TB in 47.5 Minutes

advertisement
Tandem
Daytona
TeraByte Sort: Tsort
1 TB in 47.5 Minutes
Daivd Cossock,
Sam Fineberg,
Pankaj Mehra,
John Peck
Trophy presentation by Jim Gray
1
Benchmark History
IBM TP 1-7
CA and Tony Lukes
1970
Debit Credit
Gray
Wisconsin
Bitton Boral DeWitt Turbyfill
1980
Sort
TPC-A
TPC-B
1990
PennySort
MinuteSort
2000
Datamation
Anon et al
TPC-C
TPC-W ?
MCC
Boral &...
Teradata
Bollinger &...
TPC-D
2
A Short History of Sort
• April Fools 1995: Datamation Sort
– Sort 1M 100 B records
– An IO benchmark: 15-min to 1 hr!
• 1993: {Minute | Penny}x{Daytona | Indy}
• 1998: TeraByte Sort
• Web site:
http://research.Microsoft.com/barc/SortBenchmark/
3
Ground Rules
• How much can you sort for a penny (in a minute).
–
–
–
–
–
Hardware and Software cost
Depreciated over 3 years
1M$ system gets about 1 second,
1K$ system gets about 1,000 seconds.
Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident
• Input is
– 100-byte records (random data)
– key is first 10 bytes.
• Must create output file
and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
4
Bottleneck Analysis
• Drawn to linear scale
Disk R/W
~15MBps
Memory
MemCopy Read/Write
~50 MBps
~150 MBps
Theoretical
Bus Bandwidth
422MBps = 66 Mhz x 64 bits
5
Bottleneck Analysis
• NTFS Read/Write
• 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5)
3 PCI 64
~ 155 MBps Unbuffered read (175 raw)
~ 95 MBps Unbuffered write
• Recently: SQL Server on Xeon: 190MBps scan.
Good, but 10x down from S390/SGI/UE10k
Adapter
~70 MBps
Adapter
PCI
~110 MBps
Memory
Read/Write
~250 MBps
Adapter
PCI
Adapter
6
PennySort
• Hardware
– 266 Mhz Intel PPro
– 64 MB SDRAM (10ns)
– Dual Fujitsu DMA 3.2GB EIDE disks
• Software
– NT workstation 4.3
– NT 5 sort
• Performance
PennySort Machine (1107$ )
Disk
25%
– sort 15 M 100-byte records (~1.5 GB)
board
13%
– Disk to disk
– elapsed time 820 sec
• cpu time = 404 sec
Cabinet +
Assembly
7%
Memory
8%
Other
22%
Network,
Video, floppy
9%
Software
6%
cpu
32%
7
Recent Results
• NCSAsort: 10.3 GB in .9 minute 60 Intel/NT/Myranet
nodes
• MilleniumSort: 16x Dell NT cluster:
100 MB in 1.08 Sec (Datamation)
8
1999 PennySort
• Daytona & Indy:
2.58 GB in 917 sec
• HMsort:
Brad Helmkamp,
Keith McCready,
Stenograph LLC
• Intel 400Mhz
2 IDE disks
9
1998 TB Sort
• Chris Nyberg
Nsort
SGI 32x Origin2000
151 Minutes
10
1999 Terabyte Sort
• Daytona:
Daivd Cossock, Sam Fineberg,
Pankaj Mehra, John Peck
Tandem/Sandia TSort:
68 CPU ServerNet
47 minutes
• Indy:
IBM SPsort
408 nodes, 1952 cpu 2168 disks
17.6 minutes = 1057sec
(all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect
11
Sandia/Compaq/ServerNet/NT Sort
• Sort 1.1 Terabyte
(13 Billion records)
in 47 minutes
• 68 nodes (dual 450 Mhz processors)
Compaq Proliant 1850R
Server
2 400 MHz
CPUs
To X
Fabric
543 disks,
1.5 M$
• 1.2 GBps network rap
(2.8 GBps pap)
• 5.2 GBps of disk rap
(same as pap)
•
(rap=real application performance,
pap= peak advertised performance)
To Y
fabric
6-port ServerNet I
crossbar sw itch
512 MB
SDRAM PCI
Bus
ServerNet I
dual-ported
PCI NIC
4 SCSI busses,
each with 2 data
disks
Bisection Line (Each switch
on this line adds 3 links to
bisection width)
X Fabric
(10 bidirectional
bisection links)
Y Fabric
(14 bidirectional
bisection links)
6-port ServerNet I
crossbar sw itch
The 72-Node 48-Switch ServerNet-I Topology Deployed at Sandia National
12 Labs
SP sort
4.0
• 2 – 4 GBps!
3.5
GPFS read
GPFS write
3.0
Local read
Local write
GB/s
2.5
2.0
1.5
1.0
0.5
0.0
0
100
200
300
400
500
600
700
800
900
Elapsed time (seconds)
56 nodes
18 racks
Storage
432 nodes
37 racks
compute
488 nodes 55 racks
1952 processors, 732 GB RAM, 2168 disks
56 storage nodes manage 1680 4GB disks
336 4+P twin tail RAID5 arrays (30/node)
Compute rack:
16 nodes, each has
4x332Mhz PowerPC604e
1.5 GB RAM
1 32x33 PCI bus
9 GB scsi disk
150MBps full duplex SP switch
Storage rack:
8 nodes, each has
4x332Mhz PowerPC604e
1.5 GB RAM
3 32x33 PCI bus
30x4 GB scsi disk (4+113RAID5)
150MBps full duplex SP switch
1999 Sort Records
1999
Sort Records
Daytona
Penny
Minute
TeraByte
Indy
2.58 GB in 917 sec
2.58 GB in 917 sec
HMsort: Brad Helmkamp, Keith McCready,
Stenograph LLC
HMsort: Brad Helmkamp, Keith McCready,
Stenograph LLC
7.6 GB in 60 seconds
10.3 GB in 56.51 sec
Ordinal Nsort
SGI 32 cpu Origin IRIX
NOW+MPI HPVMsort
Luis Rivera UIUC & Andrew Chien UCSD
49 minutes
1057 seconds
Daivd Cossock, Sam Fineberg,
Pankaj Mehra, John Peck
68x2 Compaq &Sandia Labs
Datamation
SPsort 1952 SP
Jm Wyllie
cluster 2168 disks
PDF SPsort.pdf (80KB)
1.18 Seconds
Phillip Buonadonna, Spencer Low, Josh Coates,
UC Berkeley Millennium Sort 16x2 Dell NT Myrinet
14
2x/year!
• Partly hardware
• Partly software
• Partly economics
Records Sorted per Second
Doubles Every Year
1.E+06
1.E+03
1.E+00
GB Sorted per Dollar
Doubles Every Year
1.E-03
1985
1990
1995
15
2000
Progress on Sorting
• Speedup comes from Moore’s law 40%/year
• Processor/Disk/Network arrays: 60%/year
(this is a software speedup).
SPsort
1.E+08
Records Sorted per Second
Doubles Every Year
Sort Re cords/se cond vs T ime
SPsort/ IB
1.E+07
1.E+06
NOW
IBM RS6000
1.E+06
IBM 3090
Sandia/Compaq
/NT
Ordinal+SGI
Alpha
NT/PennySort
1.E+03
1.E+05
Cray YMP
Sequent
1.E+04
1.E+03
Compaq/NT
Intel
HyperCube
Penny
NT sort
1.E+00
Kitsuregawa
Hardware Sorter
Tandem
1.E+02
1985
Bitton M68000
1990
1995
2000
1.E-03
1985
GB Sorted per Dollar
Doubles Every Year
1990
1995
16
2000
Musings: PennySort=TBsort
•
•
•
•
•
•
•
•
Sorts 1TB in 1Minute
2 pass so 3TB of disk
= 10 disks if 330GB/disk
= 5Gps (if each disk is 50Mbps)
So, 600 seconds (3TB/5GBps)
So, node costs 1.5k$
Costs 100x that today
maybe in 10 years?
17
Data Gravity
Processing Moves to Transducers
• Move Processing to data sources
• Move to where the power (and sheet metal) is
• Processor in
– Modem
– Display
– Microphones (speech recognition)
& cameras (vision)
– Storage: Data storage and analysis
• System is “distributed” (a cluster/mob)
18
Disk = Node
•
•
•
•
has magnetic storage (100 GB?)
has processor & DRAM
has SAN attachment
has execution
Applications
environment
Services
DBMS
RPC, ...
File System
SAN driver
Disk driver
OS Kernel
19
SAN:
Standard Interconnect
Gbps SAN: 110 MBps
PCI: 70 MBps
UW Scsi: 40 MBps
FW scsi: 20 MBps
scsi: 5 MBps
• LAN faster than
memory bus?
•
•
•
•
1 GBps links in lab.
100$ port cost soon
Port is computer
Winsock: 110 MBps
(10% cpu utilization at each end)
20
Download