DATABASE DEPLOYMENT SCENARIOS AND PERFORMANCE ON SSD ARRAYS Mark Holliman Wide Field Astronomy Unit Institute for Astronomy University of Edinburgh Summary Some Astronomical survey databases are becoming so large that some curation tasks are approaching unreasonable timeframes. This is evident in such surveys as the Galactic Plane Survey (GPS) for UKIDSS, and the VISTA Variables in the Via Lactea (VVV). The VVV alone contains >10 of TB of data and includes a detection table with >1010 rows. While RDBMSs are capable of handling this much data, the execution times for curation activities can stretch into weeks at a time. To address this issue, there are two main approaches: 1. 2. Switch from a RDBMS to a Key-Pair based model or Column oriented DB (i.e. Hadoop, MonetDB, etc) Throw Hardware at it (i.e. SSDs, SAN, LUSTRE, GPFS, etc) As you can probably guess, this talk is about #2 The Big Database Problem In an RDBMS the main bottleneck on almost all operations is disk I/O. While parallel processing and increased RAM can address some performance issues, ultimate performance figures are dictated by how fast data can be moved to/from the storage medium. For large databases (>1TB) the most cost efficient storage medium are RAID5/6/10 arrays of spinning disks (HDD). These disks range in size from 1TB to 4TB at present. These arrays provide redundancy in case of disk failure, while at the same time speed up I/O by spreading disk operations across multiple devices simultaneously The rotational speed on the hard drives is the main factor in determining an HDD’s performance. The faster the disk spins, the faster read/write operations can occur. Most enterprise disks run at 7200RPM, though 10000RPM and 15000RPM disks are available (at a serious jump in price) SSDs are just beginning to approach the size/price ratio necessary for large DBs. Test Server Details Intel Xeon 2.8GHz, 24 cores 16GB RAM Disk Subsystem (6Gbps SAS RAID) HDD1,2: 7 x 1TB HDD RAID5 arrays SSD1: 1 x 512GB Crucial SSD SSD2: 6 x 512GB Crucial SSD RAID5 Array OS: Windows Server 2008 R2 DBMS: SQL Server 2008 R2 Full DB Tests The first tests involved placing an entire database on each particular disk subsystem and running a set of queries to measure performance. The queries were constructed to represent 3 specific use cases in order to identify exactly where the SSDs provide performance gains. Database: 2MASS Query 1, All Indexed Columns: Select * From twomass_psc Where ((ra> 240) AND (ra<242)) OR ((ra> 120) AND (ra<122)) AND ((dec> -47) AND (dec<-44)) OR ((dec> 120) AND (dec<123)) Query 2, All Nonindexed Columns: Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m<15.8) AND (k_m>15.1 AND k_m<15.3) Query 3, Mixed Index Columns: Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m<15.8) AND (k_m>15.1 AND k_m<15.3) AND ((ra> 230) AND (ra<241)) OR ((ra> 110) AND (ra<121)) AND ((dec> -47) AND (dec<-44)) OR ((dec> 120) AND (dec<123)) AND ((k_psfchi>1) OR (k_psfchi<0.4)) AND (dist_opt<.3) AND (k_msig_stdap<1) Full DB Test Results Indexed Col Select Time (s) Nonindexed Col Select Time (s) 300 3500 3000 253 250 3000 212 2500 200 150 150 150 2000 1500 100 1140 1000 50 500 0 293 255 SSD SSD RAID5 0 RAID5 RAID0 SSD SSD RAID5 RAID5 RAID0 Mixed Index Col Select Time (s) 1400 1200 1200 1000 800 600 305 400 104 200 0 RAID5 SSD SSD RAID5 SSD Cost Issue While putting an entire Database on SSDs is certainly preferable, it is unfortunately cost prohibitive for very large archive databases (where 10’s of TB are required). At recent pricing: 1TB on SSD = ~£500 vs 1TB on HDD = ~£87 So with this in mind we ran a second battery of tests to assess whether sufficient performance gains can be achieved with a hybrid system, whereby some database files are placed on SSDs while the rest remain on standard HDD arrays SSD File Location Tests Database: UKIDSS DR5 4 different representative queries were run (email me if you want to see the actual SQL) #1: produce light curves #2: produce variable light curves #3: JOIN Source and Detection Tables (single index) #4: JOIN Source and Detection Tables (multi indices) 4 different file/disk configurations All files on HDD Indices on SSD (RAID) Source Tables on SSD (RAID) Detection Tables on SSD (RAID) SSD File Location test results Query 1 Execution Time 25 11 12 21 19 20 Query 2 Execution Time 9 10 15 8 10 6 5 4 2 1 SSD Index SSD Source 4 2 0 Spin Disk 6 SSD Detection 0 Spin Disk Query 3 Execution Time 1400 1271 SSD Source SSD Detection Query 4 Execution Time 1289 60 1167 1200 SSD Index 50 1000 49 39 40 800 33 37 30 600 20 400 130 200 0 10 0 Spin Disk SSD Index SSD Source SSD Detection Spin Disk SSD Index SSD Source SSD Detection Conclusions The biggest performance gains from SSDs are seen on queries involving non-indexed columns RAID arrays of SSDs do provide performance gains over individual disks, though these gains are not universal or consistently scaled Placing Source or Detection tables on SSDs both result in large performance gains Performance gains are heavily dependant on the query being run, and how much of the query uses the tables on the SSDs Due to SSD size constraints, Source tables are more likely to fit as they tend to be smaller Curation activities could be optimized by moving database files from HDD to SSD for intensive operations, then moved back to HDD once complete. This only works if the file movement speeds are high enough to overcome the extra time necessary for running operations on the HDD stored files Index file operations are actually slower on SSD than on HDD This is a serious surprise Further testing is necessary to determine if this is due to the inherent sequential access method of HDDs, or if this is due to some quirk of the MS SQL index file format