Data Movement & Storage Using the Data Capacitor Filesystem Justin Miller jupmille@indiana.edu http://pti.iu.edu/dc Big Data for Science Workshop July 2010 Challenges for DISC • Keynote by Alex Szalay identified the challenges that researchers face – “Scientific data doubles every year” • Amount of data is a barrier to extracting knowledge – Problem of today: data access – How can we minimize data movement? Workflow Example – Single Compute 0$/$)1+.%&" *+,-./")!"#+.%&" !"#"$%&'"%(#)*+,-./"% Workflow Example – Multiple Compute 3$/$)4+.%&" *+,-./" !"#+.%&")01 *+,-./" !"#+.%&")02 !"#"$%&'"%(#)*+,-./"% Workflow Example – Visualization 3$/$)4+.%&" *+,-./") !"#+.%&")01 !"#"$%&'"%(#) *+,-./"% *+,-./") !"#+.%&")02 56#.$768$/6+9 !"#+.%&" Workflow Example – Archive 3$/$)4+.%&" !"#"$%&'"%(#) *+,-./"% 56#.$768$/6+9 !"#+.%&" *+,-./") !"#+.%&")01 *+,-./") !"#+.%&")02 :$-" ;%&'6<" Data Movement & Storage • This is an unsustainable workflow – Works for GB, maybe single TB, but not more • Every resource is another series of transfers – Data movement is in the way of doing work – Good reasons to add resources to workflow • And we haven’t addressed other drawbacks IU Central Filesystem Workflow *+,-./") !"#+.%&")01 3$/$)4+.%&" !"#"$%&'"%(#) *+,-./"% !"#" $"%"&'#() *+,-./") !"#+.%&")02 :$-" ;%&'6<" 56#.$768$/6+9 !"#+.%&" IU’s Data Capacitor Filesystem • National Science Foundation funded in 2005 • Funds purchased 535TB of Lustre storage – 339TB available as production service • Data Capacitor name comes from electronics – capacitors provides transient storage of electrons – absorbs and evens out peaks in flow – provides consistent output Idea of Data Capacitor • Centralized short-term storage for IU resources – Store your data to compute against, and use for “scratch space” during your run – Possibility exists for mid-term storage Data Capacitor Centralized Storage • Compute using IU’s supercomputer Big Red • Compute using IU’s Quarry cluster • Archive to IU’s massive HPSS tape archive – hierarchical storage – archive your data to tape Central to IU Cyberinfrastructure Physics Research • Dr. Chuck Horowitz, IU physicist – Interested in the behavior of neutron stars – Studying the behavior of nuclear matter near saturation density • can form interesting phase "nuclear pasta” – Using MDGRAPE-2 hardware for increased performance Physics Research • Particle interaction is simulated via molecular dynamics using specialized MDGRAPE-2 hardware – configurations are saved • Post processing – creates VTK frames • Visualization system – ingests frame data – displays as movie Physics Research !"#$%&'( )'*"%+,' 7/&/ !/$/,.&"+ -.*%/0.1/&."2 )'*"%+,' 3/$' 4+,5.6' Earth Science Research • Linked Environments for Atmospheric Discovery (LEAD) • WxChallenge – The WxChallenge is a meteorological forecast competition. – Compete to forecaste maximum and minimum temperatures, precipitation, and maximum wind speeds for select U.S. cities over a tenweek period each semester LEAD Workflow 0'.&1'+ -.&. !"#$%&'( )'*"%+,' -.&. !.$.,/&"+ !"#$%&'+ 2,/'3,' !4%*&'+ 5+.3*6'+ )'*"%+,'* Extend the Centralized FS Model • The natural progression is to be central to more resources – Make data available to more resources • IU did this by extending the filesystem across the wide-area network (WAN) – Data Capacitor WAN (DC-WAN) – New FS separate from the original DC Data Capacitor WAN Data Capacitor WAN Tradeoffs • The benefit of a centralized WAN filesystem is the illusion of locality • Your data is transferred behind the scenes across the network – At worst your data will be transferred slower than you like – At best it is as fast, or faster, than local storage; typically comparable across research networks DC-WAN Namespace Mapping • WAN FS challenge is heterogeneous user identification across sites • The numeric user identification (UID) for a particular user not the same across sites • You don’t have to worry about this because DC-WAN does the conversion Indiana TACC PSC NCSA SDSC jupmille tg803934 jupmille jupmille jupmille uid=648424 uid=803934 uid=43415 uid=40436 uid=502639 Physics Research with DC-WAN ;'+-$#.')"()" *)+,-./ 0/&)-12/ !-&.'"<(5= >>?(+'$/& 3'&-#$'4#.')" 0/&)-12/ !"#$%&'&()" *)+,-./ 0/&)-12/ 8#.# *#,#2'.)1 9!: 5#,/ !126'7/ Astronomy with DC-WAN 3-2&)"?(!@ ABCD(+'$/& 79:8(3/$/&2),/ ;"/(6/<1//(9+#</1 =;69> !"#$%&'&()" *)+,-./ 0/&)-12/ 6#.# *#,#2'.)1 7!8 3#,/ !124'5/ Image NOAO/AURA/NSF Center for the Remote Sensing of Ice Sheets (CReSIS) Workflow !"#$%&' (')"%*+' 5*''67-68 9-:*'6+';<=> ?@A<#07') 2-&!-$-+0&"* 3.4 ,-$' .*+/01' .6&-*&0+- Gas Giant Planet Research 1)23"4)5",).6 7$2.3&'$ G9G),,2B3&H(<=G% FE?=A)4$2 +"," -"#"'),.& /%0 0-9% :&B"6"<=CD EF@=A)4$2 !"#$ %&'()*$ 89: 9,"&;*)44$<=89 >?@=A)4$2 Demo • Small sample of Gas Giant Planet Research • Data is on DC-WAN, which is mounted on two different resources – Compute on PSC’s Pople (SGI Altix 4700) – Post-process and visualize results on IU machine that has proprietary software (IDL v7.0); view over network IU’s Data Capacitor WAN Filesystem • Funded by Indiana University in 2008 • 339TB of storage available as production service • Centralized short-term storage for nationwide resources, including TeraGrid – Use your data on the best resource for your needs – Short-term storage like DC, possibility exists for mid-term storage Based on Lustre Filesystem • Lustre is a parallel distributed file system • Available under the GNU GPL • Used by U.S. government, movie studios, financial institutions, oil and gas industry • 7 of the top 10 HPC systems on the June 2009 "Top 500" list – 52 of the top 100 run Lustre in 2010 Based on Lustre Filesystem • “Lustre filesystems can support up to tens of thousands of client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput.” • Scalable filesystem – uses separate servers to aggregate for performance – storage backend is hidden from the client Lustre Filesystem Architecture • Lustre presents all clients with standard POSIX filesystem interface – Filesystem mount • My scratch directory for example: – IU: /N/dcwan/scratch/jupmille/ – PSC: /N/dcwan/scratch/jupmille/ – TACC: /N/dcwan/scratch/jupmille/ – NCSA: /N/dcwan/scratch/jupmille/ – Standard commands • ls, cp, cat, etc. from the command line Lustre Filesystem Architecture • Metadata Server (MDS) – stores the filesystem metadata such as filenames, directories, and permissions. – file operations such as open/close • Object Storage Server (OSS) – bulk I/O servers • Object Storage Targets (OST) – back-end storage devices Lustre Filesystem Architecture '() *)) *)) *)) *)) *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+ *)+ !"#$%& Data Capacitor Hardware • 8 pairs Dell PowerEdge 2950 – 2 x 3.0 GHz Dual Core Xeon – Myrinet 10G Ethernet – Dual port Qlogic 2432 HBA (4 x FC) – 2.6 Kernel (RHEL 5), Lustre 1.8 • 4 DDN S2A9550 Controllers – Over 2.4 GB/sec measured throughput each – 339Tb of spinning SATA disk Data Capacitor WAN Hardware • 2 pairs Dell PowerEdge 2950 – 2 x 3.0 GHz Dual Core Xeon – Myrinet 10G Ethernet – Dual port Qlogic 2432 HBA (4 x FC) – 2.6 Kernel (RHEL 5), Lustre 1.8 • 1 DDN S2A9550 Controllers – Over 2.4 GB/sec measured throughput – 339Tb of spinning SATA disk Getting the Most out of Lustre • Lustre is optimized for large files (where large is >1Mb), not so good for small files • Lustre has aggressive client side caching – if you plan reading the same files more than once, big win • Lustre allows you to control how your data is striped across the OSTs, so optimization based on your I/O patterns can reap benefits in throughput Lustre WAN Future • DC-WAN will be mounted on the India and Sierra FutureGrid clusters – In the testing phase right now • IU’s Lustre UID mapping code will be used in a new TeraGrid Lustre-WAN project in development now Thank you for listening. • Questions are welcome. – Please use moderators for Q&A Justin Miller jupmille@indiana.edu Data Capacitor Team dc-team-l@indiana.edu http://pti.iu.edu/dc