Big Data Open Source Software and Projects ABDS in Summary VIII: Level 11A I590 Data Science Curriculum August 15 2014 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) HPC-ABDS Layers Here are 17 functionalities. Technologies are presented in this order Message Protocols Distributed Coordination: 4 Cross cutting at top 13 in order of layered diagram starting at Security & Privacy: bottom Monitoring: IaaS Management from HPC to hypervisors: DevOps: Interoperability: File systems: Cluster Resource Management: Data Transport: File management and SQL / NoSQL / File management: Formats In-memory databases&caches / Object-relational mapping / Extraction Tools Inter process communication Collectives, point-to-point, publish-subscribe Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: High level Programming: Application and Analytics: Workflow-Orchestration: iRODS • The Integrated Rule-Oriented Data System (iRODS) is an BSD Open source http://irods.org/ data management software in use at research organizations and government agencies worldwide. • iRODS is a production-level distribution aimed at deployment in mission critical environments. • It functions independently of storage resources and abstracts data control away from storage devices and device location allowing users to take control of their data. • iRODS abstracts data services from data storage to facilitate executing services across heterogeneous, distributed storage systems. • iRODS empowers data stewards by partitioning policies, rules, and services they develop from repository management. • iRODS executes your data policies on your schedule. • iRODS virtualizes data policy by separating data management policy enforcement from repository management. Scientific Data/File Formats I NetCDF, CDF • NetCDF http://en.wikipedia.org/wiki/NetCDF (Network Common Data Form) is a set of software libraries and self-describing, machineindependent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the University Corporation for Atmospheric Research (UCAR). This work started in 1989 and was based on CDF • Common Data Format (CDF) http://en.wikipedia.org/wiki/Common_Data_Format is a library and toolkit that was developed by the National Space Science Data Center (NSSDC) at NASA starting in 1985. The software is an interface for the storage and manipulation of multi-dimensional data sets. • • • • Scientific Data/File Formats II HDF Hierarchical Data Format (HDF) http://en.wikipedia.org/wiki/Hierarchical_Data_Format is a set of file formats (HDF4, HDF5) designed to store and organize large amountsof numerical data. Originally developed at the National Center for Supercomputing Applications, it is supported by the non-profit HDF Group, whose mission is to ensure continued development of HDF5 technologies, and the continued accessibility of data stored in HDF. In keeping with this goal, the HDF format, libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and noncommercial software platforms, including Java, MATLAB/Scilab, Octave, IDL, Python, and R. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView). The current version, HDF5, differs significantly in design and API from the major legacy version HDF4. HDF5 simplifies the file structure to include only two major types of object: – – • • • Datasets, which are multidimensional arrays of a homogeneous type Groups, which are container structures which can hold datasets and other groups This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are even accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes. In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists. The latest version of NetCDF, version 4, is based on HDF5. • Scientific Data/File Formats III OPeNDAP OPeNDAP "Open-source Project for a Network Data Access Protocol“ http://en.wikipedia.org/wiki/OPeNDAP is a data transport architecture and protocol widely used by earth scientists. The protocol is based on HTTP and the current specification is OPeNDAP 2.0 draft. – OPeNDAP includes standards for encapsulating structured data, annotating the data with attributes and adding semantics that describe the data. – The protocol is maintained by OPeNDAP.org, a publicly funded non-profit organization that also provides free reference implementations of OPeNDAP servers and clients. • • • • An OPeNDAP client could be an ordinary browser or specialized visualization/analysis engine An OPeNDAP client sends requests to an OPeNDAP server, and receives various types of documents or binary data as a response. One such document is called a DDS (received when a DDS request is sent), that describes the structure of a data set. A data set, seen from the server side, may be a file, a collection of files or a database. Another document type that may be received is DAS, which gives attribute values on the fields described in the DDS. Binary data is received when the client sends a DODS request. An OPeNDAP server can serve an arbitrarily large collection of data. Data on the server is often in HDF or NetCDF format, but can be in any format including a user-defined format. Compared to ordinary file transfer protocols (e.g. FTP), a major advantage using OPeNDAP is the ability to retrieve subsets of files, and also the ability to aggregate data from several files in one transfer operation. OPeNDAP is widely used by governmental agencies such as NASA and NOAA to serve satellite, weather and other observed earth science data. • Scientific Data/File Formats IV FITS Flexible Image Transport System http://en.wikipedia.org/wiki/FITS is an open astronomy standard defining a digital file format useful for storage, transmission and processing of scientific and other images. – FITS is the most commonly used digital file format in astronomy. – Unlike many image formats, FITS is designed specifically for scientific data and hence includes many provisions for describing photometric and spatial calibration information, together with image origin metadata. • • • • • The FITS format was first standardized in 1981; it has evolved gradually since then, and the most recent version (3.0) was standardized in 2008. FITS was designed with an eye towards long-term archival storage, and the maxim once FITS, always FITS represents the requirement that developments to the format must be backwards compatible. A major feature of the FITS format is that image metadata is stored in a humanreadable ASCII header, so that an interested user can examine the headers to investigate a file of unknown provenance. The information in the header is designed to calculate the byte offset of some information in the subsequent data unit to support direct access to the data cells. Each FITS file consists of one or more headers containing ASCII card images (80 character fixed-length strings) that carry keyword/value pairs, interleaved between data blocks. The keyword/value pairs provide information such as size, origin, coordinates, binary data format, free-form comments, history of the data, and anything else the creator desires: while many keywords are reserved for FITS use, the standard allows arbitrary use of the rest of the name-space. RCFile Hive Data Format I • See Hive’s RCFile (Row Column File) introduced in 2011 • http://www.infomall.org/I590ABDSSoftware/Resources/ICDEworkshop-2014_XiaodongZhang.pptx • Distributed Row-Groups among Nodes as in row store but organize by columns in each group • Minimize unnecessary I/O operations – In a row group, table is partitioned by columns – Only read needed columns from disks • Minimize network costs in row construction – All columns of a row are located in same HDFS block • Comparable data loading speed to Row-Store – Only adding a vertical-partitioning operation in the data loading procedure of Row-Store • Applying efficient data compression algorithms – Can use compression schemes used in Column-store RCFile Hive Data Format II Store Block 1 Store Block 2 Store Block 3 Row Group 1-3 Row Group 4-6 DataNode 1 Unnecessary I/O transfers (MBytes) Row Group 7-9 DataNode 2 DataNode 3 20 Compressed Metadata 101 102 103 104 105 Row-store RCFile: Combined row-stores and column-store Compressed Column A 201 202 203 204 205 101 102 103 104 105 Compressed Column B 301 302 303 304 305 Column-store Unnecessary network transfers (MBytes) 201 202 203 204 205 401 402 403 404 405 Compressed Column C 301 302 303 304 305 Compressed Column D 401 402 403 404 405 ORC Hive Data Format • http://docs.hortonworks.com/HDPDocuments/HDP2/HDP2.0.0.2/ds_Hive/orcfile.html • Hive’s RCFile has been the standard format for storing Hive data since 2011. However, RCFile has limitations because it treats each column as a binary blob without semantics. • Hive 0.11 adds a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. • Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. • Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. • Finally, ORC works together with query vectorization work providing a high bandwidth reader/writer interface • http://2013.berlinbuzzwords.de/sessions/orc-file-improving-hive-data-storage Apache Parquet IO/Format I • https://github.com/Parquet/apache-proposal • In its relatively short lifetime (co-founded by Twitter and Cloudera in July 2013), Parquet has already become the de facto standard for efficient columnar storage of Apache Hadoop data — – with native support in Impala, Apache Hive, Apache Pig, Apache Spark, MapReduce, Apache Tajo, Apache Drill, Apache Crunch, and Cascading, Scalding, Kite, Presto and Shark. • Supports data models: Apache Avro, Thrift and Google Protocol Buffers • Based on Google Dremel paper • http://blog.cloudera.com/blog/2014/05/congratulations-to-parquetnow-an-apache-incubator-project/ • http://www.slideshare.net/cloudera/hadoop-summit-36479635 Apache Parquet IO/Format II • https://github.com/apache/incubator-parquet-format • Parquet makes the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. • Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces. • Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is futureproofed to allow adding more encodings as they are invented and implemented.