Substituting HDF5 tools with Python/H5py scripts Daniel Kahn Science Systems and Applications Inc. HDF HDF-EOS Workshop XIV, 28 Sep. 2010 1 of 14 What are HDF5 tools? HDF5 tools are command line programs distributed with the HDF5 library. They allow users to manipulate HDF5 files. h5dump: dump HDF5 data as ASCII text. h5import: convert non-HDF5 data to HDF5 h5diff: show differences between HDF5 files. h5copy: Copy objects between HDF5 files. h5repack: Copy entire file while changing storage properties of HDF5 objects. h5edit: (proposed) add attributes to HDF5 objects. HDF5 tools have a long history as the first (and for a long time only) way to manipulate HDF5 files conveniently. I.e. without writing a C or Java program, or without buying expensive commercial software such as IDL or Matlab. 2 of 14 The tools can be characterized as having three parts: Text Processing—Evaluate command arguments, process input text files, match group names. Tree Walking – Search HDF5 file hierarchy for objects by name. Object Level Operations – Operate on the objects: copy, diff, repack, etc. The tools are simple to use and convenient as they are distributed with the HDF5 library. 3 of 14 Disadvantage of HDF5 tools: The command line arguments limit tool capability. Adding new features with command line syntax which is both readable and does not break the legacy syntax becomes difficult. Development time for designing and implementing new features is long (weeks...months). Use cases must be evaluated, a solution proposed in an RFC, the proposal must be implemented, new code is distributed in next release. 4 of 14 Here's an example from HDF documentation: h5copy -v -i "test1.h5" -o "test1.out.h5" -s "/array" -d "/array But suppose we had multiple datasets named arrayNNN where N is 0–9. We'd like to write something like: h5copy -v -i "test1.h5" -o "test1.out.h5" -s "/array\d+{3}” So that \d+{3} would provide a match to all such objects. Extending the tool syntax to meet this use case, and then again for the next use case would be a never ending game of catch up. A more flexible substitute is desirable... 5 of 14 ...Python? 6 of 14 What is Python? Python is a programming language. It features dynamic binding of variables, like Perl or shell scripts, IDL, Matlab, but not C or Fortran. Unlike Perl, it supports native floating point numbers. It has scientific array support in the style of IDL or Matlab (numpy module). Array operations can be programmed using normal arithmetic operators. It has access to the HDF5 library (Anderw Collette's h5py module). Python is currently the only programming language in wide spread use to have all these features. They are essential to the success of the language for easy HDF5 file manipulation. 7 of 14 Real world Experience: Learning Python and h5py is quick. In the summer of 2010 SSAI hired a summer intern. Equipped with some Perl programming experience the intern was able to come up to speed on Python, HDF5, h5py, and numpy within one to two weeks and, over the summer, develop a specialized file/dataset merging tool and a dataset conversion tool. Python and h5py are the best way to introduce HDF5 because it allows the user to concentrate on the H in HDF5, rather than the C API syntax. 8 of 14 Python is well suited to HDF5 Python is well suited to HDF5 because the HDF5 array objects carry the dimensionality, extent, and element data type information, just as HDF5 datasets do. The object oriented nature of Python allows these objects to be manipulated at a high level. C, by contrast, lacks a scientific array object and the ability to define object methods. 9 of 14 Example: Creating and Writing a Dataset to a New File Python: import h5py import numpy TestData = numpy.array(range(1,25),dtype='int32').reshape(4,6) h5py.File("WrittenByH5PY.h5","w")['/TestDataset'] = TestData Compare to C version: #include "hdf5.h" int main() { hid_t file_id, dataspace_id, dataset_id; /* identifiers */ herr_t status; hsize_t dims[2]; const int FirstIndex = 4, SecondIndex = 6; int i, j, dset_data[4][6]; for (i = 0; i < 4; i++) /* Initialize the dataset. */ for (j = 0; j < 6; j++) dset_data[i][j] = i * 6 + j + 1; dims[0] = FirstIndex; dims[1] = SecondIndex; file_id = H5Fcreate("WrittenByC.h5", H5F_ACC_TRUNC, H5P_DEFAULT,H5P_DEFAULT); /* Open an existing file. */ dataspace_id = H5Screate_simple(2, dims, NULL); dataset_id = H5Dcreate(file_id, "/TestDataset", H5T_STD_I32LE, dataspace_id, H5P_DEFAULT,H5P_DEFAULT,H5P_DEFAULT); /* Write the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data); status = H5Dclose(dataset_id); /* Close the dataset. */ status = H5Fclose(file_id); /* Close the file. */ } 10 of 14 And here's the output: h5dump WrittenByH5PY.h5 HDF5 "WrittenByH5PY.h5" { GROUP "/" { DATASET "TestDataset" { DATATYPE H5T_STD_I32LE DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) } DATA { (0,0): 1, 2, 3, 4, 5, 6, (1,0): 7, 8, 9, 10, 11, 12, (2,0): 13, 14, 15, 16, 17, 18, (3,0): 19, 20, 21, 22, 23, 24 } } } } 11 of 14 Python and the Three Pillars of HDF5 Tools Python is well suited to Text Processing Python has wide range of string manipulation functions, an easy-touse regular expression module, and list and dictionary (hash table) objects. No segmentation faults! Python is well suited to Tree Walking. Recursive functions and loops over lists are easy to write Object Level Operations...Not so much. Object Level Operations (e.g. copy, diff) are challenging to write efficiently and should be provided as part of the API by the HDF Group, for example h5o_copy. API functions are available to the Python programmer via h5py. 12 of 14 Why use Python to substitute HDF5 tools? Python is available now. Some HDF5 tools are still under development as new use cases are presented. For example, users have requested a tool to add attributes to HDF5 files. Such a capability already exists with h5py: python -c "import h5py ; fid = h5py.File('FileForAttributeAddition.h5','r+') ; fid['/TestDataset'].attrs['CmdLine1'] = 'NewValue' ; fid.close()" It's little ugly, but it is available today. Python is a full programming language. It can accomplish tasks which HDF5 tools cannot. Further Resources: http://groups.google.com/group/h5py http://h5py.alfven.org/ 13 of 14 Recommendations: Users should consider Python and H5py to accomplish their HDF5 file manipulation projects. The HDF Group should concentrate on providing efficient API functions for object level tasks: object copy, dataset difference, etc. The HDF Group should avoid complex enhancements to tools where Python/h5py could be used instead. An easily searched contributed application repository on the HDF Group website with user ratings would be very helpful. 14 of 14