7 +/- 2 Maybe Good Ideas John Caron June 2011 (1) • NetCDF-Java (aka CDM) has lots of functionality, but only available in Java – – – – NcML Aggregation Access to lots of other file formats Feature types (eg collections of point data) Ironically, some functionality (eg aggregation) already available for remote datasets through opendap – But not for local datasets How can we get the CDM into other languages ? – Replicate in C and maintain two software stacks – Use reverse JNI (call Java from C) – Or … CdmRemote Server (aka TDS Lite) • Lightweight server for CDM datasets – Zero configuration – use queries to configure – Local filesystem – Cache expensive objects – Allow non-Java applications access to CDM stack – Create virtual datasets: aggregations, logical views – Coordinate space queries – Feature Type subsetting – New API (!) CdmRemote Server (aka TDS Lite) Python / ? cdmRemote Server C Client CDM Point Feature API cdmRemote CDM Point Feature API Application Coordinate Systems Data Access Data (2) Ncstream as a netCDF file format • • • • • Write-optimized Append only Encode the full CDM object model Uses Google’s protobuf for serialization Java, C Libraries can read and access through the standard netCDF API • Tools to convert to netcdf-3 and 4 formats (3) BUFR/GRIB Table registration • Unidata sponsored web service • Registered users can upload BUFR/GRIB tables – Unique id is assigned (MD5 16 byte checksum?) – Convince producers to include the id into the data – unambiguous which table was used – Anyone can download. • GRIB and BUFR Decoding – – – – Using CDM – find bugs ! Might become (ad-hoc) reference library Might spur objections from “the experts” Turn over to WMO if they want it • Survival of Human Race is at stake here (4) Streaming data / standing queries • The proposal Dennis and I submitted last year • “As soon as it arrives on IDD, send me PrecipTotal from NCEP/ RUC2 model subsetted by lat/lon bounding box in netCDF-4 / CF format” • “As it arrives, send me GTS BUFR data in lat/lon bounding box in CSV” Current IDD data access IDD Data Push (header) LDM FILE Pull requests TDS CDM library Dataset Dataset Dataset Dataset Content based filtering (standing requests) IDD Data LDM PIPE Message Service •Content filtering •Change encoding •Protocol? Push (content) Request Request Request Content Filter Standing request service (5) Python • Unidata should choose a scripting language to support, and give scientists full access to all of our tools in it • Python wants to be the open-source Matlab • DOE, BADC have bought into Python • Python is a safe choice (6) NetCDF management tools • Develop consistent set of tools for managing collections of netCDF files – Use existing tools (ncgen, nccopy, ncdump, nco, etc) under the covers – but don’t be constrained by their interfaces • Look at RDBMS management languages • Use a scripting language like Python (7) Hadoop – – – – – – – – – Open Source started by Doug Cutting (Lucene) and Yahoo Based on Google’s Map-Reduce for parallel processing Lots of industry use, part of new data ecosystems Objects in distributed, replicated file system Commodity, shared-nothing hardware nodes Simple key-value store Append-only, sequential reading Scale to arbitrarily large amount of data (batch) Gather many queries and run them over the data (8) SciDB • Michael Stonebraker, David DeWitt – “SciDB will be optimized for data management of big data and for big analytics. – “The scientists that are participating in our open source project believe that the SciDB database — when completed — will dramatically impact their ability to conduct their experiments faster and more efficiently and further improve the quality of life on our planet by enabling them to run experiments that were previously impossible due to the limitations of existing database systems and infrastructure.” • Getting involved: 1. 2. Load netcdf/hdf5 into SciDB “Native mode” – leave data in netcdf/hdf5