Enabling Eco-Science Analysis with MatLab and DataCubes in the Cloud Jayant Gupchup† and Catharine van Ingen* Computer Science Department, The Johns Hopkins University† Microsoft Research* Abstract: The ecological sciences are rapidly becoming data intensive sciences. Several groups have been pioneering the use of databases, datacubes, and web-services to address some of the data handling challenges caused by the avalanche/tsunami/flood of data. Science happens only when the data are actually analyzed and today that very often happens with one of the common scientific desktop analysis tools such as Excel, MatLab, ArcGIS, or SPlus. The challenge then is how to connect the data in the cloud to the analysis tool on the desktop without requiring full data download. This article describes our prototype connection between one such service and one such tool. We describe how this approach can be generalized across a number of different science questions and tools. We also explain why this is a good solution from a scientist’s perspective. are tedious to do in Excel. We describe how we have connected MatLab on the desktop over the internet to one of our datacubes. The approach can be generalized to connect other tools to our family of datacubes to give our scientists a choice in analysis tools. We believe this has implications for other researchers exploring how to enable ecological scientists. 1.1 The Ameriflux carbon-climate data set The Ameriflux network [AMERIFLUX] is a scientific collaboration of over 50 institutions across America and operates approximately 120 measurement sites. Each site provides continuous observations of ecosystem level exchanges of CO2, water, energy and momentum spanning diurnal, synoptic, seasonal, and inter-annual time scales. Ameriflux is one of several regional networks that together form the FLUXNET global collaboration. 1 Introduction The combination of rapid advances in sensor technology, remote sensing, and internet data availability are causing a dramatic increase in the amount and diversity of ecological science data. At the same time, scientists are collaborating to attempt regional or even global scale studies. These analyses require mixing time-series data from different sources with site property or other ancillary data from still different sources. Using a database to assemble and curate such data collections have been documented in depth elsewhere [OZER], [SDSS], [CUAHSI-ODM]. At the Berkeley Water Center [BWC], we have been building a number of related environmental datasets. We describe one of these datasets and the kinds of analyses commonly performed. We then describe the important aspects of our databases and datacubes. Our focus here is on how to enable common data analyses using tools already in use by scientists. We chose to use MatLab for two reasons [MATLAB]. First, a number of our scientists use it. Second, we also wanted to use it for simple visualizations which Each Ameriflux tower site contributes 22 common measurements to the Ameriflux archive at ORNL. The ORNL archive works with researchers in the sister CarboEuropeIP network to produce scienceready data products which are gap-filled, quality assessed and contain additional computed variables. The data are used to understand how climate interacts with plants at a systems level to influence carbon flux and global warming. In the past, such studies have been primarily individual site investigations. Today, regional and global analyses are being attempted. At the same time, the data are also used by other non-field scientists to provide ground truth for climate modeling efforts and satellite-based remote sensing data. Carbon-climate data is similar to many other environmental data sets in the following ways. The data has strong temporal characteristics – understanding diurnal, seasonal, long term changes and other time variations are important to the science. The data has important spatial characteristics. For example, micro-climate is affected by latitude, longitude, and proximity to the coast. There are strong and weak correlations between the observed and computed variables. Understanding those relationships such as the change in leaf production as the result of temperature and precipitation correlations is at the center of the science. Analysis of the time series data often requires knowledge of other site parameters such as vegetative cover or soil composition or site disturbances such as fires, floods, or harvests. Similarly, the sorts of data analyses are similar to other environmental analyses. Examples include: Look for trends or changes in variables outside of the common diurnal and seasonal fluctuations. Look for changes in variables after a relatively rare event or disturbance such as a flood or fire. Look for similarities and differences in variables across sites of similar characteristics such as tropical rainforests). Integrate with maps. These characteristics are very common to other ecological sciences such as hydrology, oceanography, and meteorology. Figure 1. Example environmental datacube dimension. Common dimensions include what (datumtype and exdatumtype), where (site and offset), when (timeline), which (WorkingGroup), and how (quality). As shown in Figure 1, our cubes have five common dimensions: What or datumtype and exdatumtype: measurement variable such as precipitation or latent heat flux. Because of the large number of variables we handle, we sometimes parse the variable as a primary variable (datumtype) and one or more extensions (exdatumtype). Most analyses need only the primary variable. Where or site and offset: (x, y, z) location where (x,y) is the site location and (z) is the vertical elevation at the site. The site dimension also surfaces important site attributes such as climate classification or vegetative cover to allow locations to be grouped or filtered along those characteristics. The site dimension includes hierarchies such as latitude band which enables drilldown as well as grouping. When or time: timeline. This dimension allows aggregation across different time granularities such as day of year or hour of day. We also build a number of hierarchies to enable drilldown in time from decade to year through to minute. Some of our cubes 1.2 Scientific Datacubes The what-where-when nature of time series data drives much of our databases schema and datacube dimensions. A datacube is a database specifically optimized for data mining or OLAP [GRAY1996], [SSAS]. Datacube abstractions include: Simple aggregations such as sum, minimum, or maximum can be precomputed for speed Hierarchies such as year to day of year to hour of day can be defined for simple filtering with drilldown capability Additional calculations such as median or standard deviation can be computed dynamically or pre-computed All operate along dimensions such as time, site, or measurement variable Datacubes can be constructed from relational databases using commercial tools. also include science-specific time attributes and hierarchies such as water-year or MODIS-week. Which or working group or dataset: data versioning or other collections such as “all Boreal forest sites” or “real-time data” useful for analyses. As shown in Figure 1, this is a many-to-many dimension – a given site can be a member of multiple datasets. How or quality: This dimension varies the most across our cubes, although all have some notion of data quality. This may include spike detection, gap-filling, or other data “goodness” metric. We’ve been including a few computed members in addition to the usual count, sum, minimum and maximum hasDataRatio: fraction of data present across time and/or variables. This measure includes both orginal data and any gapfilled data. DailyCalc: average, sum or maximum depending on variable and includes units conversion YearlyCalc: similar to DailyCalc RMS or sigma: standard deviation or variance for fast error or spread viewing gapPercent: percentage of contributing data that is either missing or has been gapfilled. Datacubes are queried by the multidimensional query language MDX [MDX], [MDXTutorial]. MDX is similar to the SQL query language but has some prominent differences. A SQL returns an array; each column relates to one query element (e.g. time, datumtype, site). An MDX query returns a matrix with a notion of a column axis and a row axis; each cell relates to two or more elements. Each axis can contain one or more dimensions or attributes. Thus each axis can be viewed as a join of all the dimensions on that axis. 1.3 Datacube Clients In recent years, one sees a considerable growth in the attention given to simple access to datacubes. Most of these tools are GUI-based and intended for business applications. Tableau [TAB], Proclarity [PRO], and Cognos [COG] are three such business application software applications which provide a GUI and additional analysis features. At present, the most common way of accessing a datacube is the Excel PivotTable [EXCEL]. Excel PivotTables allow you to set up a connection with the datacube and then browse and select the data using a drag and drop type mechanism. The MDX query is generated by Excel and passed over an OLEDB connection. Figure 2 shows how MDX queries are rendered in COLUMN AXIS : Data DIMENSIONS : variables and sites SELECT NON EMPTY CROSSJOIN ( {[Datumtype].[Datumtype].[Datumtype]}, {[Site].[IGBPClass].[IGBPClass]} ) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON COLUMNS, Aggregate Measures ROW AXIS : Time DIMENSIONS : Year, day NON EMPTY CROSSJOIN ( {[Timeline].[Year].&[2003]}, {[Timeline].[day].[day]} ) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON ROWS FROM [LatestAmfluxL3L4Daily] WHERE ([Measures].[Average]) Figure 2: Rendering of an MDX query in Excel. The various fragments of the query and the rendering are marked in the same box style (background color and font) to make it easier to identify the mapping. Excel. The PivotTable columns correspond to the MDX query columns; the PivotTable rows correspond to the MDX query rows. The returned measures populate the PivotTable array. Despite the ease Excel has a number of associated restrictions from a scientist’s view-point. Excel PivotTables have limited plotting capabilities. To make a scatter plot, you must cut-paste the data from the PivotTable thereby losing the ability to update the data via query. Excel does not have a scripting feature. Scientists often make collections of very similar graphs for example to look at different variables across sites. To graph each column in a returned PivotTable requires a lot of tedious select-cut-paste. While Excel includes some scientific libraries such as histograms or Fourier transforms, the selection is not as wide as tools intended for scientists such as MatLab. The libraries are also not well integrated with PivotTables again likely leading to fragile cut-paste. These same limitations apply to the above commercial tools as well. These tools also suffer from the difference between scientific graphics and business graphics – the colors, shapes and axes labeling are foreign to scientists. Familiarity is important to scientists. At a minimum, the difference means that the plot must be repeated with another tool prior to publication. Our preliminary survey suggests that pairing a datacube with a rich scientific client application should offer the best of both. The datacube provides simple slice and dice to aggregates; the rich client provides scripting, familiar graphics and powerful analysis libraries. 2 System Overview The components of our solution are shown in Figure 3. This section explains each as well as identifying those which can be reused with other clients or datacube structures. MatLab Results Object Command Column Index Handles and Column Names GUI selections GUI Builder MDX Field Picker Handle Manager Fields and Filters Results Object Query Builder Menu Config Cube Config Credentials and MDX Query Deserializer HTTP Serialized Results Web Server Authenticated Credentials and MDX Query ADO MD Serialized Results Serializer Input MDX Output Results Figure 3 : System Architecture 2.1 GUI Builder The GUI Builder allows the scientist to select the dimension attributes, hierarchy levels, and measures to be retrieved for inclusion in the analysis. As shown in Figure 4, the GUI is divided into 2 major panes. The Field Axis acts as the column axis whereas the Time Axis acts as the row axis. The Field Axis supplies the what-where-which; the Time Axis supplies the when. “What” is determined by the “Select Datum” box. In Figure 4, “LE” (latent heat flux) is selected. Multiple datums can be selected by control-clicking. “Where” is chosen with the “Select Groupings” box and the associated “Drill down sites” check box. Latitude and longitude bands are common selection criteria. If the drill down sites check box is selected, data are returned for each site within the latitude bands; if the check box is not selected, data are aggregated across the band. “Which” is chosen with the “Dataset” box and the associated “Use dataset filter” check box. By default, all data are included in the returned aggregate. If the check box is selected, only the selected highlighted datasets are included in the query. “When” is selected in the “Time Axis” pane. The time range is selected by the start and stop years. The hierarchy to be used and the depth of the hierarchy to be traversed are selected in the Select Time box. The data aggregate is chosen in the “Select Measure” box. Note that the interface does not support setting a filter on a date-time window. This is a limitation of MDX. There is no construct that allows specification of months 04-12 for 1999, and 08-12 for 2006 and full months for the years in between. We chose to set a filter at the year granularity. The contents of each menu are determined by configuration files. For example, each entry in the “datum.txt” file is entered on a new line, and each entry is of the format <alias, MDX representation>. The aliases are shown in the GUI box and the MDX representations are used by the MDX Field picker to create the lists that are passed to the Query Builder module. As an illustration, the entry for the LE entry shown in Figure 4 looks like: LE,[Datumtype].[Datumtype].&[11] Figure 4. GUI Builder. The GUI exposed by the GUI builder is used to select the what-where-which-why. 2.4 Serializer (Cube Access) Note also that the prototype does not include the “how” or quality dimension. 2.1 MDX Field Picker The MDX field picker module provides the primary MatLab programming interface. The field picker invokes the GUI Builder, passes the obtained user selections to the Query Builder, and returns the results object. To invoke the GUI to make selections, the MDX field picker is invoked by: [v1 v2 v3 res] = MdxFieldPicker(); where v1 v2 v3 are GUI variables and res is the returned results object. After the query parameters are selected, the user hits submit to exit the GUI and the above call returns. To retrieve the results, the field picker is invoked a second time: [v1 v2 v3 res] = MdxFieldPicker( 'MdxFieldPickerOutputFcn', v1,v2,v3); 2.3 Query Builder The Query Builder module builds the MDX query based on the GUI Builder selections. The Query Builder module accepts as input: List of groups (sites) and datums Time hierarchies (Year – day etc) Filters: time range filter and dataset-filter Variable Measure(s) The selected datum(s) and groups(s) form column axis. The Query Builder looks at the number of dimensions needed and then cross-joins dimensions as necessary. Similarly, the row axis is generated from the time range and hierarchies. The SELECT clause is then constructed by combining the row axis MDX and the column axis MDX. The FROM clause is specified in the cube configuration file. The measures and dataset filters are used to generate the WHERE clause. Finally, the clauses are combined to complete the MDX query. Our prototype Query Builder can generate queries where each of axis can have up to 3 dimensions. This was chosen for coding simplicity and accommodates our family of related eco-datacubes. The Query Builder invokes the ASP-based web service Serializer by http post. The Serializer unpacks the post, passes the query to the datacube and then produces a results stream. An example post is below. http://<xxxx>/mdxconnect/Default.aspx?db=Latest AmfluxL3L4Daily&mdx=SELECT%20%20NON% 20EMPTY%20CROSSJOIN%20({[Datumtype].[D atumtype].%26[1],[Datumtype].[Datumtype].%26[1 9]},{[Site].[Site].&[477]})%20%20DIMENSION% 20PROPERTIES%20PARENT_UNIQUE_NAME %20ON%20COLUMNS,%20%20NON%20EMPT Y%20{[Timeline].[Year].%26[1990]:[Timeline].[Y ear].%26[2006]}%20%20DIMENSION%20PROPE RTIES%20PARENT_UNIQUE_NAME%20ON%2 0ROWS%20%20FROM%20%20LatestAmfluxL3L 4Daily%20%20WHERE%20%20([Measures].[Year lyCalc]) The natural question is why does one need to do produce results as a stream? Excel and other ADO [ADO] compatible applications can talk to the cube using the OLE DB (ADO MD) drivers. The OLE DB driver maintains the relationship of each returned data cell with the associated 2 or more dimensions. After much investigation, we found that no such driver exists for environments that cannot handle ADO objects; MatLab is one such environment. In order to solve this problem, we made use of the underlying structure of an MDX result: we serialize the results in a manner that can be reconstructed at the client end. We: convert the query results into a stream using the ADO MD driver convert that stream to a text stream pass that text stream over the internet reconstruct the stream to an object that maintains the cell-dimension(s) association on the client. The organization of the stream is as follows. The first 2 numbers represent the number of rows and columns. This is followed by the number of dimensions on the column axis followed by the actual column dimension attributes. Based on the number of columns and number of dimensions on the column axis, we can write all the columndimension attributes. Next we write the number of dimensions on the row axis followed by the rowdimension attributes. Again, as done with columns, by combining the information of number of rows and number of dimensions on the row axis, we can write in the row-dimension attributes. After reading the dimensions on the row and column axis, we write the data matrix [row X col]. As an illustration, consider the results of the MDX query. SELECT NON EMPTY CROSSJOIN ( {[Datumtype].[Datumtype].&[11 ],[Datumtype].[Datumtype].&[19]}, {[Site].[SiteID].&[477],[Site ].[SiteID].&[480]} ) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON COLUMNS, NON EMPTY {[Timeline].[Year].&[2000]:[T imeline].[Year].&[2006]} DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON ROWS FROM LatestAmfluxL3L4Daily WHERE ([Measures].[YearlyCalc]) The result of this query is as follows: 6,4,2,LE,US-Ton,LE,USVar,Precip,US-Ton,Precip,US-Var, 1,2001,2002,2003,2004,2005,2006, 46.1471300802596,NaN,NaN,14.0221343 47994,NaN,38.6128757119495,NaN,33.8 220144215576,NaN,NaN,NaN,81.4135755 203902,87.5986066925887,44.60779285 24156,267.782040508823,116.88883878 2413,267.167004732928,295.245106825 869 ... For convenience, the numbers that tell us about the dimensions are in bold, the dimensions themselves are in italics and the data are underlined. The first two numbers tell us the number of rows and columns in the result. Thus the number of rows is 6 and the number of columns is 4. The next number (third number) tells us the number of dimensions in the column axis. In this example, we have 2 dimensions on the col. Axis, and 4 columns, therefore we must have 2*4 = 8 attributes on the column dimension. The row axis follows this; with only one dimension and 6 rows, there are 6 attributes. Lastly the data [Row X Col] are written. Access to the Serializer access is secured with HTTP Basic Authentication [HTTP] and a dedicated machine-local no-login account. The Serializer then access the datacube using the NT AUTHORITY\NETWORK SERVICE account. We realize that basic authentication is not a long term solution as the credentials are encoded as Base64 in clear text and can be decoded quite easily [BASIC]. This does, however, demonstrate that some level of security can be achieved. Basic authentication also prevents web-crawlers and robots from accessing the data and over-loading the system. 2.6 Deserializer The results stream is deserialized by back tracing the serializing mechanism. We construct a MatLab object that associates the cells with the dimensions. The pseudo-class is represented as follows: Struct MdxResults { Integer: Number of Integer: Number of Integer: Number of on Col axis Integer: Number of on Row axis Struct Axis[2] structure Double[,] : Data } Rows Cols dimensions dimensions : Axis Struct Axis { String [Number of Dimensions in axis][Number of attributes in each dimension] : Header } The MatLab object, res, that implements the above results structure is shown below: res = rows: 27 cols: 37 dim1axes: 2 dim2axes: 1 axis: [1x2 struct] data: [27x37 double] The MatLab user has access to this structure and the query results, without having to construct the MDX query. A typical MDX result contains many dimensions and attributes associated with those dimensions. As such, we need mechanism that enables the MatLab user to make the column-attribute association using some form of a search. The Handle Manager is that mechanism. 2.7 Handle Manager The Handle Manager associates the datacube dimensions and attributes with the returned results columns. The Handle Manager is invoked by: Hm = handle_manager(res.axis(1).dim) Consider an MDX query with 3 dimensions on the column axis each of which has 10 attributes associates. The total number of columns in the result set will be 10X10X10 = 1000 columns. The Handle Manager provides a MatLab user friendly way to find the right two columns for a scatter plot. The prototype Handle Manager provides two mechanisms to make the association. The user can: provide the column number (index) and get back the fully qualified name of the column by concatenating the attributes along different dimensions. provide the attribute names and obtain the indices (or handles) at which those attributes are found. The name can be either partially or fully qualified. To further illustrate this point, consider Figure 5 to be the output of a small, simple MDX query. There are two site (US-Ton and US-Var) and two datumtypes (LE and Precip). Figure 5: Sample Result set of an MDX query. Yearly values of two datumtypes (LE and Precip) are returned for two sites (US-Ton and US-Var) for the years 2001 through 2006. To discover the contents of column 3, the user can retrieve the fully qualified column name “USVar_LE”. Header = get_header(hm,3) The user can also retrieve the columns with names containing “US-Var” (columns 3 and 4), “LE” (columns 1 and 3) or “US-Var” and “LE” (column 3). index = find_dims(hm,'US-Var','LE') 3 Conclusions We have had a great deal of interest in our prototype from our colleague scientists. We are still very early in getting experience using the connection. One unexpected benefit is that many of our scientists have non-Windows desktops. Macintosh Excel PivotTables does not support datacube access, so MatLab is the most accessible access. The scripting facility and improved rendering facility is already helping us. A collection of plots from one of our Russian River hydrology cubes is shown in Figure 6 on the page following. The upper right pane is a simple time plot of two variables (discharge and turbidity). The upper left pane show the results of an FFT (Fast Fourier Transform). This can be done with Excel, but requires careful cutpaste which is not updated across PivotTable changes. The lower pane shows a color-coded plot of discharge as a function of site (aggregated by the drainage area property) in 2003.This sort of plot is often used by our scientists and is not possible with Excel. Our solution is also faster than Excel over potentially slow lines to a scientist desktop, Excel uses a SOAP-based approach; the XML headers make the result bulkier than our text-based approach. As the amount of data returned by the query gets large, the performance can become sluggish. This is a combination of the time necessary to retrieve the data the network transport time, and the scaling of MatLab when handling large amounts of data. The good news is that the datacube approach can postpone that slowdown when the analysis is not at the leaf nodes of the hierarchies. The datacube can precompute the aggregate and only those aggregates need to be passed to the desktop application and handled by that application. Of course, we are describing only a prototype. Our query generator cannot handle more that 3 dimensions on an axis. Thus, the maximum number of dimensions that the query generator can accept is 6 (3 on column axis and 3 on row axis). This is not a limitation for our cubes, but could be in the future. Figure 6: Example MatLab generated plots from our Russian River cube. The lower color coded plot of discharge in 2003 is not possible to create with Excel. We have also not attempted to include the (very widely varying) quality dimension. Lastly, we are using only basic authentication. JOINs; we have demonstrated feasibility and correctness, but not optimal coding. 5 Acknowledgements 4 Future Work Near term, we want to convert the prototype to an easily to deploy technology artifact. We need to add support for selecting a datacube including specifying credentials, menu configuration files and would like to move to HTTPS [HTTPS]. Our scientists have asked for a command line interface in addition to the GUI. They have also suggested returning an n-dimensional array rather than using the Handle Manager; that would be more intuitive to them. We need to consider how to abstract the differing quality dimensions across our data sets; this is much more of a user model than GUI or query generation question. Lastly, we have some performance testing to do on our generated queries given the CROSS We would like to acknowledge the valuable contributions made by Deb Agarwal, Monte Goode, Matt Rodriguez, and Robin Weber of the Berkeley Water Center, in getting the data ready and testing various modules during our development and deployment. We would also like to thank Rebecca Leonardson our first user for many terrific suggestions. As always, we rely on Stuart Ozer for his continued datacube wisdom. 6 References [ADO]: ActiveX Data Objects (ADO), a languageneutral object model that expose data raised by an underlying OLE DB Provider, http://support.microsoft.com/kb/183606 [AMERIFLUX]: AmeriFlux http://public.ornl.gov/ameriflux/ Network, [BASIC]: SSL Man-in-the-Middle Attacks, Peter Burkholder, February 1, 2002, http://www.sans.org/reading_room/whitepapers/thre ats/480.php [BWC] Berkeley http://esd.lbl.gov/BWC/. Water Center [COG]: Cognos, http://www.cognos.com/solutions/index.html [CUAHSI] Consortium Consortium of Universities for the Advancement of Hydrologic Science, Observations Data Model, http://www.cuahsi.org/his/odm.html [EXCEL]: Excel Pivot tables, http://www.microsoft.com/dynamics/using/excel_pi vot_tables_collins.mspx [GRAY1996] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data cube: A relational operator generalizing group-by, crosstab and sub-totals,” ICDE 1996, pages 152–159, 1996. [HTTP]: J. Franks et al. HTTP Authentication: Basic and Digest Access Authentication, June 1999. IETF RFC. [HTTPS]: HTTPS, http://technet2.microsoft.com/windowsserver/en/libr ary/052d2ea9-586c-4e33-9c56ecc0c2b203be1033.mspx?mfr=true [MATLAB] The language of technical computing, http://www.mathworks.com/products/MatLab/ [MDX]: Multi Dimensional eXpressions (MDX), a query language to query the SQL Server Analysis Services (SSAS), http://msdn2.microsoft.com/enus/library/ms345116.aspx [MDXtutorial]: MDX http://msdn2.microsoft.com/enus/library/ms144884.aspx Tutorial, [OZER] Stuart Ozer, Alex Szalay, Katalin Szlavecz, Andreas Terzis, Razvan Musǎloiu-E., Joshua Cogan, Using Data-Cubes in Science: an Example from Environmental Monitoring of the Soil Ecosystem, MSR-TR-2006-134, 2006. [PRO]: Proclarity, http://www.proclarity.com [SSAS] SQL Server Analysis Server, An integrated view of business data for reporting, OLAP analysis, Key Performance Indicator (KPI) scorecards, and data mining, http://www.microsoft.com/sql/technologies/analysis /default.mspx [SDSS] The Sloan Digital Sky Survey SkyServer, http://skyserver.sdss.org/ [TAB]: Tableau, A tool for querying and analyzing OLAP databases without any knowledge of MDX, http://www.tableausoftware.com/info/OLAP_Front_ End/OLAP_Front_End_fw.php