Big Data Storage and Access Issues for Phenotyping of Agricultural Data Stephen George,Susan Urban, Eric Hequet, and Hamed Sari-Sarraf Texas Tech 2013 NSF Research Experiences for Undergraduates Site Program Abstract Plant phenotyping involves the assessment of plant traits such as growth, tolerance, resistance, and yield. The Texas Tech Phenotyping Project is specifically studying the cross-breed of cotton plants that will better survive the harsh climate of West Texas. Using robotics, images of individual plants in a field are being collected and analyzed over time to support the study, generating massive amounts of plant data. This research project is investigating the big data storage and organizational issues for phenotyping data. A conceptual design of the phenotyping data requirements has been generated to illustrate the large scope of the data required. NoSQL database technology has also been investigated as an alternative to relational databases to provide more efficient storage and retrieval. In particular, the utilization of the NoSQL-based Couchbase system has been investigated for its high scalability and cost effective storage of massive data. Temporal data management with respect to NoSQL databases has also been explored due to the timeoriented nature of phenotyping data collection and analysis. This research provides a prototype implementation of image data storage using CouchBase, together with examples of temporal queries and a performance analysis. Objectives 1. Comparing different types of NoSQL Databases to determine which form is appropriate for the phenotyping project requirements. The Phenotyping Project NoSQL • Plant phenotyping is the comprehensive assessment of plant complex traits such as growth, development, tolerance, resistance, architecture, physiology, ecology, yield, and the basic measurement of individual quantitative parameters that form the basis for the more complex traits. (LemnaTec) • NoSQL groups all the stores created as an attempt to solve problems which cannot fit into a table/column/rows structures. • Many NoSQL systems produce better write performance than the traditional Relational Databases. • Robotics is being used to monitor and capture the plant’s growth over time and keep track of the plant’s environment. • NoSQL handles high volumes of data faster than that of a Relational Databases. • The navigation aspect of the project provides location information for each of the cotton plants in the fields. • Provides a greater level of flexibility when storing different data types such as images, documents and other objects. • Each individual plant will have multiple images that capture growth attributes over time. • Goals of the Texas Tech Phenotyping project • Determining which cross-breed would survive in harsh climate of West Texas. • Being able to store and analyze massive amounts of plant data overtime. • The F1 cross contains 430000 plants over the site of one cotton field. • Can potentially store close to 4 million images/attributes over a 10 week span for 1 generation • 20 crosses * 200 lines * 2 Reps * 5 environments = 17.2 billion plant data spanning over a year. Massive amounts of data being produced. • Key-Value Store: MongoDB, • Document Store: MongoDB, Couch, Raven • Column store: Hbase, Cassandra Figures 9: Displays the read speeds for the wicking experiment. CouchBase DataBase 1. Primary unit of Storage on the server is JSON documents Summary 2. JSON documents offer a flexible structure that allows a document to be modeled as an object. • After looking at various NoSQL databases, it was determined that a document-store based DB, Couchbase would not only satisfy the project requirements, but also provide an in-system crash prevention, making the system durability close to Relational DBs. • A data model for the Phenotyping project has been created and is ready for implementation. It supports not only the physical attributes of the plant but also environment variables that affect plant growth. • This work also experimented with other forms of data (Wicking data) in order to see if we could implement a similar data model based on the phenotyping project. 3. Couch Base Server 2.0 uses a JavaScript-based query system that uses field values within JSON documents. 1. Using Views to query specific data creates the ability to combine multiple attributes and retrieve documents based on a given specification. 1. Modeling the entity and attribute data requirements of the phenotyping project. Wicking Data 2. Capturing the temporal aspects and applying it as a data organization method. 1. Due to the unavailability of plant data in this state of the project, the experiment was be conducted on wicking data. 2. What is Wicking? 1. The ability of a fabric to absorb moisture from a surface (skin). 2. Used in active wear and performance fabrics. 3. Support for retrieval and querying of data over time. 4. Prototype using the wicking data application. Figures 8: Displays the write speeds for the wicking experiment. Future Work • Implement the phenotyping database in CouchBase DB in order to store and handle attributes taken from the robot. • Create different Views in order to fit the specifications for querying plant data based on physical attributes, time-spatial data, and environment. References Figure 2: Data Model for the Phenotyping Project. Figure 4: Sequence of frames of the drying cycle of active wear fabric. Area of Frames Area/cm 14 12 10 Figure 5: Query code for displaying Area based on Experiment 1. Chen, S. (2010). Multimedia Databases and Data Management: A Survey. International Journal of Multimedia Data Engineering and Management (IJMDEM), 1(1), 1-11. doi:10.4018/jmdem.2010111201 Monger, M. D., Mata-Toledo, R. A., & Gupta, P. (2012). Temporal Data Management in Nosql Databases. Journal of Information Systems & Operations Management, 6(2), 237243. 8 6 4 2 1 288 575 862 1149 1436 1723 2010 2297 2584 2871 3158 3445 3732 4019 4306 4593 4880 5167 5454 5741 6028 6315 6602 6889 7176 7463 7750 8037 8324 8611 8898 9185 0 Figure 1: Image of a Cotton farm with respect to the time aspects of cotton growth. Figure 3: Data Model for the Wicking Experiment. Figure 6: Query code for displaying Temperature based on Experiment 1. Figure 7: Graph for Figure 5 Query results. DISCLAIMER: This material is based upon work supported by the National Science Foundation and the Department of Defense under Grant No. CNS-1263183. An opinions, findings, and conclusions or recommendation expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the Department of Defense.