Building Simulation Modelers – Are we big data ready? Jibonananda Sanyal and Joshua New Oak Ridge National Laboratory Learning Objectives • Appreciate emerging big-data needs in building sciences • Identify bottlenecks early and devise solutions for unanticipated situations using large amounts of data ASHRAE is a Registered Provider with The American Institute of Architects Continuing Education Systems. Credit earned on completion of this program will be reported to ASHRAE Records for AIA members. Certificates of Completion for non-AIA members are available on request. This program is registered with the AIA/ASHRAE for continuing professional education. As such, it does not include content that may be deemed or construed to be an approval or endorsement by the AIA of any material of construction or any method or manner of handling, using, distributing, or dealing in any material or product. Questions related to specific materials, methods, and services will be addressed at the conclusion of this presentation. Outline • Background and scope • Motivation • Big data in building sciences • Data from simulations • Data from sensors • Conclusion 3 Sustainability is the defining challenge • Buildings in U.S. – 41% of primary energy/carbon 73% of electricity, 34% of gas • Buildings in China – 60% of urban building floor space in 2030 has yet to be built • Buildings in India – 67% of all building floor space in 2030 has yet to be built 4 4 Energy Consumption and Production Commercial Site Energy Consumption by End Use 5 5 Whole ‘test buildings’ for system/building integration research ● Evaluating emerging energy efficiency technologies in realistic test beds is an essential step before market introduction. ● Some technologies (whole-building fault detection and diagnostics, etc.) benefit from use of test buildings during the development process. Fleet of Residential ‘Test Buildings’ 6 Two Light Commercial ‘Test Buildings’ Real demonstration facilities Residential homes 2800 ft2 residence 269 sensors @ 15-minutes 50-60% energy savers Heavily instrumented and equipped with occupancy simulation: • • • • • • 7 Temperature Plugs Lights Range Washer Radiated heat • • • • • Dryer Refrigerator Dishwasher Heat pump air flow Shower water flow What is Big-Data? Volume Velocity Scale Streaming Variety Veracity Data Types Quality But still, what is big-data? 8 Why do we care? • Trending technologies – Simulation playing bigger role • Parametric analysis determinants • Understanding uncertainty – Sensor data • Unprecedented levels of resolution • New control algorithms – – – – Calibration Internet of things Demand response B2G integration • Common/traditional tools and methods of analysis break 9 How big is big data in the building sciences? • Depends on – Size – Capability of tools for conventional analysis • What is the purpose of the data? Where the management and analysis of data poses a non-trivial challenge 10 For simulation output • Scalability, analysis requirements, and adaptability of data storage mechanism to changing needs • Unique situations – Non standard EnergyPlus time-stamps – 2012-01-10 24:00:00 • Data movement and network performance – Lag, bandwidth, connection stability • Logical aspects – Synchronization, storage schema, logical partitioning • Analytics on the data 11 Additionally, for sensor data • Fault detection, management, correction – – – – 3-sigma rule for automated outlier detection Slopes and trends Physical redundancy Chicken and egg problem • Quality control/quality assurance – Filtering, statistical filling in, machine learning approaches 12 Managing simulation ensembles • Uncertainty • Simulation input – design of experiments – – – – – Random sampling Uniform sampling Markov order sampling Latin square designs Fractional factorial designs • Simulation output – Larger in size, post-processing overheads – Saving raw output vs. summary – Trade-off of re-computing vs. storage and retrieval 13 Big-data management • Generic considerations – Weak to unstructured collection of data units – Balance of storage to computational needs (e.g. Cost of unzipping) – Access patterns • • • • Do you access individual files or groups of files? Do you calculate summaries often? Do you repeat the same calculations? Physical location on disk Can you exploit parallelism? – Design for the generic analysis/use case – Design for fault resilience 14 Data transfer and storage methods • Moving big-data is expensive – 10 days to move 45TB, only 68 minutes to generate! • Logical partitions in data movement – Over arching logical data schema – Use of parallel file transfer tools like bbcp, GridFTP, rsync • Traditionally, building simulation data are csv files – – – – 15 Use database technologies Atomicity, Consistency, Isolation, Durability (ACID) SQL vs. NoSQL Compression in databases, row vs. columnar Performance metrics for simulation data • 15 minute EnergyPlus output – 35,040 records of 96 variables, ~35MB – 7-8MB compressed, ~20-22% compression rate • 200 CSVs inserted into MySQL database, w/ row compression – 7M records, 10.27 MB average – Read only: 6.8 MB • ALTER TABLE command on a 386 GB table – 8 hrs 10 min with 12 partitions – >1 week when unpartitioned • HADOOP approaches – Key-value pairs, no ACID compliance 16 Comparison of database performance 17 Other considerations • Data sensitivity and user permissions • Backups – Sensor data – Simulation data • Is it worth it to back up? – Analysis to be applied – Change in simulation code – Reference for derivative products • Provenance • Workflow Tools 18 Case study • Parametric ensemble – 2 Residential buildings – Commercial buildings: medium office, warehouse, stand-alone retail • Several supercomputers used – Titan (299,008 cores) – Frost (2048 cores) – Nautilus (1024 cores) • Several challenges – EnergyPlus on supercomputers – File system, data transience and its movement – Analysis 19 Tipping point Wall-clock Time (mm:ss) 16 18:14 Data Size 5 GB 32 18:19 11 GB 128 64 18:34 22 GB 256 128 18:22 44 GB 512 256 20:30 88 GB 1,024 512 20:43 176 GB 2,048 1,024 21:03 351 GB 4,096 2,048 21:11 703 GB 8,192 4,096 20:00 1.4 TB 16,384 8,192 26:14 2.8 TB 32,768 16,384 26:11 5.6 TB 65,536 32,768 31:29 11.5 TB 131,072 65,536 44:52 23 TB 262,144 131,072 68:08 45 TB 524,288 Processors 20 E+ simulations 64 Conclusion • What is big data? • Why should we care? • Considerations in working with big-data • Analysis requirements • Managing big-data • Anticipating issues of scale 21 Questions? Jibonananda Sanyal sanyalj@ornl.gov