Jaqueline Beckhelling, Loughborough University Presentation for Second TEDDINET Workshop, 4/5th June 2014 Day 2, Theme 1, Digital innovation for energy savings in buildings Data management for the DEFACTO project What is the purpose of data management on a project? There are many different answers to that question and I can’t deal with all of them in 5 minutes, but one aspect of data management which I have been thinking about a lot recently is: What will happen to the data after the project is complete? In the case of the DEFACTO project, the data should be shared, which means I need to create a dataset which can be accessed by researchers who are not familiar with the DEFACTO project. A vital thing which I can do to facilitate access is document the data well – however the importance of good documentation and the need to develop metadata standards was acknowledged yesterday, so I will not discuss that again today. Instead I want to ask you a different question to which I’ve been giving a lot of thought, which is: With whom am I going to be sharing these data? Or to be more precise, how do I need to structure this dataset so it is as easy as possible for other users to use it? I can structure this dataset so it will be as easy as possible to use with most statistical analysis packages (I’ve worked with all of the major packages). I think the same structure will also be easily accessed by Matlab users, based on my experience of that. However, I have little experience of preparing data for energy plus and similar building modelling programs, so I need to make sure the data will be useful for those users too. I think there will have to be some compromises made! There is also the method of creating the dataset to be considered: The data are coming from a variety of sources and will not fit together neatly, like a jigsaw. I will need the programmatic equivalent of a crowbar and a large hammer to get some of it to fit together. That means using database for the data manipulation because they have very flexible and sophisticated methods of manipulating large amounts of data. However, the best structure for a database will not be the best structure for use with programmes which do not have the data handling capacities of a database. If the database structure is not optimal it can affect the speed with which the database operates, which could be a problem for the main DEFACTO project as it will include hundreds of homes and we will be monitoring the internal temperatures of every room for up to 3 years. However, at the moment we are carrying out the pilot, which is based on only 12 homes. So I have decided to structure the database with the structure I think is needed for the final dataset. I will be extracting the data which will be used in the energy modelling programs, so I will see if the data requires a lot of restructure for input into those programs and, if necessary, I can adjust the final data set structure accordingly. I suspect the final structure of the dataset will not be ideal for anybody, but I hope I can produce something which will not have major access problems for anybody either. I have made a major assumption in what I am doing currently, which is that I’ve assumed the final dataset needs to be accessed independently of a database. I’ve done this because few of the people I know working in energy monitoring seem to have database skills and so I assume that a dataset which needs to be imported into a database to be processed before being used would be of limited use. My main message for today is that I think I am unilaterally making a lot of key decisions which will have major implications for the long term use of the DEFACTO data. Should I be doing this? I do discuss the decisions I make with the other members of the DEFACTO team, but still a lot of what I do is based on my experience alone. Is this the ideal situation for a dataset which could potentially be useful for a range of purposes and by a range of researchers? I think it would be better if a group of Teddinet researchers thought about the structure we want for our shared data. Maybe it will be possible to come up with common structures for at least some of the data we want to store. Maybe we can only come up with guidelines. But if we had some degree of conformity about how we stored data it would really improve the ease of access. As someone who has spent more time that I like to think about working out how complex datasets work for secondary analysis, I can assure you, it will be time well spent!