http://www.sheffield.ac.uk/is/research/projects/rdmrose Session 7.1 Case Study: Oncology Transcript of interview extracts Extract 1: Introduction My name is Bernard Corfe. I’m a senior lecturer in the Department of Oncology, and I’ve worked in Sheffield for about ten years. I’ve had a number of large external funded projects by BBSRC and by the Food Standards Agency, and I currently have a research group of around ten people at Ph.D., M.D., and bachelor’s level. Extract 2: Research and Data I do a number of different types of research, so we do some in vitro work where we’re taking cell lines and treating them with different concentrations of a drug or a compound of interest and measuring outcomes, very often that’s by using microscopy approaches, sometimes by extracting proteins and using western immunoblot approaches. And those allow us to do highly mechanistic cell-based assays. But we also have research projects that have involved human clinical trials. So typically we’ve done a few projects, for example, with master’s students, where we’ve recruited subjects who have either had a normal physiology or perhaps have IBS, and we’ve treated them and collected information about a response to an intervention like a probiotic. [0.50] And then I’ve run a couple of larger cross-sectional studies where we’ve recruited people who are patients, so this falls within NHS ethics, and they’ve been presenting at gastroenterology clinics and we’ve asked to collect additional samples from them, so biopsies and history, and we’ve used the biopsies to analyse protein changes, and also histological changes. And then we try to store that data and manage it in such a way so that we can link all of the outcomes—so the protein changes and the histology and the family history—and how those might link together to disease outcome. [1.24] So the data that are generated in vitro typically would be spreadsheets, would be imaging data, so images of gels or images of cells, which have got quite a range of sizes, sometimes from 1MB up to several TB of data, depending on the particular type of imaging that we use. We also have lab books, we also have the original project proposals that are sort of experiments we’ve done. For a clinical trial not involving NHS staff, so something that we might do with a self-reported group, and where we’re using an over-counter type remedy, the data might also include the same sort of imaging type analysis if we wanted to do samples in a certain way, but would also typically and necessarily include the site file and all of the associated documentation, so samples of the original ethics application, any amendments that are made, samples of all of the forms that are used in dealing with subjects in a trial. We’d also have subject-specific information, so every patient’s or subject’s recruitment information and any sampling information that was taken, so for example, how severe the symptoms were, or information to allow us to link that to a specific sample, for example a urine or stool or blood sample. [2.43] An NHS-type clinical trial would be similar, except the level of documentation is substantially larger because of the longer ethics application to go through NHS ethics, the more detailed amounts of information and sometimes the larger number of people involved in such a trial. And I guess to a Page 1 http://www.sheffield.ac.uk/is/research/projects/rdmrose degree as well as the data itself, we would also have samples, so our information needs to be linked to patients’ samples. We do omics-type experiments, so generally these are proteomics experiments, where we extract all the proteins and do mass spectrometric analysis. [3.18] The raw data for that is stored on servers in Chemical Engineering, and then the processed data is stored generally in the form of spreadsheets in the Medical School. So it’s all online. The raw data from the proteomics experiments again could go into many GB and TB, whereas the processed data would be usually be a couple-of-MB-sized spreadsheet. Histology data tends to be imaging form, so again, the individual file sizes are quite large, they tend to be stored on a separate hard drive in the Medical School. Extract 3: Two RDM issues We’ve had an issue in the past where a computer was stolen that had a lot of imaging on it, so we lost a lot of data when that happened. That’s some years ago now. So as a general routine, I ask people to back up, but then they don’t always do it, and I don’t always do it. So that’s a sort of risk. What other issues have we had? [0.21] I’m not sure it’s entirely a data management issue, but I had a problem with a research assistant some years ago. She was lying about recruitment and lying about the sample and samples that she’d collected and she was having problems keeping track of individuals. So that led to pretty catastrophic failure in the process, partly because somebody was deliberately setting out to deceive, and I’ve been considerably more fastidious since that experience, but nonetheless, you don’t expect that sort of thing, but it can happen. So there’s a degree to which I am sure that I’ve got more oversight of projects since that experience, but you are always subject to that possibility of somebody just making up information, or deceiving you, or claiming that they’ve done something. Extract 4: Example project A project funded by the Food Standards Agency, so this looked at the effects of fibre and its fermentation products on the gut lining. And we specifically wanted to take biopsies of the gut lining, so we wanted to work with people who were having biopsies taken anyway, so this was subjects who were coming to gastroenterology clinics. So I first made an application to the Food Standards Agency in 2004 and it was turned down, but they were prepared to see a revised version of the application in 2005. And so that’s a fairly substantive document, which is all fully costed out that involved a named clinician on board with the project—no—two clinicians and three academics. And then it employed three staff on research contracts. So that was awarded in 2005 and started in 2006, and we got started in late 2006 in terms of the recruitment. So we’re looking at a good two years from the initial application to actually the first person being recruited onto the trial. [1.10] From each individual we wanted to collect a diet history, a stool sample, and then when they were in the GI clinic, we wanted to collect some extra biopsies and do various analyses with these. And we had a certain target of 30 normal people, 30 people with a benign polyp, and 30 people with a malignant disease in the colon. And so it took us I think about two years all in all to collect those numbers. And this was the project where the one initially appointed researcher had been lying about recruitment rates, had been lying about…so this added to the complexity of the management. Because that person had also been charged with the primary role of managing the site file, managing the patient consent forms, because she was the person actually physically doing the recruiting. Page 2 http://www.sheffield.ac.uk/is/research/projects/rdmrose [1.57] So we had a couple of false starts on that, and then she had to leave, and we got somebody else in who was extremely fastidious and extremely efficient. We met our recruiting targets. But it’s only once we had completed recruiting and completed some initial sample analysis of the stool samples that we were then able to go into things like the proteomic analysis, which involved collaboration with Chemical Engineering down here, and so really, we were only starting to get data out by, say about 2009. So you’re then into five years from initially thinking up the project until your first actually integrated dataset of some information on the patients, some information on the stool, the diet, and the proteomics. [2.42] And that’s led to a sort of further round of experimentation, because we’ve got histology samples, so the proteomics can then lead us to ask some questions in histological samples, so it becomes like a loop of information of scientific questions, that you ask that, and then it tells you that, and we’ve been going through a cycle of that. And it’s only really around this time now that we’re starting to think about writing up the paper. So now it’s eight years, so this is an extraordinarily slow one, but it’s eight years on from the initial proposal, seven years on from the one that got funded, and we’re only just starting to get the…we have got some papers off it, but we’re only just starting to get the major published outputs. [3.24] This wasn’t a clinical trial in the sense of a medicine or an intervention. It was a clinical trial in the sense that it was done in a clinic. But it was a cross-section. We didn’t follow the patients through; we didn’t treat them with a medicine or look for any kind of clinical benefit. We’re perhaps less at the treatment and curative end of disease. We’re more interested in the mechanisms of disease, so it fits in with the Medical School remit. Extract 5: Data in example project Different types of data generated…you know, the same project has got western blot data on one computer, it’s got raw mass spec data in Chemical Engineering, it’s got histology data on a microscope computer. The processed outcomes are then linked into a spreadsheet that’s on an M: drive, but individually, the data are distributed across multiple points. If I had the resource, what I would do would be to network all of those computers and put them onto some sort of network drive or server and archive everything from that project in a single place, so that at least I would know it was secure if somebody came in and stole a microscope and stole a mass spec and so forth. But that depends on a certain level of resource. [0.43] So there would be a whole stream of different endpoints. So let’s say I have 100 patients, so they’re called 001 through to 100. And I might have a researcher who’s looking at a series of histology endpoints, and so they would be scoring endpoint 1, so they’d have a spreadsheet which would have the study number of the patients and they would put their endpoints into a file, and for the purposes of their project, they might have made ten repeated measurements and ended with an average and some variability so that they can then condense that down into a single output or two or three outputs if they’re looking at two or three endpoints. [1.22] And then the same might be true for somebody else who’s looked at perhaps the concentration of metabolites in the stool sample. So they would have a spreadsheet of their own which just had those set of patient identifiers. And then there would be another one for another protein, and then another one for the proteomics, and so forth. But then what I do is to merge all of Page 3 http://www.sheffield.ac.uk/is/research/projects/rdmrose that information together into a larger sheet, and look at the interrelationship between all of these endpoints, and the degree to which one influences another. So I have the kind of overview of the project. [1.51] So individuals get something that is substantive, but very personal to them, whereas I am more interested in looking at the big picture if you like, so less interested in variations in a single endpoint, which might form a substantive project for one of my bachelor’s students or a placement student, but through them doing that, I’ve then got a very large set of information on these 100 patients. The other senior members of the research team have the same sort of overview, but given that I was the PI on the project, and other people have sort of drifted in and drifted out, they have different levels of involvement at different stages. But as we’re approaching a writing phase, an analytical phase, they’ll be drawn back together again and we’ll work on the whole dataset together in order to get the output. Extract 6: Open access and open data I think open data is really important. So a lot of my work is in the area of bioinformatics, and that is extremely driven by open data, and also open source. But the argument in bioinformatics is that it’s not necessarily the person who generates the data who is the best person to analyse it, and also that there is more to be gained from merging datasets and analysing much larger datasets than there is from one person working on a dataset. So I think from those bioinformatics principles, I can see those applying to all sorts of areas of science, that if you can generate datasets in such a way that they can be merged, you’ve got a much more powerful way of analysing work. And also, a better way of seeing why one group get one result and another group might get a completely opposing result. If you can share that data and see the underlying differences, be it different age groups of the populations, or different socioeconomic groups, or different genetics, or whatever, it gives us better understanding. [0.59] And then that is in itself informative. If you can see why two studies end up with different outcomes, then you can see why you might reach through to personalised medicine or personalised healthcare. So I think there are enormous amounts to be gained by open data, but I think tied into that is an absolute need for data reporting standards. So proper use of conserved vocabularies, proper structuring and organisation of data, rather than just chucking your spreadsheets into the ether. So that works really well for people who are working on metabolomics or proteomics. But actually if we can generate reporting standards for work in nutrition, or work in sociology or whatever, and generate datasets that can be merged, then, you know, the future looks very very bright, I think. My argument is, there’s no point in having open data without standards…without really tight standards. [1.50] So an area like proteomics, microray, and genomics, are absolutely fantastic. They’re kind of leaders in reporting standards. Other areas that I work in, in nutrition, are, you know, back in the 1950s, and absolutely terrible. They don’t share, they don’t have their own reporting standards, they have no repositories. And so, I can see from the part of me that stands in the bioinformatics world the potential, and the part of me that stands in the nutrition world of what a mountain to climb there is in some disciplines. It’s fairly simple. It needs to come down from top journals saying you can’t publish unless you’ve got a repository, but in that case you need a repository, in that case you need structured vocabularies. So there are steps, a good few, you know, not too many. Page 4 http://www.sheffield.ac.uk/is/research/projects/rdmrose [2.39] These are interesting projects for people to do in their own right. But they need to be done, because otherwise there’s just no point. So I have no problem in principle with open access papers. What I have a problem [with] is people standing up and saying “Well, the existing publishers are just profiteering, and their costs are very high, and they’re costing a fortune to libraries and so forth, and there’s something absolutely brilliant about open access publishers”. Because whenever I’ve gone to an open access publisher, they want something in the order of £1,500 to £2,000 to publish a paper in an open access journal. And if you think, “Well, I’m publishing maybe a paper a month, so that would be in the order of £15,000 to £20,000 a year to publish at my current rate”. So that’s one issue. And if you extend that across an entire university, that’s far more than the cost of subscription journals. [3.33] So I think that’s a really major issue. It’s all very well saying “Closed access publishers are profiteering”, but so are open access publishers: they’re just profiteering from somebody else. My other issue is that a paid access and an open access publisher have got different clients, that they’re not like-for-like. So a paid access publisher has to produce a journal of high quality in order to sell it, and that gives you some sense that if you’re purchasing that journal, you’re purchasing something of substance. Whereas with open access, there is a sense in the community and there is some evidence behind this, that their thresholds for publication, so that might be indicated by the rejection rate of the journal, are far lower. So this means we’re led to a kind of explosion of journals and papers, and no way of actually judging what is of quality and what is not of quality. Page 5