Session 7.1 Case Study: Oncology Transcript of interview extracts

advertisement
http://www.sheffield.ac.uk/is/research/projects/rdmrose
Session 7.1
Case Study: Oncology
Transcript of interview extracts
Extract 1: Introduction
My name is Bernard Corfe. I’m a senior lecturer in the Department of Oncology, and I’ve worked in
Sheffield for about ten years. I’ve had a number of large external funded projects by BBSRC and by
the Food Standards Agency, and I currently have a research group of around ten people at Ph.D.,
M.D., and bachelor’s level.
Extract 2: Research and Data
I do a number of different types of research, so we do some in vitro work where we’re taking cell
lines and treating them with different concentrations of a drug or a compound of interest and
measuring outcomes, very often that’s by using microscopy approaches, sometimes by extracting
proteins and using western immunoblot approaches. And those allow us to do highly mechanistic
cell-based assays. But we also have research projects that have involved human clinical trials. So
typically we’ve done a few projects, for example, with master’s students, where we’ve recruited
subjects who have either had a normal physiology or perhaps have IBS, and we’ve treated them and
collected information about a response to an intervention like a probiotic.
[0.50] And then I’ve run a couple of larger cross-sectional studies where we’ve recruited people who
are patients, so this falls within NHS ethics, and they’ve been presenting at gastroenterology clinics
and we’ve asked to collect additional samples from them, so biopsies and history, and we’ve used
the biopsies to analyse protein changes, and also histological changes. And then we try to store that
data and manage it in such a way so that we can link all of the outcomes—so the protein changes
and the histology and the family history—and how those might link together to disease outcome.
[1.24] So the data that are generated in vitro typically would be spreadsheets, would be imaging
data, so images of gels or images of cells, which have got quite a range of sizes, sometimes from
1MB up to several TB of data, depending on the particular type of imaging that we use. We also have
lab books, we also have the original project proposals that are sort of experiments we’ve done. For a
clinical trial not involving NHS staff, so something that we might do with a self-reported group, and
where we’re using an over-counter type remedy, the data might also include the same sort of
imaging type analysis if we wanted to do samples in a certain way, but would also typically and
necessarily include the site file and all of the associated documentation, so samples of the original
ethics application, any amendments that are made, samples of all of the forms that are used in
dealing with subjects in a trial. We’d also have subject-specific information, so every patient’s or
subject’s recruitment information and any sampling information that was taken, so for example,
how severe the symptoms were, or information to allow us to link that to a specific sample, for
example a urine or stool or blood sample.
[2.43] An NHS-type clinical trial would be similar, except the level of documentation is substantially
larger because of the longer ethics application to go through NHS ethics, the more detailed amounts
of information and sometimes the larger number of people involved in such a trial. And I guess to a
Page 1
http://www.sheffield.ac.uk/is/research/projects/rdmrose
degree as well as the data itself, we would also have samples, so our information needs to be linked
to patients’ samples. We do omics-type experiments, so generally these are proteomics
experiments, where we extract all the proteins and do mass spectrometric analysis.
[3.18] The raw data for that is stored on servers in Chemical Engineering, and then the processed
data is stored generally in the form of spreadsheets in the Medical School. So it’s all online. The raw
data from the proteomics experiments again could go into many GB and TB, whereas the processed
data would be usually be a couple-of-MB-sized spreadsheet. Histology data tends to be imaging
form, so again, the individual file sizes are quite large, they tend to be stored on a separate hard
drive in the Medical School.
Extract 3: Two RDM issues
We’ve had an issue in the past where a computer was stolen that had a lot of imaging on it, so we
lost a lot of data when that happened. That’s some years ago now. So as a general routine, I ask
people to back up, but then they don’t always do it, and I don’t always do it. So that’s a sort of risk.
What other issues have we had?
[0.21] I’m not sure it’s entirely a data management issue, but I had a problem with a research
assistant some years ago. She was lying about recruitment and lying about the sample and samples
that she’d collected and she was having problems keeping track of individuals. So that led to pretty
catastrophic failure in the process, partly because somebody was deliberately setting out to deceive,
and I’ve been considerably more fastidious since that experience, but nonetheless, you don’t expect
that sort of thing, but it can happen. So there’s a degree to which I am sure that I’ve got more
oversight of projects since that experience, but you are always subject to that possibility of
somebody just making up information, or deceiving you, or claiming that they’ve done something.
Extract 4: Example project
A project funded by the Food Standards Agency, so this looked at the effects of fibre and its
fermentation products on the gut lining. And we specifically wanted to take biopsies of the gut
lining, so we wanted to work with people who were having biopsies taken anyway, so this was
subjects who were coming to gastroenterology clinics. So I first made an application to the Food
Standards Agency in 2004 and it was turned down, but they were prepared to see a revised version
of the application in 2005. And so that’s a fairly substantive document, which is all fully costed out
that involved a named clinician on board with the project—no—two clinicians and three academics.
And then it employed three staff on research contracts. So that was awarded in 2005 and started in
2006, and we got started in late 2006 in terms of the recruitment. So we’re looking at a good two
years from the initial application to actually the first person being recruited onto the trial.
[1.10] From each individual we wanted to collect a diet history, a stool sample, and then when they
were in the GI clinic, we wanted to collect some extra biopsies and do various analyses with these.
And we had a certain target of 30 normal people, 30 people with a benign polyp, and 30 people with
a malignant disease in the colon. And so it took us I think about two years all in all to collect those
numbers. And this was the project where the one initially appointed researcher had been lying
about recruitment rates, had been lying about…so this added to the complexity of the management.
Because that person had also been charged with the primary role of managing the site file, managing
the patient consent forms, because she was the person actually physically doing the recruiting.
Page 2
http://www.sheffield.ac.uk/is/research/projects/rdmrose
[1.57] So we had a couple of false starts on that, and then she had to leave, and we got somebody
else in who was extremely fastidious and extremely efficient. We met our recruiting targets. But it’s
only once we had completed recruiting and completed some initial sample analysis of the stool
samples that we were then able to go into things like the proteomic analysis, which involved
collaboration with Chemical Engineering down here, and so really, we were only starting to get data
out by, say about 2009. So you’re then into five years from initially thinking up the project until your
first actually integrated dataset of some information on the patients, some information on the stool,
the diet, and the proteomics.
[2.42] And that’s led to a sort of further round of experimentation, because we’ve got histology
samples, so the proteomics can then lead us to ask some questions in histological samples, so it
becomes like a loop of information of scientific questions, that you ask that, and then it tells you
that, and we’ve been going through a cycle of that. And it’s only really around this time now that
we’re starting to think about writing up the paper. So now it’s eight years, so this is an
extraordinarily slow one, but it’s eight years on from the initial proposal, seven years on from the
one that got funded, and we’re only just starting to get the…we have got some papers off it, but
we’re only just starting to get the major published outputs.
[3.24] This wasn’t a clinical trial in the sense of a medicine or an intervention. It was a clinical trial in
the sense that it was done in a clinic. But it was a cross-section. We didn’t follow the patients
through; we didn’t treat them with a medicine or look for any kind of clinical benefit. We’re perhaps
less at the treatment and curative end of disease. We’re more interested in the mechanisms of
disease, so it fits in with the Medical School remit.
Extract 5: Data in example project
Different types of data generated…you know, the same project has got western blot data on one
computer, it’s got raw mass spec data in Chemical Engineering, it’s got histology data on a
microscope computer. The processed outcomes are then linked into a spreadsheet that’s on an M:
drive, but individually, the data are distributed across multiple points. If I had the resource, what I
would do would be to network all of those computers and put them onto some sort of network drive
or server and archive everything from that project in a single place, so that at least I would know it
was secure if somebody came in and stole a microscope and stole a mass spec and so forth. But that
depends on a certain level of resource.
[0.43] So there would be a whole stream of different endpoints. So let’s say I have 100 patients, so
they’re called 001 through to 100. And I might have a researcher who’s looking at a series of
histology endpoints, and so they would be scoring endpoint 1, so they’d have a spreadsheet which
would have the study number of the patients and they would put their endpoints into a file, and for
the purposes of their project, they might have made ten repeated measurements and ended with an
average and some variability so that they can then condense that down into a single output or two
or three outputs if they’re looking at two or three endpoints.
[1.22] And then the same might be true for somebody else who’s looked at perhaps the
concentration of metabolites in the stool sample. So they would have a spreadsheet of their own
which just had those set of patient identifiers. And then there would be another one for another
protein, and then another one for the proteomics, and so forth. But then what I do is to merge all of
Page 3
http://www.sheffield.ac.uk/is/research/projects/rdmrose
that information together into a larger sheet, and look at the interrelationship between all of these
endpoints, and the degree to which one influences another. So I have the kind of overview of the
project.
[1.51] So individuals get something that is substantive, but very personal to them, whereas I am
more interested in looking at the big picture if you like, so less interested in variations in a single
endpoint, which might form a substantive project for one of my bachelor’s students or a placement
student, but through them doing that, I’ve then got a very large set of information on these 100
patients. The other senior members of the research team have the same sort of overview, but given
that I was the PI on the project, and other people have sort of drifted in and drifted out, they have
different levels of involvement at different stages. But as we’re approaching a writing phase, an
analytical phase, they’ll be drawn back together again and we’ll work on the whole dataset together
in order to get the output.
Extract 6: Open access and open data
I think open data is really important. So a lot of my work is in the area of bioinformatics, and that is
extremely driven by open data, and also open source. But the argument in bioinformatics is that it’s
not necessarily the person who generates the data who is the best person to analyse it, and also that
there is more to be gained from merging datasets and analysing much larger datasets than there is
from one person working on a dataset. So I think from those bioinformatics principles, I can see
those applying to all sorts of areas of science, that if you can generate datasets in such a way that
they can be merged, you’ve got a much more powerful way of analysing work. And also, a better
way of seeing why one group get one result and another group might get a completely opposing
result. If you can share that data and see the underlying differences, be it different age groups of the
populations, or different socioeconomic groups, or different genetics, or whatever, it gives us better
understanding.
[0.59] And then that is in itself informative. If you can see why two studies end up with different
outcomes, then you can see why you might reach through to personalised medicine or personalised
healthcare. So I think there are enormous amounts to be gained by open data, but I think tied into
that is an absolute need for data reporting standards. So proper use of conserved vocabularies,
proper structuring and organisation of data, rather than just chucking your spreadsheets into the
ether. So that works really well for people who are working on metabolomics or proteomics. But
actually if we can generate reporting standards for work in nutrition, or work in sociology or
whatever, and generate datasets that can be merged, then, you know, the future looks very very
bright, I think. My argument is, there’s no point in having open data without standards…without
really tight standards.
[1.50] So an area like proteomics, microray, and genomics, are absolutely fantastic. They’re kind of
leaders in reporting standards. Other areas that I work in, in nutrition, are, you know, back in the
1950s, and absolutely terrible. They don’t share, they don’t have their own reporting standards, they
have no repositories. And so, I can see from the part of me that stands in the bioinformatics world
the potential, and the part of me that stands in the nutrition world of what a mountain to climb
there is in some disciplines. It’s fairly simple. It needs to come down from top journals saying you
can’t publish unless you’ve got a repository, but in that case you need a repository, in that case you
need structured vocabularies. So there are steps, a good few, you know, not too many.
Page 4
http://www.sheffield.ac.uk/is/research/projects/rdmrose
[2.39] These are interesting projects for people to do in their own right. But they need to be done,
because otherwise there’s just no point. So I have no problem in principle with open access papers.
What I have a problem [with] is people standing up and saying “Well, the existing publishers are just
profiteering, and their costs are very high, and they’re costing a fortune to libraries and so forth, and
there’s something absolutely brilliant about open access publishers”. Because whenever I’ve gone
to an open access publisher, they want something in the order of £1,500 to £2,000 to publish a
paper in an open access journal. And if you think, “Well, I’m publishing maybe a paper a month, so
that would be in the order of £15,000 to £20,000 a year to publish at my current rate”. So that’s one
issue. And if you extend that across an entire university, that’s far more than the cost of subscription
journals.
[3.33] So I think that’s a really major issue. It’s all very well saying “Closed access publishers are
profiteering”, but so are open access publishers: they’re just profiteering from somebody else. My
other issue is that a paid access and an open access publisher have got different clients, that they’re
not like-for-like. So a paid access publisher has to produce a journal of high quality in order to sell it,
and that gives you some sense that if you’re purchasing that journal, you’re purchasing something of
substance. Whereas with open access, there is a sense in the community and there is some evidence
behind this, that their thresholds for publication, so that might be indicated by the rejection rate of
the journal, are far lower. So this means we’re led to a kind of explosion of journals and papers, and
no way of actually judging what is of quality and what is not of quality.
Page 5
Download