Distributed BioSensors in Genetically Modified crop trial monitoring

advertisement
Distributed BioSensors in Genetically Modified
crop trial monitoring
S. Hassard*, M. Osmond*, F. Pereira#, M. Howard#, S. Klier#, R. Martin#, J. Hassard#
*Imperial College London, South Kensington, London SW7 2AZ
#
deltaDOT Ltd, Charing Cross Hospital W6 8RF
The DiscoveryNET consortium
Biotechnology is moving towards the use of small, cheap instruments and sensors connected using distributed systems
technologies. This nascent distributed system can use a concomitant GRID architecture to acquire, collate and analyse the
data from each biosensor. Distributed Protein Analysis Systems can, for example, be used to monitor the possible spread of
GM in crops, by identification of marker proteins. This data, collated with data such as meteorological and geographic
factors, can be used to advise the appropriate authorities of any release or other problems aiding rapid decision-making.
Motivation
The increase in biological data and the advances
in data acquisition systems will have two major
impacts over the next 5-10 years.
1. The latest release statistics from the two major
biological databases shown in Figure 1 show that the
trend for information generation has been linear in the
case of protein, but exponential for the addition of
sequence data. This disparity is largely due to the
method of data acquisition (DNA sequencing has, for
the last five or so years been largely automated), but
the trend is obvious in both cases and set to continue.
There is speculation that now the human and certain
other genomes have been completed, that the need for
DNA sequencing would diminish. This conjecture is
incorrect as the expansion in DNA data acquisition will
continue, if not exceed current tendencies. This
increase will be due, in part, to new sequencing
technologies increasing the usability of DNA
sequencing in everyday diagnostic processes such as
drug choice and prescription as well as pre-clinical
cohort screening.
The increase in protein database entries is
apparently increasing at a lower rate than DNA
because the complexity of protein analysis does not
allow the automation of proteomic systems. However,
polyacrylamide gel electrophoresis (PAGE) systems
are now gradually being replaced by higher throughput
capillary and microchip systems in pharmaceutical and
biotechnology laboratories, and this trend will continue
in the future.
If the amount of Proteomic and Genomic
information being gathered continues to grow at
current or increased rates new technologies that will
further increase this already phenomenal rate of data
acquisition must have an inherent structured data
handling function.
http://www.ncbi.nlm.nih.gov/Genbank/genb
anstats.html
http://ca.expasy.org/sprot/relnotes/relstat.html
Figure 1. The growth trends of typical existing
biological databases.
From 1982 to the latest release, the number of bases in
GenBank has doubled approximately every 14 months.
Release
Date
Base Pairs
Entries
3
Dec 1982
680338
606
141
Apr 2004
38989342565
33676218
SWISSPROT has increased by about 8% per year
Release
Date
Amino Acids
Entries
2
Sept 86
900163
3939
43
March 04
54093154
146720
2. Advances in hardware technology will lead to
increased miniaturisation and simplification of
biosensor systems. This will have a concomitant effect
in the distribution of technology out of the specialised
laboratory environment, and into the wider world. The
movement towards smaller technology will allow
wider distribution of instrumentation usable in both
genomics and proteomics. Both systems are dependent
on electrophoresis: the separation and sizing of
biomolecules by their charge/mass ratios in a complex
sieving matrix. This can be achieved in either capillary
or microfluidic chip systems, allowing smaller or
cheaper bioanalysis machines. It is only with the
development of such systems that the widespread
distribution of biosensors for a wide range of
applications can be feasible.
One of the major factors essential to the type of
operation is cost. Currently the only systems capable of
this of field-based distributed bioanalysis are found in
the military arena, where costs are not a first tier issue.
A fairly typical unit [1] could cost up to £250K. The
cost issue reflects both the capital expenditure in
buying the machine, and the operating expenditure, the
cost of the chemical and biological (‘wet’) consumable
components the machine needs to function. Antibodies,
the core of most existing systems, are extremely
expensive, costing up to £100K to develop and cost
upwards of £195 per 100mg. In traditional labelled
systems, the cost of the label used to visualise the
biomolecule can also be a major factor. Therefore, the
running costs of such systems are prohibitive for nonmilitary applications. In a manner analogous to the
evolution from mainframe computers to desktops,
laptops and onto hand-held devices, biology too is
becoming more distributed, more personalised, and just
as cost was reduced in computing as a function of size
etc. both initial and running costs have to be brought
down for biosensors. One way of achieving this is the
removal of the need for labels and antibodies, coupled
to technologically simple systems where software
algorithms are a key component to the analysis.
This leads to a further consideration in the terms
of system usability. All existing biosensor systems are
extremely complex and require a technologically adept
operator. The new deltaDOT biosensor systems are
designed to be relatively simple to use, particularly
where the issue of data handling is concerned. This,
coupled to the potential of the GRID as a dissemination
and analysis medium present the future of a distributed
biosensor system in a feasible light.
Another and deceptively trivial factor in such system
is the ‘cold chain’. This is the vital need to keep many
of the ‘wet’ components of a distributed system
within a narrow and constant temperature range. This
is usual around 4°C, the temperature of a domestic
fridge. Should the cold chain fail the validity of the
components and therefore the amount of trust that can
be put in the biosensor is diminished.
Therefore, in order for cheap distributed
systems to be viable all these issues must be
addressed.
Label Free Intrinsic Imaging (LFII) and
Distributed BioSensors
This distributed system of data analysis and
dissemination is the first step to enabling the future
role of miniaturised high throughput system (HTS)
diagnostic technologies currently in development.
These biosensor systems exploit Label Free Intrinsic
Imaging (LFII), a novel combination of several
innovative fields [2], using no labels, advanced signal
processing and microfluidic systems, such as
capillaries and micro-fabricated chip structures.
At the current time ~90% of biological research
utilises labels to image the biomolecule. Taking
proteins as the model, conventional analysis usually
takes the form of electrophoresis, a process based on
simple chromatography, where proteins are separated
in some sort of a sieving matrix. The separation is
achieved in an electric field, proteins having a net
negative charge. Under specific chemical conditions
the proteins move through the matrix, smaller ones
migrating faster as they can move through the matrix
more easily. As the proteins reach a steady state they
separate. The pattern of proteins appears as a ladder of
bands in the matrix, but this can only be seen when
chemical dyes are added to stain the protein. These
dyes are a major source of complexity, error, cost, and
time.
Label free systems work by exploiting the
natural absorptivity of biomolecules at certain
wavelengths in the spectrum. For example, proteins
preferentially absorb energy at 214nm and 280nm.
Such instruments which take advantage of this feature
are rare because the huge loss of signal to noise has, to
date, made such systems complex and every
expensive. However, the new LFII systems that form
the core of this potential biosensor system are
comparatively simple and use advanced signal
processing algorithms adapted from particle physics to
extract useful results from the smaller signal produced
by these unlabelled samples.
These algorithms not only make the system
mechanically simpler, but also more sensitive and, as
there are fewer chemicals to buy, it is cheaper, more
reliable and easier to use. Rather than an add-on
feature, the biosensors have had basic data handling
software designed in from the start.
The DiscoveryNET project [3, 4] is developing
GRID based methods for the integration and the
analysis of data generated from distributed high
throughput devices in a variety of application areas
including environmental science and remote sensing.
The goal is to develop an advanced generic
computing infrastructure that supports real-time
processing, interpretation, integration, visualisation and
mining of massive amounts of time-critical data
generated from such devices. In this paper we describe
the application of the sensors and Discovery Net
technology in a GM crop-monitoring scenario.
Hardware
The same basic instrumentation will be able to
analyse Protein, DNA and RNA, whilst a slightly more
sophisticated unit will perform DNA sequencing.
Such distributed systems will be useable in a wide range
of applications in both genomic and proteomic analysis,
such as :• DNA fragment sizing
• Distributed DNA sequencing
• Real-time epidemiology
• Real-time diagnostics
• Personal Drug targeting
• Single Nucleotide Polymorphisms
• Disease marker proteins
• Up-regulation events
The LFII system consists a of a UV light source
and optical components such as filters and lens arrayed
on an optical rail. These are arranged to focus the UV
light on a separation matrix, in this case in a 50mm
internal diameter quartz silicon capillary. The capillary
is filled with a thick sieving matrix through which the
proteins migrate, under the influence of an electrical
potential. As the proteins move through the matrix they
absorb the energy from the UV light source, which
causes a drop in the signal as it is detected on a Photo
Diode Array (PDA). It is the subsequent signal
processing of this data that gives this system such
power. The basic machine, shown in Figure 2, called
the deltaDOT protein analysis system (PAS), is small
enough to fit on a standard laboratory bench and runs
via USB cable from a standard lap-top computer.
Figure 2. The deltaDOT PAS. The unit has twin, 24
pot carousels allowing 24 samples in one side and a
wide variety of ‘house-keeping’ chemicals such as
sieving matrices, cleaning chemicals and buffers in
the other. It is fully automated for high throughput
analysis.
Output
The data output can be tuned to a variety of
applications, but in this case, the mirror plot is the
most appropriate visualisation. In an advanced system
all analysis would be automated, diminishing the need
for specialist interpretational skills. Figure 3 shows a
mirror plot of two complex protein mixtures. One is a
standard bacterial lysate (black line), and the other is
the same sample, to which an unknown protein was
added. This was done in order to test both the ability
of the system to spot anomalies such as the added
protein and also to demonstrate the display software.
Figure 3. A mirror plot to show the comparisons
between two E. coli samples. The added protein is
clearly visible and the mirror plot allows simple
visualisation and comparison of the samples.
This demonstration shows how the technology
can be used to detect anomalies between two similar
systems. In basic research and diagnostics this means
it can be used to look for indications of disease, be
they causative or symptoms. This is because diseases,
whether infectious or cancerous, generally result in a
change in the proteomic profile. Such a change will
also be visible in a Genetically Modified (GM) crop
analysis.
GM crop monitoring
For environmental and safety reasons, since gene
expression and regulation are complex co-ordinated
processes, any changes made to the genome of
consumer crops have to pass strict regulations set down
by the UK’s Food Standards Agency (FSA) before
being approved. Therefore genomic or proteomic
analysis can be used to check that the produce has had
no or only an approved modification. This could be
performed on genomic or proteomic samples, in the
field or in, for example the Goods-In station of a
supermarket where incoming produce could be
monitored.
The Protein Analysis System produces a protein
profile for each sample – a list of proteins present,
distinguished by mass and quantity.
The Figure 4 shows data of protein profiles of
genetically modified cereal supplied by the FSA. The
top, non-GM crop shows a normal or standard protein
profile, while the lower GM profile clearly shows a
massive extra peak at about 9300 scans.
Figure 4
As an initial validation step, the profile may be
compared with a known standard profile for the
species of crop sampled. This will allow basic
confirmation that the sample is of the species
expected, and normalisation to the standard profile,
after which any statistically significant differences
from the standard profile can be determined.
Variations from the standard protein profile could be
caused by transgene introduction.
The next stage of the analysis is therefore to
identify likely GM traits that could have produced the
observed variation, by referring, in real-time to online
reference sources which would contain libraries of
transgenes and their expected effects on the protein
profiles of different crop types. These could be either
agency e.g. FSA, databases or standard sources such
as SwissPROT [5], the main protein database. Further
levels of complexity could include the effects of
multiple transgenes present simultaneously, and the
expected natural variation in protein profiles under
different conditions.
The grid infrastructure (Figure 5) permits
efficient management of this type of distributed
analysis, as well as authentication and authorisation
for access to restricted data sources. However, as such
sources are not yet available, our analysis for this
demonstration uses a simpler algorithm which checks
for the presence or absence of one a number of very
specific transgenic peaks.
There will be numerous other factors that will
influence possible transfer of GM traits between trial
and neighbouring sites. Basic considerations include
the mechanism and timing of pollination [6]. GM
crops can be engineered to reduce chances of crosspollination, but this would still need to be monitored
to ensure 100% efficiency. The timing of pollination
events also is a major factor: in each individual case,
does it coincide with that of neighbouring crops, is it
possible to get interspecies cross-pollination? Adding
complexities to this already complex scenario are
possible mechanisms of transfer. What are the local
pollinator fauna (i.e. bees etc.) populations and what
are their feeding preferences and habits? Is there
transfer by mechanical means, such as ramblers and
farm machinery? Two final and immense impacts on
transfer could be physical geography and the weather.
These myriad factors are all addressable, but
form a potentially huge and complex data set that will
need a GRID solution.
Figure 5. The workflow shown performs the described
analysis, identifying GM samples from the simulated
protein profile data. It collates the results for each field,
and annotates them with information about the transgenes
identified and background information on the fields
themselves. The data analysis components within the
workflow are distributed computations co-ordinated by the
Discovery Net workflow engine.
Grid-based GM crop monitoring
A GM crop monitoring scenario has been
developed to demonstrate this application.
An area near Cambridge around the site of a
previous GM field test was selected for the model. Five
scenarios, each considering different levels and
mechanisms of GM contamination were constructed,
assuming that 100 samples would be taken and tested
from each of 102 fields. Each sample taken from a field
would be tested by a PAS biosensor, to produce a
protein profile.
Each protein profile is represented in tabular
form, where each row corresponds to a peak for a
particular protein present in the sample, and the peak
area is calculated to find the protein quantity. This is,
from the point of view of Discovery Net, the raw data.
Each sample may be tested many times, and so may
have multiple associated protein profiles – it was
decided in this case however that we would simplify
matters by examining only one profile per sample.
The example protein profiles used were provided
by deltaDOT for GM and non-GM cereal samples
which were obtained from the FSA. This was reduced to
a normal (non-GM) sample containing 38 protein peaks,
plus a GM ‘feature’ profile. Each sample in each field
was then assigned a profile, potentially modified with
the GM feature according to the percentage defined in
the scenario description. This resulted in total raw data
for the five scenarios of nearly 2 million rows.
Thus even this simplistic model results in a
considerable amount of data which benefits from the use
of an automated analysis system.
Data analysis
Having simulated the raw data containing protein
profiles, the aim of data analysis was now to recreate
the original scenario specifications, showing the
percentage of GM contamination in each field.
The most important step in this analysis is the
identification of GM features in a protein profile, which
in this case was simple and only required the detection
of a particular peak. This is the stage which would
traditionally require expert interpretation of the
experimental data.
A more advanced implementation of this step
could potentially be performed if a central database
existed documenting features associated with different
types of GM, and preferably another which recorded
typical protein profiles for standard crop samples. A
generic analysis procedure for detecting and identifying
transgenic modifications could then be outlined as
follows:
• Determine if the sample matches the expected
species profile (elementary verification that the result
is sensible and has not been contaminated by another
sample). This attempts to provide some basic quality
assurance which would have been done as a matter of
course if a human expert were performing the
analysis.
• Normalize it if necessary, for example
accounting for expected natural variations. Again, this
is an aspect of analysis at which a human expert
would excel, but it is more of a challenge for
unassisted algorithms.
•
Identify statistically significant differences from
the 'normal' profile.
Visualisation
Visualisation aids the user in interpreting the
data and attempting to find factors which influence GM
spread. As previously described, the plethora of
additional factors that would influence any possible
release need to be factored in and also displayed in a
simple, understandable way so that decisions can be
made, possibly by non-technical users.
The main form of visualisation chosen (Figure 6)
was to show pie charts of the percentage of GM/nonGM samples in each field displayed on the map over
each field. The system allows switching between
different scenarios, and a number of different
background images were supplied to aid interpretation
by the user. The selection of background images
provided reflects several possible influencing factors
considered: an aerial photo, showing the field
boundaries; a traditional map showing transport
networks and contour lines; a shaded map showing the
types of crops grown in each field; and a diagrammatic
representation of the average wind direction, created
from source data provided by the Iceni Weather Station.
[7]
• Match a profile of differences to one or more
known GM features. These features may well be
complicated, involving multiple proteins and
up/down-regulation, and variations due to the
conditions when taking the sample may have to be
accounted for.
• If a likely GM candidate has not yet been found,
the researcher may choose to look up the interesting
peaks to learn more about them, using Discovery Net
online annotation and lookup tools.
The remainder of the analysis workflow deals
with transforming and grouping the data to aggregate
results within fields, producing the required GM
percentages per field in each scenario. It also annotates
the results with further information about the GM
feature discovered, along with scenario and field
descriptions, in preparation for display by the GIS
visualisation tool (GISViz).
Figure 6. The visualisation of modelled
contamination event. The pie-charts indicate a
of modelled contamination across the site. The
direction, while not indicated, appears to
influenced the movement of pollen.
GM
level
wind
have
The GIS system allows integration of many different
types of information. Bringing in these external sources
would allow further types of analysis and ways of
aiding decision making, for example identifying organic
farms in the area.
A cross-disciplinary platform
The reusable workflow and analysis technology
provided by Discovery Net are demonstrated in this
example scenario, the implementation of which required
minimal custom coding and reuses many features
developed for the GUSTO project, particularly the GIS
Visualisation system.
Figure 7
Figure 7 shows how both release and wind
direction can be shown on one visualisation: the levels
of GM material found in fields surrounding the test site
are now shown as circles of different sizes, the larger
the circle, the more GM material present, e.g. the one
in the centre is the test field, where it will be ~100%
prevalent. The prevailing SW to NE wind direction is
represented by the white sectors, again size being
relative.
To further aid interpretation, filters may be
applied so that only levels of GM above a user-specified
threshold are displayed.
Additional weather information, such as rainfall,
air pressure, humidity and temperature, data that is
easily available from the UK’s excellent network of
weather stations, could be added to these visualisations
as required. The same system could be used to display
results over a number of years. Any analysis of GM
spread would then also have to take into account factors
such as crop rotation.
The current demonstration assumes that a human
will interpret the results aided by visualisation.
However, data mining algorithms could be used to
explore correlations in space and time, particularly if
data were gathered for a number of years. For example,
• Identifying original sources of GM contamination.
This could become a significant challenge if there
were many potential sources, of different crop and
GM types.
• Analytically determine rate of spread and
direction/range of spread, and what these are affected
by.
References
1.
2.
3.
4.
5.
6.
7.
Smiths, Cerberus Mobile NBC Detection
System. 2002, Smiths detection Ltd.
J. Hassard and S. Hassard, Molecular imaging.
2000: US and International.
Salman AlSairafi, Filippia-Sofia Emmanouil,
Moustafa Ghanem, Nikolaos Giannadakis,
Yike Guo, Dimitrios Kalaitzopoulos, Michelle
Osmond, Anthony Rowe, Jameel Syed and
Patrick Wendel., The Design of Discovery Net:
Towards Open Grid Services for Knowledge
Discovery. Special issue of The International
Journal on High Performance Computing
Applications on Grid Computing:
Infrastructure and Applications, 2003. 17(3).
Anthony Rowe, Michelle Osmond, Moustafa
Ghanem, Yike Guo. The Discovery Net System
for High Throughput Bioinformatics. in
Proceedings of the Eleventh International
Conference on Intelligent Systems for
Molecular Biology. . Also appears in ISMB
(Supplement of Bioinformatics) 2003: 225-231.
2003.
SwissPROT, The analysis of protein sequences
and structures as well as 2-D PAGE. 2004,
The Swiss Institute of Bioinformatics.
Bayer, Notification C/BE/96/01 Oilseed Rape
Ms8xRf3. 2003, Bater CropScience. p. 28.
Barker, R., The Iceni weather station. 2004.
Download