Distributed BioSensors in Genetically Modified crop trial monitoring S. Hassard*, M. Osmond*, F. Pereira#, M. Howard#, S. Klier#, R. Martin#, J. Hassard# *Imperial College London, South Kensington, London SW7 2AZ # deltaDOT Ltd, Charing Cross Hospital W6 8RF The DiscoveryNET consortium Biotechnology is moving towards the use of small, cheap instruments and sensors connected using distributed systems technologies. This nascent distributed system can use a concomitant GRID architecture to acquire, collate and analyse the data from each biosensor. Distributed Protein Analysis Systems can, for example, be used to monitor the possible spread of GM in crops, by identification of marker proteins. This data, collated with data such as meteorological and geographic factors, can be used to advise the appropriate authorities of any release or other problems aiding rapid decision-making. Motivation The increase in biological data and the advances in data acquisition systems will have two major impacts over the next 5-10 years. 1. The latest release statistics from the two major biological databases shown in Figure 1 show that the trend for information generation has been linear in the case of protein, but exponential for the addition of sequence data. This disparity is largely due to the method of data acquisition (DNA sequencing has, for the last five or so years been largely automated), but the trend is obvious in both cases and set to continue. There is speculation that now the human and certain other genomes have been completed, that the need for DNA sequencing would diminish. This conjecture is incorrect as the expansion in DNA data acquisition will continue, if not exceed current tendencies. This increase will be due, in part, to new sequencing technologies increasing the usability of DNA sequencing in everyday diagnostic processes such as drug choice and prescription as well as pre-clinical cohort screening. The increase in protein database entries is apparently increasing at a lower rate than DNA because the complexity of protein analysis does not allow the automation of proteomic systems. However, polyacrylamide gel electrophoresis (PAGE) systems are now gradually being replaced by higher throughput capillary and microchip systems in pharmaceutical and biotechnology laboratories, and this trend will continue in the future. If the amount of Proteomic and Genomic information being gathered continues to grow at current or increased rates new technologies that will further increase this already phenomenal rate of data acquisition must have an inherent structured data handling function. http://www.ncbi.nlm.nih.gov/Genbank/genb anstats.html http://ca.expasy.org/sprot/relnotes/relstat.html Figure 1. The growth trends of typical existing biological databases. From 1982 to the latest release, the number of bases in GenBank has doubled approximately every 14 months. Release Date Base Pairs Entries 3 Dec 1982 680338 606 141 Apr 2004 38989342565 33676218 SWISSPROT has increased by about 8% per year Release Date Amino Acids Entries 2 Sept 86 900163 3939 43 March 04 54093154 146720 2. Advances in hardware technology will lead to increased miniaturisation and simplification of biosensor systems. This will have a concomitant effect in the distribution of technology out of the specialised laboratory environment, and into the wider world. The movement towards smaller technology will allow wider distribution of instrumentation usable in both genomics and proteomics. Both systems are dependent on electrophoresis: the separation and sizing of biomolecules by their charge/mass ratios in a complex sieving matrix. This can be achieved in either capillary or microfluidic chip systems, allowing smaller or cheaper bioanalysis machines. It is only with the development of such systems that the widespread distribution of biosensors for a wide range of applications can be feasible. One of the major factors essential to the type of operation is cost. Currently the only systems capable of this of field-based distributed bioanalysis are found in the military arena, where costs are not a first tier issue. A fairly typical unit [1] could cost up to £250K. The cost issue reflects both the capital expenditure in buying the machine, and the operating expenditure, the cost of the chemical and biological (‘wet’) consumable components the machine needs to function. Antibodies, the core of most existing systems, are extremely expensive, costing up to £100K to develop and cost upwards of £195 per 100mg. In traditional labelled systems, the cost of the label used to visualise the biomolecule can also be a major factor. Therefore, the running costs of such systems are prohibitive for nonmilitary applications. In a manner analogous to the evolution from mainframe computers to desktops, laptops and onto hand-held devices, biology too is becoming more distributed, more personalised, and just as cost was reduced in computing as a function of size etc. both initial and running costs have to be brought down for biosensors. One way of achieving this is the removal of the need for labels and antibodies, coupled to technologically simple systems where software algorithms are a key component to the analysis. This leads to a further consideration in the terms of system usability. All existing biosensor systems are extremely complex and require a technologically adept operator. The new deltaDOT biosensor systems are designed to be relatively simple to use, particularly where the issue of data handling is concerned. This, coupled to the potential of the GRID as a dissemination and analysis medium present the future of a distributed biosensor system in a feasible light. Another and deceptively trivial factor in such system is the ‘cold chain’. This is the vital need to keep many of the ‘wet’ components of a distributed system within a narrow and constant temperature range. This is usual around 4°C, the temperature of a domestic fridge. Should the cold chain fail the validity of the components and therefore the amount of trust that can be put in the biosensor is diminished. Therefore, in order for cheap distributed systems to be viable all these issues must be addressed. Label Free Intrinsic Imaging (LFII) and Distributed BioSensors This distributed system of data analysis and dissemination is the first step to enabling the future role of miniaturised high throughput system (HTS) diagnostic technologies currently in development. These biosensor systems exploit Label Free Intrinsic Imaging (LFII), a novel combination of several innovative fields [2], using no labels, advanced signal processing and microfluidic systems, such as capillaries and micro-fabricated chip structures. At the current time ~90% of biological research utilises labels to image the biomolecule. Taking proteins as the model, conventional analysis usually takes the form of electrophoresis, a process based on simple chromatography, where proteins are separated in some sort of a sieving matrix. The separation is achieved in an electric field, proteins having a net negative charge. Under specific chemical conditions the proteins move through the matrix, smaller ones migrating faster as they can move through the matrix more easily. As the proteins reach a steady state they separate. The pattern of proteins appears as a ladder of bands in the matrix, but this can only be seen when chemical dyes are added to stain the protein. These dyes are a major source of complexity, error, cost, and time. Label free systems work by exploiting the natural absorptivity of biomolecules at certain wavelengths in the spectrum. For example, proteins preferentially absorb energy at 214nm and 280nm. Such instruments which take advantage of this feature are rare because the huge loss of signal to noise has, to date, made such systems complex and every expensive. However, the new LFII systems that form the core of this potential biosensor system are comparatively simple and use advanced signal processing algorithms adapted from particle physics to extract useful results from the smaller signal produced by these unlabelled samples. These algorithms not only make the system mechanically simpler, but also more sensitive and, as there are fewer chemicals to buy, it is cheaper, more reliable and easier to use. Rather than an add-on feature, the biosensors have had basic data handling software designed in from the start. The DiscoveryNET project [3, 4] is developing GRID based methods for the integration and the analysis of data generated from distributed high throughput devices in a variety of application areas including environmental science and remote sensing. The goal is to develop an advanced generic computing infrastructure that supports real-time processing, interpretation, integration, visualisation and mining of massive amounts of time-critical data generated from such devices. In this paper we describe the application of the sensors and Discovery Net technology in a GM crop-monitoring scenario. Hardware The same basic instrumentation will be able to analyse Protein, DNA and RNA, whilst a slightly more sophisticated unit will perform DNA sequencing. Such distributed systems will be useable in a wide range of applications in both genomic and proteomic analysis, such as :• DNA fragment sizing • Distributed DNA sequencing • Real-time epidemiology • Real-time diagnostics • Personal Drug targeting • Single Nucleotide Polymorphisms • Disease marker proteins • Up-regulation events The LFII system consists a of a UV light source and optical components such as filters and lens arrayed on an optical rail. These are arranged to focus the UV light on a separation matrix, in this case in a 50mm internal diameter quartz silicon capillary. The capillary is filled with a thick sieving matrix through which the proteins migrate, under the influence of an electrical potential. As the proteins move through the matrix they absorb the energy from the UV light source, which causes a drop in the signal as it is detected on a Photo Diode Array (PDA). It is the subsequent signal processing of this data that gives this system such power. The basic machine, shown in Figure 2, called the deltaDOT protein analysis system (PAS), is small enough to fit on a standard laboratory bench and runs via USB cable from a standard lap-top computer. Figure 2. The deltaDOT PAS. The unit has twin, 24 pot carousels allowing 24 samples in one side and a wide variety of ‘house-keeping’ chemicals such as sieving matrices, cleaning chemicals and buffers in the other. It is fully automated for high throughput analysis. Output The data output can be tuned to a variety of applications, but in this case, the mirror plot is the most appropriate visualisation. In an advanced system all analysis would be automated, diminishing the need for specialist interpretational skills. Figure 3 shows a mirror plot of two complex protein mixtures. One is a standard bacterial lysate (black line), and the other is the same sample, to which an unknown protein was added. This was done in order to test both the ability of the system to spot anomalies such as the added protein and also to demonstrate the display software. Figure 3. A mirror plot to show the comparisons between two E. coli samples. The added protein is clearly visible and the mirror plot allows simple visualisation and comparison of the samples. This demonstration shows how the technology can be used to detect anomalies between two similar systems. In basic research and diagnostics this means it can be used to look for indications of disease, be they causative or symptoms. This is because diseases, whether infectious or cancerous, generally result in a change in the proteomic profile. Such a change will also be visible in a Genetically Modified (GM) crop analysis. GM crop monitoring For environmental and safety reasons, since gene expression and regulation are complex co-ordinated processes, any changes made to the genome of consumer crops have to pass strict regulations set down by the UK’s Food Standards Agency (FSA) before being approved. Therefore genomic or proteomic analysis can be used to check that the produce has had no or only an approved modification. This could be performed on genomic or proteomic samples, in the field or in, for example the Goods-In station of a supermarket where incoming produce could be monitored. The Protein Analysis System produces a protein profile for each sample – a list of proteins present, distinguished by mass and quantity. The Figure 4 shows data of protein profiles of genetically modified cereal supplied by the FSA. The top, non-GM crop shows a normal or standard protein profile, while the lower GM profile clearly shows a massive extra peak at about 9300 scans. Figure 4 As an initial validation step, the profile may be compared with a known standard profile for the species of crop sampled. This will allow basic confirmation that the sample is of the species expected, and normalisation to the standard profile, after which any statistically significant differences from the standard profile can be determined. Variations from the standard protein profile could be caused by transgene introduction. The next stage of the analysis is therefore to identify likely GM traits that could have produced the observed variation, by referring, in real-time to online reference sources which would contain libraries of transgenes and their expected effects on the protein profiles of different crop types. These could be either agency e.g. FSA, databases or standard sources such as SwissPROT [5], the main protein database. Further levels of complexity could include the effects of multiple transgenes present simultaneously, and the expected natural variation in protein profiles under different conditions. The grid infrastructure (Figure 5) permits efficient management of this type of distributed analysis, as well as authentication and authorisation for access to restricted data sources. However, as such sources are not yet available, our analysis for this demonstration uses a simpler algorithm which checks for the presence or absence of one a number of very specific transgenic peaks. There will be numerous other factors that will influence possible transfer of GM traits between trial and neighbouring sites. Basic considerations include the mechanism and timing of pollination [6]. GM crops can be engineered to reduce chances of crosspollination, but this would still need to be monitored to ensure 100% efficiency. The timing of pollination events also is a major factor: in each individual case, does it coincide with that of neighbouring crops, is it possible to get interspecies cross-pollination? Adding complexities to this already complex scenario are possible mechanisms of transfer. What are the local pollinator fauna (i.e. bees etc.) populations and what are their feeding preferences and habits? Is there transfer by mechanical means, such as ramblers and farm machinery? Two final and immense impacts on transfer could be physical geography and the weather. These myriad factors are all addressable, but form a potentially huge and complex data set that will need a GRID solution. Figure 5. The workflow shown performs the described analysis, identifying GM samples from the simulated protein profile data. It collates the results for each field, and annotates them with information about the transgenes identified and background information on the fields themselves. The data analysis components within the workflow are distributed computations co-ordinated by the Discovery Net workflow engine. Grid-based GM crop monitoring A GM crop monitoring scenario has been developed to demonstrate this application. An area near Cambridge around the site of a previous GM field test was selected for the model. Five scenarios, each considering different levels and mechanisms of GM contamination were constructed, assuming that 100 samples would be taken and tested from each of 102 fields. Each sample taken from a field would be tested by a PAS biosensor, to produce a protein profile. Each protein profile is represented in tabular form, where each row corresponds to a peak for a particular protein present in the sample, and the peak area is calculated to find the protein quantity. This is, from the point of view of Discovery Net, the raw data. Each sample may be tested many times, and so may have multiple associated protein profiles – it was decided in this case however that we would simplify matters by examining only one profile per sample. The example protein profiles used were provided by deltaDOT for GM and non-GM cereal samples which were obtained from the FSA. This was reduced to a normal (non-GM) sample containing 38 protein peaks, plus a GM ‘feature’ profile. Each sample in each field was then assigned a profile, potentially modified with the GM feature according to the percentage defined in the scenario description. This resulted in total raw data for the five scenarios of nearly 2 million rows. Thus even this simplistic model results in a considerable amount of data which benefits from the use of an automated analysis system. Data analysis Having simulated the raw data containing protein profiles, the aim of data analysis was now to recreate the original scenario specifications, showing the percentage of GM contamination in each field. The most important step in this analysis is the identification of GM features in a protein profile, which in this case was simple and only required the detection of a particular peak. This is the stage which would traditionally require expert interpretation of the experimental data. A more advanced implementation of this step could potentially be performed if a central database existed documenting features associated with different types of GM, and preferably another which recorded typical protein profiles for standard crop samples. A generic analysis procedure for detecting and identifying transgenic modifications could then be outlined as follows: • Determine if the sample matches the expected species profile (elementary verification that the result is sensible and has not been contaminated by another sample). This attempts to provide some basic quality assurance which would have been done as a matter of course if a human expert were performing the analysis. • Normalize it if necessary, for example accounting for expected natural variations. Again, this is an aspect of analysis at which a human expert would excel, but it is more of a challenge for unassisted algorithms. • Identify statistically significant differences from the 'normal' profile. Visualisation Visualisation aids the user in interpreting the data and attempting to find factors which influence GM spread. As previously described, the plethora of additional factors that would influence any possible release need to be factored in and also displayed in a simple, understandable way so that decisions can be made, possibly by non-technical users. The main form of visualisation chosen (Figure 6) was to show pie charts of the percentage of GM/nonGM samples in each field displayed on the map over each field. The system allows switching between different scenarios, and a number of different background images were supplied to aid interpretation by the user. The selection of background images provided reflects several possible influencing factors considered: an aerial photo, showing the field boundaries; a traditional map showing transport networks and contour lines; a shaded map showing the types of crops grown in each field; and a diagrammatic representation of the average wind direction, created from source data provided by the Iceni Weather Station. [7] • Match a profile of differences to one or more known GM features. These features may well be complicated, involving multiple proteins and up/down-regulation, and variations due to the conditions when taking the sample may have to be accounted for. • If a likely GM candidate has not yet been found, the researcher may choose to look up the interesting peaks to learn more about them, using Discovery Net online annotation and lookup tools. The remainder of the analysis workflow deals with transforming and grouping the data to aggregate results within fields, producing the required GM percentages per field in each scenario. It also annotates the results with further information about the GM feature discovered, along with scenario and field descriptions, in preparation for display by the GIS visualisation tool (GISViz). Figure 6. The visualisation of modelled contamination event. The pie-charts indicate a of modelled contamination across the site. The direction, while not indicated, appears to influenced the movement of pollen. GM level wind have The GIS system allows integration of many different types of information. Bringing in these external sources would allow further types of analysis and ways of aiding decision making, for example identifying organic farms in the area. A cross-disciplinary platform The reusable workflow and analysis technology provided by Discovery Net are demonstrated in this example scenario, the implementation of which required minimal custom coding and reuses many features developed for the GUSTO project, particularly the GIS Visualisation system. Figure 7 Figure 7 shows how both release and wind direction can be shown on one visualisation: the levels of GM material found in fields surrounding the test site are now shown as circles of different sizes, the larger the circle, the more GM material present, e.g. the one in the centre is the test field, where it will be ~100% prevalent. The prevailing SW to NE wind direction is represented by the white sectors, again size being relative. To further aid interpretation, filters may be applied so that only levels of GM above a user-specified threshold are displayed. Additional weather information, such as rainfall, air pressure, humidity and temperature, data that is easily available from the UK’s excellent network of weather stations, could be added to these visualisations as required. The same system could be used to display results over a number of years. Any analysis of GM spread would then also have to take into account factors such as crop rotation. The current demonstration assumes that a human will interpret the results aided by visualisation. However, data mining algorithms could be used to explore correlations in space and time, particularly if data were gathered for a number of years. For example, • Identifying original sources of GM contamination. This could become a significant challenge if there were many potential sources, of different crop and GM types. • Analytically determine rate of spread and direction/range of spread, and what these are affected by. References 1. 2. 3. 4. 5. 6. 7. Smiths, Cerberus Mobile NBC Detection System. 2002, Smiths detection Ltd. J. Hassard and S. Hassard, Molecular imaging. 2000: US and International. Salman AlSairafi, Filippia-Sofia Emmanouil, Moustafa Ghanem, Nikolaos Giannadakis, Yike Guo, Dimitrios Kalaitzopoulos, Michelle Osmond, Anthony Rowe, Jameel Syed and Patrick Wendel., The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery. Special issue of The International Journal on High Performance Computing Applications on Grid Computing: Infrastructure and Applications, 2003. 17(3). Anthony Rowe, Michelle Osmond, Moustafa Ghanem, Yike Guo. The Discovery Net System for High Throughput Bioinformatics. in Proceedings of the Eleventh International Conference on Intelligent Systems for Molecular Biology. . Also appears in ISMB (Supplement of Bioinformatics) 2003: 225-231. 2003. SwissPROT, The analysis of protein sequences and structures as well as 2-D PAGE. 2004, The Swiss Institute of Bioinformatics. Bayer, Notification C/BE/96/01 Oilseed Rape Ms8xRf3. 2003, Bater CropScience. p. 28. Barker, R., The Iceni weather station. 2004.