Bioinformatics – Opportunities for Mauritius Oveeyen Moonian1 Shakuntala Baichoo1 Yasmina Jauferally-Fakim2 Zahra Mungloo-Dilmohamud1 Sunilduth Baichoo1 1 Department of Computer Science and Engineering 2Department of Agricultural and Food Sciences University of Mauritius Abstract Traditional research experiments in molecular biology have been largely overtaken by high-throughput methods which generate far more data in relatively shorter periods of time. Genome sequences and gene expression studies have become crucial in understanding biological processes. This has made the handling, analysis and storage of such data only possible with computational tools, hence the development of the interdisciplinary field of bioinformatics which brings together the sciences of biology, biochemistry and computers. Over the past decade or so, bioinformatics has become the forefront of research on living organisms. The challenges presented by the scale of data produced in the genomic and post-genomic era, have been addressed by developers of computer programs in order to provide efficient means for data analysis and management. A whole new realm of bioinformatics resources has become available to scientists thus allowing for rapid discovery of new genes and proteins. All disciplines of biological sciences, including medical, environmental, microbial and plant sciences are set to benefit from such developments. This paper describes some of those resources and how they are being used. It also presents an overview of the different bioinformatics organisations which are driving forces behind the rapid implementation of facilities in this field and the ethical issues related to bioinformatics development. The paper finally highlights opportunities for Mauritius in the field of bioinformatics. 1. Introduction Advances in biological sciences over the past few decades have been marked by major developments in technical methods for studying living cells and tissues more closely. Primarily, the advent of molecular 1 approaches, such as genetic engineering and DNA amplification, has revealed the complexities of cellular interactions which determine physiological and biochemical characteristics. Such methods were however relatively limited given that single or only a few genes could be studied at a time. Highthroughput technologies have revolutionised experimental outputs in a way that data coming out of research activities have to be analysed with powerful computational tools. DNA sequencing, microarray technology, DNA and protein chips, molecular markers have provided new platforms for understanding how biological information is organised and utilised in different organisms. They have allowed an insight into the causes of diseases, how hosts and pathogens interact, and all together depict a much more detailed picture of living organisms. Bioinformatics is an area where computational applications are used for interpreting biological data mainly from sequences of DNA, RNA or proteins, and from patterns of gene expression. Determination and comparison of protein structures have also become possible through various tools. For this purpose, specific software and algorithms have been developed for particular uses. The field of bioinformatics has developed very rapidly over the last decade and has become indispensible in life sciences research. It integrates various disciplines like computer science, molecular biology and biochemistry as well as statistics and mathematics. Data from experiments have to be captured, stored, and made easily accessible to users. Large databases store large amount of information that can be retrieved and queried by scientists across the world. Many tools are integrated within web-based applications. This paper discusses bioinformatics resources and tools that are currently used and the opportunities the area presents for Mauritius. The rest of the paper is organized as follows: Section 2 covers the resources available to support research in the area. Section 3 discusses the initiatives taken to develop a Bioinformatics industry in different regions. Section 4 discusses Bioinformatics initiatives on the African continent. Section 5 draws attention to the legal and ethical issues to be handled when developing the area of Bioinformatics. Section 6 discusses the prospects of Bioinformatics for Mauritius. Section 7 makes recommendations for Mauritius to better seize the opportunities and meet potential challenges and concludes the discussions. 2. Bioinformatics resources worldwide In order to facilitate ongoing research in bioinformatics, a number of resources are available to researchers. These tools can be broadly categorized as programming tools, databases and data analysis tools. 2 2.1 Programming Tools in Bioinformatics The main activities in the Bioinformatics discipline consist of analyzing biological data which is composed of the following sub-tasks: Alignment of DNA sequences for comparison Finding motifs within DNA sequences Genome assembly following sequencing Development of methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences Clustering protein sequences into families of related sequences and the development of protein models Aligning similar proteins and generating phylogenetic trees to determine evolutionary relationships Programming tools are software development supports that can be used to create bioinformatics tools. These programming tools need to deal with a huge amount of scattered and complex information (data/text) accurately, reliably, and effectively. Some of the programming tools can be classified as follows: BioJava (Biojava, 09): Biojava is an open source project that provides Java tools for processing biological data which includes sequences manipulation features, dynamic programming, file parsers and simple statistical routines. It contains a collection of Java programs that represent and manipulate biological data and assist bioinformatics research. It started at EBI/Sanger (European Bioinformatics Institute (EBI, 09)) in 1998 by Matthew Pocock and Thomas Down. BioPerl (BioPerl, 09): BioPerl consists of Perl tools for bioinformatics and provides online resources for modules, scripts and web links for developers of Perl-based software. It has a bioinformatics toolkit for: – format conversion – report processing – data manipulation – sequence analyses – batch processing Biopython (Biopython, 09): Biopython is also an open source project with very similar goals to bioperl. Biopython is a set of freely available tools for biological computation written in Python. It is 3 a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. MATLAB Bioinformatics Toolbox: Toolboxes (e.g., bioinformatics) are comprehensive collections of MATLAB functions (M-files) that extend the MATLAB environment to solve particular classes of problems. The Bioinformatics Toolbox extends MATLAB to provide an integrated and extendable software environment for genome and proteome analysis. Together, MATLAB and the Bioinformatics Toolbox give scientists and engineers a set of computational tools to solve problems and build applications in drug discovery, genetic engineering and biological research. R-language for Statistical Computing [R-project, 09]: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Its bioinformatics counterpart component is Bioconductor (Bioconductor, 09). Bioconductor provides tools for the analysis and comprehension of genomic data. The broad goals of Bioconductor are to: – provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data – facilitate the integration of biological metadata in the analysis of experimental data: e.g. literature data from PubMed, annotation data from LocusLink 2.2 – allow the rapid development of extensible, scalable, and interoperable software – promote high-quality and reproducible research – provide training in computational and statistical methods for the analysis of genomic data. Databases for Bioinformatics There is a very large number of databases covering a wide range of scientific data available to researchers in bioinformatics (Zvelebil, 08). Data is highly duplicated in different databases. An important feature of many databases is that they do not only store sequence data but they also contain a lot of relevant non-sequence data known as annotation that can include links to related entries in other databases, interpretation of data and relevant research citations. In addition to simply providing information, some of the databases also provide web-based interface to programs for online analysis of their data. A distinction is sometimes made between databases of primary data and those that contain secondary data derived from these primary sources. In some cases, the primary data include raw experimental 4 results such as scans of gene-expression arrays and two-dimensional proteomic gels but in many cases they include the initial experimental interpretation e.g. nucleotide sequences. An example of database containing primary data is SWISS-PROT for protein sequences. Examples of secondary databases are those that contain collections of conserved protein motifs, or comparisons of multiple sequences that give measures of sequence similarity and relatedness and are only based on data existing at that time. Databases can be categorized as follows: Sequence databases Nucleotide sequence related databases include major international collaborations such as GenBank (NCBI), EMBL-EBI Nucleotide Sequence database (EBI, 09), and DDJB DNA Data Bank of Japan. In addition, resources that are more gene-specific with information on introns, exons, and splice sites, as well as motifs and transcriptional regulators and sites. There are a number of different types of DNA sequences stored in these databases, differing in the way they have been obtained and each type provides different biological information. They are: – the raw genomic sequence representing the sequence of chromosomal DNA which is deposited in GenBank (produced at National Center for Biotechnology Information (NCBI)) (NCBI, 09) and the organism-specific DNA sequence databases – the cDNAs which refer to the sequences of DNA molecules that have been synthesized by reverse transcription of mRNA molecules indicating the range of genes being expressed in the sample used at the time of experimentation – Expressed Sequence Tags (ESTs) which is a partial cDNA sequence, also indicating the range of genes being expressed in the sample used at the time of experimentation. Protein sequence databases include the major sequence databases such as UniProtKB (UniProt, 09) and NCBI Protein Database (NCBI-Protein, 09), both being efforts to collect information on all protein sequences. These protein databases are often compiled from raw nucleotide sequence data. UniProtKB is produced by analysis of all translations of the EMBL database nucleotide sequences. It has two components, namely Swiss-Prot which is manually annotated and TrEMBL which is only computer annotated. In addition, a multitude of organism-specific or protein families databases have been set up thus allowing a more structural organisation of information, for example FlyBase (Drosophila 5 melanogaster), TAIR (for Arabidopsis thaliana), VectorBase, PLASMODB (malaria), KEGG Pathway Database which provides pathway maps based on known molecular interactions. Most of the databases also provide analysis tools for both DNA and proteins. Microarray databases and Gene expression databases Microarray databases are repositories of data from microarray experiments, often accompanied by data analysis and tools to visualize the raw image. Gene expression databases also contain expression data collected by other experimental methods such as SAGE (Serial Analysis of Gene Expression) and EST sequencing. The databases contain expression data and often extensive annotation as well as techniques to visualize the numerical and statistical analysis programs. One such database is the Stanford Microarray Database (SMD) which includes data from above 7000 microarray experiments. ArrayExpress (ArrayExpress, 09) is another repository for microarray data which additionally includes the ArrayExpress Data Warehouse that stores gene-indexed expression profiles from a curated subset of experiments from the database. Protein interaction databases Proteins have to interact with other molecules, including other proteins, to carry out their functions. The protein interaction databases provide an understanding of the functions of the proteins and help in building up biological networks that can be used in systems biology. There are a number of such databases, namely: – the Database of Interacting Proteins (DIP) (DIP, 09) that contains information only on proteinprotein interactions – the Molecular INTeraction database (MINT) (MINT, 09) that contains additional information on protein, nucleic acid, and lipid interactions – the Biomolecular Interaction Network Database (BIND) (BIND, 09) that describes interactions at the atomic level for protein, DNA, and RNA – protein Signaling, Transcriptional Interaction and Inflammation Networks Gateway (pSTIING) (pSTIING, 09) is a web-based application as well as an interaction database for protein-protein, protein-anything else interactions as well as transcriptional associations. – Munich Information Center for Protein Sequences (MIPS) hosts a comprehensive, manually curated databse of mammalian protein-protein interactions. – Proteome (Proteome, 10) is a useful reference for a list of protein interactions databases. 6 Structural databases Structural databases include those containing information on the structure of small molecules, carbohydrates, nucleic acids (DNA, RNA), and proteins. These are the results obtained using various experimental techniques, using X-ray crystallography or Nuclear Magnetic Resonance (NMR). The most common structural databases are the Structural Bioinformatics Protein Databank (RCSB, PDB) (rcsb, 09) and the Macromolecular Structure Database (MSD) (MSD, 09) at EBI. CATH is a protein classification of structural domains. SCOP, Structural Classification of Proteins, provides detailed information on folds, superfamilies and families with the aim of being able to reconstruct structural and evolutionary relationships among proteins. 2.3 Data analysis tools A number of organisations which host databases for bioinformatics applications also provide data analysis tools. The two main ones are the EBI and the NCBI toolboxes. 2.3.1 Toolbox at EBI The European Bioinformatics Institute (EBI) (EBITools, 09) provides a comprehensive range of tools for the field of bioinformatics. These are subdivided into the following categories: Homology and Similarity Programs The BLAST (Basic Local Alignment Search Tool) enables a researcher to compare a query sequence (protein or nucleotide) with a database of sequences, and identify sequences that resemble the query sequence above a certain threshold. The Smith & Waterman algorithm is used for performing local sequence alignment; that is, for determining similar regions between two protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. Protein Functional Analysis EBI provides the protein analysis application via the InterPro and InterProScan tool (InterProScan, 09). InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. It classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. It adds in-depth annotation, including GO (Gene Ontology) terms, to the protein signatures. 7 InterProScan tool allows a user to query his/her protein sequence against InterPro and allows for searching the InterPro by accession number or sequence. It can be used to search for protein repeats, motifs, biochemical function and family. Structural Analysis The determination of a protein's 2D/3D structure is crucial in the study of its functions. EBI provides a set of tools for protein structure analysis and secondary structure prediction. Some of them are: – DaliLite: This program is used for pairwise structure comparison i.e. it compares the given structure (first structure) to a reference structure (second structure). – EMSearch: This is a search tool for electron microscopy depositions. – MaxSprout: Allows for the reconstruction of 3D coordinates from C (alpha) trace. – PQS and PQS-Quick: These tools are used to search for Protein Quaternary Structure. Sequence Analysis Sequence analysis encompasses the use of various bioinformatics methods to determine the biological function and/or structure of genes and the proteins they code for. Unknown structure and function can be elucidated through comparison with database of known structures/sequences/functions. EBI provides a number of tools for sequence analysis, some of which are: – ClustalW is a general-purpose Multiple Sequence Alignment tool for nucleotides or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms. – EMBOSS-Align contains two programs each using a different algorithm. For an alignment that covers the whole length of both sequences, the Needle program (based on Needleman-Wunsch algorithm (Needleman, 70)) is used. In order to find the best region of similarity between two sequences, the Water program (based on Smith-Waterman algorithm (Waterman, 76)). There are also a number of Gene finding tools and translation tools. 2.3.2 Tools at NCBI 8 The NCBI (NCBITools, 09) provides a comprehensive range of tools for the field of bioinformatics which can be categorized as follows: Nucleotide Sequence Analysis The nucleotide sequence analysis tools at the NCBI can be summarised as follows: – BLAST, used for comparing gene and protein sequences against others in public databases, comes in several forms including PSI-BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human, microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences. – Electronic PCR allows a user to search a query DNA sequence for sequence tagged sites (STSs) that have been used as landmarks in various types of genomic maps. It compares the query sequence against data in NCBI's UniSTS, which is a unified, non-redundant view of STSs from a wide range of sources. – Model Maker allows a user to view the sequence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence to build a gene model. It is then possible to edit the model by selecting or removing putative exons. The mRNA sequence and potential ORFs for the edited model can be viewed and the mRNA sequence data saved for use in other programs. Model Maker is accessible from sequence maps that were analyzed at NCBI and displayed in Map Viewer. – ORF Finder identifies all possible ORFs in a DNA sequence by locating the standard and alternative stop and start codons. The deduced amino acid sequences can then be used to BLAST against GenBank. Protein Sequence Analysis and Proteomics BLAST programs are also available for comparing protein sequences. – Blink ("BLAST Link") displays the results of BLAST searches that have been carried out for every protein sequence in the Entrez Proteins data domain. – CDART takes a given protein query sequence and displays the functional domains that make up the protein and lists proteins with similar domain architectures. – TaxPlot is a tool for 3-way comparisons of genomes on the basis of the protein sequences they encode. In TaxPlot, one selects a reference genome to which two other genomes are compared. Pre-computed BLAST results are then used to plot a point for each predicted protein in the 9 reference genome, based on the best alignment with proteins in each of the two genomes being compared. Structural Analysis Cn3D is a helper application for web browsers and allows a user to view 3-dimensional structures from NCBI's Entrez retrieval service. VAST Search is NCBI's structure-structure similarity search service. It compares 3D coordinates of a newly determined protein structure to those in the MMDB/PDB (Molecular Modeling Database/Protein Data Bank) database. Genome Analysis Entrez Genomes hosts whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses, phages, viroids, plasmids, and organelles. Entrez Genomes provides graphical overviews of complete genomes/chromosomes and the ability to explore regions of interest in progressively greater detail. Clusters of Orthologous Groups (COGs) (a system of gene families) were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. Gene Expression Gene Expression Omnibus (GEO) provides several tools to assist with the visualization and exploration of GEO data. Datasets may be viewed as hierarchical cluster heat maps, providing insight into the relationships between samples and co-regulated genes. SAGEmap provides a tool for performing statistical tests designed specifically for differential-type analyses of Serial Analysis of Gene Expression (SAGE) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP), which have been submitted to GEO. The Cancer Genome Anatomy Project (CGAP) - aims to decipher the molecular anatomy of cancer cells. CGAP develops profiles of cancer cells by comparing gene expression in normal, precancerous, and malignant cells from a wide variety of tissues. 10 2.3.3 Swiss Bioinformatics Institute Expert Protein Analysis System (ExPASY) is a proteomics server of the SIB hosts a variety of proteomics tools with structural viewer. Protein identification and characterisation can be performed using a number of different tools that distinguish the different molecular properties of proteins such as isoelectric point, molecular weight and amino acid composition. Similarity search as well as pattern and profile search are also available. Also provided are a ViralZone and HAMAP for microbial proteomes. 3. Initiatives in other parts of the world The opportunities presented by biotechnology and bioinformatics have motivated the setting up of research nodes by most nations throughout the world. Many associations share resources and make up regional networks. Two such associations are EMBnet (European Molecular Biology Network, RIBioNet (Latin America) and APBioNET (Asia-Pacific). 3.1 EMBnet EMBnet (EMBnet, 09) is a science-based group of collaborating nodes throughout Europe and a number of nodes outside Europe. The combined expertise of the nodes allows EMBnet to provide services to the European molecular biology community which encompasses more than can be provided by a single node. This site gives an overview of the organization and of its members. It provides the visitors with news of the EMBnet community and new links related to bioinformatics. It also combines the services available on the nodes and publishes EMBnet.news, the electronic letter devoted to provide information about what is happening at the national and special nodes. Since its creation in 1988, EMBnet has evolved from an informal network of individuals in charge of maintaining biological databases into the only organization world-wide bringing bioinformatics professionals to work together to serve the expanding fields of genetics and molecular biology. Although composed predominantly of academic nodes, EMBnet gains an important added dimension from its industrial members. The success of EMBnet has attracted an increasing number of organizations outside Europe to join the group. EMBnet has a tried-and-tested infrastructure to organise training courses, give technical help and help its members to effectively interact and respond to the rapidly changing needs of biological research in a way no single institute is able to do. In 2005 the organization created additional types of node to allow more than one member per country. The "associated node" was born. The following are some of the main achievements of EMBnet: 11 Development of the first complete e-learning system for teaching Bioinformatics (EMBER) EMBnet’s compromise with Society is reflected in its active involvement in dealing with relevant problems and diseases (AntiSARS, RBMDB, p53FamTaG) EMBnet has pioneered use of Grid technologies in the Biosciences and has been involved in seminal Grid projects (SWEGrid, EGEE, EMBRACE, HealthGrid, WebServices) From very early on, EMBnet has promoted development of distributed computing services initiatives to share workload among international servers ( (HASSLE, SRSfed, MRSfed, FedBLAST, SIMDAT) EMBnet is committed to bringing the latest software algorithms to the user, free of charge (EGCG, Pratt, BITS, HoxPred), and continues to develop state of the art public software (EMBOSS) and powerful, easy to use intuitive interfaces (CINEMA, W2H, GeneDoc, WWW2GCG, TOPS, Jalview, wEMBOSS, Jemboss, STACKpack, EMBOSSrunner, eBiotools, WebLab, UTOPIA) EMBnet has made major contributions to supercomputing in the Life Sciences as a means to deliver more powerful and advanced services (Bioccelerator, MPSRCH, INSECTS+MOLLUSCS) EMBnet has contributed to the development and maintenance of advanced database systems for the Life Sciences (SRS, Bioimage, CpGisle, CLEANUP, Webin & Seqin, GQserv, PRINTS, InterPro, STACKdb, UniProt, NyHITS, ENSEMBL, MitoDrome, YeastBASE, MRS, MitoRes) EMBnet was the first to come up with advanced solutions for automated database distribution using the Internet (NDT, SynCron) The Ping project was for a long time the only existing project giving continuous information about network efficiency across the whole of Europe EMBnet had the first gopher and World Wide Web servers in biology (CSC BioBox). 3.2 APBioNET The Asia Pacific Bioinformatics Network (APBioNet) (APBioNet, 09) is a non-profit, non-governmental, international organization. It focuses on the promotion of bioinformatics in the Asia Pacific Region. Since 1998, its mission has been to pioneer the growth and development of bioinformatics awareness, training, education, infrastructure, resources and research amongst member countries and economies. Its work includes the technical coordination and liaison with other international bodies such as the EMBnet. APBioNet has more than 20 organizational and 300 individual members from over 12 countries in the region, and members include those from industry, academia, research, government, investors and 12 international organisations. APBioNet has coordinated or co-organised more than 20 international and national meetings in cooperation with members in different economies. It is spearheading a number of key bioinformatics initiatives in the region in collaboration with international organisations such as APAN, APEC, S* Alliance and A-IMBN. 4. The African initiatives Africa is set to take up the challenge of bringing solutions to its major problems of health and food through the applications of biotechnology and bioinformatics. Capacity building in this area will ensure that scientists have the right tools to address the research issues relevant to the continent. South Africa dominates the scene with well established bioinformatics centres and where various universities have engaged in this direction in order to ensure manpower training. The research output is evident to the high level of activities that are currently on-going. Both Malawi and Zambia have many projects in the health and agricultural sectors that are molecular biology based and therefore need bioinformatics to make good progress. East Africa has several institutions engaged in the utilisation of bioinformatics applications. ILRI, International Livestock Research Institute in Nairobi, has a state-of-the art centre where several pathogen genomes have been sequenced. KEMRI, Kenya Medical Research Institute, is also involved in the application of bioinformatics tools in malaria research. This institute has a longstanding support from the Welcome Trust in UK which coordinates major sequencing projects at the Sanger Institute, Cambridge, UK. North Africa is also active in developing bioinformatics; Pasteur Institute in Tunis has close collaborations with French research centres while working on local problems. Similarly West Africa runs several health related projects where bioinformatics tools are widely applied. The New Partnership for Africa’s Development (NEPAD), with the objectives of stimulating Africa’s development by bridging existing gaps in priority sectors, has identified that the future of Africa lies in the development of Science & Technology. In this respect, in 2003, it adopted an outline of an action plan containing a number of flagship programme areas. It has been recognized that investment in Biosciences can help Africa to ensure food security and better health for its population. Flagship programmes related to biosciences have been clustered to form the Bioscience initiative which has created four regional networks in the continent. These are: 1. Biosciences Eastern and Central Africa Network (BecA Net). 2. Southern Africa Biosciences Network (SANBio). 3. West Africa Biosciences Network (WAB Net). 13 4. North Africa Biosciences Network (NAB Net). Each of these networks consists of a hub and a number of nodes that work towards the development of biosciences including bioinformatics, in the respective region. These networks provide coordination and financial support for the nodes for capacity building and development of research projects. The BecA Net has drawn a 4-year business plan for achieving its objectives. The BecA Hub has a number of service units among which one is for bioinformatics. The BecA Hub has a bioinformatics platform, hosted on a High Performance Computer (HPC) platform located on the BecA, Nairobi campus, and provides advanced computational capabilities in bioinformatics to all BecA Hub scientists to: Uncover the wealth of biological information hidden in the mass of DNA sequences, structure, literature and other biological data Obtain a clearer insight into the fundamental biology of organisms Use this information to enhance the standard of life for mankind. The Southern Africa Biosciences Network (SANBio) is to cater for the development of biosciences and related areas in 12 countries of the South African region, including Mauritius. The strategic objectives of SANBio are to: • Address Southern African problems in agriculture, health, and environment through the application of bioscience technologies • Use new developments in biosciences to protect the environment and conserve biodiversity in Southern Africa • Build and strengthen human capacity in biosciences in Southern Africa • Promote access to affordable, world-class research facilities within Southern Africa • Harness indigenous knowledge and technology of the Southern African people for sustainable utilization of natural resources and wealth generation. Due to its ability to enhance research and development in Biosciences, Bioinformatics can play an important role to support the objectives of SANBio. It is acknowledged that in Biosciences in general, including Bioinformatics, capacity building is an important stepping stone. A recent initiative of the SANBio has launched a capacity building project for the training of scientists in the region in the various applications in bioinformatics. The aim is to equip university academics and researchers with the skills to teach and implement activities in this field. Several collaborations have been set up for this purpose with 14 the European Molecular Biology Network and with ILRI. The University of Mauritius has been selected as the SANBio regional node for Bioinformatics capacity building (Jauferally-Fakim et al., 09). 5. Legal and ethical issues The prospects of Bioinformatics have aroused a lot of interest and enthusiasm in the research community and public at large. In the agricultural industry many plants have already been genetically modified (Steve Windley, 08) to produce fruits which are resistant to pests, cold and other adverse effects. Many benefits have been reported (Wolfenbarger L. L., and Phifer P. R., 00) due to the use of genetically modified plants, such as reduced environmental impacts from pesticides, ease in soil conservation, increased yield and Phytoremediation (remediation of polluted soils, sediments, surface waters, and aquifers). Research in bioinformatics and genetic engineering is also being carried out on human cells to find more effective cures. MOSS Bernard, in 1996, reported that the Vaccinia virus, no longer required for immunization against smallpox, now serves as a unique vector for expressing genes within the cytoplasm of mammalian cells. As a research tool, recombinant vaccinia viruses are used to synthesize and analyze the structure-function relationships of proteins, to determine the targets of humoral and cell-mediated immunity, and to investigate the types of immune response needed for protection against specific infectious diseases and cancer. The vaccine potential of recombinant vaccinia virus has been realized in the form of an effective oral wild-life rabies vaccine, although no product for humans has been licensed. A genetically altered vaccinia virus that is unable to replicate in mammalian cells and produces diminished cytopathic effects retains the capacity for high-level gene expression and immunogenicity while promising exceptional safety for laboratory workers and potential vaccine recipients. Rosenberg et al, in 2006, have reported that they have achieved Cancer Regression in Patients after transfer of Genetically Engineered Lymphocytes. Search for which gene is responsible for which disease is a very common topic of research in most groups. Some of the causes have already been identified and simple tests can now determine who is prone to which disease. One can definitely appreciate all the benefits that biotechnology and bioinformatics have for the health sector and also as a solution to food crisis. However, researchers have raised several concerns over the safety of genetically modified foods. Researchers are concerned about what effects might come by interfering with the DNA of these crops. What happens to the crops? What happens to the animals and 15 the humans who eat them? Are these plants a problem now? Will they be a problem in the future? Can the bacteria and viruses used to alter the DNA in these plants also affect the bacteria in our body? These issues offer avenues for further research. With the trend in the human genome project, it will be soon possible to identify the genes which are the causes of different diseases. Simple tests can determine that one is prone to certain diseases or have high risks of developing certain severe diseases. This raises several ethical issues about how such information can be used. Can a parent decide to abort a child that may be at risk? Can insurance companies decide not to insure a person with a high risk? Can a company decide to reject the job application on the same basis? Will one want to check his/her partner’s genetic information before getting into a relationship? Béatrice Godard and her co-authors (Godard et al, 03) examine the professional and scientific views on the social, ethical and legal issues that impact on genetic information and testing in insurance and employment in Europe. For this purpose, many aspects were considered, such as the concerns of medical geneticists, of the insurers and employers, of the public, as well as the regulatory frameworks and unresolved issues. The work was based on debates from 47 experts from 14 European countries invited to an international workshop organized by the European Society of Human Genetics Public and Professional Policy Committee in Manchester, UK, 25–27 February 2000. The results stress on a need for clear definitions of terms used in genetics, declaring the grounds on which genetic information is or is not used, and promoting confidence between the public and the insurance industry. In Europe, there is currently very little use of genetic information in relation to employment, but the situation should be kept under review. 6. Prospects of Bioinformatics in Mauritius Two of the areas impacted by bioinformatics and that are of high relevance to Mauritius are healthcare applications and food security. However the first line of action should target education at tertiary level. Bioinformatics has been introduced into existing programs at UoM but it is crucial that additional resources be allocated for implementing programs and initiating research in this area. 16 6.1 Healthcare Applications Traditional drug discovery has been through the isolation, or synthesis of molecules whose activities are then screened through a lengthy and costly process. Pharmacokinetic properties and toxicity have to be determined. This is being replaced by a more molecular targeting approach in which compounds are screened in silico for their ability to bind to proteins and modifying their function. It is possible to do so due to improved knowledge of the basis of diseases. Most large pharmaceutical firms are already applying this technology. Drugs targets can be validated through their 3-D structure using proteomics tools. Molecular epidemiology of infectious diseases relies on the knowledge of their genetic variability in order to have adequate control measures. Bacterial and protozoan genomes have become available over the past years and the sequences can be compared with appropriate comparison tools. These methods are more promising for vaccine development as well as finding new antibiotics. In silico vaccinology allows the identification of appropriate binding molecules to antigenic epitopes that will enhance an immune response in the vaccinated individual. 6.2 Food Security Food production relies on a limited number of plant varieties which are bred for optimal yield and agronomic characters. Major crops, like rice, have already been sequenced while other cereals’ genomes are in the pipeline. It is estimated that genomes sequences of crops will help improve the quality of food products and ensure adequate production in the future. Bioinformatics is promising in finding useful genes and mapping them on the genome of both plants and animals. DNA sequence data as well as expression patterns of genes are hopeful means of finding ways to deal with insect vectors as well as disease causing organisms. More effective vaccines are being designed this way. 6.3 Opportunities for Mauritius The ICT sector has been identified as one of the important pillars of the Mauritian Economy. Software development is to play an important role in the ICT sector. This activity can be extended to include bioinformatics software development. Mauritius can participate actively in software for data mining, simulation and visualization tools. With the advent of Next Generation Sequencing there will be a high demand for trained man power to work with applications in genome assembly and annotation. Mauritius can take advantage of such prospects in outsourcing. 17 However, to seize the opportunities, Mauritius will need to invest in the required resources to support bioinformatics activity. These include the development of the required human resources and high performance computing facilities to support the development of databases and computing tools. Equally important, there is an urgent need to invest in research facilities to carry out studies in the fields of genomics and proteomics. Mauritius has a high degree of endemicity with unique terrestrial and marine species. The country can have substantial economic prospects from studying the genomics of these different species, in particular those with medicinal properties. A database of the genomic information about these species would be extremely valuable. The population of Mauritius comes from different origins, thus providing unique opportunities for understanding the effects of genotypes on diseases. This offers interesting prospects from the genomic perspective. Recent epidemics of both human and animal diseases in the region have resulted in severe setbacks in the economy, thus emphasizing the need for strengthening research in the area of molecular epidemiology of pathogens. 6.4 Bioinformatics at the UoM In order to support the above mentioned development, academic institutions need to take the lead to drive research and capacity building in the area. The University of Mauritius, conscious of its important role in this development, has been proactive in initiating appropriate steps. Researchers from the Faculties of Science, Agriculture and Engineering, have joined efforts to embark on research in the field of bioinformatics. Among other initiatives, a Bioinformatics Computing Research Group has been set up since 2006. Additionally, there is an increasing number of programmes related to bioinformatics or with bioinformatics components that are being offered both at undergraduate and postgraduate levels at the different faculties of the University of Mauritius. New programmes with higher emphasis in bioinformatics are in the pipeline. Recently, the SANBio (SANBio, 09) Steering Committee approved the designation of the University of Mauritius as a SANBio Node for capacity building in bioinformatics. Among other activities, the University of Mauritius through the Faculty of Agriculture will be coordinating the implementation of training programmes in bioinformatics in the SADC region under the auspices of NEPAD. Under this 18 initiative, a computer laboratory (equipped with necessary hardware and software), sponsored by SANBio, is being set up at the University of Mauritius to support the capacity building. 7. Recommendations and Conclusion Bioinformatics is relevant to many fields of life, namely Basic science for understanding living systems at the molecular level. Medicine more specifically for clinical informatics. Agriculture and fisheries so as to improve yield and disease resistance. Environment so as to better understand the biosphere and do biological spill clean-up. In Mauritius, a number of institutions are concerned with bioinformatics research due to the nature of their activities. Among others, we have the Mauritius Sugar Industry Research Institute (MSIRI), the Mauritius Oceanographic Institute (MOI), the Food and Agricultural Research Council (FARC), the Ministry of Agriculture, the Ministry of Health and academic institutions such as the University of Mauritius. Development of bioinformatics at the national level requires coordination and collaboration among these institutions. Bioinformatics involve large amounts of data and intensive processing power. In order to support research in this area, there is a need to increase resources for information infrastructure and build the appropriate computing environment. Extensive training programmes in the field including hands-on to the above-mentioned tools can kick-start research in the area of bioinformatics, and the University of Mauritius can play a key role in this respect. Mauritius should aim at building the necessary infrastructure to maintain bioinformatics databases for storing and archiving local data. Such databases should be highly protected against piracy and unethical use. Therefore access to this data should be properly controlled. However, overprotection may stifle useful research. Currently Mauritius is equipped only with the Data Protection Act 2004. More research should be conducted to fine tune the legal aspects of data protection and use. The field of bioinformatics presents a number of interesting challenges and opportunities for biologists, computer scientists, information scientists and bioinformaticians. These challenges sit at the intersection of biology and information. Ideally, larger scale work in this broad area involves a partnership between those with expertise in relevant foundational domains (e.g. computer scientists) and application domains (e.g. biologists) as well as bioinformaticians to serve as a bridge. 19 The potential benefits of addressing some of the above-mentioned challenges are numerous both in terms of improving our understanding in general of how biological systems work and in terms of applying the knowledge to help improve health and treat diseases. Above all, bioinformatics has brought together researchers, organisations and institutions from different areas with the aim of strengthening collaborative output in scientific discovery. References [APBioNet 09] APBioNet Homepage, http://www.APBionet.org/, accessed on 17 Dec 2009 [ArrayExpress, 09] ArrayExpress Homepage, http://www.ebi.ac.uk/microarray-as/ae/, accessed on 16 Dec 2009 [BIND, 09] Biomolecular Interaction Database Homepage, http://www.ncbi.nlm.nih.gov/pubmed/11125103, accessed on 16 Dec 2009 [Bioconductor, 09] Bioconductor Homepage, http://www.bioconductor.org, accessed on 16 Dec 2009 [Biojava, 09] Biojava Homepage, http://www.biojava.org, accessed on 16 Dec 2009 [BioPerl, 09] BioPerl Homepage, http://www.bioperl.org, accessed on 16 Dec 2009 [Biopython, 09] BioPython Wiki, http://biopython.org/wiki/Main_Page, accessed on 16 Dec 2009 [DIP, 09] Database of Interacting Proteins Homepage, http://dip.doe-mbi.ucla.edu, accessed on 16 Dec 2009 [EBI, 09] EBI Homepage, http://www.ebi.ac.uk, accessed on 16 Dec 2009 [EBI, 09] EMBL-EBI Homepage, http://www.ebi.ac.uk/embl/, accessed on 16 Dec 2009 [EBITools, 09] EBI Tools Homepage, http://www.ebi.ac.uk/Tools/, accessed on 16 Dec 2009 [EMBnet 09] EMBnet Homepage, http://www.Embnet.org/, accessed on 17 Dec 2009 Godard Béatrice, Raeburn Sandy, Pembrey Marcus, Bobrow Martin, Farndon Peter and Aymé Ségolène, , “Genetic information and testing in insurance and employment: technical, social and ethical issues”, European Journal of Human Genetics (2003) 11, Suppl 2, S123–S142 [InterProScan, 09] InterProScan Sequence Search, http://www.ebi.ac.uk/Tools/InterProScan/, accessed on 16 Dec 2009 [Jauferally-Fakim, 09] Jauferally-Fakim Y., Puchooa D., Mumba L. “Status of Bioinformatics in Southern Africa: Challenges and Opportunities”, EBMnet.news, vol 15, No. 3, October 2009. 20 [MINT, 09] Molecular INTeraction Database Homepage, http://mint.bio.uniroma2.it/mint/Welcome.do, accessed on 16 Dec 2009 [Moss 96] MOSS Bernard, 1996, “Genetically engineered poxviruses for recombinant gene expression, vaccination, and safety” Proc. Natl. Acad. Sci. USA Vol. 93, pp. 11341-11348, October 1996 [MSD, 09] Macromolecular Structure Database Home Page, http://www.ebi.ac.uk/msd/, accessed on 16 Dec 2009 [NCBI, 09] NCBI Homepage, http://www.ncbi.nlm.nih.gov, accessed on 16 Dec 2009 [NCBI-Protein, 09] NCBI Protein Database Homepage, http://www.ncbi.nlm.nih.gov/protein/, accessed on 16 Dec 2009 [NCBITools, 09] NCBI Tools, http://www.ncbi.nlm.nih.gov/Tools/index.html, accessed on 16 Dec 2009 [Needleman, 70] Needleman, S. B. & Wunsch, C. D. (1970). Journal of Molecular Biology. 48, 443-453. [Proteome, 10] Proteome Homepage http://proteome.wayne.edu/PIDBL.html accessed on 11 Jan 2010 [pSTIING, 09] protein Signaling, Transcriptional Interaction and Inflammation Networks Gateway, http://pstiing.licr.org, accessed on 16 Dec 2009 [Rcsb, 09] Structural Bioinformatics Protein Databank Homepage, http://www.rcsb.org/pdb/home/home.do, accessed on 16 Dec 2009 [Rosenberg 06] Rosenberg* S. A., Morgan R. A., Dudley M. E., Wunderlich J. R., Hughes M. S., Yang J. C., Sherry R. M., Royal R. E., Topalian S. L., Kammula U. S., Restifo N. P., Zhili Zheng, Azam N., Christiaan R. de Vries, Linda J. Rogers-Freezer, Sharon A. M. , , 2006, “Cancer Regression in Patients After Transfer of Genetically Engineered Lymphocytes”, Science 6 October 2006, Vol. 314. no. 5796, pp. 126 - 129 [R-project, 09] R-Project Homepage,(http://www.r-project.org/, accessed on 16 Dec 2009 [SANBio, 06] Southern African Network For Biosciences (SANBio) Business Plan 2006-2011”, Prepared by SANBio Secretariat, c/o CSIR, Box 395, Pretoria 0001, Republic of South Africa, April 2006 [SANBio, 09] SANBio Home, http://www.san-bio.com/, accessed on 16 Dec 2009. [Swiss-Prot, 09] Swiss-Prot Homepage, http://www.expasy.ch/sprot/, accessed on 16 Dec 2009 [UniProt, 09] UniProt Homepage, http://www.uniprot.org, accessed on 16 Dec 2009 [Waterman, 76] Waterman, M. S., Smith, T. F. & Beyer, W. A. (1976). Advances in Mathematics, 20, 367387. [Windley, 08] Windley Steve 2008, “Genetically Modified Foods”, PureHealthMD.com, Pure Health Corporation Fort Wayne IN USA, 2008. 21 [Wolfenbarger, 00] Wolfenbarger L. L., and Phifer P. R., 2000, “The Ecological Risks and Benefits of Genetically Engineered Plants.”, Science 15 December 2000, Vol. 290. no. 5499, pp. 2088 - 2093 [Zvelebil, 08] Zvelebil M., Baum J.O., “Understanding Bioinformatics”, Garland Science, ISBN 0-81534024-9, 2008 22