THE CBS MAP CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS ABOUT CBS The Center for Biological Sequence Analysis (CBS) at the Technical University of Denmark was established in 1993 by the Danish National Research Foundation in response to the growing importance of the field of bioinformatics. After more than ten years, CBS has become an internationally recognized research center and a resource for computational analysis of a wide array of biological data. The center profile now incorporates many novel aspects from the general area of systems biology. CBS is a highly multi-disciplinary group consisting of approximately 65 employees – one of the larger European groups within academic bioinformatics and systems biology. In a recent review of the center’s production over a ten year period, the reviewers were impressed by the standard of research, with a strong production of high-impact publications, several significant text books, and highly popular WWW services. The quality and quantity of this output also reflect the originality and the creative atmosphere within the center. Overall, CBS has gained a reputation in the field of bioinformatics and systems biology as a highly dynamic and collaborative group, establishing itself among the internationally leading groups. The research groups in the center participate in more than 15 projects funded by the EU and the NIH. While the first wave of computational research at CBS focused on sequence analysis, where many highly important unsolved problems still remain, the current and future needs will concern sophisticated, large scale integration of extremely diverse sets of data: gene sequences and their control regions, gene expression profiles, protein-protein interaction networks, temporal knowledge on protein complex formation, and signaling cascades, to mention but a few. Integrative approaches will form the basis for advanced, quantitative and qualitative types of systems biology, where simulation and modeling will be key for the understanding of the complex dynamics of entire cells, organs or organisms. FACTS ABOUT CBS PUBLICATION PROFILE: CBS has a strong publication INDUSTRY: High level of industrial collaboration, which profile with more than 300 peer-reviewed papers, many in addition to conventional, joint industry-university pro- in high impact journals as well as many with high cita- jects, include three industrial bioinformatics satellites tion levels. Since 2003, the average, annual publication established within the center framework. CBS also hosts rate has been 40 papers in journals with review. In addi- BioSys – a high technology network covering bioinforma- tion to scientific papers, the CBS staff has authored or tics and systems biology (www.biosys.dk). The network co-authored seven text books and edited four procee- consists of seven academic partners and 16 biotech- and dings. A 2005 publication highlight was the Science pa- medico companies. The purpose of the network is to per entitled “Dynamic protein complex formation during further the collaboration between academia and industry the cell cycle,” authored by U. de Lichtenberg, L. J. Jen- by creating an environment, where network partners sen, S. Brunak and P. Bork – joint work with EMBL in can meet, develop and exchange knowledge and ideas. Heidelberg. EXTERNAL FUNDING: The center staff has extracted CITATION PROFILE: The most cited CBS publication considerable external funding. Funding sources include has more than 3,000 citations; Identification of prokaryo- the EU, NIH, Nordic sources, private Danish foundations, tic and eukaryotic signal peptides and prediction of their industry, and most of the Danish research councils, in cleavage sites, H. Nielsen, J. Engelbrecht, S. Brunak and addition to the founding sponsor, the Danish National G. von Heijne, Protein Eng., 10, 1-6, 1997, describing a Research Foundation. CBS is also participating in many method for prediction of signal peptides in prokaryotic EU consortia and NIH projects. and eukaryotic proteins. This paper was included in the ISI Red Hot list for 1997. 15 other papers have more than COMPUTE INFRASTRUCTURE: A strong compute and 100 citations, with eight of these having between 200 and database infrastructure equipped with SGI Altix shared- 2000 citations. Two of these have been on the ISI Red memory computers with 250 processors, 500 GB RAM, Hot List. Except for a single year, CBS has in the 1997- and a fast fibre channel 30TB storage RAID. Most of the 2004 period each year produced a paper, which is among computer hardware has been funded by grants from the the ten most cited papers out of the approximately 20,000 Danish Center for Scientific Computing (www.dcsc.dk). papers each year co-authored by Danish scientists. The database management maintains a datawarehouse with more than 300 public databases particularly useful ONLINE SERVICES: CBS has established a highly po- in the context of data integration within systems biology. pular service component, with 35 different web servers, serving 1,500,000 pageviews/month. The Institute for ACADEMIC ENVIRONMENT: CBS is a highly multi- Scientific Information has ranked the CBS web site as disciplinary center with a strong international profile. one of the most useful within the field. In addition to web More than 12 nationalities are currently represented, based services, software packages for many of the met- including many European countries, Russia, China and hods have been installed at hundreds of other academic the USA. Together, the CBS group has key competences and industrial sites world-wide. within the fields of molecular biology, biochemistry, pharmacology, medicine, chemistry, physics, mathema- TEACHING: Strong, innovative teaching component with ten highly popular courses involving combinations tics, computer science and chemical engineering. CBS is one of several strong research centers at the of lectures and web based hands-on exercises at both BioCentrum department at the Technical University of PhD and MSc levels. Since 2003, CBS has been respon- Denmark. With more than 400 employees BioCentrum- sible for the international MSc-programme in bioinfor- DTU represents the largest concentration of biotech matics at DTU, and in collaboration with other research reserch in Denmark. groups at DTU, the center also coordinates an international MSc program in systems biology. From 2005, CBS MANAGEMENT: CBS has since 1993 been led by pro- also offers internet-transmitted courses combining live, fessor Søren Brunak, who, together with the other group real-time transmitted lectures, webbased exercises, leaders, the administrative staff, and the compute group discussion fora and chat lines maintained by CBS staff. manage the center activities. INTEGRATIVE SYSTEMS BIOLOGY M/G1 Group leader: Center director, professor Søren Brunak, brunak@cbs.dtu.dk X The study of life at the cellular and molecular level has brought about insight and change beyond anyone’s imagination over the last thirty years. Until quite recently, this type of research has been carried out in a reductionistic way in which a few components were studied at a time. However, the latest breakthroughs and advances in nano- and biotechnology has created new possibilities for cataloging and studying hundreds and thousands of biomolecules simultaneously and have thereby paved the way for a new systems-scale view of living cells and organisms. The new field is called systems biology. The Integrative Systems Biology Group at CBS is at the leading edge of these developments, focusing mainly on understanding how intracellular networks of genes, proteins, metabolites and other small molecules regulate cellular behaviour and how perturbations to these regulatory systems may lead to disease. Unlike related efforts in other areas, such as for instance physics, modeling in systems biology relies on integration of massive amounts of experimental data rather than just on theoretical modeling. For this reason, the group consists of biologists, pharmacologists, biochemists, engineers and physicists, working together on both the experimental and the computational side to tackle the challenges of modeling, mining and integrating massive amounts of heterogeneous data into systems biology. The group recently published a first proof-of-concept study of the cell cycle in the Science Magazine. The results reveal how different protein complexes, or molecular machines, are built and activated inside the cell during the cell division process. Apart from being of great value for basic research, these results may help to understand how mutations in these molecular components lead to diseases such as cancer. The ongoing effort in the group is to construct such models that will aid the identification of new disease genes and in uncovering the mechanisms behind complex, multifactoral diseases. This effort has recently led to the discovery of a large number of likely disease genes in various disorders such as breast cancer, Parkinson’s disease and hypertension. The ultimate goal of the group is to expand these efforts into models of entire cells and organisms, which will enable simulations of cellular and physiological response to perturbations associated with disease, drug targets and drug treatment. M Regulation of meiosis MCM/ORC Protein kinase A DNA replication Glycogen synthesis Nucleosome/ bud formation Mitotic exit APC Cdc28-cyclin Histones Cation transport SCF Pho85cyclin Sister chromatid cohesion Tubulin related S f Transcription factors SPB Cell wall G2 E IMMUNOLOGICAL BIOINFORMATICS Group leader: Associate professor Ole Lund, lund@cbs.dtu.dk X The immune system normally does a good job of keeping us free from diseases, but sometimes it fails. One approach towards understanding why this happens is to produce advanced simulation models of the immune system and to understand the relationship between hosts and patogens in this manner. Depending on the complexity of these models and the input given, they can be used to simulate what happens when a host gets infected by a pathogen, thereby predicting the coevolvement of pathogens and immune systems. One aim of the modeling is to identify parts of proteins known as epitopes, which are recognized by the immune system, thereby inducing a protective response. This knowledge is very valuable in the development of better vaccines and provides very important insights into the progression of cancer, allergy and autoimmune diseases. The Immunological Bioinformatics Group at CBS is developing new technologies for epitope discovery that can aid in the search for new vaccines and therapies for HIV, malaria, and tuberculosis, as well as for diseases such as influenza and pox, which may evolve to be a threat naturally or intentionally through bioterrorism. The group has built a simulation model of the human immune system and has constructed a database with all human pathogens. Using this database and a database of the human genome, the group is working on using the prediction methods to simulate the co-evolvement of pathogens and immune systems, and in particular to identify epitopes from the different arms of immune systems. In most of the projects the predicted epitopes are being validated through experimental collaborations with partners doing wet-lab research. The group seeks to develop methods for the three main types of epitopes: B cell epitopes, which are used to recognize microorganisms outside cells; Helper T lymphocyte (HTL) epitopes, which are used to activate cells that have taken up foreign substances; and cytotoxic T lymphocyte (CTL) epitopes, which are used to detect and kill infected cells. G MOLECULAR EVOLUTION Group leader: Associate professor Anders Gorm Pedersen, gorm@cbs.dtu.dk X Evolutionary theory is the conceptual foundation of the life sciences. The famous geneticist Theodosius Dobzhansky expressed this very well when he said, “Nothing in biology makes sense, except in the light of evolution”. In the post-genomic era this insight is more relevant than ever, and only by taking the theory of evolution into account is it possible to get a handle on organizing and analyzing the massive amount of biological data now available. Specifically, it is important to realize that any group of present-day species that one might choose to investigate will in fact have evolved from a common ancestor through a process of “descent with modification”, and this will have an impact on how similarities and differences between the molecules within these organisms should be interpreted. The Molecular Evolution Group at CBS applies phylogenetic methods to analyze specific biological systems, but also develops methods for analyzing the flood of sequence data available in the public domain in order to learn about the evolutionary process itself. Current projects focus, among other things, on how viruses such as HIV and the hepatitis C virus (HCV) evolve within infected patients, and in this context it is investigated how antiviral drug use influences selection for resistance. The evolution of bacterial resistance to antibiotics and horizontal transfer of resistance-associated genes are other topics that the group explores. The group’s mainly computerbased research into bacterial and viral evolution is done in close collaboration with experimental groups at the University of Copenhagen, Copenhagen University Hospital, Hvidovre Hospital and State Serum Institute. Other current projects in the Molecular Evolution Group include investigations into the evolution of non-coding DNA, evolution and origin of introns, and evolution of evolvability (the ability of biological systems to evolve). Generally, the group is interested in all aspects of evolution, and while using stateof-the-art computational tools, the focus is always on analyzing problems that are interesting from a biological point of view. F POST-TRANSLATIONAL MODIFICATION Group leader: Associate professor Nikolaj Blom, nikob@cbs.dtu.dk X Protein function and modification is the focus of the Post-Translational Modification Group at CBS. Disturbances of PTMs are the direct or indirect cause of many diseases, including cancer and infections, and a greater understanding may therefore lead to therapies of intervention. The PTM group studies a large range of protein modifications in order to elucidate the function of proteins which have still not been fully characterized and as tools for discovery of proteins with particular properties, e.g. localization signals and processing sites. Many PTMs occur at specific, yet variable motifs, in the target proteins. In contrast to simple consensus patterns, machine learning techniques, such as artificial neural networks, are often well suited to integrate the subtleties of sequence variations, which can also be visualized by so called sequence logos. The Protein PTM group at CBS has a successful historical record of developing useful prediction tools – SignalP (signal peptide sequences), NetOGlyc and NetNGlyc (glycosylation sites) and NetPhos (phosphorylation sites) – made available over the internet and used by large parts of the molecular biology community. Current projects focus on kinase-specific phosphorylation sites, apoptotic caspase targets, GPI-attachment sites and pro-protein processing sites. Taking the knowledge of protein modifications further, the group is working on an integration of features at a systems biology and proteome-wide level. This basically means that certain classes of proteins, e.g. nucleolus-localized or cellcycle regulated proteins, may be classified based on their features. These features include calculated as well as predicted properties such as PTMs. In one such project, the group is aiming at predicting the ability of proteins to fold with or without the assistance of chaperones. To test the PTM predictors developed by the group, an inhouse experimental validation scheme has been initiated. The first approach involves peptide microarrays which contain up to 10,000 different peptides on a microscope slide. By incubation with, for example, a specific kinase, it is possible to deduce much information about the specificity of the given kinase and use this in the refinement of a prediction method. Q COMPARATIVE MICROBIAL GENOMICS Group leader: Associate professor David W. Ussery, dave@cbs.dtu.dk X Today, hundreds of bacterial genome sequences are available in the public databases and several more genomes are being sequenced every month. Many of these genomes are known to be human pathogens. The sequence data represent a vast amount of information and comparison and analysis is important for a deeper understanding of virulence factors and whether new organisms constitute a potential food safety problem. The Comparative Microbial Genomics Group at CBS uses a combination of computational predictions and experiments to explore the relationships between the hundreds of sequenced bacterial genomes. The approach is “DNA-centric” in that the DNA sequence is used to predict DNA structures, which can in turn be indicators of useful biology (for example, localization of a promoter based on DNA curvature and melting profiles). Currently, the four major focus areas of the group are: 1) prediction of transcripts, including promoters, operons (containing genes coding for proteins, rRNAs, tRNAs or other ncRNAs), and terminators; 2) prediction of highly expressed genes (based on chromatin properties of the genomic DNA sequence, as well as CAI (codon adaption index) values for genes encoding proteins); 3) developing models of gene interaction networks involved in bacterial pathogenesis; and, 4) developing novel methods for comparison of bacterial genomes. The analysis of a single genome can contain much information, and coupled with experimental data, such as transcriptomic, proteomic, and metabolomic results, the information for even one organism can be overwhelming. To handle and maintain this large amount of data for hundreds of organisms sequenced requires a structured database system. For this purpose, the GenomeAtlas database (www.cbs.dtu.dk/services/GenomeAtlas) has been developed including a web interface for presenting much of this information from a genomic perspective. The GenomeAtlas database also includes visualisation methods for viewing and comparison of genomic properties for all the sequenced microbial genomes. The group also designs high-density microarrays for bacterial genomes, and perform laboratory experiments to test the predictions, as well as generate new data for models and making new predictions, in an iterative manner. The microarrays are designed to test predictions of transcriptional start sites, non-coding RNA and conserved and unique coding regions within a bacterial species. G SYSTEMS BIOLOGY OF GENE EXPRESSION Group coordinator: Assistant professor Henrik Bjørn Nielsen, hbjorn@cbs.dtu.dk X In the post-genomic area understanding the dynamics of biological systems is becoming increasingly in focus. The activity of genes and their encoded products can be regulated in several ways, but transcription is the primary level of regulation in most systems. The recent flood of global expression analyses has underscored the importance of transcriptional regulation. The Systems Biology of Gene Expression Group at CBS conducts research into systems biology with primary offset in transcriptomics. The group has evolved from a microarray-focused activity and is currently focused on addressing questions at the systems biology level. Current research topics in the group include: integration and mining of diverse data domains such as gene expression data, protein-protein interaction data, gene ontology, regulatory sequence motifs, ChIP-on-chip data and non-coding RNAs. In addition, the group studies the cellular networks whose perturbations are measured experimentally. The group takes advantage of the CBS laboratory facilities both for data collection and verification of hypotheses. The facilities include an RNA lab, Affymetrix facilities, custom designed oligo array equipment, and RT-PCR instruments. Even though the group holds expertise on experiments, computer science and statistics, the main focus is on biological and medical problems. These problems are formulated in collaboration with other teams from industry and academia. The group has also contributed to the scientific community with important tools for microarray design and analysis: the microarray normalization algorithm qspline, OligoWiz for microarray probe selection and the widely used ‘affy package’ comprising a complete framework for Affymetrix GeneChips analysis. E CHEMOINFORMATICS Group leader: Associate professor Svava Ósk Jonsdottir, svava@cbs.dtu.dk X The search for new drugs is a very challenging and N 1 .. .. .. .. .. . N O O O 80 1 80 . .. ... 1 . .. ... Input Nodes .. ... . costly endeavor. The possibility of using computational methods for screening compounds at an earlier stage can significantly improve the success rate among drug candidates, as many late drug failures due to toxicity and other factors thus can be avoided. The Chemoinformatics Group at CBS works with the development of new and innovative computational tools for use in the drug discovery and optimization process. The research is presently focused mainly on analysis of large compound and property databases, and the development of predictive tools using machine learning and computational chemistry methods. Such models are based on the structural features of the drug molecules, combined with relevant biological and chemical information in such a way that it becomes possible to predict the behavior of unknown compounds. This research is carried out in close collaboration with scientists at the Danish University of Pharmaceutical Sciences. Examples of current research projects are: Development of pre-screening methods used for selecting compounds for a drug discovery pipeline, prediction methods for properties like solubility and various types of toxicity, prediction of drug toxicity based on NMR metabonomics data from rat urine, and modeling of hERG ion channel blockers. An integrated part of this research effort is building an in-house infrastructure of accessible data by collecting a number of relevant compound databases and data sets. New links between chemoinformatics, bioinformatics and systems bioleogy are also explored. Hidden Layer Output H WWW.CBS.DTU.DK Center for Biological Sequence Analysis BioCentrum-DTU, Technical University of Denmark Kemitorvet, Building 208 DK-2800 Kgs. Lyngby, Denmark phone: +45 4525 2477, fax: +45 4593 1585 e-mail: cbs@cbs.dtu.dk