Building and running large scale computational neuroscience models Fred Howell and Jean-Alain Grunchec Institute for Adaptive and Neural Computation, Informatics, University of Edinburgh Abstract The ambitious aim of computational neuroscience is to gain a better understanding of the operation of real neurons and brain circuits. But real neurons are extremely complicated electronic and biochemical devices, whose dynamics we still have a poor understanding of, and they are connected in extremely large, complex and specific networks, whose connectivity and dynamics we have even less understanding of. We are working on three of the software problems associated with modeling brain networks; obtaining access to sufficient experimental data to build models; developing GUIs and standards for declarative XML based model specifications, containing all parameters of multi-level models of neurons, ion channels and networks; and automated techniques for scaling large models across clusters without the modeller having to become a parallel programming expert. 1. Introduction The first major success of computational neuroscience was the phenomenal work of Hodgkin and Huxley, in 1952, who succeeded in producing an elegant mathematical model of the propogation of the electrical signal in a nerve axon - the action potential. More recent work has extended the electrical study of the membrane properties to include large numbers of different ion channels (the ``transistors'' of the neuron), receptors and synapses. Software tools such as NEURON [5] and GENESIS [6] include scripting languages for building arbitrarily complex models of electrical activity over the neural membrane (similar to SPICE models in analogue electronics) incorporating the branched shape of dendrites, ion channel switching characteristics, and intracellular biochemical pathways. Modellers would like to be able to scale up models to include a fraction of the connectivity of a real circuit. It is the number of synapses which usually limits the scale of model which one can run; a small area of brain tissue may have 10,000 neurons, each connected to 10,000 others, i.e. 100,000,000 synapses. One may wish to model the dynamics of each synapse by a number of state variables, which makes modelling even a small area of brain networks intractable. In this paper we present our work addressing three of the escience challenges related to modelling brain circuits: obtaining access to experimental data; building multi level model descriptions; and scaling models to run across clusters. 2. Data issues in neural modeling The single biggest challenge limiting the development of working models of neural circuits is the lack of availability of suitable experimental data. Ideally one would like to know the precise shapes of all neurons in a given circuit (since morphology is one determinant of electrical behaviour), along with positions and shapes of all synapses, electrical recordings taken concurrently from all cells, and localisation of the electrically active receptors and channels across the membrane. However it is not yet feasible to gather detailed connectivity information, or detailed electrical recordings from more than a few cells at once – although exciting technical developments in high throughput electron microscope level tomography [10] and simultaneous optical recording techniques from hundreds of cells using calcium imaging and confocal microscopes promise orders of magnitude improvements in this area. Such techniques will require novel high throughput image analysis algorithms to generate useful data. Truly large scale and systematic data collection to allow us to model small brain circuits will require an industrial activity in the manner of the human genome project. [11] Fig 1 The network editor gui is used for specifying large networks of neurons. The GUI for adjusting parameters for different compnents of the model is generated on the fly from the schema, and the model is stored in XML. We have developed a number of tools for data before an article would be accepted for publishing experimental data including the publication. Funding council moves towards ratbrain project [9] and Axiope Catalyzer [12]. requiring data publication will also help. Particular computer science challenges are caused by the heterogeneity of data types. In the extreme, each experiment could require its 3. XML for model description own object oriented data model. Ambitious We are collaborating with the developers of attempts to construct data models for biological the major software tools for neuroscience experiments [e.g. MAGE-OM and MAGE-ML] models (including NEURON and GENESIS) to illustrate the complexity, with hundreds of move the declarative aspects of model related classes for describing a subset of the specification to simulator independent XML restricted domain of a single experimental format. Models in neuroscience can be complex technique (microarrays). Neuroscience research and multi-level. They can combine elements of requires new usable software which bridges the intracellular pathways (the focus of the systems gap between unstructured text and structured biology standard for pathways, SBML [4]); ion databases, and which is not so complex that its channels and receptors; compartmental neurons use requires ontology specialists. and networks. At each level of scale, there are We emphasise this data issue in a paper on different levels of detail suitable for modeling, software for modeling as it is currently more of e.g. A single neuron may be modelled as a a limiting factor to progress than the other single isopotential comparttment, or as a significant technical issues such as model complex tree structure; a synapse may be scaling or parameter searching. modelled as a simple weight, a dynamical Databases are not yet in widespread use in system, or as a complex pathway incorporating neuroscience, and there has not been enough calcium buffers and states of receptors. encouragement for researchers to publish their Because of this complexity, it is important raw datasets alongside journal articles. Our for the description of the model to be as clean as suggestion is to learn from the successes of data possible, and separated from the publication in bioinformatics, which implementation concerns of a particular established useful community databases by the simulator. The current situation is that many simple mechanism of journals requiring an models are coded using a script language, which accession number in a public database of raw provides for convenient automation for running Fig 2 The dynamic state of a running simulation can be viewed from the workstation. This allows graphs, animations and summary statistics for the model running on a cluster to be monitored. models but often means that the model can only run on a single simulator. Simulator developers are keen to move towards language independent XML based standards for model descriptions, as XML is sufficiently flexible and clear for people and programs to read, and copes with arbitrarily complex structures. We established the NeuroML project [1] as a focal point for tools standards efforts in neuroscience modelling. For areas of modelling where the methods and data structures are agreed and stable it makes sense to move towards standard languages. One example is the MorphML [13] standard for describing the complex morphology of reconstructed neurons and populations. This is useful for neuroanatomists as well as modellers. Another example is the ChannelML project for standardising models of ion channel dynamics. Developer meetings and open source resources provided by Sourceforge [1] have proved useful for coordinating activity. The NeuroConstruct tool [3] developed by our collaborators in UCL provides support for creating declarative model descriptions focused on cerebellar network models. But there remains the question of how simulators represent aspects of novel models for which there is no agreed standard. It would be advantageous for these parts of model descriptions to be stored in XML as well, with a simulator-specific schema. 3.1 Developing schema-independent GUIs One consequence of the complexity and dynamic nature of neuroscience models is the software development of overhead of recoding user interfaces with every change in schema. Because of this, we made a model editor which builds itself on the fly from the schema definition of the object model, so no coding is required to extend the interface or add additional parameters. This technique has proved extremely powerful and time saving, and allows one to edit arbitrary XML formats which provide an xml version of their underlying object model. Figure 1 illustrates the interface. Similar techniques have been used in a small number of tools [7,8,12]. 4. Scaling models to run across clusters One consequence of moving model specifications from a script language to a declarative XML notation is that it becomes possible to run models across a number of machines, to allow scaling to more realistic sizes, without having to recode the model as an explicitly parallel program. This is useful for network models, as the number of synapses can be large. 4.1 Dynamic modules One of the challenges of neuroscience models is that the types of model people want to build are constantly changing. In order to avoid having to rewrite our software with every new demand, we adopted a system of dynamic modules, with new simulation and GUI components loaded on the fly (using javabeans). The models can be run locally or automatically distributed across a cluster with visualisation streamed to the workstation. One of the 3D visualisation modules we have written allows the user to display the voltage activity of any neuron in the simulation, and also population activity statistics. • • using a variable length integer encoding scheme to reduce memory usage per integer from four bytes to one. The encoding scheme we use places values in the bottom 7 bits of a byte, so small numbers (0-127) take a single byte of space. The high bit is set if additional bytes are needed to store larger numbers. The technique is similar as that used by the Apache Lucene text index software for its high performance. using on-the fly in memory compression of connectivity tables. Standard techniques including run length encoding and difference encoding are used, and these are particularly effective in conjunction with the variable length integers. 4.2 Parallel algorithms We use the Java remote method invocation (RMI) layer for communications between nodes. This high level interface has the flexibility of automatically serialising objects, so once the basic communications layer is in place one can send arbitrarily complex structures between nodes (e.g. the shape and voltage distribution of a neuron) without having to write custom messaging code. The flexibility comes at a performance cost, however. The built-in java object serialisation is fairly slow. For the communications during a simulation run (voltage spikes between neurons) we developed our own optimised communications based directly on low level sockets. This allows us to reduce memory copying and approach the maximum performance of the underlying hardware. 4.3 Java optimisation techniques We chose to develop our software using Java, because of its convenience and the availability of useful libraries for GUIs. In order to obtain reasonable performance using Java, we found it necessary to be extremely careful in selecting which Java features to use. Optimisations we found to work particularly well were: • using arrays of primitive types in preference to objects. For example, rather than having a “Connection” class with a separate object for each connection, use a “Connections” table which holds arrays of integers or bytes. This allows the just-intime compiler to reach C++ levels of performance and memory efficiency at a cost in programming convenience. When C# support on Unix reaches maturity it may be possible to combine convenience and efficiency. With these optimisations one can obtain performance comparable to native code with fewer cross platform issues. Memory use per connection is 10 bytes, and the communications overhead is proportional to the logarithm of the number of nodes. We have been running simulations with 100 million synapses per node on our 24 processor cluster, and are extending the package to run across clusters in Edinburgh and UCL. The software will be released on our NeuroGEMS site [2]. References [1] Howell et al, NeuroML home page: www.neuroml.org [2] NeuroGEMS neuroinformatics modules www.neurogems.org [3] P Gleeson and A Silver, Neuroconstruct, http://www.physiol.ucl.ac.uk/research/silver_a/n euroConstruct/index.html [4] SBML, www.sbml.org [5] M Hines, NEURON, www.neuron.yale.edu [6] GENESIS, www.genesis-sim.org [7] R. Cannon, Catacomb, http://www.enorg.org [8] PEDRO, pedro.man.ac.uk [9] L Zaborsky, F Howell, N McDonnell, P Varsanyi, (2005) The Ratbrain Project, www.ratbrain.org [10] Denk W, Horstmann H (2004) Serial Block-Face Scanning Electron Microscopy to Reconstruct Three-Dimensional Tissue Nanostructure. PLoS Biol 2(11): e329 [11] R Merkle (1989) Large scale analysis of neural structures, www.merkle.com/merkleDir/ [12] Howell, Cannon et al (2004), Catalyzer, www.axiope.com [13] S Crook (2004), MorphML, www.morphml.org