Cognitive Science, 2015 (forthcoming) Building Cognition: The Construction of Computational Representations for Scientific Discovery Sanjay Chandrasekharan1 (sanjay@hbcse.tifr.res.in) Nancy J. Nersessian2 (nancynersessian@fas.harvard.edu) 1 Homi Bhabha Centre for Science Education, Tata Institute of Fundamental Research, Mumbai, India 400088 2 Department of Psychology Harvard University, 33 Kirkland St., Cambridge, MA 02138 1 2 Abstract Novel computational representations, such as simulation models of complex systems and video games for scientific discovery (Foldit, EteRNA etc.), are dramatically changing the way discoveries emerge in science and engineering. The cognitive roles played by such computational representations in discovery are not well understood. We present a theoretical analysis of the cognitive roles such representations play, based on an ethnographic study of the building of computational models in a systems biology laboratory. Specifically, we focus on a case of model-building by an engineer that led to a remarkable discovery in basic bioscience. Accounting for such discoveries requires a distributed cognition (DC) analysis, as DC focuses on the roles played by external representations in cognitive processes. However, DC analyses by-and-large have not examined scientific discovery, and mostly focus on memory offloading, particularly how the use of existing external representations change the nature of cognitive tasks. In contrast, we study discovery processes, and argue that discoveries emerge from the processes of building the computational representation. The building process integrates manipulations in imagination and in the representation, creating a coupled cognitive system of model and modeler, where the model is incorporated into the modeler's imagination. This account extends DC significantly, and we present some of its theoretical and application implications. Keywords: Distributed Cognition, External Representations, Scientific Cognition, Discovery, Digital Media, Systems Biology 3 Introduction New computational representations are radically changing the way scientific knowledge is generated, most notably in the biological sciences and bioengineering. A striking example is the case of Foldit, a video game (built on top of a computational model) that allows novel protein-folds to be designed by web-based groups of people not formally trained in biochemistry. Using Foldit, a 13-year-old player (Aristides Poehlman) designed protein folds that were judged better than the best biochemists' folds in CASP (Critical Assessment of Techniques for Protein Structure Prediction), the top international competition on protein-folding (Bohannon, 2009). This remarkable result provides an interesting cognitive insight: the process of building new protein folds, using the video game interface, allowed the novice player to implicitly develop an accurate/veridical sense of the mechanics and dynamics of the protein folding problem. In this paper, we provide details of this process more generally, and develop a theoretical account of how discoveries could emerge from building. The approach of 'crowdsourcing' difficult scientific problems to novices using novel interfaces is now widely accepted, especially after Nature published a paper (Cooper et al., 2010) where roughly 200,000 Foldit players were included as authors. The paper proposed that harnessing people’s implicit spatial reasoning abilities using such model-based games could be a new method to solve challenging scientific problems. This proposal is now confirmed, with Foldit players making some remarkable discoveries, including building the structure of a protein causing aids in rhesus monkeys, which was an unresolved problem for 15 years (Khatib et al., 2011). The game is currently being refined to support the development of new drugs by the players. A spin-off game from Foldit, EteRNA, allows players to build RNA folds, and every week the most promising folds from the gamers are synthesized by a Stanford lab. The synthesis results are then fed back to the gamers, who use these real-world results to improve their designs. This closed loop building process has led to the gamers discovering fundamental design principles underlying RNA structure (Lee et al, 2014; Koerner, 2012). Other similar crowdsourcing games include Phylo (helps optimize DNA sequences) Eyewire (helps map 3D structure of neurons). 4 Eyewire recently helped answer some basic research questions about the way retinal cells detect motion (Kim et al., 2014). These games mark an important shift in the direction of knowledge flow in science, which has traditionally been from implicit to explicit. For instance, in many areas of biology, the effort is to capture implicit procedural knowledge (such as flight patterns and navigation of birds) in explicit declarative terms (such as aerodynamics and signaling). In physics, procedural knowledge (such as the qualitative understanding of force) is considered to lead to misconceptions, and declarative knowledge (such as Newton's Laws) is used to explain many aspects of phenomenal experience. Given this procedural-to-declarative trajectory of scientific knowledge, the case of Foldit and similar games marks a new approach to discovering scientific knowledge, as such cases re-represent declarative knowledge using computational models and a manipulable interface, so that naive participants can use their procedural knowledge to build up novel patterns. At the heart of such games and other similar digital media for discovery is a re-representation – converting explicit conceptual knowledge, developed by science (structure of protein, possible folds, hydrophobic/hydrophilic interactions etc.) to build a control interface that can be manipulated using a set of actions. This interface allows building of new representations by novices, using their implicit spatial knowledge. These games thus present a fundamental shift in the practice of science, particularly an acknowledgment of the role played by tacit/implicit sensorimotor processes in scientific cognition (Polanyi, 1958, 1966). The success of this approach suggests that there is a close connection between procedural and declarative knowledge. This is a radical epistemic shift, and it is driven by two irreversible factors. One is the focus on understanding interdisciplinary problems such as climate change, where the phenomena under investigation are spread across many time-scales and spatial levels, and complex feedback loops are standard features of the domain. Existing theory and automated methods are not able to solve the multiscale combinatorial problems that emerge in such areas. It is also possible that in these domains, as von Neumann (1951) observed, the phenomena are the simplest descriptions possible, and any good model would need to be more complex than the phenomena. A second factor is the emergence of 'Big Data', 5 where petabytes of data are generated routinely in labs, particularly in biological sciences. It is not possible to analyze this avalanche of data without computational models and methods, which themselves fail to work for many problems. A good example is the classification of galaxies using data from the Hubble space telescope, a difficult problem that led to the development of Galaxy Zoo, the first effort to crowd-source science. This web-based citizen-science project has led to at least 30 peer-reviewed papers, and a new astronomical object (Hanny'sVoorwerp) named after the Dutch schoolteacher who identified it. The crowdsourcing approach to scientific problem-solving is new, but the idea of using the human sensorimotor system to detect patterns, particularly in dynamic data generated by computational models, has been applied right from the beginning of computational modeling. Entire methodologies, disciplines, and phenomena challenging existing models have been built just from visualized patterns on computer screens. These include Complexity Theory (Langton, 1984, 1990), Artificial Life (Reynolds, 1987; Sims, 1994), models of plant growth (Prusinkiewicz, Lindenmayer, &Hanan, 1988; Runions et al., 2005), computational bio-chemistry (Banzhaf, 1994; Edwards, Peng, &Reggia, 1998), computational nanotechnology (reported in Lenhard, 2004; Winsberg, 2006), and climate change (Schneider, 2012). All these novel areas of exploration are based on visualizing data from computational models. Apart from the visual modality, protein structure has been generated as music (Dunn & Clark, 1999), and scanning microscope output has been used to generate haptic feedback (Sincell, 2000). This approach to making scientific discoveries, by coupling the sensorimotor systems of a crowd of novice humans to data embedded in novel computational media, raises a number of questions about cognition. Particularly, what cognitive mechanisms mediate the re-representation (and back) of scientific knowledge as manipulable on-screen structures? What is the relationship between declarative and procedural knowledge, such that this conversion is possible and new discoveries could emerge from this conversion process? At a more applied level, how could the visual and tactile manipulation of model elements on screen, by groups of non-scientists, quickly lead them to build valid structures representing imperceptible molecular entities they have never encountered, especially structures that have eluded 6 practicing senior scientists for many years? What cognitive and biological mechanisms support this manipulation-based discovery process? How can these mechanisms be harnessed better, to develop other collaborative games/interfaces that address more complex and abstract scientific and engineering problems with wider applicability? These questions are critical, for practicing as well as learning this new form of science and engineering. Addressing these questions requires developing a general theoretical account that captures how discoveries could emerge from the building of new computational representations, particularly computational models, and re-representation of data from these models. Previous theoretical work from our group has described how building physical models, and computational models linked to such physical models, leads to discoveries (Nersessian & Chandrasekharan, 2009; MacLeod & Nersessian 2013a, 2013b) and innovation (Aurigemma, Chandrasekharan, Newstetter & Nersessian, 2013). We have also developed an account of the possible cognitive/neural mechanisms involved in this buildingto-discover process (Chandrasekharan, 2009). In this paper, we seek to address four broader questions: What cognitive powers are developed by building new computational representations, particularly computational models, and how do these new cognitive powers lead to new discoveries? What other roles does the process of building new computational representations play in the discovery process, and more broadly, in scientific cognition? What kind of extensions to existing cognitive theory are required to develop an account of how discoveries emerge from building new computational representations? What broader implications and possibilities are offered by such a theoretical model? 7 Scientific Cognition as Distributed Cognition In our view, these questions are best addressed within the distributed cognition (DC) framework (Hutchins, 1995a; Hutchins 1995b), which was developed to study cognitive processes in complex (usually technical and scientific) task environments, particularly environments where external representations and other cognitive artifacts are used by groups of people. The DC approach was first outlined by Cole and Engestrom (1993), Pea (1993), and Salomon (1993), and apart from the currently dominant model presented by Hutchins (1995a, 1995b), significant contributions to the initial framework were made by Cox (1999), Hollan, Hutchins, and Kirsh (2000), and Kirsh (1996, 2001, 2010). Most work in DC is focused on understanding how internal and external representations work together to create and help coordinate complex socio-technical systems. The primary unit of analysis in DC is a distributed socio-technical system, consisting of people working together (or individually) to accomplish a task and the artifacts they use in the process. The people and artifacts are described, respectively, as agents and nodes. Behavior is considered to result from the interaction between external and internal representational structures. The canonical example of external representational structures in DC is the use of speed bugs in a cockpit (Hutchins, 1995a). Speed bugs are physical tabs that can be moved over the airspeed indicator to mark critical settings for a particular flight. When landing an aircraft, pilots have to adjust the speed at which they lose altitude, based on the weight of the aircraft during landing for that particular flight. Before the origin of the bugs, this calculation was done by pilots while doing the landing operation, using a chart and calculations in memory. With the bugs, once these markers are set between two critical speed values (based on the weight of the aircraft for a particular flight), instead of doing a numerical comparison of the current airspeed and wing configuration with critical speeds stored in memory or a chart, pilots simply glance at the dial to see where the speed-indicating needle is in relation to the bug position. This external representation allows pilots to 'read off' the current speed in relation to permissible speeds using perception. They can then calibrate their actions in response to the perceived speed difference. The speed bugs (an external artifact) thus lower the pilot's cognitive load at a critical 8 time period (landing), by cutting down on calculations and replacing these complex cognitive operations with a perceptual operation. The setting of the speed bugs also leads to a public structure, which is shared by everyone in the cockpit. This results in the coordination of expectations and actions between the pilots. These two roles of the speed bug (lowering cognitive load and promoting coordination between pilots) are difficult to understand without considering the human and the artifact as forming a distributed cognitive system. Starting from the above example, one way to extend the DC framework to develop an account of the role of building computational representations in discovery, particularly the process of crowdsourced discovery, would be to show how the building of external representations, specifically computational representations, help offload not just memory, but also processes of imagination. This is roughly the direction we will be following in this paper. However, we will argue that offloading is not the right metaphor to understand the imagination process developed through the building of novel computational representations. Rather, the metaphor should be that of coupling between internal and external representations (Chandrasekharan & Stewart, 2007; Nersessian et al., 2003; Osbeck & Nersessian, 2006; Nersessian, 2009). This extension of DC is complex, as it requires moving the analysis from the use of external structures to lower cognitive load – the focus of DC till now – to the process of building external representations to create coupled systems for imagination. Much of the work on external representations within DC has focused on capturing detailed descriptions of the way external representations are used in highly structured task environments, such as ship navigation and landing of aircraft, and the way these representations change the nature/cost of cognitive tasks. Less understood are the processes of generating/building external representations to alter task environments (Kirsh, 1996) and the role played by this building process in cognitive processes while problem-solving (Chandrasekharan & Stewart, 2007). As Schwartz & Martin (2006) observe, “most cognitive research has been silent about the signature capacity of humans for altering the structure of their social and physical environment”. However, a central premise of the DC perspective is precisely this, as Hutchins has succinctly stated, "Humans create their cognitive powers by creating the 9 environments in which they exercise those powers" (1995b, p. 169). Since building problem-solving environments is a major component of scientific research (Nersessian, 2012), scientific practices provide an especially good locus for examining the human capability to extend and create cognitive powers, particularly through building new external representations. In this paper, we focus on an exemplar of the building of a computational model – a complex external representation – and examine the role this external representation, specifically the process of building it, plays in structuring, as well as altering, the task of making scientific discoveries collaboratively in a systems biology laboratory. A handful of studies have examined problem solving in scientific research from a DC perspective (see, e.g., Nersessian et al, 2003; Nersessian, 2010; Nersessian, 2012; Alac & Hutchins, 2004; Becvar et al., 2008; Hall, Stevens, &Torrobla, 2007; Hall, Wiecker t& Wright, 2010; Goodwin, 1997; Giere, 2002). Most of these studies do not consider in detail how external representations are built, largely because the development of a novel external representation, and the changes this process makes to the scientific task environment, are complex events that occur over long periods of time, and are therefore not easily captured. Even when such data are available (Chandrasekharan, 2009; Nersessian & Chandrasekharan, 2009), it is not easy to understand the cognitive roles played by the building of external representations by direct application of the current DC framework, which is derived using studies of well-structured tasks with set goals, and therefore does not transfer well to the illstructured and open-ended task environment of a scientific laboratory that focuses on discovery. Further, the DC framework, as it stands now, largely focuses on the use of existing external representations, not on the processes of generating representations, which play a significant role in scientific practice. Building representations is part of the activity of what Hall et al. (2010) call the process of “distributing” of cognition. The DC framework therefore needs to be extended to understand scientific practices that require building novel representations for problem solving. Other theoretical frameworks to study complex cognition in group settings, such as Situated Cognition (Clancey,1997; Lave, 1988; Suchman, 1987), do not examine the interaction between internal and external representations, and in some instances these approaches deny the very existence of 10 representations. With a few notable exceptions (e.g., Larkin & Simon 1987; Schwartz, 1995; Hegarty 2004) the classical symbol manipulation model of cognition (Vera and Simon, 1993; Newell,1980) also has largely ignored the interaction between internal and external representations. Without a focus on the interaction between internal and external representations, it is difficult to explain how new external representations come into being, particularly ones that lead to new concepts (Nersessian 2009, 2012). This is a central reason why DC is the most suitable framework for developing a theoretical account of the role of computational representations in discovery. Secondly, a central aspect of the DC account is the role played by perception in lowering cognitive load (as illustrated in the speed-bug example). This critical role of perception in complex cognitive tasks provides a focused entry point for understanding the link between procedural and declarative knowledge, as well as the mechanisms involved in building external representations for discovery (Chandrasekharan, 2009; Chandrasekharan, 2014). A third reason is that, as noted above, there already exists a literature placing scientific practices, particularly accounts based on ethnographic studies of scientific cognition, within the DC framework, and therefore the DC framework has a base from which to extend, compared to other related theoretical frameworks, such as the extended mind framework developed by Andy Clark (2003). Also, unlike this latter philosophical framework, the DC framework’s close connection with cognitive science research provides a basis for examining what are the specific cognitive functions of the interaction between internal and external representations, and what cognitive/neural mechanisms support this interaction (Chandrasekharan & Stewart, 2007; Chandrasekharan, 2009; Chandrasekharan, 2014). In social studies of science, scientific laboratories have been studied to examine how they develop new scientific practices and technologies and how representations such as scientific papers are generated as part of science practice (Cetina, 1999; Galison, 1997; Latour, 1989; Rheinberger, 1997). Our account shares some features with such accounts, particularly the study of scientific labs and examination of the socio-cultural nature and effects of building new representations. However, our central objective is very different from the social studies approaches – our objective is understanding how the mind develops new powers through the building of new modeling media – physical and 11 computational – and how these can lead to discovery and innovation. The analyses in science studies are mostly sociological or historical in nature, where the cognitive processes of the agents involved are explicitly ignored or treated as black boxes. Some of this research, though, does hint at the cognitive saliency of the material objects, including external representations, used in research and of the collaborations among researchers (e.g., Baird, 2004; Cetina 1999; Galison 1997; Latour, 1986; Meli, 2006). However discussing these contributions in detail would lead us too far astray from the central issue of this paper, which is understanding the cognitive effects of building computational representations and media, and determining how this understanding could help in identifying the mechanisms involved in discoveries based on such media and also ways in which this understanding could be exploited to analyze and develop new digital media that support scientific discovery. In other work, we have addressed how these studies can provide important insights that can be used by researchers in the cognitive science of science. Further, we have been arguing that DC provides the basis for creating a rapprochement between purely socio-cultural and purely cognitive accounts (Nersessian, 2005, 2008; Osbeck et al., 2011). The paper is structured as follows. Section 1 provides a basic description of computational modeling, particularly ordinary differential equation (ODE) modeling, for readers not familiar with this method. Readers familiar with ODE modeling can skip this section. Section 2 outlines our empirical and theoretical approach. Here we provide an overview of the scientific lab environment we are studying, and how it is different from the traditional environments studied by DC. Section 3 describes the general process of building computational models in the Lab, and captures it using a flow chart. Section 4 goes deeper into the modeling process, presenting a specific case of building a model that led to a remarkable discovery in basic science. Based on this case, Section 5 examines different cognitive powers that are created or enhanced by building simulation modeling environments. Section 6 examines how our account extends Distributed Cognition theory and of some of the broader implications and possibilities offered by our account. 12 1. Computational Modeling: A Primer Building models of various kinds is essential to scientific problem solving. Models are usually built to investigate complex real-world phenomena -- a comparison of the model's behavior with the real system's behavior provides a way of understanding the complex phenomena. For instance, some plants such as mimosa fold their leaves at dusk, and also when touched. One could try to understand the mechanism underlying this behavior by building a physical model of a leaf, where folding is accomplished using small electric motors that are activated by light and touch sensors. Once such an electromechanical folding model of the leaf is created, its performance can be compared with the actual plant's folding, qualitatively (Is the model's closing as graceful as the plant's? Do all leaves close at once, or in order?), and quantitatively (What is the difference in speed-of-closing between model and plant? How sensitive is the model to light and touch, compared to the plant leaf?). These comparisons allow the model's design to be optimized, until the model's behavior parallels that of the plant. Once the behavior of the electromechanical model matches that of the real leaf, the nature of the mechanism that is used to fold the mechanical leaf can provide an understanding of the mechanism used by the plant leaf, particularly in terms of the complexity required and the limits of the mechanism. At this point, the model can be used to predict the nature of other folding behavior exhibited by mimosa leaves, such as in response to heat. A more abstract kind of modeling based on mathematical equations is possible if there are quantitative data about the real plant's behavior, usually captured as a graph with one variable (say folding speed) changing in relation to another standard variable (say light). Suppose many speed-offolding values of the leaf are known, for different standard light conditions, and these real world data are plotted as a graph. Now, a mathematical equation could be developed, which captures the relation between folding-speed and light, represented in the graph, using variables than can take different values. For instance, let's say the equation is 7L - h = 2S/f, where L is the amount of light, S is the folding speed, h is the height of the plant and f is the size of its foliage. Here S and L are variables, and h and f are parameters, which vary for different plants. This equation is a possible mathematical model, trying to 13 capture the plant's folding behavior using a set of variables and parameters, starting with a graph representation of one aspect of leaf folding (speed). To see if this equation is a good model for folding behavior, the model should generate its own data, just like the electromechanical model, and these data need to be compared with the real-world data from the plant. To get data from the mathematical model, you give h and f constant values (say the height and foliage values from the plant for which you have real-world data), and then give different numbers for light (L), and track the speed (S) output by the equation. If the graph made using these S values output by an equation closely matches the graph made using the real world S data from the plant, the model 'fits' the data. If this fit holds for many mimosa plants, with varying heights (h) and foliage (f), then the equation can be considered a robust/good mathematical model of the folding behavior of the plant. One key advantage provided by such an abstract model is that it allows you to make predictions quickly about how fast the plant will fold for light/touch values you have not tracked, or light/touch values you cannot track experimentally because of the limitations of your equipment. The next possibility is the computational model. It is possible to develop mathematical equations from a graph when there are only two variables, but this approach breaks down when the studied behavior is complex, with many graphs, and their variables are interconnected. For instance, you can have a chain of reactions, where one reaction's output becomes another reaction's input. Such a chain of biochemical reactions is termed a pathway. A mathematical model of a pathway requires developing a set of equations, one for each reaction, and 'coupling' together the equations, i.e. the first equation's output becomes the next equation's input. In this case, equations are not developed based on the patterns seen in the graphs, but based on the idea that any reaction can be considered as a rate of change of one set of metabolites into another, in the presence/absence of some regulating agents. A rate of change is usually modeled mathematically using differential equations. In the case of a pathway, ordinary differential equations (ODEs) are commonly used, in a chained (coupled) fashion, as the reactions these represent are coupled. 14 To be effective, such complex models with coupled equations need to generate data that compare well with real-world data. Data generation is difficult in such cases, because the different values for the variables and parameters need to be changed systematically and in many combinations, to get model data that 'fits' real-world data. Since many combinations of numbers can generate a 'fit', the space of possible numbers (the parameter space) needs to be explored to find the combinations that generate the best 'fit' with the real-world data. This exploration, known as parameter estimation, is difficult, because to get a unique set of numbers that give a best 'fit' in such complex cases, the set of equations have to generate data that match all the graphs, i.e. data from all the reactions in the pathway (for a detailed analysis of parameter fitting processes, see MacLeod & Nersessian, forthcoming). A computer can be used to solve both these problems (systematic exploration of the parameter space, comparing model data with experimental data for fit) by quickly ‘running’ or simulating the dynamics of the model many times, putting different values for the variables and parameters, thus generating a range of model data. This is usually done using standard programs known as ODE solvers. Simulations can also be used to compare the 'fit' between a range of real-world data and the data generated by the model. Such automated solving of mathematical equations using numerical methods (i.e. by trying different numbers, and not through purely algebraic methods) is one influential form of computational modeling, and our case study focuses on such modeling using ODEs. We have omitted many details in this primer, and there are many other forms of computational modeling, but for our purposes it provides a reasonably good starting point to understand the modeling case we discuss. The most important point to keep in mind is that in the model validation process ('fit'), the data the model generates need to match the real-world data. This data generation and matching process becomes very difficult when the real-world system has many elements, which raises the number of elements (variables and parameters) in the model, and the number of graphs to be matched. Also keep in mind that the modeler can change the structure of the models (such as the number of equations, number of parameters etc.) while trying to get a fit. 15 2. Lab G as a Distributed Cognitive System Our research involves studying distributed cognition in scientific laboratories. This is a relatively new research area, and our focus is on understanding scientific cognition as situated in complex environments, comprising researchers and artifacts. Specifically, we are interested in understanding how innovations and discoveries arise from the building and use of cognitive artifacts in scientific research labs. This approach to studying scientific cognition provides substance to Hutchins’ claim we cited above about creating cognitive powers (see, e.g., Nersessian, 2012). In our current project, we have been conducting a four-year ethnographic study of cognitive and learning practices in two systems biology labs. This project is part of a longer (twelve-year) effort to both understand these aspects of research practices in bioengineering science laboratories, and develop instructional contexts that reflect salient dimensions of these practices. We use ethnographic data collection methods of participant observation, informant interviewing, and artifact collection. In both labs we conducted unstructured interviews with the lab members as well as some collaborators outside of the lab. We collected and analyzed relevant artifacts, including presentation files, paper drafts, published papers, grant proposals, dissertation proposals, and completed dissertations. We have collected 97 interviews and have audiotaped 24 research meetings. We focus here on one lab that does only computational modeling (“Lab G”). In this lab, the modelers come mainly from engineering fields, but work on building computational models of biochemical pathways to simulate and understand phenomena as varied as Parkinson’s disease, plant systems for bio-fuels, atherosclerosis and heat shock response in yeast. The problems Lab G modelers work on are provided by outside experimental collaborators, who see modeling primarily as a means to extract patterns from the large amounts of realworld data they have, and thus helping isolate/predict key experiments of scientific or commercial importance. The collaborators provide experimental data for modeling, and sometimes also generate data needed for developing a model, or for validating a model's predictions. In broad terms, the Lab G modeling processes can be understood as occurring within a distributed socio-technical system, which is the primary unit of analysis in DC. This system comprises 16 people working together (modelers, experimentalists) to accomplish a task (discover fruitful changes to biological pathways), and the artifacts they use (models, pathways, diagrams, graphs, papers, databases, search engines) in the process. We will ignore the physical models and instrumentation systems used by the experimental collaborators in this analysis, since Lab G modelers do not interact with these. The task environment of the lab and the external representations used by modelers differ significantly from those usually examined in DC. The main differences can be classified as follows: Actors and Goals: The lab does not have a structured task environment, with synchronous actions connecting individuals or groups. The objective of the lab is to make discoveries, so the lab task environment is one where the specific goal is not set or known in advance. There are very general goals, such as “discover interesting reactions”, and less general goals, such as “fit model”. These general goals are spread across people who share a resource (experimental data), but do not share a tightly integrated task environment. The actors have different goals; they work in different settings, at different times, and using different instruments. Conflicts: The community sharing the data has conflicting interests. Even though the modelers are working on a problem of specific interest to the experimentalists, it is very hard to get data from experimentalists, even when they have initiated the collaboration. One reason is that the experimental labs have other experimental projects underway, and the modeler’s requests are often not an immediate priority for lab members. Other reasons are the experimentalists' insufficient understanding of the model and the modeler's requirements, as well as their inclination to publish experimental results first before sharing the data with modelers, even when the data is necessary for building the model. Finally, the experimentalists collect and report data suitable for their own interests and community standards (such as data showing a statistically significant rise/fall from a baseline level, but only at one time-point), but the modeler often requires a different type of data (such as time-series data, which reports many measurements across a thick series of time-points) that allows her to lower the mathematical complexity of her model. Further complicating the interaction, the two communities also work at conflicting time17 scales. For instance, once developed, the models run blazingly fast, and can produce interesting predictions in a few days. But experimenters take weeks and months to gather data based on these predictions, and this phase-lag frustrates the modeler. Conflicts also arise over the different epistemic values, such as placing importance on specific data points and considering only trends in the data. Artifacts: The lab researchers do not simply use external representations to reach a goal. The task of the lab is to build novel representations (biological pathway diagrams, mathematical and computational models) and use them to make discoveries. These representations are themselves built from other representations (papers, data files, online databases, code), which provide information in a scattered fashion. Modelers with little biological background need to drill deeply into a very specific experimental literature, about which they have no prior knowledge. There are significant judgments involved in acquiring and assessing this scattered information (Is this database curated? Is this cell line compatible with my problem?), and integrating the information into a coherent representation (Should this reaction be included in my pathway? Are there other regulations missing here?). The engineers involved in building the models are largely novices in making these judgments, and they gain knowledge by discussing these judgments with the experimentalists, who, in turn, have little to no understanding of how the model works, the components of the model, and what the modeler needs to build the model. These differences suggest that understanding the lab as a distributed cognitive system requires extending the current DC framework – to task environments where goals are not clearly specified, where many kinds of conflicts exist, and where building representations is the central component of the task. Such an extension requires developing an understanding of the cognitive roles played by external representations in such environments, and how the features of these representations and their building processes meet the demands of the task. This expansion is critical for understanding scientific cognition, because building computational models is fast becoming a requirement in contemporary science, and the cognitive roles played by these models by-and-large have not been addressed. 18 3. Constructing the Pathway and the Model: The General Process Lab G researchers mostly build ordinary differential equation (ODE) models of metabolic systems, which capture how the concentration levels of different metabolites in a given biological pathway (a series of reactions) change over time. The first step in this building process is the development of a pathway diagram, which shows the main reactions involved. The number of reactions in the pathways studied by the lab range from 14 to 34, but the number of equations from this set of reactions varies for specific models, depending on the question the modeler is exploring and the computational/data resources available. The pathway diagram also captures positive and negative regulation effects, which specify how the presence of different metabolites has a positive or negative influence on different reactions (Figure 1). A rough diagram of the pathway is sometimes provided by the experimental collaborators, but most often the modelers have to stitch together the pathway by searching for and reading the relevant biology literature. Once the pathway is constructed in sufficient detail, the modelers, who mostly come from engineering backgrounds, have to estimate the details of the pathway by themselves, particularly values of parameters related to metabolites, such as speed of the reaction (rate constant) and an index of the reaction mechanism (kinetic order), which are usually not measured by experimenters. Some of this information is available in rough form (with varying degrees of reliability) from online databases, but most often these values need to be estimated, usually through iterative testing of the model, using a range of numbers as parameter values. [Figure 1 about here] Modelers also add some components to the pathway, usually metabolites that are known to interact with the network provided by the experimenters. These components are found by reading and searching biology journal articles and databases related to the problem being modeled, and also based on results from preliminary models. Even when much of the pathway is provided by experimentalists, 19 these kinds of additions based on literature searches are required, because the provided pathway does not identify all the components, and the regulatory influences they have on the reaction. The pathway developed by the modeler thus brings together pieces of information that are spread over a wide set of papers, databases, and unreported data from the experimentalists. This pathway is usually trimmed, based on some simplifying assumptions, mostly to lower the mathematical and computational complexity involved in numerically solving the differential equations. After the trimming, differential equations are generated to capture the trimmed pathway, usually directly as Matlab (Mathworks Inc.) code. A variable is used to represent the metabolite, while the speed of its change (rate constant) and an index of its reaction mechanism (kinetic order) are represented by parameters. The next step involves estimating rough values for these parameters, and these values are then used to initialize simulations of the models. The simulation results (model data) are then compared to actual experimental results (real-world data), to judge the ‘fit’ of the model. Usually, modelers split available experimental data into two sets: one set is used to develop and fit the model (training data), and the other set is used to validate/test the completed/fitted model (test data). When the model data do not fit the test data, the parameters are “tuned” to get model results that fit. Once the model fits the test data, it is run through a series of diagnostic tests, such as for stability (e.g. does not crash for a range of values), sensitivity (e.g. input is proportional to output) and consistency (e.g. reactant material is not lost or added). If the diagnostic tests fail, the parameters are tuned again, and in some cases, the pathway changed, until the model meets both the fit and diagnostic tests. Figure 2 provides a broad outline of the modeling process. Sometimes, when data are scarce and the model can be fitted using many parameter values (see below), the diagnostic tests are run early, to lower the space of parameter values. Lab G modelers also run Monte Carlo simulations (which involve randomly testing numbers from a set) to explore the dynamics of different parameter values/ranges. [Figure 2 about here] 20 Lab G models do not use real-time dynamic visualizations. Parameter values are changed manually or using scripts, and the equation code is numerically solved using a Matlab ODE solver. Model results for different parameter values are compared using a deck of graphs, where each graph plots the concentration value of a molecule in the pathway across time, for model and experimental data. These graphs (see, e.g., Figure 3) are used by the modeler while discussing the model with collaborators and other team members. A significant chunk of the parameter estimation problem is tackled using optimization algorithms (such as simulated annealing and genetic algorithms), which automatically do the ‘tuning’ of parameters to get a fit, by comparing the output values of the model (for different parameter inputs) against a desired value or range of values (objective function). [Figure 3 about here] Importantly, the linear work flow suggested by the above description is very deceptive – the modeling process is highly iterative and incremental. For instance, to develop the pathway diagram, preliminary models are built using the pathway network provided by the experimenters, and these are run using tentative parameter values, and the generated model data are fit to the training data. The parameter values are then revised based on this fit. If the model data do not fit after a large number of these parameter revisions – particularly if the data trends are the exact opposite of experimental data – the modeler will add some components to the pathway network, based on elements that are known (in the literature) to be related to the pathway. These pathway revisions, and their justifications, are discussed with the collaborators, and if a revision is considered “reasonable” by the experimenter, it becomes a stable component of the pathway. This pathway identification process is usually bottom-up, and creates a composite network, made up of parameter values and metabolites extracted from experiments in different species, different cell lines, and so forth. This composite is unique, and does not exist anywhere else in the literature. 21 One of the central problems the lab members face is the unavailability of rich, and dependable, data. One central use of data is to establish that the model captures a possible biological mechanism, and this is done by showing that the model’s output matches the output from experiments (fitting data). A second use of data is to tune parameter values during the training phase of building the model. The fit with the experimental data from each training simulation can indicate how the model parameters need to be changed, so that the generated model data fits the training data. This use is highly dependent on the type of data available. Most of the time, the available data are ‘qualitative’ in nature – usually data showing how an experimental manipulation led to a change in a metabolite level from a baseline. Mostly, this is reported as a single data point, indicating the level going up or down, and then holding steady. However, when this type of (steady-state) data fits the results of the model, this fit does not indicate that the model has captured the biological mechanism, because a range of parameter values can generate model results that fit such sparse data – the fit is not unique. Further, since the pathway is an approximation, the modeler is uncertain in such cases as to whether the lack of a unique solution is due to poor estimation of parameters, or because some elements are missing from her pathway. As a general example of modeling in this lab, consider G12, an electrical engineer by training, who is modeling atherosclerosis. When she started modeling, she had no background on atherosclerosis. She was provided a rough outline of the pathway by her experimental collaborators, and she learned more about the pathway by reading papers. The initial papers were from the collaborating lab, but then she spread out using the reference lists of those papers. The data available were mostly steady-state data. Once she had read a number of papers, she started building rudimentary computer models and testing these using available data. She then added some components to the model based on connections in the literature. It is worth noting here that while her problem mostly concerned endothelial cells, some of her parameters were taken from experiments with neurons, a very different cell class, and a domain of research (neuroscience) that is not usually connected to research in endothelial cells. After discussion, her collaborators endorsed some of her additions to the pathway, as “reasonable”. 22 Estimating parameter values for her model was a tough problem, since the data were sparse. To get parameter values that generated model data that fit the training data, she ran a large number of simulations and compared the model results with the training data. Finally, she arrived at a set of values that generated data that roughly matched the training data. Once this was done, she tested her model against the test data, and got a rough fit there as well. Based on this fit, she generated a number of predictions from the model, by changing the parameter values. Some of these predictions would be tested by her experimental collaborators. This exemplar is representative of much of the modeling in Lab G, where external representations (pathways and models) are built up from scattered and unreliable information, and discussion. These representations are built by modelers (engineers with no background in biology) using an iterative building strategy, starting from rough data and guidelines from experimental domain experts. This building process requires interaction between the modelers and the experimentalists, and, when working well, the interaction can foster a rich collaboration. The modeler is dependent on the experimentalist to validate the model’s predictions, and the completed model’s predictions can guide experimental decisions and lead to discoveries in critical areas such as biofuel production. The data from these experiments are then incorporated into the model, leading to another cycle of experiments and discoveries. In the next section, we outline a specific case of model building in Lab G, where the process of building an ODE model of lignin led to a remarkable discovery, changing the very basic science knowledge that was used to create the model. 4. A remedy for recalcitrance: modeling of monolignol bio-synthesis One of the projects we studied in Lab G was the modeling of the pathway of lignin, a natural polymer that helps harden plant cell walls. The objective of the model was to help develop transgenic plants with lower lignin content, which would make commercial production of biofuel possible. The hardening property of lignin provides the plant with structural rigidity, and supports growth. While 23 biologically useful for the plant, and also potentially useful for making carbon nanofibres (Ago et al. 2012), this hardening property is a problem for the bio-fuel industry, because lignin is difficult to breakdown (exhibits “recalcitrance”) when biomass is processed into fermentable sugars (using enzymes/microbes). This recalcitrance of lignin makes the extraction of sugars difficult and costly, and bio-fuel production is therefore uncompetitive in comparison to fossil fuels. To solve this problem, genetically engineered plant varieties with lower lignin content have been developed (for various plant species such as alfalfa and poplar;Wilkerson et al., 2014), but these transgenic species are not optimal, as they exhibit unforeseen consequences such as decreasing only one of the three monomer building blocks of lignin (termed monolignols, H, G, S). Computational modeling has the potential to help in understanding the mechanisms underlying lignin production, and this understanding could contribute to the development of transgenic species that have low lignin content, but also good growth (very low lignin content will lead to plants not having structural integrity, and this could prevent growth). The modeling could also help develop plants with different ratios of lignin monomers H G and S, such as a lower S/G ratio, which also helps in improving the extraction of sugar from plant cellulose. Lab G was approached by a research lab in another US state to model the pathway involved in lignin biosynthesis. The researchers wanted to understand how the monolignol components (monomers: H,G,S) of lignin are generated by the components of the lignin pathway. This is a new modeling area; most of the other modeling efforts in the bio-fuel domain involve developing bio-informatics models, or models of organisms that are used to break up the plant biomass. G10 was the principal researcher working on the lignin biosynthesis model. Our analysis of this case is based on 5 interviews (1 to 1.5 hours each; the initial interview was open format, with semistructured follow ups), drafts of papers, presentation files, dissertation proposal and defense, dissertation and defense, and field notes on research presentations. G10 mostly worked from home, so field observations of the modeling process were not possible. G10 was a bioengineering Ph.D. student, with bachelor’s and master’s degrees in electrical engineering from outside the US. He had no previous knowledge of lignin or of biofuels. In the collaboration, G10 started by building a model of the lignin 24 pathway in alfalfa (the species used by the experimental lab), which, along with switch grass, is the plant species of choice in the US, given its geography), based on existing knowledge about the pathway from the literature, and data from papers. To progress, he needed new data from the experimental lab. However, he faced significant problems in getting data from his collaborators. The problems were three-fold. One was that the nature of the data he wanted was different from the data collected by the collaborators. ….the biologist(s) produce the data they want. But those data are not actually what we want when we do the parameter estimation. ...so there ... might be some gap between these two, between us. … they only focus on one species. But even so they don’t produce enough data. They don’t produce, they don’t measure the concentration for example. And they have few kinetic data. ...most of the data they have is just the output, the final output, ... the composition of the lignin. That’s what they have. ... we [modelers] try to use only that kind of data to construct a full model. A second problem was delays, and not having good access to the collaborators. ...the problem for me is that I cannot, I cannot ask the biologists question[s] as often as I want… because they are in [different US state.] So I ... email ... him and so the man who, who have contact [with] me is a research scientist…he is very busy. So sometimes you want to ask him [a] question, and he would get back to you in a month… or even two month… or ... don’t even reply. ...that’s a problem… because we are not expert ...[in] that field. We try to do the modeling right, 25 but we need more information, not this data in the literature. They have more information than we know from the literature... I think the major problem is the communication channel between us and the biologists. ...if we can build up a, a very solid channel between us, I think it will be great for our work. A third problem was the unwillingness of the collaborators to part with data before publication, and this issue was complicated by G10’s status as a junior researcher. ... right now they just give us the data they have published. But we want more data which they … [have] not yet publish[ed]. And I cannot, from my past experience with them ... they just told me if they’re working on something they need to get published first…and then they can give me the data later. And sometimes it’s a problem of you know, it’s the different status [between] me and the research scientist. These three problems illustrate the inherent conflicts involved in the collaboration, and these conflicts make the interaction between the modeler and the experimentalist very different from the cooperative interaction that is seen in other distributed cognition analyses. 4.1 The Poplar model While waiting to get data and feedback on the alfalfa model from his collaborators, G10 developed a model of lignin biosynthesis in poplar. A significant amount of data was available for this species, which is preferred for biofuel production in Europe. This model helped him to understand the lignin pathway better, and also to develop a two-step modeling technique that helped in dealing with the complexity of the lignin pathway. The model also led to the development of a novel parameter estimation method, 26 suited to the available data (consisting mostly of S/G ratio in genetically engineered plants). This method involved a three-step process. • First, each parameter value was constrained to a physiologically realistic range. This technique uses a biological constraint – survival – to simplify the mathematical problem of estimating parameter values. A Monte Carlo type random sampling of parameter values within this range was then done to simulate the model. These simulations generated a large set of model data (different S/G ratios). • In a second step, correlations were established between the parameter values used in these simulations and the S/G ratios, to select significant parameters (any parameter where a small change lead to a large change in the S/G ratio). • In the final step, these significant parameters were optimized using computational techniques (linear programming and simulated annealing) so that the difference (SSE: sum of squared error) between model results and experimental data was minimal. This generated an ‘ensemble’ of models with low SSE. This ensemble of models helped identify the key reactions that influenced the S/G ratio. The ensemble of models was then used to simulate the test data (two transgenic experimental results not included in the training data). The models (note the plural; there was no unique model) were able to approximate these test data, as well as provide some mechanistic insight into the working of the pathway, which was found to be supported by available experimental evidence. 27 Based on this validation, the ensemble of models was used to make some predictions about how the pathway could be engineered to reduce the S/G ratio. Even though this model of lignin mechanism in poplar was developed based just on pathway information and data available in the literature, it helped create a modeling template for the lignin domain. Particularly, the three-step process was useful in managing the mathematical complexity of the system, and the parameter estimation process was developed in a way that was tailored to data in this domain. These templates were then applied, with variations, to the alfalfa model, to make significant discoveries relating to the lignin pathway. 4.2 The (second) Alfalfa model Once G10’s collaborators had published some of their experimental work, they shared their data (excel files) with him, which he incorporated into the Alfalfa model. The collaborators did not provide G10 with the pathway structure, only data was given. They don’t tell us what the pathway looks like. We just get this information from the literature…., the pathway structure is from the literature…. everybody is using it. The black arrows is [sic] what everybody thinks is right…. The red ones are the new findings. There were also some other results that appeared from the experimental group at that time, which were also incorporated into the model, but based on the data the group reported in the literature. The newly added components can be seen in the figure below. [Figure 4 about here] 28 The blue arrows are new components added by G10, based on his modeling and analysis. These are revisions to the established pathway that “everybody is using”. The blue ones are actually a hypothesis from our analysis of data – from the results of our analysis. So based on our analysis we suggest there are a few, for example, there are three, we need to add these three arrows[blue] here… so that our model can fit the data. And also we need to… we need to set these three reactions to be reversible so that our data can be explained by our model. So they are, these blue arrows are actually our findings, our new findings…. we have data from our collaborators and we analyze it with very simple linear models and based on our analysis results, and we suggest there… this original pathway needs to be modified so that this data can be explained. The alfalfa model was more complex (24 equations) than the poplar one, since there was more real-world data to account for (7 transgenic species) than the poplar case. Further, the data included the lignin levels at different points of growth of the plant stem (8 different internodes), and the lignin levels were different for each of these growth points. Developing a model whose output matched this differential expression of lignin at different growth points was a challenge. However, now there were lots of data available: ...we have many data...we have data for six or seven transgenic experiments and each experiment generate uh, about seven sets of data. So we have many data... 29 …our collaborators ...are generating transgenic plants in alfalfa and they...actually modify every single enzyme within the pathway. So they have more data than … we need to know. The modeling approach used was a variation of the three-step one used in the case of poplar. First, this model made the assumption that the lignified stem tissues of wild-type alfalfa plants evolved to maximize the production of lignin monomers, and this biological assumption was used to develop a mathematical term for optimizing parameter values (the objective function). In the second step, the transgenic plant data (for every inter-node of the plant) were modeled using a method where it was assumed that a genetically modified strain tries to function as similarly to the wild-type as possible within the limitations imposed by the genetic modification. This is a biological assumption that provides a mathematical term. Finally, a Monte Carlo type simulation similar to the poplar case (random sampling of many parameter values) was performed, to understand the role of kinetic features of the participating enzymes. This three-step modeling process led to a series of insights (six postulates). These include the reversibility of some reactions (blue arrows pointing up in figure 4) and the possibility of independent pathways for the synthesis of G and S monolignols. However, one spectacular finding stood out: the modeling showed G10 that the traditional pathway – used by almost everyone in the field for twenty years – is incomplete, and an element (termed X by G10; this naming is significant, as we discuss below) outside the standard pathway has a significant regulatory effect on the behavior of the lignin pathway. So this is actually the biggest finding from our model. So by adding this reaction you can see that we hypothesize that there is another compound that can give a regulation… give a feed forward regulation to other parts of the pathway. 30 The figure below captures the regulatory role of X in two different scenarios. The thickness of the red and blue lines indicates the size of the regulatory effect. The size of the violet circle indicates the amount of X generated. The violet line ending in an arrow indicates activation, the horizontal line ending indicates inhibition. [Figure 5 about here] This finding would not be possible without modeling, and the proposed role of X in the lignin pathway, if correct, would rewrite the scientific consensus on the lignin pathway significantly. And this finding will not be possible if we haven’t done any modeling… because well if you just look at the data, the data only tells you the composition of these three lignin... G10’s collaborators found this proposal interesting, and it significantly increased their willingness to collaborate by conducting experiments on the model’s prediction. Their experiments identified a possible candidate metabolite that played the specific roles X played in G10's models. A paper outlining the modeling and experimental results was published in a high impact modeling journal, and the paper was written jointly with the experimental collaborators. According to G10, the experimental collaborators knew about the existence of the candidate metabolite (outside the lignin pathway), but they did not consider the metabolite either as important or as influencing the production of lignin. 31 They just don’t focus on it… because it's not important. I mean it is not…not important because they just focus on how the three products are being produced by this pathway. But they don’t know how this part could affect this…the production of these products. So originally they're just focused on ok this is our precursor and how is this precursor being converted into the products we want. But they don’t know how this compound… which is also derived from this precursor could affect the pathway. This result illustrates clearly the ideal case of modeling – of the model making a significant experimental prediction, which is then tested and validated by the experimentalist. It shows the value modeling can provide for experimentalists. Based on this finding, and the collaboration that resulted from it, G10 was optimistic about an enhanced interdisciplinary collaboration that will provide more data from his collaborators. I guess this findings [sic] will give them more confidence in what we are doing so maybe in the future they could be more willing to give us…to share more data. In contrast to his statements in the beginning of the modeling, when he was frustrated about not having access to data, he was now cautiously optimistic about his collaborators' willingness to share data and to do additional experimental work to validate the model’s predictions. If our model really produce [sic] something very new, they would want to validate that. 32 ...we also [have] come up with another hypothesis which they are looking into right now because it involves more complex regulation… the transcription, the transcriptional regulation that would involve more enzymes, more proteins, and they don’t know if that is correct or not. ... so they, I believe they will do experiments to validate. In the following section we examine some of the cognitive effects of the model building process. 5. The cognitive effects of building The above case study illustrates two critical cognitive effects of building an external computational model in the process of scientific research: 1) New cognitive powers that facilitate scientific discovery emerge from building an external model 2) Collaboration ecosystems emerge from building an external model We discuss these points below, and the following section outlines how they help extend the DC framework. 5.1 The emergence of cognitive powers The original goal of the lignin project was tweaking a given pathway so as to make lignin break down more readily for biofuel production, which is an engineering goal. But G10 ended up changing the standardized pathway, the scientific consensus on the mechanism underlying lignin production. This is a basic biological science discovery, generated by an electrical engineer, based on a few months of modeling. The finding is remarkable. The discovery shows that the built external model is not just a replica of an existing standardized structure (the pathway) for the purpose of tweaking. The external model, and its building, is a mechanism that affords discovering unknown features of the pathway. Approaching this discovery event from the point of view of understanding the role of external representations in cognition, a key question is: what are the cognitive changes involved in building the 33 external simulation model, and how could these changes lead up to the discovery? The key cognitive change is that within the course of many iterations of model building and simulation, the model gradually becomes coupled with the modeler’s mental system, particularly his imagination (mental model simulation) of the phenomena he is modeling, based on which the modeler explores different scenarios. The building process slowly creates an 'external imagination' that is closely coupled to the modeler's imagination system. This coupling allows "what if" questions in the mind of the modeler to be turned into detailed, and close to actual, explorations of the system. It is important to note that the model acquires this external imagination role only in a gradual manner, through its incrementally acquired ability to enact the behavior of the system that it is modeling. As it is built over many iterations (such as the first poplar model), using many data sets, the model's output/behavior comes to parallel the pathway's dynamics. Each replication of experimental results by the model adds complexity to the model, and this process continues until the model fits all available experimental data well. At this point, the model can enact the behavior of the real system – the pathway that is being examined – and thus support detailed "what if" explorations that are not possible to do in the mind alone (see also Kirsh, 2010), or in experiments. Importantly, the model's ability to enact the real system behavior is a very complex judgment made by the modeler, based on a large number of iterations, where a range of factors, such as sensitivity, stability, consistency, computational complexity, nature of pathway etc., are explored. The gradual confidence in the model is thus a complex intuition about its overall performance, emerging over a long series of interactions and revisions, and does not depend just on data fitting, even though fitting is the most critical process leading to this judgment. As the enaction ability of the model develops gradually through the building process, the model starts making manifest many behaviors the modeler might have only imagined previously. But, the model goes further, as it also makes visible many details of the system's behavior, which the modeler could not imagine (Kirsh, 2010) because of the fine grain and complexity of these details. The gradual process of building creates a close coupling between the model and the modeler's imagination, with each 34 influencing the other. The computational model now works as an external component of the imagination system. This coupling significantly enhances the researcher’s natural capacity for simulative modelbased reasoning (Nersessian et al., 2003; Nersessian, 2008; Nersessian, 2009;Chandrasekharan, 2009; Chandrasekharan et al. 2012), particularly in the following ways: 1. It allows running many more simulations, with many variables at gradients not perceivable or manipulable by the mind (say .0025 of metabolites a and b). These can then be compared and contrasted, which would be difficult to do in the mind. 2. It allows testing what-if scenarios that are impossible to do in the researcher’s mind. Such as, what would happen if I change variable 1 and 2 downwards, switch off 6 and 21, and raise 7 and 11 with a time lag between 16 and 19? 3. It allows stopping the simulation in between, and checking its state. It also allows tracking the simulation's states at every time point, and if something desirable is seen, tweaking the variables to get that effect more often and consistently. This 'reverse simulation' is impossible to do in the mind, or in experiments. 4. It allows taking apart different parts of the system as modules, simulating them, and putting them together in different combinations. 5. It allows changing the time at which some in-between process kicks in (say, making it start earlier or later), and this can be done for many processes, which is very difficult to do in the mind, or in experiments. 6. It exposes the modeler to system behavior that experimenters would never encounter, as most of the above manipulations are not possible in experiments. The process of building this distributed model-based reasoning system comprising researcher(s) and model leads to the creation of new or enhanced cognitive capacities. We list some of these here. 35 Integration/Synthesis The model building process brings together a range of experimental data and creates a synthesis that exists nowhere else in the literature, and would not be possible for the biology researchers to produce on their own. Further, given internet search engines and extensive on-line databases, current models synthesize more data than ever possible. In effect, the model-building process creates a running literature review. The structure of the external model, however, is arbitrary because the modeler is free to add/delete elements from the model, based on what he wants to focus on, and also the computational/data resources available. Given this arbitrariness, how can a model enact a real-world system? Roughly, the enactive ability is achieved by infusing data into the model, using a highly recursive process. In the particular case of Lab G modeling, there are three distinct elements of the model – data fit, parameter values, and network structure – that can be altered in many ways to replicate experimental data. During the process of building the model using the training data, these elements are tuned iteratively and in tandem, until they lock together like pieces of a jigsaw puzzle. The lock-in happens because the iterative changes constrain each element, and their interactions. The engine behind this process is the fitting of data. The notion of fit is complex, as it is not a point-by-point replication of all experimental data for all variables. Rather, ‘fit’ usually means the model replicates the trends (metabolite production going up/down) in the experimental data, for most of the major variables. In other words, fit is a global pattern, and it is approximate. While estimating the values for parameters, the modeler uses the fit with the experimental data as an anchor, in the following way. For each change in a parameter value, the way the model’s output maps to the experimental results also changes. But only parameter values that improve fit, or keep fit at an acceptable level, are considered. The building process proceeds by using the global behavior of the model (fit) as an anchor to specify the local structure (parameter values), which is involved in generating the fit itself. The fit is also used to add/delete components in the pathway. Importantly, this process of synthesis is not planned; it emerges from exploration, and is best thought of as a coagulation process, where each of the three changeable 36 elements (pathway-structure, parameters, fit) are fluid in the beginning, but get more and more constrained by their interactions. Abstraction A critical cognitive effect of this sophisticated external component of the modeler's imagination is that it provides the modeler with an external, system view of the pathway – a global perspective of the system as whole, gradually developed from the thousands of runs of simulations, and the analysis of system dynamics for each simulation. This global view would not be possible to develop just from mental simulation, particularly when the interactions between the elements are very complex and difficult to keep track of separately. It is developed solely through the model-building process. The modeler does not gain expertise in biology from this process. However, the system view, together with the detailed understanding of the dynamics, provides the modeler with an intuitive sense of the biological mechanism – how the equation set used (the pathway structure) could generate different types of experimental data. This intuitive understanding of biological mechanism enables her to extend the pathway structure in a highly constrained fashion, to account for experimental data that could not be accounted for by the current pathway structure. The model-building process thus creates abstractive capabilities as well as intuition about the pathway's dynamics (which the Lab G director often calls “a feel for the model.”). Developing such an intuitive sense of mechanism through interaction with the model also helps explain the success of Foldit, EteRNA and EyeWire players (also see Chandrasekharan, 2014). Possible world thinking The model-building process begins by capturing a reaction using variables, and then proceeds by identifying ideal combinations of numbers for these variables – combinations that generate model data similar to experimental data. Variables are a way of getting the building process going by representing the unknown using place-holders. But this place-holder representation has an interesting side effect. The variable representation provides the modeler with a more flexible way of thinking about the reaction, 37 compared to the experimentalist, who works only with one set of values she can control (all possible experimental results), which are privileged values, arising from a set of spatial/structural/thermal properties of the molecules, and the possibility of controlling them. While the modeler also starts from this privileged view of experimental values, once the model is built, the variables can take any set of values, as long as they generate a fit with experimental data. The variable representation enables the modeler to think of the real-world experimental value as one possible scenario, and examine why this scenario is commonly seen in nature, and not others. This helps her to think of generic design patterns and design principles that generate the natural order; such as thermodynamic principles, biological systems' preference for many small changes, their bias for reusing existing structures, and so forth. Identifying such meta-mechanisms is an ongoing effort in the lab. Thinking in variables also supports the modeler’s objective of altering the structure of the reaction, in a way such that patterns commonly seen in nature (such as the thickness of lignin in plant cell walls) can be redesigned. This objective requires: 1) not fixating on the given natural order and 2) thinking of design principles underlying this natural order. The variable representation facilitates both these cognitive steps. More broadly, the variable representation puts the modeler in a counterfactual thinking mode, where it is very natural to think of reality, or the data from the real-world, as representing one possible world (Chandrasekharan & Nersessian, 2008). This stance significantly expands the imagination space of the modeler, compared to the experimentalist, and even the modeler's own imagination before building the model. Note that this does not mean experiments do not provide counterfactual explorations, or that they are not, or cannot be treated as, part of the cognitive system. See Aurigemma et al.(2013) for an account where we examine experiments and prototype development from this perspective. Finally, since the process of constraining variables is gradual, the modeler encounters many variations and extremes in pathway dynamics, and the parameter/network settings that generate these variations. These variations are not exhibited by the final constrained model. But since they are encountered, these variations provide insight on 1) which are the most influential variables, 2) how they 38 constrain each other, and 3) how they contribute to the nature of the real-world data. This gradual contraction of the imagination space, via encountering many extremes and variations, provides a much richer and focused perspective than is possible by just mentally simulating the pathway. In sum, the coupling between the model and the modeler's imagination changes what is available to the mind and what it can do with it. One of the things the mind can do with the new configurations, manipulative abilities, and the system view is to ask the question: what happens if the very representational mapping of the model (the pathway) is changed? This question can be asked only after the modeler has created lots and lots of variation, and thus has a firm grasp of the system behavior, and every possible change other than this one fails to provide a good fit. It is a very bold move, and the building process (particularly variable thinking, variations seen during the modeling process, and the allocentric system view abstracted from the variations) provides the researcher with warrant (and confidence) to make this bold proposal. 5.2. The emergence of collaboration ecospaces As the case study shows, the modeler – experimentalist relation often starts out strained in the beginning. However, in this case the successful building of the model, and the predictions that followed from it, led to a very close collaboration, where the experimentalist replicated the modeler's prediction, and the two groups published a joint paper, in a high profile modeling journal. Thus model building can lead to a deepening of collaborations. The relationship between the modeler and experimenter(s) started off bumpy for a variety of reasons. The researchers had different representations of the mechanism, different levels of control, different goals/objectives, and little understanding of the nature of these differences. Once built, the model generated some interesting predictions, which the experimental collaborators also found interesting, even though the modeling process by which the predictions were made was opaque to them. However, the predictions generated by the model were tested by the experimentalists and the results supported the predictions. Although it does not always happen that the experimentalists will use the 39 predictions in this way, when it does, this process, over time, creates a new collaboration space, and offers the potential for bringing the modeling and experimental communities together, leading to a transdisciplinary research field that is distinct from the background disciplines of researchers in both the streams. The building process also leads to new overlapping mental representations of the problem. For instance, each reaction occurs in a specific location in the cell (nucleus, organelles, cytoplasm), and every reaction is determined by the structural properties of the molecules involved. The experimentalist’s judgments are based on this spatial complexity. But the ODE models are based on rate of change of metabolite concentrations, and thus do not take into account any of this spatial complexity, and the modeler with an engineering background is largely unaware of this complexity. (An indication of this lack of structural understanding is G10's naming the unknown element outside the pathway 'X', instead of providing a possible metabolite name.) Over time, the building of the model, and the discussion with experimentalists about possible additions, can lead to the modeler developing more awareness about the spatial complexity, and sometimes new modeling strategies (such as agent-based models or molecular dynamics models) that take into account this complexity. In the other direction, discussions about the mathematical advantages provided by time-series data could influence experimentalists to report data across time, even if the results are not statistically significant. The building process thus can lead to overlapping problem representations, and approaches that fit the other community’s task better. We term this growth over time of shared collaboration space and overlapping mental representations the mangrove function of external representations, after Clark’s (1997) example of the growth of a mangrove tree to illustrate how writing can generate new thought capacities. A mangrove tree germinates from a seed floating in shallow water. It then sends out a complex web of roots to the ground, creating a “plant on stilts.” This structure traps floating debris, and over time, sand accumulates around this debris, creating a little island around the plant. The tree thus generates its own land to grow. 40 This is similar to how the piecemeal building of the model generates its own task environment, collaboration space, research domain, and shared representations. Another way in which simulation modeling environments build their own ground for future research possibilities is in providing the basis for novices in the lab to start at a more complex level than if they had to start from scratch. Further, built models link together a range of results in a domain, and thus prevent the dissipation of data and concepts. Simulation models are thus cultural artifacts that provide what Tomasello (1999) has called the ratchet effect. In this case, the effect enables modelers and experimentalists to build upon previous work. 6. Extending Distributed Cognition The current DC framework considers external representations that are by-and-large already existing in the system. We have argued that understanding scientific cognition, particularly discoveries based on building new computational representations, requires examining the processes through which representations are built, and treating these building processes as a part of the DC systems that solve problems and generate scientific discoveries. Abstracting from the case study, we have outlined above ways in which building an external simulation model changes the task environment – by affording new cognitive operations, by helping make discoveries, and by furthering collaboration. These effects of building, and their sub-components, extend the DC framework, in the following ways: DC has discussed how structures in the environment could be made part of the cognitive system to lower processing load, for instance in the process of landing an aircraft (Hutchins, 1995a) or playing Tetris (Kirsh & Maglio,1994). In accounting for how G10 made the discovery about the pathway structure, we proposed that the discovery emerged from a coupling that gradually emerged between his imagination system and the external model. This proposal extends the DC idea of making external structures part of the cognitive system, to include complex mental processes such as mental simulations, and complex external structures such as computational simulations. Further, we propose that this coupling emerges gradually, from the start of the 41 modeling process, and develops incrementally through the process of building the external model. Current DC models do not provide an account of the process by which external structures are made part of the cognitive system, though Kirsh & Maglio (1994) discuss how mental rotation is offloaded to physical rotation in the video game Tetris, and how this skill develops with expertise. In our account of model-building, the internal-external coupling is driven not by the modeler's expertise, but by the way the model ‘gains expertise', i.e. how well it enacts the real-world phenomena. This enaction feature, as well as the gradual integration between the internal and external imagination 'spaces' through systematic exploration of the possibilities of the external model, makes the process in our account different from offloading. Ours is an 'incorporation' account (see Chandrasekharan, 2014), where the building process leads to two kinds of integration. First, incorporation of real-world data into the model, which allows the model to enact the behavior of the system it parallels. Second, incorporation of the model as part of the imagination system, such that imagined scenarios are tried out in the model, and the results are integrated into the internal model of the system the model parallels. This notion of incorporation is novel, and the cognitive mechanisms involved in this process would be wider than just perception (as, for e.g., the highly visual nature of crowdsourcing computational media such as Foldit might lead one to infer), and would involve cognitive systems relating to the processing and understanding of motor control and tool use (Chandrasekharan, 2014). Current DC accounts examine how external artifacts serve a coordination function. For instance, Hutchins (1995) discusses how the speed bug provides a shared representation of critical speed values for the two pilots in the cockpit. However, DC models do not provide an account of how coordination emerges among actors through the use of external representations (but see Galantucci, 2005; Chandrasekharan &Tovey, 2012), particularly when actors have conflicting interests. The G10 case presents an instance of how building an external simulation model led to collaborations emerging between groups with conflicting interests. This case contributes to pushing the DC notion of coordination further, to include task environments where many kinds 42 of conflicts exist between the actors. The mangrove effect captures aspects of this process metaphorically, but it also suggests that shared representations need not pre-exist for developing collaborations; building an external model (such as a protein fold using Foldit) can lead to the emergence of shared representations of complex problems and, partly through these, the building process can lead to better collaborations and discoveries. DC assumes that new cognitive capacities emerge from building new environments, but does not discuss in detail the nature of such new cognitive capacities and their relation to the built environments. Our discussion of the way imagination is augmented by simulation modelbuilding provides support for the notion that new cognitive capacities emerge from building new environments, and also how they are interwoven with the building of the external model. In particular, we argue that operations done using the external model are impossible to execute internally, and this is the reason the model is built. This view rejects the equivalence between internal and external operations, which suggests that building the external model is optional, and the model is just "called on" when cognitive load becomes higher. Our view suggests the external model is a requirement for running many of the more complex imagination processes, and therefore the model needs to be incorporated into the imagination system. This coupling between the external and internal models leads to discoveries. In the following section, we examine some of the theoretical and application possibilities offered by this significant extension of the DC framework. 6.1 Broader implications Our account extends the notion of distributed cognition in two new directions. One is to the problem of discovery, where we propose that novel computational representations help generate discoveries through a process of incorporation, where imagination based on internal representations is integrated with the building and behavior of a dynamic external representation. Second, in contrast to traditional DC accounts, we provide a process account of representation-integration and coordination, where both 43 emerge through the process of building an external representation. These theoretical extensions open ways of addressing the following important issues. Mechanisms underlying model-based discovery Our account suggests that discoveries based on models emerge from a gradual 'incorporation' process that is akin to learning, where the model becomes part of the modeler's imagination system through systematic actions executed in the model and feedback from these actions. In this account, the model expands the action space of the modeler, similar to the way tools expand the action space of users (Maravita & Iriki, 2004). The cognitive/neural mechanisms involved in this process are possibly, thus, similar to those in the case of tool use described by Maravita & Iriki (2004). Our account thus provides a specific testable hypothesis, and extends the previous mechanism account of building-based-discovery (Chandrasekharan, 2009; Chandrasekharan, 2014). Design of novel digital media Our account of discovery suggests a way of understanding the success of novel digital media for discovery, such as Foldit and EteRNA. Essentially, the process of incorporation is based on building new structures, then executing actions on this model, and getting feedback. The crowdsourcing games’ rerepresentation of valid declarative knowledge as a control interface allows novices to build and then execute actions on the model and get feedback. Once the gamers' imaginations and the external media (which embed real-world system behavior) are coupled through these processes, they get an implicit feel for the behavior of the real system as a whole, and they can then use this implicit understanding to design novel structures that can stand up to real-world testing. Importantly, the incorporation also provides a 'coordination space' that allows gamers to build on, and extend, others' designs. The success of the crowdsourcing model is thus based on the development of this shared action space. This approach to understanding new computational media for scientific discovery suggests that it would be possible to design similar control interfaces for non-structural modeling problems, such as the metabolic engineering case we have described, and for numerical simulations in general. Working 44 closely with Lab G, we have developed a prototype tangible control interface that turns model-building, parameter estimation and model fitting into tangible actions (see Wu et al, 2011), and we are currently extending this prototype further 1. This design is driven by the theoretical framework we have developed, and we hope to refine our framework through this design-based research project. Understanding model-based learning In mathematics and science education, manipulatives and models are commonly used to improve learning of abstract concepts, such as fraction concepts and area concepts, and unperceivable patterns, such as DNA structure and stereochemistry. More broadly, there are standard approaches to learning based on actions and feedback, such as learning-by-doing and activity-based-learning, and software platforms that promote action-based learning, such as Geogebra, Netlogo (Wilensky & Reisman, 2006) and Kill Math, which seeks to promote learning of math and science concepts through manipulations of objects and numbers on screen. The incorporation account provides a way of understanding how these model-based learning approaches work, and how they are related to scientific practice and discovery based on games such as Foldit. Essentially, scientific discovery games work by re-representing conceptual knowledge as a control interface, where global knowledge of the system can be gained through actions on models and feedback from these actions. In model-based-learning, conceptual knowledge is gained through similar actions and feedback, via the manipulation of models and physical artifacts. In our account, the underlying mechanism in both these cases would be the gradual integration of the internal imagination process and the external model, and the implicit understanding of the system's behavior that emerges from this incorporation. This account of model-based learning allows the use of DC as an analysis framework to understand learning situations involving manipulable models and novel digital media (Landy & Goldstone, 2009; Ottmar, Landy & Goldstone, 2012; Landy, Allen & Zednik, 2014;Marghetis & Nunez, 2013; Majumdar et al., 2014), and also extend learning frameworks based on modeling (such 1 NSF award HCC1320350: Getting a Grip on the Numerical World: Kinesthetic Interaction with Simulations to Support Collaborative Discovery in Systems Biology 45 as Modeling Theory, Hestenes, 2011), thus taking the DC framework back to its original learning roots, as proposed by Pea and Salomon (1993). Conclusion The study of scientific laboratories as distributed cognitive systems is in its infancy. We contend that such analyses can help establish distributed cognition as a leading framework in understanding how discovery and innovation happens in science and engineering. The incorporation account we provide extends the DC framework, and the implications we outline offer initial glimpses of how our account could help in understanding the way cognitive powers are developed through the building of novel computational representations in the domain of scientific discovery and beyond. 46 Acknowledgments We gratefully acknowledge the support of National Science Foundation REESE grant DRL097394084. Our analysis has benefited from discussions with Wendy Newstetter(co-PI), Lisa Osbeck, and Vrishali Subramanian. We thank the members of Lab G for granting us interviews. We appreciate the helpful comments of the journal reviewers on two earlier versions, particularly David Landy, who sought an explanation of why we are using DC and an elaboration of what the new account makes possible. Thanks to Joshua Aurigemma for fine-tuning the figures and to Miles MacLeod for his comments on the penultimate draft. 47 References Ago, M., Okajima, K., Jakes, J. E., Park, S., & Rojas, O. J. (2012). Lignin-based electrospun nanofibers reinforced with cellulose nanocrystals. Biomacromolecules, 13(3), 918-926. Alac, M. & Hutchins, E. (2004). I See What You are Saying: Action as Cognition in fMRI Brain Mapping Practice, Journal of Cognition and Culture, 4, 629-661. Aurigemma, J., Chandrasekharan, S., Newstetter, W., Nersessian, N.J. (2013). Turning experiments into objects: the cognitive processes involved in the design of a lab-on-a-chip device. Journal of Engineering Education, 102(1), 117-140. Baird, Davis. (2004). Thing knowledge: A philosophy of scientific instruments. University of California Press. Banzhaf, W. (1994).Self-organization in a system of binary strings. In R. Brooks & P. Maes (Eds.), Artificial life IV (pp. 109-119). Cambridge, MA: MIT Press. Becvar, A., J. Hollan, and E. Hutchins.(2008). "Representing gestures as cognitive artifacts." In Resources, Co-evolution,and Artifacts: Theory in CSCW, edited by M. S. Ackerman, C. Halverson, T. Erickson and W. A. Kellog, 117-143. NewYork: Springer. Bohannon, J. (2009). Gamers unravel the secret life of protein. Wired 17.05, April 20, 2009. Cetina, K. K. (1999). Epistemic Cultures: How the Sciences Make Knowledge. Cambridge, MA: Harvard University Press. Chandrasekharan, S., & Stewart, T. (2007).The origin of epistemic structures and proto-representations. Adaptive Behavior, 15(3), 329–353. Chandrasekharan, S., Nersessian, N.J. (2008). Counterfactuals in science and engineering. Comment on the target article “The Rational Imagination”, by Ruth Byrne, Behavioral and Brain Sciences, 30, 454. 48 Chandrasekharan, S. (2009). Building to Discover: A Common Coding Model. Cognitive Science, 33(6), 1059-1086. Chandrasekharan, S. & Tovey, M. (2012). Sum, Quorum, Tether: design principles underlying external representations that promote sustainability. Pragmatics& Cognition, 20 (3), 447-482 Chandrasekharan, S., Nersessian, N.J., Subramanian, V. (2012). Computational Modeling: Is this the end of thought experimenting in science? In J. Brown, M. Frappier, & L. Meynell, eds. Thought Experiments in Philosophy, Science and the Arts. (London: Routledge), 239-260 Chandrasekharan, S. (2014). Becoming Knowledge: Cognitive and Neural Mechanisms That Support Scientific Intuition. In Osbeck, L.M. & Held., B.S. (Eds.), Rational Intuition: Philosophical Roots, Scientific Investigations, Cambridge University Press, pp. 307-337 Clancey, W. J. (1997). Situated cognition: On human knowledge and computer representations. Cambridge University Press. Clark, A. (1997). Being There: Putting Brain, Body, and World Together Again. Cambridge, MA: MIT Press. Clark, A. (2004). Natural-born cyborgs: Minds, technologies, and the future of human intelligence. Oxford University Press. Cole, M., & Engestrom, Y. (1993). A cultural-historical approach to distributed cognition. In G. Salomon (ed.), Distributed cognitions: Psychological and educational considerations (pp. 1-46). Cambridge, UK: Cambridge University Press. Cooper, S., et al. (2010). Predicting protein structures with a multiplayer online game, Nature, 466 (7307), 756-60. Cox, R. (1999).Representation construction, externalized cognition and individual differences. Learning and Instruction, 9, 343-363. 49 Dunn, J., & Clark, M. (1999). Life music: The sonification of proteins. Leonardo, 32(1), 25-32. Edwards, L., Peng, Y., & Reggia, J. (1998). Computational models for the formation of protocell structure. Artificial Life, 4(1), 61-77. Galantucci, B. (2005). An experimental study of the emergence of human communication systems. Cognitive Science, 29, 737–767. Galison, P. (1997). Image and Logic: A Material Culture of Microphysics. Chicago: University of Chicago Press. Giere, R. (2002). Scientific Cognition as Distributed Cognition. In P. Carruthers, S. Stitch and M. Siegal, (Eds.) The Cognitive Bases of Science, Cambridge: Cambridge University Press Goodwin, C. (1995). Seeing in depth. Social Studies of Science 25, 237-274. Goodwin, C. (1997). The Blackness of Black: Color Categories as Situated Practice. In L. B. Resnick, R. Säljö, C. Pontecorvo, & B. Burge (Eds.), Discourse, Tools and Reasoning: Essays on Situated Cognition (pp. 111-140). Berlin, Heidelberg, New York: Springer. Hall, R., R. Stevens, and T. Torralba. (2002). Disrupting representational infrastructure in conversation across disciplines. Mind, Culture, and Activity 9,179-210. Hall, R., Wieckert, K., Wright. K. (2010). How does cognition get distributed? Case studies of making concepts general in technical and scientific work. In M. Banich & D. Caccamise (Eds.), Generalization of knowledge: Multidisciplinary perspectives, Psychology Press. Hegarty, M. (2004). Diagrams in the mind and in the world: Relations between internal and external visualizations. In Diagrammatic Representation and Inference, edited by A. Blackwell, K. Mariott and A. Shimojima. Berlin: Springer. Hestenes, D. (2011). Notes for a Modeling Theory. In Proceedings of the 2006 GIREP conference: Modeling in physics and physics education. (Vol. 31). 50 Hollan, J. Hutchins, E., & Kirsh, D. (2000). Distributed cognition: a new theoretical foundation for human-computer interaction research. ACM Transactions on Human-Computer Interaction, 7, 174196. Hutchins, E. (1995). Cognition in the wild. Cambridge, MA: MIT Press. Hutchins, E. (1995a). How a cockpit remembers its speeds. Cognitive Science, 19, 265–288. Khatib, F., DiMaio, F., Foldit Contenders Group, Foldit Void Crushers Group, Cooper, S., Kazmierczyk, M., Gilski, M., Krzywda, S., Zabranska,H., Pichova, I., Thompson, J., Popovic, Z., Jaskolski, M., Baker, D. (2011). Crystal structure of a monomeric retroviral protease solved by protein folding game players, Nature Structural & Molecular Biology, 18, 1175-1177 Kim, J. S., Greene, M. J., Zlateski, A., Lee, K., Richardson, M., Turaga, S. C., ... & Seung, H. S. (2014). Space-time wiring specificity supports direction selectivity in the retina. Nature, 509(7500), 331-336. Kirsh, D. &Maglio, P. (1994).On Distinguishing Epistemic from Pragmatic Actions. Cognitive Science, 18 (14), 513-549. Kirsh, D. (1996). Adapting the environment instead of oneself. Adaptive Behavior, 4, 415-452. Kirsh, D. (2001). The context of work. Human-Computer Interaction, 16, 305-322. Kirsh, D. (2010). Thinking With External Representations. AI & Society, 25 (4) 441–454 Koerner, B.I. (2012). New Videogame Lets Amateur Researchers Mess With RNA. Wired, July 5, 2012 Landy, D. H., & Goldstone, R. L. (2009). How much of symbolic manipulation is just symbol pushing? Proceedings of the Thirty-First Annual Conference of the Cognitive Science Society, 1072-1077. Amsterdam, Netherlands: Cognitive Science Society. 51 Landy, D., Allen, C., & Zednik, C. (2014). A perceptual account of symbolic reasoning. Frontiers in psychology, 5. Langton, C. G. (1984). Self-reproduction in cellular automata. Physica D, 10, 135-144. Langton, C. (1990). Computation at the edge of chaos: Phase transitions and emergent computation. Physica D, 42, 12-37. Larkin, J. H., and H. A. Simon. 1987. Why a diagram is (sometimes) worth ten thousand words. Cognitive Science, 11, 65-100 Latour, B. (1986). Visualisation and cognition: Thinking with eyes and hands. Knowledge and Society no. 6:1-40. Latour, B., and S. Woolgar. (1986). Laboratory Life: The Construction of Scientific Facts. Princeton: Princeton University Press. Lave, J. (1988). Cognition in Practice: Mind, Mathematics, and Culture in Everyday Life. New York: Cambridge University Press. Lee, J., Kladwang, W., Lee, M., Cantu, D., Azizyan, M., Kim, H., ... & Das, R. (2014). RNA design rules from a massive open laboratory. Proceedings of the National Academy of Sciences, 111(6), 21222127. Lenhard, J. (2004). Surprised by a nanowire: Simulation, control, and understanding. Philosophy of Science, 73, 605-616. MacLeod, M., & Nersessian, N. J. (2013). Coupling simulation and experiment: the bimodal strategy in integrative systems biology. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 44(4), 572-584. MacLeod, M. & Nersessian, N.J. (2013): Building Simulations from the Ground-Up: Modeling and Theory in Systems Biology, Philosophy of Science, 80, 533-556 52 MacLeod & Nersessian, In Press . Modeling systems-level dynamics: Understanding without mechanistic explanation in integrative systems biology. Studies in the History and Philosophy of the Biological and Biomedical Sciences Majumdar, R., & Kothiyal, A.,Pande,P.,Agarwal, H., Ranka, A., Murthy, S., Chandrasekharan, S. (2014). The enactive equation: exploring how multiple external representations are integrated, using a fully controllable interface and eye-tracking. In Proceedings of the Sixth International Conference on Technology for Education (T4E), IEEE. Maravita, A., & Iriki, A. (2004).Tools for the body (schema).Trends in cognitive sciences, 8(2), 79-86. Marghetis, T., & Núnez, R. (2013). The motion behind the symbols: a vital role for dynamism in the conceptualization of limits and continuity in expert mathematics. Topics in cognitive science, 5(2), 299-316. Meli, DB. (2006). Thinking with objects: The transformation of mechanics in the seventeenth century: Johns Hopkins University Press. Nersessian, N. J. (1991). Why do thought experiments work? Proceedings of the 13th Annual Meeting of the Cognitive Science Society, pp.480-486. Nersessian, N. J. (1992). In the Theoretician's Laboratory: Thought Experimenting as Mental Modeling. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 2, 291–301. Nersessian, N. (2002). The cognitive basis of model-based reasoning in science. In P. Carruthers, S. Stich, & M. Siegal (Eds.), The cognitive basis of science (pp. 133–153). Cambridge, England: Cambridge University Press. Nersessian, N. J., Kurz-Milcke, E., Newstetter, W. C., & Davies, J. (2003). Research laboratories as evolving distributed cognitive systems. Proceedings of The 25th Annual Conference of the Cognitive Science Society. pp.857-862. 53 Nersessian, N. J. (2005). Interpreting scientific and engineering practices: Integrating the cognitive, social, and cultural dimensions. In Scientific and Technological Thinking, edited by M. Gorman, R. D. Tweney, D. Gooding and A. Kincannon, 17-56. Hillsdale, N. J.: Lawrence Erlbaum. Nersessian, N. J. (2008). Creating Scientific Concepts. Cambridge, MA: MIT Press. Nersessian, N.J. (2009). How do engineering scientists think? Model-based simulation in biomedical engineering laboratories, Topics in Cognitive Science, 1:730-757. Nersessian, N. J. (2012). Engineering concepts: the interplay between concept formation and modeling practices in bioengineering sciences. Mind, Culture, and Activity, 19(3), 222-239. Nersessian, N. J., Kurz-Milcke, E., Newstetter, W. C., & Davies, J. (2003). Research laboratories as evolving distributed cognitive systems. Proceedings of the 25th Annual Conference of the Cognitive Science Society.pp.857-862. Nersessian, N. J., & Chandrasekharan, S. (2009). Hybrid analogies in conceptual innovation in science. Cognitive Systems Research, 10, 178-188. Newstetter, W., Kurz-Milcke, E., & Nersessian, N. (2004).Cognitive partnerships on the bench tops.InY. B. Kafai, W. A. Sandoval, N. Enyedy, A. S. Nixon, & F. Herrera (Eds.), Proceedings of the 6th international conference on learning sciences (pp. 372–379). Hillsdale, NJ: Erlbaum. Newell, A. (1980). Physical symbol systems. Cognitive Science, 4(2), 135-183. Osbeck, L., and N. J. Nersessian.(2006). The distribution of representation. The Journal for the Theory of Social Behaviour , 36,141-160. Osbeck, L.M., N. J. Nersessian, K. R. Malone, and W. C. Newstetter. (2010). Science as psychology: Sense-making and identity in science practice: Cambridge University Press. 54 Ottmar, E., Landy, D., & Goldstone, R. L. (2012). Teaching the perceptual structure of algebraic expressions: Preliminary findings from the pushing symbols intervention. In The Proceedings of the Thirty-Fourth Annual Conference of the Cognitive Science Society (pp. 2156-2161). Pea, R. (1993). Practices of distributed intelligence and designs for education. In G. Salomon (ed.), Distributed cognition: Psychological and educational considerations (pp. 47-87). Cambridge, UK: Cambridge University Press. Polanyi, M. (1958). Personal Knowledge: Towards a Post-Critical Philosophy. University of Chicago Press. Polanyi, M. (1966). The Tacit Dimension. London, Routledge. Prusinkiewicz, P., Lindenmayer, A., & Hanan, J. (1988). Developmental models of Herbaceous plants for computer imagery purposes. Computer Graphics, 22(4), 141-150. Reynolds, C. (1987). Flocks, herds, and schools: A distributed behavioral model. Computer Graphics, 21(4), 25-34. Rheinberger, H-J.(1997). Toward a History of Epistemic Things: Synthesizing Proteins in the Test Tube. Stanford, CA: Stanford University Press. Runions, A., Fuhrer, M., Lane, B., Federl, P., Rollang-Lagan, A., & Prusinkiewicz, P. (2005). Modeling and visualization of leaf venation patterns. ACM Transactions on Graphics, 24(3), 702-711. Salomon, G. (1993). No distribution without individuals' cognition: a dynamic interactional view. In G. Salomon (ed.), Distributed cognition: Psychological and educational considerations (pp. 111-138). Cambridge, UK: Cambridge University Press. Schwartz, D. L. (1995). Reasoning about the referent of a picture versus reasoning about a the picture as the referent. Memory and Cognition, 23,709-722. 55 Schwartz, D.L. & Martin, T. (2006). Distributed learning and mutual adaptation. Pragmatics & Cognition, 14, 313-332. Schneider, B. (2012). Climate model simulation visualization from a visual studies perspective. Wiley Interdisciplinary Reviews: Climate Change, 3(2), 185-193. Sims, K. (1994). Evolving virtual creatures. Computer Graphics, 8, 15-22. Sincell, M. (2000).NanoManipulator lets chemists go mano a mano with molecules. Science, 290, 1530. Suchman, L. (1987). Plans and situated actions: The problem of human/machine communication. Cambridge, UK: Cambridge University Press. Tomasello, M (1999) The Cultural Origins of Human Cognition. Harvard University Press. Vera, A., & Simon, H. (1993). Situated action: a symbolic interpretation. Cognitive Science, 17, 7-48. Von Neumann, J. (1951). The general and logical theory of automata. Cerebral mechanisms in behavior, 1, 41. Wilensky, U., & Reisman, K. (2006). Thinking like a wolf, a sheep, or a firefly: Learning biology through constructing and testing computational theories—an embodied modeling approach. Cognition and Instruction, 24(2), 171-209. Wilkerson, C. G., Mansfield, S. D., Lu, F., Withers, S., Park, J. Y., Karlen, S. D., ... & Ralph, J. (2014). Monolignol ferulate transferase introduces chemically labile linkages into the lignin backbone. Science, 344(6179), 90-93. Winsberg, E. (2006). Handshaking your way to the top: Simulation at the nanoscale. Philosophy of Science, 73, 582-594. 56 Wu, A., Caspary,E., Yim, J., Mazalek, A., Chandrasekharan, S., Nersessian, N.J. (2011). Kinesthetic Pathways: A Tabletop Visualization to Support Discovery in Systems Biology. Proceedings of the ACM Creativity and Cognition Conference, Atlanta, USA. 57 Figure Captions Figure 1: A sample pathway diagram. Metabolite names have been replaced with alphabets. The dark lines indicate connections where material moves across nodes, the dotted lines indicate regulatory connections. Note the questions marks over some connections that are postulated by the modeler. Figure 2: An outline of the modeling process in Lab G Figure 3: A sample of graphs that show a 'fit'. The blue and red lines indicate experimental data, the dots indicate model results, blue indicates baseline. Note that the green dot in the last graph in the top row shows a model result (rising above the baseline) that is exact opposite of the experimental results (falling below the baseline). The previous graph also shows a model result that falls way outside the experimental data points. However, this set of graphs, from a published paper from Lab G, is presented as a good 'fit', indicating that fit is a global feature. Figure 4: The lignin pathway in the second alfalfa model. Metabolite names have been replaced with alphabets and random numbers. The red lines and elements indicate new data added. The blue lines indicate G10's findings. Figure 5: The lignin pathway in the second alfalfa model, with the new element X and the role it plays in the lignin pathway at different levels of concentrations. Metabolite names have been replaced with alphabets and random numbers. The red and the blue lines indicate channels that lead to S and G monomers. The thickness of these lines indicate the size of the channel (which decides the rate of the 58 reaction). The violet lines indicate the influence of X on these channels. Violet lines ending with an arrow indicate a positive influence, horizontal ends indicate negative influence. The size of the purple circle around X indicates the concentration of X. The red cross indicates that the reaction is blocked. 59 Figure 1 60 Figure 2 61 Figure 3 62 Figure 4 63 Figure 5 64