Why Data Science? David Skillicorn School of Computing Queen’s University skill@cs.queensu.ca research.cs.queensu.ca/home/skill skillicorn.wordpress.com “Now is the time to begin thinking of Data Science as a profession not a job, as a corporate culture not a corporate agenda, as a strategy not a stratagem, as a core competency not a course, and as a way of doing things not a thing to do." - Kirk Borne Data Science = Inductive modelling of systems from (big) data about them I. Changing science II. Big data III. Inductive modelling IV. An example payoff – social networks I. Changing Science Enlightenment view of natural history: (i.e. science until the middle of C20) Create hypothesis Devise experiment to test hypothesis If experiment provides evidence that supports hypothesis, promote it to a theory The first step is really difficult! Popperian view of science: (middle of C20 onwards) Create hypothesis Devise experiment to falsify hypothesis If experiment provides evidence that fails to falsify hypothesis, add it to the list of plausible theories Rank the list of theories by how difficult they are to falsify Modifications to the Popperian view: Kuhn’s point that reordering the list is subjective Quantum physics issues around observers Post-modern worldviews discounting “objectivity” The tendency of effects to disappear in repeated experiments (the “decline effect”) The heart of hypothesis testing/falsifying: The controlled experiment Always problematic • can’t be done in e.g. astrophysics, national economies • often unethical in e.g. medical experiments (drugs that work really well or really badly) • subjects chosen in a controlled way but the pool not controlled e.g. psych experiments on undergrads Science by controlled experiment Statistical machinery System Model Experiment pokes the system with a clever stick/asks a clever question II. Why BIG Data “With enough data, you can find any pattern” (aka “data dredging”) Key insight: If a pattern is really present, more data makes it more obvious, not less obvious. As the data increases, real structures become more obvious; accidental structures less obvious What line fits these points? III. Inductive modelling Reverse engineering a system from the data about it Human System being studied Plentiful, but surface, data about the system Models induced from the data Model evaluation process Model of choice Replace “controlled experiment” by “natural experiments” – use the data that’s available. There’s typically lots of data. The availability of data allows the use of “test sets” – data describing the system, but never seen by the modelling process. Accuracy of modelling using this test set data builds confidence in the model(s), maybe more than statistical machinery. Key benefits: 1. Less need for “creativity” (clever data collection replaces clever hypothesis generation) 2. Models are generated inductively, i.e. pushed to the scientist 3. Models that would not even have occurred to the scientist get considered Weaknesses: 1. Choice of model building technique(s) 2. Most techniques are parametric, but it’s hard to choose the parameters in a principled way This is addressed by iterative modelling, using knowledge from early attempts to improve later modelling. Even within data science there are two approaches: Leo Breiman, Statistical Modeling: The Two Cultures “There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model [statisticians]. The other uses algorithmic models and treats the data mechanism as unknown [data miners].” Places where Data Science is making a difference: • • • • • • • • Physics, e.g. CERN Climate Fluid dynamics Genome wide association studies Natural language (meaning from syntax) e.g. Google Construction of large software systems Shopping patterns/supply chain management Mortgage approvals • • • • • Insurance pricing Traffic/congestion prediction Behaviour modelling Police scheduling Recommender systems Both social sciences and humanities can use this new scientific method (where they could not use the old one). A major application: human social behaviour aka social network analysis Human social networks are based on local decisions to connect, like, dislike but produce global, emergent regularities and structures Online activity makes it easy to capture when people create relationships, and often what kind of relationships these are: email forum/web page posts Facebook/LinkedIn physical co-presence (via GPS devices) The results have mostly been surprising: Milgram’s 6 degrees of separation - social networks have much smaller diameters than we expect Dunbar’s Number - social networks have degrees (125) that match real-world connectivity (so probably cognitive) Powers of almost-three - from each individual’s perspective, connections are layered 3 + 9 + 27 + 81 = 120 Influence flows “over the horizon” - happiness, sadness, smoking, weight, … are affected by friends of friends whom we haven’t even met Hairball and whiskers - clusters or communities are in the eye of the beholder The connections (relationships) in social networks encode many different properties: Symmetric relationship Asymmetric (liking/influence) Qual. different (friend/colleague) Like/Hate + undirected edge directed edge typed edge signed edge relationship intensity can change with time Most of these properties have not been studied much. The Medicis have been studied extensively because of their meteoric rise from obscurity to become Dukes of Florence (and to supply 3 popes). Conventional view: they acquired power by acting as intermediaries between oligarchs and nouveau riche. We can examine this using the social network of Florence. Social networks are hard to study directly, but ways to embed them faithfully in low-dimensional spaces are known. The undirected social network (oligarchs – black, Medici-aligned – red) Closeness to the centre = importance; closeness to another family = similarity of connections/role Red edges = business relationships, green edges = social relationships, blue edges = difference between the two roles Accounting for the direction of relationships, using the Chung embedding technique Accounting for direction using a new embedding technique developed by my group – the difference between in and out measures net flow across each node All of these embeddings support the conventional view that the Medici do occupy a central position, mediating between the nouveau riche and the oligarch bloc. The variants (especially the differences between embedded versions of the same group) give insights into exactly how this worked – and also allow edge prediction, guesses about the characteristics of ‘missing’ edges. However, the relationships among the most important subset of families is more subtle … Typed, undirected (red – business; green – personal; blue – role difference) Summary: Data “science” (which includes data humanities and data social science) is a new way of understanding complex systems – systems that are more complex than conventional science can handle This includes systems that involve many humans – economic, social, political. I’ve shown some hints of the payoffs, using social network analysis as an application domain. We learn many unexpected aspects of how we relate to one another – both macro-scale and micro-scale. Thank you And Questions? David Skillicorn skill@cs.queensu.ca www.cs.queensu.ca/home/skill skillicorn.wordpress.com