Why data science (ppt slides)

advertisement
Why Data Science?
David Skillicorn
School of Computing
Queen’s University
skill@cs.queensu.ca
research.cs.queensu.ca/home/skill
skillicorn.wordpress.com
“Now is the time to begin thinking of Data Science
as a profession not a job,
as a corporate culture not a corporate agenda,
as a strategy not a stratagem,
as a core competency not a course, and
as a way of doing things not a thing to do."
- Kirk Borne
Data Science = Inductive modelling of systems from (big)
data about them
I.
Changing science
II. Big data
III. Inductive modelling
IV. An example payoff – social networks
I. Changing Science
Enlightenment view of natural history:
(i.e. science until the middle of C20)
Create hypothesis
Devise experiment to test hypothesis
If experiment provides evidence that supports
hypothesis, promote it to a theory
The first step is really difficult!
Popperian view of science:
(middle of C20 onwards)
Create hypothesis
Devise experiment to falsify hypothesis
If experiment provides evidence that fails to
falsify hypothesis, add it to the list of
plausible theories
Rank the list of theories by how difficult they are
to falsify
Modifications to the Popperian view:
Kuhn’s point that reordering the list is subjective
Quantum physics issues around observers
Post-modern worldviews discounting “objectivity”
The tendency of effects to disappear in repeated
experiments (the “decline effect”)
The heart of hypothesis testing/falsifying:
The controlled experiment
Always problematic
• can’t be done in e.g. astrophysics, national economies
• often unethical in e.g. medical experiments (drugs that
work really well or really badly)
• subjects chosen in a controlled way but the pool not
controlled e.g. psych experiments on undergrads
Science by controlled experiment
Statistical machinery
System
Model
Experiment pokes the
system with a clever
stick/asks a clever
question
II. Why BIG Data
“With enough data, you can find any pattern” (aka “data
dredging”)
Key insight: If a pattern is really present, more data
makes it more obvious, not less obvious.
As the data
increases, real
structures become
more obvious;
accidental
structures less
obvious
What line fits these points?
III. Inductive modelling
Reverse engineering a system from the data about it
Human
System
being
studied
Plentiful,
but
surface,
data about
the system
Models
induced
from the
data
Model
evaluation
process
Model
of
choice
Replace “controlled experiment” by “natural experiments”
– use the data that’s available.
There’s typically lots of data.
The availability of data allows the use of “test sets” – data
describing the system, but never seen by the modelling
process. Accuracy of modelling using this test set data
builds confidence in the model(s), maybe more than
statistical machinery.
Key benefits:
1. Less need for “creativity” (clever data collection replaces clever
hypothesis generation)
2. Models are generated inductively, i.e. pushed to the
scientist
3. Models that would not even have occurred to the
scientist get considered
Weaknesses:
1. Choice of model building technique(s)
2. Most techniques are parametric, but it’s hard to choose
the parameters in a principled way
This is addressed by iterative modelling, using knowledge
from early attempts to improve later modelling.
Even within data science there are two approaches:
Leo Breiman, Statistical Modeling: The Two Cultures
“There are two cultures in the use of statistical modeling
to reach conclusions from data. One assumes that the data
are generated by a given stochastic data model
[statisticians]. The other uses algorithmic models and
treats the data mechanism as unknown [data miners].”
Places where Data Science is making a difference:
•
•
•
•
•
•
•
•
Physics, e.g. CERN
Climate
Fluid dynamics
Genome wide association studies
Natural language (meaning from syntax) e.g. Google
Construction of large software systems
Shopping patterns/supply chain management
Mortgage approvals
•
•
•
•
•
Insurance pricing
Traffic/congestion prediction
Behaviour modelling
Police scheduling
Recommender systems
Both social sciences and humanities can use this new
scientific method (where they could not use the old one).
A major application: human social behaviour
aka social network analysis
Human social networks are
based on local decisions to connect, like, dislike
but produce
global, emergent regularities and structures
Online activity makes it easy to capture when people
create relationships, and often what kind of relationships
these are:
email
forum/web page posts
Facebook/LinkedIn
physical co-presence (via GPS devices)
The results have mostly been surprising:
Milgram’s 6 degrees of separation
- social networks have much smaller diameters than
we expect
Dunbar’s Number
- social networks have degrees (125) that match
real-world connectivity (so probably cognitive)
Powers of almost-three
- from each individual’s perspective, connections are
layered
3 + 9 + 27 + 81 = 120
Influence flows “over the horizon”
- happiness, sadness, smoking, weight, … are affected
by friends of friends whom we haven’t even met
Hairball and whiskers
- clusters or communities are in the eye of the beholder
The connections (relationships) in social networks encode
many different properties:
Symmetric relationship
Asymmetric (liking/influence)
Qual. different (friend/colleague)
Like/Hate
+
undirected edge
directed edge
typed edge
signed edge
relationship intensity can change with time
Most of these properties have not been studied much.
The Medicis have been studied extensively because of
their meteoric rise from obscurity to become Dukes of
Florence (and to supply 3 popes).
Conventional view: they acquired power by acting as
intermediaries between oligarchs and nouveau riche.
We can examine this using the social network of Florence.
Social networks are hard to study directly, but ways to
embed them faithfully in low-dimensional spaces are
known.
The undirected social network (oligarchs – black, Medici-aligned – red)
Closeness to the centre = importance; closeness to another family =
similarity of connections/role
Red edges = business relationships, green edges = social relationships,
blue edges = difference between the two roles
Accounting for the direction of relationships, using the Chung
embedding technique
Accounting for direction using a new embedding technique
developed by my group – the difference between in and out
measures net flow across each node
All of these embeddings support the conventional view that the
Medici do occupy a central position, mediating between the nouveau
riche and the oligarch bloc.
The variants (especially the differences between embedded
versions of the same group) give insights into exactly how this
worked – and also allow edge prediction, guesses about the
characteristics of ‘missing’ edges.
However, the relationships among the most important subset of
families is more subtle …
Typed, undirected (red – business; green – personal; blue – role difference)
Summary:
Data “science” (which includes data humanities and data
social science) is a new way of understanding complex
systems – systems that are more complex than
conventional science can handle
This includes systems that involve many humans –
economic, social, political.
I’ve shown some hints of the payoffs, using social
network analysis as an application domain.
We learn many unexpected aspects of how we relate to
one another – both macro-scale and micro-scale.
Thank you
And
Questions?
David Skillicorn
skill@cs.queensu.ca
www.cs.queensu.ca/home/skill
skillicorn.wordpress.com
Download