The Impact of Big Data on Social Research David Rhind Sharon Witherspoon

advertisement
The Impact of Big Data on
Social Research
David Rhind
Sharon Witherspoon
1
www.nuffieldfoundation.org
The landscape to be covered…
• What is Big Data? Just consultants’ hype?
• Key questions for SRA
• Technology + other drivers of change
• New sources of data and their uses
• Big challenges
• Back to the future – the next Census
• Presentation also matters
• Conclusions
2
What is/are Big Data?
VOLUME: too large to handle by standard contemporary analytical tools i.e.
subjective / relative measure
“the total amount of data has grown exponentially: it has been estimated that
more data was harvested between 2010 and 2012 than in all of preceding
human history.”
Source: http://www.bbc.co.uk/news/business-17682304
Certainly made by Mike Lynch; original source IBM?
VELOCITY: how fast data is being produced and how fast it must be
produced to meet demand.
VARIETY: many different forms of data which are used – structured and
unstructured (the majority), held in different types of databases as text
documents, emails, imagery, videos and much else
PROBLEMS: hype, bias in (large) sample, focus on correlations not
causality, understanding the results
3
Context and key questions for SRA
• Current practice mostly survey-based
• Divide exists between expertise in data collection and
analysis skills
• National shortfall in quantitative analytical skills
• Will Big Data, etc change the ground-rules of research
practice?
• Are established practices becoming obsolete?
• Or do we need to assimilate what’s new into
established principles of research?
4
Drivers of change
•
•
•
•
•
•
Extraordinary rate of technological enhancement
Austerity – better vfm sought
Transparency
Job creation/ increase wealth
Calls for better/ more up to date data/info/evidence
Threats to traditional approaches e.g. EU Parliament
and Data Protection - ‘Specific and explicit consent’
Public sector manifestations of change: data scientists
sought by government, support of Open Data Institute,
ONS exploration of options, data.gov, ESRC £64m
funding & ADRCs
5
Technology
change
The iPhone
4S 2012 in
my pocket
Apollo 11 More computing
1969
power than
Apollo
3000 x
storage of
$150 /
year
6
IBM 305 disk
drive 1956
Leased for
$35,000/year
7
New(ish) sources of data
• Mobile phone sensors
• Proxy: satellite remote sensing 31cm resolution
(how to reflect people data?)
• Proxy: web scraping (e.g. inflation measures)
• Crowd sourcing e.g. OpenStreetMap
• Management/ administrative data (public and
private sector)
• Modelling starting from historic data
8
Visitors and locals in Paris
9
Source: Eric Fischer
Uses of different data types
• Obtaining data about ‘things’ easy? – see remote
sensing examples
• People:
 location and movement of people technically easy via
CCTVs, smartphones.
 ethnicity, age data approximations from names
 profiles from private sector data or linked governmental
administrative data technically easy
Best solution usually is combination of data types..
e.g. land cover and use from imagery and
company records
10
Real time data collection
now routine for some
applications
11
Source: UK MoD under the Open Government license, Google and US Geological Survey
Different uses of imagery at different resolutions
10m resolution
See roads and water
features
Source: DigitalGlobe 2014
12
1 to 2 metres resolution,
See some cars and
individual houses
30 to 60cm resolution,
See all visible cars, manholes
Extreme crowd sourcing:
Pyongyang Open Street Map
13
Also MH 370
Source: UK MoD under the Open Government license, Google and US Geological Survey
Admin data / management information
• Obvious advantages – already exists, often
continuously maintained, linkage of personal admin
data facilitates valuable research and fraud
reduction
BUT
• You get (at best) what is created for other purposes
• Content or classification changes mess up time
series
• Personal admin data sharing and privacy debate…
• Has raw data quality been audited properly
(English police recorded crime statistics)?
14
Ratio between CSEW incidents and crime
recorded by the police
15
Adding value = a commercial asset
Can have huge value e.g. Climate Corporation:
 2006 start-up by 2 ex-Google staff
 Linked US government weather, crop yield
and soil data
 Provide yield forecasting and planting advice,
weather and crop insurance
 Bought by Monsanto October 2013 for $930m
16
Big Challenges
• Trade-off between data integrity and currency.
How good is ‘good enough’? How fast is fast enough?
• Want to anticipate the future as well as know the past
• Private sector increasingly active in data collection and
exploitation e.g. Markit surveys used by Bank of England.
Internationalisation of data collection/assembly growing.
• Public understanding: problem with use of technical
language e.g. public doesn’t really understand ‘n year
flood’ concept. PM confusion of deficit and debt.
Changed role of data constructor/statistician? – mentors
and advocates?
• This all a matter for the very young?
17
18
Back to the future with
surveys?
19
The 2011 Census
• 2011 Census survey data collection went well
but total cost £480m
• Basically very similar to what done for decades;
16% completed on-line
• Results started to become available 15 months after
survey but much still being published after 3.5 years
• Changing society
more difficult to complete forms
• Statistics Commission, Treasury Select Committee
and UKSA said ‘no more traditional census’
20
LFS Response Rates 1993 to 2008
Source: ONS
US experience is similar – an average of 20% reduction in 20 years
21
The 2021 ‘Census’
• Very strong support from public consultation for
continuation of some form of Census
• ONS plan now accepted in principle by government
• Model is for an on-line Census+:
 aim to achieve high (e.g. 65)% of online completion of forms
 aim to enrich census data by adding variables derived from
admin data wherever possible
 much research under way…
• US Bureau of Census experimenting with use of
smartphone-derived data
22
Source: ONS
23
Data presentation also
matters!
24
Basic
arithmetical
error – it
should be
“almost £400”
not “almost
£4000!”
PM confusing
deficit and
debt…
25
National
Infrastructure Plan:
Pipeline Value by
sector (£m)
Pipeline value by sector
250,000
Moral: how information
is presented can
seriously mislead
(note log scale on
Chart 2)
200,000
Capital Value £ million
150,000
100,000
50,000
-
26
Communications
Flood
Transport
Water
Conclusions
27
• Much Big Data hype but a revolution is under way
• This will change the way we assemble data and do
social science to extract added value
• Much more work will be by multi-disciplinary teams
with higher level analytic, quantitative and
presentational skills in various disciplines
• Greater focus still needed on data quality issues
• Need focus on data sharing governance, ethics and
safeguards and on advocacy of benefits
• Q-Step will help – a BIT. But organisations like the SRA
and its members have an important role!
28
Thank you
29
Download