GSS Symposium - Big Data (Powerpoint

advertisement
ONS Big Data Project
GSS Methodology Symposium
3 July 2014
Session objectives
• Provide an overview of big data, particularly
in Official Statistics
• Introduce the ONS Big Data Project
• Provide a brief overview of our 4 pilot studies
and other project objectives
• Provide links to more information
What is Big Data?
“Big data are high volume, high velocity, and high variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and
process optimization” (Gartner 2012)
Volume
- exceeds limits of traditional
column and row relational DB
- constantly growing
Vertical scalability
Requires
Data streaming
Velocity
- arrives rapidly, often in real
time
Requires
Variety
- does not have a standard
structure, e.g. text, images
- ability to grow storage to
accommodate new ‘records’
Requires
- real time processing,
analysis and transformation
Horizontal scalability
- ability to add additional data
structures
How is big data generated?
Sensors gathering information: e.g.
Climate, traffic etc.
Social media: posts, pictures and
videos
Digital satellite images
Purchase transaction records
Mobile phone GPS signals
High volume administrative
& transactional records
Big Data Technologies
Cloud Computing
Parallel Computing
NoSQL Databases
General Programming
Data Visualization
Machine Learning
Big Data and Official Statistics
• Replace existing outputs
• Produce an entirely new outputs
• Complement other sources:
• Filling in gaps
• Auxiliary variables for statistical models
• Improve operational processes
• Quality assurance
What is the ONS Big Data Project?
• A one year project which aims to:
• investigate the potential for big data in official statistics while
understanding the challenges
• establish an ONS policy and longer term strategy which
incorporates ONS’s position within Government and
internationally in this field
• Recommend next steps to support the strategy going
forward
Big Data Project work packages
•
•
•
•
Management and Strategy
Stakeholder Engagement
Communication
Analysis and infrastructure – pilot projects:
Smart meter
Mobile Phones
Prices
Twitter
Stakeholder Engagement
• International:
• UNECE / ESS
• Leading NSIs are Italy and Netherlands
• Cross-government:
• HMG Data Science Community of Interest Group
• Big data for statistics vs other types of analysis
• UK Government Big Data Champion (Jane Naylor)
• Academia:
• University of Southampton
• ESRC Big data network
Analysis & Infrastructure: Technical
challenges
• Huge and continuously growing data streams,
requiring new data architectures and software
• Feasibility and efficiency of processing,
typically requiring parallel computing on a
large scale
• New skills will be required, bringing together
statistical and technological expertise
Analysis & Infrastructure: The Data
Science Skill Set
No living person is
expert in all these
disciplines.
A very rare person
would be proficient in
them all.
An individual might
be expert in one or
two, and proficient in
another two or three.
Data science is a TEAM
SPORT.
Source: http://en.wikibooks.org/wiki/Data_Science:_An_Introduction
Pilot 1: Smart meter project
Irish smart meter pilot study:
Single meter, total daily electricity consumption
Christmas 2010
150
100
50
ec
em
be
r2
01
0
D
Day
Au
gu
st
20
10
20
10
M
ay
20
10
Fe
br
ua
ry
O
ct
ob
er
20
09
20
09
Consecutive days w ith low
consumption, possibly a w eek aw ay?
Ju
ly
Total daily electricity consumption (kWh)
Christmas 2009
Pilot 1: Smart meter project
Research Question: Investigate the potential
of smart meter electricity data (high frequency
– 30 mins) to identify household occupancy
levels, potentially household structure
• England and Ireland both conducted pilots of
rollout in 2009-2010 – data now available for
research
• Southampton University commissioned by
Beyond 2011 to conduct preliminary research
(due mid Feb 2014)
Pilot 2: Mobile Phone Project
• 4 pilot projects:
Smartmeter
Mobile Phones
Prices
Twitter
• RD&I Research Innovation Labs
Pilot 2: Mobile Phone Project
Research Question: To investigate the
possibility of using mobile phone data to
model population flows, eg travel to work
statistics
• Location data:
Telefonica proposal to provide aggregate data on
origin-destination flows
• Requirement to engage with GDS before
proceeding further
Pilot 3: Prices Project
Research Question: To investigate how we
can scrape prices data from the internet and
how this data could be used within price
statistics
• 2-day workshop held with big data experts
from Statistics Netherlands
• Focus on groceries
• Early prototype code in place
• Engagement with Billion Prices Project
Pilot 3: Prices by webscraping
Rendered webpage:
HTML code:
......
</div><div class="productLists" id="endFacets-1"><ul class="cf products line"><li id="p-254942348-3" class=" first"><div
class="desc"><h3 class="inBasketInfoContainer"><a id="h-254942348" href="/groceries/Product/Details/?id=254942348"
class="si_pl_254942348-title"><span class="image"><img
src="http://img.tesco.com/Groceries/pi/121\5010044000121\IDShot_90x90.jpg" alt="" /><!----></span>Warburtons Toastie Sliced
White Bread 800G</a></h3><p class="limitedLife"><a href="http://www.tesco.com/groceries/zones/default.aspx?name=quality-andfreshness">Delivering the freshest food to your door- Find out more ></a></p><div class="descContent"><!----><div
class="promo"><a href="/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31234788" title="All products
available for this offer" id="flyout-254942348-promo-A31234788--pos" class="promoFlyout"><span class="promoImgBox"><img
src="/Groceries/UIAssets/I/Sites/Retail/Superstore/Online/Product/pos/2for.png" class="promoFlyout promo" alt="Special Offer"
id="flyout-254942348-promo-A31234788--posimg" /></span><em>Any 2 for £2.00</em></a><span> valid from 21/1/2014 until
10/2/2014</span></div><div class="tools"><div class="moreInfo"><a href="/groceries/Product/Details/?id=254942348"
class="midiFlyout" id="flyout-254942348-midi-0-"><img class="midiFlyout hd"
src="http://ui.tescoassets.com/groceries/UIAssets/I/../Compressed/I_635209615845382232/Sites/Retail/Superstore/Online/Product/i
nfoBlue.gif" alt="" title="View product information" id="flyout-254942348-midi-1-" /></a></div><!----><div
class="links"><ul><li><a
href="http://www.tesco.com/groceries/product/browse/default.aspx?notepad=white%20sliced%20loaf%20800g&N=4294793217"
class="shelfFlyout active plaintooltip" id="s-tt-254942348" title="Premium White Bread"> Rest of <span class="hide">Premium
White Bread <!----></span>shelf </a></li></ul></div></div></div></div><div class="quantity"><div class="content addToBasket"><p
class="price"><span class="linePrice">£1.45<!----></span><span class="linePriceAbbr"> (£0.18/100g)</span></p><h4
class="hide">Add to basket</h4><form method="post" id="fMultisearch-254942348"
.....
Pilot 3:
The Billion Prices Project @ MIT
Lehman Brothers files
for bankruptcy (15 Sept
2008)
Daily Online
Price Index
(United
States)
Pilot 4: Twitter Project
Research Question: To investigate how to
capture geo-located tweets from Twitter and
how this data might provide insights on
commuting patterns and internal migration
• Opportunity to start experimenting early on
with big data technologies
• Pilot work has successfully harvested geolocated tweets from the live Twitter feed using
Python and Twitter API
• Need to determine whether planned
application will exceed rate-limits
Pilot 4: Twitter Project
Temporal Patterns of International Mobility by selected country
Pilot 4: Mobility patterns from Twitter
Dover
Calais
Where to from here?
• We need to think hard about how we can
exploit this deluge of data, new tools and
technologies
• We must share and collaborate in
applications of big data for official statistics
• We need to be able to respond to challenges
about our statistical outputs arising from big
data sources
• We need to look beyond our national borders
Finding out more information
Questions
• ?
Download