Statistical analysis in the government sector using the R language:

advertisement
Statistical analysis in the
government sector
using the R language:
Experiences from MBIE and DOC
Peter Ellis
Ministry of Business, Innovation & Employment
Ian Westbrooke
Department of Conservation
Today





Assessing needs and why R at MBIE
Assessing needs and meeting them at
DOC
R and official statistics production – the
Regional Tourism Indicators
Training MBIE staff in R
Reviewing progress and looking forward
2
Tourism data in 2011




$4 million per year
Ranges from departure cards to
business surveys
Combination of in-house, contracted
and Statistics New Zealand
About to have significant change of
content and orientation
3
Tourism data in 2011
Storage
Web
dissemination
Custom
extraction
Analysis
4
Some things we couldn’t do then
5
6
7
Chinese visitors’ disappointments
All visitors’ disappointments
8
9
10
12
Options…
Package
Strengths
Weaknesses
SAS
Solid
Graphics
Good reputation
Cost
Already used in production
Support
SPSS
GUI a good introduction
Already used for storage
Stata
Cheap
GUI a good introduction
Lacks basic capability without
additional modules
Cost
Bad with more than one
rectangle of data
Used elsewhere in MED
R
Free
Intimidating
Best graphics
Fear, Uncertainty, Doubt – how
can it be any good if it’s free and
was developed in NZ?
Cutting-edge techniques
User community
13
“In terms of taking advantage
of modern statistical
techniques, R clearly
dominates. When analysing
data, it is undoubtedly the
most important software
development during the last
quarter of a century. And it is
free…
“It is powerful, flexible, and it
provides a relatively simple
way of applying cutting-edge
techniques….
SAS software could be written
for applying the modern
methods mentioned in this
book, but for many of the
techniques to be described
this has not been done.”
http://r4stats.com/articles/popularity/
(Rand Wilcox)
14
15
Number of R- or SAS- related
posts to Stack Overflow by week
Software
R
Others
Number
of Blogs
452
SAS
40
Stata
8
0-3
16
17
Where is the R Activity?
http://spatial.ly/2013/06/r_activity/
18
Assessing needs at
Dept of Conservation
A third of New
Zealand’s land in
conservation
management
20
Department of
Conservation


Part of central government
Conserving heritage


natural
historic
21
30+ marine
reserves
22
Protecting indigenous
biodiversity


Unique ecosystems
Many unique species




birds
lizards
marine mammals
plants
23
Kiwi
24
Kiwi – threatened due to stoats
killing chicks
25
Kiwi – survival analysis tools
1.0
0.8
Stoat control
0.6
0.4
0.2
No control
0.0
0
50
100
Day from hatching
150
26
Tuatara
27
Tuatara responses to rat removal
Conservation Biology Volume 21, No. 4, 2007
28
Marine mammals - Hookers sea lion
29
Plants – Rata
30
Plants – planting Pingao
31
Invasive threats to
unique flora & fauna
32
Promoting recreation
33
Promoting recreation
Franz Josef Glacier
– Southern Alps, New Zealand
34
Tongariro alpine crossing
35
Generalised additive model:
crowding on Tongariro alpine crossing
Predicted proportion crowded
0.8
0.6
0.4
0.2
0.0
100
200
300
400
Daily track count
500
600
700
36
Facilitating tourism Fiordland
37
1800 plus staff

Several hundred science graduates


science and technical work
at national, regional and local levels
38
Effective conservation
management


Requires evidence based on data
Typical questions

What are the trends in abundance and
health for native species and ecosystems?



How can management make a difference?
How to deal with threats effectively?
How are visitors using parks and facilities

What visitor issues need to be managed?
39
Effective conservation
management

Moving beyond broad qualitative
statements



demands quantitative assessments
based on data
Statistics are essential
40
Evidence-based
conservation management

Internally



need to know what to do
increasing emphasis on optimisation based
on evidence & monitoring
Externally


need to demonstrate making a difference
maintain government funding
41
One permanent DOC
statistician since 2000

Statistics infrastructure



Software
Training
Consulting


Design
analysis
42
Statistical skills needed

Statistical modelling skills
 essential for leading science and
technical staff



starting from the linear model
through its extensions
mixed models for repeated measures
43
Meeting needs at
Dept of Conservation
Increasing statistical
skills

Developed and promoted courses

using a mixture of in-house and external
expertise
45
First emphasis –
data analysis

Much more data collected than analysed
|O |
Assessment class
Monitoring objectives
Monitoring design & methodology
|O|
| O
Sampling design
|
Data collection
Data analysis
| O |
|
O
|
|
Reporting
O
|
Continuity & review
|
50
60
70
O
80
|
90
Mean score in percent (with 95% confidence interval) 46
Statistical Modelling –
Key Area

Most data observational



not experimental
interested in estimating the size of effects
with confidence intervals
more than in testing null hypotheses
But
 University focus



designed experiments
hypothesis tests
ANOVA
47
Statistical Modelling

Developed 3 day internal course


drawing together ANOVA and regression
into the linear model
extending to generalised linear models




logistic and Poisson regression
binary and count data common
plus graphing & using R software effectively
if time, introduce generalised additive
models and/or tree-based models
48
Modelling course

Each student works at a computer






accessing data
creating graphs
applying models
as the trainer demonstrates
Using real data from DOC
Context and relevance of data

very important when teaching in the
workplace
49
Statistical software

We use R


amazingly powerful and flexible
free
BUT

a steep learning curve
50
Software used at DOC



SPSS from about 1998
S-plus added for “upper end” in 2003
R replaced S-plus 2006

SPSS dropped from 2008
51
Barriers to adoption of R:
Creating code

Typing code is new for most


R Commander helps



used to a point and click menus
provides an menu-based interface to R
provides a bridge to R code
Have converted our basic modelling
course to use R Commander

works very well
52
Demonstrate R
Commander

Opening
53
Demonstrate R
Commander

Opening
54
Demonstrate R
Commander

3 windows
55
Demonstrate R
Commander

Milford

Import
56
Demonstrate R
Commander
3 windows
57
Demonstrate R
Commander

View data
58
Demonstrate R
Commander
Milford

Graph

Using ggplot
10
Annoyance

5
0
10
20
30
DayVisitors
40
59
Modifying script
- simplified ggplot code
60
Demonstrate R
Commander

Milford

Linear model
61
Demonstrate R
Commander

Milford

Linear model
62
Demonstrate R
Commander
63
Barriers to adoption:
Getting help within R

Highly variable


most help files are of limited use to the
uninitiated
R help/support needs further development

to make R more


accessible beyond statisticians/programmers
R Commander is a big step forward
64
Graphs

Aimed to improve



data exploration
quality of presentation
Developed a course on graphs


drew heavily on Tufte and Cleveland
exercises to allow students to learn for
themselves
65
From main planning
document in 2002
66
Position on a
common scale
Position on identical
non-aligned scales
1
50
Length
2
50
0
Angle
50
50
4
0
3
5
50
2
0
Graphs course
4
5
50
3
50
0
1
0
1
2
3
4
5
0
1
0
2
3
4
5
List the seven graphs by how
easy it is to estimate the SIZE
of the number represented
1 is easiest, 7 hardest
1
5
2

Exercises:
3
1
4 5
3
3
2
4
2
5
4
1
6
1




Area
3
4
Grey scale
5
7
Plan to revamp

Including R graphs


2
graphical perception hierarchy
demonstrating the inadequacies of pie graphs
improving Excel default graphs
Slope
Ggplot2
See manual online

Google – “designing graphs Westbrooke”
67
Workplace-based
Training

We emphasise


practical applications and examples using
real data
only a basic outline of the theoretical
background


formulae and notation kept to a minimum
Intensive block courses (1-3 days)


easier for staff to commit for a short block
staff are dispersed, often in remote areas
68
Workplace-based
Training…

Small classes



No formal assessment of students



maximum of 12
high trainer to student
ratios (1:6)
would use precious classroom time
students highly motivated to learn and apply
Students assess course and applicability
to their work
69
New challenge



Develop web-based learning
environments
Face-to-face courses - core role
But on a wider platform



resources
feedback and interaction
Moodle-based
70
Tourism data in 2011
Storage
Web
dissemination
Custom
extraction
Analysis
71
Tourism data in 2014
Storage
Web
dissemination
Custom
extraction
Analysis
72
Regional Tourism
information example

One of two top priorities from 2011 review of tourism data

Two big developments for official tourism statistics in
2012

World first use of administrative (electronic transactions)
rather than survey data for regional tourism

Unprecedented reliability and validity at regional level

Major analytical and consultative task over 18 months

Sophisticated statistical techniques used to combine
multiple datasets to estimate dollar values

Approximately 500,000 rows of data per month
Growth in tourist spend 2008 - 2012
What did R contribute?

During development





Flexible, scalable data validation
High quality presentation graphics
Flexible experimenting with reports
Fast! (in combination with a RDBMS)
In production


Automated dissemination products
Automated data checking integrating statistical
techniques with graphics

E.g. forecast for > 500 series each month and
comparison with actual
76
77
78
79
80
81
Meeting needs at
MBIE
Training at MBIE

Many similarities to DOC

Equip analysts

To do more & better analysis


Bring data to life
Into decision making
83
Training at MBIE

Different context

Initial focus on tourism research team of 6
or 7



Aim to make R tool of choice for much of work
Allow other MBIE staff to observe
Big commitment for team once decision
made to use R as workbench
84
Approach

Group seminars each month

Handling data in R


Dates and seasonal adjustment in R




relevant R Commander menus
Decompose(); stl(); interface to X12
Linear model; glm; gams; tree models
Use tourism data and examples
Hands on – limited at first

computer lab already booked
85
Approach – first stage

Individual coaching

Assisting with real work problems



Present some at monthly seminars
In person 2 days a month
By phone and shared desktop

Weekly
86
Rapid progress after a
few months

First had to solve basic challenges


R became a tool of choice for analysis



Accessing the data and common tasks
International visitors survey
Regional tourism indicators
“Critical mass” of R users

Supporting each other
87
Approach – new stage


2 days a month
Monthly seminar series

“Intermediate” level


4-6 staff each session
New basic series for


new staff in core team
Wider MBIE staff




Full subscribed – 10 staff each session
Using R Commander
Computer lab now available
Limited individual coaching
88
Lessons from DOC & MBIE
experiences
R features




Latest statistical methodology
Fast & flexible

reading, manipulating and writing data

In tandem with RDMBS
Reproducible - Peer review
Repeatable in new context


Great presentation tools



Update data
tables and graphics - to Latex, Word, etc
Large, growing user community
Free
90
R: learning curve


lack of a point-and-click
geeky documentation
91
R: Dealing with
learning curve


R Commander as bridge
Access resources

Internet




Search, ask questions
Books
Appropriate training and mentoring
Join and develop community of users

World-wide & local communities



Want to share
Colleagues
User groups
92
R usage in state sector
R in govt departments

Core statistical package at DOC


Key package for tourism in MBIE





70 + users
Three (and growing) important official statistics
datasets now use R in production
Statistics NZ …
IRD
MSD
+ ???

Who is here from other departments
94
Statistics NZ


SAS for production systems
R available to staff for analysis purposes.


many people come with at least basic R skills
SNZ interest group on future use of R

Improve deployment & improve support quality of code




e.g. training, mentoring
Identify where R can improve efficiency and quality
Share experiences with other statistics offices using R
Should R be limited to analysis rather than production
95
R in wider govt sector





Food Standards
NZ Qualifications Authority
BRANZ
Wellington Regional Council
Health




Midcentral & Canterbury DHB
Midlands health network
NZ Brain Research Institute
plus …??
96
R in research sector increasing


Wide use in universities
Growing use in research institutes


Most CRIs, especially at 2 largest:
AgResearch

“younger statisticians arrive as experienced R users”




encouraged to continue to use R
other statisticians making increasing use of R
R courses for scientists
Plant & Food Research


“10-20 staff rely on R for their work”
“another 20-30+ use R on a reasonably regular basis”
97
R is growing


Working for us at DOC & MBIE
Growing in rest of NZ state sector

As in rest of world
98
Ways forward

Is there interest in R user list or group
focussed on state sector



virtual communication ?
physical meetings ?
May circulate request to express
interest

Via email list from this seminar
99
Concluding remarks

Questions/comments
100
Acknowledgements

Official Statistics System

Sponsor & organise seminar


Andrew Tideswell & Kam Theobald
The people behind R


Core developers
Developers of R packages



MBIE & DOC



R Commander & ggplot plug-in
ggplot2/reshape2/plyr
For supporting progress with R
Staff who have taken to R enthusiastically
Respondents to informal survey on R use
101
Download