Statistical analysis in the government sector using the R language: Experiences from MBIE and DOC Peter Ellis Ministry of Business, Innovation & Employment Ian Westbrooke Department of Conservation Today Assessing needs and why R at MBIE Assessing needs and meeting them at DOC R and official statistics production – the Regional Tourism Indicators Training MBIE staff in R Reviewing progress and looking forward 2 Tourism data in 2011 $4 million per year Ranges from departure cards to business surveys Combination of in-house, contracted and Statistics New Zealand About to have significant change of content and orientation 3 Tourism data in 2011 Storage Web dissemination Custom extraction Analysis 4 Some things we couldn’t do then 5 6 7 Chinese visitors’ disappointments All visitors’ disappointments 8 9 10 12 Options… Package Strengths Weaknesses SAS Solid Graphics Good reputation Cost Already used in production Support SPSS GUI a good introduction Already used for storage Stata Cheap GUI a good introduction Lacks basic capability without additional modules Cost Bad with more than one rectangle of data Used elsewhere in MED R Free Intimidating Best graphics Fear, Uncertainty, Doubt – how can it be any good if it’s free and was developed in NZ? Cutting-edge techniques User community 13 “In terms of taking advantage of modern statistical techniques, R clearly dominates. When analysing data, it is undoubtedly the most important software development during the last quarter of a century. And it is free… “It is powerful, flexible, and it provides a relatively simple way of applying cutting-edge techniques…. SAS software could be written for applying the modern methods mentioned in this book, but for many of the techniques to be described this has not been done.” http://r4stats.com/articles/popularity/ (Rand Wilcox) 14 15 Number of R- or SAS- related posts to Stack Overflow by week Software R Others Number of Blogs 452 SAS 40 Stata 8 0-3 16 17 Where is the R Activity? http://spatial.ly/2013/06/r_activity/ 18 Assessing needs at Dept of Conservation A third of New Zealand’s land in conservation management 20 Department of Conservation Part of central government Conserving heritage natural historic 21 30+ marine reserves 22 Protecting indigenous biodiversity Unique ecosystems Many unique species birds lizards marine mammals plants 23 Kiwi 24 Kiwi – threatened due to stoats killing chicks 25 Kiwi – survival analysis tools 1.0 0.8 Stoat control 0.6 0.4 0.2 No control 0.0 0 50 100 Day from hatching 150 26 Tuatara 27 Tuatara responses to rat removal Conservation Biology Volume 21, No. 4, 2007 28 Marine mammals - Hookers sea lion 29 Plants – Rata 30 Plants – planting Pingao 31 Invasive threats to unique flora & fauna 32 Promoting recreation 33 Promoting recreation Franz Josef Glacier – Southern Alps, New Zealand 34 Tongariro alpine crossing 35 Generalised additive model: crowding on Tongariro alpine crossing Predicted proportion crowded 0.8 0.6 0.4 0.2 0.0 100 200 300 400 Daily track count 500 600 700 36 Facilitating tourism Fiordland 37 1800 plus staff Several hundred science graduates science and technical work at national, regional and local levels 38 Effective conservation management Requires evidence based on data Typical questions What are the trends in abundance and health for native species and ecosystems? How can management make a difference? How to deal with threats effectively? How are visitors using parks and facilities What visitor issues need to be managed? 39 Effective conservation management Moving beyond broad qualitative statements demands quantitative assessments based on data Statistics are essential 40 Evidence-based conservation management Internally need to know what to do increasing emphasis on optimisation based on evidence & monitoring Externally need to demonstrate making a difference maintain government funding 41 One permanent DOC statistician since 2000 Statistics infrastructure Software Training Consulting Design analysis 42 Statistical skills needed Statistical modelling skills essential for leading science and technical staff starting from the linear model through its extensions mixed models for repeated measures 43 Meeting needs at Dept of Conservation Increasing statistical skills Developed and promoted courses using a mixture of in-house and external expertise 45 First emphasis – data analysis Much more data collected than analysed |O | Assessment class Monitoring objectives Monitoring design & methodology |O| | O Sampling design | Data collection Data analysis | O | | O | | Reporting O | Continuity & review | 50 60 70 O 80 | 90 Mean score in percent (with 95% confidence interval) 46 Statistical Modelling – Key Area Most data observational not experimental interested in estimating the size of effects with confidence intervals more than in testing null hypotheses But University focus designed experiments hypothesis tests ANOVA 47 Statistical Modelling Developed 3 day internal course drawing together ANOVA and regression into the linear model extending to generalised linear models logistic and Poisson regression binary and count data common plus graphing & using R software effectively if time, introduce generalised additive models and/or tree-based models 48 Modelling course Each student works at a computer accessing data creating graphs applying models as the trainer demonstrates Using real data from DOC Context and relevance of data very important when teaching in the workplace 49 Statistical software We use R amazingly powerful and flexible free BUT a steep learning curve 50 Software used at DOC SPSS from about 1998 S-plus added for “upper end” in 2003 R replaced S-plus 2006 SPSS dropped from 2008 51 Barriers to adoption of R: Creating code Typing code is new for most R Commander helps used to a point and click menus provides an menu-based interface to R provides a bridge to R code Have converted our basic modelling course to use R Commander works very well 52 Demonstrate R Commander Opening 53 Demonstrate R Commander Opening 54 Demonstrate R Commander 3 windows 55 Demonstrate R Commander Milford Import 56 Demonstrate R Commander 3 windows 57 Demonstrate R Commander View data 58 Demonstrate R Commander Milford Graph Using ggplot 10 Annoyance 5 0 10 20 30 DayVisitors 40 59 Modifying script - simplified ggplot code 60 Demonstrate R Commander Milford Linear model 61 Demonstrate R Commander Milford Linear model 62 Demonstrate R Commander 63 Barriers to adoption: Getting help within R Highly variable most help files are of limited use to the uninitiated R help/support needs further development to make R more accessible beyond statisticians/programmers R Commander is a big step forward 64 Graphs Aimed to improve data exploration quality of presentation Developed a course on graphs drew heavily on Tufte and Cleveland exercises to allow students to learn for themselves 65 From main planning document in 2002 66 Position on a common scale Position on identical non-aligned scales 1 50 Length 2 50 0 Angle 50 50 4 0 3 5 50 2 0 Graphs course 4 5 50 3 50 0 1 0 1 2 3 4 5 0 1 0 2 3 4 5 List the seven graphs by how easy it is to estimate the SIZE of the number represented 1 is easiest, 7 hardest 1 5 2 Exercises: 3 1 4 5 3 3 2 4 2 5 4 1 6 1 Area 3 4 Grey scale 5 7 Plan to revamp Including R graphs 2 graphical perception hierarchy demonstrating the inadequacies of pie graphs improving Excel default graphs Slope Ggplot2 See manual online Google – “designing graphs Westbrooke” 67 Workplace-based Training We emphasise practical applications and examples using real data only a basic outline of the theoretical background formulae and notation kept to a minimum Intensive block courses (1-3 days) easier for staff to commit for a short block staff are dispersed, often in remote areas 68 Workplace-based Training… Small classes No formal assessment of students maximum of 12 high trainer to student ratios (1:6) would use precious classroom time students highly motivated to learn and apply Students assess course and applicability to their work 69 New challenge Develop web-based learning environments Face-to-face courses - core role But on a wider platform resources feedback and interaction Moodle-based 70 Tourism data in 2011 Storage Web dissemination Custom extraction Analysis 71 Tourism data in 2014 Storage Web dissemination Custom extraction Analysis 72 Regional Tourism information example One of two top priorities from 2011 review of tourism data Two big developments for official tourism statistics in 2012 World first use of administrative (electronic transactions) rather than survey data for regional tourism Unprecedented reliability and validity at regional level Major analytical and consultative task over 18 months Sophisticated statistical techniques used to combine multiple datasets to estimate dollar values Approximately 500,000 rows of data per month Growth in tourist spend 2008 - 2012 What did R contribute? During development Flexible, scalable data validation High quality presentation graphics Flexible experimenting with reports Fast! (in combination with a RDBMS) In production Automated dissemination products Automated data checking integrating statistical techniques with graphics E.g. forecast for > 500 series each month and comparison with actual 76 77 78 79 80 81 Meeting needs at MBIE Training at MBIE Many similarities to DOC Equip analysts To do more & better analysis Bring data to life Into decision making 83 Training at MBIE Different context Initial focus on tourism research team of 6 or 7 Aim to make R tool of choice for much of work Allow other MBIE staff to observe Big commitment for team once decision made to use R as workbench 84 Approach Group seminars each month Handling data in R Dates and seasonal adjustment in R relevant R Commander menus Decompose(); stl(); interface to X12 Linear model; glm; gams; tree models Use tourism data and examples Hands on – limited at first computer lab already booked 85 Approach – first stage Individual coaching Assisting with real work problems Present some at monthly seminars In person 2 days a month By phone and shared desktop Weekly 86 Rapid progress after a few months First had to solve basic challenges R became a tool of choice for analysis Accessing the data and common tasks International visitors survey Regional tourism indicators “Critical mass” of R users Supporting each other 87 Approach – new stage 2 days a month Monthly seminar series “Intermediate” level 4-6 staff each session New basic series for new staff in core team Wider MBIE staff Full subscribed – 10 staff each session Using R Commander Computer lab now available Limited individual coaching 88 Lessons from DOC & MBIE experiences R features Latest statistical methodology Fast & flexible reading, manipulating and writing data In tandem with RDMBS Reproducible - Peer review Repeatable in new context Great presentation tools Update data tables and graphics - to Latex, Word, etc Large, growing user community Free 90 R: learning curve lack of a point-and-click geeky documentation 91 R: Dealing with learning curve R Commander as bridge Access resources Internet Search, ask questions Books Appropriate training and mentoring Join and develop community of users World-wide & local communities Want to share Colleagues User groups 92 R usage in state sector R in govt departments Core statistical package at DOC Key package for tourism in MBIE 70 + users Three (and growing) important official statistics datasets now use R in production Statistics NZ … IRD MSD + ??? Who is here from other departments 94 Statistics NZ SAS for production systems R available to staff for analysis purposes. many people come with at least basic R skills SNZ interest group on future use of R Improve deployment & improve support quality of code e.g. training, mentoring Identify where R can improve efficiency and quality Share experiences with other statistics offices using R Should R be limited to analysis rather than production 95 R in wider govt sector Food Standards NZ Qualifications Authority BRANZ Wellington Regional Council Health Midcentral & Canterbury DHB Midlands health network NZ Brain Research Institute plus …?? 96 R in research sector increasing Wide use in universities Growing use in research institutes Most CRIs, especially at 2 largest: AgResearch “younger statisticians arrive as experienced R users” encouraged to continue to use R other statisticians making increasing use of R R courses for scientists Plant & Food Research “10-20 staff rely on R for their work” “another 20-30+ use R on a reasonably regular basis” 97 R is growing Working for us at DOC & MBIE Growing in rest of NZ state sector As in rest of world 98 Ways forward Is there interest in R user list or group focussed on state sector virtual communication ? physical meetings ? May circulate request to express interest Via email list from this seminar 99 Concluding remarks Questions/comments 100 Acknowledgements Official Statistics System Sponsor & organise seminar Andrew Tideswell & Kam Theobald The people behind R Core developers Developers of R packages MBIE & DOC R Commander & ggplot plug-in ggplot2/reshape2/plyr For supporting progress with R Staff who have taken to R enthusiastically Respondents to informal survey on R use 101