Small team workflow in government analytics Peter Ellis Manager Sector Performance 18 March 2014 Today’s talk • • • • Who are we and why is our experience important? What are “data-intensive economic reports”? The challenge The solution • Reflections on analytics in government The Sector Performance team • 9-10 staff • $5 million budget – mostly for outsourced data collection • One of 3, 4 or 9 analytical teams in MBIE • Depending on definitions • But diverse approaches from different teams • Variety of roles • • • • Manage collection of tourism and science and innovation data Analyse and publicly disseminate tourism data Analyse data on all sectors for policy teams and Ministers Support policy teams in other areas • Mid through 5 year Tourism Data Improvement Programme • Since MBIE’s creation, now applying the tools, skills and techniques to a wider range of data Whatever the terminology, tools and content, your organisation’s “analytics” team/s need to be in this space http://drewconway.com/zia/2013/3/26/thedata-science-venn-diagram Capability building for an analytical team • Five key areas needed 1. 2. 3. 4. 5. Workflow, document management and teamwork Analytical techniques Tools Data reshaping and management Data storage • Many programmes don’t take all five into account… • IT-led BI programmes may focus on only #3 and #5 • Universities typically only teach #2 Data-intensive economic reports http://www.mbie.govt.nz/w hat-we-do/business-growthagenda The challenge – update the draft overview Sectors Report • Current version had evolved over 24 months – over 200 plots and 50 tables of data • • • • • Not all the data sources fully defined Some of the Excel workbooks lost Some data was custom-cut by Statistics New Zealand Home-grown (and inconsistent) concordances to “sector” Some data hard keyed in, and not clear what was original, what was analysis, and what was grooming/reshaping • Tight timeframe • High profile, and quality guarantee essential This is just one worksheet of around 30 – only 20 of which we could find… Principles for a solution • Separate the data from the grooming and analysis • Reproducibility • Systemised constant teamwork and peer review, requiring: • • • • Repository-based version control Centralised and disciplined folder and file structure Modular code with custom functions, palettes and themes Frequent integration and continuous testing • Cut the dependencies on externals • Extreme code-based plot polishing • And for our next project (Small Business Report): • Frequent iteration with the client (policy team and Minister) • Separate exploratory analysis from polishing The toolkit ( future warehouse) DATA SOURCES DATA PROCESSING AND ANALYSIS DATA MANAGEMENT INTERMEDIATE OUTPUTS LaTeX (preferred), or MS Word / MS Powerpoint One-off data slices from various sources FINAL PRODUCTS Adobe Illustrator Messy custom data Hard copy plots, tables and text DATA WAREHOUSE MBIE data Hard copy and PDF Statistics NZ data International data Tidy data in datamarts Project specific database Data for web version Tidy data in datamarts “ad hoc ETL” adds data to the datamart if suitable Other regularly acquired data Tidy data in datamarts SQL Server Reproducible grooming of data Exploratory data analysis Production of visuals and text R and Git Code-based / auditable / reproducible / version-controlled Interactive web version Design, build and touch up for final products HTML, JavaScript The folder structure • raw_data • • • • • • • • concordances NZ.Stat Infoshare custom grooming_code data analysis_code output • Part I • Part II – dashboards •R • .git Held together with key files in the project’s root directory: • integrate.r (in future to replace with makefile) • sector_report.rproj • .Rprofile add, commit add, commit add, commit save save save save save changechangechangechangechangechangechangechangechangePull Push John's memory John’s PC stick Master C clone A Shared P:/OTSP/somewhere Master fileserver init, add, commit, branch E F merge Visible G clone B Jane's memory stick Jane’s PC Master changechangechangechangechangechangechangeD changechangechange save save save save save save add, commit add, commit add, commit Pull Push Particular things that make this approach hum • Git • Rstudio projects are a great way of organising • But Notepad++ users can still participate if they use R shortcuts in the root folder of the repo • Clean, pared back, modular scripts essential for readability • Create your own palette, ggplot2 themes, font variables and functions for image dimensions and resolution • Resource for oversight, coordination, ensuring the build works • Manager needs to be technical enough to dive into the repo • You wouldn’t have a policy manager who couldn’t use Word • Clear spec – or ability to have agile iterative approach with client Joel’s 12 point test for software developer teams 1. Do you use version control for your code?* 2. Can you make a build in one step? 3. Do you make frequent builds (at least daily)? 4. Do you use an issues tracking system?* 5. Do you fix bugs before writing new code? 6. Do you have an up-to-date schedule? 7. Do you have a spec? Tweaked (*) from 8. Do programmers have quiet working conditions? http://www.joelonsoftware.com/articles/fo g0000000043.html 9. Do you use the best tools available?* Surprisingly relevant for analytics teams too 10. Do you have testers? (not sure this one’s relevant) 11. Do new candidates write code during their selection ? 12. Do you do hallway usability testing? Five things needed for successful capability building 1. External demand 2. Sustained management commitment 3. Resourcing for trialling, experiments and intensive customised training 4. Supportive IT team and environment 5. Preparedness for the process to take years rather than months Different needs and roles IT BAU Web Web support support Network Network support support · · · · Policy world Instantaneous needs Fast delivery time Unclear and changing data needs Non-specialist tools Applications Applications packaging packaging IT project land · · · · Capital projects Waterfall projects; big design up front 12 month + delivery time Deliver vital infrastructure to empower analytics teams · Tools like SSIS Project Project office office Programme Programme PM PM manager manager BA BA BA BA Developer Developer Architect Architect Architect Architect Developer Developer Analytics team · Relies on the infrastructure provided by IT · Bilingual in IT and Policy languages · Agile, scrum and “extreme programming” project methods; iterate as it goes · Tools like R, SQL, Shiny, JavaScript · Translate policy needs into demands for basic infrastructure from IT · Use the infrastructure to deliver flexible products Some particular issues in government • Demand from Ministers and senior management essential • Courage required to raise the expectations • Need to push some boundaries • Work with, not against, your ICT team • Common goals • Recognise where ICT projects are needed and when to use “BAU” • Balance of waterfall v. agile and beyond • But - be prepared to use personal machines as a trial environment for new tools and techniques • Only way to know what you want to invest in – high costs in packaging up new software for locked down networks • A significant sized team essential to build momentum • Recent developments only possible for us with the creation of MBIE