ADVISORY EXPERT GROUP BIG DATA Statistics Canada Outline Big data and the National Accounts Establishing the right infrastructure Lessons learned: case studies from Statistics Canada 2 Traditional big data Scanner data Electricity consumption Credit card and Interact Remote sensing Statistics Canada • Statistique Canada Big data and the National Accounts From a business perspective "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.…. – (Gartner 2012) Wikipedia From an NSO perspective "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to which could reduce respondent burden, increase quality, develop new statistical products or enhance the detail of existing statistical products…..…. – ???? 3 Statistics Canada • Statistique Canada Big data and the National Accounts Mich Couper from the University of Michigan’s’ Survey Research Center sites the following limitations NSO will face when confronting Big data: • • • • • • • • lack of covariates in the datasets; self-selection and self-reporting biases; lack of stability; privacy issues; access issues; opportunity for mischief; size issues; and selective reporting of results (file drawer problem). You could add to that • 1. 4 Sustainability – data sources disappear, systems change, perceptions change. Couper, Mick P., Is the Sky Falling: New Technology, Changing Media, and the Future of Surveys. (Presentation, European Survey Research Association, 5th Conference, Ljubljana, Slovenia, July, 2013) Statistics Canada • Statistique Canada Big data and the National Accounts There needs to be up-front acknowledgement that we are trying to fit a square peg in a round hole…. The needs of business (big data to increase business intelligence) and national accountants (big data to produce comprehensive macroeconomic statistics) is quite different. Dimensions of the data Needs of National Accountants Needs of business Scope of the dataset Comprehensive Limited to the needs of the business Use of the dataset Produce meaningful aggregate statistics Find patterns, explore the detail Structure of the dataset On-going, stable, regular Structure can change as required by the business 5 Statistics Canada • Statistique Canada Putting in place the appropriate infrastructure In order to determine how to best leverage big data NSO needs to put in place the proper infrastructure to: 1. 2. 3. 4. 6 Obtain the data Process the data Evaluate the data Integrate the data Statistics Canada • Statistique Canada Putting in place the appropriate infrastructure – Obtaining the data 7 Use of legislation – e.g., Section 13 of Canada’s Statistics Act states that “A person having the custody or charge of any documents or records that are maintained in any department or in any municipal office, corporation, business or organization, from which information sought in respect of the objects of this Act can be obtained or that would aid in the completion or correction of that information, shall grant access thereto for those purposes to a person authorized by the Chief Statistician to obtain that information or aid in the completion or correction of that information.” 1970-7172, c. 15, s. 12. Memorandum of understanding (MOUs) which outline: • Roles and responsibilities • Delivery mechanism • Uses of data • Termination of the agreement Purchasing big data • Many firms sell big data that can be used for business intelligence – it could also be purchased for statistical purposes. Under what conditions and terms should NSOs purchase big data? Statistics Canada • Statistique Canada Putting in place the appropriate infrastructure – Processing the data File transfer system - NSOs need a secure, high capacity file transfer system to transfer data from the data provider to the NSO. Storage and processing capacity - In most NSOs (especially NA divisions) the processing capacity for big data does not exist. Software - Statistics Canada is leveraging the SAS distributed computing solution called “SAS Grid” to shorten the time needed to process and analyze its larger data holdings. Also, the Data Analysis Resource Center at Statistics Canada maintains a research computer with analytical software installed, offering a wide range of add-ons that provide advanced analytical and visualization tools particular to big data analytics. Information management policies – Access, privacy, confidentiality, retention 8 Statistics Canada • Statistique Canada Putting in place the appropriate infrastructure – Evaluating the data Big data community of practice • There needs to be a structure in place that allows analysts and programs to gain knowledge and share experiences with respect to big data, to engage with colleagues internally or externally when needed and to report findings to senior managers when appropriate. Big data needs to be evaluated with respect to its: Quality Coverage Timeliness Detail Regularity In order to leverage big data we need to develop a research and development orientation. 9 Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada: International merchandise trade statistics Collection/access agreement: Access to detailed customs data is governed by two memorandum of understanding: one with the Canadian Revenue Agency and one with the U.S. Census Bureau Cost: Nil Dimensions: 1.5 Terabytes, 60 attributes Uses: Balance of Payments, International Merchandise Trade Statistics Timeliness: 35 days following the reference period Frequency: Daily, if required Potential uses: Creating an importer and exporter characteristics file which can be used to analyze the entry an exit of Canadian traders within the Canadian economy, used in studies of globalization, global production, goods for processing, foreign affiliate statistics. 10 Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada: Taxation statistics Collection/access agreement: Access to detailed taxation statistics is governed by a memorandum of understanding with the Canada Revenue Agency. Cost: Approximately $1.6 million Dimensions: 6 Terabytes and growing Uses: Benchmark estimates of wages and salaries; output; property incomes, taxes, etc. Timeliness: Earliest use – 45 data following the reference period Frequency: Mainly annual, some monthly (goods and services taxation statistics) Potential uses: Creation of a National Accounts longitudinal file—a business level micro-data file that can be used to undertake studies such as GDP by city, GDP by firm size, productivity by firm size. 11 Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada: Government finance statistics 12 Collection/access agreement: No formal agreement in place – institutional understanding between Statistics Canada and the government jurisdictions. Cost: Nil Dimensions: 40 million financial transactions, 200 GB Uses: Government Finance Statistics, government sector – National Accounts Timeliness: Earliest is 15 days following the reference period. Frequency: Monthly, quarterly, annual Potential uses: Local government remains a ‘survey of municipalities’, access to electronic files will increase our ability to provide CMA level data as well as increased revenue and expenditure details. Potential data uses for the health, education and justice programs. Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada: Electronic household transactions (credit and debit) 13 Collection/access agreement: Memorandum of understanding outlining the roles and responsibilities of both Statistics Canada and the data provider. Cost: Nil Dimensions: “Aggregated” big data - number of transactions, value of transactions aggregated by merchant group by place of transaction (domestic, international) by class of transactor (personal or commercial). Uses: Indicator for household final consumption expenditure and international travel abroad Timeliness: Earliest is 15 days following the reference period. Frequency: Monthly Potential uses: International travel services, monthly household final consumption expenditure. Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada: Electronic household transactions (credit and debit) Growth rates – household final consumption expenditure and credit transactions (domestically acquired by residents) 8 7 % change 6 5 4 3 2 1 0 2008 2009 2010 2011 Credit Transactions domestically acquired by residents Total Household Final Consumption Expenditure 14 Statistics Canada • Statistique Canada 2012 Examples of big data research at Statistics Canada: Electronic household transactions (credit and debit) Growth rates – credit transactions and household final consumption expenditure (accommodation services) 12 10 8 6 % change 4 2 0 2008 2009 2010 2011 -2 -4 -6 -8 -10 Credit Transactions domestically acquired by residents (Accomodations) Household Final Consumption Expenditure - Accomodation Services 15 Statistics Canada • Statistique Canada 2012 Examples of big data research at Statistics Canada: Scanner data: vendor specific 16 Collection/Access Agreement: MOU in negotiation Cost: Current costs are nil though the long-term approach being proposed would involve a quid pro quo agreement where CPD would provide the company their data back with value added (i.e., an implicit cost would be borne by the division). Dimensions: Sales, quantities, and item descriptions of all goods sold for a given store over a given period Uses: Consumer prices and household expenditure weights to feed the CPI Timeliness: TBD, though potentially as little as a one day lag (e.g., weekly data for a given week could be delivered on the first day of the following week). Frequency: Initial data has been provided on a weekly aggregated basis. Future work will look at daily and / or transactional level data. Dataset size: For one week of sales data (aggregated on the week) for one store, • roughly 4,000 KB • roughly 30,000 rows (i.e., unique items sold) • implies roughly 200MB for one year of weekly aggregated data for one store. Potential uses moving forward: Direct input into the calculation of the CPI (potential replacement for collected prices), studies on consumer behaviour, CPI weights, household final consumption expenditures, retail sales. Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada Smart meter: household electricity consumption 17 Collection/access agreement: Two memoranda of understanding with two regional electricity distributors Cost: Nil Dimensions: Roughly 200 GB of raw hourly electricity consumption data have been obtained, providing detailed information on approximately 120,000 customers, between the years of 2008 to 2013 Uses: Household electricity consumption Timeliness: Earliest is 15 days following the reference period. Frequency: Hourly Potential uses: Household final consumption expenditure, monthly Gross Domestic Product’s utilities. Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada Smart meter: household electricity consumption Total residential consumption 18 Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada Satellite Imaging: Land Account Collection/Access Agreement: Public data Cost: Nil Dimensions: 20 GB. Although not apparent here, “dimension” of this type of big data (which is not really big data, strictly speaking) may well explode in the coming years. LIDAR datasets (high resolution radar), as well as higher resolution (space and time) satellite data will require terabytes of storage and “terahertz” of processing capacity. Uses: Land accounts: Land cover / land use change 2000 and 2010 2013 Timeliness: 3 years lag Frequency: Annual Potential Uses moving forward: Landscape and freshwater ecosystem accounts 19 Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada Remote sensing: land use 20 Statistics Canada • Statistique Canada Examples of big data research at Statistics Canada Water Measurement Instruments: Water Account Collection/Access Agreement: Informal agreement with Water Survey of Canada Cost: Nil Dimensions: Original WSC data is 5 GB; derived water yield data is 90 GB Uses: Water accounts: Water Yield Timeliness: From real-time to lag of several years Frequency: Daily Potential Uses moving forward: Freshwater ecosystem accounts 21 Statistics Canada • Statistique Canada Some lessons learned so far 1. Quid pro quo – is important when trying to obtain ‘big data’. Firms are more willing to part with their ‘big data’ if you show them how they will receive a ‘business intelligence’ benefit on their side. 2. Cost – ‘big data’ is not always the cheapest option. It is sometimes easier to have the firm complete the survey than to create an infrastructure to receive and process their data. For example, the data received from local electricity providers is equivalent to the completion of two questions on our current survey. 3. Classification systems – ‘big data’ does not follow any standard classification system. For example, electronic retail transactions are classified according to merchant groups rather than industries. 4. Big data aggregates – asking firms to aggregate their ‘big data’ is an option. 5. Data formats – Need to work with new data formats that we are often not familiar with. 22 Statistics Canada • Statistique Canada Discussion point for the AEG • In order to exploit the potential of big data, NSOs need to make significant investments. How can we leverage the work taking place across various NSOs to minimize the investment and maximize the return? • How do we promote the development of new data products using big data over using big data to re-construct existing data products? Do we adjust our frameworks to accommodate big data or do we adjust big data to accommodate our frameworks? 23 Statistics Canada • Statistique Canada