Big Data

What’s Going on in Survey Research? Lars Lyberg Stockholm University Frimis, November 11, 2015 A Changing Survey Landscape • • • • • • Probability and nonprobability sampling Total survey error New technology Big data International surveys Hard-to-survey populations 2 Probability Sample Every object in the target population has a known non-zero probability of being selected • Very few samples in market, opinion and social research live up to this definition • Reasons include nonresponse, frame problems, and special research goals 3 The Origins of Probability Sampling • Introduced in 1934 • Basically a financial breakthrough • Data collection was expensive • To be able to say something about a population based on a relatively small sample and a margin of error to go with that was almost like magic 4 A Couple of Giants Sir Ronald Fisher Jerzy Neyman 5 Problems • It took a while for probability sampling to be accepted • The sampling theory did not handle other error sources very well • Basically the only “allowed” error source is sampling 6 Issues Associated with Sampling • Ridiculous response rates • Increased demands for timely data • Access to large volumes of (inexpensive) data • Margins of error are understated • Discussions about nonprobability sampling • New less expensive ways of collecting data • The advent of opt-in panels • Proper inference not always possible 7 Examples of Statements • Probability sampling is the only reasonable way to achieve representativity • Probability samples are not representative due to nonresponse • There is no theoretical foundation for opt-in panels • There are theories and methods based on modeling and weighting 8 More Statements • Studies show that probability sampling is more accurate that nonprobability sampling • Some of these comparisons are flawed since weighting of the nonprobability samples has not been sufficiently ambitious • Even though results from opt-in panels might be biased to some extent they come at a fraction of the costs for a probability sample and much quicker 9 The Current Situation • Both probability and nonprobability sampling have problems • Bayesian inference gaining ground • Lots of experimentation needed • Quality criteria need to be defined 10 The Recent British Election • Whilst the Conservatives won convincingly, 18% of the campaign polls had suggested a dead heat and a further 46% had suggested Labour leads. • Of the 36% of polls that registered Conservative leads, three out of four showed leads that were less than half the actual outcome. • Both probability sampling and panels failed. • The British Polling Council has initiated an investigation on why things went wrong. 11 Total Survey Error Sampling Error Due to selecting a sample instead of the entire pop’n Nonsampling Error Errors due to mistakes or system deficiencies 12 Risk of Bias and Variance by Error Source MSE Component Sampling error Specification error Nonresponse error Frame error Measurement error Data Processing error Var High Low Low Low High High Bias Low High High High High High 13 What to do about Total Survey Error • Minimize variances and biases through QA, QC, QM, and best practices • Estimate the size of the total error • Apply risk management 14 New Technology • Smartphones as a data collection mode • Social media as an information source • GPS 15 Big data is a term that describes data sets so large and complex that they cannot be processed and analyzed with conventional software systems. Sources: • Transaction databases • Social media • The Internet of Things 16 A Black Swan A black swan is an undirected and unpredicted event. It is rare, has an extreme impact but in retrospect we saw it coming • Internet - yes • 9/11 - yes • The Lehman Brothers crash - yes • The advent of Big Data - ? 17 The Three V’s • Volume • Tera- to Peta- to Exabytes of data, stored and processed • Variability • Structured, unstructured, text, images, maps, multimedia • Varying sources • Velocity • Streaming data, from seconds to milliseconds • Veracity • Can we trust Big Data? Can we use it? Proxies, indicators 18 19 Big Data Examples of Big Data with use or potential use in statistics production • Google searches (flu trends) • Traffic camera data • Retail scanner data • Credit card and transaction data • GPS data 20 Hype of Big Data Gartner’s hype curve Source: Wikipedia 21 Happiness and Well-being The common survey question: How satisfied are you with your life? BD alternative • 10 million tweets that are coded for happiness (rainbow, love, beauty, hope, wonderful, wine…) and non-happiness (damn, boo, ugly, smoke, hate, lied,…) • Happiest states: Hawaii, Utah, Idaho, Maine, Washington • Saddest states: Louisiana, Mississippi, Maryland, Michigan, Delaware 22 Big Data Challenges • Data quality • Data analytics • Confidentiality concerns 23 Mono Surveys vs 3MC Surveys • 3MC=multinational, multregional and • • • • multicultural contexts One population vs more than one population In 3MC TSE or MSE as planning criteria must be complemented by equivalence or comparability 3MC surveys need to be designed with a mixture of standardization and flexibility to achieve operational equivalence Implementation and control much more demanding in 3MC surveys 24 Examples of 3MC Surveys • Gallup World Poll (GWP) Student assessment (PISA) • European Statistical System • European Social Survey (ESS) • • World values (WVS) • Health, ageing and retirement (SHARE) Marketing surveys on customer satisfaction, brand names, attitudes, finances etc • Pure entertainment surveys • Adult literacy (IALS) • Adult skills (PIAAC) • • Electoral systems (CSES) 25 Some Special Features in a 3MC Survey Setting • Comparability is the main goal • Concepts must have a uniform meaning • Risk management differs • Financial and methodological resources • • • • differ (3MC’s are expensive) National and international interests are in conflict Scientific challenge Administrative challenge National pride is at stake 26 Response Rates in PIAAC, Cycle I (%) • Australia 71 • Japan 50 • Austria 53 • Korea 75 • Belgium 62 • Netherlands 51 • Canada 58 • Norway 62 • Cyprus 73 • Poland 54 • Czech Republic 66 • Slovak Republic 66 • Denmark 50 • Spain 48 • Estonia 63 • Sweden 45 • Finland 66 • UK-England 59 • Germany 55 • UK-Northern Ireland 65 • Ireland 72 • USA 70 • Italy 56 27 Challenges in 3MC Surveys • • • • • • • Design (what can vary, what is rigid) Translation Adaptation Culturally different error structures Data fabrication Quality control Often too many countries involved 28 Hard-to-survey Populations (H2S) Homeless Prostitutes Refugees Victims Persons with disabilities Minorities Illegal aliens Rare (fans, musicians, language groups, extremists) • Mobile populations (nomads, migrants, students) • • • • • • • • 29 Methodological Approaches to H2S • Innovative sampling methods • Venue-based (red light districts, voting facilities) • Indirect sampling • Snowball and respondent driven • Qualitative studies (anthropology etc) • Formative research 30 The End of Theory Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. Chris Anderson 2008 31 The Future of Surveys is Uncertain Too many surveys, too much off-the-shelf tools Active participation going down Passive participation going up Many problems are global Decision makers need data fast and at low cost The design-based approach needs refreshment Decision makers need data from different sources • The big survey institutes are worried • • • • • • • 32 Endnote • Our industry needs innovations and less • • • • fighting We need to merge with other research cultures We need to know more about combining data sources We need to account for all major sources of uncertainty that is associated with data collection and analysis of data We need to develop new theories for handling error structures, combining data sources, and reaching equivalence 33 Over and Out 34

Big Data

Related documents

Products

Support

Big Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib