Handling social science data: Challenges and responses Paul Lambert, University of Stirling DAMES research Node, www.dames.org.uk 17/MAR/2010 DIR workshop: Handling Social Science Data 1 What is social science data? Example: Accessing surveys via UK Data Archive Shibboleth authentication Download and analyse in Stata, SPSS, etc 17/MAR/2010 DIR workshop: Handling Social Science Data 2 Principal forms of data… • ‘Large and complex social surveys’ Longitudinal; cross-national; hierarchical • Small scale social surveys • Administrative data (e.g. ADMIN node; ADLS; commercial data) • Supplementary (digital) data E.g. ‘GESDE’ services at DAMES • Qualitative material – auido / video / textual 17/MAR/2010 DIR workshop: Handling Social Science Data 3 Large and complex social surveys • several thousand variables • tens of thousands of cases (micro-data) • additional complex survey data features (e.g. household clustering) 17/MAR/2010 DIR workshop: Handling Social Science Data 4 Complex data example: British Household Panel Survey dataset [SN 5151] . xtdes, i(pid) t(year) pid: year: 10002251, 10004491, ..., 1.794e+08 1991, 1992, ..., 2007 Delta(year) = 1 unit Span(year) = 17 periods (pid*year uniquely identifies each observation) Distribution of T_i: Freq. • • • min 1 Percent Cum. 4294 2726 2032 1224 964 840 632 631 593 17941 13.47 8.55 6.37 3.84 3.02 2.64 1.98 1.98 1.86 56.28 13.47 22.02 28.40 32.24 35.26 37.90 39.88 41.86 43.72 100.00 31877 100.00 5% 1 25% 2 50% 6 n = T = 75% 9 31877 17 95% 17 max 17 Pattern 11111111111111111 ........111111111 ..........1111111 ......11111...... 1................ ..........1...... ........1........ ................1 11............... (other patterns) XXXXXXXXXXXXXXXXX This example shows BHPS being analysed in Stata. BHPS re-contacts subjects annually (since 1991) 4294 interviewed as adults every year for 17 years. Analysis methods, and measurement issues over time, are challenging. . tab year year Freq. Percent Cum. 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 10,264 9,845 9,600 9,481 9,249 9,438 11,193 10,906 15,623 15,603 18,867 16,597 16,238 15,791 15,627 15,392 14,910 4.57 4.38 4.27 4.22 4.12 4.20 4.98 4.86 6.96 6.95 8.40 7.39 7.23 7.03 6.96 6.85 6.64 4.57 8.95 13.23 17.45 21.56 25.77 30.75 35.60 42.56 49.51 57.91 65.29 72.52 79.55 86.51 93.36 100.00 Total 224,624 100.00 Supplementary (digital) data • E.g. ‘Occupational information resources’ = data files within information on occupations, which can be usefully linked to micro-data about occupations e.g. GEODE acts as a library of OIRs, www.geode.stir.ac.uk Such resources are often not widely known about, but have the ability to enhance analysis 17/MAR/2010 DIR workshop: Handling Social Science Data 6 Example: Qualitative data used by ‘Digital Records for e-Social Science’ (DReSS) video • transcribed talk • audio / video • digital records • system logs code tree • location transcript system log DIR workshop: Handling Social 17/MAR/2010 Science Data 7 Three well-known challenges • We’re data rich, but analysts’ poor • UK Data Forum (2007); Wiles et al (2009) • Under-use of suitably complex statistical models • Coordination and communication on data processing • Recodes / Standardisation / harmonisation / documentation • Not rewarded/incentivised to researchers • Lack of generic/accessible representation of tasks • Limited disciplinary/project/researcher cross-over when dealing with data • Specific software orientations These are not generally problems of scale, but of organisation 17/MAR/2010 DIR workshop: Handling Social Science Data 8 ‘Managed’ responses? • Data handling/analysis capacity-building ESRC programmes (NCRM, RDI, RMP); training workshops/materials; P/G funds; strategic research grant investment • Documentation/replication policies Dale (2006) • Software for data access and analysis NESSTAR – UK Data Archive data/metadata browser Long (2009) on the Stata software Remote access to data (e.g. SDS) 17/MAR/2010 DIR workshop: Handling Social Science Data 9 ..train and/or constrain the analysts.. Train them -> 17/MAR/2010 DIR workshop: Handling Social Science Data 10 ..constrain the analysis.. 17/MAR/2010 DIR workshop: Handling Social Science Data 11 Non-hierarchical responses? Technological collaborative services might support effective, unmanaged data access, coordination and exploitation (in principle) UK e-Social Science investment in data oriented social science research support NeISS; E-Stat; DAMES; Obesity e-Lab; CQeSS 17/MAR/2010 DIR workshop: Handling Social Science Data 12 ..some examples.. E-Stat @ Design a tool to specify complex statistical models in generic / visual terms Multilevel models Multiple data permutations and analytical alternatives Ready access to a suite of complex modelling tools 17/MAR/2010 National e-Infrastructure for Social Simulation • Expert led simulation demonstrations • Combining data resources • Workflows for the simulation analysis Modify and re-specify existing simulation templates DIR workshop: Handling Social Science Data 13 DAMES – online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating 17/MAR/2010 DIR workshop: Handling Social Science Data 14 GESDE – Search and browse supplementary data on occupations; educational qualifications; ethnicity 17/MAR/2010 DIR workshop: Handling Social Science Data 15 • Data curation tool (for collecting metadata) 17/MAR/2010 DIR workshop: Handling Social Science Data 16 Handling data: analysis-oriented data management priorities • {Data collection or creation} • Data preservation or curation • Data enhancement/modification • Data analysis • Multiple permutations of related analyses • Documentation and replication 17/MAR/2010 DIR workshop: Handling Social Science Data 17 Ideas on the future of social science research data • Enduring challenges of documentation for replication, and coordination • More and more comparative analysis • Harmonisation and standardisation • Data linkage and data enhancement • Models for complex multiprocess systems • Fluency – increasing uptake by more users 17/MAR/2010 DIR workshop: Handling Social Science Data 18 References and Links • • • • • • ADLS: http://www.adls.ac.uk/ ADMIN Node: http://www.ncrm.ac.uk/about/organisation/Nodes/ADMIN/ DAMES Node: http://www.dames.org.uk/ DReSS: http://web.mac.com/andy.crabtree/NCeSS_Digital_Records_Node/ Secure Data Service: http://securedata.ukda.ac.uk/ UK Data Archive: http://www.data-archive.ac.uk/ • Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. Wiles, R., Bardsley, N., & Powell, J. L. (2009). Consultation on research needs in research methods in the UK social sciences. Southampton: University of Southampton / ESRC National Centre for Research Methods, and http://eprints.ncrm.ac.uk/810/ • • 17/MAR/2010 DIR workshop: Handling Social Science Data 19