ECDF and Social Science Research Paul Norris Research Fellow (School of Law) pnorris@staffmail.ed.ac.uk Data Analysis in the Social Sciences Data analysis in the social sciences can appear small compared to other subjects Datasets typically have only a few thousand cases Models often include only a handful of variables Analytical methods are increasingly computer intensive Social scientists typically use GUI based software Potential Benefits of Using ECDF in the Social Sciences There are three major benefits to using ECDF :1) Ability to run models which won’t run on a standard PC 2) Increased Speed 3) Reduced Cost Multiple Ways to Use Multiple Cores Some statistical routines within R are created to run across multiple cores Propensity Score Matching using genmatch SPRINT project at ECPP Some statistical routines are “embarrassingly parallel” Multiple Imputation Latent Class based models Some models use only one core but multiple chunks of memory Reporting Crime to the Police Between 1992 and 2002 the proportion of crime reported to the police in Scotland fell Changes in an aggregate pattern can be attributed to two types of underlying shift:Model Change Effects – the behaviour of individuals (with identical characteristics) changes over time Distributional Effects – the makeup of the “population” changes over time Propensity Score Matching is one technique which can help to address this question Reporting Crime to the Police Reporting Behaviour Population Distribution 1992 2002 1992 55.7 55.0 2002 50.1 49.3 Matching on crime type, gender, age, social class, ethnicity, household income, weapon used, threat used, doctor visited, insurance claimed, value of damage/theft, Injury, took place at home, tenure and marital status Change in population of crimes and victims seems to have lowered reporting rates Reporting behaviour also slipped (but non-significant) Change in reporting seems to be most related to distributional changes Speed Gains From Multiple Cores Generic matching is very computer intensive R routine can be used on a computer cluster 2500 Time in Seconds 2000 1500 1000 500 Analysis based on example dataset from Sekhon (2007) contains 185 treatment cases and matches on 10 variables 0 Desktop Single Core 2 3 4 5 Number of Processors Used for Calculations 6 7 Reduced Costs From Using ECDF One alternative to using ECDF might be a separate desktop PC “Select PC” costs £700 ECDF costs for Police Reporting analysis were tiny in comparison ECDF resources also better suited to short bursts of intensive analysis Conclusions ECDF may seem of limited relevance to the social sciences Likely to have a steep learning curve Benefits can include :Chance to run models “too big” for normal PC Increased speed of analysis Reduced costs