Mind the Gap How holes in your data can lead to stories Thomas Hargrove, Scripps News Washington Bureau Jennifer LaFleur, Center for Investigative Reporting NICAR Baltimore: 2 p.m. March 1, 2014 Salon DEF • • • • • • Never assume data are whole – check !!! Simple techniques like sorting Many of these we use to integrity check Graphing over time Matching to other data sets Statistical tools • • • • Look for research already done on the topic Find experts Talk to reporters who have done similar stories If possible, talk to records personnel who assembled the data • Follow data to their source – usually people • Finding stories in the holes – Agencies failure to report – Varying reporting rules across geography or agency – Government computer system failures – Find patterns among missing records – Find the reasons behind missing records How This Project Started Dr. David Icove Researcher, University of Tennessee Retired member of FBI Behavioral Science Unit For many years, NFIRS reported only 5% of building fires are intentionally set in U.S. The Impossible Variance of America’s Rate of Arson: 2006 to 2011 • • • • • • • • • • • • • • • • • • • • • • • Department State Indianapolis San Diego New York City Gwinnett County Houston Arlington Chicago Los Angeles City Phoenix Memphis Tulsa Gary Cleveland Toledo Saginaw Dayton Buffalo Youngstown Highland Park North Las Vegas IN CA NY GA TX TX IL CA AZ TN OK IN OH OH MI OH NY OH MI NV Fires 1,207 1,022 18,988 1,678 7,740 1,511 5,075 7,975 5,359 5,331 3,076 424 5,742 2,544 1,377 1,930 1,606 2,125 748 435 Arson Rate 0% 0 1 2 2 3 4 10 12 16 22 28 28 28 32 33 33 36 45 49 How Rare is Arson? But They Should Have Reported: “Arson is grossly under reported. The true rate, I believe, is 40% to 50% -- in that range.” --Bill Degnan, President National Association of State Fire Marshals. “There isn’t a day that goes by that I don’t think: ‘Man, I was a monster.’ I’m just thankful no one was hurt,” --Kenneth Allen Muncie, Indiana The Allen Conspiracy: 46 people set 73 home and vehicle fires to collect $3.8 million from insurance Lessons Learned from 1 million fires: • 54,860 fires at ‘unlucky’ buildings that, like Allen’s home, experienced multiple fires but none of which were reported as arson. • 42,434 fires at buildings that experienced foreclosure, according to the national mortgage monitoring firm RealtyTrac. • 3,561 fires that had multiple points of ignition, suggesting someone set several fires at once. • 77,596 fires in unoccupied or vacant buildings. What’s Next? • Collecting data on 4.8 million fires • Calculate geographic rates by merging aggregated fire counts to Census Bureau tract data • Correlate rates of suspicious fires to tracts with unusually high occurrences of fire • Contact local fire/police authorities to determine if serial arson is suspected or should be investigated Local gap-mining stories Here’s FBI data you were never supposed to see Truck accidents by year and agency Sometimes you find piles Sometimes you find piles Statistical tools • Time series correlation – are your ups and downs real? • Project/predict data and compare to actual results. What causes differences? • Population counts are pretty accurate. Use them to determine reporting rates • Regression with dummy variables Make sure the holes are real EE000132 might actually be the same as EE-000-132 A word of caution • Do spot checks to make sure what you found is real • Run your findings by experts • If possible, engage government sources of data early. They may not be the enemy. • Challenge your assumptions. Data are only a clue, never an end results Questions? Jennifer LaFleur jlafleur@cironline.org @j_la28 Thomas Hargrove hargrovet@scripps.com 202-408-2703 Arson Project syntax files: https://www.dropbox.com/l/LPB7l3kpz7wxvGsHSdTOy9 A copy of this presentation will be at www.jenster.com/2014