Improving the Quality of Tax Statistics: Recent Innovations in Editing and Imputation Techniques at the Statistics of Income Division of the U.S. Internal Revenue Service Scott Hollenbeck – Scott.M.Hollenbeck@irs.gov Barry Johnson – Barry.W.Johnson@irs.gov Melissa Ludlum – Melissa.R.Ludlum@irs.gov Today’s Presentation Overview of Statistics of Income (SOI) Dealing with Missing Data Recent Innovations Future Plans What Does SOI Do? Primary source of U.S. tax data Data from 110 tax returns and information documents Test and correct data collected during administrative processing (IRS Masterfile) Collect extensive additional data from forms, schedules and attachments Most projects collect data from samples Products Micro data files for U.S. Treasury Department & Congress Public-use files Tables and analysis (www.irs.gov/taxstats) SOI Data Collection Systems Maintains computer network separate from main IRS processing Data collection takes place in IRS Submissions Processing Centers Graphical User Interface (GUI) systems based in ORACLE Data tested for internal consistency Post-edit processing overseen by headquarters’ staff Three Major SOI Programs Individual Income Tax Filed by individuals and married couples to report most forms of personal income 133 million returns filed in 2006 Corporation Income Tax Filed by incorporated businesses to report income from parent corporation and subsidiaries 2.5 million returns filed in 2006 Tax-exempt Organizations Annual information returns report assets, income, expenses 833,000 returns filed in 2006 Missing Data – Unit Nonresponse Causes Extensions/late-filed returns Tax evasion Strategies Update values from prior year using survey responses Utilize records for recent prior years filed during the selection period Missing Data – Item Nonresponse Causes Taxpayer neglects to provide attachments Paper return is being used by another IRS function Strategies Use IRS Masterfile data for key values Impute values based on existing data and information provided on prior and/or subsequent return Surveys and direct contact with preparers What’s New? Digital images of tax returns Electronic filing Automated error correction/imputation routines Digital Return Images In 1998 SOI began scanning operations Images stored in Tagged Image File Format (TIFF) In 2006, imaged more than 71.5 million pages from 30 different tax and information returns Many users: SOI headquarters staff SOI edit operations IRS Functions General Public (tax-exempt organizations only) Split-Screen Edit Systems Combines scanned image and GUI edit system on a single 24 inch wide-aspect monitor Image displayed using Adobe Acrobat or specially adapted ORACLE programs Image and edit systems are synchronized Online access to instructions, dictionaries, other tools Split-Screen Edit Systems Positive feedback from editors Slight overall improvement in productivity and quality Images available to geographically disbursed work force Reduced storage of paper documents Reduced impact on other IRS functions Electronic Filing of Tax Returns 2004 Modernized electronic filing (MeF) began Uses Extensible Markup Language (XML) to capture: Numeric and character strings supplied by taxpayer Information tags 2005 mandatory e-file for large business and tax-exempt organizations 20.5% SOI sample of corporate income taxes 13.5% SOI sample of tax-exempt organizations SOI Use of MeF Data In 2006, SOI developed programs to render digital images from XML data Edit returns using split-screen applications In 2007, will populate ORACLE data tables directly with XML data Editors will validate data, supply codes and allocate certain data items Electronic Filing of Tax Returns Individual income tax returns 1986 – E-file through paid preparers 1992 – E-file from home computers allowed 1994 – 98% of all filers eligible to e-file 2006 – 73 million returns, or 54%, e-filed Data stored in Tax Return Database (TRDB) ASCII data, not tagged XML 2010 – Scheduled for conversion to MeF SOI Individual Income Tax Program Sample of returns processed differently depending on certain criteria Edited returns “Missing returns” Forced closed returns Individual Processing Programs Online editing system – editors transcribe, code and review any potential data discrepancies Post Edit Reconciliation Process (PERP) – automated computer program which validates and adjusts data Edited Returns Edited returns are processed through the online editing system by an editor, then reviewed using the PERP program Prior to Tax Year 2004, all sampled returns which were not “missing” were manually edited Currently only paper returns and electronically filed returns with specific characteristics are edited through online system “Missing Returns” Each year, approximately 250 paper returns selected for the sample are not located Limited IRS Masterfile data available PERP program used to impute missing details of forms and schedules Forced Closed Returns Automated processing of certain E-filed returns in the SOI sample Bypass the online editing system and processed through the PERP program Returns with possible discrepancies are reviewed by National Office analyst Returns that pass all tests are considered “forced closed” and added to final data file Results from Forced Closing Returns Tax Year 2004 – First year using automated closing of selected electronically filed returns Total sample size – 200,295 returns Electronically filed – 64,670 returns “Forced Closed” – 18,193 returns Editing hours saved – 1,400 hours Results from Forced Closing Returns Tax Year 2005 – Second year of program, expanded criteria for returns eligible to be “forced closed” Total sample size – 292,837 returns Electronically filed – 114,897 returns “Forced Closed” – 47,753 returns Editing hours saved – 4,100 hours The Future - Data More returns and information documents will be filed electronically Optical Character Recognition or Intelligent Character Recognition will be used to capture data from paper-filed returns Data will be available in real time Enable larger sample sizes and increased use of population files The Future – Field Operations Increased resources dedicated to resolving data inconsistencies as opposed to data transcription Paperless environment – use of electronic data or digital images created from paper returns Increased use of prior year data to identify and correct data anomalies The Future - Products Improvements in technology and increased use of electronic filing will allow SOI to produce more data, more quickly and more efficiently Increased sample sizes will allow small area estimates Population files will allow for creation of ad hoc panels, linkage of data items across tax form types and research on infrequent data items