Methods for detecting errors in VAT Turnover data. Phil Lewis and Daniel Lewis, Office for National Statistics. E-mail: philip.a.lewis@ons.gov.uk daniel.lewis@ons.gov.uk 1) Introduction Turnover from Value Added Tax (VAT) is one of the main administrative sources used in the production of official statistics. When the VAT data reach the statistical office they have generally not been subject to the cleaning required for statistical uses. Cleaning that does happen is more likely to focus on the needs of Her Majesty’s Revenue and Customs (HMRC). This paper describes methods to detect errors in VAT Turnover and reports on work to evaluate the effectiveness of each method. The detection methods are mainly drawn from the most promising findings of a literature review on international best practice in this area. Section 2 discusses methods for detecting suspicious VAT Turnover values, section 3 discusses methods for detecting suspicious patterns in reported VAT Turnover and section 4 discusses methods for detecting unit errors in reported VAT Turnover. Note that the work discussed in this paper represents the first steps in an investigation into using administrative data both within the Office for National Statistics and the wider international statistical community. 2) Methods for detecting suspicious VAT Turnover values When using VAT Turnover for statistical purposes it is important to ensure that the reported figures look reasonable. Five methods are described below for identifying suspicious VAT Turnover values. The results of testing these methods with VAT Turnover data can be found later in this paper. In some cases, the methods concentrate on unusually large Turnover values, since these are the ones that have the potential to cause the biggest errors in statistical outputs. Other methods detect both unusually large and unusually small Turnover values. Method 1 – Quartile distances in industry Turnover The first method is based on a method described in Hoogland and Van Haren (2007). It identifies unusually large or small Turnover by locating extreme values in the distribution of VAT Turnover within a particular industry and size class. The method identifies suspicious businesses as those with Turnover greater than a specified multiple of inter-quartile distances from the third quartile, Q3 (for large Turnover) or less than the same multiple of inter-quartile distances from the first quartile, Q1 (for small Turnover). More formally, suspicious Turnover is identified as follows. 1 If Turnover > Q3 + [C × (Q3 – Median)] Or Turnover < Q1 – [C × (Median – Q1)] then treat at suspicious. Where the quartiles and median are derived from the VAT Turnover data within the industry and size class for the period under consideration. The parameter C is derived from analysing past data to find a threshold that successfully identifies extreme values. C may be given different values for different industry and size classes, dependent on the distribution of data in those classes. Note that this method is easily tailored to the situation where it is only desired to detect unusually large Turnover values, by using just the first inequality. Method 2 – Period on period ratios The second method comes from De Jong (2003) and involves calculating period on period ratios for each business based on the contribution that business’s Turnover makes to its class. As in method 1, the class would commonly be defined as a crossclassification of industry and size. The period on period ratios are called test ratios. Any business with a test ratio above a pre-defined threshold is judged as suspicious. It is then recommended to check the data to identify which of the periods’ data looks most likely to be in error. The method is described below. Calculate Score = VAT Turnover Median VAT Turnover in class for each business. Then calculate Scoret Score t-1 TestRatio = Scoret-1 Score t if Scoret > Scoret-1 otherwise. Where Score t is the value of Score in period t and Score t-1 is the value of Score in period t-1. Any business with a value of TestRatio greater than a pre-defined threshold is deemed to be suspicious. The threshold can be set by analysing past data to see at what point the values of TestRatio appear to indicate suspicious returns. De Jong (2003) describes using a threshold of 40 when using VAT data for monthly statistics and a threshold of 20 when using VAT data for quarterly statistics. 2 Note that the method identifies Turnover values that are both suspiciously large and suspiciously small, since the TestRatio can be large for either of these reasons. Method 3 – Comparison with reporting history for the business The third method compares the current VAT Turnover figure with previous historic values for the same business to determine likely suspicious values. The method is described in slightly different forms in Hoogland and Van Haren (2007) and Lorenz (2010). In its most basic form, the method is described as follows. If Turnover >£100 million And Turnover > 10 × mean Turnover for the business in the past 24 months then treat as suspicious. In this form, the method focuses on businesses with very large reported VAT Turnover. The method can be enhanced by defining different thresholds, both for the size of Turnover and number of multiples of previous mean Turnover, for different sizes of business. A further refinement would be to also vary the thresholds by industry. In each case, it is recommended to test different thresholds on past Turnover data and evaluate which thresholds perform best at identifying extreme values for the desired types of business (whether that be just the larger businesses or a wider range of business sizes). Note that this method only identifies suspiciously large Turnover. Method 4 – Quartile differences combined with measure of influence The fourth method is a refinement to method 1 (Quartile distances in industry Turnover) and is inspired by Hoogland et al (2009). The idea is to combine the detection of suspicious values using quartile differences with some measure of the influence or importance of the business. In Hoogland et al (2009), the influence is measured as the contribution the business makes to its output cell for a particular survey. However, for our purpose a measure of influence that is independent of any survey that the VAT Turnover may be used for is preferable. Therefore in this method the influence is simply calculated as the proportion of VAT Turnover the business contributes to the total VAT Turnover in the industry and size class. The method is as follows. Identify unusual Turnover values using the quartile distances measure described in method 1. Calculate VAT Turnover Influence = Total VAT Turnover in class for each business. 3 Treat any business which has both unusual Turnover values from the quartile distance method and whose Influence is greater than a pre-defined threshold as suspicious. The threshold for influential businesses should be set by analysing past VAT Turnover data. Note that this method effectively subsets businesses failing the quartile distance method, so that only the most influential are viewed as being suspicious. One practical way to implement method 4 is to first set reasonable thresholds for method 1 (Quartile distances in industry Turnover), ensuring that the businesses being identified as suspicious appear to be genuine extremes in the distribution. Then, for method 4, slightly reduce the value of parameter C in the quartile distance method so that more businesses are viewed as being extreme. The threshold for influential businesses can then be set to ensure that a similar number of businesses fail method 4 as failed method 1. The difference between the two methods is that the suspicious businesses in method 4 are determined as a combination of their relationship to the distribution of VAT Turnover in the class and their influence in that class. Method 1 does not take account of the influence. Method 5 – Hidiroglou-Berthelot method The fifth method considered is the Hidiroglou-Berthelot method, which is based on the simple ratio between the VAT Turnover value for a business in the current period and the VAT Turnover value for the same business in the previous period. The method adds some refinements to overcome various drawbacks of using a simple ratio and also to allow for the size of the business to be taken into account. The method comes from Hidiroglou and Berthelot (1986). This method differs from the other methods in that it is drawn from the literature on business survey editing. However, it has been considered in this context to see whether a well-specified survey edit rule is helpful for detecting errors in VAT Turnover data. The Hidiroglou-Berthelot method is defined as follows. current VAT Turnover I. Calculate the ratio r = for each business. previous VAT Turnover II. Transform the ratio so that the data are symmetric and centred around zero (this gives an equal chance of identifying large and small errors): Calculate the median of the ratios, r. If r < median then calculate t = Otherwise calculate t = r - median r r - median median 4 III. Define E = t max (current VAT Turnover , previous VAT Turnover)V This step gives the option of giving greater importance to businesses with larger VAT Turnover. The parameter V can take any value between 0 and 1. A value of 0 results in every business having the same importance, whereas a value of 1 gives greatest importance to businesses with higher Turnover. When setting a value for V, it is recommended to consider whether errors in businesses with larger Turnover are more important. The final choice of V should be chosen after analysing its effect on past data. IV. Calculate the first and third quartiles (Q1 and Q3) and the median (Q2) of the E values calculated in step 3. Then calculate d Q1 = max (Q2 - Q1) , A Q2 d Q3 = max (Q3 - Q2) , A Q2 In most cases, these distances are just the difference between the median and the first and third quartile of the E values respectively. However, in some unusual distributions it is possible for the data to group around the median leading to businesses being erroneously identified as suspicious. To safeguard against this happening, a small multiple (A) of the median of the E values is used instead. The parameter A is commonly given the value 0.05 for this purpose. V. Suspicious businesses are then identified as follows: If E < Q2 – C × d Q1 Or E > Q2 + C × d Q3 Then treat as suspicious. The parameter C effectively controls the number of businesses being viewed as suspicious and should therefore be set by analysing past data to determine the value of C that appears to accurately identify suspicious Turnover. 5 3) Methods for detecting suspicious patterns in reported VAT Turnover It is often possible to identify errors in VAT Turnover data by considering the pattern of reported Turnover over a period. Hoogland (2010) describes a variety of suspicious patterns identified in quarterly VAT Turnover. These patterns can be summarised as follows. Zero Turnover in three quarters, positive Turnover in the other quarter. Zero Turnover in one quarter, positive Turnover in the other three quarters. Same Turnover in all four quarters. Same Turnover for three quarters, a different (positive) Turnover value in the other quarter. Negative Turnover in any of the quarters. These patterns all suggest inaccurate reporting of Turnover. When businesses return positive Turnover for some quarters, a zero response suggests that the true data may have been withheld. When the same Turnover is reported for four quarters, it suggests that the business may have simply divided their annual figure into four. Reporting the same Turnover for three quarters and a different value for the fourth often means that the business has estimated the Turnover for three of the quarters and used the fourth to balance the total Turnover based on an annual figure. The presence of negative Turnover is an error, since by definition Turnover cannot be negative. Hoogland (2010) notes that the presence of these patterns does not always imply that data are suspicious and gives the example of real estate businesses, which may expect to have the same Turnover in each quarter. As a further example, for some seasonal businesses it may be reasonable to have zero Turnover in some quarters but not others. The same logic may be applied to monthly VAT Turnover. Some of these suspicious patterns may also be identified in annually reported Turnover by considering multiple years of data. 4) Methods for detecting unit errors in reported VAT Turnover It may be possible to deal effectively and relatively easily with some errors in VAT Turnover, if they are due to a systematic cause. One common systematic error in the reporting of financial data is a unit error. For example in business surveys, Turnover is often requested to be reported in thousands of pounds. Some businesses report their figures in pounds. This leads to a large error for the business, but it is relatively straightforward to identify and correct that error as long as something is known about the expected Turnover for that business. For VAT Turnover, the error may be in the other direction. Tax offices often require VAT Turnover in pounds, but businesses used to reporting in thousands of pounds for 6 other purposes sometimes report in thousands of pounds for VAT Turnover. These errors can often be detected by comparing with previous VAT returns for the same business, using a rule of the following form. If current VAT Turnover B previous VAT Turnover then assume the current VAT Turnover has been reported in thousands of pounds and multiply by 1000 to get a figure in pounds. A The parameters A and B are used to determine the range of values which indicate a likely thousand pounds error. In deriving A and B, it is recommended to analyse the effect of the rule using past data. The aim is to identify as many thousand pounds errors as possible, without risking too many false corrections. One of the problems with VAT Turnover data is that it is often not possible to recontact businesses to get an idea of their true Turnover figure. Bearing this in mind, it is possible that a business may make an uncorrected thousand pound error in more than one period so that the rule above would not necessarily identify such an error. To identify more of these errors, it may be helpful to consider multiple periods of past Turnover data to get a feel for the true level of Turnover expected for the business. 5) Evaluation of detection methods A key difference between survey and administrative data is that with administrative data it is not possible to re-contact the business and ask them to confirm any suspicious values. One implication of this is that the only option for adjusting suspicious values is to remove the original value and impute a figure which is more in line with expectations. Because of this, it is generally not possible to discover the truth about suspicious VAT Turnover. This means that the evaluation of detection methods is not straightforward and cannot usually be definitive. A number of diagnostics were used to allow for some comparison between methods, especially in relation to the five methods for detecting suspiciously large and suspiciously small Turnover values. These include the proportion of businesses identified as suspicious within each industry and size class and the average size (employment) and VAT Turnover of suspicious businesses compared with the rest of the class. If an independent measure of Turnover is available, it should be possible to estimate the effectiveness of the methods in detecting errors. However, care should be taken over differences in definitions of variables. For example, survey Turnover is often defined differently to VAT Turnover. In this case it would not be valid to conclude that a business with VAT Turnover different to survey Turnover in the same period is an error. However, if a business is detected as an error and has VAT Turnover similar to the survey Turnover, then this can be viewed as being a likely ‘false hit’. 7 When deciding which method to use to identify suspicious Turnover, it is also important to bear in mind the available data and the statistical use of the VAT Turnover. If businesses with larger Turnover values are of more importance, then methods 4 (Quartile differences combined with measure of influence) and 5 (HidiroglouBerthelot) offer the flexibility to give higher weight to those businesses. If good quality historic data are available for the business, then methods 2 (Period on period ratios) and 3 (Comparison with reporting history for the business) are likely to give good results – with method 3 being particularly useful when there is a reasonably long time series of past data and method 2 having the advantage of taking account of the contribution of the business to its class. Method 1 (and the related method 4) should be effective in identifying extreme values when only the current period data are available. Ultimately the decision of which method to use should be informed by analysis of the effect of using the methods on the VAT Turnover data set. 6) Results Methods for detecting suspicious VAT Turnover values The five methods described above for detecting suspicious VAT Turnover were tested on two years of VAT data, 2004 and 2005. Note that a different data set was available for each month in this period, but that the data itself are a mixture of monthly, quarterly and annual Turnover depending on the reporting frequency and pattern of the business. For each period, the detection methods were applied separately for the monthly and quarterly responders. The annual data were not used, as they account for only 1% of the data and the lack of frequency of the data means that some of the detection methods could not be applied easily. The size of the data sets was not the same for every method. This is due to the fact that some methods involve taking averages over previous periods and so can only be calculated for those businesses with data in all the required previous periods. This led to a greatly reduced data set for those methods. This requirement for historical data in itself may be a reason to favour other methods over, for example method 3 (Comparison with reporting history for the business). Table 1 shows the proportion of suspicious businesses identified (on average across all industries), average VAT Turnover and average business register employment for each of the five detection methods for both suspicious and non-suspicious businesses. These values are all averaged over the periods in the study. For methods 1 (Quartile distances in industry Turnover), 4 (Quartile differences combined with measure of influence) and 5 (Hidiroglou-Berthelot) the objective was to set thresholds to identify similar numbers of suspicious businesses, so that the types of businesses identified could be compared. This was not possible for methods 2 (Period on period ratios) and 3 (Comparison with reporting history for the business), since in practice these methods appear to identify much fewer suspicious values than the others. Therefore, parameters similar to those recommended in the literature were used. For method 2, a 8 threshold of 25 was used as a compromise between the monthly and quarterly data. For method 3, the thresholds described in Hoogland and Van Haren (2007) was used. Note that for method 5, the Hidiroglou-Berthelot rule, a value of V = 1 was used to give extra weight to businesses with larger Turnover, as this has been shown to work well with business data in the past. The value of C for this method was 250. Method 1 used a value of C = 10 in the quartile method to give the same proportion of failures. For method 4 a value of C = 8 was chosen in the quartile method and then prioritised the businesses failing that method by VAT Turnover to give a similar proportion of failures as methods 1 and 5. Table 1. Results of testing detection methods 1 – 5 Proportion of suspicious businesses (%) Average VAT Turnover (£000) for suspicious businesses Average VAT Turnover (£000) for nonsuspicious businesses Average employment for suspicious businesses Average employment for nonsuspicious businesses Method 1 4.7 1,296,158 24,033 1,143 125 Method 2 0.1 1,110 10,050 88 204 Method 3 0.1 2,780,112 393,002 1,144 261 Method 4 4.2 1,779,091 27,740 1,266 105 Method 5 4.8 67,034 3,541 727 159 Note that the difference in the average magnitude of businesses between methods can be partially attributed to the reduced size data sets available for methods 2 (Period on period ratios), 3 (Comparison with reporting history for the business) and 5 (Hidiroglou-Berthelot). Table 2 shows the estimated false hit rate for each of the methods. That is the proportion of businesses that were identified as suspicious, but which had VAT Turnover similar to reported survey Turnover. The survey Turnover came from the Annual Business Survey, used for Structural Business Statistics. VAT Turnover data were annualised to make this comparison. False hits could only be identified for those businesses in both the VAT and survey data sets. The figures in table 2 are therefore only an estimate of the rate for the whole VAT data set. In making these comparisons, similarity in Turnover between the two sources was defined as being within 20%. 9 Table 2. Estimated false hits for methods 1 – 5 False hits (%) Method 1 - Quartile distances in industry Turnover 37 Method 2 - Period on period ratios 22 Method 3 – Comparison with reporting history for 30 the business Method 4 – Quartile differences combined with measure of influence 17 Method 5 – Hidiroglou-Berthelot method 32 The best performing methods were those that only use the current period data, that is methods 1 (Quartile distances in industry Turnover) and 4 (Quartile differences combined with measure of influence). This is because using multiple periods led to a reduced data set and therefore less reliable results. Based on the estimated false hits, method 4 performs better than method 1. The average VAT Turnover and employment for method 4 is higher than for method 1, which suggests that more influential businesses are being identified as errors, on average. This is a natural consequence of method 4. It should be noted that methods 2 (Period on period ratios) and 3 (Comparison with reporting history for the business) appear to differ in nature to the other methods, in that they only identify a very small proportion of suspicious values – these methods can be seen as focusing on the most extreme parts of the distribution over time. When choosing a method it is important to bear in mind how many suspicious businesses it is desired to detect. Methods for detecting suspicious patterns and unit errors in reported VAT Turnover The methods for detecting suspicious patterns and unit errors in VAT Turnover were also applied to the VAT data. Due to the way that the VAT Turnover is supplied by HMRC, there is no negative VAT Turnover in the data set. The data were checked for all of the other suspicious Turnover patterns described. Numerous examples of all of these patterns were found in the data. Any business displaying one of these patterns was assigned a marker, describing which of the patterns was present. The unit error method was applied to the data to identify businesses which had incorrectly returned in thousands of pounds instead of pounds. Values of A = 0.00065 and B = 0.00135 were used, corresponding to a 35% variation around a thousand times difference between the current and previous VAT Turnover. These thresholds 10 are in line with those used for thousand pound errors in business surveys. It was thought that some businesses may even have returned in millions of pounds, but a similar check failed to identify any examples of this. 7) Conclusion and recommendations This paper has described methods for detecting errors in VAT Turnover data. Following a comprehensive literature study, five methods for detecting unusual VAT Turnover values have been described and tested. Each of these methods uses parameters which can be fine-tuned to identify an appropriate number of suspicious businesses. The effective values of these parameters are likely to differ between data sources. Therefore, rather than prescribe specific values, it is recommended that the parameters are set through analysis of the effect of the method on the VAT data under consideration. The aim is always for the method to identify a suitable amount of extreme businesses. The interpretation of ‘suitable’ should take account of the expected proportion and magnitude of errors in the data and how the data will be used. The choice of method will depend partly on the availability of data. Methods 1 (Quartile distances in industry Turnover) and 4 (Quartile differences combined with measure of influence) rely only on current period data and detect extremes in the distribution of values for that period. Methods 2 (Period on period ratios), 3 (Comparison with reporting history for the business) and 5 (Hidiroglou-Berthelot) all use previous period data in some way. Previous returns for a specific business often provide an accurate method for detecting errors, since it is expected that the current period value will be similar to the previous value. This type of method is most useful when previous values are available and easily matched for the majority of businesses. Methods 4 (Quartile differences combined with measure of influence) and 5 (Hidiroglou-Berthelot) both give the opportunity to prioritise error detection on larger businesses. Depending on the use of the data, it is often the case that larger businesses are of more importance in statistical outputs. Method 4 is actually a refinement to method 1 (Quartile distances in industry Turnover), and a comparison between the performances of these two methods suggests that prioritising on larger businesses can lead to more accurate detection of errors. The descriptions and examples in this paper will hopefully give useful information on selecting an appropriate method for detecting suspicious VAT Turnover values. However, it is always recommended to test the performance of methods on the data source itself before implementation. This paper has also described suspicious patterns that are commonly observed in VAT Turnover data. These are easily identified and generally imply that the business has made a reporting error. It is recommended that VAT data are checked for these patterns before implementing any other error detection method. Another type of error that is relatively easy to identify and correct is the unit error. If businesses are used to reporting Turnover in thousands of pounds, they may continue to report in this way even if VAT Turnover should be reported in pounds. Other unit errors are also possible. Left untreated, unit errors are often the largest errors in 11 Turnover data. It is recommended that an automatic method is developed to detect and correct any unit errors in VAT Turnover data, before applying any other rules to detect suspicious Turnover values. The final recommendation is that in developing methods for detecting errors in VAT Turnover data, it is always useful to understand the data source and the possible errors that may be found in it. In many cases, it will be necessary to liaise with the data providers to get this information. The next stage is to test the above methods that take specific outputs into account and to develop strategies for what to do with the suspicious data identified for the specific output. References De Jong, A. "Impect: Recent developments in harmonized processing and selective editing", Proceedings of UNECE Work Session on Statistical Data Editing, Madrid, October 2003: Web. Hidiroglou, M. A. and Berthelot, J.-M. “Statistical Editing and Imputation for Periodic Business Surveys”, Survey Methodology, June 1986, Vol. 12, No. 1, pp 7383: Journal. Hoogland, J. "Editing strategies for VAT data", Seminar on 'Using administrative data in the Production of Business Statistics - Member States experiences', Rome, March 2010: Web. Hoogland, J. and Van Haren, G. "Editing and integrating VAT and SBS data", Proceedings of the third International Conference on Establishment Surveys (ICESIII), Montreal, June 2007: CD ROM. Hoogland, J., Van Bemmel, K. and De Wolf, P-P. "Detection of potential influential errors in VAT turnover data used for short term statistics", Proceedings of UNECE Work Session on Statistical Data Editing, Neuchatel, October 2009: Web. Lorenz, R. "The integrated system of editing administrative data for STS in Germany", Seminar on 'Using administrative data in the Production of Business Statistics - Member States experiences', Rome, March 2010: Web. 12