Probabilistic approach for statistical learning in administrative archives Vincenzo Spinelli Istat, Italy’s National Statistical Institute, Via Tuscolana 1782, 00173, Rome, Italy vispinel@istat.it Summary. Official statistics on social protection schemes constitute the context in which a stochastic approach for a supervised classification problem is developed. This problem has been excerpted from a more general model of data mining based on administrative data sources. In this context, a data mining approach has the meaning of searching a specific set of patterns inside a huge number of records characterized by noised values for the variables of interest. These variables can be partially filled and this implies that conditional probability distributions are naturally defined. We outline that the bayesian classifiers have some drawbacks when they choose the class with the maximum value of probability; therefore we define a stochastic classifier that has a “similar behavior on average”, but can be different if we consider specific classification instances. Key words: Administrative archive, Supervised classification, Bayesian and stochastic classifier. 1 Introduction In Istat, Italy’s National Statistical Institute [ISTAT], an IT system for providing official statistics on social protection schemes is being implemented. In particular the system aims at collecting and processing archives on the Non-Pensions Cash Benefits (Npcb) by using the administrative data sources. In this context, we define Npcb as “unilateral temporary transfers leading to the flow of cash resources from public and private institutions, insurance companies, and employers to families in order to relieve them of the burden of a well-defined set of risk and needs involved in social protection” [CON03]. There are two main units of analysis: social benefits and recipients. The first unit refers to the aspect of quantifying the finances used for Npcb, while the second one identifying the effective recipients of social protection actions. The main characteristics of the recipients are reported to be the following: age, gender, residence, occupation, and composition of own household. There is a plurality of Bodies responsible for paying Npcb. Each of these Bodies has a specific competence to protect certain categories of workers. The heterogeneity of Bodies causes a multiplicity of 1414 Vincenzo Spinelli data sources. One of the main administrative data sources, containing information referred to the recipient on a micro level, is the “770 Form tax register” (770 Form) of the Agenzia delle Entrate. [AGEN] This work describes a specific step of the statistical process for data quality analysis of 770 Form. A problem of supervised classification is defined in section 2 for a sequence of coordinates for the family allowances. A process of statistical learning is described in sections 3 and 4. Finally, some examples of this probabilistic approach are described in details in section 5. 2 Problem Definition The information of interest included in 770 Form are: characteristics of the recipients (age, gender, occupation, residence), the sequence of months they received benefits paid in advance by employer of behalf of the Inps [INPS], the family allowances (FA), and the wages [EMI]. In this context, we deal with the treatment of the FA variables for each recipient. Table 1. Structure of the FA variables (T , C, N ) Variable Description Feasible values T - Table Family type 11-19,20A-D,21A-D C - Class Make-up of the family unit’s income 1-16 N - Number Number of members of the family 1-8 Each recipient can have at most one legal instance of (T , C, N ) for each month. Conversely, these variables are mandatory to establish the amount of cash benefit the FA recipient gets each month during a specific period of 12 months. Table 2. Table 20B for FA cash benefits C - Class 1 2 3 4 5 6 7 8 9 10 11 1 − − − − − − − − − − − N - Number 2 3 56.81 98.13 43.90 87.80 30.99 67.14 12.91 49.06 − 30.99 − 12.91 − − − − − − − − − − of members of the family 4 5 6 142.03 185.92 229.82 123.95 173.01 222.08 105.87 154.94 216.91 87.80 136.86 204.00 67.14 123.95 198.84 49.06 105.87 185.92 30.99 74.89 167.85 12.91 43.90 149.77 − 12.91 129.11 − − 61.97 − − − 7 273.72 260.81 247.90 234.99 222.08 211.75 191.09 173.01 160.10 142.03 61.97 Probabilistic approach for administrative archives 1415 From a legal sequence of (T , C, N ), we can easily get the monthly amount of cash benefit: T selects a table out of those available, and (C, N ) are the indexes in this matrix. Every FA table has an upper triangular structure, as we can seen in table 2 where the cash benefits are provided in euros. In table 1 we have a rectangular domain for (T , C, N ), but some FA sequences can have a null value in the matrices like the one showed in table 2: (20B, 8, 2) has a null entry, but (20B, 6, 4) returns 49.06. This situation has some aspects of ambiguity that a learning process must take into account: what does it mean if a sequence of (T , C, N ) gets null value or is an illegal sequence? Two reasonable answers are as follows: 1. if the recipient did not get any FA cash benefit then the sequence of (T , C, N ) has been unnecessarily typed 2. if the recipient did get FA cash benefit then the sequence of (T , C, N ) has been wrongly typed. We do not have any further information to distinguish these two opposite situations and therefore we need to state some hypotheses. Conjecture 1. “the second situation is stronger than the first one and this means that if a sequence of (T , C, N ) has been typed in some way, it must lead to a non null value; if it does not, we try to change the sequence to get a significant cash benefit. If we cannot choose any other (T , C, N ) sequence, then we hold the first situation true and disregard the (T , C, N ) sequence”. This problem is magnified by another problem. The archive 770 Form that we have used was gathered by raw data coming from firms without any checking process, and therefore it is characterized by the presence of noise. Generally, this noise is not well defined and detectable. Its origin is mainly due to typesetting operation and/or misunderstanding of the compiling constraints of 770 Form. For all these reasons, the learning process on the (T , C, N ) variables must get an answer for the following hierarchy of questions: Question 1. are the (T , C, N ) variables definitely null? Question 2. if the (T , C, N ) variables are not null, are they feasible? Question 3. if the (T , C, N ) variables are not null but unfeasible (e.g. partially filled), what is the right sequence? 3 Learning process of (T , C, N ) sequences All the considerations in section 2 have been gathered to define a stochastic learning process. In this process, we do not consider the contribution of other variables but the (T , C, N ) ones. This means we have evaluated both the marginal effect of each (T , C, N ) variable and the interaction effect (cross-difference) depending on all the covariates in the model. [CHU03] The learning process has many steps and needs some definitions: Definition 1. Starting Set : R770 = 770 Form - 2003 version. [M770] 1416 Vincenzo Spinelli Definition 2. Smoothing function : ∀x ∈ R770 : AN F (x) = ΠT ,C,N (x). This function selects the (T , C, N ) variables from vector x, erases the characters that are not present in table 1, and creates a formatted sequence adding blank characters (6 ♭) where necessary according definition 8 (see below). There are specific typesetting errors that are fixed by heuristic methods. Some examples of its behavior are showed in table 3. Definition 3. FA Universe : U = {x ∈ R770 | AN F (x) 6= null}. Definition 4. (T , C, N ) Set : L = {sequences def ined by law}. [TNC1] Definition 5. Training Set : Ω = {x ∈ U | AN F (x) ∈ L}. Definition 6. Working Set : Γ = {x ∈ U | AN F (x) 6∈ L}. These definitions are the answers for the questions of section 2: U satisfies question 1, Ω satisfies question 2, and Γ partially satisfies question 3. If we apply the conjecture of section 2 to Γ then question 3 is fully satisfied. Table 3. Examples for AN F function Input (T , C, N ) Cleaning step Formatted output Result (-11,03,200) (/A x3) ” ( 123) ” (11,3,2) (A 3) ” ( 0) ” (6 ♭11,6 ♭3,6 ♭2) (6 ♭6 ♭A,6 ♭6 ♭,6 ♭3) (6 ♭6 ♭6 ♭,6 ♭6 ♭,6 ♭6 ♭) ∈Ω ∈Γ ∈ R770 − U 4 Learning Process The learning process is basically defined as stochastic choice based on a “conditional probability distribution” (CPD). This distribution must be evaluated for every x ∈ Γ from a unique prior distribution. We differ from the bayesian classifiers in that we do not find the class having the maximum probability value. We hold that no finite amount of evidence can determine an instance’s class membership. This means that the final choice for the class assignment is based on a simulation of a random variable based on the CPD. We need some formal steps to define this learning process: Definition 7. The prior distribution : ∀l ∈ L : P (l) = k{x ∈ Ω | AN F (x) = l}k = πl kΩk Definition 8. Inclusion relationship : let x, y be two string of characters x ⊆ y ⇐⇒ ∀i = 1, . . . , n : yi = 6 6 ♭ ∧ x i = yi yi =6 ♭ This relationship can be extended to x, y ∈ Γ throughout ANF() function. Probabilistic approach for administrative archives 1417 Definition 9. Conditional training set : ∀x ∈ Γ : Ω|x = {y ∈ Ω | AN F (y) ⊆ AN F (x)} Definition 10. Conditional probability distribution: ∀x ∈ Γ, ∀l ∈ L : P (l|x) = k{y ∈ Ω|x | AN F (y) = l}k kΩ|x k If Ω|x = ∅ then we set P (l|x) = 0 . If x ∈ Ω and l = AN F (x) then P (l|x) = 1. Definition 11. Bayes classification : ∀x ∈ Γ : new AN F (x) = l̂ if P (l̂|x) = max P (l|x) l∈L The bayesian approach has some aspects that we do want to avoid. Problem 1. the choices are generally concentrated on the biggest classes of the CPDs. All the other classes have no chances to be chosen in this classification process, even if they are feasible candidates to be considered as the right answer. Problem 2. there is an ambiguity for choosing the solution class when two or more classes are solutions of the optimization problem. Problem 3. if two or more classes have almost the same size then the bayesian solution leaves out of account this situation. This is a subtler problem than the previous one. The approach that we used tries to solve all these problems. If we consider an ordering in L (e.g. lexicographic order), we can simulate a conditional variable for each CPD P (l|x). When we simulate a random variable we always get a unique value that is the expression of the whole distribution. This means that all the classes can be chosen but the CPD gives them the right chances. 5 Examples The examples presented in this section are based on R770 . The summary of the learning process is shown in table 4, where we can see the efficacy of the AN F () function: the size of U is 3.74% of the size of R770 , and the records in Γ are about 95.0% of those in U . This means that we need to use a classifier only for the 0.21% of the R770 input records. The two examples that we show below describe some typical situations we met during the classification process: one or two variables are to be fixed. The examples of the following paragraphs are described by showing the conditional universe L|x , the CPD P (L = l | x), and the conditional cumulative distribution P (L ≤ l | x). This CPD is based on the order in which the classes in L are read and processed by the system. All the distributions are not normalized in [0, 1] and for this reason all the values are the effective number of elements of U falling in each class of L|x . 1418 Vincenzo Spinelli Table 4. Global results for probabilistic learning process Phase Initial state Learning Classification Working Set L R770 U Ω Γ Fixed sequences Disregarded sequences Number of records 2, 429 52, 361, 853 1, 960, 387 1, 852, 177 108, 210 68, 253 39, 957 Table 5. Empirical CPD (normalization factor = 11) 1 2 3 4 5 L|x (6 ♭11, 6 ♭7, 12) (6 ♭11, 14, 12) (6 ♭11, 6 ♭4, 12) (6 ♭11, 6 ♭3, 12) (6 ♭11, 6 ♭1, 12) P (L = l | x) 1 1 1 2 6 P (L ≤ l | x) 1 2 3 5 11 5.1 Missing variable C Consider the situation AN F (x) = (6 ♭11, 6 ♭ 6 ♭, 6 ♭2), and that the L|x set has 5 classes as shown in table 5. • Bayesian classifier : the class (6 ♭11, 6 ♭1, 12) has the maximum value of CPD in table 5 • Stochastic classifier : simulate a uniform random variable in [0, 1]: random = 0.25 −→ 3 ∈ [0, 11]; from the conditional cumulative distribution of table 5, we get (6 ♭11, 6 ♭4, 12), a different choice from the bayesian classifier. 5.2 Missing variables (T , N ) Consider the situation AN F (x) = (6 ♭ 6 ♭, 12, 6 ♭ 6 ♭), and that the L|x set has 12 classes as shown in table 6. • Bayesian classifier : the class (6 ♭11, 12, 6 ♭4) has the maximum value of CPD in table 6 • Stochastic classifier : simulate a uniform random variable in [0, 1]: random = 0.189849002534 −→ 3, 445 ∈ [0, 18146]; from the conditional cumulative distribution of table 6, we get the same class of the bayesian classifier. 6 Conclusion This work must be seen as a part of a more general statistical project mainly based on administrative archives. These archives could have a huge amount of data, but it is not always possible to get a cross validation only by deterministic processes. For this reason we have designed a stochastic classifier. We are now defining and testing a Probabilistic approach for administrative archives 1419 Table 6. Empirical CPD (normalization factor = 18, 146) 1 2 3 4 5 6 7 8 9 10 11 12 L|x (6 ♭14, 12, 6 ♭6) (6 ♭11, 12, 11) (6 ♭11, 12, 6 ♭9) (6 ♭11, 12, 6 ♭8) (6 ♭12, 12, 6 ♭5) (6 ♭12, 12, 6 ♭4) (6 ♭14, 12, 6 ♭5) (6 ♭11, 12, 6 ♭6) (6 ♭14, 12, 6 ♭4) (6 ♭11, 12, 6 ♭7) (6 ♭11, 12, 6 ♭5) (6 ♭11, 12, 6 ♭4) P (L = l | x) 4 1 4 7 3 12 29 180 86 36 2, 079 15, 705 P (L ≤ l | x) 4 5 9 16 19 31 60 240 326 362 2, 441 18, 146 more sophisticated model for defining L|x , based on an multi-criterion optimization problem. The (T , C, N ) variables has been integrated by some other ones (e.g. age, gender, job level) to get a final CPD where we have a smaller number of classes. Furthermore, from the analysis of situation like the one presented in paragraph 5.1, we are trying to define some mechanism to guarantee a minimum size level for each class considered by the classifier. Finally, some technological considerations for the system used for this work. It was implemented by using only open source software. In particular, the random simulation is based on the GSL library [GSL], and Mersenne routines [MAT98]. References [SPV04] Spinelli, V., Tancioni, M.: Non Pension Cash Benefits: Computational Approach and Results of the activity on the DM10-Inps archives. Collana Contributi, Istat, Rome, 2004. [TAN04] Spinelli, V., Tancioni, M.: Data Mining for Administrative Data: heuristic and formal model for missing values and outliers. KDNet Symposium ”Knowledge-Based Services for the Public Sector”, Petersberg, Bonn (Germany), June 3-4, 2004. [SPI04] Spinelli, V., Tancioni, M.: Automatic Evaluation Procedures for Elementary Administrative Data. The case of missing values and outliers in Non Pension Cash Benefits Archives. Q2004 - European Conference on Quality and Methodology in Official Statistics, Mainz (Germany), May 24-26, 2004. [CON03] Consolini, P.: Administrative Data Based Statistics: the Case of nonPension Cash Benefits (Npcb). Proceedings of the 17th Roundtable on Business Survey Frames, Rome, October, 26-31, 423–430 Volume II, (2003). [CHU03] Chunrong, A., Norton, E.C.: Interaction terms in logit and probit models. Economics Letters , 80, 123–129 (2003). [CON02] Consolini, P., De Carli, R.: Non Pension Cash Benefits: units of analysis, sources and statistical representation of data. Collana Contributi, Istat, Rome, (2002). 1420 Vincenzo Spinelli [CON00] Consolini, P.: Non-Pension Cash Benefits: Istitutional Aspects and Statistical Classifications. Collana Documenti, Istat, Rome, (2000). [MAT98] Mersenne Twister: (http : //www.math.keio.ac.jp/matumoto/emt.html) [HEC96] Heckerman: Bayesian networks for knowledge discovery. Advances in Knowledge Discovery and Data Mining., Fayyad, Shapiro, Smyth, Uthurusamy. 273–305 (1996). [CHE96] Cheeseman, P., Stutz, J.: Bayesian classification (AutoClass): theory and result. Advances in Knowledge Discovery and Data Mining., Fayyad, Shapiro, Smyth, Uthurusamy. 153–180 (1996). [USA96] Fayyad, U.M. et al.: Advances in knowledge discovery and data mining. Menlo Park: AAAI press; Cambridge; London: MIT XIV, (1996) [ISTAT] Istat: Italy’s National Statistical Institute - (http : //www.istat.it). [AGEN] Agenzia delle Entrate - (http : //www.agenziaentrate.it). [TNC1] (T , C, N ) in Inps: (http : //www.inps.it/home/def ault.asp?sID = %3B0%3B4740%3B&lastM enu = 4741&iM enu = 1&itemDir = 4953) [INPS] Inps: National Institute of Social Security for Private Sector (http : //www.inps.it). [M770] Repository for full information of 2003 Version of 770 Form (http : //www1.agenziaentrate.it/modulistica/dichiarazione/2003/770). [EMI] Emire - Italy - Family Allowance: (http : //www.eurof ound.eu.int/emire) [GSL] GSL: Gnu Scientific Library - (http : //www.gnu.org/sof tware/gsl)