Probabilistic approach for statistical learning in administrative archives

advertisement
Probabilistic approach for statistical learning
in administrative archives
Vincenzo Spinelli
Istat, Italy’s National Statistical Institute, Via Tuscolana 1782, 00173, Rome, Italy
vispinel@istat.it
Summary. Official statistics on social protection schemes constitute the context
in which a stochastic approach for a supervised classification problem is developed.
This problem has been excerpted from a more general model of data mining based on
administrative data sources. In this context, a data mining approach has the meaning
of searching a specific set of patterns inside a huge number of records characterized
by noised values for the variables of interest. These variables can be partially filled
and this implies that conditional probability distributions are naturally defined. We
outline that the bayesian classifiers have some drawbacks when they choose the class
with the maximum value of probability; therefore we define a stochastic classifier
that has a “similar behavior on average”, but can be different if we consider specific
classification instances.
Key words: Administrative archive, Supervised classification, Bayesian and
stochastic classifier.
1 Introduction
In Istat, Italy’s National Statistical Institute [ISTAT], an IT system for providing
official statistics on social protection schemes is being implemented. In particular
the system aims at collecting and processing archives on the Non-Pensions Cash
Benefits (Npcb) by using the administrative data sources. In this context, we define
Npcb as “unilateral temporary transfers leading to the flow of cash resources from
public and private institutions, insurance companies, and employers to families in
order to relieve them of the burden of a well-defined set of risk and needs involved
in social protection” [CON03].
There are two main units of analysis: social benefits and recipients. The first
unit refers to the aspect of quantifying the finances used for Npcb, while the second
one identifying the effective recipients of social protection actions. The main characteristics of the recipients are reported to be the following: age, gender, residence,
occupation, and composition of own household. There is a plurality of Bodies responsible for paying Npcb. Each of these Bodies has a specific competence to protect
certain categories of workers. The heterogeneity of Bodies causes a multiplicity of
1414
Vincenzo Spinelli
data sources. One of the main administrative data sources, containing information
referred to the recipient on a micro level, is the “770 Form tax register” (770 Form)
of the Agenzia delle Entrate. [AGEN]
This work describes a specific step of the statistical process for data quality
analysis of 770 Form. A problem of supervised classification is defined in section
2 for a sequence of coordinates for the family allowances. A process of statistical
learning is described in sections 3 and 4. Finally, some examples of this probabilistic
approach are described in details in section 5.
2 Problem Definition
The information of interest included in 770 Form are: characteristics of the recipients
(age, gender, occupation, residence), the sequence of months they received benefits
paid in advance by employer of behalf of the Inps [INPS], the family allowances
(FA), and the wages [EMI]. In this context, we deal with the treatment of the FA
variables for each recipient.
Table 1. Structure of the FA variables (T , C, N )
Variable
Description
Feasible values
T - Table
Family type
11-19,20A-D,21A-D
C - Class
Make-up of the family unit’s income 1-16
N - Number Number of members of the family
1-8
Each recipient can have at most one legal instance of (T , C, N ) for each month.
Conversely, these variables are mandatory to establish the amount of cash benefit
the FA recipient gets each month during a specific period of 12 months.
Table 2. Table 20B for FA cash benefits
C - Class
1
2
3
4
5
6
7
8
9
10
11
1
−
−
−
−
−
−
−
−
−
−
−
N - Number
2
3
56.81 98.13
43.90 87.80
30.99 67.14
12.91 49.06
−
30.99
−
12.91
−
−
−
−
−
−
−
−
−
−
of members of the family
4
5
6
142.03 185.92 229.82
123.95 173.01 222.08
105.87 154.94 216.91
87.80 136.86 204.00
67.14 123.95 198.84
49.06 105.87 185.92
30.99 74.89 167.85
12.91 43.90 149.77
−
12.91 129.11
−
−
61.97
−
−
−
7
273.72
260.81
247.90
234.99
222.08
211.75
191.09
173.01
160.10
142.03
61.97
Probabilistic approach for administrative archives
1415
From a legal sequence of (T , C, N ), we can easily get the monthly amount of cash
benefit: T selects a table out of those available, and (C, N ) are the indexes in this
matrix. Every FA table has an upper triangular structure, as we can seen in table
2 where the cash benefits are provided in euros.
In table 1 we have a rectangular domain for (T , C, N ), but some FA sequences
can have a null value in the matrices like the one showed in table 2: (20B, 8, 2) has a
null entry, but (20B, 6, 4) returns 49.06. This situation has some aspects of ambiguity
that a learning process must take into account: what does it mean if a sequence of
(T , C, N ) gets null value or is an illegal sequence? Two reasonable answers are as
follows:
1. if the recipient did not get any FA cash benefit then the sequence of (T , C, N )
has been unnecessarily typed
2. if the recipient did get FA cash benefit then the sequence of (T , C, N ) has been
wrongly typed.
We do not have any further information to distinguish these two opposite situations
and therefore we need to state some hypotheses.
Conjecture 1. “the second situation is stronger than the first one and this means that
if a sequence of (T , C, N ) has been typed in some way, it must lead to a non null
value; if it does not, we try to change the sequence to get a significant cash benefit.
If we cannot choose any other (T , C, N ) sequence, then we hold the first situation
true and disregard the (T , C, N ) sequence”.
This problem is magnified by another problem. The archive 770 Form that we have
used was gathered by raw data coming from firms without any checking process,
and therefore it is characterized by the presence of noise. Generally, this noise is not
well defined and detectable. Its origin is mainly due to typesetting operation and/or
misunderstanding of the compiling constraints of 770 Form. For all these reasons,
the learning process on the (T , C, N ) variables must get an answer for the following
hierarchy of questions:
Question 1. are the (T , C, N ) variables definitely null?
Question 2. if the (T , C, N ) variables are not null, are they feasible?
Question 3. if the (T , C, N ) variables are not null but unfeasible (e.g. partially filled),
what is the right sequence?
3 Learning process of (T , C, N ) sequences
All the considerations in section 2 have been gathered to define a stochastic learning
process. In this process, we do not consider the contribution of other variables but
the (T , C, N ) ones. This means we have evaluated both the marginal effect of each
(T , C, N ) variable and the interaction effect (cross-difference) depending on all the
covariates in the model. [CHU03]
The learning process has many steps and needs some definitions:
Definition 1. Starting Set : R770 = 770 Form - 2003 version. [M770]
1416
Vincenzo Spinelli
Definition 2. Smoothing function : ∀x ∈ R770 : AN F (x) = ΠT ,C,N (x). This function selects the (T , C, N ) variables from vector x, erases the characters that are not
present in table 1, and creates a formatted sequence adding blank characters (6 ♭)
where necessary according definition 8 (see below). There are specific typesetting errors that are fixed by heuristic methods. Some examples of its behavior are showed
in table 3.
Definition 3. FA Universe : U = {x ∈ R770 | AN F (x) 6= null}.
Definition 4. (T , C, N ) Set : L = {sequences def ined by law}. [TNC1]
Definition 5. Training Set : Ω = {x ∈ U | AN F (x) ∈ L}.
Definition 6. Working Set : Γ = {x ∈ U | AN F (x) 6∈ L}.
These definitions are the answers for the questions of section 2: U satisfies question 1,
Ω satisfies question 2, and Γ partially satisfies question 3. If we apply the conjecture
of section 2 to Γ then question 3 is fully satisfied.
Table 3. Examples for AN F function
Input (T , C, N )
Cleaning step
Formatted output
Result
(-11,03,200)
(/A x3)
”
( 123)
”
(11,3,2)
(A 3)
”
( 0)
”
(6 ♭11,6 ♭3,6 ♭2)
(6 ♭6 ♭A,6 ♭6 ♭,6 ♭3)
(6 ♭6 ♭6 ♭,6 ♭6 ♭,6 ♭6 ♭)
∈Ω
∈Γ
∈ R770 − U
4 Learning Process
The learning process is basically defined as stochastic choice based on a “conditional
probability distribution” (CPD). This distribution must be evaluated for every x ∈ Γ
from a unique prior distribution. We differ from the bayesian classifiers in that we
do not find the class having the maximum probability value. We hold that no finite
amount of evidence can determine an instance’s class membership. This means that
the final choice for the class assignment is based on a simulation of a random variable
based on the CPD.
We need some formal steps to define this learning process:
Definition 7. The prior distribution :
∀l ∈ L : P (l) =
k{x ∈ Ω | AN F (x) = l}k
= πl
kΩk
Definition 8. Inclusion relationship : let x, y be two string of characters
x ⊆ y ⇐⇒ ∀i = 1, . . . , n :
yi =
6 6 ♭ ∧ x i = yi
yi =6 ♭
This relationship can be extended to x, y ∈ Γ throughout ANF() function.
Probabilistic approach for administrative archives
1417
Definition 9. Conditional training set :
∀x ∈ Γ : Ω|x = {y ∈ Ω | AN F (y) ⊆ AN F (x)}
Definition 10. Conditional probability distribution:
∀x ∈ Γ, ∀l ∈ L : P (l|x) =
k{y ∈ Ω|x | AN F (y) = l}k
kΩ|x k
If Ω|x = ∅ then we set P (l|x) = 0 . If x ∈ Ω and l = AN F (x) then P (l|x) = 1.
Definition 11. Bayes classification :
∀x ∈ Γ : new AN F (x) = l̂ if P (l̂|x) = max P (l|x)
l∈L
The bayesian approach has some aspects that we do want to avoid.
Problem 1. the choices are generally concentrated on the biggest classes of the
CPDs. All the other classes have no chances to be chosen in this classification process,
even if they are feasible candidates to be considered as the right answer.
Problem 2. there is an ambiguity for choosing the solution class when two or more
classes are solutions of the optimization problem.
Problem 3. if two or more classes have almost the same size then the bayesian
solution leaves out of account this situation. This is a subtler problem than the
previous one.
The approach that we used tries to solve all these problems. If we consider an
ordering in L (e.g. lexicographic order), we can simulate a conditional variable for
each CPD P (l|x). When we simulate a random variable we always get a unique value
that is the expression of the whole distribution. This means that all the classes can
be chosen but the CPD gives them the right chances.
5 Examples
The examples presented in this section are based on R770 . The summary of the
learning process is shown in table 4, where we can see the efficacy of the AN F ()
function: the size of U is 3.74% of the size of R770 , and the records in Γ are about
95.0% of those in U . This means that we need to use a classifier only for the 0.21%
of the R770 input records.
The two examples that we show below describe some typical situations we met
during the classification process: one or two variables are to be fixed. The examples
of the following paragraphs are described by showing the conditional universe L|x ,
the CPD P (L = l | x), and the conditional cumulative distribution P (L ≤ l | x).
This CPD is based on the order in which the classes in L are read and processed by
the system. All the distributions are not normalized in [0, 1] and for this reason all
the values are the effective number of elements of U falling in each class of L|x .
1418
Vincenzo Spinelli
Table 4. Global results for probabilistic learning process
Phase
Initial state
Learning
Classification
Working Set
L
R770
U
Ω
Γ
Fixed sequences
Disregarded sequences
Number of records
2, 429
52, 361, 853
1, 960, 387
1, 852, 177
108, 210
68, 253
39, 957
Table 5. Empirical CPD (normalization factor = 11)
1
2
3
4
5
L|x
(6 ♭11, 6 ♭7, 12)
(6 ♭11, 14, 12)
(6 ♭11, 6 ♭4, 12)
(6 ♭11, 6 ♭3, 12)
(6 ♭11, 6 ♭1, 12)
P (L = l | x)
1
1
1
2
6
P (L ≤ l | x)
1
2
3
5
11
5.1 Missing variable C
Consider the situation AN F (x) = (6 ♭11, 6 ♭ 6 ♭, 6 ♭2), and that the L|x set has 5 classes
as shown in table 5.
• Bayesian classifier : the class (6 ♭11, 6 ♭1, 12) has the maximum value of CPD in
table 5
• Stochastic classifier : simulate a uniform random variable in [0, 1]: random =
0.25 −→ 3 ∈ [0, 11]; from the conditional cumulative distribution of table 5, we
get (6 ♭11, 6 ♭4, 12), a different choice from the bayesian classifier.
5.2 Missing variables (T , N )
Consider the situation AN F (x) = (6 ♭ 6 ♭, 12, 6 ♭ 6 ♭), and that the L|x set has 12 classes
as shown in table 6.
• Bayesian classifier : the class (6 ♭11, 12, 6 ♭4) has the maximum value of CPD in
table 6
• Stochastic classifier : simulate a uniform random variable in [0, 1]: random =
0.189849002534 −→ 3, 445 ∈ [0, 18146]; from the conditional cumulative distribution of table 6, we get the same class of the bayesian classifier.
6 Conclusion
This work must be seen as a part of a more general statistical project mainly based
on administrative archives. These archives could have a huge amount of data, but it
is not always possible to get a cross validation only by deterministic processes. For
this reason we have designed a stochastic classifier. We are now defining and testing a
Probabilistic approach for administrative archives
1419
Table 6. Empirical CPD (normalization factor = 18, 146)
1
2
3
4
5
6
7
8
9
10
11
12
L|x
(6 ♭14, 12, 6 ♭6)
(6 ♭11, 12, 11)
(6 ♭11, 12, 6 ♭9)
(6 ♭11, 12, 6 ♭8)
(6 ♭12, 12, 6 ♭5)
(6 ♭12, 12, 6 ♭4)
(6 ♭14, 12, 6 ♭5)
(6 ♭11, 12, 6 ♭6)
(6 ♭14, 12, 6 ♭4)
(6 ♭11, 12, 6 ♭7)
(6 ♭11, 12, 6 ♭5)
(6 ♭11, 12, 6 ♭4)
P (L = l | x)
4
1
4
7
3
12
29
180
86
36
2, 079
15, 705
P (L ≤ l | x)
4
5
9
16
19
31
60
240
326
362
2, 441
18, 146
more sophisticated model for defining L|x , based on an multi-criterion optimization
problem. The (T , C, N ) variables has been integrated by some other ones (e.g. age,
gender, job level) to get a final CPD where we have a smaller number of classes.
Furthermore, from the analysis of situation like the one presented in paragraph 5.1,
we are trying to define some mechanism to guarantee a minimum size level for each
class considered by the classifier. Finally, some technological considerations for the
system used for this work. It was implemented by using only open source software. In
particular, the random simulation is based on the GSL library [GSL], and Mersenne
routines [MAT98].
References
[SPV04] Spinelli, V., Tancioni, M.: Non Pension Cash Benefits: Computational Approach and Results of the activity on the DM10-Inps archives. Collana
Contributi, Istat, Rome, 2004.
[TAN04] Spinelli, V., Tancioni, M.: Data Mining for Administrative Data: heuristic and formal model for missing values and outliers. KDNet Symposium
”Knowledge-Based Services for the Public Sector”, Petersberg, Bonn (Germany), June 3-4, 2004.
[SPI04] Spinelli, V., Tancioni, M.: Automatic Evaluation Procedures for Elementary Administrative Data. The case of missing values and outliers in Non
Pension Cash Benefits Archives. Q2004 - European Conference on Quality
and Methodology in Official Statistics, Mainz (Germany), May 24-26, 2004.
[CON03] Consolini, P.: Administrative Data Based Statistics: the Case of nonPension Cash Benefits (Npcb). Proceedings of the 17th Roundtable on Business Survey Frames, Rome, October, 26-31, 423–430 Volume II, (2003).
[CHU03] Chunrong, A., Norton, E.C.: Interaction terms in logit and probit models.
Economics Letters , 80, 123–129 (2003).
[CON02] Consolini, P., De Carli, R.: Non Pension Cash Benefits: units of analysis,
sources and statistical representation of data. Collana Contributi, Istat,
Rome, (2002).
1420
Vincenzo Spinelli
[CON00] Consolini, P.: Non-Pension Cash Benefits: Istitutional Aspects and Statistical Classifications. Collana Documenti, Istat, Rome, (2000).
[MAT98] Mersenne Twister: (http : //www.math.keio.ac.jp/matumoto/emt.html)
[HEC96] Heckerman: Bayesian networks for knowledge discovery. Advances in
Knowledge Discovery and Data Mining., Fayyad, Shapiro, Smyth, Uthurusamy. 273–305 (1996).
[CHE96] Cheeseman, P., Stutz, J.: Bayesian classification (AutoClass): theory and
result. Advances in Knowledge Discovery and Data Mining., Fayyad,
Shapiro, Smyth, Uthurusamy. 153–180 (1996).
[USA96] Fayyad, U.M. et al.: Advances in knowledge discovery and data mining.
Menlo Park: AAAI press; Cambridge; London: MIT XIV, (1996)
[ISTAT] Istat: Italy’s National Statistical Institute - (http : //www.istat.it).
[AGEN] Agenzia delle Entrate - (http : //www.agenziaentrate.it).
[TNC1] (T , C, N ) in Inps: (http : //www.inps.it/home/def ault.asp?sID =
%3B0%3B4740%3B&lastM enu = 4741&iM enu = 1&itemDir = 4953)
[INPS] Inps: National Institute of Social Security for Private Sector (http : //www.inps.it).
[M770] Repository for full information of 2003 Version of 770 Form (http : //www1.agenziaentrate.it/modulistica/dichiarazione/2003/770).
[EMI]
Emire - Italy - Family Allowance: (http : //www.eurof ound.eu.int/emire)
[GSL] GSL: Gnu Scientific Library - (http : //www.gnu.org/sof tware/gsl)
Download