Reengineering French structural business statistics

advertisement
Reengineering French structural business statistics: redesign
of the annual survey
Olivier HAAG
Insee : business statistics directorate
18, bd Adolphe Pinard
75675 Paris cedex 14
France
olivier.haag@insee.fr
keywords: breakdown of turnover, legal restructuring, merging of data, companies recall
The Insee has conducted an important program since September 2004 (Depoutot R., 2010) in
order to reduce the number of questions in the Annual Survey on Businesses, to improve its internal
productivity, to increase timeliness of the derived statistics and to introduce new statistical units
(companies derived from profiling or companies which participate to legal restructuring).
This paper will focus on:
-
The survey content
The organisation on the survey staff in order to reach sufficient quality level
The specific treatment of enterprise's legal restructuring (merging for example) which
has been implemented
The merging of the survey data and the available tax and employment data
1. Presentation and aims of the survey
The ESA (Enquête Structurelle Annuelle, ie. Insee Annual Survey on Businesses) has been
strongly lightened (Insee, 2009). Excepted for turnover, the survey does not include any of the
questions already existing in other administrative or fiscal source. This survey enables specially to
publish the level and growth rate of aggregates for the different characteristics, and to determine the
main activity of inquired businesses for the statistical register. In the manufacturing industry it is
integrated with the Prodcom Survey. This survey is the most important survey of businesses carried
out in six of the main production sectors1. It comprises two parts for each sector:
-
A core section which includes in particular the breakdown of turnover by different activity
at a very detailed level (much more detailed than NAF2 classes in Distributive trade and
Services), questions on employment and legal restructuring
-
A sectoral part of the questionnaire relating to the characteristics of the companies that are
specific to a given sector: sales area for companies in the trade sector, spending of fuel for
companies in the transport sectors, etc…
The statistical burden on companies, which has been bearing on the preparation of the French
response to the SBS regulation, is derived almost exclusively from the response to this survey. All
companies above a certain size threshold3 are surveyed (about 83 000 companies of which 30 000
1
Industry, excluding the agri-food industry, the agri-food industry, transport, construction, trade and services. NB.
Banks and insurance companies are not included in this survey.
2
The NAF is the French equivalent of the NACE
3
Which can vary from sector to sector. This threshold ranges from 20 to 50 employees
1
for the manufacturing industry); conversely a sample of those below the threshold are surveyed
(about 79 000 companies among 2 500 000).
For the first time and to facilitate the using of administrative data, we have asked the firms to
provide us its turnover based on the accounting period relying on at least 6 months of the reference
year. For example, for 2008 as reference year, if the company closes its accounting exercises on
March 31st, it was asked to answer to the ESA's questionnaire for the accounting exercises which
closed on March 31st of 2009. However, a consequent issue is that, as we sent the questionnaire in
January, some companies that close their accounts afterwards do not want to answer since their
accounts are not validated.
1.1. The breakdown of turnover
The breakdown of turnover by different activities of the company which made it possible to
ascertain its different branches4 has two main statistical uses. It enables to:
-
compile the sector-branch transition matrix of the national accounts
comply the Prodcom regulation for the manufacturing industry
calculate the main activity (MA) of the company using an algorithm. Determining the main
activity was therefore based on the company's response to this survey and note merely on its
declaration.
In order to facilitate the answer of the companies, the questionnaires are customised by the
print of the activities that the company used the year before and by the activities which are the most
used by the companies with the same MA.
1.2. Common questions for all sectors
The questionnaire contains only questions not already existing in tax or customs or social
declarations. The aim of the survey is also to complete the statistical information. That’s why, for
the employment for example, the questionnaire contains questions about the
- agency's employment
- non-salaried employee
but the questionnaire does not contain any information about
-
the number of salaried employees (included in the social declarations)
the salary (present in the tax declarations) etc.
The core questions for all the sectors can also be divided into three main topics
-
employment
professional incomes (questions about subcontracting for example)
legal restructuring. This part is quite important for the definition of new statistical unit
which are necessary to establish growth rate of aggregates (see §3 of this document).
The companies have to answer questions about their partners in the legal restructuring
o
o
o
o
o
4
the name of the partner
his statistical ID
the date of the legal restructuring
the type of legal restructuring (merging, split etc.)
the amount of the transaction
called homogeneous production unit in the UE classification of statistical unit
2
1.3. Specifics questions for each sector
A section of the questionnaire is specific to each activity sector. The principle is the same than
for the common questions: the survey questionnaire cannot contain information already existing in
tax or social declarations.
1.4. The consequences on data collection
Since the former annual survey system was partly decentralised according to the main sectors
covered (Agri-food Industry, Manufacturing and energy, Construction, Distributive Trade,
Transport, Services) the data collection of this new survey is concentrated on two statistical teams.
The six teams of administrators of the previous survey have been replaced by a single team
from within Insee, to which a project management role has been delegated (the ministerial statistical
services will continue to be the contracting authority in their sector, but will delegate to Insee
responsibility for developing the global information system). In the industrial sector, matters are a
little more complex because the annual statistics also include the response to the Prodcom
regulation. This survey is still managed by the statistical service of the Ministry for Industry.
Because of the specific questions of each sector, and the fact that the questionnaire's
customizing, the Insee's staff has to control about 200 types of different questionnaires. Since a
clerk cannot know each type of questionnaire, the survey staff is organized by sector, and have to
execute many distinct tasks. Therefore, the first year was largely dedicated to the training of the
staff, and did not permit to achieve the level of productivity required by the hierarchy.
2. The tasks of the survey staff to improve the quality of the answers
The new system for the production of structural business statistics goes along with several
methodological changes compared to the former one, in particular with regard to data editing.
Indeed, the new data editing process puts great emphasis on selective editing, with the double
objective to improve statistical quality and to reduce manual control burden. The new implemented
selective editing method relies mainly on local scores, and as usual, the most delicate point is to
tune thresholds allowing to determine which units have to be checked manually, and which units
can be edited in an automatic way.
2.1. The principles of the selective editing used
Data editing in the new system for the production of structural business statistics is based on a
two-step process, which combines automatic micro-editing and selective editing. Here, the macroediting process takes place secondly, after automatic corrections have been applied to the microdata.
In the first step, raw data are automatically controlled at a micro-level, by a set of classical
micro-edits: for each variable, the individual plausibility of the record is checked, as well as its
coherence with regard to the rest of the questionnaire. Therefore, the micro-editing process answers
a double purpose:
− to prepare the data for the selective editing process. Indeed, many problems occur when
selective editing is directly applied to raw data: non-response phenomenon prevents from
computing relevant aggregates, as well as scores of concerned units. Moreover, the
presence of very large errors in raw data may disrupt the selective editing process, by
bending some aggregates. The micro-editing process, by ensuring detection and
imputation for very atypical records as well as non-response, permits to make up for this
problem;
− to quantify the quality of each record thanks to quality indicators, which will be used as
diagnosis help by the survey clerks, during the manual control step of units pointed out
by selective editing.
3
The second step consists thus of a selective editing process, which constitutes the cornerstone
of data editing in the new system. It rests on two kinds of methods (Gros E., 2009): on the one
hand, “drop-out” methods, using score functions measuring the impact of each unit on a given ratio,
and on the other hand “diff” methods, using score functions measuring the weighted difference
between the raw value of a variable given on the questionnaire and an expected value of this
variable. Each method is applied on micro-edited data, and concerns respondents as well as partial
respondents. Total non-respondents are not checked by selective editing and are subject to a specific
follow-up procedure.
2.2. The follow up to complete answers
Before the control of the answer as detailed above, the clerks have to follow up some
companies to:
-
-
obtain the answer for a list of very important companies. The very important size and
economic burden of this company does not permit us to impute their answer. In the 2008
survey, among the 1500 companies whose answer was considered as compulsory,
answer was of obtained for more than 1450. Three hundred had had to be recalled.
complete the answer of the companies. The principal variable concerned was the
breakdown of the turnover. As mentioned previously, this variable is one of the most
important variables of the system, especially since the sector-based statistics will use it.
Enterprises are asked to fill the lines concerning the turnover coming from the wholesale
of predefined activities but may also fill a line giving the possibility of filling the
turnover of an activity which is not in the predefined list. In this case, the enterprise
gives the amount of turnover relative to this activity, and writes on the questionnaire the
“name” of the activity. If the amount of turnover exceeds a threshold the clerks have to
find the corresponding code (of the French nomenclature NAF), otherwise the
corresponding code will be inferred by the statistical office.
This issue was underrated for the first survey campaign, during which the clerks have
treated more than 23 000 businesses. A mass processing has been set up for the first
year in order to treat more quickly about 15 000 answers without any recall.
Nevertheless, the survey staff had to recall more than 1000 companies to find a
corresponding code. For the next year, the number of manual processing will be reduced
by using the SICORE software, the Insee's software for automatic coding updated for
our survey.
-
-
confirm the breakdown of turnover when it leads to a change of their principal activity
code (APE code) compared to the value of the register. This year, about 15 000
businesses were controlled for this item and generated about 2000 phone calls.
complete the part of the questionnaire which concerned the legal restructuring. Clerks
has to obtain all the partner’s information we need. This information will be sent to the
statistic register which constitutes new statistical units (see on §3) For this job, the clerks
can consult legal source or recall the companies. About 6000 companies were checked
for this work and clerks recalled about 1000 companies.
2.3. The follow up to control answers
Also, for a given characteristic of interest and a given level of validation, the joint use of two
local “drop-out” and a local “diff” score allows to organize controls into a hierarchy. However,
since units – i.e. questionnaires – need to be treated on a “unit by unit” basis, and not item by item,
the results of the local scores are synthesized into a global priority indicator, according to a threestep procedure:
4
− firstly, for each variable and each local score, two thresholds, a “high” threshold and a
“medium” threshold, permit to divide the whole set of units into three groups: very
influential units, moderately influential units and non influential units;
− then, the status of each variable is defined as the “maximum status” of the different local
scores relating to this variable. So, the status S(Xi) of a given variable Xi is defined as I
if the unit is very influential for at least one local score, at S if the unit is only
moderately influential for at least one local score, and at O otherwise;
− lastly, the global priority indicator is defined as
where A represents the importance attached to the “very influential” status compared with
the “moderately influential” status, and Ki represents the importance of each variable.
Eventually, the whole set of units is divided into four roughly equal sized groups, according
to the value of their global priority indicator: priority units if GPI>α, important units if α≥GPI>β,
secondary units if β≥GPI≥γ and units which may be edited in an automatic way if γ≥GPI4. Priority
units are checked manually first, then important units and last secondary units, according to
available time and means. This mechanism permits to manage the amount of work during the
campaign, and thus to respect practical constraints while ensuring a good level of quality for
statistics.
For the first campaign, only the turnover and its breakdown were controlled this way. Because
of delays, the survey staff was not able to validate the other characteristics of the questionnaire. To
the selective editing, the level of the aggregation was the first three position of the APE. Therefore,
if the score exceeded a threshold of 1% of the total of his sector (companies with the same first
three positions on their APE), the answer of the companies had to be controlled. About 8000
companies were controlled by survey staff and it generated about 1500 phone calls.
3. The particular processing of the companies which take part in a legal restructuring
3.1. Example
Let us take the example of a merging because it's the most frequently legal restructuring in
France.
N-1
400 K€
N
Company 4
APE : sector B
market
100 K€
1 500 K€
Company 14
APE : Sector A
market
Company 1 :
1000 K€
APE sector A
Figure 1: example of merging
5
In this example, the legal restructuring concerns 3 companies. This restructuring is also
composed from two couples of companies:
-
the company 4 (called the grantor) which yields his fixed tangible assets to the company
14 (called the grantee)
the company 1 which yields his fixed tangible assets to the company 14 (called the
grantee)
In the ESA’s questionnaire, company 14 has to give us the description of his two partners (the
company 4 and 1). The data of this couple of companies are dumped to the statistical register (called
CITRUS) which centralizes all the information about the legal restructuring. The main goal of this
register is to create new statistical unit (called envelope) from the couple of companies that it
received from different statistical sources.
3.2. Definition of the restructuring envelope
The definition of the envelope bordering5 is managed by CITRUS. The restructuring envelope
includes the companies which participate to the restructuring for the current year and the year
before. In the previous example, the envelope contained companies 1 and 4 for the year N-1 and the
company 1 for the year N.
To constitute the envelop CITRUS has a principal rule. One company belongs to one and only
one envelope for a given year. An envelope can also contain many companies, and some of them
can be unrelated the one to the others. Company A can exchange tangible assets with B, and B with
C: the envelope will contain companies A, B and C even if there is no link between A and C.
The envelope is considered as a statistical unit of which main activity (MA) is calculated from
the APE and the turnover of the constituting companies of the current year. The envelope has the
same MA for both years (N-1 and N). We therefore avoid too strong sector evolutions. CITRUS
calculates the main characteristics of the envelope (MA, size, aggregate coefficient6).
3.3. The use of the envelope for the survey
A questionnaire is created for this envelope as follows:
-
The value of a characteristic X in year n of the envelope is obtained by adding the values
of the companies constituting the envelope in N
XN 
X
N
i
ienv N
-
The value in N-1 is obtained by adding the values of the companies constituting the
envelope in N-1 to which we remove the flow intra-envelope generated by the
restructuring if it exists and it is known. When the flow intra-envelope is unknown, it is
also possible to weight the sum of X values of the companies that compose the envelope
N-1 by the coefficient called ‘aggregation coefficient’. By agreement, the flow intraenvelope is removed in N-1. So if this flow intra-envelope appears in N, it must be added
in N-1, the flow intra-envelope is negative.
If we know the flow intra-envelope
X N 1 
X
ienv
N 1
N 1
i
F
where F is the Flow intra-envelope.
5
6
List of companies in n and n-1
Aggregation coefficient= (turnovern /value addedn)/(turnovern-1/added valuen -1)
6
If we do not know the flow intra-envelope
X N 1  K *
X
ienv N 1
N 1
i
where K is the aggregate coefficient
A characteristic is considered as additive if its value is independent of the legal
restructuring in short terms (e.g. added value). The flow intra-envelope is null for this
characteristic.
A characteristic is considered as additive as non additive if the restructuring has an
immediate impact without any apparent economic change (e.g. turnover): restructuring
can indeed generate and increase or a decrease (called flow intra-envelope) of this
variable. This flow intra-envelope is not associated to an economic reality by merely
due to a change in juridical structure.
In our example (Figure 1):
- for the year N: the envelope’s turnover is the turnover of the company 14 which is the
only company which belongs to the envelope for this year. The turnover value is 1500
K€
- For the year N-1: the envelope’s turnover is the sum of the turnover of the companies 1
and 4 minus the values of the flow intra-envelope. The turnover value is: 500 (turnover
of 4) + 1000 (turnover of 1) – 100 (flow intra-envelope) = 1400 K€. In the figure 1, we
can see the flow intra-envelope which is shown with the red arrow.
3.4. The clerk’s tasks for checking the envelopes characteristics
The envelopes participate to the selective editing process but scores are only calculated for the
temporal drop out. If the envelope has to be checked by the clerks, it reflects that its turnover
evolution between N and N-1 is atypical. This is generally due to the flow intra-envelope, but there
are no question about the flow intra-envelope in the survey. In this case, the clerk has to recall the
bigger company of the envelope in order to obtain an estimation of the turnover flow intraenvelope, and if doable, an estimation of the breakdown of this flow intra-envelope by activities.
4. The rules to merge the survey data with the available tax and employment data
4.1. The definition of the frame of reference
The French business register SIRENE is used to define the frame of reference of the system.
Since all French legal units have to be registered within it (for example to obtain a loan from a
bank). Since SIRENE is an inter-administrative register whose id-number’s has to be used by all
administrations, this register offers an exhaustive coverage of the field of legal units. The universal
use of the id-number makes the merging of different files easy.
However, the register has not to be considered as a base to produce directly statistics. For
example, the number of active enterprises for a given economic sector will not be obtained directly
from SIRENE, because some units stopped their activity but did not yet notify it to the register. It
will result from specific estimates using the different components of the system, especially the
statistical survey. That's why a statistical register (called OCSANE) was specially built for the
structural statistics production. This register contains all the companies which belong to the
7
structural statistical field7 and for whom we are waiting for information (tax, employment, surveys
etc.).
The business register has to be considered as the backbone of the system. It may happen that
in some sources, some units are missing, or existing as two records (for example in case of multiple
tax declarations for the same enterprise). It is by using the id-number of the register that these cases
will be settled (Chami S. 2010).
Employment data
available
Statistical
Frame
ESA's
sample
ESA's
respondant
ESA ' sampling frame
Tax data
available
Figure 2: The definition of the different populations of statistical or administrative sources
The OCSANE's population which is represented by the grey form. It represent the statistical frame
of our system.
The ESA's sampling frame (in purple color) which does not correspond exactly to the population of
OCSANE, since some activities do not belong to the survey but are necessary for the national
accounts (the sports activities for example).
The ESA's sample (in pink)
The ESA's respondents (in yellow). The total non-response correction in ESA will call on weighting
methods and use Calibration techniques (Deville, Särndal, 1992) lead to adjust the weights taking
into account calibration equations. These equations are defined using the classifying of every
enterprise within the register and the tax turnover.
The tax data available (in red). The total non-response correction will call on imputing methods.
The employment data available (in blue).There is no treatment of the total non-response
For each population, the level of information that we have at our disposal is different. For the
ESA's respondents for example we have the tax, employment and survey data. For the other
population we only have tax and / or employment data.
7
For example, the companies from the financial sector do not belong to the structural statistical field; therefore even if
we obtain data from tax for these businesses they will not interest us.
8
4.2. The rules to merge data
Thanks to the register we can merge all the data collected for a statistical ID. However, we
have to check that the different data correspond to the same statistical unit and to the same
reference's period. Between the three main sources of information for a company (survey data, tax
data employment) there are common characteristics which permit to make this control and to merge
all the information available for a company as the following figure 3 shown. The product process
whose the goal is to merge the data is called REDI.
…
Turnover
Goods sales
Products sales
Services sales
…
…
…
Turnover
Goods sales
Products sales
Services sales
….
Salary
Staff size
…
Surveys' data8
Tax data
Definitive turnover
Definitive sales
Definitive Salary
Definitive Staff size
…
….
Salary
Staff size
…
Employment data
Final Data
Figure 3 : common characteristics between the three main sources
For the companies which answered to the survey (in yellow in the figure 2) we merge the 3
sources. For the other companies of the statistical referential (in grey and pink in the figure 2), we
only merged the tax and employment data.
The principle of the merging is as follows:
-
For each common characteristic, a major source is defined.
Table 1: The definition of the major source for each common characteristic
Common
characteristic
Condition
major source
Turnover
We have a tax data for the company
Tax
We have no tax data for the company
Survey9
Sales
We have a answer in the survey for the year of Survey
reference
We have no answer in the survey for the year of Tax
reference
Salary
We have a tax data for the company
Tax
8
The sales (goods, services and products) are estimated in the survey by using the breakdown of turnover by activities.
The goods sales are the sum of the turnover of the trade's activities, the services sales are the sum of the turnover of the
service's activities
9
In this case, the turnover of the survey will be used to impute the tax data for the company
9
Staff Size
-
We have no tax data for the company
Employment
We have a employment data for the company
Employment
We have no employment data for the company
Tax
For each common characteristic, we calculate the difference between the two sources.
Then we calculate a score which will permit us to determine the companies that the
clerks are going to check.
score 
X s1  X s 2
T(X p)
 X S 1  value of the characteristic X in the source 1
 X  value of the characteristic X in the source 2
 S2
where 
T ( X p )  total of the characteristic X in the major source at the level

of aggregation used for the control
-
If the score is below the threshold, the final value is the one from the major source
-
If the score is above the threshold, then it is a clerk which has to determine the good
values. He has to determine linked characteristics too. For example, if he changes the
goods sales, he has to check that the goods purchases stay consistent with the new values
of the sales. Otherwise, he has to adjust the purchases.
4.3. The consequences on the other data
As we havejust noticed, the common characteristics are often linked with other characteristics
in the different sources (the purchase and the turnover for the tax data, the turnover and the values
of the subcontracting in the surveys etc...). This is the reason why, when the final value of the
common characteristics are defined by the process REDI, this value has to be reintroduced in the
original sources (survey, tax and employment data). The characteristics from each source which are
correlated with common characteristic are calculated again to conserve the consistency between the
values within a source. Besides, in the surveys, the turnover is often used to process the partial nonresponse. So this process of automatic correction has to be rerun using the final values.
Conclusion
In order to reduce the statistic burden for the companies, the annual survey was lightened
thanks to a systematic use of administration sources. Therefore, statistics obtained through the new
system derive from different kinds of information :
-
some obtained directly from administrative sources ;
-
some using only the variables of the statistical survey ;
-
many of them will combine survey data and administrative data, especially for
the sector-based estimations.
The raw material to produce these statistics is a non rectangular table: some data are
exhaustive (those coming from the administrative sources), other not (those coming from the
statistical survey). In order to obtain the most exhaustive information for a company, we have to
merge these different sources.
Besides, a new estimator (Brion P. 2008) which combine administrative and survey data has
been implemented to calculate the characteristics aggregate.
10
Finally although the data of the three main sources (survey, tax and employment data) can be
separately controlled and validated, we can not spread the results of one source without the other
sources. This is a new constraint but it enable to have single information of better quality.
References:
Brion Ph., “The future system of French structural business statistics: the role of the
estimates”, UN/ECE Work Session on Statistical Data Editing, Vienna, 2008.
Chami S., “Reengineering French structural business statistics: an extended use of
administrative data”, work session of the Q2010 conference in Helsinki
Depoutot R., “Reengineering French structural business statistics: an owerview”, work
session of the Q2010 conference in Helsinki
Deville J.-C., Särndal C.-E. (1992), “Calibration estimators in survey sampling”, Journal of
the American Statistical Association, 87, pp. 376-382
Gros E., “Setting cut off scores for selective editing in structural business statistics : an
automatic procedure using simulations study” , UN/ECE work session of the conference of
European statistician, Neuchâtel 2009
Insee, “Redesigning French structural business statistics: how can the response burden on
companies be lessened”, ECE/CES/209/35 Work Session on Statistical Data Editing, Geneva 2009
11
Download