Methodology employed in the calculation of the mortality tables of the population in Spain 1992-2005 Madrid, July 2007 Index 1 Introduction 3 2 Obtaining death probability series 3 3 Obtaining derivative series 4 4 Obtaining prospective series 5 5 Synthesis of the smoothing procedure employed 6 Tables calculated and base information used 7 Glossary of symbols 9 6 Methodology employed in the calculation of the mortality tables of the population in Spain 19922005 1 Introduction Mortality tables are compiled to measure the incidence of this phenomenon on the population under study, regardless of the structure by age. The type of table used is created after performing a transversal analysis of the mortality, examining how said phenomenon affects the population classified by age or age groups, at a certain moment in time. Given the evolution usually experimented by mortality, that does not present any brusque modifications, these tables appear as an acceptable description of the phenomenon for short periods of time, close to the moment they refer to. To calculate the functions of a complete mortality table, it is necessary to have information on the deceased and the population classified both by ages and referred to the same time period. Since the figures for deaths classified by ages are quite small (except for the oldest age groups), not only in provinces and Autonomous Communities but also on a national level, recount errors and possible disruptions, which could exceptionally affect mortality in a certain year, have a notable bearing on this information. Consequently, it is necessary to eliminate these anomalies since if they were to remain in the data, they would present an incorrect image of the phenomenon under study. This elimination is performed during the initial stage. To calculate the mortality table for a certain moment in time and for each age group, it is necessary to consider average deaths corresponding to a specific number of years (generally from two to four), focusing on that particular moment. In a second stage, it is necessary to eliminate the disruptions, both in terms of the number of deaths and the population, caused by errors when stating the age, and produce an increase of the values observed for certain ages to the detriment of the contiguous ones, distorting the series of death probabilities on the mortality table. This problem is usually avoided applying a smoothing procedure to the original data. 2 Obtaining death probability series Death probability at age x, qX, is defined as the probability a person from a specific generation, exactly x years old, has of dying before reaching age x+1. Therefore, it is necessary to consider possible death cases, in other words, persons who could die, as well as real events, that is, persons of that age and generation who have actually died. Possible cases are persons who are x years old, calculated as the sum of the inhabitants who are that age at the end of the year and half of the persons deceased aged x during the year in question, since it is supposed that deaths are distributed uniformly throughout the year in question. Accepting the hypothesis that the deaths of persons from a certain generation aged x occur half in one year and half in the following, the death probability would be expressed by: qx = 1 / 2 (Dzx + z+ 1 Dx ) z z Px + 1 / 2 (Dx) where: Dzx represents deaths occurred in year z aged x. Dzx1 represents deaths occurred in year z+1 aged x. Pxz population on December 31st of year z aged x. The previous expression has been used to calculate all qx corresponding to all ages ranging from two to ninety years old, both inclusive. Since the deaths of babies under one year old mainly occur during the first weeks of life, it is not possible to apply this hypothesis uniformly throughout the year. Therefore, for this age, the death probability has been calculated using: z q0 = z+ 1 D0, g(z) + D0, g(z) z z P0 + D0, g(z) cubic parabola, given high mortalities for ages around one hundred and ten years old. where: Dz0 ,g(z) deaths occurred in year z, 0 years old, from the generation born that year. Dz0,g1(z) deaths occurred in year z+1, 0 years old, from the generation born the previous year. 3 Obtaining derivative series The death probability series can provide the mortality tables functions described hereunder. P0z population on December 31st of year z aged 0. Consequently, for babies aged one year old, q1, has been calculated using: z 1, g(z-1) 1 + D1,z+g(zD 1) q1 = z z P1 + D1, g(z-1) PROBABILITY OF LIFE OR SURVIVAL AT AGE x, px The probability of survival between two exact ages. Therefore, for each age x, px = 1 - qx where: D1z,g(z1) deaths in year z, aged 1, from generation z-1. z1 1,g(z1) D deaths in year z+1, aged 1, from SURVIVORS AGED x YEARS OLD, lx Number of persons aged x among the initial l 0 on the mortality table. Therefore, for each age x, generation z-1. P1z population on December 31st of year z aged 1. The low number of deaths registered for persons who are over ninety years old and the greater repercussion of errors when stating the age, lead to distortions in the death probability series for the aforementioned ages. Therefore, the latter have been estimated adjusting a third grade parabola, by least squares, based on the qx calculated using the previous expression, for X = 90, 91, 92, 93 and 94. The following conditions were established in order to perform said adjustment: a) The cubic parabola passes through point q90, which implies the continuity of the qx adjusted and those calculated for ages under 90 years old, b) value q110 = 1, truncating the parabola as from this point, which means that, a priori, there are no survivors over one hundred and ten years old, and c) the cubic parabola has a tangent parallel to the x axis at point x = 110, which implies an accelerated increase of mortality as from the point of inflection of the lx = lx-1px-1 Surveys usually work with l0 = 100,000 THEORETICAL DEATHS AGED x YEARS OLD, dx Deaths occurred between two exact ages x and x+1, obtained from the mortality table. L0 a0 l 0 a1l1 , where a0 a1 1 in which z+ 1 D0, g(z) z+ 1 z+ 1 D0, g(z) + D0, g(z+ 1) Therefore, for each age x, a0 = d x l x q x l x l x 1 where Dz0,g1(z) represents the deaths of children under LIFE EXPECTANCY AT AGE x, ex Average number of years each person aged exactly x is expected to live, for survivors that reach said age, under the supposition that the years lived by all persons are the same for all of them. Considering the hypothesis that all persons who die at a certain age live, on average, half the year in which they die, life expectancy is calculated as ex = 1 2 + 1 lx For x = 99 and x = 100 L99 = e99 l99 - e100 l100 L100 = e100 l100 where L100 are survivors aged 100 years old and older. l PROBABILITY OF SURVIVAL AT x YEARS OLD, TX i i= x+ 1 with representing the oldest age, for which there are supposedly no survivors. 4 one year old occurred in year z+1 among those born in generation g(z). Obtaining prospective series As well as the previous classical biometric series or functions, it was considered of major importance to include the two prospective series specified hereunder. The probability of survival for ages x and x+1 for persons aged x years old. This is easily obtained from the former using Tx = and, for the population aged 99 years old and older, the probability of reaching 100 years old or over is T99 = SURVIVORS AGED x YEARS OLD, LX Represents the number of survivors on the mortality table who are x years old. The estimate of this function has been performed implementing this next formula (see Introduction to the Mathematics of Population. Keyfitz. Addison-Wesley): Lx = 13 24 (lx + lx+ 1) - 1 24 for x = 1, 2, ..., 98. For the remaining ages (lx-1+ lx+ 2) Lx+ 1 Lx L100 L99 + L100 5 Synthesis of the procedure employed smoothing Both population stocks obtained from population censuses and register renewals, and data on deaths obtained from the Vital Statistics, sometimes, contain mistakes due to flaws that appear when interviewees state their age. This increases the values of some ages to the detriment of those corresponding to similar ages, which causes distortions in the death probability series calculated. In order to avoid this problem, it is necessary to implement a smoothing procedure for the original data before employing them. The smoothing procedure employed for the original data was the Variate Difference Method. The National Statistics Institute had used said method to compile the former comprehensive mortality tables. A complete explanation of the application, with vast bibliography, can be found in the book by G. Tintner, The Variate Difference Method, 1940, in the Cowles Commission collection. The following paragraphs explain the foundations of the procedure briefly. The basic hypothesis for the application of the method is that the series observed is the additive superimposition of two other series, one of which expresses the correct value or the value expected for each age x, and the other the random distortion that alters the observed value. In this case, the latter would be the sum of all the causes and circumstances that lead to persons stating an incorrect age. Therefore, the model is: yx = ux + ex where for each age x: yx is the observed value. ux is the expected or correct value. ex is the error or random distortion. In this application, values ux theoretically follow a slow trend, without sharp zigzags, and random errors are supposedly unrelated. This hypothesis could be smoothed given the noncorrelation of the random errors. A second essential hypothesis, that has allowed the implementation of the Variate Difference Method, consists in supposing that the expected value ux is simply a grade n polynomial, when n is an unknown value. The Variate Difference Method determines the exact value of n. Subsequently, after obtaining n, a polynomial for said degree is adjusted to the data observed yx. In this respect, it is necessary to mention the existence of a close relationship between the moving average method and the variate difference method. Specifically, M..G. Kendall (A Theorem in Trend Analysis, Biometrika, vol. 48, 1961. Advanced Theory of Statistics) has proven that the moving average method calculations result from the application of the variate difference method based on a lineal combination of some of the successive terms of the observed values yx. More precisely, all moving average formulae result in an adjustment of 2K + 1 successive terms of a grade p - 1 polynomial, with 2K - p + 1 numbers bi (with p - K < i < K), so that k ûx = y x - p bi y x+ i i= p-k where: p difference of order p. ûx the estimated value of ux. bi coefficients of the Sheppard smoothing formula. If the expected value ux follows a grade n polynomial, it is a case of determining the latter. For this, an iterative process is implemented to calculate the successive finite differences. There will evidently be a point in the process when the expected value u x will disappear, on cancelling the grade n polynomial. That is to say, the corresponding difference will be constant, thus cancelling subsequent differences. Nevertheless, since calculations are performed with the observed values yx, it is necessary to know the moment at which the expected value has supposedly been cancelled in this process of successive finite differences, with only a residue remaining from the existence of random errors ex. This question can be answered using the following consideration: if there is a time series that only contains a random element, the variations of the successive series of finite differences are equal, after correcting them by multiplying a binomial coefficient given that the series, which is random, is not ordered in time. Consequently, the variation of the first and second differences is the same as in the original series. problem to be resolved. Nevertheless, the number of values included in each average should be taken respecting the length of the main cycle that is to be cancelled. In this case, moving averages have been taken considering each series of five consecutive observed values yx. The aforementioned provides a criterion to determine when the expected value ux has disappeared. If a certain difference k is calculated with variation equal to that of difference k + 1, and equal to that of K+2, etc., it is possible to say that the expected value u x has been cancelled, taking K-th difference. Nevertheless, the equation between two variations is never reached, since there is always a random variation residue. Yet since the table uses a probability method it is proven that the only necessary element is that the difference between the variation of two successive series of finite differences is smaller than three time the standard error of the lowest difference. To apply this aspect to the compilation of the mortality tables, the series of the expected values always disappears in the first or second differences. This implies a constant application of moving averages when smoothing original series. After determining the degree of the polynomial to be adjusted, it is merely a case of applying the corresponding weighted average to the coefficients of Sheppard's smoothing formula. The moving average type is determined as follows: if the non-random element or expected value ux is more or less cancelled in the first or second difference, the table uses n = 1 or a moving average that is equivalent to adjusting a straight line to a certain number (not determined by the method) of consecutive observed values yx. If the expected value is cancelled in the third or fourth finite differences, we will obtain n = 2, and select a moving average equivalent to adjusting a second grade parabola to a certain number of consecutive observed values. If the nonrandom element is cancelled in the fifth or sixth differences, n = 3, we use a weighted average equivalent to adjusting a third grade (cubic) polynomial to a selected number of consecutive observed values, etc. If the nonrandom element is cancelled in the k-th finite difference, n = k/2, when k is even, or n = (k +1)/2, when k is uneven. As aforementioned, moving averages are implemented on a specific number of consecutive observed values, which have been centred appropriately. Nevertheless, this number is undetermined. The criterion is open to the experience and specific nature of the 6 Tables calculated information used and base Annually, the mortality tables have been calculated for the population in Spain and its Autonomous Communities, for the period 1992-2005. In the case of Ceuta and Melilla, the tables obtained are for the two cities together, though since 2002 they have been presented separately as two autonomous cities. In all of the geographical areas considered, tables have been obtained for the populations of males, females and the total. The deaths used in the calculation of the mortality tables (national and Autonomous) for each year, have been obtained as an average of the figures registered by age in the Vital Statistics for each consecutive two-year period, the reference year and the previous year, for males, females, and the total population tables. The irregularities in these original death figures, caused by the possible errors regarding the classification by age, have been cancelled using the smoothing procedure explained in the previous section. The populations that have been used, by Autonomous Community, sex and simple age, referring to 1 January of each year, correspond to the Intercensal population estimates obtained between the Population Censuses of 1991 and 2001 and the Estimates of the current population calculated from the last census mentioned. The figures used are published alongside the biometric functions of the mortality tables calculated. Glossary of symbols Q(X)= Risk or probability of death between the exact ages of X and X+1. L(X)= Survivors aged 100,000 initial persons. exactly X among D(X) = Theoretical deaths occurred between two specific ages X and X+1. E(X) = Life expectancy at a specific age X. LL(X) = Survivors aged X years old. T(X) = Probability of survival among persons aged X and X+1.