A Transformation-based Nonparametric Estimator of Multivariate Densities with an Application to Global Financial Markets Meng-Shiuh Chang∗ and Ximing Wu† September 25, 2011 Abstract We propose a probability-integral-transformation-based estimator of multivariate densities. Given a sample of random vectors, we first transform the data into their corresponding marginal distributions. We then estimate the density of the transformed data via the Exponential Series Estimator in Wu (2010). The density of the original data is then estimated as the product of the density of the transformed data and marginal densities of the original data. This construction coincides with the copula decomposition of multivariate densities. We decompose the Kullback-Leibler Information Criterion (KLIC) between the true density and our estimate into the KLIC of the marginal densities and that between the true copula density and a variant of the estimated copula density. This result is of independent interest in itself, and provides a framework for our asymptotic analysis. We derive the large sample properties of the proposed estimator, and further propose a stepwise hierarchical method of basis function selection that features a preliminary subset selection within each candidate set. Monte Carlo simulations demonstrate the superior performance of the proposed method. We employ the proposed method to model the joint densities of the US and UK stock market returns under different Asian market conditions. The estimated copula density function, a by-product of our estimation, provides useful insight into the conditional dependence structure between the US and UK markets, and suggests a certain resilience against financial contagions originated from the Asian market. ∗ School of Public Finance and Taxation, Southwestern University of Finance and Economics, Sichuan, China; Email: mslibretto@hotmail.com † Texas A&M University, College Station, TX 77843; Email: xwu@tamu.edu. We are grateful to David Bessler, Jianhua Huang, Qi Li, Victoria Salin, Robin Sickles and Natalia Sizova for helpful comments and suggestions. The usual disclaimer applies. 1 Introduction Estimating probability distributions is one of the most fundamental tasks in many fields of science. Although many distributions can be conveniently summarized by a number of sample statistics, such as moments, quantiles or cumulants, a distribution/density function itself is the ‘complete package’ in the sense that all the aforementioned summary measures can be calculated from the distribution/density function. In addition, distributions or densities offer an important extra advantage; researchers may obtain useful insights into the quantities in question via visual examination of distributions/densities, while these insights might be elusive even if a set of summary measures are closely scrutinized. This paper concerns with the estimation of multivariate density functions. Multidimensional analysis has played an increasingly important role in modern economics. For instance, the recent financial crisis has called for a more comprehensive approach of risk assessment of the financial market, in which multivariate analysis of the market, especially under extreme conditions, plays a crucial role. Another example is welfare analysis. Conventional welfare analysis tends to focus on one single attribute such as income or consumption. However, recent literature has contended that one shall take into account not only income/consumption, but also some other important attributes such as health, environment and civil rights, to arrive at more balanced welfare inferences. In either example, estimation of the joint distribution/density of the quantities in question is of independent interest in itself, and constitutes an important first step that aids in the construction of more definite answers. Density functions can be estimated by either parametric or nonparametric methods. Parametric estimators are asymptotically efficient if they are correctly specified, but inconsistent under faulty distributional assumptions. In contrast, nonparametric estimators are consistent, whereas they converge at a slower-than-root-N rate. The slower convergence rate is due to the fact that the number of (nuisance) parameters increases with the sample size in nonparametric estimators to achieve consistence. This so called curse of dimensionality is particularly severe for multivariate density estimations, in which the number of parameters increases exponentially with both the sample size and the dimension of the random vectors. In this paper, we propose a new multivariate density estimator that is shown to mitigate the curse of dimensionality. Let {X t }nt=1 , where X t = (X1t , . . . , Xdt ), be an iid random sample from a d-dimensional distribution F with density f . We first transform each margin of the random sample to F̂j (Xjt ) for j = 1, . . . , d and t = 1, . . . , n, where F̂j is an estimate of the jth marginal distribution. Let ĉ be an estimate of the joint density of the transformed 2 data. It follows that one can construct a density estimator of the original data as d Y fˆ(x) = ĉ F̂1 (x1 ), . . . , F̂d (xd ) fˆj (xj ), (1) j=1 where fˆj ’s, j = 1, . . . , d, are the marginal densities of the original data. This probability-integral-transformation-based density estimator is used in Ruppert and Cline (1994) on univariate densities. Their estimator is bias-reducing because the transformed data converge in distribution to the standard uniform distribution, whose derivatives of all orders are zero. Although bias reduction is a potential benefit of employing this transformation in multivariate density estimations, our estimator is largely motivated by a different consideration. Equation (1) indicates that the joint density can be constructed as the product of marginal densities and the density of the transformed data. As a matter of fact, this construction coincides with the copula decomposition of multivariate densities according to the celebrated Sklar’s Theorem (1959), in which the first factor in (1) is termed the copula density function. This decomposition allows one to assemble a joint density by first estimating the marginal densities and copula density separately. A valuable by-product of this estimator is the copula density function, which completely summarizes the dependence structure among the margins and oftentimes provides useful insight into the variables in questions. We present a decomposition of the convergence rate of estimator (1), in terms of the Kullback-Leibler Information Criterion (KLIC), into that of the marginal densities, and the KLIC between the true copula density and a variant of the estimated copula density. This result is of independent interest in itself, and provides the necessary framework for the asymptotic analysis of the proposed estimator. The KLIC, being an expected log ratio between two densities, arises as a natural metric for our analysis because it conveniently transforms the product of densities in (1) into a sum of log densities. We then propose a nonparametric estimator of the empirical copula density using the Exponential Series Estimator (ESE) of Wu (2010). The ESE is particularly suitable for copula density estimations since it is defined explicitly on a bounded support and thus does not suffer from the boundary biases of the usual kernel density estimators, which are particularly severe for copula density estimations because copula densities tend to peak towards the boundaries of its domain. This estimator has an appealing information-theoretic interpretation and lends itself to asymptotic analysis in terms of the KLIC. We derive the large sample properties of the transformation-based multivariate density estimator, and discuss how the convergence rates of the marginal densities and copula density together determine 3 the convergence rate of the joint density. Inevitably, the number of basis functions in the ESE increases with the dimension and sample size rapidly. To facilitate the selection of basis functions, we propose a stepwise hierarchical approach for basis function selection, which features a preliminary subset selection within each group that maximizes the degree of similarity between the candidate set and a subset of a given size (see, e.g., Cadima et al., 2004 for subset selections). Finally we use some information criterion such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) to select a preferred model. To examine the finite sample performance of the transformation-based estimator, we conduct two sets of Monte Carlo simulations. The first experiment compares the proposed estimator with a direct density estimator using the kernel method in terms of overall density estimation performance. The second experiment compares the estimations of joint tail probabilities based on the proposed method, a kernel estimate and the empirical distribution. In both experiments, the proposed method outperforms the alternatives, oftentimes by substantial margins. Lastly we apply the proposed method to estimating the joint distribution of the US and UK stock markets under different Asian market conditions. Our analysis reveals how fluctuations and extreme movements of the Asian market influenced the western markets. In particular, the influence is asymmetric in the sense that the western market responded more strongly when the Asian markets were up than down, indicating a certain degree of resilience against global financial contagions originated from the Asian market. We note that the asymmetric relation, albeit obscure in the joint densities of the US and UK markets, is quite evident in their copula densities – a valuable by-product of our transformation-based estimator. This paper proceeds as follows. Section 2 briefly reviews various approaches of density estimations. Section 3 presents a two-stage transformation-based estimator of multivariate densities and its large sample properties, followed by a sequential updating method of basis function selection. Section 4 presents a series of Monte Carlo simulations of the proposed method. Section 5 provides an empirical application on global financial markets. Some concluding remarks are given in the last section. 2 Multivariate density estimation Let {X t }nt=1 be a d-dimensional i.i.d. random sample from an unknown continuous distribution F with density f defined on the real line, d ≥ 2. We are interested in estimating f . There 4 exist two general approaches: parametric and nonparametric. The parametric approach entails functional form assumptions up to a finite set of unknown parameters. Multivariate normal distributions or more generally, the elliptical families, are commonly used due to their simplicity. The nonparametric approach provides a flexible alternative that seeks a functional approximation to the unknown density. Instead of imposing functional form assumptions, this approach allows the number of (nuisance) parameters to increase with sample size to achieve consistency. One can also combine these two approaches to balance between parsimony and goodness-of-fit. Below we briefly review various methods for multivariate density estimation, with a focus on nonparametric estimators. 2.1 Direct estimation One of the most commonly used density estimators is the kernel density estimator (KDE), which takes the form n 1X Kh (X t − x), fh (x) = n t=1 where Kh (x) is a d-dimensional non-negative kernel function that peaks at x = 0 and h, the so-called bandwidth, controls how fast Kh (x) decays as x moves away from zero. A popular choice of K is the Gaussian kernel, which is the standard normal density function. For multivariate densities, the product kernel is commonly used. It is well-known that the performance of KDE crucially depends on the choice of bandwidth. Data-driven methods, such as the cross validation, are often used for bandwidth selection (see, e.g., Li and Racine, 2007). Another popular method for density estimation is the series estimator. Let gi , i = 1, 2, . . . , m, be a series of linearly independent real-valued basis functions defined on Rd . A series estimator is given by m X fm (x) = λi gi (x), i=1 where the number of basis functions m plays a role similar to the bandwidth in kernel estimation, and is usually determined by some data-driven methods, such as the generalized cross validation. Examples of basis functions include the power series, trigonometric series, splines, and wavelets. For a d-dimensional random variable with a r-times continuously differentiable density, both the kernel and series estimators can achieve the optimal convergence rate Op (n−r/(2r+d) ) in the L2 norm under some regularity conditions. However, for the kernel estimators to achieve a convergence rate faster than n−2/(4+d) , one needs to use a higher (than two) order 5 kernel, which may lead to negative density estimates. The optimal series estimator has an appealing property of automatically adapting to the unknown smoothness of the underlying distribution, but it does not guarantee the positiveness of density estimates either. One advantage of these estimators is the linearity, which makes it easy to use cross-validation to determine their smoothing parameters and relatively straightforward to derive their asymptotic properties. But the linearity is also their weakness in the sense that their likelihood functions, being a product of a sum, are complicated and have no sufficient statistics. Alternatively, there exist likelihood based nonparametric density estimators. A regular exponential family of estimators takes the form fm (x) = exp( m X λi gi (x) − λ0 ), (2) i=1 where gi , i = 1, . . . , m, are a series of bounded linearly independent basis functions, and R P λ0 ≡ exp( m i=1 λi gi (x))dx < ∞ ensures that fm integrates to unity. The estimation of a probability density function by sequences of exponential families, which is equivalent to approximating the logarithm of a density by a series estimator, has long been studied. Earlier studies on the approximation of log densities using polynomials include Neyman (1937) and Good (1963). Transforming the polynomial estimate of log-density back to its original scale yields a density estimator in the exponential family. The maximum likelihood method provides efficient estimates of this canonical exponential family. Crain (1974) establishes the existence and consistency of the maximum likelihood estimator. Zellner and Highfield (1988) and Wu (2003) discuss the numerical aspects of this estimator, which typically requires nonlinear optimizations. By letting the number of basis functions increase with sample size, one obtains a nonparametric estimator in (2). Stone (1990) and Kooperberg and Stone (1991) provide in depth analyses of the log-spline density estimator, which is a special case of (2) with spline basis functions in its exponent. Barron and Sheu (1991) establish the asymptotic properties of (2) for general basis functions that include the power series, splines and trigonometric series in a unified framework. They show that under suitable regularity conditions, this estimator achieves the optimal rate specified in Stone (1982) in terms of the Kullback-Leibler information criterion. Wu (2010) further generalizes their results to multivariate density estimations. Following the spirit of Barron and Sheu (1991), we call this family of density estimator Exponential Series Estimator (ESE) to reflect its nonparametric nature. Like the series estimator, an optimal ESE adapts to the smoothness of the underlying distribution auto6 matically. On the other hand, it is strictly positive and has a set of sufficient statistics, P µ̂i = 1/n nt=1 gi (X t ), i = 1, . . . , m, thanks to its canonical exponential form. In addition, the ESE has an appealing information theoretic interpretation. It can been derived as the maximum entropy density by maximizing Shannon’s information entropy subject to given moment constraints (Jaynes, 1957). The methods discussed so far are general methods that can be applied to univariate and multivariate densities alike. However, due to the ‘curse of dimensionality’, the estimations of multivariate densities are significantly more difficult. Below we discuss transformationbased, multiple-staged estimation methods that in particular facilitate multivariate density estimations by mitigating the curse of dimensionality. 2.2 Transformation-based estimation Transformation of variables of interest to facilitate model construction and estimation is a common practice in statistical and econometric analyses. For example, the logarithm transformation of a positive dependent variable in regression analysis oftentimes mitigates heteroskedasticity. More generally, the Box-Cox transformation, which nests the logarithm transformation as a limiting case, is often used to remedy deviations from normality in residuals. Although less common, transformations have also been used in density estimations. In the context of nonparametric density estimation, transformations can be used to reduce bias. Wand et al. (1991) propose a transformation based kernel density estimator. They note that the usual kernel estimators with a single global bandwidth work well for densities that are not far from Gaussian in shape, but can perform quite poorly when the densities deviate further away from Gaussian. In a spirit close to the Box-Cox transformation, they propose transformations of the data so that the density of the transformed data is closer to normal and can be adequately estimated by kernel estimators with a global bandwidth. In particular, they suggest applying the shifted power transformation to right-skewed data. They demonstrate that if a transformation is carefully selected, it is then more appropriate to use the typical kernel estimator with a global bandwidth on the transformed data; the resulting density estimate of the raw data obtained by back-transformation has a smaller bias. Yang and Marron (1999) show that multiple families of transformations can be employed at the same time, and this process can be iterated for further improvements. Wand et al. (1991) and Yang and Marron (1999) consider only parametric transformations, which reduce biases but do not improve the convergence rate of nonparametric estimators. Ruppert and Cline (1994) propose an estimator, based on the probability-integraltransformation, that not only reduces bias and but also improves the convergence rate for 7 univariate densities. Suppose for now Xt is a scalar. First the data are transformed to F̂ (Xt ), which is a smooth estimate of the CDF of x. The estimated kernel density of the raw data then takes the form n 1X Kh (F̂ (Xt ) − F̂ (x))fˆ(x), f˜(x) = n t=1 where fˆ(x) ≡ dF̂ (x)/dx. Because F̂ converges to the standard uniform distribution whose density has all derivatives equal to zero, bias of the second stage estimate is asymptotically negligible. They further show that if the bandwidths of the first and second steps are chosen to be of order n−1/9 , then the squared error of f˜ is of order Op (n−8/9 ) as n → ∞ rather than Op (n−4/5 ), the rate of the usual kernel density estimators. This procedure can also be iterated to obtain further rate improvements, although in practice the benefits may be rather small. Intuitively, both parametric and nonparametric transformations achieve bias reduction by choosing a transformation such that the density of the transformed data is easier to estimate in terms of, say smaller squared errors or integrated squared error. Motivated by Ruppert and Cline (1994), we apply the nonparametric transformation approach to multivariate density estimations. We show that this transformation benefits the estimation by mitigating the curse of dimensionality. In addition, it provides useful insight into the dependence relation among variables since the decomposition resulting from the said transformation coincides with the copula decomposition of multivariate densities. Let F̂j and fˆj , j = 1, . . . , d, be estimated marginal CDF and PDF for the jth margin of a d-dimensional data X = [X1 , . . . , Xd ]. The probability-integral-transformation-based density estimator of X is given by fˆ(x) = fˆ1 (x1 ) · · · fˆd (xd )ĉ(F̂1 (x1 ), . . . , F̂d (xd )), (3) o n where ĉ is the estimate of the density of the transformed data F̂1 (x1 ), . . . , F̂d (xd ) . Interestingly, (3) can also be derived using Sklar’s theorem (Sklar, 1959). Let f be the density of a d-dimensional random variable, with Fj and fj being its jth marginal CDF and PDF for j = 1, . . . , d. Sklar (1959) shows that the joint density can be decomposed as f (x) = f1 (x1 ) · · · fd (xd )c(F1 (x1 ), . . . , Fd (xd )). (4) When all marginal distributions are differentiable, the decomposition is unique. The last factor in (4) is termed the copula density, which completely summarizes the dependence 8 structure of x (see, e.g., Nelsen, 2006 for a general treatment of copula). The copula decomposition allows the separation of marginal distributions and their dependence and thus facilitates construction of flexible multivariate distributions. 2.3 Entropy of Copula Decompositions Although it has been suggested that the copula approach helps mitigate the curse of dimensionality in multivariate density estimations (see, e.g., Hall and Neumeyer, 2006), to the best of our knowledge, this conjecture has not been established formally. Below we provide a formal characterization of the potential benefit of copula approach in multivariate density estimations. discrete entropy is defined with a negative sign Generally the more features a density has (or the more bumpy a density is), the more difficult is its estimation. There exist many criteria of the degree of bumpiness. For instance, R the integrated squared derivatives, (f (r) (x))2 dx, r = 1, 2, ..., are commonly used for this purpose. For discrete distributions, Shannon’s information entropy is sometimes used. The PK P discrete entropy is given by H = K k=1 pk = 1. One can show k=1 pk ln pk , where pk ≥ 0 and that the negative entropy attains its maximum when pk = 1/K for all k; namely when the distribution is uniform across all possible states. In this sense, the negative entropy, which reflects the discrepancy between a given discrete distribution and the uniform distribution, can be viewed as an indicator of the degree of difficulty in density estimation. For a continuous univariate density f , the differential entropy is defined as H(f ) = R f (x) ln f (x)dx. The differential entropy of a distribution is not restricted in sign and generally does not admit the intuitive interpretation of the discrete entropy. Nonetheless if we impose the restriction that a given distribution has a bounded support, then a similar interpretation is allowed. To see this, we first need to introduce a closely related concept, the Kullback-Leibler information criterion (KLIC). The KLIC between two densities, f and g, is given by Z f (x) D(f ||g) = f (x) log dx, g(x) where g is absolutely continuous with respect to f and D(f ||g) = +∞ otherwise. It is well known that D(f ||g) ≥ 0 and the equality holds if and only if f = g almost everywhere. Suppose that the support for f and g is bounded. Without loss of generality, the support can be assumed to be the unit interval. Under this condition, when g is the standard uniform distribution, D(f ||g) = H(f ). Thus the entropy of a continuous distribution defined on a bounded support, like the discrete entropy, captures its discrepancy from the uniform 9 distribution, and therefore reflects the degree of difficulty in its density estimation. Below we show that for a multivariate density defined on a bounded support, the entropy of its joint density can be decomposed into the entropies of its margins and its copula density. R In particular, let H(f ) = f (x1 , . . . , xd ) ln f (x1 , . . . , xd )dx1 · · · dxd be the entropy for a ddimensional distribution. We establish the following: Theorem 1 Suppose x is a d-dimensional random variable from a distribution F defined on [0, 1]d , with density function f . Then H(f ) = d X H(fj ) + H(c), (5) j=1 where H(fj ), j = 1, . . . , d, is the entropy of the jth marginal density, and H(c) is the entropy of the corresponding copula density c(F1 (x1 ), . . . , Fd (xd )). Remark 1 The assumption that X is defined on [0, 1]d is no more restrictive than the assumption of a bounded support since one can transform an arbitrary bounded support to the unit hypercube through, e.g., the probability integral transformation. Remark 2 We stress that the assumption of bounded support is only imposed to facilitate our theoretical exploration below. In practical implementation of the transformation-based estimation, the supports for marginal distributions are not required to be bounded. In principle, any reasonable parametric/nonparametric methods can be used to estimate the marginal densities and distributions. After the first stage probability integral transformation, the domain for the second stage density, or the copula density, is naturally restricted to [0, 1]d . Remark 3 The decomposition (5) resembles the structure of additive models in nonparametric estimations.1 The difficult task of estimating the joint density is divided into the estimation of individual marginal densities and that of the copula density. In regards of the entropy as an indicator of degree of difficulty in density estimations, each individual task is at least as easy as the estimation of the joint density since all entropies on the right hand side are non-negative. Usually multivariate densities are considerably more difficult to estimate than univariate ones. Theorem (1) suggests that when there are significant features in the marginal distributions, the two-step estimation approach can be more effective than the direct estimation since the joint density of the transformed data, or the copula density, is easier 1 There is one subtle difference: the additive structure of general additive models is oftentimes assumed, while that of the copula decomposition holds exactly. 10 to estimate after the features attributed to the marginal distributions have been ‘removed’. Obviously, the higher is the dimension d, the more substantial the potential benefit can be. Remark 4 The copula decomposition of multivariate densities induced by the probability integral transformation is in spirit close to the so called ‘divide and conquer’ algorithm, as is dubbed in the computer science literature. Due to its flexibility, the copula approach has been used widely in multivariate analyses; see, e.g., Chen and Fan (2006a, 2006b), Chen et al. (2006), Patton (2006) and references therein. Hall and Neumeyer (2006) show that the copula method can benefit estimation of joint densities when there are additional data for the margins. Chui and Wu (2009) provide simulation evidence that two-step estimation via an empirical copula density often outperforms direct estimation of joint densities. Both papers consider only bivariate densities. Below we present formally a transformation-based estimator for the general d-dimensional cases and establish its large sample properties. 3 Transformation-based multivariate density estimation In this section we present a nonparametric transformation-based multivariate density estimator and establish its asymptotic properties. We then propose a method of model specification for the second stage estimation of the density of the transformed data, which can be viewed as estimation of empirical copula density functions. 3.1 The estimator The transformation-based estimator for an iid d-dimensional random vector {X t }nt=1 is constructed in two steps. We first obtain consistent estimates of marginal densities and distributions, denoted by fˆj and F̂j respectively for j = 1, . . . , d. Note that it is not necessary that fˆj (x) = F̂j0 (x); the densities and distributions of the margins can be estimated separately. For instance, we can combine smoothed estimates of marginal densities and empirical CDF’s of corresponding margins in our estimations. The second step estimates the density of the transformed data F̂t = (F̂1 (X1t ), . . . , F̂d (Xdt )), t = 1, . . . , n. To ease notation, we define ût = (û1t , . . . , ûdt ) with ûjt = F̂jt for j = 1, . . . , d. As is discussed above, the density of {ût }nt=1 converges to the copula density as n → ∞. 11 Like an ordinary density function, a copula density can be estimated by parametric or nonparametric methods. Parametric copula density functions are usually parameterized by one or two parameters. This parsimony in functional forms, however, oftentimes imposes restrictions on the dependence structure among the margins. For example, the popular Gaussian copula is known to exhibit zero tail dependence. Consequently, it may be inappropriate to use simple Gaussian copulae to investigate the co-movements of extreme stock returns. Another limitation of parametric copulas is that they are usually defined only for bivariate distributions (with the exception of the multivariate Gaussian copula) and extensions to higher dimensions are not available. Nonparametric estimation of copula densities provides a flexible alternative that ensures consistency. However compared with their parametric counterparts, nonparametric estimators are marked for their slower convergence rates. In addition, since copula densities are defined on a bounded support, treatment of boundary biases warrants special care. Boundary biases are negligible if the curves to estimate vanish at the boundaries, but can pose a considerable problem otherwise. Although boundary bias problem exists generally in nonparametric estimations, it is particularly severe in copula density estimations. This is because unlike many densities or curves that vanish at the boundaries, copula densities often peak at the boundaries and corners. For example consider the joint distribution of two stock returns, their dependence structure is often dominated by the co-movements of their extreme tails, giving rise to a copula density that peaks at either end of the diagonal of the unit square. In this case, a nonparametric estimate, say a kernel estimate, of the copula density without proper boundary bias corrections may significantly underestimate the degree of tail dependence between the two stocks. In this study, we adopt the ESE to estimate copula densities. This estimator has some appealing properties that make it suitable for copula density estimations. The prominent advantage is that the ESE is explicitly defined on a bounded support. With optimally selected basis functions (by data-driven methods), the estimators are free of boundary bias. This is particularly useful for estimation of copula densities that often peak at the boundaries and corners. In addition, the ESE adapts automatically to the unknown smoothness of the underlying copula density, and unlike the higher order kernel and series estimators, it guarantees the strict positiveness of the density estimate. P Define a multi-index i = (i1 , i2 , . . . , id ), and |i| = dj=1 ij . Given two multi-indices i and m, i ≥ m indicates ij ≥ mj elementwise; when m is a scalar, i ≥ m means ij ≥ m for all j. Recall that the multivariate ESE of a copula density can be derived from the maximization of Shannon’s entropy of the copula density subject to some given moment conditions. Suppose 12 P m = (m1 , . . . , md ) and M ≡ {i : |i| > 0 and i ≤ m}. Let {µ̂i = n−1 nt=1 gi (ut ) : i ∈ M} be a set of moment conditions for a copula density, where gi ’s are a sequence of realvalued, bounded and linearly independent basis functions defined on [0, 1]d . The ESE of the corresponding copula density can be obtained by maximizing Shannon’s entropy Z −c(u) log c(u)du, (6) H= [0,1]d subject to the integration to unity condition Z c(u)du = 1 (7) [0,1]d and side moment conditions Z gi (u) c(u) du = µ̂i , i ∈ M, (8) [0,1]d where du = du1 du2 · · · dud for simplicity. The estimated multivariate copula density is then given by ĉ(u; λ̂) = exp ( X λ̂i gi (u) − λ̂0 ) (9) i∈M where Z λ̂0 = log ( exp ( [0,1]d X λ̂i gi (u))du). (10) i∈M Note that λ̂ is finite due to the boundedness of the domain of copula functions and that of the basis functions. Given the estimated marginal density functions, the multivariate density of the original data is then estimated by d Y ˆ f (x) = ( fˆj (xj ))ĉ(F̂ (x); λ̂), (11) j=1 where F̂ (x) = (F̂1 (x1 ), . . . , F̂d (xd )). 3.2 Asymptotic Properties of Two-Stage Multivariate ESE In this section, we present the large sample properties of the proposed transformation-based estimator of multivariate densities. The copula decomposition transforms a joint density 13 into the product of marginal densities and a copula density. Thus a discrepancy measure in terms of the logarithm of densities is particularly useful since we can then transform the said product into a sum of log densities. It transpires that the Kullback-Liebler Information Criterion (KLIC), discussed in the previous section, is a natural candidate for this task. Below we shall establish the convergence rates of the transformation-based estimator in terms of the KLIC. Denote f (x) = f1 (x1 ) · · · fd (xd )cf (F1 (x1 ), . . . , Fd (xd )), where fj and Fj , j = 1, . . . , d are marginal densities and distributions, and cf is the copula density for the joint distribution F (x). Similarly, let g(x) = g1 (x1 ) · · · gd (xd )cg (G1 (x1 ), . . . , Gd (xd )), where gj and Gj , j = 1, . . . , d are marginal densities and distributions, and cg is the copula density for the joint distribution G(x). We first present a theorem that decomposes the KLIC between f and g as the sum of KLICs of their components. Theorem 2 Suppose F and G are continuous distributions defined on a common support with densities f and g respectively. We then have D(f ||g) = d X D(fj ||gj ) + D(cf ||c̃g ), j=1 where c̃g (u1 , . . . , ud ) = cg (G1 (F1−1 (u1 )), . . . , Gd (Fd−1 (ud ))) for (u1 , . . . , ud ) ∈ [0, 1]d . Remark 5 Analogous to the decomposition of entropy in Theorem 1, Theorem 2 suggests that the KLIC between two multivarirate densities can be expressed as the sum of the KLICs between individual marginal distributions and a term that resembles the KLIC between two copula densities, although c̃g is generally not a copula density function. Only under the condition that Gj = Fj for all js, c̃g is a copula density. Nonetheless, this distinction is negligible asymptotically for our analysis below, in which F̂j plays the role of Gj , which converges to Fj asymptotically so long as F̂j is a consistent estimate of Fj . Remark 6 There are two note-worthy special cases of Theorem 2. When the two distributions share common marginal distributions, or fj = gj for all j’s, D(fj ||gj ) = 0 and c̃g = cg . It follows that D(f ||g) = D(cf ||cg ). On the other hand, when the two distributions share P a common copula density, or cf = cg , we have D(f ||g) = dj=1 D(fj ||gj ) + D(cf ||c̃f ), with c̃f (u1 , . . . , ud ) = cf (G1 (F1−1 (u1 )), . . . , Gd (Fd−1 (ud ))). Thus it is seen that the ‘reminder’ term attributed to the copula densities generally exist even when two distributions share a common copula function, so long as their marginal distributions differ. Theorem 2 indicates that one can derive the convergence rate of the estimator given by (11) as a sum of the convergence rates of its components. To facilitate our asymptotic analysis in terms of convergence in the KLIC, we focus on the following estimation strategy: 14 1. The marginal densities are estimated by the ESE. For j = 1, . . . , d, we have fˆj (xj ) = exp lj X λ̂j,i gi (xj ) − λ̂j,0 , (12) i=1 where gi ’s are a series of linearly dependent, bounded real-valued basis functions defined on the support of xj . 2. The marginal distributions are estimated by their corresponding empirical CDFs. To ease notation, we denote uj = Fj (xj ) and ûj = F̂j (xj ). 3. The copula density c is estimated by the ESE of the form ! ĉ(u) = exp X λ̂i gi (u) − λ̂0 , (13) i∈M where gi ’s are a series of linearly dependent, bounded real-valued functions defined on [0, 1]d , and M = {i : |i| > 0, i ≤ m and m = {m1 , m2 , . . . , md }}. Only the third condition is essential to our analysis. The first two conditions are assumed to ease our exposition. We stress that the above estimators for the marginal densities and distributions can be replaced by other suitable estimators in our theoretical analysis if we examine the convergence in terms of other criteria than the KLIC. As is discussed above, in practice, the marginal quantities can be estimated by any reasonable estimators. The following conditions are assumed for the rest of this section. Assumption 1 The observed data X 1 = [X11 , X21 , . . . , Xd1 ], X 2 = [X12 , X22 , . . . , Xd2 ], . . . , X n = [X1n , X2n , . . . , Xdn ] are i.i.d. random samples from a continuous distribution F defined on a bounded support, with joint density f , marginal density fj and marginal distribution Fj for j = 1, . . . , d. (s −1) Assumption 2 For each j in 1, . . . , d, let qj (xj ) = log fj (xj ). qj j (·) is absolutely conR (s ) tinuous and (qj j (xj ))2 dxj < ∞, where sj is a positive integer greater than 1. Assumption 3 lj → ∞, lj3 /n → 0 when gi ’s are the power series and lj2 /n → 0 when gi ’s are the trigonometric series or splines, as n → ∞. Assumption 4 Let qc (u) = log c(u) . For nonnegative integers rj ’s, j = 1, . . . , d, define P (r) (r−1) qc (u) = ∂ r c(u)/∂ r1 u1 · · · ∂ rd ud , where r = dj=1 rj . qc (u) is absolutely continuous and R (r) 2 (qc (u)) du < ∞ for r > d. 15 Q Q Q Assumption 5 dj=1 mj → ∞, dj=1 m3j /n → 0 when gi ’s are the power series and dj=1 m2j /n → 0 when gi ’s are the trigonometric series or splines, as n → ∞. The convergence rate of the marginal densities is given by Barron and Sheu (1991). To ease references and for the sake of completeness, it is given below. Theorem 3 Under Assumptions 1, 2 and 3, the estimated marginal density fˆj , j = 1, . . . , d, given by (12) converges to fj in terms of the KLIC such that −2s D(fj ||fˆj ) = Op (lj j + lj /n), as n → ∞. Next we derive the convergence rate of the ESE of a copula density. Suppose for now that the marginal distributions are known such that uj,t (Xj,t ) = Fj (xj,t ). We can then establish the convergence rate of the copula density estimator ĉ(u) based on uj,t ’s: Theorem 4 Under Assumptions 1, 4 and 5, the copula density estimator given by (13) converges to the copula density c in terms of the KLIC at rate D(c||ĉ) = Op d Y −2r mj j j=1 + d Y ! mj /n , j=1 as n → ∞. Lastly since uj,t ’s are unknown, they need to be replaced by their corresponding estimates. Let ûj,t (Xj,t ) = F̂j (Xj,t ), j = 1, . . . , d, t = 1, . . . , n. Denote the ESE copula density estimator based on û by ĉ(û). Our joint density estimator is then given by fˆ(x) = d Y fˆj (xj )ĉ(û), (14) j=1 where û = {û1 , . . . , ûd } with ûj = F̂j (xj ) for j = 1, . . . , d. Combining the results of Theorems 3 and 4 via the decomposition given in Theorem 2, we obtain the following result for our transformation-based density estimator: Theorem 5 Under Assumptions 1, 2, 3, 4 and 5, the joint density estimator fˆ given by (14) converges to f in terms of the KLIC with rate D(f ||fˆ) = Op ! d d d Y Y X −2s −2r {lj j + lj /n} + mj j + mj /n , j=1 j=1 16 j=1 as n → ∞. Remark 7 Replacing u by û in the estimation of the empirical copula function does not affect the final convergence rate because û, as the empirical CDF, has an n−1/2 convergence rate and therefore faster than any nonparametric rate. Remark 8 The optimal convergency rates for the marginal densities are obtained if we set lj = O(n1/(2sj +1) ), leading to a convergence rate of Op (n−2sj /(2sj +1) ) in terms of the KLIC for each j = 1, . . . , d. Similarly, the optimal convergence rate for the copula density is Op (n−2r/(2r+d) ) if we set mj = O(n1/(2r+d) ) for each j. It follows that the optimal P d −2sj /(2sj +1) −2r/(2r+d) = convergence rate of the joint density is given by Op ( j=1 n )+n −2sj /(2sj +1) −2r/(2r+d) . Thus the best possible rate of convergence is either Op maxj (n )+n the slowest convergence rate of the marginal densities or that of the copula density. Usually the convergence rates of multivariate density estimations are slower than those of univariate cases. In our case, the convergence rate of the copula density is the binding rate unless sj < r/d for at least one j ∈ 1, . . . , d; namely unless the degree of smoothness of at least one marginal density is especially low (relative to that of the copula density). 3.3 Model specification The selection of smoothing parameters plays a crucial role in nonparametric estimations. In series estimations, the number of basis functions is the smoothing parameter. The asymptotic analysis presented above is invariant to the choice of basis function space; on the other hand, it does not provide guidance on the actual selection of basis functions within a certain basic function space. In this subsection, we present a practical strategy of model specification. One manifestation of the curse of dimensionality in nonparametric estimations is that the number of nuisance parameters increases rapidly with the dimension of the problem. For instance, consider the d-dimensional multi-index set M = {i : 0 < |i| ≤ m}, where m is a positive integer. Denote the number of elements of the set M by #(M). One can show − 1. A complete subset selection, which entails 2#(M) estimations, is that #(M) = m+d d prohibitively expensive. The computational cost is even more severe for the ESE, which requires numerical integrations in multiple dimensions. One practical strategy is to use stepwise algorithms. Denote Gm = {i : |i| = m} , m = 1, . . . , M . We consider a hierarchical basis function selection procedure. Starting with G1 , we keep only basis functions whose coefficients are statistically significant in terms of their 17 2 t-statistics. n Denote o the significant subset of G1 by Ĝ1 . We next estimate an ESE using basis functions Ĝ1 , G2 , and eliminate elements of G2 that are not statistically significant, leading n o to a ‘degree-two’ significant set Ĝ1 , Ĝ2 . This procedure is repeated for m = 3, . . . , M , n o yielding a set of significant basis functions Ĝ1 , . . . , ĜM along the way. Denote by fˆm an n om c ESE using basis functions Mm = Ĝj , m = 1, . . . , M . One can then select a preferred j=1 model according to an information criterion, such as the AIC or BIC, or the method of cross validation or generalized cross validation. There exists, however, a practical difficulty associated with the above stepwise selection of basis functions. That is, for d ≥ 2, the number of elements of Gm increases with m (as well as with d) rapidly. One can show that #(G)m = m+d−1 . For instance, with d = 4, d−1 #(G)m = 4, 10, 20, and 35 respectively for m = 1, . . . , 4. Thus the stepwise algorithm entails increasingly bigger candidate sets as the selection proceeds. Incorporating a large number of basis functions indiscriminately is computationally expensive. At the same time, a more severe consequence is that it may deflate the t-values of informative basis functions, leading to their omissions from the significant set. To tackle this problem, we further propose a refinement to the stepwise algorithm. This method entails a preliminary selection of significant terms at each stage of the stepwise selection. Let fˆm be the preferred model at stage m of the estimation. Denote ) n 1X gi (X t ) : |i| = m + 1 , = n t=1 Z ˆ = gi (x)fm (x)dx : |i| = m + 1 . ( µ̄m+1 µ̂m+1 Recall that the sample moments associated with given basis functions are sufficient statistics of the resultant ESE. If µ̄m+1 are well predicted by the ‘one-step-ahead’ estimates µ̂m+1 , one can argue that the moments associated n with basisofunctions Gm+1 are not informative given the set of moments associated with Ĝ1 , . . . , Ĝm . Consequently, it is not necessary to incorporate Gm+1 into the estimation. n o In practice, it is more likely that some elements of Gm+1 are informative given Ĝ1 , . . . , Ĝm . We call these informative elements of Gm+1 its significant subset, denoted by G̃m+1 . Let ρm+1 2 See Delaigle et al. (2011) on the robustness and accuracy of methods for high dimensional data analysis based on the t-statistics. 18 be the correlation between µ̄m+1 and µ̂m+1 . We estimate the size of G̃m+1 according to #(G̃m+1 ) = d p 1 − ρm+1 × #(Gm+1 )e, (15) where dae denotes the smallest integer greater than or equal to a. After calculating #(G̃m+1 ), we need to select members of G̃m+1 from Gm+1 . For this purpose, we employ the method of subset selection. This method of identifying a significant subset of a vector variables is to select subsets which are optimal for a given criterion that measures how well each subset approximates the whole set (see, e.g., McCabe, 1984, 1986; Cadima and Jolliffe, 2001; Cadima et al, 2004). In particular, we adopt the RM criterion that measures the correlation between a n × p matrix Z and its orthogonal projection on to an n × q submatrix Q, where q ≤ p. This matrix correlation is defined as s RM (Z, PQ Z) = cor(Z, PQ Z) = trace(Z T PQ Z) , trace(Z T Z) (16) where PQ is the linear orthogonal projection matrix onto Q.3 Thus given #(G̃m+1 ), we then select the preliminary significant subset G̃m+1 as the one that maximizes the RM criterion given in (16). This procedure is rather fast since it only involves linear operations (see Cadima et al. 2004 for details in implementing this method). We conclude this section with a step-by-step description of the proposed model specification procedure. 1. j = 1: Fit an ESE using G1 ; denote by Ĝ1 the subset of G1 with significant t-values; fit an ESE using Ĝ1 , denote the resultant estimate by fˆ1 . 2. j = 2: (a) Estimate the size of the preliminary significant subset #(G̃2 ) according to (15); (b) Select the preliminary significant subset G̃2 of G2 according to (16); n o (c) Estimate an ESE using Ĝ1 , G̃2 ; denote the significant subset of G̃2 by Ĝ2 ; n o (d) Estimate an ESE using Ĝ1 , Ĝ2 ; denote the resultant estimate by fˆ2 . 3 To deal with the dilemma of dimensionality, a typical way is through a principal component analysis (PCA). However, dimensionality reduction via PCA still involves all of the moments which may lead to a worse estimation performance in our case since those trivial moments produce extra noise. The subset selection method is closely related to the PCA. There exists, however, one key difference: the principal components are linear combinations of all elements of a candidate set, while the subset selection method only selects the most informative elements. 19 3. Repeat procedure 2 for j = 3, . . . , M to obtain fˆ3 , . . . , fˆM respectively. n o 4. Select the best model among fˆ1 , . . . , fˆM according to a given criterion such as the AIC, BIC, cross validation or generalized cross validation. 4 Monte Carlo Simulation In this section, we conduct Monte Carlo simulations to investigate the finite sample performance of the proposed estimator. We consider bivariate and tri-variate distributions constructed as mixtures of normal distributions. In particular, following Wand and Jones (1993), we construct multivariate normal mixture distributions characterized as uncorrelated normal, correlated normal, skewed, kurtotic, bimodal I and bimodal II respectively.4 For the sake of comparison, we also evaluate the performance of a direct estimator of multivariate densities using the KDE. For the transformation-based-estimator, the marginal densities and distributions are estimated by the KDE, and the copula densities are estimated via the ESE, for which the highest degree of basis functions is set to four (i.e., M = 4). We use the polynomial basis functions, and the actual selection of basis functions is conducted according to the stepwise algorithm described in the previous section. The best model is then selected according to the AIC. For all KDEs (the marginal densities and distributions in the two-step estimation, and the direct estimation), the bandwidths are selected using the least squares cross validation. Our first example concerns with the estimation of density functions. The sample sizes are 100, 200 and 500, and each experiment is repeated 300 times. The performance is gauged by the integrated squared errors (ISE) evaluated on [−3, 3]d , d = 2 and 3, with the increment of 0.15 in each dimension. The results are reported in Table 1. It is seen that in all experiments, the ESE outperforms the KDE. For the bivariate cases reported in the top panel, the average ratios of the ISE between the ESE and the KDE across all six distributions are 78%, 75% and 54% respectively when the sample sizes are 100, 200 and 500. The results for the trivariate cases are reported in the bottom panel. The general pattern of performance remains the same. The corresponding average ISE ratios between the ESE and the KDE are 69%, 59% and 52%. Overall, the average ISE ratios between the ESE and the KDE decrease with the sample sizes for a given dimension, and with the dimension for a given sample size, indicating the superior finite sample performance of the transformation-based estimator. 4 Details on the distributions investigated in the simulations are given in the Appendix. 20 Table 1: ISE of joint density estimations (KDE: kernel estimator; ESE: exponential series estimator) N=100 KDE ESE uncorrelated normal 0.0103 (0.0047) correlated normal 0.0094 (0.0035) skewed 0.0223 (0.0061) kurtotic 0.0202 (0.0052) bimodal I 0.0075 (0.0029) bimodal II 0.0142 (0.0050) 0.0086 (0.0075) 0.0081 (0.0074) 0.0141 (0.0082) 0.0153 (0.0037) 0.0065 (0.0045) 0.0107 (0.0080) uncorrelated normal 0.0058 (0.0041) 0.0041 (0.0031) 0.0090 (0.0064) 0.0160 (0.0025) 0.0042 (0.0031) 0.0042 (0.0025) 0.0078 (0.0024) correlated normal 0.0058 (0.0014) skewed 0.0132 (0.0032) kurtotic 0.0218 (0.0027) bimodal I 0.0059 (0.0015) bimodal II 0.0072 (0.0018) N=200 KDE ESE d=2 0.0064 0.0052 (0.0026) (0.0050) 0.0058 0.0048 (0.0017) (0.0033) 0.0190 0.0104 (0.0043) (0.0046) 0.0152 0.0129 (0.0042) (0.0023) 0.0049 0.0038 (0.0017) (0.0024) 0.0095 0.0065 (0.0027) (0.0044) d=3 0.0053 0.0028 (0.0015) (0.0019) 0.0040 0.0024 (0.0008) (0.0011) 0.0098 0.0053 (0.0022) (0.0021) 0.0192 0.0148 (0.0022) (0.0015) 0.0040 0.0023 (0.0008) (0.0014) 0.0051 0.0026 (0.0011) (0.0014) N=500 KDE ESE 0.0035 (0.0012) 0.0034 (0.0008) 0.0146 (0.0029) 0.0099 (0.0026) 0.0028 (0.0009) 0.0054 (0.0014) 0.0023 (0.0019) 0.0028 (0.0011) 0.0072 (0.0025) 0.0010 (0.0010) 0.0018 (0.0010) 0.0029 (0.0015) 0.0033 (0.0008) 0.0026 (0.0005) 0.0067 (0.0013) 0.0146 (0.0017) 0.0026 (0.0005) 0.0033 (0.0006) 0.0013 (0.0007) 0.0015 (0.0005) 0.0027 (0.0008) 0.0136 (0.0010) 0.0011 (0.0005) 0.0012 (0.0005) Our second experiment concerns with the estimation of tail probabilities of multivariate densities, which is of fundamental importance in many areas of economics, especially in financial economics. In particular, we are interested in the estimation of joint tail probabilities of multivariate distributions. The tail index of a distribution is given by Z T = f (x)dx, [−∞,qj ]d 21 where qj ≡ Fj−1 (α), j = 1, · · · , d, and α is a small positive number close to zero. The same set of distributions investigated in the first experiment are used in the second experiment. The sample size is 100, and each experiment is repeated 500 times. The 5% and 10% lowertails (i.e., α = 5% and 10%) are considered. The transformation-based estimators and the KDE’s are estimated in the same manner as in the first experiment. Denote the estimated densities via the ESE and KDE by fˆESE and fˆKDE respectively. The corresponding estimated tail indices are given by Z T̂ESE = fˆESE (x)dx, d [−∞,qj ] Z fˆKDE (x)dx. T̂KDE = [−∞,qj ]d In addition, we also consider an estimator based on the empirical distributions: T̂EM ( d ) n 1X Y = I(Xt,j ≤ qj ) . n t=1 j=1 The performance of the tail index estimators is measured by the mean squared errors (MSE), which is the average squarred difference between the estimated tail index and the true tail index. The results are reported in Table 2. It is seen that the overall performance of the ESE is substantially better than that of the other two. For d=2, the average ratio of the MSE between the ESE and the KDE across six distributions is 37% at 5% marginal distribution and 53% at 10% marginal distribution respectively, while for d=3 the average MSE ratio between the ESE and the KDE is 25% and 47% respectively. The average ratio of MSE between the ESE and the KDE improves with the dimensionality of the sample space. In addition, the variations of the estimated tail index based on the ESE are considerably smaller than those of the other two. 5 Empirical Application In this section, we provide an illustrative example of multivariate analysis using the transformationbased density estimation. We also demonstrate that the copula density, a by-product of the two-step density estimator, provides useful insight into multivariate dependence structure that cannot be easily detected by examining the joint densities. We study the joint density of the U.S. and U.K. financial market returns, under various conditions of the Asian financial markets. Starting from the 1990’s, the Asian financial markets have played an 22 Table 2: MSE of tail probability estimation (EM: empirical estimator; KDE: kernel estimator; ESE: exponential series estimator) EM 1.0320 (1.7724) correlated normal 1.8451 (2.8861) skewed 1.0076 (1.6747) kurtotic 1.3930 (2.4307) bimodal I 0.2045 (0.3951) bimodal II 0.2095 (0.4327) α = 5% KDE uncorrelated normal 1.7570 (2.3339) 1.3867 (2.4799) 0.7697 (1.2471) 1.1677 (2.0369) 0.3210 (0.5513) 0.3652 (0.5585) uncorrelated normal 0.1352 (0.2539) 0.2822 (0.6333) 0.1111 (0.2475) 0.3607 (0.6043) 0.0080 (0.0263) 0.0108 (0.0371) 0.1331 (0.4129) correlated normal 0.5059 (1.2809) skewed 0.1829 (0.3992) kurtotic 0.5730 (0.9976) bimodal I 0.0119 (0.1063) bimodal II 0.0080 (0.0869) ESE EM d=2 0.2949 2.6780 (0.4981) (4.0155) 0.9333 4.2959 (0.8883) (5.9669) 0.3493 3.0475 (0.4207) (4.2814) 0.6777 2.9170 (0.6613) (4.3901) 0.0483 1.0200 (0.0904) (1.7069) 0.0602 0.9720 (0.0910) (1.4723) d=3 0.0103 0.4535 (0.0218) (0.8620) 0.0971 1.7460 (0.0616) (2.5603) 0.0296 0.8814 (0.0134) (1.2628) 0.2699 1.8735 (0.0855) (2.8469) 0.0003 0.0996 (0.0007) (0.3228) 0.0004 0.1172 (0.0008) (0.3940) α = 10% KDE ESE 4.6047 (5.7700) 3.2371 (4.5019) 2.7579 (4.0362) 2.5453 (4.2056) 1.5901 (1.9928) 1.5850 (1.9710) 1.1665 (1.9452) 2.7643 (3.0975) 1.8587 (2.1085) 2.3564 (2.1718) 0.3142 (0.5367) 0.3962 (0.7425) 0.6440 (0.8642) 0.9812 (1.7711) 0.5762 (1.0695) 1.2931 (2.1702) 0.0909 (0.1711) 0.0995 (0.2023) 0.0805 (0.1819) 0.5602 (0.5968) 0.4666 (0.2811) 1.4767 (0.9358) 0.0076 (0.0134) 0.0080 (0.0177) increasingly important role in the global financial system. The purpose of our investigation is to examine how the Asian markets influence the western markets. The scale and scope of global financial contagion has been under close scrutiny especially since the 1998 Asian financial crisis. Therefore we are particularly interested in examining the general pattern of the western markets under extreme Asian market conditions. In our investigation, we focus on the monthly stock return indices of S&P 500 (US), FTSE 100 (UK), Hangseng (HK) and Nikkei 225 (JP) markets. Instead of assuming a 23 specific parametric model, we opt to investigate their relations in a fully nonparametric manner. We first estimate the joint distribution of the four markets, and then calculate the joint conditional distribution of the US and UK markets under various conditions of the HK and JP markets. Our data include the monthly indices of the four markets between February 1978 and May 2006. For each market, we calculate the rate of return Yt by ln Pt − ln Pt−1 . Following the standard practice, we apply a GARCH(1,1) model to each series and base our investigation on the standardized residuals. The transformation-based density estimator is used in our analysis. The marginal densities and distributions are estimated by the kernel method and the copula density is estimated by the ESE. Denote the returns for the US, UK, HK and JP markets by Yj , j = 1, . . . , 4 respectively. After obtaining the joint density, we calculate the joint density of the US and UK markets conditional on the HK and JP markets: fˆ (y1 , y2 |(y3 , y4 ) ∈ ∆) , where ∆ refers to a given region of the HK and JP distribution. In particular, we consider three scenarios of the Asian markets: ∆L = {(y3 , y4 ) : Fj−1 (0) ≤ yj ≤ Fj−1 (15%), j = 3, 4}, ∆M = {(y3 , y4 ) : Fj−1 (40%) ≤ yj ≤ Fj−1 (60%), j = 3, 4}, ∆H = {(y3 , y4 ) : Fj−1 (85%) ≤ yj ≤ Fj−1 (100%), j = 3, 4}, where Fj , j = 3, 4 are the marginal distributions of the HK and JP markets. Thus our analysis focuses on the joint distributions of the US and UK markets when the Asian markets are in the low, middle and high regions of their distributions respectively. The conditional copula density, ĉ(u1 , u2 |(y3 , y4 ) ∈ ∆), of the US and UK markets is obtained in a similar manner. Figure 1 reports the estimated conditional joint densities of the US and UK markets under various conditions of the Asian markets. For comparison, the unconditional joint density is also reported. The positive dependence between the two markets is evident under all market conditions. There is little difference between the unconditional density and the conditional one when the Asian markets are in the middle region. When the Asian markets are low, the bulk of US and UK distribution moves visibly to the lower left corner; in contrast, when the Asian markets are high, the entire distribution shifts toward the upper right corner. The overall pictures of the joint densities are consistent with the general consensus that the western markets are influenced by fluctuations in the Asian markets and they tend to move in similar directions. Next we report in Figure 2 the conditional copula densities of 24 Unconditional density of US and UK Conditional density of US and UK given distributions of HK and JP between (0, 0.15] 2.5e−05 0.00030 2 2 0.00025 2.0e−05 1 1 0.00020 UK UK 1.5e−05 0 0 0.00015 1.0e−05 −1 −1 0.00010 5.0e−06 −2 −2 0.00005 −3 −3 0.0e+00 −2 −1 0 1 2 3 0.00000 −2 −1 0 1 2 3 US US Conditional density of US and UK given distributions of HK and JP between (0.40, 0.60] Conditional density of US and UK given distributions of HK and JP between (0.85, 1] 0.00025 2 0.00025 2 0.00020 1 0.00020 0 0.00015 −1 0.00010 0.00005 −2 0.00005 0.00000 −3 1 UK UK 0.00015 0 0.00010 −1 −2 −3 −2 −1 0 1 2 3 0.00000 −2 US −1 0 1 2 3 US Figure 1: Estimated US and UK joint densities (Top left: unconditional; Top right: Asian market low; Bottom left: Asian market middle; Bottom right: Asian market high) the US and UK markets under various Asian market conditions and show that additional insight can be obtained by examining the copula densities. The top left figure reports the unconditional copula density, which has a saddle shape with a ridge along the diagonal. It is seen that the positive dependency between the US and UK markets is largely driven by the co-movements of their tails, and the relationship appears to be symmetric. The bottom left figure is the conditional copula when the Asian markets are in the middle. The saddle shape is still visible, with an elated peak on the upper right corner. This result suggests that when there are little actions in the Asian markets, the US and UK markets are more likely to perform simultaneously above the average. The upper right figure reports the conditional copula when the Asian markets are low. The copula density clearly peaks at the lower left corner, indicating that the US and UK markets have a high joint probability 25 of underperformance. In contrast as indicated by the lower right figure, when the Asian markets are high, the US and UK markets have a high joint probability of above the average performance. In addition, what is not clearly visible from the figures is that the peak of the copula density when the Asian markets are high is considerably higher than that when the Asian markets are low. Therefore, the copula densities suggest that although the US and UK markets tend to move together with the Asian markets, the dependence relation between the western and Asian markets is not symmetric: the relation is stronger when the Asian markets are high. In this sense, the western market is somewhat resilient against extremely bad Asian markets. Unconditional copula of US and UK Conditional copula of US and UK given distributions of HK and JP between (0, 0.15] 1.0 1.0 0.00035 0.00020 0.00030 0.8 0.8 0.00015 0.00025 0.6 0.6 UK UK 0.00020 0.00010 0.4 0.00015 0.4 0.00010 0.00005 0.2 0.2 0.00005 0.0 0.00000 0.0 0.2 0.4 0.6 0.8 0.0 1.0 0.00000 0.0 0.2 0.4 US 0.6 0.8 1.0 US Conditional copula of US and UK given distributions of HK and JP between (0.40, 0.60] Conditional copula of US and UK given distributions of HK and JP between (0.85, 1] 1.0 1.0 0.00020 8e−04 0.8 0.8 0.00015 6e−04 UK 0.6 UK 0.6 0.00010 4e−04 0.4 0.4 0.00005 2e−04 0.2 0.2 0.0 0.00000 0.0 0.2 0.4 0.6 0.8 0.0 1.0 0e+00 0.0 US 0.2 0.4 0.6 0.8 1.0 US Figure 2: Estimated US and UK copula densities (Top left: unconditional; Top right: Asian market low; Bottom left: Asian market middle; Bottom right: Asian market high) The asymmetric relation between the western and Asian markets revealed in our analysis of the copula densities prompts us to look into this issue more carefully. Below we calculate 26 some dependence indices that can be obtained readily from the estimated copula densities. The first one is Kendall’s τ , a rank-based dependence index. This index can be calculated from a copula distribution as follows: Z τ =4 C(u, v)dC(u, v) − 1. [0,1]2 Although the Kendall’s τ offers some advantages over the linear correlation coefficient, it does not directly capture the dependence structures at the tails of a distribution, which is of critical importance in financial economics. Nor does it discriminate between symmetric and asymmetric dependence. Therefore we also explore the tail dependence structure based on the tail dependence coefficients (TDC). The upper and lower TDC between two random variables X and Y are given by UTD = lim− P r[X > FX−1 (α)|Y > FY−1 (α)] = lim− α→1 α→1 1 − 2α + C(α, α) 1−α and LTD = lim+ P r[X < FX−1 (α)|Y < FY−1 (α)] = lim+ α→0 α→0 C(α, α) , α provided that these limits exist and fall into [0, 1]. We report in Table 3 the estimated Kendall’s τ and tail dependence index of the US market relative to the UK market, given various conditions of the Asian markets.5 The Kendall’s τ is higher when the Asian markets are in the middle than in the tails. What is particularly interesting is the comparison between the lower tail dependence index when the Asian markets are low and the upper tail dependence index when the Asian markets are high. The estimated numbers are respectively 0.1057 and 0.2248 when α = 3%, and 0.1746 and 0.3511 respectively α = 5%. These results confirm our visual inspection of the copula densities that the dependence is stronger when the Asian markets are high. Therefore the global financial contagions originated from the Asian markets are weaker when the Asian markets are low relative to when the Asian markets are high. 6 Concluding remarks Our numerical experiments and empirical examples demonstrate the usefulness of the proposed method, and that valuable insight can be obtained from the estimated copula density, 5 Similar patterns are observed regarding the tail dependence index of the UK market relative to the US market. 27 Table 3: Conditional Dependence Measures between US and UK under different Asian market conditions α = 3% α = 5% Asian Markets τ LTD UTD LTD UTD 0-15% 0.2917 0.1057 0.0111 0.1746 0.0190 40-60% 0.3324 0.0499 0.0479 0.0834 0.0781 85-100% 0.2928 0.0149 0.2248 0.0249 0.3511 a by-product of our transform-based estimator. We expect that the proposed method will find many useful applications in multivariate analysis, especially in financial economics. We consider only iid case in this paper. Although it is beyond the scope of the current paper, extensions of the current paper to accommodate dependent time series will be an interesting subject for future studies. 28 References [1] Barron, A.R. and C.H. Sheu, 1991,Approximation of Density Functions by Sequences of Exponential Families, Annals of Statistics, 19, 1347–1369. [2] Cadima, J., J. O. Cerdeira, and M. Minhoto, 2004, Computational Aspects of Algorithms for Variable Selection in the Context of Principal Components, Computational Statistics and Data Analysis, 47, 225–236. [3] Cadima, J. and I. T. Jollie, 2001, Variable Selection and the Interpretation of Principal Subspaces, Journal of Agricultural, Biological, and Environmental Statistics, 6, 62–79. [4] Chen, X. and Y. Fan, 2006a, Estimation of Copula-based Semiparametric Time Series Models, Journal of Econometrics, 130, 307–335. [5] Chen, X. and Y. Fan, 2006b, Estimation and Model Selection of Semiparametric Copulabased Multivariate Dynamic Models under Copula Misspecification, Journal of Econometrics, 135, 125–154. [6] Chen, X., Y. Fan and V. Tsyrennikov, 2006, Efficient Estimation of Semiparametric Multivariate Copula Models, Journal of the American Statistical Association, 101, 1228–1240. [7] Chui, C. and X. Wu, 2009, Exponential Series Estimation of Empirical Copulas with Application to Financial Returns, Advances in Econometrics, 25, 263–290. [8] Csiszár, I., 1975, I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3, 146-158. [9] Crain, B. R., 1974, Estimation of Distributions Using Orthogonal Expansion, Annals of Statistics, 2, 454–463. [10] Delaigle, A., P. Hall and J. Jin, 2011, Robustness and Accuracy of Methods for High Dimensional Data Analysis Based on Student’s t-statistic, Journal of the Royal Statistical Society, Series B, 73, 283–301. [11] Good, I. J., 1963, Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables, Annals of Mathematical Statistics, 34, 911–934. [12] Hall, P. and N. Neumeyer, 2006, Estimating a Bivariate Density When There Are Extra Data on One or Both Components, Biometrika, 93, 439–450. 29 [13] Jaynes, E. E., 1957, Information Theory and Statistical Mechanics, Physical Review, 106, 620–630. [14] Kooperberg, C. and C. J. Stone, 1991, A Study of Logspline Density Estimation, Computational Statistics and Data Analysis, 12, 327–347. [15] Li, Q. and J. Racine, 2007, Nonparametric Econometrics: Theory and Practice, Princeton University Press, New Jersey. [16] McCabe, G. P., 1984, Principal variables, Technometrics, 26, 137–144. [17] McCabe, G. P., 1986, Prediction of principal components by variables subsets, Technical Report, 86-19, Department of Statistics, Purdue University. [18] Nelsen, R. B., 2006, An Introduction to Copulas, 2nd Edition, Springer-Verlag, New York. [19] Neyman, J., 1937, Smooth Test for Goodness of Fit, Scandinavian Aktuarial, 20, 149– 199. [20] Patton, A. J., 2006, Estimation of Multivariate Models for Time Series of Possibly Different Lengths, Journal of Applied Econometrics, 21, 147–173. [21] Ruppert, D., and D. B. H. Cline, 1994, Bias Reduction in Kernel Density-Estimation by Smoothed Empirical Transformations, Annals of Statistics, 22, 185–210. [22] Sklar, A., 1959, Fonctions De Repartition a n Dimensionset Leurs Mrges, Publ. Inst. Statis. Univ. Paris, 8, 229–231. [23] Stone, C. J., 1982, Optimal Global Rates of Convergence for Nonparametric Regression, Annals of Statistics, 10, 1040–1053. [24] Stone, C. J., 1990, Large-sample Inference for Log-spline Models, Annals of Statistics, 18, 717–741. [25] Wand, M. P., J. S. Marron, and D. Ruppert, 1991, Transformations in Density Estimation, Journal of the American Statistical Association, 86, 343–361. [26] Wand, M. P. and M. C. Jones, 1993, Comparison of Smoothing Parameterizations in Bivariate Kernel Density Estimation, Journal of the American Statistical Association, 88, 520–528. 30 [27] Wu, X., 2003, Calculation of Maximum Entropy Densities with Application to Income Distribution, Journal of Econometrics, 115, 347–354. [28] Wu, X., 2010, Exponential Series Estimator of Multivariate Density, Journal of Econometrics, 156, 354–366. [29] Yang, L. and J. S. Marron, 1999, Iterated Transformation-kernel Density Estimation, Journal of the American Statistical Association, 94, 580–589. [30] Zellner, A., and R. A. Highfield, 1988, Calculation of Maximum Entropy Distribution and Approximation of Marginal Posterior Distributions, Journal of Econometrics, 37, 195–209. 31 Appendix A: Technical Proofs Proof of Theorem 1. For simplicity, below we only present the proof for the case of d = 2. Extensions to the more general d > 2 cases follow immediately. Given f (x) = f1 (x1 )f2 (x2 )c(F1 (x1 ), F2 (x2 )), through the copula decomposition, we have Z H(f ) = f (x1 , x2 ) log f1 (x1 )f2 (x2 )c (F1 (x1 ), F2 (x2 )) dx1 dx2 Z Z = f (x1 , x2 )dx2 log f1 (x1 )dx1 + f (x1 , x2 )dx1 log f2 (x2 )dx2 Z + f1 (x1 )f2 (x2 )c (F1 (x1 ), F2 (x2 )) log c (F1 (x1 ), F2 (x2 )) dx1 dx2 . The first term on the right hand side simplifies to Z f1 (x1 ) log f1 (x1 )dx1 = H(f1 ). Similarly, the second term equals H(f2 ). By change of variables u = F1 (x1 ) and v = F2 (x2 ), the third term can be written as Z f1 (F1−1 (u))f2 (F2−1 (v))c(u, v) log c(u, v)dF1−1 (u)dF2−1 (v) Z = c(u, v) log c(u, v)dudv =H(c), which completes the proof. Proof of Theorem 2. For simplicity, we only present the proof for the case of d = 2. Extensions to the more general d > 2 cases follow immediately. By definition, Z f1 (x1 )f2 (x2 )cf (F1 (x1 ), F2 (x2 )) dx1 dx2 g1 (x1 )g2 (x2 )cg (G1 (x1 ), G2 (x2 )) Z Z f1 (x1 ) f2 (x2 ) = f (x1 , x2 ) log dx1 dx2 + f (x1 , x2 ) log dx1 dx2 g1 (x1 ) g2 (x2 ) Z cf (F1 (x1 ), F2 (x2 )) + f (x1 , x2 ) log dx1 dx2 . cg (G1 (x1 ), G2 (x2 )) D(f ||g) = f (x1 , x2 ) log The first term of the above summation can be re-written as Z Z f1 (x1 ) f1 (x1 ) f (x1 , x2 )dx2 log dx1 = f1 (x1 ) log dx1 = D(f1 ||g1 ). g1 (x1 ) g1 (x1 ) 32 Similarly, the second term equals D(f2 ||g2 ). Nextly the third term, through changes of variables u = F1 (x1 ) and v = F2 (x2 ), can be written as Z cf (F1 (x1 ), F2 (x2 )) dx1 dx2 f1 (x1 )f2 (x2 )cf (F1 (x1 ), F2 (x2 )) log cg (G1 (x1 ), G2 (x2 )) Z cf (u, v) = f1 (F1−1 (u))f2 (F2−1 (v))c(u, v) log dF −1 (u)dF2−1 (v) −1 cg (G1 (F1 (u)), G2 (F2−1 (v))) 1 Z cf (u, v) = c(u, v) log dudv c̃g (u, v) =D(cf ||c̃g ). Collecting the three terms then completes the proof. Proof of Theorem 3. See Theorem 1 of Barron and Sheu (1991). Proof of Theorem 4. The key to prove the convergence rate of the ESE is the information projection (Csisźar, 1975). The ESE (13), being in the regular exponential family, can be P characterized by a set of sufficient statistics µ̂M = {µ̂i = n−1 nt=1 gi (µt ) : i ∈ M}. Denote their population counterparts by µM . Let the ESEs associated with µM and µ̂M respectively be ĉ(·; λ) and ĉ(·; λ̂). The coefficients of these ESEs are implicitly defined by the moment conditions: Z ĉ(u; λ)gi (u)du : i ∈ M , µM = Z µ̂M = ĉ(u; λ̂)gi (u)du : i ∈ M . We then have D(c(·)||ĉ(·; λ̂)) = D(c(·)||ĉ(·; λ)) + D(ĉ(·; λ)||ĉ(·; λ̂)) = D(c(·)||ĉ(·; λ)) + D(c(·)||ĉ(·; λ̂)), (A.1) R R where the second equality holds because the ĉ(u; λ)gi (u)du = c(u)gi (u)du for all i ∈ M. The two components in (A.1) can be viewed as the approximation error and estimation error respectively. Without loss of generality, suppose that gi ’s are a series of orthornormal bounded basis functions with respect to the Lebesgue measure on [0, 1]d . Under Assumption 4, we obtain Q −2r || log c(·) − log ĉ(·; λ)||2 = O( dj=1 mj j ) by Lemma A1 of Wu (2010). Next to establish the convergence result in terms of the copula densities, we also require the boundedness 33 of || log c(·) − log ĉ(·; λ)||∞ . This is established in Lemma A2 of Wu (2010), which gives Q −r +1 || log c(·) − log ĉ(·; λ)||∞ = O( dj=1 mj j ). Lemma 1 of Barron and Sheu (1991) suggests that D(p||q) is proportional to a squared norm of log p/q. Using this result and the boundedness of || log c(·) − log ĉ(·; λ)||∞ , we then have D(c(·)||ĉ(·; λ)) = O( d Y −2rj mj ). (A.2) j=1 Nextly, the second term of (A.1), being a KLIC between two ESEs of the same family, is determined solely by the discrepancy between µM and µ̂M . Lemma 4 of Barron and Sheu 2 (1991) indicates that D(ĉ(·; λ)||ĉ(·; λ̂)) = Op ||λ − λ̂|| . In addition, Lemma 5 of Barron and Sheu (1991) shows that ||λ − λ̂|| = Op (||µM − µ̂M ||). It follows that D(ĉ(·; λ)||ĉ(·; λ̂)) = Op d Y ! mj /n . (A.3) j=1 Plugging (A.2) and (A.3) into the KLIC decomposition (A.1) completes the proof. Proof of Theorem 5. Note that the joint density is estimated by fˆ(x) = d Y fˆj (xj )ĉ(F̂1 (x1 ), . . . , F̂d (xd )). j=1 By Theorems 2 and 3, we can decompose the KLIC between f and fˆ as D(f ||fˆ) = d X j=1 = d X j=1 −2sj Op (lj Z c(u) du c(u) log −1 −1 ĉ F̂1 (F1 (u1 )), . . . , F̂1 (Fd (ud )) Z c(u) du. + lj /n) + c(u) log ĉ F̂1 (F1−1 (u1 )), . . . , F̂1 (Fd−1 (ud )) D(fj ||fˆj ) + 34 (A.4) The second term of the right hand side of (A.4) can be further decomposed as Z c(u) du c(u) log ĉ F̂1 (F1−1 (u1 )), . . . , F̂1 (Fd−1 (ud )) Z Z ĉ(u) c(u) du du + c(u) log = c(u) log −1 −1 ĉ (u) ĉ F̂1 (F1 (u1 )), . . . , F̂1 (Fd (ud )) Z Z ĉ(u) c(u) du du + ĉ(u) log = c(u) log ĉ (u) ĉ F̂1 (F1−1 (u1 )), . . . , F̂1 (Fd−1 (ud )) Z ĉ(u) du. + (c(u) − ĉ(u)) log ĉ F̂1 (F1−1 (u1 )), . . . , F̂1 (Fd−1 (ud )) (A.5) Since F̂j ’s are the empirical counterparts of Fj ’s, they converge to the marginal distributions Q −2r with rates of n−1/2 . By Theorem 4, the first term of (A.5) is of order Op ( dj=1 mj j + Qd j=1 mj /n). Next note that the second term is the KLIC between two densities within a same regular exponential family, implying that only estimation errors remain. Thus the Q second term is of order Op ( dj=1 mj /n). It follows immediately that Z c(u) du c(u) log −1 −1 ĉ F̂1 (F1 (u1 )), . . . , F̂1 (Fd (ud )) d d d Y Y Y −2rj + mj /n) + Op ( mj /n) + s.o. =Op ( mj j=1 j=1 d Y d Y =Op ( j=1 −2rj mj + j=1 mj /n). (A.6) j=1 Combining (A.4) and (A.6) completes the proof of this theorem. Appendix B: Coefficients for Normal Mixtures The coefficients for the bivariate normal mixtures can be obtained in Table 1 of Wand and Jones (1993). A trivariate normal random variable is given by N (µ, σ, ρ) where µ = (µ1 , µ2 , µ3 ), σ = (σ1 , σ2 , σ3 ), ρ = (ρ12 , ρ13 , ρ23 ). The coefficients for the trivariate normal mixtures used in the simulation are as follows. 1. uncorrelated normal: N ((0, 0, 0), ( 12 , √12 , 1), (0, 0, 0)) 35 3 5 7 2. correlated normal: N ((0, 0, 0), (1, 1, 1), ( 10 , 10 , 10 )) 3. skewed: 15 N ((0, 0, 0), (1, 1, 1), (0, 0, 0)) + 15 N (( 12 , 12 , 21 ), ( 23 , 23 , 32 ), (0, 0, 0)) 12 12 12 , 13 , 13 ), ( 59 , 59 , 59 ), (0, 0, 0)) + 35 N (( 13 √ √ 4. kurtotic: 32 N ((0, 0, 0), (1, 2, 2), ( 12 , 21 , 12 )) + 13 N ((0, 0, 0), ( 32 , 32 , 13 ), (− 12 , − 12 , 21 )) 5. bimodal I: 21 N ((−1, 0, 0), ( 23 , 23 , 32 ), (0, 0, 0)) + 12 N ((1, 0, 0), ( 23 , 32 , 23 ), (0, 0, 0)) 6. bimodal II: 21 N ((− 23 , 0, 0), ( 14 , 1, 1), (0, 0, 0)) + 12 N (( 23 , 0, 0)( 14 , 1, 1), (0, 0, 0)). 36