Estimating Intrinsic Dimension by Justin Eberhardt Department of Mathematics and Statistics University of Minnesota Duluth Duluth, MN 55812 June 2007 Estimating Intrinsic Dimension A project submitted to the faculty of the Graduate School of the University of Minnesota by Justin Eberhardt In partial fulfillment of the requirements for the degree of Master of Science In Applied and Computational Mathematics June 2007 Estimating Intrinsic Dimension Abstract The intrinsic dimension of a dataset is often much less than the dimension of the original dataset. It is valuable to know the intrinsic dimension of a dataset so that the high dimensional dataset can be replaced by a lower dimensional dataset that is easier to manipulate. Traditional intrinsic dimension estimators, such as principal component analysis can only be used on linear spaces. Non-linear manifolds require other methods, such as nearest-neighbor estimators. We will compare three nearestneighbor estimators based on several criteria and show that two estimators perform well on a wide range of non-linear datasets. I would like to thank my advisors, Dr. Kang James and Dr. Barry James, for their guidance and support throughout this entire project. Contents 1. Introduction ............................................................................................................................................1 2. Nearest Neighbor Estimators ...............................................................................................................3 2.1 Nearest-Neighbor Information ......................................................................................................... 3 2.2 Nearest-Neighbor Regression Estimator Overview ......................................................................... 4 2.4 Nearest-Neighbor Regression Estimator Derivation ....................................................................... 5 2.5 Nearest-Neighbor Maximum Likelihood Estimator ...................................................................... 10 2.6 Revised Nearest-Neighbor Maximum Likelihood Estimator ........................................................ 13 3. Datasets ................................................................................................................................................14 3.1 Gaussian Sphere ............................................................................................................................. 14 3.2 Swiss Roll ...................................................................................................................................... 14 3.3 Double Swiss Roll.......................................................................................................................... 15 3.4 Artificial Face ................................................................................................................................ 16 3.5 25-Dimensional Gaussian Sphere (ID = 15) .................................................................................. 16 4. Results ..................................................................................................................................................17 4.1 Accuracy ................................................................................................................................... 17 4.2 Dependence on Number of Neighbors ........................................................................................... 19 4.3 Dependence on Distribution Type ................................................................................................. 20 4.4 Effectiveness on Datasets with High Intrinsic Dimension ............................................................ 21 4.5 Summary ........................................................................................................................................ 22 References .................................................................................................................................................23 A. Appendix .............................................................................................................................................24 A.1 Program Overview ........................................................................................................................ 24 A.2 Code .............................................................................................................................................. 25 int_dim.cpp ........................................................................................................................................ 25 int_dim_reg.h ..................................................................................................................................... 31 int_dim_mle.h .................................................................................................................................... 33 random_gen.h .................................................................................................................................... 34 gauss_gen.h ........................................................................................................................................ 35 1. Introduction High-dimensionality limits the usefulness of practical data, but often it is possible to represent highdimensional data in low dimensional space. The least dimension that describes a data set without significant loss of feature is the intrinsic dimension of the data set [1]. Examples of high-dimensional datasets with low intrinsic dimensionality include biometric datasets such as face images, fingerprints, and iris scans. Genetic information is also believed to have low intrinsic dimension [2]. Intrinsic dimension estimators are a useful tool for determining whether or not a dataset can be represented in a lower dimensional space. Several intrinsic dimension estimators are currently available. One traditional method for determining intrinsic dimension is principal component analysis (PCA). More recently, nearest-neighbor (NN) methods, including NN regression estimation and NN maximum likelihood estimation have been proposed and preliminary results on real and simulated datasets are promising [1], [2]. PCA can be used as an intrinsic dimension estimator but its effectiveness is limited in many practical applications. Implementation of PCA as an intrinsic dimension estimator requires the covariance matrix of the input data set. Eigenvalues of the covariance matrix are determined, and the number of eigenvalues greater than a specified threshold value is considered the intrinsic dimension. PCA is useful for certain applications, including data compression algorithms, however, when working with high-dimensional matrices the process is computationally expensive. The greatest limitation to PCA is that is only useful for linear manifolds, that is, manifolds that can be represented by linear transformations of Euclidean spaces. Biometric and genetic data is often non-linear, thus PCA is not useful in determining the actual intrinsic dimension. Consider the data manifold pictured below, typically referred to as a Swiss Roll. The manifold is a twodimensional plane that has been “rolled” so that the shape occupies three-dimensional space. Since the underlying manifold is non-linear, traditional methods for calculating intrinsic dimension fail. 1 Figure 1 A typical Swiss Roll manifold. Data containing more than three dimensions cannot be represented in Cartesian space; however, it would is often useful to visualize such data. Multi-Dimensional Scaling (MDS) is a method that seeks to reduce dimensionality by preserving Euclidean distances between data points. Through MDS, data can be “flattened” without significant loss of information. MDS flattens data in a way that preserves distances between observations; however, it requires computationally-expensive iteration. The quality of the flattening is determined by the amount of stress in the dataset. Stress is a measure of the difference between distances in the original dataset and corresponding distances in the flattened dataset. Low stress implies that the data in the reduced set is similar to the data in the original set. Traditional Estimator PCA MDS Limitation Fails on Non-linear Manifolds Computationally Expensive The limitations of PCA and MDS applied to high-dimensional, non-linear datasets led to the development of a new type of estimator that uses nearest-neighbor information. The nearest-neighbor estimators rely on the assumption that the density of observations in a small neighborhood around each observation in the dataset is constant. The literature shows that, on both practical and simulated datasets, the nearest-neighbor estimators are very accurate [1], [2], [3]. We give a detailed explanation of the theory behind nearest neighbor estimators in Section 2 including a full derivation of a regression and maximum likelihood estimator. Section 3 describes the datasets used in our simulations, and in section 4, we compare the general accuracy and characteristics of the estimators. Proposed Estimators Nearest-Neighbor Regression Estimator (NN REG) Nearest-Neighbor Maximum Likelihood Estimator (NN MLE) 2 2. Nearest Neighbor Estimators 2.1 Nearest-Neighbor Information Nearest-neighbor information is extracted from a dataset by calculating the Euclidian distance between each pair of observations and is stored in a n x n matrix (where n is the number of observations). The nearest-neighbor estimators require that the distance matrix be sorted so that the distance from each observation to its ith neighbor is known. This sorted matrix is called the nearest-neighbor matrix. Distance Matrix 1 2 3 . . . N 1 0 d1,2 d1,3 d1,n 1 0 t1,2 t1,3 t1,N 2 d2,1 0 d2,3 d2,n 2 0 t2,2 t2,3 t2,N 3 d3,1 d3,2 0 d3,n 3 0 t3,2 t3,3 t3,N . . . . . . . . 0 N . . . . . N dn,1 dn,2 dn,3 . . Nearest-Neighbor Matrix 1 2 3 . . . N . . 0 tN,2 tN,3 . . . . . . . tN,N Figure 2 Distance matrix and nearest-neighbor matrix. Row i, column j in the distance matrix is the distance from observation i to observation j. Row i, column k in the nearest-neighbor matrix is the distance from observation i to its kth NN. 3 2.2 Nearest-Neighbor Regression Estimator Overview Distance to Kth NN PDF (Approximated as Poisson) Expected Distance to Kth NN Expected Distance to the Sample-Averaged Distance to Kth NN NN REG Estimator Figure 3 Outline of the nearest-neighbor regression estimator. The nearest-neighbor regression estimator (NN REG) is derived by finding the density of the distance to the kth nearest-neighbor for an individual observation in the dataset. Based on this density, the expected distance to the kth nearest-neighbor can be obtained. We will show that the natural log of the expected distance to the sample-averaged kth NN is approximately equal to the product of the inverse of the intrinsic dimension and the natural log of k. By selecting a range of values for k, we estimate m using a simple linear regression model. 4 2.3 Nearest-Neighbor Maximum Likelihood Estimator Overview Counting Process Binomial (Approximated as Poisson) Joint Occurrence Probability Joint Occurrence Density Log-likelihood Function NN MLE Estimator Figure 4 Outline of the nearest-neighbor maximum likelihood estimator. The derivation of the nearest neighbor maximum likelihood estimator is composed of three steps. It is assumed that the density of observations in a small is constant. Therefore, the number of observations within distance t of observation x is binomially distributed with probability of success equal to f(x). The joint counting density and joint occurrence density are calculated based on the Poisson approximation to this binomial. The final formula for the estimator is achieved by maximizing the log-likelihood function obtained from the joint occurrence density. 2.4 Nearest-Neighbor Regression Estimator Derivation Pettis, Bailey, Jain, & Dubes (1979) The following derivation is based on [1] and [4]. We have added additional explanation when needed. Let: x : an observation in the input dataset – high dimensional dataset (HDD) p : dimension of the HDD m : the intrinsic dimensionality of the dataset Tk , Tx ,k : the distance from x to its k th NN N (t , x ) : the number of observations within distance t of observation x N r ,s : the number of observations between the distances r and s from observation x Vt V (m)t m : volume of a sphere with radius t V (m) : volume of a m-dimensional unit sphere 5 Counting the number of observations within distance t of x is a binomial counting process with n trials and probability of success equal to the density times the volume of a sphere with radius t. We will assume that the density is constant and dependent on x. Thus, the probability that there are k observations within distance t of x can be written as: n 1 k nk 1 P[ N (t , x) k ] [ f ( x)Vt ] [1 f ( x)Vt ] k (1) We begin by finding the probability density function (pdf) of the distance from x to its kth NN. P[t Tx ,k t t ] f k , x t , where f x ,k is the pdf of the distance from x to its kth NN 1 P[t Tx ,k t t ] t 1 P[ N (t , x) k 1 N (t t , x) k ] t f x ,k (t ) (2) This probability has a trinomial distribution. Figure 5 1 (n 1)! [ f ( x)Vt ]k 1[ f ( x)Vt ]1[1 f ( x)Vt t ]nk 1 t (k 1)!1!(n k )! Vt (n 2)! [ f ( x)Vt ]k 1[1 f ( x)Vt ]nk 1 t (k 1)!(n k 1)! V lim t Vt ' V (m)mt m1 t 0 t n 2 (n 2)! (k 1)!(n k 1)! k 1 (3) (n 1) f ( x) n 2 k 1 nk 1 (n 1) f ( x)V (m)mt m1 [ f ( x)Vt ] [1 f ( x)Vt ] k 1 (4) (5) 6 n 2 k 1 nk 1 is a binomial distribution with n-1 trials and [ f ( x)Vt ] [1 f ( x)Vt ] k 1 probability of success equal to f ( x)Vt . Approximating this binomial distribution as a Poisson with (n 2) f ( x)Vt , we obtain the following pdf. (n 1) f ( x)V (m)mt m1[(n 2) f ( x)V (m)t m ]k 1e ( n2) f ( x )V ( m )t f x ,k (t ) (k 1)! (k 1)! (k ) Let c (n 2) f ( x)V (m) nf ( x)V (m)mt m 1 0 , elsewhere m (ct m )k 1 ct m c e , when t 0 (k ) c (6) Now, we find the expected value of f x ,k (t ) by integrating over all values of t : tf (t x ,k )dt E (Tx ,k ) 0 m (n 1) f ( x)V (m) t t m1 (t m )k 1 c c k 1 m ect dt c(k ) 0 m (n 1) f ( x)V (m) (ct m )k 1 me ct dt c(k ) 0 (7) 1 Let u ct , du cmt m u m dt cm c m 1 1 ( m 1) 1 1 1 u m dt , dt du cm c 1 (n 1) f ( x)V (m) k u 1 u m 0 u e m cm c du c(k ) (8) m1 k 1 km (9) 1 1 cm u (n 1) f ( x)V (m) ( k 1 )1 u m e u du c(k ) 0 1 ( k ) 1 u m e du (k 0 (k 1 m 1 ) m 1 ) 1 m k m (n 1) f ( x)V ( m) k (k ) 1 c 1 m (10) We define a sample-averaged distance from observation xi to its k th nearest neighbor 7 Tk 1 n Tx ,k n i1 i (11) and, combining the preceding two equations, we have the following expected value of Tk . 1 n E (Txi ,k ) n i 1 1 ) 1 n (k 1 (n 1) f ( x)V (m) m 1 km 1 1 n i 1 m k (k ) c m 1 ( k ) 1 1 n (n 1) f ( x)V (m) 1 m km 1 1 n i 1 m m k (k ) [(n 1) f ( x)V (m)] E(Tk ) (12) (13) 1 k m ( k ) Let Gk ,m , 1 ( k ) m n 1 (n 1) f ( x)V (m) and let Cn 1 1 n i 1 [(n 1) f ( x)V (m)] m 1 1 m k Cn Gk ,m (14) Taking the logarithm of both sides and estimating E(Tk ) using Tk gives us the result 1 log(Gk ,m ) log(Tk ) log(k ) log(Cn ) (15) m This equation has the form Y 0 1 X where: Y log(Gk ,m ) log(Tk ) 1 m X log( k ) 0 log(Cn ) , which is independent of k 1 log(Gk ,m ) is close to zero for all k and d. Thus, plotting X log( k ) vs. Y log(tk ) for values of k ranging from 1 to an arbitrarily chosen K, we can estimate the intrinsic dimension, m, as the inverse slope of a regression line through these points. K should be chosen so that the density is relatively constant through the distance to the Kth NN. 8 K K K K XY X Y 1 k 1 k 1 mˆ k 1 K 1 K 2 2 K [ X ] [ X ] k 1 k 1 1 (16) A slightly better estimate can be obtained by writing log(Gk ,m ) as a Taylor series in terms of k and d and iterating. Gk ,m [Taylor Series] m 1 (m 1)(m 2) (m 1) 2 (m 1)(m 2)(m2 3m 3) O(k , m) 2km2 12k 2 m3 12k 2 m4 120k 4 m5 (17) NN REG Calculate NN Matrix Initial Estimate X log( k ) , Y log(tk ) 1 K K K K XY X Y k 1 k 1 mˆ 0 k K1 K 2 K [ X ] [ X ]2 k 1 k 1 Calculate Gk ,m0 with the Taylor Series approximation to Gk , mˆ 0 While ( mˆ i 1 mˆ i ) X log(k ) log(Gk ,mˆ i1 ) , Y log(tk ) K K K K XY X Y k 1 k 1 mˆ i k K1 K K [ X ]2 [ X ]2 k 1 k 1 Calculate Gk ,mˆ i 1 Increment i Figure 6 Pseudo Code for NN REG Program. ■ 9 2.5 Nearest-Neighbor Maximum Likelihood Estimator Levina & Bickel (2004) Similar to the estimator in [1], NN MLE exploits neighborhood information. The following derivation is based on [2], [4], and [5]. We have added additional explanation when needed. n N (t , x) 1{ X i S x (r )} , 0 r t , S x (r ) is a sphere of radius r about x (18) i 1 To find the joint occurrence density, we will model the binomial counting process N (t , x) as a Poisson counting process. N (t , x) has mean, t nf ( x)Vt . We approximate N (t , x) as Nt ~ POI ( t ) with rate t t ' nf ( x)V (m)mt m1 . P[ N r ,s n] fT1 , ,TK (t1 , e ( s r ) ( s r ) n , for 0 < r < s n! (19) tK ) 1 P[ N t0 ,t1 0, N t1 ,t1 t 1, N t1 t ,t2 0, Nt2 ,t2 t 1, t K , NtK ,tK t 1] (20) Increments are independent the process is a nonhomogeneous Poisson process. 10 K 1 P [ N 0] P[ Nti ,ti t 1, Nti t ,ti1 0] P[ NtK ,tK t 1] 0,t1 t K i 1 1 ( t1 0) K e Ke t i 1 e ( t1 0) ( ti t ti ) K e ( ti t ti )1 (1)! ( ti t ti ) i 1 K e K e ( ti ti1t ) (21) i 2 ( ti ti 1t ) e t1 0 t1t t1 t2 t1t tK 1t tK 1 tK tK 1t e t K i 2 K 1 t K e ( ti t ti ) t K i 1 ti t ti ti t 1 tK K e (ti t ) t K i 1 e t K (22) (23) K i 1 (24) ti tK e t dt 0 K i 1 (25) ti Taking logarithms of both sides, we have the following log-likelihood equation: Lx K tK i 1 0 ln ti t dt (26) K tK i 1 0 ln ti ln t dNt tK ln[nf ( x)V (m)mt 0 tK m1 tK ]dNt (nf ( x)V (m)mt m1 )dt (27) 0 tK tK ln(nf ( x)) ln(V (m)) ln(m) dNt (m 1)ln t dNt nf ( x)V (m) mt m1dt (28) 0 0 tK 0 tK [ln nf ( x) ln V (m) ln m]NtK (m 1) ln(t )dNt nf ( x)V (m) mt m1dt 0 Nt K tK j 1 0 (29) 0 [ln nf ( x) ln V (m) ln m]NtK (m 1) ln T j nf ( x)V (m) mt m1dt (30) where Ti is the distance from observation x to its ith nearest neighbor. Let ln f ( x) and N ( R, x) N (tK , x) Lx [ ln V (m) ln m]N ( R, x) (m 1) N ( R,x) j 1 ln T j V (m)e R m (31) 11 Lx Lx m N ( R, x) V (m)e R m 0 [ N ( R,x) V '(m) 1 ]N ( R, x) ln T j V '(m)e R m V (m)e R m log R 0 V (m) m j 1 (32) (33) From the previous two equations: N ( R, x) N ( R ,x ) ln T j N ( R)log R 0 m j 1 (34) N ( R, x ) N ( R , x ) R ln m Tj j 1 (35) mx [ N ( R,x) 1 R ln ]1 N ( R, x) j 1 Tj (36) NN MLE Calculate NN Matrix 1 K T m x [ ln K ]1 K j 1 T j m 1 mx N x Figure 7 Pseudo Code for NN MLE Program. 12 2.6 Revised Nearest-Neighbor Maximum Likelihood Estimator MacKay any Ghahramani noted in 2005 [3] that the standard NN MLE is biased for low values of K. As is shown in Figure 7, the standard NN MLE is calculated by averaging the estimates over all observations. MacKay and Ghahramani argued that the likelihood equation for the entire data set results in an estimator identical to that of Levina and Bickel except that the final estimate must be obtained by averaging the inverse of the estimates over all observations, then taking the inverse of the results. As our simulations show, the revised NN MLE does not appear to be biased as small values of K and results in a much improved estimator. NN MLE (Revised) Calculate NN Matrix 1 K T m x [ ln K ]1 K j 1 T j 1 1 m N x mx 1 Figure 8 Pseudo Code for Revised NN MLE Program. ■ 13 3. Datasets Several datasets have been developed to test the efficacy of intrinsic dimension estimators. These datasets range from the practical, such as human faces, to mathematically generated random datasets such as the Swiss roll. Several common datasets are listed below. 3.1 Gaussian Sphere The Gaussian sphere is a three dimensional dataset and the intrinsic dimension is also three. Due to this characteristic, the dataset is useful as a baseline test for intrinsic dimension estimators. Each observation is generated with three random numbers which are Gaussian distributed with mean zero and variance one. 3-D Gaussian Sphere w/ ID=3 Xi = (xi0, xi1, xi2) xi0 – xi2 = Random Numbers: ~N(0,1) * 6.28 Figure 9 Pseudo code for generating Gaussian sphere datasets. 3.2 Swiss Roll The Swiss Roll is a three dimensional dataset with a non-linear, two dimensional manifold. The Swiss Roll name is due to the shape of the object when viewed in three-dimensional space (see Figure 1). Each observations has the form ( cos( ),sin( ), z ) , where and z are random numbers. Two main features make this dataset a popular test set for intrinsic dimension estimation. First, it is easily generated, and second, traditional methods of dimension estimation, such as PCA fail. Swiss Roll Procedure Xi = (xi0, xi1, xi2) r1 : Random Number: ~U(0,1) * 6.28 xi0 = r1 * cos(r1) xi1 = r1 * sin(r1) xi2 = r1 Figure 10 Pseudo code for generating the Swiss Roll dataset. 14 Figure 11 2000 observations plotted on the Swiss Roll manifold. 3.3 Double Swiss Roll Two nested Swiss Rolls make up the Double Swiss Roll dataset. Both datasets have intrinsic dimension of two, however, where the two Swiss Roll manifolds meet at the center of the Double Swiss Roll, the intrinsic dimension appears to be greater than two due to the high density of observations in the region. Figure 12 2000 observations plotted on the double Swiss Roll manifold. 15 3.4 Artificial Face This face database includes many pictures of a single artificial face under various lighting and in various horizontal and vertical orientations. Due to the three changing conditions, the intrinsic dimensionality should be three, but since the pictures are two-dimensional projections, we do not know what the exact intrinsic dimension is. Levina and Bickel propose that the intrinsic dimension of the dataset is approximately four [citiation]. Figure 13 Images of an artificial face under various lighting conditions and various poses. [2] 3.5 25-Dimensional Gaussian Sphere (ID = 15) 25-D Gaussian Sphere w/ ID=15 Xi = (xi0, xi1,…, xi24) xi0 – xi14 : Random Numbers: ~N(0,1) * 10 xi15 – xi25 = sin(x(i-15)) 16 4. Results Previous studies have shown that nearest-neighbor estimators produce accurate results when tested on simulated datasets with moderate to low intrinsic dimension. Levina & Bickel noted that the estimators are dependent on K, the number of nearest neighbors used in the estimate. MacKay and Ghahramani stated that the choice of K was less of an issue with their revised NN MLE. Our preliminary simulations have shown that a dependence on K exists even in the revised NN MLE. To further investigate the characteristics of the estimators with respect to K, our simulations show the estimates for a wide range of K. Levina & Bickel have reported simulation results that indicate the estimators are less accurate at high-dimensions due to underestimation. The 25-dimension Gaussian sphere will be used to test the high intrinsic dimension characteristics of the estimators. Finally, we would like to know what effect the distribution type has on estimation, so we will test the estimators on a Gaussion sphere and threedimensional cube of similar size. 4.1 Accuracy Input Data: 3-Dimensional Gaussian Sphere 2000 Observations Int Dim = 3 Intrinsic Dimension Estimate 6 5 4 NN MLE Rev NN MLE NN REG 3 2 1 0 1 10 100 1000 Neighbors (K) Figure 14 Three-dimensional Gaussian Sphere. We expect the intrinsic dimension to be three for the three-dimensional Gaussian sphere. This graph shows that the MacKay & Ghahramani estimator (revised NN MLE) and the Jain, et. al. estimator (NN REG) both provide accurate estimates for a wide range of K values. 17 Input Data: 3-Dimensional Swiss Roll Intrinsic Dimension = 2 2000 Observations Intrinsic Dimension Estimate 4 3.5 3 2.5 NN MLE Rev NN MLE NN REG 2 1.5 1 0.5 0 1 10 100 1000 Neighbors (K) Figure 15 Swiss Roll. The Swiss Roll has an intrinsic dimension of two, which is correctly estimated by the revised NN MLE and the NN REG estimators. Input Data: 3-Dimensional Double Swiss Roll 2000 Observations Int Dim = 2 Intrinsic Dimension Estimate 4 3.5 3 2.5 NN MLE Rev NN MLE NN REG 2 1.5 1 0.5 0 1 10 100 1000 Neighbors (K) Figure 16 Double Swiss Roll. The estimates of intrinsic dimension for the Double Swiss Roll are not accurate for values of K greater than 100 due to the high concentration of observations near the center of the dataset. 18 Face Data 128 Data Point (128 images) 10 Intrinsic Dimension Estimate 9 8 7 6 NN MLE Rev NN MLE NN REG 5 4 3 2 1 0 1 10 100 Neighbors (K) Figure 17 Artificial Face. The estimate of 3.5 is expected for the artificial face data since the dataset is a single 3-D face projected on to a 2-D plane under varying lighting condition, and variation in horizontal and vertical orientation. Both the MacKay & Gharamani estimator and the Jain, et. al. estimator perform exceptionally well on these four simulated datasets. The best estimates occur when K, the number of nearest neighbors used in the estimate, is small relative to the total number of observations. This is reasonable since we make the assumption that the density near each observation is constant; a good assumption when K is small. As expected, the Levina & Bickel estimator is not useful when K is small [3]. 4.2 Dependence on Number of Neighbors The artificial face and double Swiss Roll results show that the estimators may be heavily dependent on the number of nearest neighbors used in the estimate. This dependence on K is exaggerated when the intrinsic dimension is large and N is moderate, as shown in Figure 19. 19 4.3 Dependence on Distribution Type Input Data: 3-Dim Gaussian Sphere/Uniform Cube 1000 Observations Int Dim = 3 Intrinsic Dimension Estimate 6 5 4 Gaussian Distributed Uniform Distributed 3 2 1 0 1 100 10 1000 Neighbors (K) Figure 18 Distribution Comparison. This graph shows that the estimators are not highly dependent on the distribution type in the case of three dimensions and 1000 observations. Figure 18 shows similar estimates for the case of a Gaussian distributed sphere and a uniformly distributed cube in three dimensional space. There is a slight overestimate when for the Gaussian distributed sphere and a slight underestimate for the uniform distributed cube, however each estimate would be rounded to the same nearest integer. 20 4.4 Effectiveness on Datasets with High Intrinsic Dimension Input Data: 25-Dimensional Gaussian Intrinsic Dimension = 15 2000/1000/500/250 Observations 15 Intrinsic Dimension Estimate 14.5 14 13.5 13 N = 250 N = 500 N = 1000 N = 2000 12.5 12 11.5 11 10.5 10 0 5 10 15 20 Neighbors (K) Figure 19 25-D Dimensional Dataset with 250, 500, 1000, and 2000 observations. As N increases, the estimators become less dependent on K. Previous studies have shown that nearest-neighbor estimators perform poorly at high intrinsic dimensions, although the dependence on K at high intrinsic dimensions has not been fully explored [2]. Each curve in Figure 19 represents the average of the MacKay & Gharamani estimators when the number of nearest neighbors in the simulated dataset is set at 250, 500, 1000, and 2000. Our results agree with the findings of Levina and Bickel that shows an estimate of approximately 12-13 when the true intrinsic dimension is 15. We also found that at high intrinsic dimensions, the estimate becomes highly dependent on K. For example, when N is equal to 1000, the intrinsic dimension estimate is 13.3 for K equal to 2, but quickly falls off to 12.1 when K is increased to 20. Increasing N partially alleviates this problem, however the estimate is still better when K is small. 21 4.5 Summary Our simulations have shown that nearest-neighbor intrinsic dimension estimators are effective on datasets with non-linear manifolds and intrinsic dimensions less than ten. The results of the artificial face simulations are encouraging for biometric applications such a facial and iris recognition. In each of our simulations, the best estimates occur when the number of nearest neighbors is small. In general, K less than ten appears to be the most accurate, although the exact choice of K is dictated by the specific dataset. When the intrinsic dimension is greater than 15, the estimators begin to underestimate the true intrinsic dimension, and the problem worsens as the intrinsic dimension increases. 22 References [1] K.W. Pettis,T.A. Bailey,A.K. Jain, and R.C. Dubes. An intrinsic dimensionality estimator from near-neighbor information. IEEE Trans. Patt. Anal. Machine Intell. vol. 1. pp. 25-37. 1979. [2] E. Levina and P. J. Bickel. Maximum Likelihood Estimation of Intrinsic Dimension. Advances in NIPS 17. 2005. [3] D. J.C. MacKay and Z. Ghahramani. Comments on 'Maximum Likelihood Estimation of Intrinsic Dimension' by E. Levina and P. Bickel. Available online: http://www.inference.phy.cam.ac.uk/mackay/dimension/. 2005. [4] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973. [5] D. L. Snyder. Random Point Processes. Wiley, New York, 1975. [6] J. B. Tenenbaum, V. de Silva, and J. C. Landford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319{2323, 2000. [7] W. H. Press, et. al. Numerical Recipes in C++ : The Art of Scientific Computing. Cambridge University Press, New York, 2002. 23 A. Appendix A.1 Program Overview Create/Import Dataset Start Loop o Set K (K: Number of NN to use in the estimate) o Execute Levina & Bickel MLE o Execute Jain Regression Estimator End Loop Output Results in .csv table format (dimension estimate vs. # of NN) 24 A.2 Code int_dim.cpp /* Main Program: int_dim.cpp Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h /* /* Research Project Justin Eberhardt Project: Intrinsic Dimension Estimation Advisor: Dr. Kang James */ #include <stdio.h> #include <math.h> #include <stdlib.h> #include <iomanip> #include <iostream> #include <fstream> #include <string> #include <algorithm> #include <ctime> #include <cstdlib> #include <vector> #include "random_gen.h" #include "gauss_gen.h" #include "int_dim_mle.h" #include "int_dim_reg.h" #include "int_dim_rev.h" #include "timer.h" using namespace std; float ran1(int &idum); float gasdev(int &idum); void int_dim_mle(int &n, int &p, float &levina_set_estimate, float &mackay_set_estimate, vector<vector<float> > &nearest_neighbor, int &neighbor_k); float int_dim_reg(int &n, int &p, vector<vector<float> > &nearest_neighbor , int &neighbor_k); float int_dim_rev(int &n, int &p, vector<vector<float> > &nearest_neighbor); main(void) { /* VARIABLES */ int p; // high-dimension (of original input_data) float m; // low-dimension (trying to find) int n; // number of observations (data points) int idum; int sim_m; int dist_type; int function; float r_a; float r_b; float r_c; float r_noise; int neighbor_k; // k (notation from paper) float levina_set_estimate; float mackay_set_estimate; int trials; float d_denominator; // d_i denominator estimator from paper 25 float d_this; /* INITIALIZATION m = 0; n = 0; sim_m = 0; */ // open files ofstream outfile; outfile.open("output.csv"); outfile << "Neighbors, Levina & Bickel, MacKay & Ghahramani, Jain et. al." << endl; ofstream indatafile; indatafile.open("input_data.csv"); /* DATA GENERATION */ int which_input; which_input = 0; //prompt user for input_data cout << endl << "Please select a dataset: " << endl << " 1) From File (input_data.txt)" << endl << " 2) Simulated Data" << endl << " 3) 1D Swiss Roll" << endl << " 4) 2D Swiss Roll" << endl << " 5) 2D Double Swiss Roll" << endl << " 6) 2D Swiss Roll with Noise" << endl; which_input = 2; cout << "Please enter 1 - 10: "; cin >> which_input; cout << endl << "Number of Observations: "; cin >> n; cout << endl << "Number of Parameters: "; cin >> p; cout << endl << "Simulated Int Dim: "; cin >> sim_m; cout << endl; //initialize the input_data vector vector<float> rows(p); vector<vector<float> > input_data(n, rows); /* USE input_data.txt */ if(which_input == 1) { cout << "How many dimensions does the data in 'input_data.txt' contain? " << endl << "Please enter an integer: "; cin >> p; cout << endl; float data_point; string row; // one observation with 'p' dimensions ifstream myfile ("input_data.txt"); if ( myfile.is_open() ) { int current_row; current_row = 0; int current_column; current_column = 0; // If the last set of observations does not contain a full set of dimension, the program will fill those spots with the last known entry while (! myfile.eof() ) { while(current_column < p) { myfile >> data_point; input_data[current_row][current_column] = data_point; current_column++; } 26 current_column = 0; current_row++; } myfile.close(); n = current_row; // number of observations } } else { cout << "Unable to open file"; } /* GENERATE A UNIFORM OR GAUSSIAN DISTRIBUTED DATASET if(which_input == 2) { cout << "(1) Gaussian, (2) Uniform: "; cin >> dist_type; cout << endl; cout << "Function (1) sin(x), (2) ln(x^2), (3) x^(1/3): "; cin >> function; cout << endl; idum = time(0); for(int i=0; i < n; i++) { for(int j=0; j < p; j++) { if(j < sim_m) { if(dist_type == 1) { input_data[i][j] = gasdev(idum); } if(dist_type == 2) { input_data[i][j] = ran1(idum); } } else { if(function == 1) { input_data[i][j] = sin( input_data[i][j-sim_m]); } if(function == 2) { input_data[i][j] = log( pow(input_data[i][j-sim_m],2) ); } if(function == 3) { input_data[i][j] = pow(input_data[i][j-sim_m],(1/3)); } } } } } */ /* GENERATE A 3-D DATASET W/ ID=1 */ if(which_input == 3) { if(p !=3 ) cout << "ERROR Parameter # does not Match" ; idum = time(0); for(int i=0; i < n; i++) { r_a = ran1(idum) * 6.14159; input_data[i][0] = r_a * cos(r_a); input_data[i][1] = r_a * sin(r_a); input_data[i][2] = r_a; } } /* GENERATE A 3-D DATASET W/ ID=2 */ if(which_input == 4) { if(p !=3 ) cout << "ERROR Parameter # does not Match" ; idum = time(0); for(int i=0; i < n; i++) { r_a = ran1(idum) * 6.14159; r_b = gasdev(idum) * 6.14159; input_data[i][0] = r_a * cos(r_a); input_data[i][1] = r_a * sin(r_a); input_data[i][2] = r_b; } } /* GENERATE A 3-D DATASET W/ ID=2 if(which_input == 5) { */ 27 } if(p !=3 ) cout << "ERROR Parameter # does not Match" ; idum = time(0); for(int i=0; i < n; i++) { r_a = ran1(idum) * 6.14159; r_b = ran1(idum) * 6.14159; if( i % 2 == 0 ) { input_data[i][0] = r_a * cos(r_a); input_data[i][1] = r_a * sin(r_a); } else { input_data[i][0] = r_a * .5 * cos(r_a); input_data[i][1] = r_a * .5 * sin(r_a); } input_data[i][2] = r_b; } /* GENERATE A 3-D DATASET W/ ID=2 & NOISE */ if(which_input == 6) { if(p !=3 ) cout << "ERROR Parameter # does not Match" ; float r_a; float r_b; idum = time(0); for(int i=0; i < n; i++) { r_a = ran1(idum) * 6.14159; r_b = gasdev(idum) * 6.14159; r_noise = gasdev(idum) * .25; input_data[i][0] = r_a * cos(r_a) + r_noise; input_data[i][1] = r_a * sin(r_a) + r_noise; input_data[i][2] = r_b; } } /* SHOW DATA */ char show_data; cout << "Would you like to view the dataset? y/n: " << endl; cin >> show_data; cout << endl; if(show_data == 'y') { cout << endl; for(int i = 0; i < n; i++) { for(int j = 0; j < p; j++) { cout << setw(10) << input_data[i][j]; } cout << endl; } } /* RECORD INPUT DATA IN input_data.csv */ show_data = 'n'; cout << "Would you like to store the dataset to input_data.csv? y/n: " << endl; cin >> show_data; cout << endl; if(show_data == 'y') { indatafile << endl; for(int i = 0; i < n; i++) { for(int j = 0; j < p; j++) { indatafile << setw(10) << input_data[i][j] << ", "; } indatafile << endl; } 28 } //The data to be processed is now stored in input_data /* NEAREST NEIGHBOR MATRIX */ vector< vector<float> > nearest_neighbor(n, vector<float>(n+1,0)); // matrix size: n x n float sum; float sum_dist; //sum of distances to all NN float sort_vector[n]; /* for loop Description: Produces an n x n matrix that contains the distance between each pair of observations in the dataset. Output: nearest_neighbor (n x n+1 matrix) & nearest_neighbor[i][n] = sum of all distances */ for(int i = 0; i < n; i++) { sum_dist = 0; for(int j = 0; j < n; j++) { // Distances sum = 0; for(int k = 0; k < p; k++) { sum += pow(float(input_data[i][k] - input_data[j][k]), 2); } nearest_neighbor[i][j] = sqrt(sum); // End Distances // Start Sum of Distances sum_dist += nearest_neighbor[i][j]; } nearest_neighbor[i][n] = sum_dist; } /* Description: In the following 'for' loop, the nearest_neighbor matrix is sorted. Each row corresponds to an observation, and each row is a sorted vector of distances to each neighbor. Output: nearest_neighbor (n x n sorted matrix) */ for(int i = 0; i < n; i++) { for(int j = 0; j < n; j++) { sort_vector[j] = nearest_neighbor[i][j]; } sort(sort_vector, ( sort_vector + n )); for(int j = 0; j < n; j++) { nearest_neighbor[i][j] = sort_vector[j]; } } /* Run the Estimators */ for( neighbor_k = n; neighbor_k > 1; neighbor_k--) { if( neighbor_k > 40 ) { neighbor_k = neighbor_k - 20; } /* LEVINA & BICKEL MLE */ int_dim_mle(n, p, levina_set_estimate, mackay_set_estimate, nearest_neighbor, neighbor_k); // output results to the outfile cout << " " << neighbor_k - 1; outfile << neighbor_k - 1 << "," << levina_set_estimate << "," << mackay_set_estimate << ","; /* JAIN ESTIMATOR */ d_this = int_dim_reg(n, p, nearest_neighbor, neighbor_k); // output results to the outfile 29 outfile << d_this << "," << endl; } cout << endl << "Estimates are in: output.csv" << endl; // close the files outfile.close(); indatafile.close(); } 30 int_dim_reg.h /* Main Program: int_dim.cpp Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h /* #include <cmath> #include <math.h> #include <algorithm> using namespace std; float int_dim_reg(int &n, int &p, vector< vector<float> >& nearest_neighbor, int &neighbor_k) { int count; float k_float; float epsilon = 0.01; // (same notation as paper) int maxiter = 10; // iteration maximum float mmax; // notation from paper float s2max; // notation from paper float log_t_hat[neighbor_k]; int n_mod; float log_g[neighbor_k]; float d_this; // estimated dimension at THIS iteration float d_previous; // estimated dimension at the PREVIOUS iteration float sum1; float sum2; float sum_log_k; float sum; float d_denominator; sum_log_k = 0; // use "neighbor_k" variable from above for k (notation from paper) /* following two for loops Description: Calculates values required to remove outliers Output: mmax, s2max */ sum = 0; for(int j = 0; j < n; j++) { sum += nearest_neighbor[j][neighbor_k]; } mmax = ( 1 / float( n ) ) * sum; sum = 0; for(int j = 0; j < n; j++) { sum += pow( ( nearest_neighbor[j][neighbor_k] - mmax ) , 2); } s2max = (1 / (float( n ) - 1) ) * sum; /* for loop Description: Remove outlier and calculate the sample-average distance to kth NN Output: log_t_hat[k] (sample-averaged distance to kth NN) */ for(int k = 1; k <= neighbor_k; k++) { sum = 0; n_mod = n; for(int j = 0; j < n; j++) { if( nearest_neighbor[j][neighbor_k] <= (mmax + sqrt(s2max) ) ) { sum += nearest_neighbor[j][k]; } 31 } else { n_mod--; } } log_t_hat[k] = log( (1/float( n_mod ) ) * sum ); /* Description: Find the initial estimate for the intrinsic dimension of the dataset. Output: d_this (contains the initial estimate of ID) */ d_previous = 0; d_this = 0; for(int k = 0; k <= neighbor_k; k++) { log_g[k] = 0; // initialize log_g to zero } sum1 = 0; sum2 = 0; d_denominator = 0; for(int k = 1; k <= neighbor_k; k++) { sum_log_k += log( float(k) ); d_denominator += pow( log( float(k) ) , 2); sum1 += log( float(k) ) * log_t_hat[k]; sum2 += log_t_hat[k]; } d_denominator = float( neighbor_k ) * d_denominator - pow( sum_log_k, 2 ); d_this = d_denominator / ( float( neighbor_k ) * sum1 - ( sum_log_k * sum2 ) ); /* while loop Description: Iterate to better estimate ID. At each iteration, log_g[k] is calculated based on previous dimension estimate. Output: d_this (final ID estimate) */ count = 0; while( d_this - d_previous > epsilon && count < maxiter ) { d_previous = d_this; count ++; for(int k = 1; k <= neighbor_k; k++) { k_float = float( k ); log_g[k] = ( ( d_previous - 1 ) / ( 2 * k_float * pow(d_previous, 2) ) ) + ( ( d_previous - 1 ) * ( d_previous - 2 ) / ( 12 * pow(k_float, 2) * pow(d_previous, 3) ) ) - ( pow( ( d_previous - 1 ), 2 )/ ( 12 * pow(k_float, 3) * pow(d_previous, 4) ) ) - ( (d_previous - 1)*(d_previous - 2)*( pow(d_previous, 2) + 3*d_previous - 3 ) / ( 120 * pow(k_float, 4) * pow(d_previous, 5) ) ); } sum1=0; sum2=0; sum_log_k=0; d_denominator=0; for(int k = 1; k <= neighbor_k; k++) { sum_log_k += log( float(k) ); d_denominator += pow( log( float(k) ) , 2); sum1 += log( float(k) ) * (log_t_hat[k] + log_g[k]); sum2 += (log_t_hat[k] + log_g[k]); } d_denominator = float( neighbor_k ) * d_denominator - pow( sum_log_k, 2 ); d_this = d_denominator / ( float( neighbor_k ) * sum1 - ( sum_log_k * sum2 ) ); } return d_this; } 32 int_dim_mle.h /* Main Program: int_dim.cpp Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h /* #include <cmath> using namespace std; void int_dim_mle(int &n, int &p, float &levina_set_estimate, float &mackay_set_estimate, vector< vector<float> >& nearest_neighbor, int &neighbor_k) { float mackay_point_estimate; float levina_point_estimate; float sum; // running total of a SUM(...) int x; //observation # //initializing variables for levina estimate levina_set_estimate = 0; mackay_set_estimate = 0; /* for loop Description: Provides a point estimate of intrinsic dimension for each observation in the dataset. Number of Iterations: n Output: levina_set_estimate (sum of all levina point estimates) Output: mackay_set_estimate (sum of all mackay point estimates) */ for(x = 0; x < n; x++ ) { levina_point_estimate = 0; sum = 0; for(int i = 1; i < neighbor_k; i++) { if(nearest_neighbor[x][neighbor_k] != 0 && nearest_neighbor[x][i] != 0 ) { sum += log( nearest_neighbor[x][neighbor_k] / nearest_neighbor[x][i] ); } } } levina_point_estimate = (neighbor_k - 1) * 1/sum; mackay_point_estimate = sum; levina_set_estimate += levina_point_estimate; mackay_set_estimate += mackay_point_estimate; levina_set_estimate = levina_set_estimate / n; mackay_set_estimate = mackay_set_estimate / (n * (neighbor_k - 1) ); mackay_set_estimate = 1 / mackay_set_estimate; } 33 random_gen.h /* Main Program: int_dim.cpp Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h /* /* from "Numerical Recipies in C++ Second Edition" Press, Teukolsky, Vetterling, Flannery ISBN 0-521-75033-4 [7] */ float ran1(int &idum) { const int IA=16807,IM=2147483647, IQ=127773, IR=2836, NTAB=32; const int NDIV = (1+(IM-1)/NTAB); const double EPS=3.0e-16, AM = 1.0/IM, RNMX=(1.0-EPS); static int iy=0; static int iv[NTAB]; int j,k; double temp; if (idum <= 0 || !iy) { if(-idum < 1) idum = 1; else idum = -idum; for( j=NTAB+7; j>=0; j--) { k=idum/IQ; idum = IA*(idum-k*IQ)-IR*k; if (idum < 0) idum += IM; if (j < NTAB) iv[j] = idum; } iy=iv[0]; } k = idum/IQ; idum = IA*(idum-k*IQ)-IR*k; if (idum < 0) idum += IM; j=iy/NDIV; iy=iv[j]; iv[j] = idum; if ((temp=AM*iy) > RNMX) return float( RNMX ); else return float( temp ); } 34 gauss_gen.h /* Main Program: int_dim.cpp Req Header Files: int_dim_reg.h, int_dim_mle.h, random_gen.h and gauss_gen.h /* /* from "Numerical Recipies in C++ Second Edition" Press, Teukolsky, Vetterling, Flannery ISBN 0-521-75033-4 [7] */ #include <cmath> using namespace std; float gasdev(int &idum) { static int iset =0; static double gset; double fac, rsq, v1, v2; if (idum <0) iset =0; if (iset == 0) { do { v1 = 2.0*ran1(idum)-1.0; v2 = 2.0*ran1(idum)-1.0; rsq = v1*v1+v2*v2; } while (rsq >= 1.0 || rsq == 0.0); fac=sqrt(-2.0*log(rsq)/rsq); gset = v1*fac; iset=1; return float( v2*fac ); } else { iset = 0; return float( gset ); } } 35