Manifold Learning and Its Applications: Papers from the AAAI Fall Symposium (FS-10-06) Building a Job Lanscape from Directional Transition Data Dominique Perrault-Joncas and Marina Meilă Marc Scott Department of Statistics University of Washington Department of Humanities and Social Sciences New York University robust to sampling fluctuations, but that some of the coordinates are related to significant demographic variables like gender, wages, and time/age. In doing this, our goal is to provide a robust instrument for visualizing data on careers that will allow social scientists to uncover trajectory differences between different demographic groups, or groups with different levels of education. For example, do all workers, regardless of education, begin their careers fairly similarly? What distinguishes the careers of economic winners and losers, in terms of the timing and structure of their traversals? The job embedding will allow scientists to answer questions such as these. In the following sections, we present the data used for the study (Section 2), we show how we built the jobs landscape (Section 3) and we examine the landscape’s robustness to sampling noise by using the boostrap method (Section 4). We then use the jobs landscape to visualize various characteristics of the job market (Section 5), and the evolution in time of individuals with varying education levels (Section 6). We also compare our embedding to other embedding methods from the literature (Section 7). A discussion (Section 8) concludes the paper. Abstract The analysis of career paths suffers from a lack of exploratory tools and dynamic models, due in part to the inherent high dimensionality of the problem. Paths may be understood as directed traversals through a graph whose nodes consist of “job types”, which we define as industry and occupation pairs. We want to develop tools to understand and detect high-level features of both the labor market and the workers moving through it – career dynamics. To do this, we map the discrete space of jobs into a d-dimensional continuous space; proximity between jobs will mean that they are “close” to each other in a non-negligible subset of career paths. This embedding allows one to visualize the job landscape. Moreover, we can map individual or groups of career paths to this space, extract features of their collective structure, and construct statistical tests comparing groups by means of this mapping. 1 Introduction At the origin of this work is an analysis of career mobility using data from the National Longitudinal Survey of Youth (NLSY), a study that followed several thousand men and women in the U.S. from 1979 until recently. The participants were aged 14-21 at the beginning of the study, and constititute a representative sample from the U.S. population. The NLSY is often used to better understand the forces and factors that influence a person’s career path in its early stages. Thus, the data contains each individual’s work history from age 20 to 36, with job status recorded quarterly, for a total of 64 job tokens per individual. We call this a career. One immediately recognizes a fundamental challenge with data of this nature. Jobs - a set of nominal states described in the next section - have no natural ordering or structure, and the number of states is potentially very large. Thus, comparisons of individuals’ career paths is limited to methods for comparing sequences over discrete alphabets. See (Abbott 1995) and (Durbin et al. 1998). Our methodology utilizes information in the transitions between job types to construct a Euclidean space in which our fundamental unit, the job type, will reside. This embedding method is derived from the WC UT algorithm of (Meilă and Pentney 2007), a spectral algorithm used for clustering in directed graphs. We show that this embedding is not only 2 The Career Data As noted in section 1, when we refer to a career, we mean a sequence of job tokens of the form (occupation, industry). We use 25 unique industry and 20 unique occupation codes from the larger set of 3-digit 1970 Census Classification codes. The details of the jobs aggregation as well as other data cleaning operations are described in (Scott forthcoming). All career sequences have length 64 (quarterly job state over 16 years). In the 16-year age span studied, approximately 450 unique industry and occupation pairings (hereafter referred to as IxOs) occur. The sample size is 7,816 individual careers. We reweight the sample so that it is consistent with the demographics of the original baseline sample. The weights reflecting this are called the population weights. In the next section, we build our embedding based only on the discrete career data described above. After the embedding is constructed, we examine the mapping of various available demographic data into this manifold. Demographic variables used in this study include selfreported race/ethnicity (Black,Hispanic,Non-Black/Non- c 2010, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 36 ply not looking for work. This was done for two reasons: (1) These tokens can dominate the transition count matrix even though they provide limited (or non-content) information; (2) These states are often seasonal and hence contain little information about career progression; To avoid losing too much information by removing the transitions to and from the removed states, it was decided to use a step-over approach. Specifically, if we observe the sequence j → x → x . . . x → i where x is a state to remove, the transition j → i will be recorded. This retains the continuity in the sequence and recognizes the fact that j and i are related (through x). Once low-frequency jobs and non-content-specific states are removed, we are left with n = 356 jobs out of the original total of 457. Hispanic), sex, age and education. While education changes over time, we simplify its inclusion by dichotomizing workers into those who complete at least a two-year degree by age 24 and those who do not. Age 24 was chosen because it is a year or two past the traditional timeline for completing undergraduate education. Workers who have not completed a two-year degree by this point face very different prospects in the labor market. Wages are hourly and inflation-adjusted to reflect 2008 dollars. 3 The job landscape – embedding the career data in d-dimensional space Our first goal is to map the jobs to a d-dimensional space, in a way that renders closest the jobs between which frequent transitions are observed. In order to do so, we compute the affinity matrix A from the original data, where Aij represents the number of times a transition from job i to job j is observed. Here the workers’ population weights are used in weighting each transition so that the resulting transtion matrix is consistent with the demographics of the original baseline sample. This asymmetric affinity matrix, Aij = Aji , will be used to produce the embedding. It is worth noting explicitly that our map creation process collapses information about the timing of the transitions between jobs; as such, transitions that occur early in individuals’ careers are indistinguishable from transitions that occur later. While it may be interesting to allow the job market to change over time, or even to examine early transitions separately from later transitions, we do not take this approach in our paper. 3.1 3.2 Embedding by the Weighted Cuts method We now have an n × n affinity matrix A, that contains transition counts between the n = 356 retained tokens, obtained as described in the previous section. Mapping the tokens to a d dimensional space is done via the Weighted Cut (WC UT) algorithm introduced by (Meilă and Pentney 2007), which we briefly describe here. The input data is the matrix of affinities A and a user-specified dimension d. For the purposes of visualization, we will use d = 3 or d = 4; for other purposes, such as computing statistics from the embedding, we extend to d = 10. It is worth noting that since the WC UT algorithm is based on eigenvalue decomposition, selecting the number of dimensions is not a critical operation. Indeed, every dimension is computed independently of the others, so adding or removing dimensions does not affect the rest of the embedding. Preprocessing the affinity matrix In preliminary embedding experiments, we observed that embeddings suffer from outliers: jobs that are rarely observed. To improve the embedding, we removed all jobs observed fewer than 5 times (according to population weights), reasoning that so few observations prevent us from making any significant observations about these jobs. We verified experimentally that the embedding of the high and medium frequency jobs is not sensitive to the exact cutoff value used to eliminate outliers. The second operation performed on the affinity matrix before embedding was to remove the diagonal by setting Aii to zero. The diagonal element Aii counts the transitions of i into itself (i.e the times job i was not changed for another job during a quarter). These counts dominate the data, A’s diagonal totalling about 8 times more than its off-diagonal elements. The elements Aii tell us very little about the job in relation to others, and also surprisingly little about the job itself. On average, workers remain at their jobs for about two years regardless of the particular job. Hence, the embedding of the data will only consider the transitions between jobs, and not how long one stays in the same job before transitioning. This is akin to considering the data in the framework of a Semi-Markov process where only the Markov component is of interest. The third step is to remove the data that represents time spent unemployed, in school, in vocational training, or sim- Algorithm WC UT 1. Input A, d 2. Calculate the degrees Di = trix D = diag{Di } P j Aji and form the diagonal ma- 3. Calculate the d largest eigenvalues λ1:d of the matrix T D−1/2 A+A D−1/2 and the corresponding eigenvectors 2 (1) (d) y . . . y . Let Y = [y (1) . . . y (d) ]. 4. Compute X = D−1/2 Y , X = [x(1) . . . x(d) ]. 5. Map every job i = 1, 2, . . . n to the d-dimensional point (1) (d) (xi . . . xi ) Figure 1: The WC UT embedding algorithm. The idea behind this method is derived from clustering. If one wanted to separate the data into d clusters, a “good” clustering would satisfy two conditions: (1) put tokens with high affinity in the same cluster, and (2) keep the cluster sizes balanced. In (Meilă and Pentney 2007) it is proved that both conditions can be optimized by mapping the data into the principal subspace of a symmetric matrix obtained from A (which is asymmetric in general). Thus, the WC UT embedding method is reminiscent of Principal Components 37 H(B ): Analysis (PCA), and could loosely be thought of as PCA for directed graph data. Although our goal is not clustering but embedding, this method will be satisfactory because it will pull toghether in space tokens that have high affinity. Meanwhile, the “balance” requirement will attempt to spread the tokens about evenly, instead of collapsing them all toghether. In (Meilă and Pentney 2007) the user is allowed to choose positive token weights by which the balancing will be judged. For our study, we chose the weight of job i to be equal to the number of transitions into i, that is j Aji . Other possible alternatives are to have equal weights, or to have weights equal to the row sums of A, which represent the number of transitions out of i. Our choice is preferred because the resulting embedding has fewer outliers (practically none) and is more robust. In addition, the frequency of transitions into a job can measure the relative desirability of that job, just like in other domains the number of links to a page, or the number of times a paper is cited measure its authority and therefore its “weight” more accurately than the outgoing links. In the career landscape, there are jobs that function as sources (i.e. have many ougoing transitions) and, as such, are more typical of the early career stage. Examples include many clerical jobs, retail sales, and food services. When these jobs serve only as a source, we would not want our configuration of the career landscape to reflect them to any great extent. However, some of the same jobs that are common in the early career are also common to low-wage careers; to the extent that workers are making transitions into such jobs, we want them reflected in the embedding through the weights (long-term careers in retail and waiter/waitress jobs are potential low-wage examples). 3.3 H(B ) = D−1/2 (D − H(A))D−1/2 , (1) with D = diag(A 1). The perturbed outdegree matrix D can be decomposed as D = diag(H(A)1) + diag(AH(A)1) ≡ S + C , (2) where both S and C are diagonal. This means that we can use a Taylor series to the diagonal elements of D to obtain −1/2 D : (3) D−1/2 = S −1/2 (I − S −3/2 C) + o() . 2 Substituting (2) and (3) into 1 gives: H(B ) = H(B0 ) −1 S −1 C S C −1 + H(B0 ) + H(B0 ) + S C + o() 2 2 ≡ H (0) + H (1) + o() . (4) With this definition, both H (0) and its perturbation H (1) are hermitian. This means that we can use regular perturbation theory in obtaining the first order effect of the directional perturbation AH(A ). We assume that we know the eigenvectors of the MNC UT for the undirected graph H(A): (0) H (0) yi (0) (0) = λi yi , i = 1, ..., n , (5) (0)t (0) where yi yj = δi,j when i = j. Meanwhile, the eigenproblem H(B )yi = λi yi is assumed to have the expansion: Perturbation analysis and the first coordinate λi The WC UT being an extension of the Multiway Normalized Cut (MNC UT) algorithm (Shi and Malik 2000; Meilă and Shi 2001), the MNC UT can serve as a good starting point to build some intution about the WC UT. One interesting approach is to focus on how the two embeddings differ. Doing this gives some insight into how WC UT embeds the asymmetric information of the graph, which is absent in the purely symmetric MNC UT. This is an important aspect of embedding the job market data as this asymmetry is the only information that pertains to the natural career progression. To determine how the WC UT differs from the MNC UT, we consider the directed graph represented by the affinity matrix A as pertubation from the undirected graph (A + At )/2. That is, we decompose A into its hermitian H(A) = (A + At )/2 and anti-hermitian AH(A) = (A − At )/2 components and assume that the anti-hermitian component is a small perturbation to the undirected graph. To make this assumption explicit, we define A = H(A) + AH(A) and we we consider how the WC UT embeds this anti-hermitian component, i.e. the directional perturbation of the undirected graph described by H(A ) ≡ H(A). Going back to the definition of the WC UT algorithm 1, we are interested in the eigenvalues and eigenvectors of yi (0) = λi = (0) yi (1) + λi + (1) yi (1) + o() , + o() . (0) Expanding yi in terms of the yj ’s gives the standard first order perturbation: t (1) yi = yj(0) H (1) yi(0) j=i (1) λi = (0)t yi (0) λi − (0) λj (0) H (1) yi . (0) yj , (6) (7) Though an interesting exercise, expressing the directional perturbation of the graph in terms of the eigenvectors of the MNC UT embedding has not provided any substantial insight so far. To extract meaning from (6) and (7), we need to appeal to the alternative interpretation of the MNC UT, specifically the Markov Chain with transition matrix P = S −1 H(A). The eigenproblem P xi = γi xi for the (0) transition matrix is equivalent to (5) through xi = S −1/2 yi (0) (0) and γi = 1 − λi . (0) The largest eigenvalue γ1 = 1 contains no information about the graph since P 1 = 1 by virtue of P being a stochastic matrix. So any information contained in 38 (1) x1 = 1 + x1 + o() can only come from directional perturbation to the graph, making the first coordinate of the embedding particularly relevant here. To assess what directional information x1 , and hence y1 , contains, it is worth taking a closer look at (6). Specifically, the interesting question is which 0th order eigenvectors will most contribute to the first order perturbation. In (0)t other words, for which yj is the coefficient (0)t yj x(4) 5 0 (0) H (1) y1 (0) −λj −3 x 10 10 , j = 1 −5 −0.015 (8) −0.01 −8 −0.005 (0) λj −6 −4 0 going to be large? Obviously, a smaller implies a larger coefficient, but this is not saying much beyond the fact that (0) smaller λj ’s are generally associated with the important eigenvectors. What is interesting is to determine when the numerator is large. (0) From (4) and the fact that the yj ’s are eigenvectors of H(B0 ), (8) takes the form: (0) λj (0)t (1) (0) (0)t (9) yj H y1 = + 1 xj C1 . 2 −2 0 0.005 2 0.01 x(3) −3 x 10 4 6 x(2) Figure 2: Embedding for coordinates x(2) , x(3) , x(4) . The color map for this embedding corresponds to the job frequency with red for low frequency, green for medium frequency and blue for high frequency. Also, somewhat remarkably, no fragmentation or clustering is visible; the job landscape appears continuous. In the figure, the tokens are colored by frequency, making it visible that the embedding is not stratified by this feature. Both these characteristics suggest that the geometry obtained represents collective properties of the set of jobs, and is not overwhelmingly dependent on a small subset of high frequency jobs. Before proceeding to draw conclusions from the mapping, we need to validate that the embedding represents genuine, albeit yet unknown, features of the population and that it is not an artifact of the algorithm employed. (0)t (0)t The term of interest is xj C1 = xj (AH(A)1). This (0) term will be largest if xj is parallel to (AH(A)1), meaning that the first order perturbation to x1 will favor eigenvec(0) tors xj which are closely aligned with (AH(A)1). This vector, by definition, corresponds to the out-degrees minus in-degrees of each job of the graph divided by two. In other words, the first order perturbation of x1 will tend to align itself with the net divergence/flow of each job. As such, the directionality of the graph is partially embedded in the first eigenvector x1 in that it separates nodes with different divergence, such as source vs. sink, while grouping nodes with similar divergence. In the context of the job market, this is borne out by survey data. Although the first coordinate shows more structure than simply the first order perturbation described above, it does seperate jobs according to whether they have net positive or negative divergence quite well. This interpretation of the first eigenvector is specially interesting here in that directional information, including divergence, is mainly temporal, i.e. along the natural career progression. Hence, the first eigenvector of the WC UT embedding is highly correlated with time. 4.1 Stability analysis by Bootstrap For this purpose we Boostrap the data to obtain new embeddings. We then compare these new embeddings with the original embedding by using Procrustes and then computing the covariance of the coordinates for each job. The Bootstrap confirms that the embedding is stable. For clarity, we plot the Bootstrap covariance ellipsoids (which would correspond to 68% confidence regions for normally distributed data) for the high frequency jobs only. Here, we define as high frequency those jobs that appear in (95%) of our B = 1000 Bootstrap samples. These covariances are shown in Figure 3, while all the low frequency token locations are marked by gray dots. Note that while the displays are in three dimensions, the embeddings and Procrustes alignments were performed in d = 5 dimensions. The figure demonstrates that for jobs that are well represented in the sample, their relative locations are extremely stable. This effect is not just a predictable consequence of the concentration of frequency estimates. The mapping from the counts Aij to the locations X is highly non-linear - in particular, it involves division by the counts Dj . To have 4 Validating the embedding We applied the algorithm WC UT to the 356 × 356 matrix A obtained by the preprocessing described in Section 3.1. The resulting embedding is presented in Figure 2 in three dimensions. At first glance, one notes that the token distribution in space is relatively even, and that there are virtually no outliers in the 3 dimensions plotted. While these qualities by themselves do not guarantee success, they are necessary for having an informative set of coordinates. 39 that have held that token over the course of the study. For salary, the color axis corresponds to the mean salary in cents for the specific job (note: these are inflation-adjusted $2008). Finally, for the time variable, the color axis represents the mean time at which the token has been observed in the study in years. The plots also weight the area of each token in proportion to its frequency in the study so as to display its relative importance. For gender, we also show a “pair” plot of each coordinate along with the same color axis as the three-dimensional plot in Figure 5. The gives a more detailed picture of the embedding for the first 4 coordinates. −3 x 10 1 10 0.9 0.8 5 (4) 0.7 x Figure 3: Job embedding with covariances estimated by Bootstrap, dimensions x(2) − x(4) . High frequency jobs covariance are shown along with color map representing mean gender (with females depicted in red and males in blue). Low frequency jobs are shown as grey dots. 0 0.6 0.5 −5 0.4 6 5.5 0.3 5 4.5 stability of the Boostrap estimates, one needs the Jacobian of the mapping to be stable as well. This is what our Figure 3 demonstrates. 5 0.2 −3 x 10 4 3.5 0.1 3 x Visualizing the jobs landscape: meaning of the coordinates (1) 2.5 6 4 0 2 −2 −4 −8 −6 0 −3 x 10 x (2) Figure 4: Gender and the embedding (coordinates x(1) , x(2) , x(4) ). Each token color represents its proportion of females, with 100% female in red and 100% male in blue (the male/female proportion in the survey is 0.48 vs 0.52). A token’s size is proportional to its frequency relative to other tokens. Next, we considered how the embedding relates to demographic variables such as gender, salary, and education. We find that certain coordinates are strongly associated with these variables. Specifically, gender is related to x(2) , while salary is related to x(1) and to a lesser extent, x(3) , and x(4) . As for race/ethnicity, it does not seem to be associated to any coordinates, at least of the first 10 considered so far. This remains true even when the data is resampled so that the racial/ethnic groups are present in the same proportion. For this reason, race/ethnicity will not be considered further in this paper. There is another variable of interest, which is time. For obvious reasons, this variable is correlated with salary, and is associated with x(1) and x(4) . In light of the pertubation analysis of the WC UT algorithm, this is not surprising. Indeed, time (and to some extent salary) is the obvious source of asymmetry in the job market. As the first coordinate’s deviation from a constant vector is controlled by this asymmetry (specifically, the first eigenvector tends to separate points according to whether they are in-degree or out-degree dominant), it is obvious why the first coordinate orders tokens based on their average temporal position in individuals’ career. Figures 4, 6, and 7 show the embeddings that correspond to demographic variables gender, wages and time. For gender, the color axis corresponds to the proportion of females As a confirmation of which coordinate is associated with a given demographic variable, the regression results are shown in table 1. This table shows the variation of the linear component of the regression model over the range of each coordinate (for the coordinates retained using BIC as a model selection criterion). As such, higher coefficients account for a larger linear change of a demographic variable over the range of the embedding. 6 Evolution in time Another point of interest is to understand how groups with different education levels evolve in time. Education is tied to mobility, but the interplay between the amount of education, the type of work and gender are not fully understood. Under our fairly granular IxO scheme, are less-educated workers doing a qualitatively different type of work, or are they just on the lower end of the pay scale and perhaps skill set? We define four types of workers, classified by education at age 24 (two-year degree being the separating point) and 40 x(1) −3 100 6 50 4 0 2 4 6 −3 x 10 2 −0.01 6 −3 x 10 6 4 0 0.01 2 −0.02 x 10 −3 0 0.02 2 −0.01 3.4 x 10 4 10 3.3 0 0.01 −3 40 0.01 0.01 0 20 0 0 −0.01 2 4 6 0 −0.01 0 0.01 −0.01 −0.02 0 0.02 −0.01 −0.01 3.2 5 3.1 0 0.01 x(4) x (2) x 10 0.01 −3 x 10 x (3) 0.02 0.02 0 −0.02 100 0 2 −0.02 6 −0.01 4 0.02 50 0 0.01 0 −0.02 3 0 0 0 −0.02 0.02 −0.01 2.9 0 0.01 −3 x 10 0.01 0.01 0.01 100 0 0 0 50 2.8 x (4) −10 −0.01 2 4 6 x(1) x 10−3 −0.01 −0.01 0 x(2) 0.01 −0.01 −0.02 0 x(3) 0.02 0 −0.01 −5 6 0 x(4) 0.01 5.5 5 4.5 4 3.5 x 10 0 3 2.5 −3 x 10 5 (2) −3 Figure 5: Gender and the embedding, four first coordinates. Each token color represents its proportion of females, with 100% female in red and 100% male in blue. x(1) x Figure 6: Wages and the embedding (coordinates x(1) , x(2) , x(4) ). Each token’s color represents the average wage, on a logarithmic scale (base 10). A token’s size is proportional to its frequency in the population. gender to examine how each group moves through the embedding. We know that time and salary are most strongly associated with with the first coordinate, and that the second coordinate separates IxOs by gender. Hence, we use coordinates x(1) and x(2) to study these four groups. To get an idea of how each group evolves, we tracked the center of mass (mean) of each group in the embedding at successive time periods. We find significant difference in behavior between the four groups for x(1) and x(2) . Figure 8 shows the evolution of each group along x(1) and x(2) , while Figure 9 shows their evolution with time along x(1) only. For coordinate one, the less educated workers move through time (as indicated by descreasing level of that coordinate) more slowly – this is suggestive of making less “typical” progress, such as shedding entry-level positions. The second coordinate, which showed some relationship to gender, now suggests something more subtle: the genders are well separated within education groups, so the type of work that forms the career for these workers is fairly distinct and organized around the information in this coordinate. 7 −5 are: (Pentney and Meilă 2005) a MNC UT embedding for complex eigenvectors, (Zhou, Schölkopf, and Hofmann 2005) which constructs the symmetric matrix −1/2 −1/2 −1/2 −1/2 T AD0 )(Di AD0 ) then applies Az = (Di MNC UT to this matrix, and the directed Laplacian of (Andersen, Chung, and Lang 2007). We applied the preprocessing in section 3 to all embeddings.1 Each embedding was aligned with the WC UT embedding using coordinate axes 2 to 7 by the Procrustes transformation. Coordinate 1 is constant in all but the WC UT embedding and was omitted. The Procrustes distortion, quoted for each embedding, represents the proportion of the variance not explained by the alignment and is a measure of agreement with the WC UT (0 being perfect). The embeddings by MNC UT and Diffusion map are similar to the WC UT embedding, with the Diffusion map exhibiting the effect of the rescaling of the axes by the λs (distortions of 0.01 and 0.04 respectively). This is exactly what we expect, given that the asymmetry in A is weak (i.e. (A − AT )/2 is small).2 Interestingly, the directed embedding of (Pentney and Meilă 2005) and of (Zhou, Schölkopf, and Hofmann 2005) are also relatively close to the WC UT embedding (distortions 0.11 and 0.31). However, the directed Laplacian embedding is very different (distortion 0.97). This mapping is neither smooth nor informative, having much of the data collapsed in the origin, and the rest as outliers at different distances. In addition, we also performed correlation tests of all the demographic variables on all the coordinates from these embeddings. The correla- Related work We compared our embedding with other graph embedding methods from the literature. Most graph embedding methods assume that the graph is symmetric. In order to use them, we symmetrize the matrix A by As = (A + AT )/2. The best known of these are the MNC UT /Laplacian eigenmap method, (Shi and Malik 2000; Meilă and Shi 2001), (Belkin and Niyogi 2002) and the Diffusion map (Lafon and Lee 2006). The WC UT embedding is identical to the MNC UT embedding if A is symmetric. The Diffusion map differs from the MNC UT by rescaling the eigenvectors with the corresponding λ to a positive power. In our embedding this power is 1. Embedding methods that work for directed graphs 1 We also obtained embeddings by these methods which omit some of the preprocessing steps, but the results were uniformly worse in terms of outliers. 2 We omit the plots for lack of space. 41 (1) 13 x x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) 12 −3 x 10 11 10 10 9 5 x(4) 8 7 0 −10 6 −5 Gender 1.49 -6.28 1.22 1.04 1.00 -0.58 1.13 1.22 -3.29 -0.68 log10 Wages -0.54 0.10 0.18 -0.16 0.08 -0.07 0.07 -0.20 0.13 0.02 Time -11.64 1.42 -0.96 2.67 0.48 0.84 0.68 0.91 Educ. Gr. 0.30 -6.14 2.01 0.41 -1.10 0.60 0.93 -1.50 - −3 −5 6 x 105 0 5.5 5 4.5 Table 1: Variation of the linear part of the models for each demographic variable. We used linear regressions for continous variables (wages and time) and logistics regressions for the categorical variables (gender, education groups and education factors). The coordinates selection (removed coordinates are values are replaced with a dash “-”) is performed using BMA (BIC) on the first 10 coordinates of the embedding. 4 4 3.5 −3 x 10 3 2.5 5 x(2) (1) x Figure 7: Time and the embedding (coordinates x(1) , x(2) , x(4) ). Each token’s color represents the average position of this token in individual careers. A token’s size is proportional to its frequency in the population. itative way. Our pilot experiment on mean trajectories for various gender×education groups, which shows early and strong segregation for some groups, illustrates one of the possible insights. While the existence of segregation with respect to the selected demographics is not a new finding (in fact, it is relatively easy to demonstrate such segregation without resorting to embedding), we show that the segregation relates to many demographic variables in a concerted way, and that it is discernible from individual transitions alone. Following this finding, future studies could focus on smaller and more homogeneous groups to discover the speed at which they advance on their career paths and the location of their niche in the manifold. In addition, as the reader has perhaps noted, not all manifold coordinates represent known demographic features. Thus, finding their meaning represents an opportunity for further research. From a methodological perspective, we found it interesting that a method designed for clustering (see the motivation in (Meilă and Pentney 2007)) works so remarkably well in a pure embedding task (the manifold we obtain exhibits very little, if any, clustering). This strongly suggests that the WC UT should be considered a competitive algorithm for other embedding tasks. Another feature of the WC UT that benefits our task is the explicit retrieval of a global directionality axis in the first coordinate of the embedding. This is a previously unknown result. For the career data, this axis naturally aligned with time. As a final point of interest on the methodology side, we note that it can be shown (although we do not do so here) that by rescaling the coordinate axes by λi , as we did in our embedding, one can find a relation between the Euclidean distance in the embedding and the original transition matrix P , that is similar to a diffusion distance. tions with the WC UT coordinates were uniformly stronger and obviously agreed with the regression results of table 1. In short, all methods that find a smooth, informative embedding find essentially the same mapping, with small variations. Of them, the WC UT has the unique advantage of extracting the time component in the first eigenvector, something that is impossible using any of the other methods. We saw that this coordinate is significant for the career data. 8 Discussion and conclusions Establishing a coherent embedding for categorical sequences is an inherently challenging problem given the lack of a natural metric in this domain. This is the first time spectral embedding has been applied to career data. The embedding method used is a variation of the WC UT algorithm, one of the few existing methods that explicitly take into account directionality in a graph’s edges. This approach to mapping has two main traits: (1) it discards explicit time information, but incorporates asymmetry in the transition information, and (2) it maps jobs, which are discrete tokens of the form (occupation, industry) into a continuous, low-dimensional space. Our goal was first scientific and then methodological. We aimed to obtain a manifold that captures meaningful features of existing jobs. We demonstrated that the principal dimensions of the embedding relate to important demographic variables, which were not input into the algorithm. In fact, not only are these variables correlated with the embedding coordinates, but they exhibit visible continuity along various coordinates in spite of considerable noise. These results suggest the possibility of using the manifold as a tool for answering a variety of questions of interest to social scientists and economists in a quantitative and qual- 42 −3 −3 x 10 3.6 x 10 3.4 4 3.7 3.3 2 3.8 3.2 0 3.9 (1) x x(2) 3.1 −2 −4 −6 4 3 Low Educ. Male Low Educ. Female High Educ. Male High Educ. Female 5.5 5 2.9 4.1 2.8 4.2 2.7 4.5 4 (1) x 3.5 4.3 3 0 2 4 6 8 10 12 14 16 Time (in years) −3 x 10 Figure 9: Education Groups’ Progression over Time for First Coordinate with Standard Error (Black). Blue = Low Male, Green = Low Female, Red = High Male, Yellow = High Female. Figure 8: Education Groups’ Progression. Blue = Low Male, Green = Low Female, Red = High Male, Yellow = High Female. The gray scale corresponds to log10 of the mean monthly wages in cents. Scott, M. forthcoming. Affinity models for career sequences. Journal of the Royal Statistical Society - Series C. Shi, J., and Malik, J. 2000. Normalized cuts and image segmentation. PAMI. Zhou, D.; Schölkopf, B.; and Hofmann, T. 2005. Semisupervised learning on directed graphs. In Saul, L. K.; Weiss, Y.; and Bottou, L., eds., Advances in Neural Information Processing Systems, number 17. MIT Press. References Abbott, A. 1995. Sequence analysis. Annual Review of Sociology 21:93–113. Andersen, R.; Chung, F. R. K.; and Lang, K. J. 2007. Local partitioning for directed graphs using pagerank. In WAW, 166–178. Belkin, M., and Niyogi, P. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Dietterich, T. G.; Becker, S.; and Ghahramani, Z., eds., Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Durbin, R.; Eddy, S.; Krogh, A.; and Mitchison, G. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. New York: Cambridge University Press. Lafon, S., and Lee, A. B. 2006. Diffusion maps and coarsegraining: a unified framework for dimensionality reduction, graph partitioning, and data set parametrization. IEEE Transactions on Pattern Recognition and Machine Intelligence 28(9):1393–1403. Meilă, M., and Pentney, W. 2007. Clustering by weighted cuts in directed graphs. In SIAM Conference on Data Mining. Meilă, M., and Shi, J. 2001. A random walks view of spectral segmentation. In Jaakkola, T., and Richardson, T., eds., Artificial Intelligence and Statistics AISTATS. Pentney, W., and Meilă, M. 2005. Spectral clustering of biological sequence data. In Veloso, M., and Kambhampati, S., eds., Proceedings of Twentieth National Conference on Artificial Intelligence (AAAI-05), 845–850. Menlo Park, California: The AAAI Press. 43