The Frisch Centre for Economic Research
Abstract
Factor loading as a method of dimension reduction is explained in the context of our flexible MPH Competing Risk models. This is some fairly simple matrix manipulation, it mainly serves the purpose of clarifying to myself what’s going on. THIS IS A DRAFT. DO NOT QUOTE.
This note came out of a discussion in a concrete estimation. We have two transitions, to programme and to employment . For these transitions we estimate the joint discrete heterogeneity distribution as random intercepts v p and v e
. I.e.
if we have K points (masspoints) in our distribution we estimate vectors of location points v p
= ( v 1 p
, v 2 p
, . . . , v K p
) and v e
= ( v 1 e
, v 2 e
, . . . , v K e
) together with the probabilities p = ( p
1
, p
2
, . . . , p
K
). (Our masspoint number is a superscript
(except in the probabilities)).
These location points enter our competing risk hazards (in masspoint i ) as h i p
= exp( X · β + v i p
) and h i e
= exp( vector, X is a covariate vector).
X · β + v i e
). (As usual, β is a parameter
We also want to compute the treatment effect . This is done by having a dummy covariate X
α which equals 1 if the subject has undergone treatment, and 0 otherwise.
Since there is heterogeneity everywhere we model it as a random coefficient in our heterogeneity distribution, thus our hazards become h i p
= exp( X · β + v i p
) and h i e
= exp( X · β + X
α
α i
+ v i e
).
It turns out to be difficult to estimate the α -vector. There’s perhaps too much flexibility. We will try to factor load it, we need factor loading anyway for other purposes. With many transitions we may save quite a few parameters when the number of masspoints increase.
Rather than estimating the location points α = ( α 1 , α 2 , . . . , α K ), we will introduce some weights. That is, we let α i = α
0
+ α e v i e
+ α p v i p enter the hazard as before, but we estimate the three weights ( α
0
, α p
, α e
). These weights are constant across masspoints. We will have quite some freedom in the first few moments of the α distribution, but it will not be completely flexible.
1
To put it in slightly different words, we let (1 , v e
, v p
) be a factor and we let α be factor loaded . The hope is that our α in this way will be sufficiently restricted.
We will implement more general factor loading, i.e. where e.g. ( α, v e
, v p
) are all loaded from, say, two freely estimated random variables. In this case, it’s actually equivalent, it might even happen that such free factors is all we need
(methinx). Or perhaps not. We will shortly see how a simple concept can be made arbitrarily complex with the use of some linear algebra.
We will use the above α -construction as an example, but we state things more generally.
Say we have K points in our multivariate heterogeneity distribution, we have
T random coefficients (or intercepts) to estimate, say v = ( v
1
, v
2
, . . . , v
T
).
< 1 with
P i
We have a vector of probabilities P = ( p
1
, . . . , p
K
) with 0 < p i p i
= 1. We have an T × K matrix of location points:
V =
v v
1
1
1
2 v v
2
1
2
2
· · ·
· · · v v
K
1
K
2
· · · · · · · · · · · ·
v 1
T v 2
T
· · · v K
T
Instead of estimating these individually, we want to estimate a factor , i.e.
an ( F + 1) × K matrix (with F < T ):
Φ =
1 1 · · · 1
φ
1
φ
1
1
2
φ
φ
2
1
2
2
· · · φ
K
· · · φ
1
K
2
· · · · · · · · · · · ·
φ
1
F
φ
2
F
· · · φ
K
F and a set of weights, a T × ( F + 1) matrix:
W =
w
0
1
0 w
1
1
1 w
2
1
2
· · · w
F
1
F
w
2 w
2 w
2
· · · w
2
· · · · · · · · · · · · · · · w
0
T w
1
T w
2
T
· · · w
F
T
and we let
V = W Φ be the locations occuring in the hazard.
That is, we factor the T -dimensional location points V into a fixed part
W and F -dimensional location points Φ.
In other words, we write a highdimensional matrix as a product of two lower-dimensional matrices.
Row i in the matrix Φ consists of the support points of a discrete random variable φ i from the multivariate distribution φ = ( φ
1
, . . . , φ
F
).
2
We have augmented Φ with a constant first row, and W with a first column used for estimating fixed parts. In the example in the introduction, we have
T = 1, v i
1
= α i
, φ i
1
= v i e and φ i
2
= v i p for i = 1 ..K
.
The intuition is that the low-dimensional factor Φ captures the heterogeneous variation with F dimensions, whereas the weight W captures the variation between the random variables v = ( v
1
, . . . , v
T
), but no heterogeneous variation.
For this subsection, which illustrates how covariances are affected by factor loading, we assume there is no constant row in Φ, no that E( φ i w
) = 0 for i = 1 ..F
, in matrix notation Φ P
0 i
column in W and
We therefore have
V P = W Φ P = 0, so E( v i
) = 0 for i = 1 ..K
too.
Our probabilities are P = ( p
1
, . . . , p
K
), let D
P be the diagonal matrix with the vector P on the diagonal. We have the covariance matrix cov( v ) = [cov( v i
, v j
)] ij
= V D
P
V
0
.
(For any matrix A , we let A
0 factorization V = W Φ to get denote the transpose of A .) Now, we use the cov( v ) = W Φ D
P
Φ
0
W
0
.
(recollecting from primary school that ( W Φ)
0
Φ D
P
Φ 0 , thus we can write
= Φ
0
W
0
.) Clearly, cov( φ ) = cov( v ) = W cov( φ ) W
0
(1)
That is, the heterogeneous variation in v comes from φ , and it’s merely dimension-expanded by the weights W . This is the sense in which Φ captures the heterogeneous variation.
It’s also clear that cov( v ) only depends on the fixed dimensional W and the fixed dimensional cov( φ ). It does not depend on the number of support points
( K ) other than through the low-dimensional cov( φ ).
We mentioned in the introduction that we wanted our factor Φ to be bound , i.e.
the 3 × K matrix
Φ =
1 1 · · · 1
v v
1 e
1 p v v
2 e
2 p
· · ·
· · · v v
K e
K p
where ( v e
, v p
) are estimated as random intercepts, and our W is
W = α
0
α
1
α
2
1
Note that multiplication on the right by P is integration over the distribution.
3
We end up with a 3-dimensional heterogeneity distribution ( W Φ , v e
, v p
).
This has the advantage that W has a nice interpretation as a decomposition of the treatment heterogeneity into an employment heterogeneity and a programme heterogeneity.
However, it might seem technically simpler to treat all the random variables in the same fashion, in particular when it comes to computing gradients and
Jacobians which are needed in the estimation procedures. There are also some philosophical advantages (with, as usual, no technical merit). It may be the case that the underlying unobserved heterogeneity is not distributed in terms of employment and programme heterogeneity even though this is their concrete expression.
Let’s see if there’s any difference between these conceptually different approaches.
We fix an F .
Assume first that our full heterogeneity distribution V is identified (by this we mean that W Φ is unique). We also assume that the rank of V is at least F , i.e. that our heterogeneity distribution is not degenerate.
Note that this means that we must have K ≥ F .
and W b where W b has If we choose the bound alternative we estimate Φ b the block form
W b
=
P
1
I
F where the ( T − F ) × F matrix P
1 are the decomposition weights for the random coefficients ( v
1
, . . . , v
T − F
). By uniqueness of V , we get W b
Φ b
= V , thus we force the factor to consist of ( v
T − F +1
, . . . , v
T
). Clearly, for a given V , Φ b is unique, though P
1 may not be unique.
By choosing the free alternative we will estimate a pair ( W f
, Φ f
) with
W f
Φ f
= V .
Do we get anything more from this more flexible construction? Let’s have a closer look at W f
, write it as a block-matrix
W f
=
Q
1
Q
2 where Q
1 is a ( T − F ) × F matrix, and Q
2 is F × F . Now, if Q
2 is not full rank, it means that the lower F rows of V are linearly dependent, they are by assumption not, so
We have
Q
2 has full rank, thus it is invertible.
V = W b
Φ b
=
P
1
Φ b
Φ b
=
=
W f
Φ f
= W f
( Q
− 1
2
Q
2
)Φ f
Q
1
Q
− 1
2
I
Q
2
Φ f
=
Q
1
Φ f
Q
2
Φ f
4
which means that free factor. If P
1
Q
2
Φ f
= Φ b
. Thus, we may recover the bound factor from the is unique given ( V, Φ b
), then we must also have Q
1
Q
− 1
2
= P
1
.
Thus, if the decomposition weights are unique for a bound factor, we may recover them from the free weights.
F
Is there any additional information in the free (
× F matrix A , then the pair W g
= W f
A
W f
, Φ f and Φ g
)? Pick
= A − 1 Φ any f invertible will satisfy
W g
Φ g
= V .
On the other hand, say we have two free estimates, i.e. in addition to our
( W f
, Φ f
) we have
W g
=
R
1
R
2 and the corresponding Φ
Φ b and, if W b is unique, g
. Again, R
2
W g
R
− 1
2
= W b is invertible, so as before we have R
2
, thus we get
Φ g
=
W g
= W f
A and
Φ g
= A
− 1
Φ f for A = Q
− 1
2
R
2
Thus, no specific freely estimated pair ( W f
, Φ f
) is identified. They are all indistinguishable from ( W b
, Φ b
) given our data, and we may find every one of them from ( W b
, Φ b
) by translation with an invertible A . This means that we lose nothing by letting our factor Φ consist of estimated random coefficients. To avoid ambiguity in the estimation, we will stick to such factors.
That is, our W is a T × ( F + 1) matrix of the form
W =
w w
0
1
0
2
· · ·
w 0
T − F
0
0
· · ·
0 w w
1
1
1
2
· · · w 1
T − F
1
0
· · ·
0 w w
2
1
2
2
· · · w 2
T − F
0
1
· · ·
0
· · ·
· · ·
· · ·
· · ·
· · · w w
0
F
1
F
2
· · ·
· · · w F
T − F
· · · 0
· · ·
· · · 1 and our Φ is an ( F + 1) × K matrix:
Φ =
1 1 · · · 1
φ 1
1
φ 2
1
· · · φ K
1
· · · · · · · · · · · ·
φ 1
F
φ 2
F
· · · φ K
F
It’s essential that our random coefficients V = W Φ are identified by our data.
It’s also essential that the Fisher-matrix, which we use as a surrogate for the
5
(negative) Hessian, is definite. Let’s see how our factor loading affects the Fisher matrix.
Let’s first see in detail how it affects any function.
Where appropriate, we row-expand V to be a vector of functions. Or, more precisely, we want to view our matrices as argument vectors to get a “flat”
Jacobian. We know that some of our entries in W and Φ are not variables, so we make “reduced” versions of these. That is, we remove the first row of Φ and the last F rows of W . So, let V = ( v 1
1
, v 2
1
, . . . , v
Φ r
= ( φ 1
1
, . . . , φ K
1
, . . . , φ 1
F
, . . . , φ K
F
). To match Φ r
K
1
, . . . , v 1
T
, . . . , v K
T
), and we must have a version of
W without the first column, we call it W
1
. We also need a reduced argument vector version of W , W r
= ( w 1
0
, . . . , w F
0
, . . . , w 0
T − F
, . . . , w F
T − F
). We note that
W and Φ can be reconstructed from W r and Φ r
.
Let f ( β, V ) be a function of our β ’s and V .
We let V be a function
V ( W r
, Φ r
) = W Φ and consider the function g ( β, W r
, Φ r
) = f ( β, V ( W r
, Φ r
))
We have the chain-rule in terms of Jacobians
∂g
∂ Φ r
=
∂f ( β, V ( W, Φ))
∂ Φ r
=
=
∂f ( β, V )
( W
1
∂V
⊗ I
K
)
∂f ( β, V )
∂V
∂V
∂ Φ r
(2) where ⊗ denotes the Kronecker (tensor) product. In general, I n denotes the n × n identity matrix, I (with no subscript) is an identity matrix of appropriate size.
The derivatives with respect to W r a tensor product, this time involving Φ
0 are slightly more complicated. It’s still
, but it’s not with a square matrix.
∂v j i
∂w l k
=
(
0 if i = k , j = l or i > T − F
φ j i if j = l and i = k ≤ T − F
.
(remembering that φ
0
= (1 , . . . , 1).)
Thus, we need the T × ( T − F + 1) block matrix
I
T ,T − F +1
=
I
T − F +1
0
We get,
∂g
∂W r
=
∂f ( β, V ( W, Φ))
∂W r
=
=
∂f ( β, V )
∂V
( I
T ,T − F +1
∂f ( β, V
∂V
⊗ Φ
0
)
) ∂V
∂W r (3)
This enables us to compute the gradient of the likelihood with respect to W r
, Φ r in terms of W, Φ and the gradient with respect to V .
Now, what happens to the Fisher matrix? So, we have N observation vectors,
X = ( X i
) N i =1
, a vector of fixed coefficients β , and a matrix V of random coefficients. There’s a log-likelihood L ( β, V, X ) = P
N i
` ( β, X i
, V ) = P
N i
` i
( β, V ) to
6
be maximized. (The dependence on the data will just be present as the index i in ` i
). We approximate the Fisher matrix by the sum of outer products of
F ( β, V ) =
N
X
( ∇ ` i
( β, V ))( ∇ ` i
( β, V ))
0 i
` i
( β, W r
, Φ r
) = ` i
( β, W Φ). The derivatives w.r.t.
β
are unchanged, thus by equations ( 2 ) and ( 3 ) we get the 1-row matrix
( ∇ ` i
)
0
= h
∂` i
( β,V )
∂β
= h
∂` i
( β,V )
∂β
= ( ∇ ` i
)
0
J
∂` i
( β,V )
∂W r
∂` i
( β,V ) i
∂ Φ r
∂` i
( β,V )
∂V
( I
T ,T − F +1
⊗ Φ
0
)
∂` i
( β,V )
∂V
( W
1
⊗ I
K
) i where J is the Jacobian block matrix
J =
I 0
0 I
T ,T − F +1
⊗ Φ
0
W
1
0
⊗ I
K
.
Thus, ∇
ˆ i
= J
0 ∇ ` i
, so we get
ˆ
( β, W r
, Φ r
) =
X
( ∇
ˆ i
)( ∇ ` i
)
0 i
=
X
J
0
( ∇ ` i
)( ∇ ` i
)
0
J i
= J
0 X
( ∇ ` i
)( ∇ ` i
)
0
!
J i
= J
0
F ( β, V ) J.
(4)
It’s of course no surprise that a linear transformation of variables leads to familiar expressions for the derivatives.
Can this transformation with J introduce rank-deficiency in ˆ ? This is known to only happen if J
0
J is rank-deficient. Let’s consider only the lower right part J (the rest is irrelevant):
B = I
T ,T − F +1
⊗ Φ
0
W
1
⊗ I
K
Assuming F is of full rank, then ˆ is rank-deficient if and only if B
0
B is,
2
We note that we may write F = AA
0 where column i of A is ∇ ` i
, thus F is positive semi-definite. This is actually how we compute F numerically, we collect 64 or 128 columns of A at a time, then use the blas level 3 routine dsyrk to update F efficiently.
7
i.e. if
B
0
B =
=
=
I
0
T ,T − F +1
W 0 ⊗ I
⊗
K
Φ
I
T ,T − F +1
⊗ Φ
0
W
1
⊗ I
K
( W
1
0
I
T − F +1
⊗ I
K
)( I
⊗ (ΦΦ
0
)
T ,T − F +1
⊗ Φ
0
)
( I
0
T ,T − F +1
( W
0
1
⊗ I
(
I
W
T − F +1
1
0
⊗ (ΦΦ
I
T ,T − F +1
) ⊗
0
) ( I
Φ
0
0
T ,T − F +1
( W
0
1
W
1
)
W
⊗
1
) ⊗ Φ
I
K
⊗
K
Φ)(
)( W
1
W
1
⊗
⊗
I
K
I
)
K
) is rank-deficient.
For this to happen, either ΦΦ
0 or W
1
0
W
1 deficient. I.e. as long as rank Φ = F + 1 and rank W
1 must be rank-
= F we will not accidentally introduce rank-deficiency in ˆ . The former condition says that our factor Φ is not degenerate (it could be if we choose F too high), the latter condition says that there are no more linear dependence between our random coefficients than those that must result from doing factor loading.
A necessary condition is found by counting rows and columns, i.e.
K > F and F ≤ T . Not surprisingly, this condition just ensures that we don’t estimate more parameters with factor loading than without. It probably has a much simpler derivation, but we anyway need to be concrete when implementing this stuff.
We may actually use this to find a reasonable factor dimension F . If we end up with a rank-deficient Φ, we have chosen F too high, we should then reestimate with F ≤ rank Φ.
Similarly, if W
0
1
W
1 is singular, then, hmm, what then?
We might want to have one set of random coefficients dependent on one factor, and another set dependent on another factor. Thus, conceptually we might like e.g. two factors and two sets of weights:
V
1
= W
1
Φ
1 and
V
2
= W
2
Φ
2
.
I.e., we have Φ
1 an F
1
× K matrix, and Φ
2 is F
2
× K .
These may be combined into a single factor with particular constraints on the weight matrix, using block matrices:
V =
V
1
V
2
W =
W
1
0
0 W
2
8
and
Φ =
Φ
1
Φ
2 with the obvious modifications to our J
in ( 4 ), but otherwise there is no obvious
technical benefit from doing it this way. We should stick to the possibility of having several factors.
In our estimations we start out with K = 1 and gradually increase K . We need
K > F to use a factor of dimension F . Thus, in the initial iterations we do not use factor loading. When we have an estimate with K = F + 1 we need to switch to factor loading for the next iteration. Thus we need to split our estimated distribution (the T × ( F + 1) matrix V ) into a T × ( F + 1) weight matrix W and a ( F + 1) × ( F + 1) factor matrix Φ. We know that the lower F rows of V forms the lower F rows of the factor Φ, and the first row of Φ consists of 1’s. Thus we know what Φ must be.
So, we just find the weights W from the matrix equation
V = W Φ (5)
Thus, when we reach estimates with K = F + 1, we switch to factor loading by setting Φ to the subset of estimated random coefficients which we want in our factor, and we let W = V Φ
− 1 . If Φ is not invertible we have either chosen too large an F , or we have been unlucky.
Unfortunately, sometimes, in particluar with small F , Φ is not full rank, or
very close to being non-invertible. Then ( 5 ) should not be used to define
W (if
Φ is close to being singular, the entries of W will become ridiculously large).
We then have a couple of choices, one of them is to estimate a couple of more points without factor loading, hope that the problems were spurious due to too small K , and then find a W . Unfortunately, we can not hope to find a W with
V = W Φ, we try instead to find the “best” W . By “best” we mean that the rows of W Φ should be as close to the rows of V as possible. This is merely a linear regression, it takes the following form in terms of the matrices at hand:
The rows of Φ span a subspace H ⊂
R
K
. Given a vector v in
R
K
, the point in H which is closest to v is the orthogonal projection of v onto H .
Thus, we project each row of V onto H to get a linear combination of rows in
Φ, and use the corresponding coefficients as a row in W . Or, in implicit terms, we write V = V
1
+ V
2 where the rows of V
1 are in H and the rows of V
2 are orthogonal to H . We then find the W satisfying W Φ = V
1
.
Let h· , ·i denote the ordinary (Euclidean) inner product in have by definition h V
2
R
K . For φ ∈
, φ i = 0. Since the rows of Φ are in H we have V
2
Φ
0
H we
= 0,
9
thus V Φ
0
= ( V
1
+ V
2
)Φ
0
= V
1
Φ
0
= W ΦΦ
0
, so W must satisfy
W ΦΦ
0
= V Φ
0
.
(6)
V Φ 0
If the rows of Φ are linearly independent then ΦΦ
0
(ΦΦ 0 ) − 1 .
If ΦΦ 0 is invertible, thus W = is not invertible, we should estimate one more point before we try to find a W . Or, alternatively, if this never happens, we might try to reestimate with a smaller F .
We might even want to approximate “important” columns of V better than
“unimportant” ones. Each column is a masspoint of a probability distribution, and we may use the probabilities as weights. I.e. we want to minimize the weighted Euclidean distance. Let Π be the diagonal matrix with the probabilities on the diagonal. We adjust the inner product on
R
K to h x, y i = y
0
Π x . The new metric we get from this inner product is appropriately weighted, and the orthogonal projection (with respect to the new inner product) minimizes the distance given by the new metric.
Arguing as above, we get a new defining equation for W :
W ΦΠΦ
0
= V ΠΦ
0
.
I.e. we get W = V ΠΦ 0 (ΦΠΦ 0 ) − 1 . Note that ΦΠΦ 0 moments about 0 of the factor components.
is the matrix of second
Any positive definite K × K matrix may be used in place of Π to define an inner product. Thus, we may put stronger emphasis on higher probabilities by using Π λ with λ > 1, and lower emphasis by using λ < 1. The special case
λ
= 0 is equation ( 6 ). (Since Π is diagonal the exponentiation is elementwise,
though we may do it for general positive Π with spectral theory.)
Note that in the case that Φ is an invertible square matrix (as in the previous section), our new definition of W
In fact, instead of using the criterion K = F + 1 for when we should attempt to find W , we might as well use the single criterion that ΦΦ
0
(or ΦΠΦ
0
) is invertible since this can not happen with K < F + 1, and we anyway have to wait until it happens.
Another approach for when we should switch to factor loading is to initially estimate the factor loaded coefficients as fixed coefficients. Add as many points as seems appropriate, then switch to factor loading with the obvious factor components and a weight matrix with all entries zero except for the first column, which should be set equal to the fixed estimates. There’s of course still a question at how many points this should be done. By the way, note that this is the special
case of ( 6 ) where the rows of
V are filled with the corresponding fixed estimate.
Without factor loading, when we add a new point to our distribution we do it by searching for a new location point where a certain directional derivative
(often referred to as “the Gateaux derivative” in the literature) is positive.
10
Technically, we do it as follows. Assume we have K points of support, thus
V is a T × K matrix of location points, and we have a vector of probabilities p =
( p
1
, . . . , p where ` i
K
). We have the log-likelihood-function L ( p, V ) = is the individual likelihood and V k is the k
P i log
’th column of
P
V .
k p k
` i
( V k
)
Let ρ ∈
R satisfy 0 ≤ ρ ≤ 1, let p
ρ be the new K + 1 probability vector p
ρ
= ((1 − ρ ) p, ρ ), and let V n adding a column v K +1 be the new
= ( v
K +1
1
, . . . , v k +1
T
T × (
) to
K
V
+ 1) location matrix formed by
. Let G ( ρ, v K +1 ) = L ( p
ρ
, V n
) and g ( ρ, v
K +1
) =
∂G
.
∂ρ
Clearly, we have G (0 , v K +1 ) = L ( p, V ) for every vector v K +1 . When adding a new point we search for a vector v K +1 with g (0 , v K +1 ) > 0, thus guaranteeing that the likelihood will improve when we perturb the new probability in the positive direction.
When we use factor loading we will do something very similar: we keep p
ρ as before, but we let Φ n adding a column φ K +1 be the new F × ( K + 1) factor matrix formed by
= (1 , φ
K +1
1
, . . . , φ
K +1
F
) to Φ, thus we let ˆ ( ρ, φ K +1 ) =
L ( p
ρ
, W Φ n
) and
ˆ ( ρ, φ
K +1
) =
∂
ˆ
.
∂ρ
Now, as before, we search for a vector φ K +1 with ˆ (0 , φ K +1 ) > 0.
11