Labels or Attributes? Rethinking the Neighbors for

advertisement
Labels or Attributes? Rethinking the Neighbors for
Collective Classification in Sparsely-Labeled Networks
Luke K. McDowell
David W. Aha
Department of Computer Science
U.S. Naval Academy
Annapolis, MD U.S.A.
Navy Center for Applied Research in AI
Naval Research Laboratory, Code 5514
Washington, DC U.S.A.
lmcdowel@usna.edu
david.aha@nrl.navy.mil
ABSTRACT
Many classification tasks involve linked nodes, such as people connected by friendship links. For such networks, accuracy might be increased by including, for each node, the (a)
labels or (b) attributes of neighboring nodes as model features. Recent work has focused on option (a), because early
work showed it was more accurate and because option (b)
fit poorly with discriminative classifiers. We show, however,
that when the network is sparsely labeled, “relational classification” based on neighbor attributes often has higher accuracy than “collective classification” based on neighbor labels.
Moreover, we introduce an efficient method that enables discriminative classifiers to be used with neighbor attributes,
yielding further accuracy gains. We show that these effects
are consistent across a range of datasets, learning choices,
and inference algorithms, and that using both neighbor attributes and labels often produces the best accuracy.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications—
Data Mining
Keywords
Link-based classification; statistical relational learning; semisupervised learning; collective inference; social networks
1.
INTRODUCTION
Many problems in communications, social networks, biology, business, etc. involve classifying nodes in a graph.
For instance, consider predicting a class label for each page
(node) in a set of linked webpages, where some node labels
are provided for learning. A traditional method would use
the attributes of each page (e.g., words in the page) to predict its label. In contrast, link-based classification [2, 13] also
uses, for each node, the attributes or labels of neighboring
pages as model features. If “neighbor labels” are used, then
an iterative algorithm for collective inference is needed, since
many labels are initially unknown [3]. If, on the other hand,
“neighbor attributes” are used, then a single step of relational
inference suffices, since all attribute values are known.
Despite the additional complexity of inference, recent work
has used collective inference (CI) much more frequently than
relational inference (RI) for two reasons. First, multiple algorithms for CI (e.g., belief propagation, Gibbs sampling,
ICA) can substantially increase classification accuracy [8,
10]. In contrast, comparisons found RI to be inferior to
CI [3] and to sometimes even decrease accuracy compared to
methods that ignore links [2]. Second, although RI does not
require multiple inference steps, using neighbor attributes as
model features is more complex than with neighbor labels,
due to the interplay between the larger number of attributes
(vs. one label) and a varying number of neighbors for each
node. In particular, RI does not naturally mesh with popular, discriminative classifiers such as logistic regression.
Most work on link-based classification assumes a fullylabeled training graph. However, often (e.g., for social and
webpage networks) collecting the node attributes and link
structure for this graph may be easy, but acquiring the desired labels can be much more expensive [11, 6]. In response,
recent studies have examined CI methods with partiallylabeled training graphs, using some semi-supervised learning
(SSL) to leverage the unlabeled portion of the graph [15, 1,
6]. However, because using neighbor attributes seemed difficult and unnecessary, none evaluated RI.
Our contributions are as follows. First, we provide the
first evaluation of link-based classification that compares
models based on neighbor labels (CI) vs. models based on
neighbor attributes (RI), for sparsely-labeled networks. Unlike prior studies with fully-labeled training networks, we
find that RI is often significantly more accurate than CI.
Second, we introduce an efficient technique, Multi-Neighbor
Attribute Classification (MNAC), that enables discriminative classifiers like logistic regression to be used with neighbor attributes, further increasing accuracy. Finally, we show
that the advantages of RI with MNAC remain with a variety
of datasets, learning algorithms, and inference algorithms.
We find that, surprisingly, RI’s gains remain even for data
conditions where CI was thought to be clearly preferable [3].
2.
This paper is authored by an employee(s) of the United States Government and is in the
public domain. Non-exclusive copying or redistribution is allowed, provided that the
article citation is given and the authors and agency are clearly identified as its source.
CIKM’13, Oct. 27–Nov. 1, 2013, San Francisco, CA, USA.
2013 ACM 978-1-4503-2263-8/13/10
http://dx.doi.org/10.1145/2505515.2505628
LINK-BASED CLASSIFICATION
Assume we are given a graph G = (V, E, X, Y, C) where
V is a set of nodes, E is a set of edges (links), each x~i ∈ X is
an attribute vector for a node vi ∈ V , each Yi ∈ Y is a label
variable for vi , and C is the set of possible labels. We are also
given a set of “known” values Y K for nodes V K ⊂ V , so that
Table 1: Types of models, based on the kinds of features
used. Some notation is adapted from Jensen et al. [3].
Y K = {yi |vi ∈ V K }. Then the within-network classification
task is to infer Y U , the values of Yi for the remaining nodes
V U with “unknown” values (V U = V \ V K ).
For example, given a (partially-labeled) set of interlinked
university webpages, consider the task of predicting whether
each page belongs to a professor or a student. There are
three kinds of features typically used for this task:
• Self attributes: features based on the the textual
content of each page (node), e.g., the presence or absence of the word “teaching” for node v.
• Neighbor attributes: features based on the attributes
of pages that link to v. These may be useful because,
e.g., pages often link to others with the same label.
• Neighbor labels: features based on the labels of
pages that link to v, such as “Count the number of
v’s neighbors with label Student.”
Table 1 characterizes classification models based on the
kinds of features they use. The simplest, baseline models use
only one kind. First, SelfAttrs uses only self attributes.
Second, NeighAttrs classifies a node v using only v’s neighbors’ attributes. This model has not been previously been
studied, but we use it to help measure the value of neighbor
attributes on their own. Finally, NeighLabels uses only
neighbor labels. For instance, it might repeatedly average
the predicted label distributions of a node’s neighbors; this
performs surprisingly well for some datasets [5].
Other models combine self attributes with other features.
If a model also uses neighbor attributes, then it is performing “relational inference” and we call it RI. A CI model uses
neighbor labels instead, via features like the “count Students” described above. However, this is challenging, because some labels are unknown and must be estimated, typically with an iterative process of collective inference (i.e.,
CI) [3]. CI methods include Gibbs sampling, belief propagation, and ICA (Iterative Classification Algorithm) [10].
We focus on ICA, a simple, popular, and effective algorithm [10, 1, 6]. ICA first predicts a label for every node in
V U using only self attributes. It then constructs additional
relational features XR using the known and predicted node
labels (Y K and Y U ), and re-predicts labels for V U using
both self attributes and XR . This process of feature computation and prediction is repeated, e.g., until convergence.
Finally, RCI uses all three kinds of features. Because it
uses neighbor labels, it also must use some kind of collective
inference such as ICA.
2.1
0.8
Self attr. Neigh. attr. Neigh. labels
X
X
X
X
X
X
X
X
X
X
Prior Work with Neighbor Attributes
Some early work on link-based classification evaluated models that included neighbor attributes [2, 13]. However, recent work has used such models (including RI and RCI)
very rarely for two primary reasons. First, prior work found
that, while using neighbor labels can increase accuracy, using neighbor attributes can actually decrease accuracy [2].
Accuracy
Model
RCI
RI
CI
SelfAttrs
NeighAttrs
NeighLabels
0.7
RCI
RI
CI
0.6
1%
3% 5% 10% 20% 40%
Label density (log scale)
80%
Figure 1: Accuracy for “Gene” using Naive Bayes and
SSL-Once (see Section 5). Within a column, a circle or
triangle that is filled in indicates a result that was significantly different (better or worse) than that of CI.
Later, Jensen et al. [3] compared CI vs. RI and RCI. They
found that CI somewhat outperformed RCI and performed
much better than RI. They describe how neighbor attributes
greatly increased the parameter space, leading to lower accuracy, especially for small training graphs.
Second, it is unclear how to include neighbor attributes in
popular classifiers. In particular, nodes usually have a varying number of neighbors. Thus, with neighbor attributes
as features, there is no direct way to represent a node in a
fixed-sized feature vector as expected by classifiers such as
logistic regression or SVMs. With neighbor labels, CI algorithms address this issue with aggregation functions (such as
“Count”) that summarize all neighboring labels into a few
feature values. This works well for labels, which are discrete and highly informative, but is more challenging for attributes, which are more numerous, may be continuous, and
are individually less informative than a node’s label (thus,
this approach fared very poorly in early work [2]).
These two factors have produced a prevailing wisdom that
CI based on neighbor labels is better than RI based on neighbor attributes (cf., [10, 9]). This conclusion rested on studies
with fully-labeled training graphs, but has been carried into
the important domain [11, 6] of sparsely-labeled graphs. In
particular, Table 2 summarizes the models used by the most
relevant prior work with such graphs. Only one study [15]
used models with neighbor attributes (e.g., with RI or RCI),
and it did not evaluate whether they were helpful.1
We next show that this prevailing wisdom was partly correct, but that, for sparsely-labeled networks, neighbor attributes are often more useful than previously thought.
3.
THE IMPACT OF LABEL SPARSITY
To aid our discussion, we first preview our results. Figure 1 plots the average accuracy of RCI, RI, and CI for the
Gene dataset (also used by Jensen et al. [3]). The x-axis
varies the label density (e.g., fraction of nodes with known
labels). When label density is high (≥ 40%), CI significantly
outperforms RI, and RCI as well when density is very high.
High density is similar to learning with a fully-labeled graph,
and these results are consistent with those of Jensen et al.
However, when density is low (< 20%), CI is significantly
worse than RI or RCI. To our knowledge, no prior work has
reported this effect. Why does it occur?
1
Prior work with neighbor attributes, including that of Xiang & Neville [15], used methods like decision trees or Naive
Bayes that do not require a fixed-length feature vector.
Table 2: Related work on link-based classification that has used some variant of semi-supervised ICA (with partiallylabeled networks). The first row is an exception; it used fully-labeled training networks but is included for comparison.
s
rs el
s
tr Att Lab
t
A igh igh
I
lf
Classifier type
RC RI CI Se Ne Ne
Inference alg.
Datasets used
Jensen et al. [3]
RBC
X X X X
Gibbs
Gene, synthetic
Lu & Getoor [4]
Log. reg. (LR)
X X
ICA
Cora, Citeseer
Xiang & Neville [15]
Decision tree
X
X
Soft ICA
Gene, synthetic
Bilgic et al. [1]
Log. reg. (LR)
X X
ICA
Cora, Citeseer
Shi et al. [11]
Log. reg. (LR)
X
ICA
Cora, Citeseer, Gene
McDowell & Aha [6] Log. reg. (LR)
X X
X
ICA
Cora, Citeseer, Gene
This paper
NB & LR
X X X X X X ICA, Soft ICA, Gibbs Cora, Citeseer, Gene, synthetic
First, during inference, many neighbor labels are unknown.
Thus, a potential disadvantage of CI vs. RI is that some predicted labels used by CI will be incorrect, while the neighbor
attributes used by RI are all known. However, prior work
shows that CI—when learning uses a fully-labeled graph—
can be effective even when all labels in a separate test graph
are initially unknown [8, 7]. Thus, having a large number of
unknown labels during CI’s inference, while a drawback, is
not enough to explain the substantial differences vs. RI.
The key problem is that, at low label density, CI struggles to learn the parameters related to label-based features.
Labels can be used for learning such features only where
both nodes of a link have known labels. In contrast, with
neighbor attributes a single node’s label makes a link useful
for learning, since the node’s neighbors’ attributes are all
known. Thus, with RI a single labeled node can provide
multiple examples for learning (one for each of its links).
For example, for the Gene dataset of Figure 1, when the
density is 10% (110 known nodes), CI can learn from an
average of only 20 links, while RI can use an average of 340.
Thus, for sparsely-labeled networks, the effective training
size for features based on neighbor attributes is much larger
than for those based on neighbor labels. This compensates
for the greater number of parameters required by neighbor
attributes, leading to higher accuracy.
Recent work with partially-labeled networks has also observed these problems with neighbor labels, but did not consider neighbor attributes. Instead, they avoid the problem
by discarding methods based on label features and use latent features and/or links [12, 11]. Others have proposed
non-learning methods that use label propagation or random
walks [5]. Yet others have proposed SSL variants to first
predict (noisy) labels for the entire network [15, 1, 6].
We later compare against some of these methods (e.g.,
with [5]). However, Figure 1’s results already include some
effective SSL methods [1, 6] (see Section 5.3), and yet substantial problems remain for CI at low density. Section 6.2
compares RCI, RI, and CI more completely and shows that
neighbor attributes are helpful with or without SSL.
4.
LEVERAGING NEIGHBOR ATTRIBUTES
This section explains existing methods for predicting with
neighbor attributes, then introduces a new technique that
enables the use of discriminative classifiers.
4.1
Existing Methods for Neighbor Attributes
Let Ni be the set of nodes that are adjacent to node vi ,
i.e., in the “neighborhood” of vi (for simplicity, we assume
undirected links here). Furthermore, let XNi be the set of
attribute vectors for all nodes in Ni (XNi = {x~j |vj ∈ Ni }).
Suppose we wish to predict the label yi for vi based on its
attributes and the attributes of Ni . As described above, the
variable size of Ni presents a challenge. To address this general issue, prior studies (with neighbor labels) often assume
that the labels of nodes in Ni are conditionally independent
given yi . This assumption is not necessarily true, but often
works well in practice [3, 8, 7]. In our context (with neighbor attributes), we can make the analogous assumption that
the attribute vectors of the nodes in Ni (and the attribute
vector x~i of vi itself) are conditionally independent given yi .
Using Bayes rule and this assumption yields
p(x~i , XNi |yi )
p(yi |x~i , XNi ) = p(yi )
p(x~i , XNi )
p(x~i |yi ) Y
= p(yi )
p(x~j |yi )
p(x~i , XNi ) v ∈N
j
i
Y
∝ p(yi )p(x~i |yi )
p(x~j |yi )
(1)
vj ∈Ni
where the last step drops all values independent of yi .
To use Equation 1, we must compute p(x~i |yi ) and p(x~j |yi ).
The same technique works for both; we now explain for the
latter. We further assume that all attribute values for vj
(e.g., the values inside x~j ) are independent given yi . If nodes
have M attributes, then
p(x~j |yi ) = p(xj1 , xj2 , ..., xjM |yi )
=
M
Y
p(xjk |yi ).
(2)
k=1
Plugging this equation (and the equivalent one for p(x~i |yi ))
into Equation 1 yields a (relational) Naive Bayes classifier [8]. In particular, the features used to predict the label
for vi are vi ’s attributes and vi ’s neighbors’ attributes, and
these values are assumed to be conditionally independent.
Jensen et al. [3] used this simple generative classifier for
RI, and a simple extension to add the labels of neighboring
nodes as features yields the equations needed for RCI.
4.2
Multi-neighbor Attribute Classification
The method described above can predict with neighbor attributes, and can increase classification accuracy, as we show
later. However, it has two potential shortcomings. First, it
ignores dependencies among the attributes within a single
node. Second, it requires using probabilities like p(x~j |yi ),
whereas a discriminative classifier (e.g., logistic regression)
would compute p(yi |x~j ). Thus, the set of classifiers that can
be used is constrained, and overall accuracy may suffer.
A new idea is to take Equation 1 and apply Bayes rule to
each conditional probability separately. This yields
p(yi |x~i )p(x~i ) Y p(yi |x~j )p(x~j )
p(yi |x~i , XNi ) ∝ p(yi )
p(yi )
p(yi )
v ∈N
j
i
Y p(yi |x~j )
∝ p(yi |x~i )
p(yi )
v ∈N
j
(3)
i
where the last step drops all values independent of yi .
We call classification based on Equation 3 Multi-Neighbor
Attribute Classification (MNAC). This approach requires
two conditional models, p(yi |x~i ) and p(yi |x~j ). The first is
a standard (self)attribute-only classifier, while the second is
more unusual: a classifier that predicts the label of a node
based on the attributes of one of its neighbors. Because
the prediction for this latter model is based on just a single
attribute vector, any probabilistic classifier can be used, including discriminative classifiers such as logistic regression.
MNAC’s derivation is simple, but it has not been previously used for link-based classification. McDowell & Aha [6]
used a somewhat similar technique to produce “hybrid models” that combine two classifiers, but the derivation is different and they did not consider neighbor attributes.
5.
5.1
EXPERIMENTAL METHOD
Datasets and Features
We use all of the real datasets used in prior studies with
semi-supervised ICA, and some synthetic data (see Tables 2
& 3). We removed all nodes with no links.
Cora (cf., [10]) is a collection of machine learning papers.
Citeseer (cf., [10]) is a collection of research papers. Attributes represent the presence of certain words, and links
indicate citations. We mimic Bilgic et al. [1] by ignoring link
direction, and also by using the 100 top attribute features
after applying PCA to all nodes’ attributes.
Gene (cf., [3]) describes the yeast genome at the protein
level; links represent protein interactions. We mimic Xiang
& Neville [15] and predict protein localization using four
attributes: Phenotype, Class, Essential, and Chromosome.
When using LR, we binarized these, yielding 54 attributes.
We create synthetic data using Sen et al.’s graph generator
[10]. Degree of homophily is how likely a node is to link to
another node with the same label; we use their defaults of
0.7 with a link density of 0.2. Each node has ten binary
attributes with an attribute predictiveness of 0.6 (see [7]).
5.2
Table 3: Data sets summary.
Characteristics
Total nodes
Total links
Class labels
% dominant class
Classifiers and Regularization
All models shown in Table 1 (except NeighLabels) require learning a classifier to predict the label based on self
attributes and/or a classifier to predict based on neighbor
attributes. We evaluate Naive Bayes (NB), because of its
past use with neighbor attributes [3], and logistic regression (LR), because it usually outperformed NB [10, 1]. For
neighbor attributes, LR uses the new MNAC method.
RCI and CI also require a classifier to predict based on
neighbor labels. McDowell & Aha [6] found that NB with
“multiset” features was superior to LR with “proportion” features as used by Bilgic et al. [1]. Thus, we use NB for neighbor labels, and combine these results with the NB or LR
classifiers used for attributes (described above), using the
“hybrid model” method mentioned in Section 4.2 [6].
Cora
2708
5278
7
16%
CiteSeer
3312
4536
6
21%
Gene
1103
1672
2
56%
Syn.
1000
1250
5
20%
For sparsely-labeled data, regularization can have a large
impact on accuracy. We used five-fold cross-validation on
the labeled data, selecting the value of the regularization
hyperparameter that maximized accuracy on the held-out
labeled data. We used a Gaussian prior with all LR’s features, a Dirichlet prior with NB’s discrete features, and a
Normal-Gamma prior with NB’s continuous features.
5.3
Learning and Collective Inference
We chose default learning and inference algorithms that
performed well in prior studies; Section 6.2 considers others.
For learning, we use the SSL-Once strategy: first, learn
the classifiers using the attributes and the known labels.
Next, run inference to predict labels for every “unknown”
node. Finally, learn new classifiers, using the attributes,
known labels, and newly predicted labels. With LR, we also
use “label regularization” [6] which biases the learning towards models that yield sensible label distributions (it does
not apply to NB). McDowell & Aha [6] found that these
choices performed well overall and had consistent accuracy
gains compared to not using SSL. The three baseline models
(see Table 1) do not use SSL or label regularization.
For inference, CI and RCI use 10 iterations of ICA. For
NeighLabels, we use a common baseline based on relaxation labeling (wvRN+RL [5], see Section 2).
6.
RESULTS
We report accuracy averaged over 20 trials. For each, we
randomly select some fraction of V (the “label density” d) to
be “known” nodes V K ; those remaining form the “unknown
label” test set V U . We focus on the important sparselylabeled case [11, 6], e.g., d ≤ 10%. To assess results, we use
paired t-tests with a 5% significance level, with appropriate
corrections because the 20 test sets are not disjoint [14].
6.1
Key Results with Real Datasets
Figure 2 shows average classification accuracy for the real
datasets. The solid lines show results with RCI, RI, or CI,
using NB as the attribute classifier. The results for Gene
were already discussed in Section 3, and match closely the
trends for Cora and Citeseer. In particular, for high label
density d, CI (and sometimes RCI) performs best; but, when
labels are sparse, using neighbor attributes (with RCI or RI)
is essential. When d < 10%, the gains of RCI and RI vs. CI
are, with one exception, all significant (see caption). Results
without SSL-Once (not shown) showed very similar trends.
With LR (see Table 4 for partial details), the trends are
very similar: CI is significantly better than RI for high d,
but the opposite holds for low d. Fortunately, RCI (using
neighbor labels and attributes) yields good relative performance regardless of d, as was true with NB.
For comparison, the dashed lines in Figure 2 show results
for RCI with LR (using MNAC). For Cora and Citeseer, this
method beats RCI with NB in all cases. The gains are substantial (usually 8-10%) for very sparse graphs, and smaller
but still significant for higher d. Here, LR outperforms NB
Cora
Citeseer
Gene
0.9
0.8
0.7
Accuracy
0.8
0.7
0.6
0.6
RCI (with LR)
RCI (with NB)
RI (with NB)
CI (with NB)
0.5
0.4
RCI (with LR)
RCI (with NB)
RI (with NB)
CI (with NB)
0.5
0.4
1%
3% 5% 10% 20% 40% 80%
Label density (d)
0.7
RCI (with LR)
RCI (with NB)
RI (with NB)
CI (with NB)
0.6
1%
3% 5% 10% 20% 40% 80%
Label density (d)
1%
3% 5% 10% 20% 40% 80%
Label density (d)
Figure 2: Accuracy vs. label density. Solid lines show results with NB; for these, filled (not hollow) symbols indicate
sig. differences vs. CI. Dashed lines show RCI+LR for comparison; for these, filled symbols mean sig. differences vs.
RCI+NB. These results (as with all others, unless otherwise stated) use SSL-Once and, where needed, ICA for inference.
Table 4: Average accuracy for different combinations of models, learning, and inference algorithms, using LR. Within
each column, we bold the best result and all values that were not significantly different from that result. For d > 10%
(not shown), learning and inference choices usually affected accuracy by 1% or less.
Cora
Citeseer
Gene
Label Density (d)
1%
3%
5%
10%
1%
3%
5%
10%
1%
3%
5%
10%
RCI+No-SSL
58.1 74.7 78.9 82.6
51.5 63.7 67.1 71.2
63.4 70.6 75.7 78.5
RCI+SSL-Once
71.3 80.0 81.6 83.2
59.1 67.2 69.7 71.4
67.1 73.6 77.6 79.4
RCI+SSL-EM
73.5 79.6 81.3 82.7
63.1 68.2 69.8 71.2
70.6 75.6 78.0 78.9
RCI+SSL-Once+Gibbs
69.7 79.7 81.3 83.0
59.1 65.1 67.6 69.0
67.6 74.2 78.2 79.8
RI+No-SSL
57.4 73.1 77.1 80.6
51.2 63.1 66.8 70.5
63.1 68.8 73.6 76.3
RI+SSL-Once
67.8 78.7 80.5 81.8
57.9 66.7 69.4 71.3
65.9 71.6 75.7 78.0
RI+SSL-EM
70.2 78.9 80.2 81.3
62.9 68.4 69.7 71.2
68.3 71.4 76.4 76.8
CI+No-SSL
40.9 63.1 70.6 79.5
44.3 59.5 63.4 69.6
60.5 67.8 72.2 75.7
CI+SSL-Once
57.2 75.1 78.3 81.3
52.5 64.2 67.7 69.6
60.9 66.7 70.6 74.2
CI+SSL-EM
64.6 78.2 80.2 82.0
57.4 65.8 68.0 69.6
62.8 67.0 75.3 77.2
CI+SSL-Once+Gibbs
36.6 62.8 71.2 78.0
44.3 54.8 53.2 60.8
61.0 67.0 70.4 74.8
SelfAttrs
37.1 54.1 59.5 65.8
40.7 55.8 61.8 66.7
59.8 65.5 68.4 71.6
NeighAttrs
52.1 70.9 74.6 77.8
45.1 58.4 63.5 65.8
59.4 65.9 69.5 71.8
NeighLabels
41.2 63.4 72.3 77.9
31.5 47.9 52.6 55.8
55.7 62.6 66.6 71.6
because the attributes are continuous (a challenging scenario
for NB) and because LR does not assume that the (many)
attributes are independent. For Gene, however, NB is generally better, likely because NB can use the 4 discrete attributes, whereas LR must use the 54 binarized attributes.
Overall, for sparsely-labeled graphs, neighbor attributes
appear to be much more useful than previously recognized,
and our new LR with MNAC sometimes significantly increases accuracy. For simplicity, below we only consider LR.
6.2
Varying Learning & Inference Algorithms
The results above all used SSL-Once for learning and,
where needed, ICA for collective inference. Table 4 examines
results where we vary these choices. First, there are different
learning variants: No-SSL, where SSL is not used (though
label regularization still is), and SSL-EM, where SSL-Once
is repeated 10 times, as with McDowell & Aha [6]. Second,
for RCI and CI we consider Gibbs sampling instead of ICA.
Gibbs has been frequently used, including in the RI vs. CI
comparisons of Jensen et al. [3], but sometimes has erratic
behavior [7].
These choices can impact accuracy. For instance, SSLOnce+Gibbs sometimes beats SSL-Once with ICA, but
always by less than 1%, and decreases accuracy substantially
in several cases with CI. Using SSL-EM instead of SSLOnce leads to more consistent gains that are sometimes
substantial, especially for CI.
Thus, for different datasets and classifiers, the best SSL
and inference methods will vary. However, our default use
of ICA with SSL-Once was rarely far from maximal when
RCI, the best model, was used. Moreover, the values in bold
show that, regardless of the learning and inference choices,
RCI or RI yielded the best overall accuracy (at least for the
data of Table 4, which focuses on sparse networks). Also, using RCI or RI with SSL-Once almost always outperformed
CI with any learning/inference combination shown.
For comparison, the bottom of Table 4 also shows results
with three baseline algorithms. NeighAttrs performs best
on average, indicating the predictive value of neighbor attributes, even when used alone.
6.3
Varying the Data Characteristics
We compared RCI, RI, and CI on the synthetic data.
First, varying the label density produced results (not shown)
very similar to Gene in Figure 2. Second, in Figure 3(a), as
homophily increases, both RI and CI improve by exploiting
greater link-based correlations. RCI does even better by
using both kinds of link features, except at low homophily
where using the (uninformative) links decreases accuracy.
Figure 3(b) shows the impact of adding some random attributes for each node. We expected CI’s relative performance to improve as these were added, based on the results
and argument of Jensen et al. [3]: CI has fewer parameters than RI, and thus should suffer less from high variance
due to the random attributes. Instead, we found that RI’s
(and RCI’s) gains over CI only increase, for two reasons.
(a)
(b)
RCI (with LR)
RI (with LR)
CI (with LR)
0.8
RCI (with LR)
RI (with LR)
CI (with LR)
0.8
0.7
0.7
0.6
0.5
0.6
0.4
0.5
0.3
0
0.2
0.4
0.6
0.8
1
0
10
Degree of homophily (dh)
20
30
40
50
60
Number of random attributes
Figure 3: Synthetic data results, using LR and 5% label
density. Filled symbols indicate sig. differences vs. CI.
First, Jensen et al. had fully-labeled training data; for our
sparse setting, RI has a larger effective training size (see Section 3), reducing variance. Second, unlike Jensen et al. we
use cross-validation to select regularization parameters for
the features. The regularization reduces variance for RI and
CI, and is especially helpful for RI as random attributes are
added. If we remove regularization and increase label density, the differences between CI and RI decrease markedly.
6.4
Additional Experiments
Due to lack of space, we can only summarize two additional experiments. First, we evaluated “soft ICA”, where
continuous label estimates are retained and used at each
step of ICA instead of choosing the most likely label for
each node [15]. This slightly increased accuracy in some
cases, but did not change the relative performance of RCI
and RI vs. CI. Second, we compared our results (using a
single partially-labeled graph) vs. “ideal” results obtained
using a fully-labeled learning graph or a fully-labeled inference graph. We found that ideal learning influenced results
much more than ideal inference for all methods, but that
CI was (negatively) affected by realistic (non-ideal) learning much more than RI or RCI were. Thus, as argued in
Section 3, RI’s and RCI’s gains vs. CI for sparsely-labeled
networks seem to arise more because of CI’s difficulties with
learning rather than with inference.
7.
more widely-dispersed labels. Likewise, Shi et al. [11] used
semi-supervised ICA with CI as a baseline, but these results
could change substantially with RCI instead.
Our results need to be confirmed with additional datasets
and learning algorithms. Also, the best methods should be
compared against others discussed in Section 2. Finally, we
intend to explore these effects in networks that include nodes
with different types and some missing attribute values.
CONCLUSION
Link-based classification is an important task, for which
the most common methods involve computing relational features. Almost no recent work has considered neighbor attributes for such features, because prior work showed they
performed poorly and they were not compatible with many
classifiers. We showed, however, that for sparsely-labeled
graphs using neighbor attributes (with RI) often significantly outperformed neighbor labels (with CI), and that using both (with RCI) yielded high accuracy regardless of label
density. We also introduced a new method, MNAC, which
enables classifiers like logistic regression to use neighbor attributes, further increasing accuracy for some datasets. Naturally, the best classifier depends upon data characteristics;
MNAC greatly expands the set of possibilities.
Our findings should encourage future researchers to consider neighbor attributes in models for link-based classification, to include RI and RCI as baselines in comparisons,
and to re-evaluate some earlier work that did not consider
such models (see Table 2). For instance, Bilgic et al. [1]
studied active learning with sparse graphs; their optimal
strategies could be quite different if RI or RCI were considered, since they could tolerate learning with fewer and
8.
ACKNOWLEDGMENTS
Thanks to Gavin Taylor and the anonymous referees for
comments that aided this work. This work was supported in
part by NSF award number 1116439 and a grant from ONR.
9.
REFERENCES
[1] M. Bilgic, L. Mihalkova, and L. Getoor. Active
learning for networked data. In Proc. of ICML, pages
79–86, 2010.
[2] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced
hypertext categorization using hyperlinks. In Proc. of
SIGMOD, pages 307–318, 1998.
[3] D. Jensen, J. Neville, and B. Gallagher. Why
collective inference improves relational classification.
In Proc. of KDD, pages 593–598, 2004.
[4] Q. Lu and L. Getoor. Link-based classification using
labeled and unlabeled data. In ICML Workshop on
the Continuum from Labeled to Unlabeled data, 2003.
[5] S. Macskassy and F. Provost. Classification in
networked data: A toolkit and a univariate case study.
J. of Machine Learning Research, 8:935–983, 2007.
[6] L. McDowell and D. Aha. Semi-supervised collective
classification via hybrid label regularization. In Proc.
of ICML, pages 975–982, 2012.
[7] L. McDowell, K. Gupta, and D. Aha. Cautious
collective classification. J. of Machine Learning
Research, 10:2777–2836, 2009.
[8] J. Neville and D. Jensen. Relational dependency
networks. J. of Machine Learning Research, 8:653–692,
2007.
[9] R. Rossi, L. McDowell, D. Aha, and J. Neville.
Transforming graph data for statistical relational
learning. J. of Artificial Intelligence Research,
45:363–441, 2012.
[10] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher,
and T. Eliassi-Rad. Collective classification in network
data. AI Magazine, 29(3):93–106, 2008.
[11] X. Shi, Y. Li, and P. Yu. Collective prediction with
latent graphs. In Proc. of CIKM, pages 1127–1136,
2011.
[12] L. Tang and H. Liu. Relational learning via latent
social dimensions. In Proc. of KDD, pages 817–826,
2009.
[13] B. Taskar, E. Segal, and D. Koller. Probabilistic
classification and clustering in relational data. In Proc.
of IJCAI, pages 870–878, 2001.
[14] T. Wang, J. Neville, B. Gallagher, and T. Eliassi-Rad.
Correcting bias in statistical tests for network classifier
evaluation. In Proc. of ECML, pages 506–521, 2011.
[15] R. Xiang and J. Neville. Pseudolikelihood EM for
within-network relational learning. In Proc. of ICDM,
pages 1103–1108, 2008.
Download