|
|
Topics
Semi-supervised NMF for Heterogeneous Data Co-clustering
|
Problem
Definition
-
Clustering objects of two data types is pairwise
heterogeneous clustering. In semi-supervised
co-clustering, supervision is typically provided as
two sets of pairwise constraints derived from given
labels on the central data: must-link constraints M
= {(xi, xj )} and cannot-link constraints C = {(xi,
xj )}, where (xi, xj) in M implies that xi and xj
are labeled as belonging to the same cluster, while
(xi, xj ) in C implies that xi and xj are labeled as
belonging to different clusters. Note that if we can
successfully co-cluster such data, the corresponding
technique can be easily extended to structures
involving more data types.
Algorithms
- The above figure shows the pairwise data (e.g.,
documents and words, images and features, genes and
conditions), which is a basic element of the general
pairwise heterogeneous semi-supervised co-clustering
structure. The relation between data type 1 and data
tpye 2 is denoted by a matrix R(12). The green edges
indicate the must-link constraints M, while the red
edges denote cannot-link constraints C. The dotted
line shows the optimal co-clustering result.
- Using an iterative algorithm, we perform
tri-factorizations of the new relational matrix
R(12), obtained with the learned distance metric, to
infer the clusters of two data types.
Results
- The following figure shows the AC value against
increasing percentage of pairwise constraints for
SS-NMF. It is clear to see that BSGP and
unsupervised NMF are outperformed by SS-NMF on all
the document-word data sets. Another important
observation is that the accuracy of SS-NMF
consistently increases with the gradual increase of
the pairwise constraints (from 0.5% to 10%).
Moreover, in certain cases, SS-NMF is able to
generate significantly better results by quickly
learning from a few constraints (0.5%), as
demonstrated in the data set CT2, CT4, CT5 and CT7.
So, the clustering performance can be greatly
improved even with very limited prior knowledge. It
is also worth pointing out that the AC value of
SS-NMF is as high as 99% on the data sets CT2, CT5
and CT7 with 10% constraints. In other words, SS-NMF
provides near perfect clustering results on these
data sets.
Related Publications
- Yanhua Chen, Lijun Wang and Ming Dong, "A
Matrix-based Approach for Semi-supervised Document
Co-clustering", Proc. of ACM 17th Conference on
Information and Knowledge Management (CIKM), Napa
Valley, CA, 2008 (acceptance rate: 33%).
|
| |
|
|