VII Center for Visual Informatics and Intelligence wsu
Home  NSF: CRI arrow Topics

Topics

Semi-supervised NMF for Heterogeneous Data Co-clustering



Problem Definition


  • Clustering objects of two data types is pairwise heterogeneous clustering. In semi-supervised co-clustering, supervision is typically provided as two sets of pairwise constraints derived from given labels on the central data: must-link constraints M = {(xi, xj )} and cannot-link constraints C = {(xi, xj )}, where (xi, xj) in M implies that xi and xj are labeled as belonging to the same cluster, while (xi, xj ) in C implies that xi and xj are labeled as belonging to different clusters. Note that if we can successfully co-cluster such data, the corresponding technique can be easily extended to structures involving more data types.

Algorithms

  • The above figure shows the pairwise data (e.g., documents and words, images and features, genes and conditions), which is a basic element of the general pairwise heterogeneous semi-supervised co-clustering structure. The relation between data type 1 and data tpye 2 is denoted by a matrix R(12). The green edges indicate the must-link constraints M, while the red edges denote cannot-link constraints C. The dotted line shows the optimal co-clustering result.
  • Using an iterative algorithm, we perform tri-factorizations of the new relational matrix R(12), obtained with the learned distance metric, to infer the clusters of two data types.

Results

  • The following figure shows the AC value against increasing percentage of pairwise constraints for SS-NMF. It is clear to see that BSGP and unsupervised NMF are outperformed by SS-NMF on all the document-word data sets. Another important observation is that the accuracy of SS-NMF consistently increases with the gradual increase of the pairwise constraints (from 0.5% to 10%). Moreover, in certain cases, SS-NMF is able to generate significantly better results by quickly learning from a few constraints (0.5%), as demonstrated in the data set CT2, CT4, CT5 and CT7. So, the clustering performance can be greatly improved even with very limited prior knowledge. It is also worth pointing out that the AC value of SS-NMF is as high as 99% on the data sets CT2, CT5 and CT7 with 10% constraints. In other words, SS-NMF provides near perfect clustering results on these data sets.

Related Publications

  • Yanhua Chen, Lijun Wang and Ming Dong, "A Matrix-based Approach for Semi-supervised Document Co-clustering", Proc. of ACM 17th Conference on Information and Knowledge Management (CIKM), Napa Valley, CA, 2008 (acceptance rate: 33%).

Top

    Contact Webmaster
    Center for Visual Informatics and Intelligence(VII) © 2009