|
|
Topics
Isoperimetric Co-clustering Algorithm (ICA) for pairwise data
co-clustering
|
Problem
Definition
-
Data co-clustering refers to the problem of
simultaneous clustering of two data types.
Typically, the data is stored in a contingency or
co-occurrence matrix C where rows and columns of the
matrix represent the data types to be co-clustered.
An entry Cij of the matrix signifies the relation
between the data type represented by row i and
column j. Co-clustering is the problem of deriving
sub-matrices from the larger data matrix by
simultaneously clustering rows and columns of the
data matrix.
Algorithms
- The two data types are modeled as the two sets of
vertices of a weighted bipartite graph (see above
Figure).
- Isoperimetric Co-clustering Algorithm (ICA)
requires a simple solution to a sparse system of
linear equations instead of the eigenvalue or SVD
problem in the popular spectral co-clustering
approach.
Results
- In the following figures, (a) shows a typical
word-document matrix before co-clustering.
Co-clustering leads to rearranging of the documents
and words such that the most co-related get grouped
together simultaneously. (b) shows a dataset having
2 clusters in the ground truth, bipartitioned using
ICA. (c) Similarly, dataset having 6 clusters in the
ground truth has been k-partitioned by ICA.
- The following figures show the performance of ICA
and Spectral-SVD in the presence of additive and
multiplicative Gaussian noise on some datasets.
X-axis has the variance of the noise while the
isoperimetric ratio of the partitioning is along the
Y-axis. Additive noise had zero mean with variance
increasing from 1 to the maximum value in the
original data. Multiplicative noise had mean of 1
with its variance going from 1 to a maximum of 5.
First two plots are bipartitioning in the presence
of additive noise on Interest-Trade dataset and
multiplicative noise on ArachidonicAcids-Hematocrit
dataset. Similarly, next two plots are for
k-partitioning with additive noise on wap and
multiplicative on re0 datasets. From these results,
we can see that inspite of the varying amounts and
kinds of noise in the data, ICA is able to perform
optimal partitioning indicated by its low
isoperimetric ratio. Second noticeable fact is in
regards to stability. Rising ratios as the variance
increases indicates that the performance of
algorithm is gradually decreasing. However,
fluctuating ratios indicates instability and
inconsistency to partition optimally.
 |
 |
 |
 |
- We plotted the computational speed of ICA and Spectral-SVD in the following
figure. The time required to compute the indicator vector is along the Y-axis
and the number of vertices in the bipartite graph are shown along the X-axis.
Time for ICA gradually increases as the number of vertices increase. However,
Spectral-SVD time increases more rapidly comparatively and hence is clearly
outperformed by ICA.
Related Publications
- Manjeet Rege, Ming Dong and Farshad Fotouhi,
“Bipartite Isoperimetric Graph Partitioning for Data
Co-clustering”, Data Mining and Knowledge Discovery
Journal, Vol. 16, No. 3, 2008.
- Manjeet Rege, Ming Dong and Farshad Fotouhi,
“Co-clustering documents and words using Bipartite
Isoperimetric Graph Partitioning”, in proceedings of
IEEE International Conference on Data Mining, 2006.
(regular paper - acceptance rate 10%, oral
presentation)
|
|
| |
|