|
|
Topics
Consistent Isoperimetric High-order Co-clustering (CIHC) for
high-order data co-clustering
|
Problem
Definition
-
Clustering objects of more than two data types is
high-order heterogeneous clustering. The data
structure is star-structured co-clustering problems
in which a central data type is connected to all the
other data types. Intuitively, one might expect that
this kind of co-clustering can be achieved by
trivially extending the pairwise co-clustering
techniques previously developed. However, this
approach does not consider the constraints enforced
on the central data type. A basic element of this
problem is the star-structured triplet.
Algorithms
- Model the triplet using a tripartite graph
consisting of the 3 data types - f, m and w. We
treat this tripartite graph as two bipartite graphs
of f-m and m-w. Co-clustering is then achieved by
simultaneously partitioning these two bipartite
graphs together such that the star-structured
constraint is respected (see above Figure).
- Consistent Isoperimetric High-order Co-clustering
(CIHC) requires a simple and very quick solution to
a sparse system of linear equations.
Results
- In the following sub-figures show the partitioning
obtained for categories, documents and words,
respectively. The dotted line in the second
sub-figure separates the documents from the two
categories. Good document clustering should
partition documents on either side of the line into
different clusters.
- The following figures show the performance of CIHC
and CBGC in the presence of additive and
multiplicative Gaussian noise on some datasets.
X-axis has the variance of the noise while the
isoperimetric ratio of the partitioning is along the
Y-axis. First two plots are tripartitioning in the
presence of additive noise on DS1 and DS2 datasets.
Similarly, next two plots are in the presence of
multiplicative noise on these two datasets. From
these results, we can see that inspite of the
varying amounts and kinds of noise in the data, CIHC
is able to perform optimal partitioning indicated by
its low isoperimetric ratio. Second noticeable fact
is in regards to stability. Rising ratios as the
variance increases indicates that the performance of
the algorithm is decreasing. However, uctuating
ratios indicates instability and inconsistency to
partition optimally. While the ratios of CIHC
uctuate in limits, CBGC ratio fuctuation clearly
shows instability.
 |
 |
 |
 |
- We plotted the computational speed of CIHC and CBGC in the following figure.
The time required to compute the indicator vector is along the Y-axis and the
number of vertices in the tripartite graph are shown along the X-axis. Time for
CIHC gradually increases with the number of vertices. For the maximum number of
vertices we increased to - about almost 4,000, CIHC required about 98 seconds
only. CBGC on the other hand, was unable to keep up with CIHC. As can be seen,
the time required by CBGC really shoots up for a few hundred vertices in the
graph. Moreover, CBGC is unscalable and is unable to handle larger sized graphs.
In our experiment, CBGC was unable to handle graphs with more than 1,500
vertices. This clearly demonstrates the computational efficiency of CIHC and the
potential for applicability in large-scale real-world applications.
Related Publications
- Manjeet Rege, Ming Dong and Jing Hua, "Graph
Theoretical Framework for Simultaneously Integrating
Visual and Textual Features for Efficient Web Image
Clustering", in proceedings of 17th International
Conference on World Wide Web, 2008 (acceptance rate:
10.9%).
- Manjeet Rege, Ming Dong and Jing Hua, “Clustering
Web Images with Multi-modal features”, in
proceedings of ACM International Conference on
Multimedia, 2007 (acceptance rate: 24%).
|
|
| |
|