CRI: IAD Acquisition of Research Infrastructure for
Knowledge-enhanced Large-scale Learning of Multimodality Visual
Data
|
Summary
Compared to the development of visual data acquisition technology
and the explosive collection of acquired datasets, computational
techniques for knowledge discovery and learning from very large,
diverse, heterogeneous visual datasets have only evolved modestly,
ultimately impeding the more effective utilization and better
understanding. The project aims to bridge the aforementioned gaps
and foster a strong research program in geometry-guided knowledge
discovery in multimodality visual data, with an emphasis on
neuroimaging applications. Specifically, the project focuses on: (1)
exploring new tools based on Riemannian geometry for computing
geometric structures of 3-manifolds and developing a novel
volumetric mapping with geometric flow; (2) developing a rigorous
mathematical foundation for semi-supervised data clustering; and (3)
extending our approaches on geometric mapping and semi-supervised
learning to (high-order) heterogeneous volumetric visual data
analysis. Once developed, these novel algorithms are applied to
computer assisted diagnosis of various important brain diseases,
such as tumors and brain functional disorder. This project can help
with identifying disease patterns in human brain, and thus possibly
provides both clinical and social benefits to a large sector of the
population. Moreover, the project can immediately help to elevate
the existing resources and on-going research to a unified,
systematic level and strengthen computer science education. The
research results will be widely disseminated to both computer
science and medical communities through free Web access of the
software tools (including source codes), and the set of sample data
(including raw neuroimaging data and processed ones such as
high-resolution brain surface meshes) via the project Web site.
|
Acknowledgments
The support from National Science Fundation (NSF) under Award Number 0751045 is kindly appreciated.
Collaborators
Students
|
Manjeet Rege
|
Yanhua Chen
|
Lijun Wang
|
Chang Liu
|
|
Zhaoqiang Lai
|
Dashan Pai
|
Jiaxi Hu
|
Vahid Taimouri
|
|
Areen Al.Bashir
|
Gutam Bahal
|
Samuel Barnes
|
|
Software and Datasets
Further details are available
here.
Topics
- Low-rank Kernel Matrix Factorization for Large Scale Evolutionary Clustering
Traditional clustering techniques are inapplicable to problems where the relationships between data points evolve over time.
Not only is it important for the clustering algorithm to adapt to the recent changes in the evolving data, but it also needs to take the
historical relationship between the data points into consideration. In this paper, we propose ECKF, a general framework for evolutionary
clustering large-scale data based on low-rank kernel matrix factorization. To the best of our knowledge, this is the first work that clusters
large evolutionary datasets by the amalgamation of low-rank matrix approximation methods and matrix factorization based clustering.
Since the low-rank approximation provides a compact representation of the original matrix, and especially, the near-optimal low-rank
approximation can preserve the sparsity of the original data, ECKF gains computational efficiency and hence is applicable to large
evolutionary datasets. Moreover, matrix factorization based methods have been shown to effectively cluster high dimensional data in
text mining and multimedia data analysis. From a theoretical standpoint, we mathematically prove the convergence and correctness of
ECKF, and provide detailed analysis of its computational efficiency (both time and space). Through extensive experiments performed
on synthetic and real datasets, we show that ECKF outperforms the existing methods in evolutionary clustering.
Further details of this work are available
here.
|
- Non-Negative Matrix Factorization
for Semisupervised Heterogeneous Data Coclustering
Coclustering heterogeneous data has attracted extensive attention recently due to its high impact on various important
applications, such us text mining, image retrieval, and bioinformatics. However, data coclustering without any prior knowledge or
background information is still a challenging problem. In this paper, we propose a Semisupervised Non-negative Matrix Factorization
(SS-NMF) framework for data coclustering. Specifically, our method computes new relational matrices by incorporating user provided
constraints through simultaneous distance metric learning and modality selection. Using an iterative algorithm, we then perform
trifactorizations of the new matrices to infer the clusters of different data types and their correspondence. Theoretically, we prove the
convergence and correctness of SS-NMF coclustering and show the relationship between SS-NMF with other well-known coclustering
models. Through extensive experiments conducted on publicly available text, gene expression, and image data sets, we demonstrate
the superior performance of SS-NMF for heterogeneous data coclustering.
Further details of this work are available
here.
|
- Selection-fusion approach for classification of datasets with missing values
This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values.
Further details of this work are available
here.
|
- Intra-Patient Supine-Prone Colon Registration
in CT Colonography Using Shape Spectrum
CT colonography (CTC) is a minimally invasive screening
technique for colorectal polyps and colon cancer. Since electronic colon
cleansing (ECC) cannot completely remove the presence of
pseudo-polyps, most CTC protocols acquire both prone and supine images
to improve the visualization of the lumen wall and to reduce false
positives. Comparisons between the prone and supine images can be facilitated
by computerized registration between the scans. In this paper, we
develop a fully automatic method for registering colon surfaces extracted
from prone and supine images. The algorithm uses shape spectrum to
extract the shape characteristics which are employed as the surface signature
to find the correspondent regions between the prone and supine
lumen surfaces. Our experimental results demonstrate an accuracy of
12.6 ¡À 4.20 mm over 20 datasets. It also shows excellent potential in reducing
the false positive when it is used to determine polyps through
correspondences between prone and supine images.
Further details of this work are available
here.
|
- Isoperimetric Co-clustering
Algorithm (ICA) for pairwise data co-clustering
Data co-clustering refers to the problem of simultaneous clustering
of two data types. Typically, the data is stored in a contingency or
co-occurrence matrix C where rows and columns of the matrix
represent the data types to be co-clustered. An entry Cij
of the matrix signifies the relation between the data type
represented by row i and column j. Co-clustering is the problem of
deriving sub-matrices from the larger data matrix by simultaneously
clustering rows and columns of the data matrix. We present a novel
graph theoretic approach to data co-clustering. The two data types
are modeled as the two sets of vertices of a weighted bipartite
graph. We use Isoperimetric Co-clustering Algorithm (ICA)--a new
method for partitioning the bipartite graph. ICA requires a simple
solution to a sparse system of linear equations instead of the
eigenvalue or SVD problem in the popular spectral co-clustering
approach. Our theoretical analysis and extensive experiments
performed on publicly available datasets demonstrate the advantages
of ICA over other approaches in terms of the quality, efficiency and
stability in partitioning the bipartite graph.
Further details of this work are available
here.
|
-
Consistent Isoperimetric High-order
Co-clustering (CIHC) for high-order data
co-clustering
Many of the real world clustering problems arising in data mining
applications are heterogeneous in nature. Heterogeneous
co-clustering involves simultaneous clustering of objects of two or
more data types. While pairwise co-clustering of two data types has
been well studied in the literature, research on high-order
heterogeneous co-clustering is still limited. We propose a graph
theoretical framework for addressing star- structured co-clustering
problems in which a central data type is connected to all the other
data types. Partitioning this graph leads to co-clustering of all
the data types under the constraints of the star-structure.
Although, graph partitioning approach has been adopted before to
address star-structured heterogeneous complex problems, the main
contribution of this work lies in an efficient algorithm that we
propose for partitioning the star-structured graph. Computationally,
our algorithm is very quick as it requires a simple solution to a
sparse system of overdetermined linear equations. Theoretical
analysis and extensive experiments performed on toy and real
datasets demonstrate the quality, efficiency and stability of the
proposed algorithm. Further details of this work are available
here.
|
-
Semi-supervised NMF for Homogeneous Data
Clustering
Traditional clustering algorithms are inapplicable to many
real-world problems where limited knowledge from domain experts is
available. Incorporating the domain knowledge can guide a clustering
algorithm, consequently improving the quality of clustering. We
propose SS-NMF: a semi-supervised non-negative matrix factorization
framework for data clustering. In SS-NMF, users are able to provide
supervision for clustering in terms of pairwise constraints on a few
data objects specifying whether they "must" or "cannot" be clustered
together. Through an iterative algorithm, we perform symmetric
trifactorization of the data similarity matrix to infer the
clusters. Theoretically, we show the correctness and convergence of
SS-NMF and SS-NMF provides a general framework for semi-supervised
clustering. Through extensive experiments conducted on publicly
available datasets, we demonstrate the superior performance of
SS-NMF for clustering. Further details of this work are
available here.
|
-
Semi-supervised NMF for Heterogeneous
Data Clustering
Co-clustering heterogeneous data has attracted extensive attention recently due to its high impact on various important
applications, such us text mining, image retrieval and bioinformatics. However, data co-clustering without any prior knowledge or
background information is still a challenging problem. In this work, we propose a Semi-Supervised Non-negative Matrix Factorization
(SS-NMF) framework for data co-clustering. Specifically, our method computes new relational matrices by incorporating user provided
constraints through simultaneous distance metric learning and modality selection. Using an iterative algorithm, we then perform trifactorizations
of the new matrices to infer the clusters of different data types and their correspondence. Theoretically, we prove the
convergence and correctness of SS-NMF co-clustering and show the relationship between SS-NMF with other well-known co-clustering
models. Through extensive experiments conducted on publicly available text, gene expression, and image data sets, we demonstrate
the superior performance of SS-NMF for heterogeneous data co-clustering. Further details of this work
are available here.
|
- Physically Based Modeling and
Simulation with Dynamic Spherical Volumetric Simplex Splines
In this work, we present a novel computational modeling and
simulation framework based on dynamic spherical volumetric simplex
splines. The framework can handle the modeling and simulation of
genus-zero objects with real physical properties. In this framework,
we first develop an accurate and efficient algorithm to reconstruct
the high-fidelity digital model of a real-world object with
spherical volumetric simplex splines which can represent with
accuracy geometric, material, and other properties of the object
simultaneously. With the tight coupling of Lagrangian mechanics, the
dynamic volumetric simplex splines representing the object can
accurately simulate its physical behavior because it can unify the
geometric and material properties in the simulation. The
visualization can be directly computed from the object's geometric
or physical representation based on the dynamic spherical volumetric
simplex splines during simulation without interpolation or
resampling. We have applied the framework for biomechanic simulation
of brain deformations, such as brain shifting during the surgery and
brain injury under blunt impact. We have compared our simulation
results with the ground truth obtained through intra-operative
magnetic resonance imaging and the real biomechanic experiments. The
evaluations demonstrate the excellent performance of our new
technique.
Further details of this work are available
here.
|
- Geodesic Distance-Weighted Shape
Vector Image Diffusion
This work proposes a novel and efficient surface matching and
visualization framework through the geodesic distance-weighted shape
vector image diffusion. Based on conformal geometry, our approach
can uniquely map a 3D surface to a canonical rectangular domain and
encode the shape characteristics (e.g., mean curvatures and
conformal factors) of the surface in the 2D domain to construct a
geodesic distance-weighted shape vector image, where the distances
between sampling pixels are not uniform but the actual geodesic
distances on the manifold. Through the novel geodesic
distance-weighted shape vector image diffusion, we can create a
multiscale diffusion space, in which the cross-scale extrema can be
detected as the robust geometric features for the matching and
registration of surfaces. Therefore, statistical analysis and
visualization of surface properties across subjects become readily
available. The experiments on scanned surface models show that our
method is very robust for feature extraction and surface matching
even under noise and resolution change. We have also applied the
framework on the real 3D human neocortical surfaces, and
demonstrated the excellent performance of our approach in
statistical analysis and integrated visualization of the
multimodality volumetric data over the shape vector image.
Further details of this work are available
here.
|
- Simultaneous Localized Feature
Selection and Model Detection for Gaussian Mixtures
This work proposes a novel approach of simultaneous localized
feature selection and model detection for unsupervised learning. In
our approach, local feature saliency, together with other parameters
of Gaussian mixtures, are estimated by Bayesian variational
learning. Experiments performed on both synthetic and real-world
data sets demonstrate that our approach is superior over both global
feature selection and subspace clustering methods.
Further details of this work are available
here.
|
- Exemplar-based Visualization of Large
Document Corpus
With the rapid growth of the World Wide Web and electronic
information services, text corpus is becoming available online at an
incredible rate. By displaying text data in a logical layout (e.g.,
color graphs), text visualization presents a direct way to observe
the documents as well as understand the relationship between them.
In this work, we propose a novel technique, Exemplarbased
Visualization (EV), to visualize an extremely large text corpus.
Capitalizing on recent advances in matrix approximation and
decomposition, EV presents a probabilistic multidimensional
projection model in the low-rank text subspace with a sound
objective function. The probability of each document proportion to
the topics is obtained through iterative optimization and embedded
to a low dimensional space using parameter embedding. By selecting
the representative exemplars, we obtain a compact approximation of
the data. This makes the visualization highly efficient and
flexible. In addition, the selected exemplars neatly summarize the
entire data set and greatly reduce the cognitive overload in the
visualization, leading to an easier interpretation of large text
corpus. Empirically, we demonstrate the superior performance of EV
through extensive experiments performed on the publicly available
text data sets.
Further details of this work are available
here.
Exemplar-based Visualization software demo is available
here.
The 10Pubmed data set used for the software are available
here.
|
- Intrinsic Geometric Scale Space by
Shape Diffusion
This work formalizes a novel, intrinsic geometric scale space (IGSS)
of 3D surface shapes. The intrinsic geometry of a surface is
diffused by means of the Ricci flow for the generation of a
geometric scale space. We rigorously prove that this multiscale
shape representation satisfies the axiomatic causality property.
Within the theoretical framework, we further present a featurebased
shape representation derived from IGSS processing, which is shown to
be theoretically plausible and practically effective. By integrating
the concept of scale-dependent saliency into the shape description,
this representation is not only highly descriptive of the local
structures, but also exhibits several desired characteristics of
global shape representations, such as being compact, robust to noise
and computationally efficient. We demonstrate the capabilities of
our approach through salient geometric feature detection and highly
discriminative matching of 3D scans.
Further details of this work are available
here.
|
- Selection–fusion Approach for
Classification of Datasets with Missing Values
This work proposes a new approach based on missing value pattern
discovery for classifying incomplete data. This approach is
particularly designed for classification of datasets with a small
number of samples and a high percentage of missing values where
available missing value treatment approaches do not usually work
well. Based on the pattern of the missing values, the proposed
approach finds subsets of samples for which most of the features are
available and trains a classifier for each subset. Then, it combines
the outputs of the classifiers. Subset selection is translated into
a clustering problem, allowing derivation of a mathematical
framework for it. A trade off is established between the
computational complexity (number of subsets) and the accuracy of the
overall classifier. To deal with this trade off, a numerical
criterion is proposed for the prediction of the overall performance.
The proposed method is applied to seven datasets from the popular
University of California, Irvine data mining archive and an epilepsy
dataset from Henry Ford Hospital, Detroit, Michigan (total of eight
datasets). Experimental results show that classification accuracy of
the proposed method is superior to those of the widely used multiple
imputations method and four other methods. They also show that the
level of superiority depends on the pattern and percentage of
missing values.
Further details of this work are available
here.
|
|
| |
|