| Parallel
Clustering and Classification
Clustering
is the grouping of common documents into sets of similar documents.
The belief is that similar documents, as defined by a similarity
measure, are relevant, as a set, to similar queries. Given this
assumption, queries can be processed more efficiently by comparing
a sample representative document for each cluster, called a centroid,
against the query. If the centroid is deemed relevant, so are the
documents that comprise the corresponding cluster. Our approach
describes an efficient and scalable parallel approach to cluster
and classify a large document corpus.
Publicaton:
 |
R. Cathey, E. Jensen, S. Beitzel, O. Frieder, and D. Grossman, "Exploiting Parallelism to Support Scalable Hierarchical Clustering," Journal of the American Society of Information Science and Technology, 58(8), June 2007.
|
 |
E. Jensen, S. Beitzel, A. Pilotto, N.
Goharian, O. Frieder, "Parallelizing the Buckshot Algorithm for Efficient
Document Clustering", Proceedings of the 2002 ACM International Conference
on Information and Knowledge Management (ACM-CIKM), Washington D.C.,
November 2002.
|
 |
A.
Ruocco and O. Frieder, "Clustering and Classification of
Large Document Bases in a Parallel Environment," Journal of
the American Society of Information Science, 48(10), October
1997. |
|