CFP last date
15 May 2024
Reseach Article

Document Clustering: A Detailed Review

by Neepa Shah, Sunita Mahajan
International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 4 - Number 5
Year of Publication: 2012
Authors: Neepa Shah, Sunita Mahajan
10.5120/ijais12-450691

Neepa Shah, Sunita Mahajan . Document Clustering: A Detailed Review. International Journal of Applied Information Systems. 4, 5 ( October 2012), 30-38. DOI=10.5120/ijais12-450691

@article{ 10.5120/ijais12-450691,
author = { Neepa Shah, Sunita Mahajan },
title = { Document Clustering: A Detailed Review },
journal = { International Journal of Applied Information Systems },
issue_date = { October 2012 },
volume = { 4 },
number = { 5 },
month = { October },
year = { 2012 },
issn = { 2249-0868 },
pages = { 30-38 },
numpages = {9},
url = { https://www.ijais.org/archives/volume4/number5/300-0691/ },
doi = { 10.5120/ijais12-450691 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2023-07-05T10:47:26.237113+05:30
%A Neepa Shah
%A Sunita Mahajan
%T Document Clustering: A Detailed Review
%J International Journal of Applied Information Systems
%@ 2249-0868
%V 4
%N 5
%P 30-38
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Document clustering is automatic organization of documents into clusters so that documents within a cluster have high similarity in comparison to documents in other clusters. It has been studied intensively becauseof its wide applicability in various areas such as web mining,search engines, and information retrieval. It is measuring similarity between documents and grouping similardocuments together. It providesefficient representation and visualization of thedocuments; thus helps in easy navigation also. In this paper, we have given overview of various document clustering methodsstudied and researched since last few years,starting from basic traditional methods to fuzzy based, genetic, co-clustering, heuristic oriented etc. Also, the document clustering procedure with feature selection process, applications, challenges in document clustering, similarity measures and evaluation of document clustering algorithm is explained.

References
  1. RekhaBaghel and Dr. RenuDhir, "A Frequent Concepts Based Document Clustering Algorithm,"International Journal of Computer Applications, vol. 4, No. 5, pp. 0975 – 8887, Jul. 2010
  2. A. Huang, "Similarity measures for text document clustering,"In Proc. of the Sixth New Zealand Computer Science Research Student Conference NZCSRSC, pp. 49—56, 2008.
  3. Nicholas O. Andrews and Edward A. Fox,"Recent developments indocument clustering,"Technical report published by citeseer, pp. 1-25, Oct. 2007
  4. Chun-Ling Chen, Frank S. C. Tseng, and Tyne Liang, "An integration of WordNet and fuzzy association rule mining for multi-label document clustering,"Data and Knowledge Engineering, vol. 69, issue 11, pp. 1208-1226, Nov. 2010
  5. Yong Wang and Julia Hodges, "Document Clustering with Semantic Analysis,"In Proc. of the 39th Annual Hawaii International Conference on System Sciences, HICSS 2006,vol. 03, pp. 54. 3
  6. Michael Steinbach , George Karypis, andVipin Kumar, "A comparison of document clustering techniques,"In KDD Workshop on Text Mining, 2002
  7. Xiaohui Cui and Thomas E. Potok, "Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm," Special Issue, 2005
  8. F. Beil, M. Ester, and X. Xu, "Frequent term-based text clustering,"Proc. of Int'l Conf. on knowledge Discovery and Data Mining (KDD'02), pp. 436–442, 2002.
  9. Benjamin C. M. Fung, Ke Wang, and Martin Ester, "Hierarchical Document Clustering Using Frequent Itemsets," In Proc. Siam International Conference On Data Mining 2003,SDM 2003
  10. Chun-Ling Chen, Frank S. C. Tseng, and Tyne Liang, "Mining fuzzy frequent itemsets for hierarchical document clustering," Published in an Int'l Journal of Information Processing and Management, vol. 46, issue 2, pp. 193-211, Mar. 2010
  11. C. L. Chen, F. S. C. Tseng, T. Liang, An integration of fuzzy association rules and WordNet for document clustering, Proc. of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-09), 2009, pp. 147–159.
  12. PankajJajoo, "Document Clustering," Masters' Thesis, IIT Kharagpur, 2008
  13. Chih-Ping Wei, Chin-Sheng Yang, Han-Wei Hsiao, and Tsang-Hsiang Cheng, "Combining preference- and content-based approaches for improving document clustering effectiveness,"Published in Int'l Journal of Information Processing and Management, vol. 42, issue 2, pp. 350-372, Mar. 2006
  14. MS. K. Mugunthadevi, MRS. S. C. Punitha, and Dr. . M. Punithavalli, "Survey on Feature Selection in Document Clustering,"Int'l Journal on Computer Science and Engineering (IJCSE), vol. 3, No. 3, pp. 1240-1244, Mar 2011
  15. Yi Peng, Gang Kou, Zhengxin Chen, and Yong Shi, "Recent trends in Data Mining (DM): Document Clustering of DM Publications," Int'l Conference on Service Systems and Service Management, vol. 2, pp. 1653 – 1659, Oct. 2006
  16. Man Lan, Chew Lim Tan, Jian Su, and Yue Lu, "Supervised and Traditional Term Weighting Methods for Automatic Text Categorization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, No. 4, Apr. 2009
  17. Shen Huang, Zheng Chen, Yong Yu, and Wei-Ying Ma, "Multitype Features Coselection for Web Document Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 18, No. 4, Apr. 2006
  18. Minqiang Li and Liang Zhang,"Multinomial mixture model with feature selection for text clustering," Journal of Knowledge-Based Systems, vol. 21, issue 7,pp. 704-708, Oct. 2008
  19. Jun Yan, Ning Liu, Shuicheng Yan, Qiang Yang, Weiguo (Patrick) Fan, Wei Wei, and Zheng Chen, "Trace-Oriented Feature Analysis for Large-Scale Text Data Dimension Reduction,"IEEE Transactions on Knowledge and Data Engineering,vol. 23, No. 7, Jul. 2011
  20. Peter Willett, "Recent Trends In Hierarchic Document Clustering: A Critical Review,"Information Processing & Management, vol. 24, No. 5, pp. 517-597, 1988
  21. CongnanLuo, Yanjun Li, and Soon M. Chung, "Text document clustering based on neighbors,"Data and Knowledge Engineering 68,pp. 1271–1288, 2009
  22. Junjie Wu, HuiXiong, and JianChen,"Towardsunderstandinghierarchicalclustering: A datadistributionperspective," Neurocomputing 72, pp. 2319–2330, 2009
  23. Reynaldo Gil-García and Aurora Pons-Porrata, "Dynamic hierarchical algorithms for document clustering,"Pattern Recognition Letters 31, pp. 469–477, 2010
  24. Oren Zamir, Oren Etzioni,OmidMadani, and Richard M. Karp,"Fast and intuitive clustering of web documents citation," In Proc. of the 3rd Int'l Conference on Knowledge Discovery and Data Mining, 1997
  25. Noam Slonim and NaftaliTishby, "Document Clustering using Word Clusters via the Information Bottleneck Method," In Proc. of the 23rd annual Int'l ACM SIGIR conference on Research and development in information retrieval, pp. 208 – 215, 2000
  26. Sholom Weiss, Brian White, and ChidApte, "Lightweight document clustering,"IBM Research Report RC-21684, 2000
  27. Ying Zhao and George Karypis, "Evaluation of Hierarchical Clustering Algorithms for Document Datasets", Technical Report, Jun. 2002
  28. Wei Xu, Xin Liu, and Yihong Gong, "Document Clustering Based On Non-negative Matrix Factorization," In Proc. of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 267-273, 2003
  29. Khaled M. Hammoudaand Mohamed S. Kamel, "Efficient Phrase-Based Document Indexing for Web Document Clustering," IEEE Transactions on Knowledge and Data Engineering,vol. 16, No. 10, Oct. 2004
  30. William-Chandra Tjhi andLihui Chen, "Possibilistic fuzzy co-clustering of large document collections,"Journal of Pattern Recognition,vol. 40,issue 12, pp. 3452-3466, Dec. 2007
  31. William-Chandra Tjhi andLihui Chen, "A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data,"Journal of Fuzzy Sets and Systems,vol. 159,issue 4, pp. 371-389, Feb. 2008
  32. Wenyuan Li, Wee-Keong Ng, Ying Liu, and Kok-Leong Ong, "Enhancing the Effectiveness of Clustering with Spectra Analysis,"Journal of IEEE Transactions on Knowledge and Data Engineering,vol. 19, issue 7, pp. 887-902, Jul. 2007
  33. R. Kashef andM. S. Kamel, "Enhanced bisecting k-means clustering using intermediate cooperation,"Journal of Pattern Recognition,vol. 42, issue 11, pp. 2557-2569, Nov. 2009
  34. Liang Feng, Ming-HuiQiu, Yu-Xuan Wang, Qiao-Liang Xiang, Yin-Fei Yang, and Kai Liu, "A fast divisive clustering algorithm using an improved discrete particle swarm optimizer," Journal of Pattern Recognition Letters¸ vol. 31, issue 11, pp. 1216-1225, Aug. 2010
  35. Yuan-chao Liu, Chong Wu, and Ming Liu, "Research of fast SOM clustering for text information," An International Journal Expert Systems with Applications, vol. 38, issue 8, pp. 9325-9333, Aug. 2011
  36. Xiaodi Huang, XiaodongZheng, Wei Yuan, Fei Wang, and Shanfeng Zhu, "Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization," an International Journal on Information Sciences, vol. 181,issue 11, pp. 2293-2302, Jun. 2011
  37. Deng Cai, Xiaofei He, and Jiawei Han, "Locally Consistent Concept Factorization for Document Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 902-913, Jun. 2011
  38. Patrick A. De Maziere and Marc M. Van Hulle, "A clustering study of a 7000 EU document inventory using MDS and SOM,"An International Journal on Expert Systems with Applications, vol. 38,issue 7, pp. 8835-8849, Jul. 2011
  39. AbdolrezaHatamloua, Salwani Abdullah, and HosseinNezamabadi-pour, "A combined approach for clustering based on K-means and gravitational search algorithms," Swarm and Evolutionary Computation, Available online 12 Mar. 2012
Index Terms

Computer Science
Information Sciences

Keywords

Document clustering document clustering applications document clustering procedure similarity measures for document clustering evaluation of document clustering algorithm challenges in document clustering