CFP last date
15 April 2024
Reseach Article

Eliminating Noisy Information in Web Pages using featured DOM tree

by Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew
International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 2 - Number 2
Year of Publication: 2012
Authors: Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew
10.5120/ijais12-450272

Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew . Eliminating Noisy Information in Web Pages using featured DOM tree. International Journal of Applied Information Systems. 2, 2 ( May 2012), 27-34. DOI=10.5120/ijais12-450272

@article{ 10.5120/ijais12-450272,
author = { Shine N. Das, Pramod K. Vijayaraghavan, Midhun Mathew },
title = { Eliminating Noisy Information in Web Pages using featured DOM tree },
journal = { International Journal of Applied Information Systems },
issue_date = { May 2012 },
volume = { 2 },
number = { 2 },
month = { May },
year = { 2012 },
issn = { 2249-0868 },
pages = { 27-34 },
numpages = {9},
url = { https://www.ijais.org/archives/volume2/number2/133-0272/ },
doi = { 10.5120/ijais12-450272 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2023-07-05T10:43:15.440994+05:30
%A Shine N. Das
%A Pramod K. Vijayaraghavan
%A Midhun Mathew
%T Eliminating Noisy Information in Web Pages using featured DOM tree
%J International Journal of Applied Information Systems
%@ 2249-0868
%V 2
%N 2
%P 27-34
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The exact information retrieval from the Web is now a great challenge for the researchers to device new methodologies for web mining. Due to the massive information on the Web, the size and number appear to be growing rapidly at an exponential rate which is often accompanied by a large amount of noise such as banner advertisements, navigation bars, copyright notices, etc. Although such information items are functionally useful for human viewers and necessary for the web site owners, they often hamper automated information gathering and web data mining. The efficiency of feature extraction and finally classification accuracy are certainly degraded due to the presence of such noisy information. Thus cleaning the web pages before mining becomes critical for improving the mining results. In our work, we focuses on identifying and removing local noises in web pages to improve the performance of mining. We propose a novel and simple idea for the detection and removal of local noises using a new tree structure called featured DOM Tree. A three stage algorithm is proposed in which feature selection is done in the first phase, a featured DOM tree is created in the second phase and noise is marked and pruned in the third phase. The experimental results show that our algorithm outperform in terms of various benchmark measures and an increase in F score and accuracy is obtained as a result of automatic web page classification.

References
  1. Lan Yi, Bing Liu, Xiaoli Li, "Eliminating Noisy Information in Web Pages for Data Mining", Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Washington, pp 296-305, August 2003.
  2. Thanda Htwe, "Cleaning Various Noise Patterns in Web Pages for Web Data Extraction", International Journal of Network and Mobile Technologies, Vol, 1, Issue 2, pp 74 – 80, November 2010.
  3. Jinbeom Kang, Joongmin Choi, "Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation", International Symposium on Information Technology Convergence, pp 306-310, November 2007.
  4. Yi Lan, Liu Bing. "Web Page Cleaning for Web Mining through Feature Weighting". Proceeding of Eighteenth International Joint Conference on Artificial Intelligence, Mexico, August 2003.
  5. Tieli Sun, Zhiying Li, Yanji Liu, Zhenghong Liu, "Algorithm Research for the Noise of Information Extraction Based Vision and DOM Tree", International Symposium on Intelligent Ubiquitous Computing and Education, pp 81-84, May 2009.
  6. Jinbeom Kang, Joongmin Choi, "Block classi?cation of a web page by using a combination of multiple classi?ers", Fourth International Conference on Networked Computing and Advanced Information Management, pp 290 -295, September 2008.
  7. Thanda Htwe, Khin Haymar Saw Hla, "Noise Removing from Web Pages Using Neural Network", The 2nd International Conference on Computer and Automation Engineering, Singapore, Volume 1, pp. 281 – 285, February 2010.
  8. Ziv Bar-Yossef, Sridhar Rajagopalan, "Template Detection via Data Mining and its Applications", Proceedings of the 11th international conference on World Wide Web, pp 580-591, 2002.
  9. Shian-Hua Lin, Jan-Ming Ho, "Discovering informative content blocks from Web documents", Proceedings of ACM SIGKDD'02, July 2002.
  10. Jingqi Wang, Qingcai Chen, Xiaolong Wang, Hongzhi Guo, "Basic Semantic Units Based Web Page Content Extraction", International Conference on Systems, Man and Cybernetics, pp 1489 – 1494, 2008.
  11. Shine N Das, Midhun Mathew, Pramod K. Vijayaraghavan, An Approach for Optimal Feature Subset Selection using a New Term Weighting Scheme and Mutual Information, Proceeding of the International Conference on Advanced Science, Engineering and Information Technology, Malaysia, pp 273-278, January 2011.
  12. Shine N Das, Midhun Mathew, Pramod K. Vijayaraghavan, "An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method", Proceedings of 20th ACM Conference on Information and Knowledge Management, Glasgow, Scotland, 2011.
  13. http://www. dmoz. org: Open Directory Project - The largest, most comprehensive human-edited directory of the Web.
  14. Midhun Mathew, Shine N Das, T. R Lakshminarayanan, Pramod K. Vijayaraghavan, "A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix", International Journal of Computer Applications, Volume 19, Number 7, April 2011.
  15. Andrew McCallum , Kamal Nigam, "A comparison of event models for naive Bayes text classification", AAAI-98 Workshop on Learning for Text Categorization, 1998
Index Terms

Computer Science
Information Sciences

Keywords

Noise Elimination Featured Dom Tree Web Page Cleaning Web Page Classification Minimum Weight Overlapping