Google scholar arxiv informatics ads IJAIS publications are indexed with Google Scholar, NASA ADS, Informatics et. al.

Call for Paper

-

November Edition 2021

International Journal of Applied Information Systems solicits high quality original research papers for the November 2021 Edition of the journal. The last date of research paper submission is October 15, 2021.

Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity

Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O. Published in Information Sciences

International Journal of Applied Information Systems
Year of Publication: 2017
Publisher: Foundation of Computer Science (FCS), NY, USA
Authors:Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O.
10.5120/ijais2017451724
Download full text
  1. Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere and Nwachukwu E O.. Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity. International Journal of Applied Information Systems 12(9):1-7, December 2017. URL, DOI BibTeX

    @article{10.5120/ijais2017451724,
    	author = "Ifeanyi-Reuben Nkechi J. and Ugwu Chidiebere and Nwachukwu E. O.",
    	title = "Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity",
    	journal = "International Journal of Applied Information Systems",
    	issue_date = "December 2017",
    	volume = 12,
    	number = 9,
    	month = "Dec",
    	year = 2017,
    	issn = "2249-0868",
    	pages = "1-7",
    	url = "http://www.ijais.org/archives/volume12/number9/1012-2017451724",
    	doi = "10.5120/ijais2017451724",
    	publisher = "Foundation of Computer Science (FCS), NY, USA",
    	address = "New York, USA"
    }
    

Abstract

The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online. Text similarity is of great importance in any text-based applications. This paper presents a comparative analysis of n-gram text representation on Igbo text document similarity. It adopted Euclidean similarity measure to determine the similarities between Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models. The evaluation of the similarity measure is based on the adopted text representation models. The model is designed with Object-Oriented Methodology and implemented with Python programming language with tools from Natural Language Toolkits (NLTK). The result shows that unigram represented text has highest distance values whereas bigram has the lowest corresponding distance values. The lower the distance value, the more similar the two documents and better the quality of the model when used for a task that requires similarity measure. The similarity of two documents increases as the distance value moves down to zero (0). Ideally, the result analyzed revealed that Igbo text document similarity measured on bigram represented text gives accurate similarity result. This will give better, effective and accurate result when used for tasks such as text classification, clustering and ranking on Igbo text.

Reference

  1. Ifeanyi-Reuben, N.J., Ugwu, C. and Adegbola, T. (2017). Analysis and representation of Igbo text document for a text-based system. International Journal of Data Mining Techniques and Applications (IJDMTA). 6(1): 26-32.
  2. Vijaymeena, M.K. and Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal (MLAIJ). 3(1): 19 – 28.
  3. Sapna, C., Pridhi, A. and Pawan, B. (2013). Algorithm for Semantic Based Similarity Measure. International Journal of Engineering Science Invention. 2(6): 75-78.
  4. Mirza, R. M. and Losarwar, V. A. (2016). A Similarity Measure for Text Processing. International Journal for Research in Engineering Application & Management (IJREAM). Vol-02, Issue 06.
  5. Kavitha, S. M. and Hemalatha, P. (2015). Survey on text classification based on similarity. International Journal of Innovative Research in Computer and Communication Engineering. 3(3): 2099 – 210.
  6. Bird, S., Klein, E. and Loper, E. (2009). Natural language processing with Python.” O’Reilly Media Inc. First Edition.
  7. Arjun, S. N., Ananthu, P. K., Naveen, C. and Balasubramani, R. (2016). Survey on pre-processing techniques for text Mining. International Journal of Engineering and Computer Science. 5 (6): 16875-16879.
  8. Shen, D., Sun, J., Yang, Q. and Chen, Z. (2006). Text classification improved through multi-gram models,” In Proceedings of the ACM Fifteenth Conference on Information and Knowledge Management (ACM CIKM 06), Arlington, USA. Pp 672-681.
  9. David, D.L. (1990). Representation quality in text classification: An Introduction and Experiment. Selected papers from the AAAI Spring Symposium on text-based Intelligent Systems. Technical Report from General Electric Research & Development, Schenectady, NY, 12301.
  10. George, S. K. and Joseph, S. (2014). Text Classification by Augmenting Bag of Words (BOW) Representation with co-occurrence Feature. IOSR Journal of Computer Engineering (IOSR – JCE) e-ISSN: 2278 – 0661, pp 34 – 38.
  11. Onyenwe, I. E., Uchechukwu, C. and Hepple, M. (2014). Part-of-Speech tagset and corpus development for Igbo, an African language. The 8th linguistic annotation workshop, Dublin, Ireland, pp 93-98.
  12. Onukawa, M.C. (2014). Writing in the Igbo language: standards and trends. NILAS: Journal of Institute for Nigerian Languages, University of Nigeria, Aba Campus. 2(2): 1-10.
  13. Essam, S. H. (2013). Similar Thesaurus based on Arabic document: An overview and comparison. International Journal of Computer Science, Engineering and Applications (IJCSEA). 3(2): 1-10.
  14. Deval, A.B. and Kulkarni, R.V. (2012). Applications of data mining techniques in life insurance. International Journal of Data Mining & Knowledge Management Process (IJDKP) 2(4): 31 – 40.

Keywords

Similarity measure, Igbo text, N-gram model, Euclidean distance, Text representation