CFP last date
15 May 2024
Reseach Article

Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity

by Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O.
International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 12 - Number 9
Year of Publication: 2017
Authors: Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O.
10.5120/ijais2017451724

Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O. . Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity. International Journal of Applied Information Systems. 12, 9 ( Dec 2017), 1-7. DOI=10.5120/ijais2017451724

@article{ 10.5120/ijais2017451724,
author = { Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O. },
title = { Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity },
journal = { International Journal of Applied Information Systems },
issue_date = { Dec 2017 },
volume = { 12 },
number = { 9 },
month = { Dec },
year = { 2017 },
issn = { 2249-0868 },
pages = { 1-7 },
numpages = {9},
url = { https://www.ijais.org/archives/volume12/number9/1012-2017451724/ },
doi = { 10.5120/ijais2017451724 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2023-07-05T19:08:35.460821+05:30
%A Ifeanyi-Reuben Nkechi J.
%A Ugwu Chidiebere
%A Nwachukwu E. O.
%T Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity
%J International Journal of Applied Information Systems
%@ 2249-0868
%V 12
%N 9
%P 1-7
%D 2017
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online. Text similarity is of great importance in any text-based applications. This paper presents a comparative analysis of n-gram text representation on Igbo text document similarity. It adopted Euclidean similarity measure to determine the similarities between Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models. The evaluation of the similarity measure is based on the adopted text representation models. The model is designed with Object-Oriented Methodology and implemented with Python programming language with tools from Natural Language Toolkits (NLTK). The result shows that unigram represented text has highest distance values whereas bigram has the lowest corresponding distance values. The lower the distance value, the more similar the two documents and better the quality of the model when used for a task that requires similarity measure. The similarity of two documents increases as the distance value moves down to zero (0). Ideally, the result analyzed revealed that Igbo text document similarity measured on bigram represented text gives accurate similarity result. This will give better, effective and accurate result when used for tasks such as text classification, clustering and ranking on Igbo text.

References
  1. Ifeanyi-Reuben, N.J., Ugwu, C. and Adegbola, T. (2017). Analysis and representation of Igbo text document for a text-based system. International Journal of Data Mining Techniques and Applications (IJDMTA). 6(1): 26-32.
  2. Vijaymeena, M.K. and Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal (MLAIJ). 3(1): 19 – 28.
  3. Sapna, C., Pridhi, A. and Pawan, B. (2013). Algorithm for Semantic Based Similarity Measure. International Journal of Engineering Science Invention. 2(6): 75-78.
  4. Mirza, R. M. and Losarwar, V. A. (2016). A Similarity Measure for Text Processing. International Journal for Research in Engineering Application & Management (IJREAM). Vol-02, Issue 06.
  5. Kavitha, S. M. and Hemalatha, P. (2015). Survey on text classification based on similarity. International Journal of Innovative Research in Computer and Communication Engineering. 3(3): 2099 – 210.
  6. Bird, S., Klein, E. and Loper, E. (2009). Natural language processing with Python.” O’Reilly Media Inc. First Edition.
  7. Arjun, S. N., Ananthu, P. K., Naveen, C. and Balasubramani, R. (2016). Survey on pre-processing techniques for text Mining. International Journal of Engineering and Computer Science. 5 (6): 16875-16879.
  8. Shen, D., Sun, J., Yang, Q. and Chen, Z. (2006). Text classification improved through multi-gram models,” In Proceedings of the ACM Fifteenth Conference on Information and Knowledge Management (ACM CIKM 06), Arlington, USA. Pp 672-681.
  9. David, D.L. (1990). Representation quality in text classification: An Introduction and Experiment. Selected papers from the AAAI Spring Symposium on text-based Intelligent Systems. Technical Report from General Electric Research & Development, Schenectady, NY, 12301.
  10. George, S. K. and Joseph, S. (2014). Text Classification by Augmenting Bag of Words (BOW) Representation with co-occurrence Feature. IOSR Journal of Computer Engineering (IOSR – JCE) e-ISSN: 2278 – 0661, pp 34 – 38.
  11. Onyenwe, I. E., Uchechukwu, C. and Hepple, M. (2014). Part-of-Speech tagset and corpus development for Igbo, an African language. The 8th linguistic annotation workshop, Dublin, Ireland, pp 93-98.
  12. Onukawa, M.C. (2014). Writing in the Igbo language: standards and trends. NILAS: Journal of Institute for Nigerian Languages, University of Nigeria, Aba Campus. 2(2): 1-10.
  13. Essam, S. H. (2013). Similar Thesaurus based on Arabic document: An overview and comparison. International Journal of Computer Science, Engineering and Applications (IJCSEA). 3(2): 1-10.
  14. Deval, A.B. and Kulkarni, R.V. (2012). Applications of data mining techniques in life insurance. International Journal of Data Mining & Knowledge Management Process (IJDKP) 2(4): 31 – 40.
Index Terms

Computer Science
Information Sciences

Keywords

Similarity measure Igbo text N-gram model Euclidean distance Text representation