Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity

Ifeanyi-Reuben Nkechi J.; Ugwu Chidiebere; Nwachukwu E. O.

Call for Paper

June Edition

IJAIS solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 15 May 2024

Submit your paper

Know more

The week's pick

Analysis of ANN Training Algorithms for Hand Geometry-Based Access Control

Kazeem B. Adedeji Apena Waliu O. Adu Michael R.

Random Articles

Development of a Bi-directed Routing Model for Mobile Agents in Distributed Systems

October

2015

Cloud Computing Governance Readiness Assessment: Case Study of a local Airline Company

April

2016

Enriched Integrations of ERP and PLM in the IoT World

May

2017

A Novel Approach for Hindi Text Description to Speech and Expressive Speech Synthesis

May

2015

Reseach Article

Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity

by Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O.

International Journal of Applied Information Systems

Foundation of Computer Science (FCS), NY, USA

Volume 12 - Number 9

Year of Publication: 2017

Authors: Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O.

10.5120/ijais2017451724

Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O. . Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity. International Journal of Applied Information Systems. 12, 9 ( Dec 2017), 1-7. DOI=10.5120/ijais2017451724

@article{ 10.5120/ijais2017451724,

author = { Ifeanyi-Reuben Nkechi J., Ugwu Chidiebere, Nwachukwu E. O. },

title = { Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity },

journal = { International Journal of Applied Information Systems },

issue_date = { Dec 2017 },

volume = { 12 },

number = { 9 },

month = { Dec },

year = { 2017 },

issn = { 2249-0868 },

pages = { 1-7 },

numpages = {9},

url = { https://www.ijais.org/archives/volume12/number9/1012-2017451724/ },

doi = { 10.5120/ijais2017451724 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2023-07-05T19:08:35.460821+05:30

%A Ifeanyi-Reuben Nkechi J.

%A Ugwu Chidiebere

%A Nwachukwu E. O.

%T Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity

%J International Journal of Applied Information Systems

%@ 2249-0868

%V 12

%N 9

%P 1-7

%D 2017

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online. Text similarity is of great importance in any text-based applications. This paper presents a comparative analysis of n-gram text representation on Igbo text document similarity. It adopted Euclidean similarity measure to determine the similarities between Igbo text documents represented with two word-based n-gram text representation (unigram and bigram) models. The evaluation of the similarity measure is based on the adopted text representation models. The model is designed with Object-Oriented Methodology and implemented with Python programming language with tools from Natural Language Toolkits (NLTK). The result shows that unigram represented text has highest distance values whereas bigram has the lowest corresponding distance values. The lower the distance value, the more similar the two documents and better the quality of the model when used for a task that requires similarity measure. The similarity of two documents increases as the distance value moves down to zero (0). Ideally, the result analyzed revealed that Igbo text document similarity measured on bigram represented text gives accurate similarity result. This will give better, effective and accurate result when used for tasks such as text classification, clustering and ranking on Igbo text.

References

Ifeanyi-Reuben, N.J., Ugwu, C. and Adegbola, T. (2017). Analysis and representation of Igbo text document for a text-based system. International Journal of Data Mining Techniques and Applications (IJDMTA). 6(1): 26-32.
Vijaymeena, M.K. and Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal (MLAIJ). 3(1): 19 – 28.
Sapna, C., Pridhi, A. and Pawan, B. (2013). Algorithm for Semantic Based Similarity Measure. International Journal of Engineering Science Invention. 2(6): 75-78.
Mirza, R. M. and Losarwar, V. A. (2016). A Similarity Measure for Text Processing. International Journal for Research in Engineering Application & Management (IJREAM). Vol-02, Issue 06.
Kavitha, S. M. and Hemalatha, P. (2015). Survey on text classification based on similarity. International Journal of Innovative Research in Computer and Communication Engineering. 3(3): 2099 – 210.
Bird, S., Klein, E. and Loper, E. (2009). Natural language processing with Python.” O’Reilly Media Inc. First Edition.
Arjun, S. N., Ananthu, P. K., Naveen, C. and Balasubramani, R. (2016). Survey on pre-processing techniques for text Mining. International Journal of Engineering and Computer Science. 5 (6): 16875-16879.
Shen, D., Sun, J., Yang, Q. and Chen, Z. (2006). Text classification improved through multi-gram models,” In Proceedings of the ACM Fifteenth Conference on Information and Knowledge Management (ACM CIKM 06), Arlington, USA. Pp 672-681.
David, D.L. (1990). Representation quality in text classification: An Introduction and Experiment. Selected papers from the AAAI Spring Symposium on text-based Intelligent Systems. Technical Report from General Electric Research & Development, Schenectady, NY, 12301.
George, S. K. and Joseph, S. (2014). Text Classification by Augmenting Bag of Words (BOW) Representation with co-occurrence Feature. IOSR Journal of Computer Engineering (IOSR – JCE) e-ISSN: 2278 – 0661, pp 34 – 38.
Onyenwe, I. E., Uchechukwu, C. and Hepple, M. (2014). Part-of-Speech tagset and corpus development for Igbo, an African language. The 8th linguistic annotation workshop, Dublin, Ireland, pp 93-98.
Onukawa, M.C. (2014). Writing in the Igbo language: standards and trends. NILAS: Journal of Institute for Nigerian Languages, University of Nigeria, Aba Campus. 2(2): 1-10.
Essam, S. H. (2013). Similar Thesaurus based on Arabic document: An overview and comparison. International Journal of Computer Science, Engineering and Applications (IJCSEA). 3(2): 1-10.
Deval, A.B. and Kulkarni, R.V. (2012). Applications of data mining techniques in life insurance. International Journal of Data Mining & Knowledge Management Process (IJDKP) 2(4): 31 – 40.

Index Terms

Computer Science

Information Sciences

Keywords

Similarity measure Igbo text N-gram model Euclidean distance Text representation