Observations from Statistical Processing of BDNC01 Corpus

Md. Farukuzzaman Khan; M. Abdus Sobhan

Call for Paper

May Edition

IJAIS solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 28 April 2026

Submit your paper

Know more

The week's pick

Optimized Decision Tree Classifier for Data Aggregation in Wireless Sensor Networks Using IoT Sensor Data

Jagan Kurma Raghuvaran Kendyala Varun Bitkuri Avinash Attipalli Jaya Vardhani Mamidala Sunil Jacob Enokkaren

Random Articles

Design of RFID based Location Aware System to Mitigate Child Kidnapping in Nigerian Nursery and Primary Schools

July

2014

Selecting GA Parameters for Intrusion Detection

January

2014

Enhancing the Fight against Social Media Misinformation: An Ensemble Deep Learning Framework for Detecting Deepfakes

Nov

2023

Proficient and Reliable Anonymous Routing Protocol (RARP) in Mobile Ad Hoc Network Environment using Digital Signatures

Feb

2018

Reseach Article

Observations from Statistical Processing of BDNC01 Corpus

by Md. Farukuzzaman Khan, M. Abdus Sobhan

International Journal of Applied Information Systems

Foundation of Computer Science (FCS), NY, USA

Volume 3 - Number 3

Year of Publication: 2012

Authors: Md. Farukuzzaman Khan, M. Abdus Sobhan

http:/ijais12-450474

Md. Farukuzzaman Khan, M. Abdus Sobhan . Observations from Statistical Processing of BDNC01 Corpus. International Journal of Applied Information Systems. 3, 3 ( July 2012), 1-7. DOI=http:/ijais12-450474

@article{ http:/ijais12-450474,

author = { Md. Farukuzzaman Khan, M. Abdus Sobhan },

title = { Observations from Statistical Processing of BDNC01 Corpus },

journal = { International Journal of Applied Information Systems },

issue_date = { July 2012 },

volume = { 3 },

number = { 3 },

month = { July },

year = { 2012 },

issn = { 2249-0868 },

pages = { 1-7 },

numpages = {9},

url = { https://www.ijais.org/archives/volume3/number3/209-0474/ },

doi = { http:/ijais12-450474 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2023-07-05T10:45:31.247985+05:30

%A Md. Farukuzzaman Khan

%A M. Abdus Sobhan

%T Observations from Statistical Processing of BDNC01 Corpus

%J International Journal of Applied Information Systems

%@ 2249-0868

%V 3

%N 3

%P 1-7

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Recent trends in the development of language related technology finds unavoidable requirement of relevant resources and acquiring knowledge from these resources. In this prospect corpus-based methods are getting strong push from various laboratories throughout the world in Bangla language processing. In this paper we have discussed the compilation of BdNC01 corpus and observations from statistical processing of it. BdNC01 is a new Bangla text corpus collected form web edition of several influential Bangla daily newspapers containing more than eleven millions word tokens. Several processing like list and total count of vocabulary, individual word frequencies and prior probabilities were computed and preserved in final repository. The word frequency relation to Zipp's law, time and source dependency of word frequencies and character distribution were also observed. Software support tools required for various processing were implemented using C language. The paper concludes with the usability of the corpus and computed statistical database.

References

John Sinclair, Corpus and Text: Basic Principle, Tuscan Word Center, 2004, http:// www. ahds. ca. uk/litangling, retrieved on 6th Jan. , 2011
Anthony McEnery and Richard Xiao "Developing Linguistic Corpora: a Guide to Good Practice", Lancaster University, 2004,
Douglas Biber, Susan Conrad And Randi Reppen, "Corpus-based Approaches to Issues in Applied Linguistics", Oxford Journals Humanities Applied Linguistics Volume15, Issue2, Pp. 169-189, Oxford University Press, 1994.
Niladri Sekhar Dash, Language Corpora: Present Indian Need, Indian Statistical Institute, Kolkata, available at: http://www. elda. org/en/proj/scalla/SCALLA2004/dash. pdf, retrieved on 6th Jan. , 2011.
Dash, N. S. (1999) "Corpus oriented Bangla language processing". Jadavpur Journal of Philosophy. 11(1): 1-28.
Dash, N. S. (2000) "Bangla pronouns - a corpus based study". Literary and Linguistic Computing. 15(4): 433-444.
Dash, N. S. and B. B. Chaudhuri (2001) "A corpus based study of the Bangla language". Indian Journal of Linguistics. 20: 19-40.
Dash, N. S. and B. B. Chaudhuri (2001) "Corpus-based empirical analysis of form, function and frequency of characters used in Bangla". Published in Rayson, P. , Wilson, A. , McEnery, T. , Hardie, A. , and Khoja, S. , (eds. ) Special issue of the Proceedings of the Corpus Linguistics 2001 Conference, Lancaster: Lancaster University Press. UK. 13: 144-157. 2001.
Dash, N. S. and B. B. Chaudhuri (2002) "Corpus generation and text processing". Inter ational Journal of Dravidian Linguistics. 31(1): 25-44.
Dash, N. S. and B. B. Chaudhuri (2002) "Spelling variation of words in Bangla: a corpus-based study". To appear in International Journal of Dravidian Linguistics.
Dash, N. S. and B. B. Chaudhuri "Using Text Corpora for Understanding Polysemy in Bangla". Procedings of the Language Engineering Conference (LEC'02) IEEE, 2002.
Niladri Sekhar Dash, Methods in Madness of Bengali Spelling: A Corpus-based Investigation", South Asian Language Rewiew, Vol. XV, No. 2, June 2005
M M Asaduzzaman and Muhammad Masroor Ali, "Morphological Analysis of Bangla Words for Automatic Machine Translation", 6th International Conference on Computer and Information Technology (ICCIT) 2003. Jahangirnagar University, Dhaka, Bangladesh, pp. 265-270,2003
M S A Chowdhury, N M M Uddin, M Imran, M M Hassan and M. E. Haque, "Part of Speech Tagging of Bangla Sentence", 7th International Conference on Computer and Information Technology (ICCIT) 2004, Bangladesh, 2004.
Md. Jahangir Alam, Naushad UzZaman and Mumit Khan "N-gram based Statistical Grammar Checker for Bangla and English", 9th International Conference on Computer and Information Technology (ICCIT) 2006, Bangladesh, 2006.
Samit Bhattacharya, Monojit Choudhury, Sudeshna, Sarkar, and Anupam Basu. 2005. Inflectional Morphology, Synthesis for Bangla Noun, Pronoun and Verb Systems. In Proc. of the National Conference on Computer Processing of Bangla (NCCPB 05), pages 34 - 43.
Niladri Sekhar Dash. 2006. The Morphodynamics of Bengali Compounds decomposing them for lexical processing. In Language in India (www. languageageinindia. com), Vol 6:7.
Sajib Dasgupta and Vincent Ng, "Unsupervised Word Segmentation for Bangla", Human Language Technology Research Institute, University of Texus, TX 75083,
Daniel Jurafsky and James H. Martin, "Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall USA, September 28, 1999, pp 139-232.
Christopher D. Manning, Hinrich Schütze "Foundations of Statistical Natural Language Processing", MIT Press (1999), ISBN 978-0262133609, p. 24
Wentian Li (1992). "Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution", IEEE Transactions on Information Theory 38 (6): 1842–1845, Website: http://www. nslij-genetics. org/wli/pub/ieee92_pre. pdf. , Retrieved on 1st May 2012 at 8:30 AM.
Ramon Ferrer i Cancho and Ricard V. Sole (2003), "Least effort and the origins of scaling in human language", Proceedings of the National Academy of Sciences of the United States of America 100 (3): 788-791, Website: http://www. pnas. org/content/100/3/788. abstract?sid=cc7fae18-87c9-4b67-863a-4195bb47c1d1 , Retrieved on 1st May 2012 at 8:30 AM.
Khair Md. Yeasir Arafat Majumder, Md. Zahurul Islam, and Mumit Khan, "Analysis of and Observations from a Bangla News Corpus", Website: http://www. panl10n. net/english/final%20reports/pdf%20files/Bangladesh/BAN03. pdf, Retrieve on 1st May 2012, at 8:30 AM
The first 2000 most frequent words from the Brown Corpus, Website: http://www. edict. biz/lexiconindex/frequencylists/words2000. htm, Retrieve on 1st May 2012 at 8:30 AM.

Index Terms

Computer Science

Information Sciences

Keywords

Corpus Vocabulary Word Frequency Prior Probability Zipp’s Law And Character Distribution