CFP last date
15 May 2024
Reseach Article

Observations from Statistical Processing of BDNC01 Corpus

by Md. Farukuzzaman Khan, M. Abdus Sobhan
International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 3 - Number 3
Year of Publication: 2012
Authors: Md. Farukuzzaman Khan, M. Abdus Sobhan
http:/ijais12-450474

Md. Farukuzzaman Khan, M. Abdus Sobhan . Observations from Statistical Processing of BDNC01 Corpus. International Journal of Applied Information Systems. 3, 3 ( July 2012), 1-7. DOI=http:/ijais12-450474

@article{ http:/ijais12-450474,
author = { Md. Farukuzzaman Khan, M. Abdus Sobhan },
title = { Observations from Statistical Processing of BDNC01 Corpus },
journal = { International Journal of Applied Information Systems },
issue_date = { July 2012 },
volume = { 3 },
number = { 3 },
month = { July },
year = { 2012 },
issn = { 2249-0868 },
pages = { 1-7 },
numpages = {9},
url = { https://www.ijais.org/archives/volume3/number3/209-0474/ },
doi = { http:/ijais12-450474 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2023-07-05T10:45:31.247985+05:30
%A Md. Farukuzzaman Khan
%A M. Abdus Sobhan
%T Observations from Statistical Processing of BDNC01 Corpus
%J International Journal of Applied Information Systems
%@ 2249-0868
%V 3
%N 3
%P 1-7
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Recent trends in the development of language related technology finds unavoidable requirement of relevant resources and acquiring knowledge from these resources. In this prospect corpus-based methods are getting strong push from various laboratories throughout the world in Bangla language processing. In this paper we have discussed the compilation of BdNC01 corpus and observations from statistical processing of it. BdNC01 is a new Bangla text corpus collected form web edition of several influential Bangla daily newspapers containing more than eleven millions word tokens. Several processing like list and total count of vocabulary, individual word frequencies and prior probabilities were computed and preserved in final repository. The word frequency relation to Zipp's law, time and source dependency of word frequencies and character distribution were also observed. Software support tools required for various processing were implemented using C language. The paper concludes with the usability of the corpus and computed statistical database.

References
  1. John Sinclair, Corpus and Text: Basic Principle, Tuscan Word Center, 2004, http:// www. ahds. ca. uk/litangling, retrieved on 6th Jan. , 2011
  2. Anthony McEnery and Richard Xiao "Developing Linguistic Corpora: a Guide to Good Practice", Lancaster University, 2004,
  3. Douglas Biber, Susan Conrad And Randi Reppen, "Corpus-based Approaches to Issues in Applied Linguistics", Oxford Journals Humanities Applied Linguistics Volume15, Issue2, Pp. 169-189, Oxford University Press, 1994.
  4. Niladri Sekhar Dash, Language Corpora: Present Indian Need, Indian Statistical Institute, Kolkata, available at: http://www. elda. org/en/proj/scalla/SCALLA2004/dash. pdf, retrieved on 6th Jan. , 2011.
  5. Dash, N. S. (1999) "Corpus oriented Bangla language processing". Jadavpur Journal of Philosophy. 11(1): 1-28.
  6. Dash, N. S. (2000) "Bangla pronouns - a corpus based study". Literary and Linguistic Computing. 15(4): 433-444.
  7. Dash, N. S. and B. B. Chaudhuri (2001) "A corpus based study of the Bangla language". Indian Journal of Linguistics. 20: 19-40.
  8. Dash, N. S. and B. B. Chaudhuri (2001) "Corpus-based empirical analysis of form, function and frequency of characters used in Bangla". Published in Rayson, P. , Wilson, A. , McEnery, T. , Hardie, A. , and Khoja, S. , (eds. ) Special issue of the Proceedings of the Corpus Linguistics 2001 Conference, Lancaster: Lancaster University Press. UK. 13: 144-157. 2001.
  9. Dash, N. S. and B. B. Chaudhuri (2002) "Corpus generation and text processing". Inter ational Journal of Dravidian Linguistics. 31(1): 25-44.
  10. Dash, N. S. and B. B. Chaudhuri (2002) "Spelling variation of words in Bangla: a corpus-based study". To appear in International Journal of Dravidian Linguistics.
  11. Dash, N. S. and B. B. Chaudhuri "Using Text Corpora for Understanding Polysemy in Bangla". Procedings of the Language Engineering Conference (LEC'02) IEEE, 2002.
  12. Niladri Sekhar Dash, Methods in Madness of Bengali Spelling: A Corpus-based Investigation", South Asian Language Rewiew, Vol. XV, No. 2, June 2005
  13. M M Asaduzzaman and Muhammad Masroor Ali, "Morphological Analysis of Bangla Words for Automatic Machine Translation", 6th International Conference on Computer and Information Technology (ICCIT) 2003. Jahangirnagar University, Dhaka, Bangladesh, pp. 265-270,2003
  14. M S A Chowdhury, N M M Uddin, M Imran, M M Hassan and M. E. Haque, "Part of Speech Tagging of Bangla Sentence", 7th International Conference on Computer and Information Technology (ICCIT) 2004, Bangladesh, 2004.
  15. Md. Jahangir Alam, Naushad UzZaman and Mumit Khan "N-gram based Statistical Grammar Checker for Bangla and English", 9th International Conference on Computer and Information Technology (ICCIT) 2006, Bangladesh, 2006.
  16. Samit Bhattacharya, Monojit Choudhury, Sudeshna, Sarkar, and Anupam Basu. 2005. Inflectional Morphology, Synthesis for Bangla Noun, Pronoun and Verb Systems. In Proc. of the National Conference on Computer Processing of Bangla (NCCPB 05), pages 34 - 43.
  17. Niladri Sekhar Dash. 2006. The Morphodynamics of Bengali Compounds decomposing them for lexical processing. In Language in India (www. languageageinindia. com), Vol 6:7.
  18. Sajib Dasgupta and Vincent Ng, "Unsupervised Word Segmentation for Bangla", Human Language Technology Research Institute, University of Texus, TX 75083,
  19. Daniel Jurafsky and James H. Martin, "Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall USA, September 28, 1999, pp 139-232.
  20. Christopher D. Manning, Hinrich Schütze "Foundations of Statistical Natural Language Processing", MIT Press (1999), ISBN 978-0262133609, p. 24
  21. Wentian Li (1992). "Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution", IEEE Transactions on Information Theory 38 (6): 1842–1845, Website: http://www. nslij-genetics. org/wli/pub/ieee92_pre. pdf. , Retrieved on 1st May 2012 at 8:30 AM.
  22. Ramon Ferrer i Cancho and Ricard V. Sole (2003), "Least effort and the origins of scaling in human language", Proceedings of the National Academy of Sciences of the United States of America 100 (3): 788-791, Website: http://www. pnas. org/content/100/3/788. abstract?sid=cc7fae18-87c9-4b67-863a-4195bb47c1d1 , Retrieved on 1st May 2012 at 8:30 AM.
  23. Khair Md. Yeasir Arafat Majumder, Md. Zahurul Islam, and Mumit Khan, "Analysis of and Observations from a Bangla News Corpus", Website: http://www. panl10n. net/english/final%20reports/pdf%20files/Bangladesh/BAN03. pdf, Retrieve on 1st May 2012, at 8:30 AM
  24. The first 2000 most frequent words from the Brown Corpus, Website: http://www. edict. biz/lexiconindex/frequencylists/words2000. htm, Retrieve on 1st May 2012 at 8:30 AM.
Index Terms

Computer Science
Information Sciences

Keywords

Corpus Vocabulary Word Frequency Prior Probability Zipp’s Law And Character Distribution