Observations from Statistical Processing of BDNC01 Corpus

Md. Farukuzzaman Khan, M. Abdus Sobhan Published in Artificial Intelligence

Recent trends in the development of language related technology finds unavoidable requirement of relevant resources and acquiring knowledge from these resources. In this prospect corpus-based methods are getting strong push from various laboratories throughout the world in Bangla language processing. In this paper we have discussed the compilation of BdNC01 corpus and observations from statistical processing of it. BdNC01 is a new Bangla text corpus collected form web edition of several influential Bangla daily newspapers containing more than eleven millions word tokens. Several processing like list and total count of vocabulary, individual word frequencies and prior probabilities were computed and preserved in final repository. The word frequency relation to Zipp's law, time and source dependency of word frequencies and character distribution were also observed. Software support tools required for various processing were implemented using C language. The paper concludes with the usability of the corpus and computed statistical database.


Corpus, Vocabulary, Word Frequency, Prior Probability, Zipp’s Law And Character Distribution