Approach for Transforming Monolingual Text Corpus into XML Corpus
Deepak Sharma and Prakash.r.devale. Article: Approach for Transforming Monolingual Text Corpus into XML Corpus. International Journal of Applied Information Systems 1(9):1-5, April 2012. BibTeX
@article{key:article, author = "Deepak Sharma and Prakash.r.devale", title = "Article: Approach for Transforming Monolingual Text Corpus into XML Corpus", journal = "International Journal of Applied Information Systems", year = 2012, volume = 1, number = 9, pages = "1-5", month = "April", note = "Published by Foundation of Computer Science, New York, USA" }
Abstract
In this paper, we are presenting the approach to convert the text based monolingual corpus to Part-Of-Speech tagging using an standard tagging tool in tagged file and then convert tagged file in the XML format as per defined DTD (Document Type Definition). The tagged text document is parsed through the logic to generate the corpus in XML and also, it can be further used for Information Retrieval, Text-To-Speech conversion, Word Sense Disambiguation and also useful for preprocessing step of parsing by providing unique tag to each word which reduces the number of parses.
Reference
- Andrew MacKinlay and Timothy Baldwin, "POS Tagging with a More Informative Tagset", at Proceedings of the Australasian Language Technology Workshop 2005, pages 40–48, Sydney, Australia, December 2005.
- Christopher D. Manning, Part-Of-Speech Tagging From 97% To 100%: Is It Time For Some Linguistics?, in CICLing2011.
- Su Cheng Haw, G. S. V. Radha Krishna Rao,,"A Comparative Study and Benchmarking on XML Parsers", Faculty of Information Technology, Multimedia University, 63100 Cyberjaya.
- Edwin Goei, Software Engineer, Sun Microsystems," Java and XML Parsing Using Standard APIs", September 11, 2000
- Nishchal Bhalla, Sahba Kazerooni,"Web Services Vulnerabilities", at Security Compass Inc 2007.
- C. Ramisch, A. Villavicencio, C. Boitet, Mwetoolkit: A Framework For Multiword Expression Identification", in: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valetta, Malta, May 2010
Keywords
Part-of-speech Tagging, Java Xml Library, Dom Parser