Journal of Research (Urdu), BZU

Abstract

This article emphasizes the need of Urdu corpus on the example of the Bank of English and the Corpus of Contemporary American English (COCA) which are serving as the backbone of English language engineering, discourse analysis, corpus & lexicon development and works of the same fiber. This proposed Urdu corpus, namely The Bank of Urdu (TBU), will be a repository of Urdu texts of both written and spoken language gathered in platform-independent & machine-readable Indo-Perso-Arabic script. Since the mentioned English corpora have exactly the same architecture and interface so while comparing the TBU with the structure of English corpora, the name "English Corpus" will refer to both these repositories in this document. Add to devising its scope, technical and design issues of the architecture & interface of TBU are discussed in this introductory paper. Issues like those of code-mixing, false friends and homonyms in Urdu are addressed. Together, solution is given to standardize the Urdu orthograph for this work. Exemplary web view of the user interface is provided. Available Urdu written texts are mostly literature-oriented, so from the data gathering standpoint the proposed TBU must deviate from standard roadways of the English corpora at many instances. This fact is specially dealt with. A study of word-count and of lexicalizing high-frequency Urdu words in Urdu dictionaries of note is made part of this thesis. Aimed at discourse analysis, language engineering and natural language processing in Urdu, and of course, providing vital base for contemporary Urdu lexicon development, this proposed portal will not only separate Urdu language from Urdu literature but will also cast regional Pakistani languages in stationing their scholarly resources in their own scripts for such researches. This paper on TBU is a proposal of Dr Hafiz Safwan Muhammad Chohan for giving initial shape to the idea of Urdu Data Bank (UDB) of the Center of Excellence for Urdu Informatics (CEUI), National Language Authority (NLA) Islamabad. Due to homonymy of UDB with the Urdu Data Base, UDB was renamed as TBU at the CEUI in a consensus with the scholars of Urdu, IT professionals and representatives of the GoP from Cabinet Division & Planning Division. In this national workshop viz. "Urdu Informatics" Today & Tomorrow" held on 7-8 June 2008 in the NLA, Dr Chohan also coined the Urdu equivalent of TBU as اردو مثال گھر which was accepted by the participants. Acknowledgement & Dedication: Dr Hafiz Safwan Muhammad Chohan has been in contact with Prof John McHardy Sinclair (June 14, 1933 - March 13, 2007), Professor of Modern English Language at Birmingham University, 1965-2000. He pioneered work in corpus linguistics, discourse analysis, lexicography, and language teaching, and was the man behind the machine gun of British National Corpus (BNC) and the Collins COBUILD dictionaries. There is no trend of dedicating research papers to any person but with high regret that this paper (both in Urdu & in English) was not written when he was alive, this effort is being dedicated to him.