Towards a Corpus of Polish Dialect Texts Cover Image

Towards a Corpus of Polish Dialect Texts
Towards a Corpus of Polish Dialect Texts

Author(s): Halina Karaś, Monika Kresa, Aleksandra Krawczyk-Wieczorek
Subject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: gwary polskie; korpus tekstów; innosłowiańskie korpusy gwarowe; Polish dialects; text corpus; corpora of other Slavic dialects

Summary/Abstract: The paper describes a research project whose aim is to develop the Corpus of Polish Dialect Texts (KGP). The corpus is to include dialect texts recorded during the last 60 years by dialectologists from various research centres (UW, UAM, UJ), and cover all the Polish dialects. Its aims, scope, materials and the theoretical prerequisites are discussed in the article. Since the guidelines to be followed are based on previous experiences gathered during corpus compilation; the paper provides a detailed outline of the development of the National Corpus of Polish, the corpus of the Maćkowce village in Podolia, as well as corpora of other Slavonic languages or spoken language corpora exhibiting non-standard features. The need for a corpus of Polish dialects is unarguable, particularly in the view of European and worldwide progress in the study of non-standard variants of language. The project aims at compiling a resourceful corpus of Polish dialectal texts (spoken Polish), encompassing from 1 million up to 10 million tokens, accessible online and expandable. It is necessary that the KGP be a balanced and representative corpus (representative in the broad sense of the term, i.e. as far as region, topic, genre and chronology are concerned). A manually annotated balanced training subcorpus of about 0.5 million tokens is planned, and it may be employed both in linguistic research and in the process of adjusting the existing NLP tools to the analysis of Polish dialects. The ultimate aim of the KGP team is to create an annotated (phonetically and morphologically), lemmatized corpus containing meta-information (e.g. on the dialect, the village, the informant), a corpus that will prove of use not only to dialectologists, but also to anyone interested in language issues (semi-orthographic notation shall be employed, the corpus shall be accessible via the Internet and searchable thanks to a user-friendly interface enabling even sophisticated queries).

  • Issue Year: 2012
  • Issue No: 63
  • Page Range: 129-146
  • Page Count: 18
  • Language: English