Automatic lemmatization of a text in phonetic transcription. A corpus of Polish local dialect from the Southern Borderland  Cover Image

Automatyczna lematyzacja tekstu w zapisie fonetycznym. Korpus polskiej gwary południowokresowej
Automatic lemmatization of a text in phonetic transcription. A corpus of Polish local dialect from the Southern Borderland

Author(s): Aleksandra Krawczyk-Wieczorek
Subject(s): Language and Literature Studies
Published by: Towarzystwo Miłośników Języka Polskiego
Keywords: text corpus; Polish from the borderland; vocabulary; lemmatization

Summary/Abstract: The paper presents an electronic corpus of the Polish dialect of the village of Maćkowce in Ukraine. For this purpose a computer tool FonOrt was created, the author of which is M. Wieczorek. The texts, transcribed in phonetic transcription in MS Word files, were afterwards converted to XML and lemmatized. Lemmatization was achieved by attributing to each token an appropriate sequence of signs which could be interpreted by a morphological analyzer of Polish. It was usually an appropriate standard Polish form (e.g. kubita → kobieta, chudz’ima → chodzimy). Thereafter the program imputed lemmas to attained word forms using the Morfeusz SIaT analyzer. To lemmatize lexical borrowings and Polish dialectal words (selected from the texts manually) a list of their word forms was automatically created. In the corpus created using the methods described above each token is annotated with an appropriate lemma and additional information like the speaker. One can search the corpus using the tool Poliqarp.

  • Issue Year: 2012
  • Issue No: 1
  • Page Range: 11-19
  • Page Count: 9
  • Language: Polish