Creation of Contemporary Latgalian Speech Corpus in the Context of Documenting Lesser Used Languages Cover Image

Mūsdienu latgaliešu valodas runas korpusa izveide mazāk lietoto valodu dokumentēšanas kontekstā
Creation of Contemporary Latgalian Speech Corpus in the Context of Documenting Lesser Used Languages

Author(s): Angelika Juško-Štekele, Antra Kļavinska
Subject(s): Sociolinguistics, Baltic Languages, Social Theory, Crowd Psychology: Mass phenomena and political interactions, Sociology of Culture
Published by: Latvijas Universitātes Literatūras, folkloras un mākslas institūts
Keywords: corpus linguistics; representativeness; corpus design; metadata; transcription; convention;

Summary/Abstract: According to data of UNESCO, in 2013, Latgalian language with 150,000 users was recognised as one of the world’s endangered and vulnerable languages, as all generations still use the oral form, but the sustainability of the language is seriously jeopardised, since the number of young language users decreases. Pursuant to the EU directives and recommendations for preservation, research and development of regional and endangered languages, as well as the Guidelines for the State Language Policy 2021–2027 regarding development, disclosure on the web and accessibility of varied text corpus, in 2020, a group of researchers of the Rēzekne Academy of Technologies in the Project of State Research Programme Digital Resources of Humanities: Integration and Development (No. VPP-IZM-DH-2020/1-0001) started its work on the development of the Contemporary Latgalian Speech Corpus (MuLaR) aimed at the documentation, research, studies and acquisition of Latgalian. The aim of the article is to identify and analyse the issues that are important in the process of creating MuLaR, applying the referential analysis of the scientific literature and comparative methodology. In turn, applying the analytical-synthetic method and based on the experience accumulated by the corpus creators, there was developed an initial model for the corpus architectonics and technological solutions, covering such issues as ensuring a representative Latgalian speech corpus, bearing in mind the territorial distribution of Latgalian language communities and diversity of Latgalian patois; the most appropriate methods to document natural, spontaneous language: collection of new data, opportunities to use the existing recordings (interviews, TV, radio broadcasts, field research data collections), other databases (reiti.rta.lv); understanding metadata; ethical aspects of the speech corpus; transcribing (software, conventions to reveal the features of spoken text as accurately as possible); creation of an accessible, easy-to-use open-access platform, using the experience of creating oral speech corpuses for lesser-used languages / dialects in other countries. The article declares the main challenges for the corpus development after the initial validation of the corpus data, including in relation to the morphological tagging possibilities of the corpus.

  • Issue Year: 2022
  • Issue No: 47
  • Page Range: 226-242
  • Page Count: 17
  • Language: Latvian