Development of a Specialized Latvian Speech Corpus and Pronunciation Dictionary for the Linguistic Analysis and Systematic Transcription of Visual Diagnostic Examinations Cover Image

Specializēta latviešu valodas runas korpusa un izrunas vārdnīcas izveide vizuālās diagnostikas izmeklējumu lingvistiskai analīzei un sistemātiskai transkribēšanai
Development of a Specialized Latvian Speech Corpus and Pronunciation Dictionary for the Linguistic Analysis and Systematic Transcription of Visual Diagnostic Examinations

Author(s): Ilze Auziņa, Roberts Darģis, Baiba Saulite, Normunds Grūzītis, Mikus Grasmanis, Andrejs Spektors, Kaspars Stepanovs
Subject(s): Information Architecture, Electronic information storage and retrieval, Baltic Languages
Published by: Latvijas Universitātes Literatūras, folkloras un mākslas institūts
Keywords: speech corpus; pronunciation dictionary; medical terminology; digital language resources; automatic speech recognition; natural language processing; post-editing;

Summary/Abstract: The Laboratory of Artificial Intelligence (AiLab) at the Institute of Mathematics and Computer Science of the University of Latvia (IMCS UL) in cooperation with the Riga East University Hospital (REUH) has developed the RUTA:MED platform for automated transcription of medical audio recordings. This was done within an ERDF-funded industry-driven research project aimed at developing specific Latvian speech recognition systems for the medical domain. This paper describes the creation of Latvian language resources for the medical domain focusing on digital imaging to develop a medical speech recognition system for Latvian. The language resources include a pronunciation dictionary, a text corpus for language modelling, and an orthographically transcribed speech corpus for the (i) adaptation of the acoustic model, (ii) evaluation of the speech recognition accuracy, (iii) development and testing of rewrite rules for automatic text conversion to the spoken form and back to the written form. Experiments to date in adapting speech synthesis and speech recognition systems to medical applications have demonstrated the importance of industry-specific data. The general ASR system can be adapted to a specific field, namely radiology, using specialized data sets – a specialized text and speech corpus, pronunciation dictionary. In addition to the corpora required for the development of the language model and the acoustic model, the grammar of text expansion and compression is also important, which allows to reduce the amount of manual post-processing of automatically generated examinations and disease descriptions.

  • Issue Year: 2022
  • Issue No: 47
  • Page Range: 244-262
  • Page Count: 19
  • Language: Latvian