Evaluation of automatic speech segmentation Cover Image

Automaatse segmentimise hindamine
Evaluation of automatic speech segmentation

Author(s): Einar Meister, Lya Meister
Subject(s): Customs / Folklore, Theoretical Linguistics, Applied Linguistics, Cultural Anthropology / Ethnology, Culture and social structure
Published by: Eesti Kirjandusmuuseum
Keywords: automatic segmentation; Estonian; phone boundaries; segment durations; speech corpora; word boundaries

Summary/Abstract: The use of large speech corpora in phonetic research depends to a great extent on the availability and quality of phonetic segmentation and transcriptions. As a rule, the best quality of segmentation is achieved by human transcribers who perform time-consuming and tedious manual work. However, tools for automatic segmentation exploiting typically HMM-based forced alignment methods have been developed for different languages. In recent years, two automatic systems as free online services have become available for Estonian: (1) the system developed at Tallinn University of Technology (https://phon.ioc.ee/dokuwiki/doku.php?id=projects:tuvastus:est-align.et), and (2) the multi-lingual tool WebMAUS (https://clarin.phonetik.uni-muenchen.de/BASWebServices/). In this study we evaluate the performance of the two systems against human transcribers. The test set includes Estonian read speech produced by: (1) four L1 adult subjects, (2) six L1 adolescents, and (3) four L2 adult subjects. The reference segmentation data including 27 sentences from L1 subjects and 10 sentences from the other subjects were produced manually as Praat textgrid files with two tiers (word-level orthographic and phoneme-level SAMPA transcription); the automatic systems have produced similar textgrid files. In total, 1179 word boundaries and 5050 phone boundaries were compared. The results show that both systems performed more accurately for L1 adult speech and were less accurate in the case of adolescent and L2 speech. While the TUT system outperformed WebMAUS in L1 adult speech, then in L1 adolescents and L2 speech WebMAUS produced more accurate results. Despite the deviations in phone boundaries, the durations of vowel and consonant segments measured from automatic and manual segmentations of L1 adult speech differ only marginally. This suggest that the accuracy of both automatic systems seems to be sufficient for speech technology needs and could also be used in acoustic studies of L1 adult speech. However, both systems need improvements in order to reach the accuracy of automatic segmentation tools available for English.

  • Issue Year: 2017
  • Issue No: 68
  • Page Range: 145-160
  • Page Count: 16
  • Language: Estonian