Statistical language model adaptation for Estonian speech recognition Cover Image

Statistilise keelemudeli adapteerimine eesti keele kõnetuvastuses
Statistical language model adaptation for Estonian speech recognition

Author(s): Tanel Alumäe
Subject(s): Language and Literature Studies
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: speech recognition; language model adaptation; latent semantic analysis; fast marginal adaptation; morphemes; lemmatization

Summary/Abstract: This paper presents a statistical language model adaptation framework for Estonian large vocabulary speech recognition. Estonian is a highly infl ected, agglutinative and compounding language. To reduce lexical variety, morphemes are used as basic units in a statistical language model. For language model adaptation, we use a small set of topic-specifi c sentences as an adaptation seed. Then, latent semantic analysis (LSA) is applied for fi nding semantically close texts from a large document corpus. The resulting adaptation corpus is used for compiling a topic-specifi c unigram language model for each story. The unigrams are combined with a background N-gram model using fast marginal adaptation, resulting in an adapted N-gram model. We compare words, lemmas and morphemes as basic units in the LSA model. The method is tested on an Estonian broadcast news transcription task. In the fi rst pass of the recognition, a general background language model is used for fi nding recognition hypotheses for all utterances. The hypotheses are then used as an adaptation seed to compile an adapted language model for each news story. In the second recognition pass, the adapted models are applied to fi nd new recognition hypotheses. We observe a signifi cant improvement in speech recognition quality after applying the adapted models. The 10% drop in letter error rate when using morpheme-based adaptation is signifi cantly better than when using either word or lemma-based adaptation. The article also discusses some possible reasons behind this observation.

  • Issue Year: 2008
  • Issue No: 4
  • Page Range: 005-016
  • Page Count: 12
  • Language: Estonian