Disambiguation of Lithuanian Homographs Based on the Frequencies of Lexemes and Morphological Tags Cover Image

Lietuvių kalbos homografų vienareikšminimas remiantis leksemų ir morfologinių pažymų vartosenos dažniais
Disambiguation of Lithuanian Homographs Based on the Frequencies of Lexemes and Morphological Tags

Author(s): Pijus Kasparaitis, Tomas Anbinderis
Subject(s): Language and Literature Studies
Published by: Kauno Technologijos Universitetas
Keywords: teksto kirčiavimas; tomografai; vienareikšminimas; leksema; morfologinė pažyma; balso sintezė

Summary/Abstract: In the text-to-speech synthesis it is necessary to stress the text. The main problem is that currently existing algorithms of stress for Lithuanian produce more than a single stressing possibility for some words (homographs). The method based on frequency of occurrences of certain lexemes and morphological tags was proposed in this work. Such method has never been used for Lithuanian. The frequencies were calculated using text corpus containing 1 million words. Text corpus was stressed automatically and then corrected manually. Disambiguation of homographs is performed by removing less frequently used grammatical forms and lexemes. Additional problems arise due to the fact that a single word can correspond to more than two grammatical forms. The method based on the frequencies of pairs of grammatical forms was proposed in this work. It was shown that the frequencies of morphological tags play more important role than the frequencies of lexemes. The method proposed allows disambiguating the homographs with the accuracy of 85.01%. Despite the fact that the method proposed does not employ contextual information, the results achieved are comparable with those achieved with the algorithm ID3 that uses the context.

  • Issue Year: 2009
  • Issue No: 14
  • Page Range: 25-31
  • Page Count: 7
  • Language: Lithuanian