Morphologically Annotated Corpus of Contemporary Lithuanian Language Cover Image

Morfologinis dabartinės lietuvių kalbos tekstyno anotavimas
Morphologically Annotated Corpus of Contemporary Lithuanian Language

Author(s): Erika Rimkutė, Vidas Daudaravičius
Subject(s): Language and Literature Studies
Published by: Kauno Technologijos Universitetas
Keywords: morfologinis anotavimas; tekstynas; morfologinė analizė; daugiareikšmiškumas; statistinis morfologinis vienareikšminimas

Summary/Abstract: Research of morphological disambiguation and morphological annotation of the 100 million word Lithuanian corpus are presented in the article. Statistical methods enabled to develop the automatic tool of morphological annotation for Lithuanian. The method of Hidden Markov models for morphological annotation has allowed achieving the precision of 94%, which is comparable to the precision achieved for other languages, when the 1 mln. word training corpus is used. The precision of 99% is reached for establishing headwords of Lithuanian words. The precision measure estimates only the process of disambiguation, while unrecognised words are not included in the precision test. The amount of unrecognised words makes up 5,6% of all used word-forms (more that 800,000 different word-forms). 1 million word morphological corpus is enough for the analysis of morphological phenomena in the Lithuanian language, as distribution of parts of speech in the whole 100 million word corpus does not differ significantly from the distribution in the training corpus.

  • Issue Year: 2007
  • Issue No: 11
  • Page Range: 30-35
  • Page Count: 6
  • Language: Lithuanian