Analysing Accuracy of Slovak Language Lemmatization and MSD Tagging Cover Image

Analysing Accuracy of Slovak Language Lemmatization and MSD Tagging
Analysing Accuracy of Slovak Language Lemmatization and MSD Tagging

Author(s): Radovan Garabík, Denis Mitana
Subject(s): Language and Literature Studies, Applied Linguistics, Computational linguistics
Published by: Jazykovedný ústav Ľudovíta Štúra Slovenskej akadémie vied
Keywords: lemmatization; MSD tagging; POS tagging; Slovak

Summary/Abstract: Lemmatization and morphological tagging is an indispensable step in Slovak corpus linguistics. In this article, we evaluate two state-of-the-art Slovak language lemmatizers and MSD taggers. One is based on MorphoDiTa and the other is based on spaCy. We measured accuracy on the test subset of manually lemmatized and MSD annotated corpus and found that the combination of lemma and tag achieved 93.5% accuracy with MorphoDiTa, and 95.6% accuracy with spaCy. Most of the errors occurred in disambiguating MSD tags for homonymous uninflected parts of speech such as particles, conjunctions, and adverbs, and in disambiguating singular masculine inanimate nominative and accusative. In these cases, spaCy shows a noticeable improvement over MorphoDiTa, likely due to a better exploitation of the context of the words.

  • Issue Year: 88/2023
  • Issue No: 2
  • Page Range: 129-140
  • Page Count: 12
  • Language: English