The InterCorp Parallel Corpus with a Uniform Annotation for All Languages Cover Image

The InterCorp Parallel Corpus with a Uniform Annotation for All Languages
The InterCorp Parallel Corpus with a Uniform Annotation for All Languages

Author(s): Alexandr Rosen
Subject(s): Language and Literature Studies, Theoretical Linguistics, Applied Linguistics, Morphology, Syntax
Published by: Jazykovedný ústav Ľudovíta Štúra Slovenskej akadémie vied
Keywords: parallel corpus; Universal Dependencies; multilinguality; syntactic annotation; language-universal categories

Summary/Abstract: Recently, the language-specific morphosyntactic annotation of InterCorp, a large multilingual parallel corpus, has been replaced by the language-uniform morphosyntactic and syntactic annotation following the guidelines of the Universal Dependencies project. Because the corpus is used predominantly by human users via a token-based concordancer, the CONLL-U format produced by the UDPipe parser has been extended by attributes such as lemma of the token’s syntactic head or morphosyntactic categories of the content verb’s auxiliary. We conclude that despite some theoretical and practical issues, the new annotation is a promising solution to the issue of mutually incompatible tagsets within a single corpus.

  • Issue Year: 74/2023
  • Issue No: 1
  • Page Range: 254-265
  • Page Count: 12
  • Language: English