On The Art of Taming and Exploiting Parallel Tags in a Multilingual Corpus Cover Image

On The Art of Taming and Exploiting Parallel Tags in a Multilingual Corpus
On The Art of Taming and Exploiting Parallel Tags in a Multilingual Corpus

Author(s): Alexandr Rosen
Subject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: korpus równoległy; tagi morfosyntaktyczne; ontologia lingwistyczna; formalna analiza konceptualna; wielojęzyczność; parallel corpus; morphosyntactic tags; linguistic ontology; formal concept analysis; multilinguality

Summary/Abstract: The paper illustrates the principles of the adjustment of morphosyntactic tags for various languages in multilingual corpora. Texts in multilingual parallel corpora can be annotated with tools typical of monolingual ones. However, even taggers for typologically similar languages use incompatible tagsets, which often results from differences between tagsets rooted in different linguistic perspectives. The approach discussed in the article sets the basis for multiple tagsets in an abstract interlingual representation of linguistic categories. The hierarchy proposed in the paper takes three views of word class: inflectional, syntactic and semantic (lexical), and it involves a Formal Concept Analysis which helps resolve the problem of mismatches between various language-specific tagsets. The approach proposed by the author allows to refine the tagsets for word class by projecting distinctions in one tagset onto a conceptually different one, after an automatic word-to-word alignment. The procedure has been verified for the Czech tagset put side by side with the English and Polish tagsets.

  • Issue Year: 2012
  • Issue No: 63
  • Page Range: 241-256
  • Page Count: 16
  • Language: English