The syntactic transformation of the Czech Academic Corpus Cover Image

Syntaktická proměna Českého akademického korpusu
The syntactic transformation of the Czech Academic Corpus

Author(s): Alla Bémová, Zdeňka Urešová, Barbora Hladká
Subject(s): Language and Literature Studies
Published by: AV ČR - Akademie věd České republiky - Ústav pro jazyk český
Keywords: corpus; syntactic annotation; annotation guidelines; annotation checking

Summary/Abstract: The idea of the Czech Academic Corpus (CAC) came to life in 1971 thanks to the Department of Mathematical Linguistics within the Czech Language Institute. By the mid 1980s, a total of 540,000 words were morphologically and syntactically annotated manually. After the Prague Dependency Treebank (PDT) – the largest annotated treebank of Czech written texts – was built, the conversion from CAC to PDT format began. The main goal was to make the CAC and the PDT compatible, and thus to enable the integration of the CAC into the PDT. The second version of the CAC is thus a complete conversion of the internal format and annotation schemes. The conversion of syntactic annotation began three years after the syntactic annotation of PDT was finished. Such a situation is exceptional because, to our knowledge, there is no other language for which such a significant amount of data is being annotated in two subsequent projects. This article summarizes the experience acquired during the conversion of the CAC syntactic annotation.

  • Issue Year: 72/2011
  • Issue No: 4
  • Page Range: 268-286
  • Page Count: 19
  • Language: Czech