The Syntactic Annotation of the National Corpus of Polish Cover Image

The Syntactic Annotation of the National Corpus of Polish
The Syntactic Annotation of the National Corpus of Polish

Author(s): Katarzyna Głowińska
Subject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: anotacja składniowa; Narodowy Korpus Języka Polskiego; słowa składniowe; grupy składniowe; parsowanie powierzchniowe; syntactic annotation; National Corpus of Polish; syntactic words; syntactic groups; shallow parsing

Summary/Abstract: Syntactic annotation in the National Corpus of Polish consists in joining words together into constituents: first at the level of syntactic words, then at the level of syntactic groups. To begin with, the word-level segments, divided into smaller classes, are substituted by larger units – syntactic words, including among others: analytical tense and mood forms, analytical degree forms, reflexive verbs, discontinuous conjunctions. There are, however, substantial differences between the segment-level NKJP tagset and the tagset for syntactic words, since the latter has been devised to include broader grammatical classes and traditional grammatical categories, such as tense, mood and reflexivity. 10 types of syntactic groups have been distinguished: nominal group, numeral group, adjectival group, prepositional-nominal group, prepositional-adjectival group, prepositional-numeral group, adverbial group, discourse group, subordinate clause, interrogative clause. Special subgroups (NGadres, NGgodz, NGdata) within the category of nominal groups have been introduced in order to describe addresses, date and time. For each group, its syntactic and semantic heads have been identified. The problems encountered during the annotation process included among others: recognizing group boundaries, multi-word units, abbreviations, discontinuous groups and syntactic words, atypical syntactic constructions. The syntactic annotation performed for the NKJP consists in automatic text annotation by means of a shallow parsing system and manual post-editing by two independent annotators. In cases of disagreement, an adjudicator made the final decision. A manually constructed grammar for words and syntactic groups was encoded in the shallow parsing system Spejd.

  • Issue Year: 2012
  • Issue No: 63
  • Page Range: 121-128
  • Page Count: 8
  • Language: English
Toggle Accessibility Mode