The ORAL corpus: construction, lemmatization and morphological tagging Cover Image

Korpus ORAL: sestavení, lemmatizace a morfologické značkování
The ORAL corpus: construction, lemmatization and morphological tagging

Author(s): Marie Kopřivová, Zuzana Komrsková, David Lukeš, Petra Poukarová
Subject(s): Language and Literature Studies, Applied Linguistics
Published by: AV ČR - Akademie věd České republiky - Ústav pro jazyk český
Keywords: spoken Czech; spoken language corpora; lemmatization; tagging; morphological analysis

Summary/Abstract: The goal of this paper is to provide an overview of the structure and contents of the soon-to-be available ORAL corpus, which combines previously published corpora (ORAL2006, ORAL2008 and ORAL2013) with newly transcribed material into a single conveniently accessible and more richly annotated resource, about 6 million running words in length. The recordings and corresponding transcripts span a decade between 2002 and 2011; most of them capture interactions of mutually well-acquainted speakers, in informal situations and natural settings. The corpus is complemented by a marginal portion of more formal data, mostly public talks. It is tagged and lemmatized, and an effort was made to adapt existing tools (targeted at written language) to yield better results on spoken data. We hope the availability of such a resource will spawn further discussions on the morphological and syntactic analysis of spoken language, perhaps resulting in more radical departures in the future from the part-of-speech classification inherited from the linguistic analysis of written language.

  • Issue Year: 2017
  • Issue No: 15
  • Page Range: 47-67
  • Page Count: 21
  • Language: Czech