New possibilities in corpus lexicography based on the example of the Estonian Collocations Dictionary. Cover Image

Korpusleksikograafia uued võimalused eesti keele kollokatsioonisõnastiku näitel.
New possibilities in corpus lexicography based on the example of the Estonian Collocations Dictionary.

Author(s): Maria Tuulik, Jelena Kallas, Kristina Koppel
Subject(s): Theoretical Linguistics, Lexis
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: corpus lexicography; collocations dictionary; corpus query system; dictionary writing system; Estonian;

Summary/Abstract: This article aims to introduce new resources and methods used in Estonian corpus lexicography to create monolingual Estonian dictionaries. Corpora can be used in many ways: headwords list development, grammatical and frequency labels, word sense division, identifying collocations, good dictionary examples, translation equivalents (Kilgarriff 2013). The paper focuses on features offered by Sketch Engine (Kilgarriff et al. 2004), a state-of-the-art lexicographic tool for corpus analysis. For Estonian, Sketch Engine contains different types of corpora, including the recently created 260 million-word web corpus etTenTen13 and the 463 million-word Estonian National Corpus. Through the example of the Estonian Collocations Dictionary, we analyse how corpus data (headwords, collocations and example sentences) can be automatically extracted from the Estonian National Corpus. The Estonian Collocations Dictionary contains approx. 10 000 headwords (nouns, adjectives, verbs and adverbs). The various collocates within each headword are grouped according to the lexico-grammatical structure formed by the collocational phrase, and for each collocation one or two example sentences are provided. The main elements needed to develop the algorithm for automatic data extraction are the Sketch Grammar and Good Dictionary Example (Kilgarriff et al. 2008) configurations. The new Sketch Grammar version 1.6 includes all of the lexico-grammatical structures that will be presented in the collocations dictionary. It contains 116 rules in total. For the extraction of dictionary examples, the first version of GDEX for Estonian was developed. Classifiers concerning optimum sentence length, optimum word length, number and type of punctuation marks, word frequency, tokens starting with capital letters, abbreviations etc. were proposed and implemented. The use of classifiers brought significant improvements to the output. The data was extracted in XML format and imported into the EELex dictionary writing system, where it will be examined, edited and supplemented by lexicographers. The Estonian Collocations Dictionary will be published in 2018.

  • Issue Year: 2015
  • Issue No: 11
  • Page Range: 75-94
  • Page Count: 20
  • Language: Estonian