Testing word embeddings for Polish Cover Image

Testing word embeddings for Polish
Testing word embeddings for Polish

Author(s): Agnieszka Mykowiecka, Małgorzata Marciniak, Piotr Rychlik
Subject(s): Semantics, Computational linguistics, Western Slavic Languages
Published by: Instytut Slawistyki Polskiej Akademii Nauk
Keywords: distributional semantics; word embeddings; model evaluation; synonymy; analogy;

Summary/Abstract: Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results.

  • Issue Year: 2017
  • Issue No: 17
  • Page Range: 1-19
  • Page Count: 19
  • Language: English