Stability of the syntagmatic probability distributions Cover Image

Stability of the syntagmatic probability distributions
Stability of the syntagmatic probability distributions

Author(s): Strahinja Dimitrijević, Aleksandar Kostić, Petar Milin
Subject(s): Phonetics / Phonology, Lexis, Computational linguistics, South Slavic Languages, Experimental Pschology
Published by: Društvo psihologa Srbije
Keywords: corpus linguistics; quantitative linguistics; optimal sample size; conditional probabilities; Serbian language;

Summary/Abstract: The aim of the present study is to establish criteria for the optimal size of a corpus that can provide stable conditional probabilities of morphological and/or syntagmatic types. The optimality of corpus size is defined in terms of the smallest sample that generates probability distribution equal to distribution derived from the large sample that generates stable probabilities. The latter distribution we refer to as “target distribution”. In order to establish the above criteria we varied the sample size, the word sequence size (bigrams and trigrams), sampling procedure (randomly chosen words and continuous text) and position of the target word in a sequence. The obtained distributions of conditional probabilities derived from smaller samples have been correlated with target distributions. Sample size at which probability distribution reaches maximal correlation (r=1) with the target distribution was taken as being optimal. The research was done on Corpus of Serbian language. In case of bigrams the optimal sample size for random word selection is 65.000 words, and 281.000 words for trigrams. In contrast, continuous text sampling requires much larger samples to reach stability: 810.000 words for bigrams and 868.000 words for trigrams. The factors that caused these differences remain unclear and need additional empirical investigation.

  • Issue Year: 42/2009
  • Issue No: 1
  • Page Range: 107-120
  • Page Count: 14
  • Language: English