Stratified historical corpus of Estonian 1800–1940
The article introduces a stratified historical corpus of Estonian 1800–1940. A stratified corpus will allow for sociolinguistic comparisons of language use between past authors, considering their background and biographical details (e.g. native dialect area, age cohort, attained education) or the publication details (e.g. genre of publication or publisher). The corpus assembles texts from a number of different public archives and combines it with metadata on their publication details and the author’s background. The corpus at the moment of publication consists of 4,412 works from 1,188 author names, constituting 11% of the works registered in the Estonian National Bibliography from 1800–1940. The author names are associated with biographical information where possible. Three use cases on studying orthographic variation are introduced as examples where the corpus can help study past language communities. The corpus is published online to allow updates as data is improved and more texts are digitized.
More...