The Model of Latent Dirichlet Allocation in the Topic Analysis of Latvian Soldier: Oskars Kalpaks’ Case Study Cover Image

Latento Dirihlē sadalījumu modeļa izmantojums laikraksta Latvijas Kareivis tematu analīzē: Oskara Kalpaka gadījuma izpēte
The Model of Latent Dirichlet Allocation in the Topic Analysis of Latvian Soldier: Oskars Kalpaks’ Case Study

Author(s): Anda Baklāne, Valdis Saulespurēns
Subject(s): Cultural history, Archiving, Electronic information storage and retrieval, Sociology of Culture
Published by: Latvijas Universitātes Literatūras, folkloras un mākslas institūts
Keywords: topic modelling; digitized newspapers; digital history; topic coherence; National Library of Latvia;

Summary/Abstract: The paper presents a case study of the application of the LDA (latent Dirichlet allocation) model for the analysis of topics in the corpus of the historical daily newspaper of Latvian armed forces Latvian Soldier (1925–1940). Although topic modelling is one of the most popular techniques for analysing text in digital humanities, this methodology has not been extensively tested for texts in Latvian. The case study was conducted to explore the possibilities for implementing topic models as new functionality for exploring newspapers in the digital library of the National Library of Latvia. To imitate different use cases of topic modelling, two models were created: a model consisting of 50 topics for the whole corpus of the Latvian Soldier, as well as a six-topic model of the subcorpus compiled from articles that contain the name ‘Kalpaks’. It was demonstrated that both models produced usable, semantically coherent topics that could aid the exploration of historical newspapers. It was concluded that the quality of the models in the current state was sufficient to follow the approach of topic instrumentalism, which views topics as incomplete representations of texts that are a useful augmentation of the investigative process. The acquired topic models seem particularly useful for combining research practices of distant and close reading. Further testing and adjustment of the parameters are needed to produce concise and unambiguous topics that could be reliably used in research situations where extensive analysis of the sources and verification is not expected.

  • Issue Year: 2022
  • Issue No: 47
  • Page Range: 150-166
  • Page Count: 17
  • Language: Latvian