Identifying old Estonian word forms using large language models Cover Image

Eesti vanade sõnakujude tuvastamisest suurte keelemudelitega
Identifying old Estonian word forms using large language models

Author(s): Madis Jürviste, Tiina Paet, Sven-Erik Soosaar
Subject(s): Language studies, Historical Linguistics, Finno-Ugrian studies
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: historical lexicography; history of Estonian written language; large language models; Estonian;

Summary/Abstract: As large language models (LLMs) have gained more and more visibility and momentum in society since 2022, numerous researchers have studied the possibilities of applying these new technologies for research in lexicography. This article deals with historical sources: how useful are LLMs in identifying old word forms in 17th and 18th-century German-Estonian and Estonian-German dictionaries? More precisely, can these technologies reduce the time burden on human researchers to identify old word forms and connect them with the same words’ modern written forms (even if the original word itself has been substituted by a completely new one over the centuries)? To answer these questions, the authors conducted an empirical qualitative study with three major LLMs: GPT-4o, Gemini 1.5 Pro and Claude 3 Opus. The study consisted in analysing the LLMs capacities and success rates using API-request-based prompts in three main tests, each with different samples: 30 old professional titles and societal roles’ denominations (6 sources ranging from Stahl 1637 up to Hupel 1780); 54 dialectal words (in Gutslaff 1648) and 20 borrowed words (in 3 sources: Stahl 1637, Gutslaff 1648, and Göseken 1660). In these tests, Claude generally outperformed all the others. However, the results show variations due to the sample words’ characteristics (words with a similar orthography are more easily recognised). The high success rate, ranging from 74% to 90%, incites the authors to consider the possibility of carrying out tests with a larger sample, possibly encompassing whole dictionaries. This would significantly help lexicographers to create a diachronic historical development path for different words in the entries of large Estonian monolingual explanatory dictionaries.

  • Issue Year: 2025
  • Issue No: 21
  • Page Range: 63-84
  • Page Count: 22
  • Language: Estonian
Toggle Accessibility Mode