Erinevused, kaugused ja s&#245;rmej&#228;ljed. Stilomeetria ja mitmem&#245;&#245;tmelise tekstianal&#252;&#252;si alused

Artjoms Sela

Erinevused, kaugused ja sõrmejäljed. Stilomeetria ja mitmemõõtmelise tekstianalüüsi alused
Differences, distances and fingerprints. The fundamentals of stylometry and multivariate text analysis

Author(s): Artjoms Sela
Subject(s): Computational linguistics, Estonian Literature, Methodology and research technology, Sociology of Literature
Published by: SA Kultuurileht
Keywords: stylometry; computational text analysis; authorship attribution; Estonian fiction;

Summary/Abstract: The recent rapid expansion of computational methods and tools into humanities have rekindled the conversation surrounding the relationship between a study object and its mathematical representation, or model. The paper serves as a conceptual introduction to stylometry, a sub-field of computational text analysis that studies differences between texts quantitatively, and shows how simplistic models of texts can be used to uncover their complex relationships. The very term “stylometry” and the field’s development is closely linked to the problem of authorship attribution and identification – the paper briefly introduces the early history of stylometry that highlights its inherited assumptions about texts and authorship. It shows how text analysis methods have shifted from analysis of single features (such as word length) to multivariate computations. The paper explains the basics of modern multivariate text analysis and the notion of a “distance” between texts as a proxy to their difference, focusing primarily on word frequencies as units of analysis. The paper concludes with a series of preliminary authorship experiments for a small Estonian fiction corpus, which test the behavior of a few foundational variables and serve as a general proof-of-concept demonstration. Experiments show a reliable performance of Cosine Delta distance in the Estonian non-lemmatized corpus, using frequencies of at least 100 most frequent words. The Cosine Delta also achieves stable attribution accuracy for random samples of at least 5000 words, which suggests that texts shorter than this size might not be reliably attributed with the proposed methodology. Both findings are consistent with observations done for other languages, but should be treated as preliminary for Estonian.

Details
Contents

Journal: Keel ja Kirjandus

Issue Year: LXIV/2021
Issue No: 8-9
Page Range: 696-718
Page Count: 23
Language: Estonian

Content File-PDF

Back to list

Erinevused, kaugused ja sõrmejäljed. Stilomeetria ja mitmemõõtmelise tekstianalüüsi alused Differences, distances and fingerprints. The fundamentals of stylometry and multivariate text analysis

Erinevused, kaugused ja sõrmejäljed. Stilomeetria ja mitmemõõtmelise tekstianalüüsi alused
Differences, distances and fingerprints. The fundamentals of stylometry and multivariate text analysis