AUTOMATISERT KLASSIFIKASJON AV NORSKE M&#197;LFORMER VHA. DATAUTVINNING AV UANNOTERT TEKST

Fartein Th. &#216;verland

AUTOMATISERT KLASSIFIKASJON AV NORSKE MÅLFORMER VHA. DATAUTVINNING AV UANNOTERT TEKST
AUTOMATED CLASSIFICATION OF VARIANTS OF NORWEGIAN BY MEANS OF TEXT MINING OF UNANNOTATED TEXT

Author(s): Fartein Th. Øverland
Subject(s): Language and Literature Studies, Studies of Literature, Other Language Literature
Published by: Studia Universitatis Babes-Bolyai
Keywords: Language Variation; Text mining; Orange Data Mining; Text Clustering; Text Classification; Bag-of-Words; Logistic Regression; Predictive Model; Norwegian Language; Nynorsk; Bokmål;

Summary/Abstract: Automated Classification of Variants of Norwegian by Means of Text Mining of Unannotated Text. This article presents a model for automatically classifying different variants of modern Norwegian Language (bokmål and nynorsk ranging from 1930 to 2011) by means of data mining unannotated text. The model is built in the Orange visual programming interface, and is based on a modification of an example model presented by the project which had the original purpose of semantical classification of fairy tale types in the Aarne-Thompson-Uther Index. The core modules of the model are Bag-of-Words and Logistic Regression. The model is trained with four different translations of the Gospel of John, and cross validated with various random texts. The model is proven to be very sound for classification of Norwegian language variation, and yields correct classification in 100% of the realistic tests.

Details
Contents

Journal: Studia Universitatis Babes-Bolyai - Philologia

Issue Year: 65/2020
Issue No: 3
Page Range: 107-124
Page Count: 18
Language: Norwegian

Content File-PDF

Back to list

AUTOMATISERT KLASSIFIKASJON AV NORSKE MÅLFORMER VHA. DATAUTVINNING AV UANNOTERT TEKST AUTOMATED CLASSIFICATION OF VARIANTS OF NORWEGIAN BY MEANS OF TEXT MINING OF UNANNOTATED TEXT

AUTOMATISERT KLASSIFIKASJON AV NORSKE MÅLFORMER VHA. DATAUTVINNING AV UANNOTERT TEKST
AUTOMATED CLASSIFICATION OF VARIANTS OF NORWEGIAN BY MEANS OF TEXT MINING OF UNANNOTATED TEXT