SIMILARITY ANALYSIS OF TEXT DOCUMENTS BY SELF-ORGANIZING MAPS AND K-MEANS Cover Image

Tekstinių dokumentų panašumų paieška naudojant saviorganizuojančius neuroninius tinklus ir k vidurkių metodą
SIMILARITY ANALYSIS OF TEXT DOCUMENTS BY SELF-ORGANIZING MAPS AND K-MEANS

Author(s): Pavel Stefanovič, Olga Kurasova
Subject(s): Social Sciences
Published by: Vilniaus Universiteto Leidykla

Summary/Abstract: In this paper, we try to find similarities of different text documents by the self-organizing map (SOM) and k-means method. One of the main goals of these methods is to cluster a dataset. Using SOM, the similarities of documents can be observed visually. Both methods can be used only for numerical information, so we analyse the different options by converting text data on to numerical in order to get better results. To estimate the SOM quality, when the classified data are analysed, we propose two new measures: distances between SOM cells, corresponding to data items assigned to the same class, and the distance between centres of SOM cells, corresponding to different classes. We also analyse the results of visualization by self-organizing maps. In order to estimate the k-means quality, we calculate the sum of distances between cluster centres and class members and also we estimate assignment of the data from particular classes to the clusters. The experiments have been carried out using three datasets ocquired from the document database of Seimas of the Republic of Lithuania.

  • Issue Year: 2013
  • Issue No: 65
  • Page Range: 24-33
  • Page Count: 10
  • Language: Lithuanian