Does pre-processing affect the correlation indicator between Twitter message volume and stock market trading volume? Cover Image

Does pre-processing affect the correlation indicator between Twitter message volume and stock market trading volume?
Does pre-processing affect the correlation indicator between Twitter message volume and stock market trading volume?

Author(s): Joanna Michalak
Subject(s): Financial Markets, ICT Information and Communications Technologies
Published by: Wydawnictwo Naukowe Uniwersytetu Mikołaja Kopernika
Keywords: twitter sentiment analysis; behavioral economy; data mining;

Summary/Abstract: Motivation: More and more authors empirically verify the relationship between the volume of tweets and the stock market indicators. The patterns explored from Twitter most often take the form of time series that represent user’s activity on different level of granularity (moods, emotions, relevant topic or query-related messages). Sentiment analysis is a technique used to transform text data into information on the mood and related behavioral categories. Supervised machine learning is the most commonly used approach to sentiment analysis. Thus, the results of an empirical analysis of the relationship between social media and stock depend on the quality of results of classification task. The quality of the features used to learn the classifier plays a key role. The feature space is modified using various data pre-processing scenarios that aim to increase accuracy of classification. The impact of pre-processing data on the quality of classification is often discussed in studies. Very few authors discuss the impact of pre-processing on the correlation indicator between Twitter and stock market. Aim: Analysis of the impact of tweets pre-processing on the Pearson correlation indicator between the mood of Twitter users and stock market trading volume. Results: The correlation between the volume of stock market trading and the volume of tweets has been empirically confirmed. The effect of pre-processing on the correlation index was noted for the variables ‘all_tweets’ and ‘negative_tweets’. This is because the training set has a significant amount of tweets with negation. However, the results are not conclusive. The differences between the Pearson correlation index calculated for scenario one and scenario four are not significant. However, this indicates that the effect of noise data may reduce the quality and precision of conclusions. Especially in the case of frequent repetition of a certain category of noise.

  • Issue Year: 19/2020
  • Issue No: 4
  • Page Range: 739-755
  • Page Count: 18
  • Language: English