ON THE ROBUSTNESS OF SENTIMENT ANALYSIS FOR STOCK PRICE FORECASTING Anonymous authors Paper under double-blind review

Abstract

Machine learning (ML) models are known to be vulnerable to attacks both at training and test time. Despite the extensive literature on adversarial ML, prior efforts focus primarily on applications of computer vision to object recognition or sentiment analysis to movie reviews. In these settings, the incentives for adversaries to manipulate the model's prediction are often unclear and attacks require extensive control of direct inputs to the model. This makes it difficult to evaluate how severe the impact of vulnerabilities exposed is on systems deploying ML with little provenance guarantees for the input data. In this paper, we study adversarial ML with stock price forecasting. Adversarial incentives are clear and may be quantified experimentally through a simulated portfolio. We replicate an industry standard pipeline, which performs a sentiment analysis of Twitter data to forecast trends in stock prices. We show that an adversary can exploit the lack of provenance to indirectly use tweets to manipulate the model's perceived sentiment about a target company and in turn force the model to forecast price erroneously. Our attack is mounted at test time and does not modify the training data. Given past market anomalies, we conclude with a series of recommendations for the use of machine learning as input signal to trading algorithms.

1. INTRODUCTION

Research on the vulnerability of machine learning (ML) to adversarial examples (Biggio et al., 2013; Szegedy et al., 2013) focused, with few exceptions (Kurakin et al., 2016; Brown et al., 2017) , on adversaries with immediate control over the inputs to an ML model. Yet, ML systems are often applied on large corpora of data collected from sources only partially under the control of adversaries. Recent advances in language modelling (Devlin et al., 2019; Brown et al., 2020) illustrate this well: they rely on training large architectures on unstructured corpora of text crawled from the Internet. This raises a natural question: when the provenance of train or test inputs to ML systems is ill defined, does this advantage model developers or adversaries? Here, by provenance we refer to a detailed history of the flow of information into a computer system (Muniswamy-Reddy et al.). We study the example of such an ML system for stock price forecasting. In this application, ML predictions can both serve as inputs to algorithmic trading or to assist human traders. We choose the example of stock price forecasting because it involves several structured applications of ML, including sentiment analysis over a spectrum of public information sources (e.g., news, Twitter, etc.) with little provenance guarantees. There is also a long history of leveraging knowledge inaccessible to all market participants to gain an edge in predicting the prices of securities. Thales used his knowledge of astronomy to corner the market in olive-oil presses and generate a profit. We first reproduce an ML pipeline for stock price prediction, inspired by practices common in the industry. We note that choosing the right time scale is of paramount importance. ML is better suited for low frequency intra-day and weekly trading than high-frequency trading, because the latter requires decision speeds much faster than achievable by ML hardware accelerators. Although there has been prior work on attacking ML for high-frequency trading (Goldblum et al., 2020) , their experimental setting is 7 orders of magnitude slower than NASDAQ timestamps (NASDAQ; 2020), which high-frequency trading firms use. In contrast, ML in low-frequency trading has attracted greater practical interest of the industry, with two major finance data vendors vastly expanding their sentiment data API offerings in the past decade (Bloomberg, 2017; Reuters, 2014) . This is to serve the growing demand from institutional and advanced retail market players to use ML models for sentiment analysis at lower frequencies. We adopt this low frequency setting, and collect our data from Twitter and Yahoo finance, which provides 1-minute frequency stock price data. Using these services, we collected tweets related to Tesla and Tesla's stock prices over 3.5 years. We then used FinBERT (Araci, 2019), a financial sentiment classifier to extract sentiment-based features from every tweet. We describe these methods in more depth in Section 3.1; they are general and applicable to any company of interest. We use sentiment features as our input and the change in price as our target to train multiple probabilistic forecasting models. We show that including sentiment features extracted from Twitter can reduce the mean absolute error of a model that only learns from historical stock prices by half. Section 3.2 introduces the probabilistic models used in our work. Predicting stock prices is a known hard task. Even a limited, per trade edge can lead to a large gain when scaled by the number of trades preformed (Laughlin, 2014) . Moreover, typically sentiment analysis is only one of many indicators used in a trade decision. Hence our model only needs to provide a slight non-trivial advantage over random baselines to be effective in practice. We used our forecasts to implement portfolio strategies with positive returns, and measure its performance across other metrics to showcase its advantage. Equipped with this realistic ML pipeline for stock price forecasting, which takes input data from a source with little provenance guarantees (i.e., Twitter), we set out to study its robustness. Unlike previous settings of adversarial examples like vision, where the realistic incentives for an adversary to change the classification is often unclear (Gilmer et al., 2018) , in our setting there are clear financial interests at stake. Furthermore, Twitter is already subject to vast disinformation campaigns (Zannettou et al., 2019) . These make it even more complicated to assess the provenance of data analyzed by ML pipelines. To investigate the robustness of our stock price prediction pipeline, we use adversarial examples to show that our financial sentiment analysis model is brittle in this setting. While attacks against NLP models are not new in research settings, our work demonstrates the practical impacts that attacks on an NLP component (i.e., sentiment analysis) of a system can have on downstream tasks like stock price prediction. We show that an adversary can significantly manipulate the forecasted distribution's parameters in a targeted direction, while perturbing the sentiment features minimally at test time only. An adversary would determine a parameter and direction, such as increasing variance of forecasted stock prices, and compute a corresponding perturbation. If given control over training data, the adversary's capabilities would only further increase. The contributions of this paper are the following: • We propose a realistic setting for adversarial examples through the task of sentiment analysis for stock price forecasting. We develop a dataset collection pipeline using ticker data and Twitter discussion, for any ticker (i.e., company). This includes querying Twitter, processing data into a format fit for training, and a suitable sentiment analysis model. • We implement different sentiment-based probabilistic forecasting models that are able to perform better then a naive approach in forecasting stock price changes given Twitter discussion surrounding the company. DeepAR-G, a Gaussian probabilistic forecasting model, outperformed all other models in our experiments. • We subject our pipeline to adversarial manipulation, leveraging information from our model to minimally modify its inputs into adversarial examples, while achieving important changes in our model's output such as shifting distribution parameters of our probability distribution in any direction we wish. We intend to release our code and data should the manuscript be accepted. Beyond the implications of our findings in settings where an adversary is present, we stress that capturing model performance in the worst case setting is important for the domain of finance. These threats are very real: past market anomalies led to the collapse of the Knight Capital Group, and there were legal matters following the suspected manipulation of Tesla stock via Twitter. Therefore, it is important for financial institutions to understand how their ML systems could behave in worst-case settings, lest market anomalies impact these systems in unprecedented ways. Furthermore, unforeseen catastrophic events (e.g., a natural disaster or pandemic) are often hard to model via standard testing procedures be it order generation simulators or backtesting, i.e. simulating trading on replayed past data. Our methodology based on adversarial examples enables institutional traders to assess maximum loss risk effectively.

