Predicting Oil Prices Using Twitter

The drop in oil price during late 2014 has had a significant impact across the globe. While some countries may have reaped the benefits of lower costs, others have suffered greatly. Russia, for example, loses approximately $2 billion in annual revenues for every dollar the oil price drops. As a result, it is no surprise that many have attempted to develop reliable models to forecast the price of oil.

Traditionally, economists have used financial models with features such as historical prices and production levels to Identify trends in the oil market. More recently, machine learning models including Artificial Neural Networks and Support Vector Machines have also gained significant prominence in this field. Despite the use of sophisticated models, oil forecasting accuracy remains highly unreliable, only marginally surpassing a coin toss. Recognizing the limitation of existing forecasting methods and the growing trend of social media as a corpus to extract key business insight, I decided to explore Twitter and its oil predictive capabilities as part of my undergraduate dissertation. Half a million tweets, going back five years, were collected using the Twitter API. All the tweets were authored by various think tanks, oil corporations, and prominent energy journals. The features extracted from the data collected included the frequency of “oil”, frequency of OPEC members, sentiment of oil companies, energy journal and think tanks. The Stanford NLP and SentiStrength sentiment analyzers were used to obtain the sentiment of the tweets.

The dissertation was divided into two main studies. The first study successfully identified a correlation between each of the independent features mentioned above and oil prices using the Granger–Causality Test. The second study used the features to build a supervised leaming model that predicted movements in the oil market (increase or decrease).

The results of the first study revealed a significant correlation between sentiment of the tweets and oil price. It also confirmed the hypothesis that the frequency of “oil” on Twitter is positively correlated with shifts in the oil market. Using the Granger–Causality Test, it was found that there is a seven week lag from when the tweets occur to them “granger” causing a change in price.

The second study aimed to investigate the significance of the features as inputs to a predictive model. This was done using Artificial Neural Networks, Support Vector Machines, and Naive Bayes classifiers. The models were built to forecast the directional shift in the oil market seven weeks in the future. Outperforming existing methods referenced in literature, the model achieved a classification accuracy of 74.29% (SVM).

The results of both studies indicated that there is indeed a significant correlation between Twitter and the future shifts in oil price. However, there remains substantial room for improvement in the model, specifically, in the area of feature selection and natural language processing.