Deep Learning in Investing: Opportunity in Unstructured Data

Kai Wu
Jul 15, 2020
18 min read

Updated: May 16

July 2020

Executive Summary

We discuss the potential role of deep learning in investment management. We explain how deep learning can help investors streamline their consumption of unstructured data. We apply transfer learning to adapt models originally trained on large-scale, out-of-domain datasets for highly specialized investment applications. Transfer learning allows even small niche firms to harness the massive resources of big tech companies. Despite its transformative potential in unstructured data, most investors are still trying to apply deep learning directly to asset price prediction. We run simulations on a large panel of alphas to demonstrate the limitations of this approach.

Read as PDF

Introduction

Deep learning is a machine learning technique utilizing complex, multi-layered statistical models, often with tens of millions or billions of parameters. Its recent ascent has been fueled by the rise of vast datasets and cheap computing.

Deep learning is widely used in the fields of computer vision, natural language processing, and speech recognition, which are characterized by large, complex, unstructured datasets. However, it has seen limited adoption in investment management. We believe this is because most investors are still trying to use it on traditional structured data to directly predict asset prices. However, structured financial data is not fertile ground for deep learning.

Exhibit 1

Powered by Deep Learning

Source: Sparkline, Waymo, Apple

In general, artificial intelligence begins its wave of disruption by first automating the most routine parts of our jobs. A significant portion of the financial analyst’s day is spent reading textual documents ranging from financial news to broker research. In the age of big data, this has become an increasingly overwhelming task. Fortunately, deep learning can greatly streamline the way we consume this data.

Investing is a niche industry with specialized documents only accessible to highly-trained domain experts. Transfer learning helps us transcend this limitation by bringing in knowledge gained from bigger, broader domains. Transfer learning lowers barriers to entry, so that deep learning is no longer the plaything of the big tech oligopoly. Multimillion- dollar datasets and hardware not required!

This paper revolves around two practical investment case studies. First, we show how transfer learning can be used to produce state-of-the-art results in earnings call sentiment analysis. Second, use a proprietary dataset of 1,000 alphas to show the limitations of using deep learning directly to predict asset prices.

Part 1 Unstructured Data

Warning: Natural language processing (NLP) is an extremely fast-moving field and it is possible that some of the ideas here may become outdated or even contradicted in the near future.

From Word Vectors to Language Models

Our June 2019 paper, Investment Management in the Machine Learning Age, discussed word embeddings (word2vec). Introduced in 2013, word embeddings are matrices that encode the relationships between words. We showed the graphic below, which illustrates how the words used in 10-Ks cluster based on common meaning.

Exhibit 2

10-K Word Embeddings

Source: Sparkline, SEC

However, word embeddings have a major limitation -- each word can only have a single vector representation and thus a single meaning. They cannot utilize the full value of context. Words derive much of their meaning from context.

Exhibit 3

The Value of Context

Margin profit margin vs. margin for error

Cloud cloud computing vs. cloud cover

Turnover employee turnover vs. share turnover

Source: Sparkline

Word embeddings are matrices trained using a simple two-layer neural network architecture. In order to capture context, we can use deeper neural networks (i.e., add more layers). The additional layers allow the model to learn more complex semantic representations.

Recall that word embeddings are trained using the “continuous bag of words” algorithm, which tries to predict a word using its neighbors. The deep learning models are trained to perform a conceptually identical task, called language modeling. Language models estimate the probability distribution of words given previous or future words. One representative training method is the “masked language modeling” approach used by the popular BERT model, in which we randomly mask words and train the model to predict them from context.

Exhibit 4

Guess What’s Behind the [MASK]?

A statistical language model is a probability [MASK] over sequences of [MASK]. Given such a [MASK], it assigns a probability to the [MASK] sequence.

Source: Sparkline, Wikipedia

After training on millions of documents, language models can do some cool stuff. The most obvious application is autocompletion, where we guess the word (or sequence of words) given a prompt.

Exhibit 5

Autocompletion

Source: Google

One important feature of language models is that they do not require humans to manually create the training data. Text can be automatically parsed into training examples, such as by randomly masking words. This enables us to cheaply create massive training corpuses from millions of websites, books, articles, and other written media.

However, moving from word embeddings to language models has its drawbacks. More complex models are more powerful but require more data and compute to train. Given its simple architecture, we showed that word2vec produced impressive results when trained on a relatively small sample of 100,000 10-Ks. And the training process took only a few minutes on standard hardware.

By comparison, the language model GPT-2 has 1.5 billion parameters and was trained on 8 million web pages. It has been estimated that training GPT-2 cost $20-50K in computing budget spent over 1-10 months. Even putting aside time and money, there simply aren’t enough 10-Ks in existence to train a model of this size. We could of course use a smaller model, but then we would have to sacrifice performance.

Transfer Learning

The big breakthrough came in early 2018 when language modeling was combined with transfer learning. The idea behind transfer learning is to first “pre-train” a model on a large general-purpose dataset, then “fine tune” it on a smaller domain-specific dataset for a specialized task. In our example above, we could pre-train GPT-2 on 8 million web pages then fine tune it on our 100,000 10-Ks. This avoids having to train the model from scratch on a small dataset.

Exhibit 6

Transfer Learning

Source: Sparkline

Language models are extremely useful for the pre-training stage of transfer learning. It turns out the ability to predict words requires a significant level of semantic awareness. This broad linguistic understanding is foundational for many other NLP tasks. For example, tasks as disparate as translation, question answering, and named entity recognition all benefit from starting with a pre-trained language model.

In practice, fine tuning involves starting with a pre-trained language model and swapping out the final layer, exchanging it for the specific building block that meets your needs. For example, if we want to do classification, we replace the final layer of the language model with a classifier head. We then retrain the model for the new task, adjusting the existing model’s weights to incorporate learnings from the fine-tuning dataset.

Exhibit 7

One Model, Many Uses

Source: Sparkline

One way to better understand fine tuning is by analogy to computer vision, where transfer learning had been widely applied prior to its crossover to NLP. In these models, the lower layers capture basic features such as edges and textures, while the higher layers depict more complete objects such as eyes, faces, legs, and dogs.

Exhibit 8

Lower Layers Encode Lower-Level Features

Source: Sparkline, Distill (h/t Sebatian Ruder)

In our context, the lower layers of the neural network capture the fundamental building blocks of language (e.g. words), while the higher layers contain higher-level linguistic concepts. The final layer is dedicated to our specific task. Fine tuning allows our model to utilize the fundamental knowledge from earlier layers, while adjusting the end output to our specific task.

We are extremely blessed that the NLP research community has embraced the open source philosophy. Anyone can freely download massive language models that have been pre-trained on millions of documents. This saves hundreds of thousands of dollars, weeks of training time, and the redundancy of researchers constantly having to reinvent the wheel. With the heavy lifting out of the way, the fine tuning process is quite cheap and tractable even for less-resourced teams.

The NLP 🚀

The combination of language modeling and transfer learning opened the floodgates for a wave of innovation. Over the past couple years, Google, Facebook, Microsoft, OpenAI and others have introduced a succession of models building on this foundational concept.

These models have gotten bigger and bigger as datasets, computing resources and modeling techniques have improved. In Feb 2018, the state-of-the-art ELMo model had 94 million parameters. By Oct 2019, the T5 transformer had pushed the frontier to 11 billion parameters. Last month, GPT-3 was released with 175 billion parameters. The exponential trendline shows that we have experienced a 10x increase in model size every 8.5 months since pre-trained language models were introduced in 2018.

Exhibit 9

NLP 🚀

Source: Sparkline (Adapted from HuggingFace)

This new wave of models has delivered state-of-the-art results across every NLP benchmark. For example, Exhibit 10 shows progress on SQuAD 2.0, a reading comprehension benchmark where crowdworkers pose questions based on a set of Wikipedia articles.

Exhibit 10

SQuAD 2.0 Leaderboard

Source: Sparkline, Papers With Code

BERT produced an inflection point when it was introduced in Oct 2018, breaking previous records and paving the way for the dozens of BERT descendants (e.g., ALBERT, RoBERTa, SemBERT) that have dominated the leaderboard ever since. In Mar 2019, another milestone was achieved as deep learning surpassed human performance on the test for the first time (86.8% accuracy).

These breakthroughs have made their way into the real world. In Oct 2019, Google converted its search engine to BERT. The results are so good that researchers are being forced to confront their ethical implications. For example, OpenAI decided to release GPT-2 in multiple phases to give cybersecurity officers more time to set up defenses against bad actors attempting to use the model to spread fake news.

These technologies are clearly very useful for big tech companies with enormous datasets. But are they also useful for investment firms with smaller and more specialized datasets? We will address this question now.

Earnings Call Sentiment

Sentiment analysis is a foundational NLP task. It involves classifying text into sentiment categories (e.g., positive vs. negative). Sentiment analysis allows us to convert complex unstructured data into concise numerical ratings. This is a valuable tool for investors trying to avoid being drowned by the modern firehose of information.

We use earnings calls as our test case. These quarterly calls are a forum for public company executives to discuss their financial results and outlook for the future. While regulatory filings, such as 10-Ks, tend to be written using standard templates and boilerplate legal language, earnings calls offer executives greater latitude to express sentiment (whether intentionally or not).

The Q&A section of the conference call tends to be particularly informative. For example, consider the two highly polarized comments by CEOs in response to analyst questions in recent calls.

Exhibit 11

Earnings Call Sentiment

“Yes. So we've never really disclosed beds per door, anything like that. What I will say is, we actually just completed a pretty big deep dive on this with cohort views. And no matter how we cut it, we are continuing to see same-store sales increase, which is terrific, and Q4 was no exception to that. So our strength in the marketplace continues to grow.”

- Joe Megibow, CEO, Purple Innovation Inc. (Mar 13, 2020)

Sentiment: Positive

“Understood. I'd say that we probably lost $0.5 million to $0.75 million in the fourth quarter of the year due to some of those headwinds as an approximation for the combination of outages, weathers and the like.”

- Vincent J. Arnone, Chairman, CEO, and President, Fuel Tech, Inc. (Mar 30, 2020)

Sentiment: Negative

Source: Sparkline, S&P

The Small Data Problem

For supervised learning tasks such as sentiment analysis, the biggest challenge is often in obtaining large, high-quality labeled datasets. Data Labeling is the process of associating each text with a “ground truth” target (i.e., positive or negative sentiment).

Researchers found a hack to create large labeled datasets for sentiment analysis: crowdsourced online reviews. For example, the IMDb Large Movie Review Dataset consists of 50K labeled movie reviews and is widely used in the field.

Exhibit 12

IMDb Reviews: 🔋Included

Source: IMDb

These reviews arrive labeled right out of the box. Each review has a star rating from 1 to 10. Researchers use similar techniques to compile large training datasets from Yelp and Amazon.

However, not all datasets come so nicely pre-packaged. In general, you will only have the raw text and be required to undertake the manual and time-consuming labeling process yourself. Given its specificity, it shouldn’t be a surprise that no open-source dataset of earnings calls with binary sentiment labels exists.

This leads us to two more challenges faced by those in niche domains such as investing. First, it is a general principle that cost per label increases with domain specificity. While pretty much anyone can identify images of stop signs, it requires years of training to recognize signs of financial frauds. Second, even if money were no object, large datasets in niche industries may simply not exist. For instance, there are only a finite number of observations on which to train a model to find the next Enron Wirecard.

While the media are obsessed with hyping “big data”, in many cases it is unrealistic to simply throw more data at the problem. We may be better served working to extract the most insight from the limited data we do have.

Cross-Training for Computers

With this in mind, we ran an experiment to see how well we could do in an extremely data-constrained environment. We labeled 100 earnings call transcript snippets by hand, classifying each as positive or negative. We used 50 to train the model and 50 to evaluate its out-of-sample performance. Compared to the 25,000 training samples in IMDb, a 50 observation training set is extremely small.

We used BERT as our representative deep learning model. BERT has 340 million parameters so it should be no surprise that training on 50 observations did not work. We achieved testing accuracy of 54%, indistinguishable from random chance. For comparison, we also trained a simpler model -- logistic regression. This also did not work. Natural language is very complex.

As a benchmark, we tested the old-school dictionary approach. We used the Loughran-McDonald lexicon, which was created by two finance professors and is widely used in the industry. We classified texts based on the net occurrence of positive and negative words. Loughran-Mcdonald achieved a respectable accuracy of 68%. In a sense, dictionary methods are a form of transfer learning. Instead of artificial neural networks, we rely on Profs. Loughran and McDonald’s actual neurons, pre-trained over their many years of experience in the field.

So far, machine learning has failed. In order to get reasonable results, we would need a significantly larger training sample. But now let’s see if transfer learning can help. We use the training progression below.

Exhibit 13

Transfer Learning for Earnings Call Sentiment

Source: Sparkline

BERT was originally pre-trained to perform language modeling on a large corpus of books and Wikipedia articles. Instead of initializing our model with random values, we can use these pre-trained weights. But books and Wikipedia articles differ greatly from earnings calls in structure, tone, and vocabulary. Thus, we continue BERT’s education. This time we have it read earnings call transcripts. Fortunately, language model training does not require us to manually label any data. Thus, we can give BERT tens of thousands of unlabeled transcripts to study without our supervision.

BERT now understands both general english language and financial jargon. However, it has never done sentiment analysis. We correct this with one final transfer learning step. We train BERT on the IMDb dataset from earlier. While movie reviews are quite different from earnings calls, the sentiment analysis task is highly relevant. Think of all these steps as cross-training for computers. Putting in thousands of reps in the pre-season allows BERT to perform on game day.

We find that each transfer learning step increases the performance of the model. With all three, we achieve 89% accuracy. This is a full 21 percentage points better than Loughran-McDonald. This result is kind of incredible. We spent an hour labeling and now have a model that can extract transcript sentiment automatically with much greater accuracy than the current industry standard.

Exhibit 14

The Power of Transfer Learning

Source: Sparkline

BERT and its successors are extremely large models. Thus, one might assume they are only useful for huge companies like Google or Facebook with their billions of search records and social interactions. The beauty of transfer learning is that it allows us to take advantage of the vast resources baked into pre-trained language models for use with small, specialized datasets.

The fundamental techniques demonstrated here can be used for many other NLP tasks besides sentiment analysis. Pre-trained language models are an incredibly powerful tool, and we encourage you to think about other ways it can be applied to improve the way we utilize unstructured data in our industry.

Exhibit 15

Transfer Learning in the Matrix

Source: Sparkline, The Matrix

Part 2 Structured Data

We now address the question that is probably on your mind, “Why not just directly apply deep learning to stock price prediction?”

The Signal and the Noise

These days, you can’t speak to a quant without him bragging about the size of his data. However, when it comes to training machine learning models, size isn’t everything.

Financial markets are extremely noisy places, where millions of buyers and sellers converge in a chaotic, unstable equilibrium. Even the best investors are unable to forecast stock prices with a great degree of accuracy. Investing lies in a realm adjacent to pure noise, where even a 55% hit rate makes one a top investor.

Exhibit 16

Financial Markets are Noisy

Source: Sparkline, Papers With Code

Noise makes it more difficult to train machine learning models. Noise dilutes the signal within a dataset, making the model more likely to be fooled by randomness. Noise is the main reason deep learning has not gained traction in investing. Just consider the areas where deep learning is most widely used: computer vision and natural language processing. These datasets are much less noisy, as evidenced by their much higher obtainable accuracies.

Financial market data’s low signal-to-noise ratio is a huge problem for machine learning models. Noise greatly dilutes the number of effective observations in a dataset. As the old saying goes, “a cat image is worth a thousand EV/EBITDA ratios!” Size should not be measured in rows or bytes, but instead as the amount of signal we can hope to extract from the data.

Except for high frequency trading, where we may be able to compensate for noise with an extremely high quantity of data, we should be skeptical about applying deep learning directly to asset price prediction. As we will soon see, the problem of small data is compounded by the need of deep learning models for extremely large datasets.

The Right Tool for the Job

Deep learning can theoretically uncover much more complex relationships in data than traditional statistical models. However, more powerful models also require more data to avoid overfitting. Overfitting occurs when you have many trainable parameters relative to training observations. It leads to models that look great in sample but do not generalize in real life.

Exhibit 17

If Goldilocks Were a Statistician

Source: edpresso

Every dataset has an optimal level of model complexity. Overly simple models underfit, failing to capture all the nuances of the data. Overly complex models overfit, failing to work out of sample.

The point at which optimal complexity is achieved depends on the size of the dataset. Bigger datasets can sustain more complex models. The extremely stylized chart below illustrates this point.

Exhibit 18

Model Complexity Should Match Data Size

Source: Sparkline

We have already shown that financial datasets are quite small after taking into account the dilutive effect of their low signal-to-noise ratios. Thus, we should expect their optimal model complexity to be rather low based on theory alone.

Optimal Complexity

We illustrate this point empirically using our own data. Sparkline has a library of thousands of alphas. These range from standard quant factors like price-to-book ratios to proprietary signals derived from crawling the public internet. We use a random subset of 1,000 of these signals for the experiment below.

Neural networks can be viewed as linear regression with more layers. Conversely, linear regression can be viewed as a neural network with only one layer. Thus, we begin with linear regression and successively build more complex architectures. We use feedforward neural networks with batch normalization, ReLu, and dropout. Don’t worry about the details -- the main takeaway is that these networks get more complex as we add depth.

Exhibit 19

Neural Networks of Increasing Complexity

Source: Sparkline, NN-SVG

Our 1,000 signals cover the Russell 3000 stock universe for the 18-year period ending 3/31/2020. We use 2002-2012 as the training period, 2012-2016 as the validation period, and 2016-2020 as the out-of-sample test period. We abstain from standard cross-validation in order to maintain the temporal order of the data.

We train our models with a standard regression objective. The goal is to predict the next month’s return for each stock relative to the market. We build a market-neutral strategy from the model’s predictions. The next exhibit shows the simulated returns of each strategy (without fees or transaction costs).

Exhibit 20

Simulated Strategy Returns

Source: Sparkline

The left panel contains the validation period. The more complex the model, the better the performance. The right panel contains the test period. The 3-layer model does the best out of sample, especially over the past couple years including the ongoing COVID-19 crisis.

The next exhibit summarizes the results using Sharpe Ratio (i.e., signal-to-noise ratio). The chart looks as if it were taken straight out of a machine learning textbook!

Exhibit 21

Sharpe Ratio and Model Complexity

Source: Sparkline

Sharpe Ratio is lower in the test period than the validation period. This is expected as the validation period has the benefit of hindsight and alphas should naturally decay as they are discovered in the latter period.

Optimal complexity in the test period is achieved at 3 layers. This implies that linear regression is too simple. It does not capture the full intricacies of the data. On the other hand, the 5-layer neural network is too complex. It overfits the data so badly that, despite an incredible backtest, it performs only a bit better than linear regression out of sample.

Our optimal model produced a Sharpe Ratio of 1.6. This is a meaningful improvement over linear regression, which delivered a Sharpe Ratio of 1.0. We can conclude there is room for improvement moving beyond the “simple” tier of model complexity but venturing too far into the “complex” zone leads to overfitting.

Shallow Deep Learning

Our optimal model has 3 layers and 105,501 parameters. This is a lot more than linear regression, with its measly 1 layer and 1,001 parameters. However, it pales in comparison to the deep learning architectures used on unstructured data. For example, here is ResNet-50, a popular computer vision model with 50 layers and 25 million parameters.

Exhibit 22

ResNet-50

Source: Deep Residual Learning for Image Recognition

We added our optimal model to the chart of modern NLP models from the prior section. We overfit our data at just 100,000 parameters. Yet this is 1,000 times smaller than ELMo and over 1M times smaller than GPT-3.

Exhibit 23

NLP🚀++

Source: Sparkline (Adapted from HuggingFace)

One might argue that our results are specific to our dataset and model setup. Of course, we could further optimize the hyperparameters and architecture. We could also go to daily frequency data and further expand the number of signals. However, this would not qualitatively change our conclusion.

Deep learning models can offer an improvement upon linear regression. However, due to inherent limitations in financial data, the models quickly start overfitting with even simple architectures. The whole point of deep learning models is that they are deep -- consisting of dozens of layers and millions of parameters. Forced to resort to “shallow deep learning” means sacrificing most of the benefit of these models.

Explainability

In addition, using deep learning is not without its tradeoffs. One significant weakness of deep learning models is that they are “black boxes”. Unlike linear regression, there is no intuitive interpretation of its coefficients. With great power comes great opacity. 🕷

Fortunately, this is an active branch of AI research. We will utilize a simple technique called a “global surrogate”. The idea is to train an interpretable model (in our case, linear regression) to predict the predictions of the deep learning model. To be clear, we are not trying to predict the market, only the output of the deep learning model.

The main advantage of the surrogate model is that its regression coefficients are interpretable. Weights (i.e., betas) range from -2.5% to +2.5%. We spot checked a few standard quant factors to ensure they lined up with intuition. Value, momentum, reversal, quality and size work as expected. Phew!

Exhibit 24

Deep Learning Surrogate Coefficients

Source: Sparkline

One side benefit of this approach is we can evaluate the R-squared, or the percent of variance explained by the linear regression. If the deep learning model were completely linear -- which might happen if the underlying features were truly linear -- the surrogate would capture 100% of its variance. The less variance explained, the more nonlinearities and interactions the deep learning model is picking up.

Exhibit 25

Financial Factors Are Mostly Linear

Source: Sparkline

Exhibit 25 shows the variance explained for each of our models. The 1-layer model is linear regression, so the surrogate explains 100% of the variance. As we add layers, the model begins finding interesting nonlinearities and interactions in the data. The R-squared falls gradually to 62% as we increase model complexity up to 5 layers.

We found the 5-layer model overfits, so let’s focus instead on the optimal 3-layer model. The linear surrogate captures 70% of the deep learning model’s variance, while 30% can be explained only by nonlinearities and interactions.

This 70/30 split is quite interesting. It implies that our data are mostly linear. While complex models can add value, the gains are limited. Furthermore, there are significant drawbacks to utilizing deep learning. These include opacity, complexity and cost. There are plenty of machine learning algorithms occupying the “medium” complexity region between linear regression and deep learning that might be worth first considering.

Conclusion

Deep learning is extremely powerful but requires very large datasets to be effective. Traditional structured financial data is too small and linear to truly benefit from deep learning. While “shallow deep learning” can be useful, researchers may be better served to first consider simpler techniques.

On the other hand, deep learning is highly effective on unstructured data. Transfer learning provides the key to unlocking its potential in niche domains such as investing. Transfer learning enables us to leverage the creations of large technology companies without having to gather the data or train the models ourselves.

Unstructured data is a critical input to the investment process. However, its unmitigated growth presents a significant challenge for the industry. Fortunately, the advances in natural language processing presented here can greatly improve how we consume this data. Given that these innovations are less than a few years old, we believe there is opportunity for entrepreneurial individuals and firms to profit from the impending transformation.

Disclaimer

This paper is solely for informational purposes and is not an offer or solicitation for the purchase or sale of any security, nor is it to be construed as legal or tax advice. References to securities and strategies are for illustrative purposes only and do not constitute buy or sell recommendations. The information in this report should not be used as the basis for any investment decisions.

We make no representation or warranty as to the accuracy or completeness of the information contained in this report, including third-party data sources. The views expressed are as of the publication date and subject to change at any time.

Hypothetical performance has many significant limitations and no representation is being made that such performance is achievable in the future. Past performance is no guarantee of future performance.