Investment Management in the Machine Learning Age
The investment management industry is still in the process of figuring out how to incorporate recent advances in machine learning. We highlight three areas where machine learning can add value: unstructured data, data mining, and risk models. More importantly, we present detailed case studies for each topic. Our goal is to present practical insights without the buzzwords and jargon that have hamstrung the adoption of machine learning in our industry.
Machine learning, powered by advances in computing and data collection, has swept the world, transforming every facet of the economy. Computers can drive cars, comprehend speech, translate Chinese to German, and play computer games. And the pace of innovation does not seem to be slowing. As investors, we are rightly interested if our own jobs will be affected in the same way as the companies we cover.
Unfortunately, there is a lot of hype and misinformation. Some argue that it is only a matter of time before all capital allocation is performed by machines. Automation anxiety has led many firms to invest heavily in speculative AI research with no clear objective. Others espouse some sort of finance exceptionalism (we are doing god’s work, after all). These folks argue that markets are noisy, behavioral, and immune to nerds with cat classification algorithms.
But if we step back from the fray, machine learning has in reality only had limited penetration in the asset management industry. While quantitative managers are slowly gaining traction, the overwhelming majority of actively-managed assets are still run by discretionary investors. Furthermore, most quantitative managers still rely primarily on classical statistical techniques.
Ultimately, machine learning will have a transformative impact on our industry. However, financial markets do present unique challenges that prevent the wholesale adoption of techniques that have achieved success in other fields. Applying these new technologies to our field requires both machine learning expertise and deep domain-specific knowledge.
This paper outlines three areas in which our industry will be positively impacted by machine learning:
In addition, it will provide real-world case studies in each area. Our hope is that by providing transparency into practical machine learning applications, we can redirect the conversation away from one that has been clouded by hype and hearsay.
Part 1 Unstructured Data
The Drunkard’s Search
There is a famous parable of a policeman who finds a drunk man searching for his keys under a streetlamp. After helping him look for several minutes, the cop asks the drunk man if he is sure he lost his keys there. The man replies that he actually lost his keys in the park. Puzzled, the policeman asks, “why then are you looking under the streetlamp?” To which the drunk man states, “well, that is where there is light!”
The lesson, of course, is that we tend to spend our time searching the region that is measurable, even if there is nothing to be found. This is an apt description of the current state of quantitative investing. Due to its availability, quants have tended to focus almost exclusively on structured data, which is the data you would find in an Excel spreadsheet or SQL database.
Examples of Structured Data in Finance
However, with so much capital chasing so few datasets, there isn’t much alpha left. Not surprisingly, we have witnessed standard quantitative factors, such as value and momentum, struggle in recent years.
How to Make a Quant 😢
Source: Sparkline, Ken French
Enter Unstructured Data
On the other hand, fundamental investors build insight by reading 10-Ks, dialing into earnings calls, and tracking their companies in the news. These are examples of unstructured data. However, quants have historically found it challenging to incorporate unstructured data into their processes.
Examples of Unstructured Data in Finance
In fact, it is hard to overemphasize the sheer volume of unstructured data. A common estimate is that 80% of data today is unstructured. Furthermore, unstructured data is being created at a faster rate than structured data, meaning that this ratio will only become more extreme over time. We risk being drowned in an overflowing ocean of incomprehensible information.
Unstructured Data Is Eating the World
Source: IDC, IBM
Fortunately, this explosion of unstructured data coincides with significant progress in machine learning research and computing power. These innovations are allowing us to finally harness unstructured data. For example, we can use object detection algorithms to count the number of cars in satellite images of parking lots, a classic application of channel checking. In another example, we can use natural language processing (NLP) techniques to nearly instantaneously classify the sentiment of breaking events in the news and social media.
Although machine learning cannot yet rival human-level understanding in most areas, we have witnessed many exciting advances in the past few years. But we need not even wait as we are already seeing some successful applications of these data to investing, in particular given computers’ natural advantages in speed and scale.
Incorporating unstructured data will have a great impact on the investment process. As investors, our goal is to explain the economy, but structured data provides only a very incomplete picture. Ultimately, we have no choice but to rely increasingly on machines to help us process the exponentially growing volume of data.
Of course, this is not to understate the challenge. It took decades for the literature to “discover” the value, growth and momentum factors. Finding alpha in the vast wilderness of unstructured data requires a disciplined research process, abundant computing power, and actual financial market intuition.
Case Study 1: NLP on 10-K Filings
We will now explore some text mining techniques. There are many interesting ways to extract meaning from textual data, but we will focus on two methods in this paper: topic modeling and word vectors.
We will use the business section of the SEC 10-K filings, which contain detailed descriptions of companies’ products and services, as our sandbox. The data are freely available online via EDGAR, so we encourage you to take a look yourself.
The starting point for processing a textual document is called the bag-of-words method, which involves counting the occurrences of words appearing in each document to produce a matrix (i.e., document-term matrix).
Each document can be encoded as a row vector, with values representing the number of times each word appears in the document. Similarly, each word in the dictionary is represented by a column vector showing its count in each document. The intuition is that the frequency of certain terms (e.g., IPO, bankruptcy) conveys meaning, but that order doesn’t matter.
One limitation of the bag of words approach is there are as many columns as words in your vocabulary. We can combine words with the same roots through processes called stemming or lemmatization (e.g., “investing”, “invested” and “invests” become the same word) and remove stopwords, which are extremely common words with little standalone meaning (e.g., “the”, “and”, “you”). However, even after this process, the 10-K filings still contain over 100,000 different words.
We can address this problem using a technique called topic modeling, which is an unsupervised learning technique that reduces dimensionality by clustering similar words into abstract groupings. Even though this is a purely statistical process, these groupings can be interpreted as “topics”.
The procedure involves decomposing the document- term matrix into two matrices. The number of topics is predefined by the researcher (we will use 100 topics).
Topic Model Decomposition
This decomposition can be accomplished using standard matrix factorization techniques (e.g., SVD or NMF), but other more specialized algorithms exist such as LSA and LDA. We will spare you the technical details.
Although this is a purely statistical process, the topics it produces can be quite intuitive. In the case of the 10-K business section, the topics arrange themselves by industry.
Top Topics in the Topic-Term Matrix
Source: Sparkline, SEC
Now that we have our topics, we can use the document-topic matrix to determine the mixture of topics in each document. For example, the companies’ filings that most heavily favor the “pharma” topic are unsurprisingly Acorda Therapeutics, Biogen, Alnylam Pharmaceuticals, Epizyme, and Trevena.
In 2013, Google researchers published another method that performs a task similar to topic modeling. Word embeddings map each word to a N-dimensional vector, where N is predefined (as in topic modeling). If we again use 100 dimensions, we have a matrix that looks similar to the Topic-Term Matrix from before.
Word Embedding Matrix
However, rather than being constructed using matrix factorization, the word embedding matrix is trained using a neural network. The original word2vec model has two architectures. Continuous bag-of-words (CBOW) attempts to predict each word using its neighbors while skip-grams does the reverse (predicts neighboring words from a given word). Facebook’s fastText and Stanford’s GloVe are two other popular ways to train word embeddings.
The idea is to train the model so that words that appear in similar contexts are nearby in vector space. Thus, if the word “cloud” is often found next to “computing” in the 10-Ks, the geometric distance between these word vectors will be small. Conversely, if “cloud” is rarely found near “hamburger”, their vector distance will be great.
Although we can download pre-trained embeddings, they do not do a very good job for our domain-specific task. Instead, we train our own word embeddings using the CBOW algorithm over 10-Ks from 2010 to 2014. Now, we can do fun stuff with our vectors. For example, if we take the vector for “CFO”, add “executive” and subtract “financial”, the best match is “CEO”.
Fun with Word Vectors
CFO + Executive - Financial = _____
Source: Sparkline, EDGAR
If we squeeze down the 100-dimensional vector to two dimensions (using a machine learning algorithm called t-SNE), we can visualize the relationship among words. We arbitrarily chose seven words to seed the exhibit. “Software”, “petroleum”, and “China” form their own distinct clusters with similar words. Words related to “inflation” and “unemployment” group together as they are both related to the macroeconomy and words associated with corporate actions such as “IPO” and “acquisition” are found to be similar.
10-K Word Embeddings
Source: Sparkline, EDGAR
There are many other applications of word embeddings. For example, as with topic modeling, we can interpret documents as collections of word vectors. We can also use word embeddings as inputs into other supervised learning models, including deep learning models.
Part 2 Data Mining
In finance, the term “data mining” often has a negative connotation, as it implies repetitively searching the dataset until positive results (usually false positives) are found. However, in other domains, data mining simply refers to the process of uncovering patterns in large datasets without any implication of overfitting.
Linear models are extremely central to the quantitative investment research process. Almost all articles published in finance journals use linear regressions (e.g., Fama-MacBeth) to empirically validate their findings. For example, the Fama-French model (1993) attempts to explain cross-sectional stock returns as a linear combination of three factors. Furthermore, portfolio construction is commonly accomplished using mean-variance optimization, another linear model.
Linear models, such as linear regression and mean-variance optimization, form the bedrock of the classical econometric toolkit. Linear models have many wonderful properties but also some serious drawbacks.
Nonlinearity and Interaction
By definition, linear models cannot capture nonlinear features of the data. This is a real problem as many financial factors are fundamentally nonlinear. For example, financial leverage can be fine in moderation but dangerous in the extreme. The next exhibit shows how linear regression fails to accurately model this relationship. We will soon introduce a machine learning model called random forest, which does a better job fitting this nonlinear relationship.
Linear models also struggle incorporating interactions among variables. Again, this is a meaningful drawback as many financial variables are interconnected. To continue with our example, extreme financial leverage is dangerous, but should be considered differently for companies in the banking sector.
While we can in theory ameliorate these limitations through feature engineering (e.g., creating transformed features by manually defining thresholds and interaction terms), this can be time-consuming, subjective and prone to overfitting.
Modeling a Nonlinear Relationship
Dimensionality and Collinearity
In the era of modern computing, we often want to run analyses involving features (i.e., alphas, factors) numbering from 1,000 to 100,000,000. However, standard linear regression and optimization techniques scale poorly. From a technical standpoint, these methods rely on matrix inversion, which becomes unstable if there are more variables than observations (e.g., 100,000 features but only 1,000 assets).
Furthermore, highly correlated variables (e.g., 9M and 12M momentum) can destabilize these models. This is the phenomenon known as collinearity. For example, optimizers tend to put on extremely levered spread trades between highly correlated assets with only slightly different return forecasts.
These problems can be solved by using feature selection (e.g., stepwise regression, principal component analysis) to preselect a subset of features, but again this can be subjective and lead to overfitting.
Machine Learning Models
Despite what readers of finance journals might be led to believe, linear regression is not the only data mining technique. The toolkit is much broader. Machine learning models exist for use cases ranging from anomaly detection to clustering to data visualization. Even within the relatively narrow task of regression, several alternatives exist, such as tree-based and neural network models.
These models can offer several advantages. First, many machine learning techniques can capture nonlinearity and interactions and are more robust to dimensionality and collinearity. As mentioned, there is a huge advantage to not having to rely on heavy feature engineering and selection.
Second, machine learning research has produced several well-studied regularization and validation techniques designed to explicitly address the risk of overfitting. These techniques penalize model complexity and help find models that are robust to permutations in the data. Given the noise in financial markets, these techniques are extremely useful.
Finally, given that the quantitative community is still extremely focused on linear models, there is a significant advantage to be gained by using machine learning models. Even if the standalone results were no better, looking at markets in a different way should help produce uncorrelated returns and reduce the risk of getting trapped in overcrowded trades.
Case Study 2: Tree-Based Models
Decision trees are trained by funneling observations through a series of binary splits. At each branch, the algorithm selects the variable and split value that most reduces model error (e.g., mean squared error). While in theory we can grow trees until only one observation is left in each leaf, in practice we stop far earlier to avoid overfitting.
The diagrams below show how the process works. In each case, we start with 1,000 “samples” and continually subdivide the sample into finer groups at each binary split so that mean squared error (“mse”) declines. In this case, the “value” is the average Z-scored future return of all the samples in each bucket. Once the model is trained, we can predict a new observation by filtering it through the tree until it reaches a terminal node, at which point its forecast is the value for that group.
Trees and Nonlinear Features
Trees are effective at capturing nonlinear features. Let’s take the example above of the Altman Z-score, a well-known company default risk metric. As with leverage ratio, Altman Z-score is a nonlinear indicator. Companies with scores below 1.8 are bankruptcy risks, while those with scores above 3.0 are considered safe. However, the Z-score doesn’t provide meaningful information for companies with scores between these thresholds (the “gray zone”).
The next example shows how trees can also capture interaction effects. As discussed earlier, leverage ratio (debt/equity) should be dealt with differently for the banking sector. In this example, the model predicts companies will underperform if their Z-scored debt/equity ratio exceeds 2, but only if the industry is not banking.
Trees and Interaction Effects
Decision trees offer a flexible tool for fitting complex relationships. Unlike linear regression, tree models are nonparametric, meaning we are not required to specify the functional form (e.g., linear, exponential, logarithmic) upfront. Instead, the model learns the functional form from the data itself.
Single decision trees are actually weak predictors, as they only capture a single path. Therefore, we will turn to random forests, which are collections of many trees.
Random forests average across hundreds of individual decision trees, ironing out noise and producing a more generalizable model. The underlying principle is identical to the familiar law of large numbers, by which quant investors combine hundreds of weak signals to produce a high Sharpe Ratio portfolio. In machine learning terminology, this is called ensembling.
Ensembling is more powerful if the individual trees are uncorrelated with each other. This is analogous to how a portfolio of stocks, bonds and gold is more diversified than one consisting solely of large-cap US technology stocks.
We can reduce the correlation across trees using subsampling. This involves randomly choosing a subset of the observations and features available for training each tree, helping to ensure that all the trees don’t rely on the same few data points and features (subsampling observations with replacement is often called "bagging").
We can take this idea even further using boosting. Boosting models are similar to random forests in that they are ensembles of decision trees. The key difference is that boosting grows trees sequentially instead of in parallel. At each step, the model upweights errors from previous rounds, further emphasizing the selection of features that are uncorrelated with those that were selected beforehand.
We’ll demonstrate how this technique works on a small set of traditional quant factors. The boosting algorithm has many hyperparameters (parameters that control the learning process), which are generally tuned via cross-validation. In this case, we will make the important choice to constrain the max depth of the tree to 3 layers. Like subsampling, this is a regularization technique that helps prevent overfitting.
One of the most powerful tools for analyzing tree-based ensembles is feature importance, which allows us to understand the relative contribution of each feature. Below, we compute feature importance using mean decrease accuracy (MDA), which involves randomly shuffling the values in a column and measuring the resulting decrease in model accuracy.
Source: Sparkline, S&P Global
We find that the previous month’s return is the most important for the model, while inventory turnover is the least important.
As mentioned, one key drawback of linear regression is that it struggles with dimensionality and collinearity. Let’s see how boosting performs under these conditions. First, let’s add 10 factors that are simply random noise. The model does a fairly good job ignoring them.
Source: Sparkline, S&P Global
We can test collinearity by adding 10 replicas of ret_1m, so that we now have 11 factors that are 100% correlated to each other.
Source: Sparkline, S&P Global
The model is very effective at ignoring the redundant factors. Note that this is a unique feature of the boosting model, as once it has already selected a factor, it is uninterested in redundant factors as they don’t help correct previous errors. Random forest would instead split the importance roughly evenly across the 11 identical features. While this does not significantly affect performance, it makes interpretation more challenging.
Let’s confirm that the addition of noisy and redundant features do not materially affect the performance of the model. After training the model from 1995 to 2010, we can evaluate its simulated performance in the validation period from 2011 to 2015. This simple backtest does not take into account transaction costs or realistic portfolio constraints. The goal is not to highlight the models’ absolute performance but instead the relative returns across the four models.
Backtest with Noisy and Redundant Features
Source: Sparkline, S&P Global
The simulation shows that even adding 1,000 noisy and 1,000 redundant features to the model doesn’t significantly affect its performance. This is because the redundant features are quickly ignored, and even if some random noise creeps into the model, it is not enough to materially deteriorate the results.
Robustness to noisy and redundant features is a powerful property of boosting. There is currently a Ioannidisean movement budding in the financial literature claiming that most published quantitative factors are false positives. If this is true, boosting can provide a compelling way to mitigate the damage.
Part 3 Risk Models
Risk modeling is an important component of the portfolio construction process. Well-constructed risk models enable investors to more efficiently allocate their risk budgets. They help us form more diversified portfolios by avoiding excessive concentration on single industries and factors and are fundamental to return attribution.
The simplest risk models are purely statistical, relying solely on historical return data. We first estimate the covariance matrix of historical returns. Next, we use principal component analysis (PCA) to decompose the covariance matrix into a set of eigenvectors. These eigenvectors are portfolios of assets that are designed to be uncorrelated with each other. One way to think of them is as the statistical “risk factors” driving historical returns. We can then explain asset returns based on their exposure to these statistical risk factors.
In practice, this approach has a few significant problems. First, estimating the covariance matrix requires significant computational resources. For example, a universe of 3,000 stocks requires estimating over 4.5 million covariances and variances. Second, PCA relies on the covariance matrix being invertible. However, this will only be the case if the number of assets exceeds the number of observation periods. Thus, a 3,000 stock universe requires 12 years of daily returns. In practice, even if the matrix is invertible, we need significantly more data for the estimates to be stable out of sample.
Factor risk models, the current industry standard, address these issues. Factor risk models are motivated by two objectives: reduce dimensionality and provide more interpretable attribution. Thus, rather than use PCA to come up with statistical risk factors, factor risk models begin with a predefined, smaller set of fundamental risk factors. For example, BARRA’s USE4 equity model has 60 industry factors and 12 style factors (e.g., size, value, momentum). Using 72 rather than 3,000 factors leads to a more stable model. In addition, fundamental factors are more intuitive and interpretable.
Factor risk models decompose a stock’s return into two components: factor return and specific return. The factor return is the common component that depends on the stock’s exposure to its risk factors, while the specific risk is idiosyncratic to this stock and uncorrelated with the specific risk of all other stocks.
Factor Risk Model Decomposition
The key to factor risk models is the exposure matrix, which is essentially a transformation from stocks to factors. Each row of the matrix contains a stock’s exposure to each of the risk factors. For example, Apple has an exposure of 1 to the tech industry factor and 0 to all other industrial factors and a positive exposure to the size and growth style factors.
Case Study 3: Company Embeddings
The factor risk model framework has been in place since the 1970s. Over time, the framework was gradually improved and more factors were added. However, the factors considered are still all traditional (e.g., value, momentum, growth).
It would be interesting to incorporate alternative data into factor risk models. However, doing so is non-trivial, as it requires that these unstructured data somehow be converted into a matrix form such that they can replace or augment the exposure matrix.
In order to do this, we will introduce the concept of “company embeddings”. As far as we can tell, this is the first time we’ve seen this written about (at least with such a 🔥name). Company embeddings apply the concept of word embeddings to companies instead of words.
Recall that word embeddings map a vocabulary of thousands of words to a lower-dimensional vector space. Not only does this reduce dimensionality, but it also helps us capture the notion of word similarity. Word embeddings are trained as the byproduct of a supervised learning problem using neural networks. This technique is not specific to words. In fact, it can be applied to any categorical variable, such as companies.
Let’s build intuition by starting with the example of GICS sectors. We represent sector membership using a 10-dimensional vector, where values are either 1 to indicate membership or 0 for non-membership. This is called one-hot-encoding. Bag-of-words, the basic NLP model that word2vec improves upon, also represents words as one-hot-encoded vectors.
We can stack these vectors in rows to form the matrix labeled “one-hot-encoding”, which can now serve as the exposure matrix in a factor risk model. Note this is also the format we would use to add GICS membership as dummy variables in a linear regression.
GICS and Company Embeddings
One-hot-encoding has a couple major drawbacks. First, it requires as many columns as categories. Although this example only uses 10 GICS sectors, variables in the wild often have much greater cardinality. At even 1,000 columns, one-hot-encoding is quite inefficient from an information density and computational standpoint.
Second, one-hot-encoding views similarity in a crude, binary way. For example, Apple, Microsoft and Groupon are considered equally distant from each other as they are all technology companies. However, financial market intuition tells us that Apple and Microsoft, which are large-cap tech companies, should be closer to each other than to Groupon. Conversely, Apple and McDonald’s are equally far from Tesla because neither are in the consumer discretionary sector. However, in reality the high-tech Tesla has many more similarities to Apple than McDonald’s.
Interestingly, these problems of dimensionality and similarity are identical to those we saw earlier with the bag-of-words model. In the latter case, we found that embeddings offered an elegant and effective solution. Thus, it is logical that they would also help in the current context.
The matrix on the right labeled “company embedding” represents the word vector solution applied to companies. If trained correctly, company embeddings allow us to reduce dimensionality and capture a more continuous notion of company similarity.
Embeddings are trained as a byproduct of a supervised learning problem. For example, word2vec trains embeddings by looking at how often words are mentioned together. How we choose to train our company embeddings will depend on the information we want to capture.
The power of company embeddings lies in their flexibility. They can be trained on almost any type of structured or unstructured data, but will always produce a matrix that is a usable input into a factor risk model. There is a vast array of potential data sources we can use to train our embeddings, but for now we’ll return to our initial example.
If you recall, the 10-K business section contains a detailed description of a company’s main products and services. We expect similar companies to describe their businesses in similar ways, resulting in company vectors that are close to each other in embedding space. We train a neural network to generate 100-dimensional company embeddings using 10-Ks from 2010 to 2014.
The results are quite intuitive. McDonald’s lies in a cluster with other restaurant chains, while Goldman Sachs sits with other banks and financials. Apple and Microsoft overlap heavily with each other and other technology companies. Tesla’s cluster includes both tech companies as well as traditional auto companies such as GM and Ford.
Thus, we find that the textual information in the 10-K business section allows us to classify companies without having to resort to subjective industry classification schemes. In addition, it provides a more continuous view of the economy that can automatically update through time as companies and the economy evolve.
10-K Company Embeddings
Source: Sparkline, EDGAR
One challenge with alternative data is that many new datasets will likely turn out to explain risk but not return. In other words, they may capture sources of market variance but have no alpha. However, if we can use techniques such as company embeddings to put them to work in risk models, they will still be able to add value to the investment process.
In general, the technique of using alternative data to train company embeddings that can serve as inputs into risk models is an interesting line of further exploration.
The investment management industry is still in the process of figuring out how to take advantage of recent advances in machine learning. Ultimately, it is extremely likely that the industry will end up in a better place as a result of these technologies. Machine learning will both help streamline existing processes and increase market efficiency as we are able to better incorporate alternative sources of information into prices.
We’ve outlined three areas where we believe machine learning has transformative potential:
In addition, we’ve provided real-world examples of how machine learning can help in each area. Of course, there are many other potential research angles. Our hope is that by providing a few practical use cases for machine learning, we can demonstrate that machine learning is at least not all hype.
Of course, reshaping the investment industry will be a long process. In any period of flux, there will be both exciting breakthroughs and frustrating dead ends. Innovation truly moves in fits and starts. But transitional periods offer the greatest rewards for those who are able to lead the way. And at the very least it will be an interesting and exciting journey.
The author would like to thank Jonathan Siegel for his helpful comments.
This paper is solely for informational purposes and is not an offer or solicitation for the purchase or sale of any security, nor is it to be construed as legal or tax advice. References to securities and strategies are for illustrative purposes only and do not constitute buy or sell recommendations. The information in this report should not be used as the basis for any investment decisions.
We make no representation or warranty as to the accuracy or completeness of the information contained in this report, including third-party data sources. The views expressed are as of the publication date and subject to change at any time.
Hypothetical performance has many significant limitations and no representation is being made that such performance is achievable in the future. Past performance is no guarantee of future performance.