Since the financial crisis, forward guidance has become a key part of the monetary policy toolkit. Constrained by the effective lower bound, central banks globally have increasingly relied on public communications to guide policy expectations and influence financial markets. As a result, central bank statements, meeting minutes and even individual member speeches have become important data to be analysed by monetary policy watchers. For example, market participants this year have been closely following what FOMC members are saying, looking for hints of any hawkish turn like upcoming tapering.
However, most of the time deciphering central bank communications is more like an art than a science, with interpretations subject to human biases. Is there a way to systematically extract information from central bank communications and objectively quantify any policy signal? In our view, this task is possible with the help of natural language processing (NLP) – a sub-branch of artificial intelligence (AI). Using the example of Federal Reserve communications, we illustrate in this blog how we were able to use NLP to track the Fed’s opinions on different topics and construct the Algebris Dove-o-Meter – a quantitative indicator of the Fed’s dovishness/hawkishness over time. As we show in our analysis, the Fed has stayed dovish so far in the face of the post-pandemic recovery, including at the latest March FOMC meeting.
What Is NLP?
Yoav Goldberg, a prominent NLP researcher featured in the IEEE AI Top 10 to Watch list in 2018, defines NLP as a “collective term referring to automatic computational processing of human languages. This includes both algorithms that take human-produced text [and/or speech] as input, and algorithms that produce natural looking text [and/or speech] as outputs.” (2017)
While these algorithms used to exist within the confines of academia, they are now embedded within products and services used by people every day all around the world – some relatable examples include search engines (e.g., Google, Baidu) and personal digital assistants (e.g., Alexa, Siri).
But what exactly are these algorithms? NLP test benchmarks, which are used to measure the performance of NLP algorithms, can provide an illustrative answer. The Natural Language Decathlon (DecaNLP), one well-known test benchmark, breaks down what NLP algorithms do into ten tasks: “question answering, machine translation, summarisation, natural language inference, sentiment analysis, semantic role labeling, relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution.” For example, the task of question answering involves picking out a sequence of words from a given paragraph to answer a given question. The Stanford Question Answering Dataset (SQuAD) serves as a gold-standard dataset for this task, and it includes paragraphs from the English Wikipedia, as well as questions and associated answers that can be found in those paragraphs.
By using this lens to delineate NLP, each end-user product and/or service such as a search engine or a personal digital assistant can then be viewed as an agglomeration of various NLP algorithms solving for various tasks, that when put together in a well-engineered system and exposed through a well-designed user interface, seem to exhibit very real (albeit still artificial) intelligence.
Today’s “intelligent” algorithms, however, are a recent phenomenon and stand in stark contrast to older NLP algorithms. From the 1956 Dartmouth Conference (often considered to be the official birth of AI) till the early 2000s – a good 50 years or so – most NLP algorithms were designed in a “top-down” deductive manner. A system of linguistic rules was first theorised, and language as it appeared in real-life was supposed to be a manifestation of these rules. A major shift happened in the 1990s, according to the late NLP research pioneer Karen Jones, with the rise of statistical approaches which have come to dominate modern-day NLP especially in the last 5 to 10 years. Unlike the previous paradigm of “top-down” deduction, many state-of-the-art NLP algorithms rely on “bottom-up” induction, i.e., inferring linguistic patterns (from syntax to semantics) based on humongous language corpora such as the entire English Wikipedia or the public web (e.g., the Common Crawl dataset used to train GPT-3 constituted nearly a trillion words).
Indeed, such an approach is conceptually elegant. Instead of trying to force-fit real-life language data into some linguistic framework conjectured by humans, which is impossible to do in a comprehensive manner given the volume, velocity, and variety of modern-day language corpora (since language is so malleable), why not let the data drive the linguistic framework? Moreover, apart from just being elegant, the proof can also be found in the pudding – statistical NLP approaches have absolutely dominated most NLP test benchmarks, including the previously-mentioned DecaNLP, as well as other well-established multi-task benchmarks like SuperGlue.
Given these recent and exciting NLP advancements, we were keen to apply some of the latest NLP algorithms to smaller subsets of language corpora such as news articles, social media text, speech transcripts, and official government publications, for example Fed communications in this research analysis.
Fedspeak: What Could We Learn Using NLP?
The ultimate aim of our analysis is to be able to quantify the policy stance embedded in any piece of Fed communication. We broke down this goal into two main NLP tasks: extracting relevant text passages with policy implications, and quantifying the sentiment expressed in those passages.
We started by building a curated text dataset consisting of the Fed’s official statements from 1994 to 2021. To fulfil the two tasks mentioned above, we designed a system that could retrieve relevant text passages from the policy statements given some input queries/phrases. Based on whether the input phrases are hawkish or dovish, we were able to give a score to the retrieved passages to quantify its policy tilt. As a concrete example, an input phrase such as “strong economy” should retrieve the text passage “economic activity has continued to strengthen” from the 27 Jan 2010 statement, and the passage would be labelled as having a hawkish tilt. To better understand the Fed’s views across different issues, we grouped our input queries under four topics: economic growth, inflation, labour market and monetary policy. For a given policy statement, once all the relevant passages were retrieved, we aggregated the individual passage scores for each topic by taking the spread of hawkish ones over dovish ones, and then calculated a final score by averaging the topic scores. This final score provides a quantitative measure of the overall hawkishness/dovishness of the policy statement.
[infogram id=”675267ef-40a7-40b5-ae07-2f62bd5cf17f” prefix=”D2x” format=”interactive” title=”Fed Table”]
As can be seen below, our final indicator tracks the economic cycles in the US quite well and shows the Fed’s drastic dovish shifts upon the onsets of the past three recessions (2001, 2008. 2020).
To better understand whether the Fed is being too hawkish or dovish relative to economic fundamentals, we also ran a simple regression model on our indicator using a list of macro variables and got a decent fit (adjusted R2 of 0.52). As our model suggests, with current economic fundamentals the Fed should already be more hawkish based on historical relations. However, the Fed stayed dovish in the latest March FOMC meeting, retracing from some small hawkish shifts in in-between-meeting communications by Chair Powell and FOMC member Brainard. This highlights the unusual policy dilemma faced by the Fed given the sharp economic rebound after a pandemic-induced recession, thanks to unprecedented levels of monetary and fiscal stimulus.
[infogram id=”f5aec1a3-e574-4d7c-86cd-39fcf7bed0aa” prefix=”FFz” format=”interactive” title=”Fed”]
How Exactly Do Our NLP Models Work?
In more technical details, we relied on an ensemble of three NLP models for passage retrieval and scoring – with each model given a nickname based on their role:
1. Scout: This model is good at retrieving passages that are topically relevant to the input phrase (i.e., it is a good “scout”). However, it does not work so well for determining the polarity of the retrieved passages (e.g., positive/negative, hawkish/dovish), and thus is insufficient on its own. The base model used is a RoBERTa model (Liu et al, 2019) that had been distilled (Sanh et al, 2020) and modified to “derive semantically meaningful sentence embeddings that can be compared using cosine-similarity” at fast speeds (Reimers and Gurevych, 2019). The base model was already trained on the MSMARCO Passage Ranking dataset which consists of 500k real queries from Bing search. We further fine-tuned it using labeled data from the policy statement corpus.
2. Sniper: This model is extremely precise (i.e., it is a good “sniper”), which may sometimes be a double-edged sword because it may miss certain passages that are also somewhat relevant. It pairs well with the Scout model however and assists to re-score all passages first retrieved by the Scout model. The base model used is an ELECTRA model (Clark et al, 2020) which was also trained on the MSMARCO Passage Ranking dataset. Like the Scout model, we further fine-tuned this base model using labeled data from the policy statement corpus.
3. Sweeper: This model is a sentiment analysis model that was trained on financial text. It can determine whether a passage is positive/neutral/negative, and its role is to “double confirm” the signals generated from the other 2 models where applicable, i.e., it helps to “sweep up loose ends”. The base model used is FinBERT (Araci, 2019). Since the input phrases we used were not definitively positive or negative (e.g., “lowering interest rate” is not definitively positive or negative), we selectively applied this model to certain phrases that were more clear-cut in polarity (e.g., “strong economy”). We did not fine-tune this model unlike the other two.
As shown in the pipeline and example below, we passed each policy statement through all three models to retrieve the relevant passages and calculated the score for each passage as a weighted sum of the three model scores. We gave a 70% weight to the Sniper model score due to the model’s high precision, while the remaining 30% was shared equally between the remaining model(s).
[infogram id=”3959ef55-309f-4f18-9bef-fe897b03e22d” prefix=”ful” format=”interactive” title=”Fed NLP Model”]
Looking Forward
Our framework and application of NLP techniques provide us with an alternative tool to monitor monetary policy in the US. Going forward, we could easily expand our analysis to include other major central banks like the European Central Bank and the Bank of England for further insights. Most importantly, this specific example further confirms the relevance of NLP techniques to financial research, as we demonstrated in other past use cases like sentiment tracking for US elections last year. With continued breakthroughs in NLP and the general field of AI, we look to keep on exploring suitable applications of these tools to tackle problems in finance.