Using ChatGPT to Generate NLP-Driven Investment Strategies

The financial world thrives on timely insights, accurate analysis, and forward-looking strategies. Over the years, natural language processing (NLP) has emerged as a precious tool for interpreting vast amounts of financial text, aiding investors and analysts in making informed decisions. From basic sentiment lexicons to advanced large language models (LLMs) like BERT and FinBERT, the field has made significant progress. However, domain-specific challenges in financial news analysis persist.

We homed in on a popular LLM, ChatGPT, to analyze Bloomberg Market Wrap news using a two-step method to extract and analyze global market headlines. By generating a sentiment score and converting it into an investment strategy, we assessed the performance of the NASDAQ market. Our findings are promising, indicating the potential for forecasting NASDAQ returns and potentially designing investible strategies.

This post outlines a two-step sentiment extraction process from financial summaries, a method for converting sentiment into actionable allocations, and an evaluation demonstrating outperformance against a passive investment strategy.

After a short review of related work, we detail our prompt engineering approach, describe the conversion to investment strategies, and present evaluation results.

An in-depth analysis of our study is available on ssrn: “Sentiment Score of Bloomberg Market Wraps with ChatGPT.”

Other Resources

Recent research has highlighted ChatGPT’s applications in finance and economics. Hansen and Kazinnik [8] showed its utility in interpreting Federal Reserve communications, and Lopez-Lira and Tang [16] demonstrated effective prompting for stock predictions. Cowen and Tabarrok [3] and Korinek [13] explored its use in economics education, while Noy and Zhang [20] focused on productivity benefits.

Yang and Menczer [31] examined its credibility assessments for news, though Xie et al. [30] noted that its numerical predictions align with linear regression, and Ko and Lee [12] faced challenges in portfolio selection.

Our study extends this literature by using a multi-step ChatGPT approach to predict NASDAQ trends, reducing noise and enhancing accuracy.

Conversations with Frank Fabozzi Lori Heinel

Prompt Engineering

The first step in prompt engineering is data collection. We collected daily summaries from Bloomberg Global Markets, known as Market Wraps, from 2010 to October 2023. We excluded summaries with fewer than 1200 characters or those that did not mention at least two of the following market types: equities, fixed income, foreign exchange, commodities, or credit. In addition, we included only summaries that had widespread online distribution to ensure significant public impact. This process yielded a dataset of over 70,000 articles, each averaging 1000 words and approximately 6000 characters.

Naïve Approach

Initially, our prompt directive was to provide a sentiment score from the text as follows:

Using ChatGPT to Generate NLP-Driven Investment Strategies

This straight approach similar in spirit to Romanko et al. [25] or Kim et al. [11] turned out to be disappointing as it led to correlations close to zero with major stock indexes like NASDAQ and S&P500, most likely because of random model hallucinations.

Shift to Two-Step Approach

We then opted to decompose the instructions into simpler and more straightforward tasks. In accordance with the recommendations posited in [16], we devised two prompts to refine the objectives for ChatGPT, focusing on tasks empirically demonstrated to align well with ChatGPT’s capabilities. Our first prompt consisted of summarizing the text into titles or headlines as follows:

Our second prompt consisted of determining a sentiment score on each headline.

For the two prompts, we used the gpt-3.5-turbo version of ChatGPT. The overall idea of this two-step approach is to ease the task of ChatGPT and leverage its amazing capacity to make summaries and in a second step find the tone or sentiment. We can now devise an enhanced and more pertinent “Global Equities Sentiment Indicator” as follows:

Definition 1. Daily Sentiment Score: Let us denote hi as the ith headline scanned from the daily news n and have two scoring functions that are consistent, a positive one p(hi) which returns 1 if hi is positive, 0 otherwise and a negative one n(hi) which returns 1 if hi is negative, 0 otherwise.

The sentiment score S for a day with N headlines is given by:

The sentiment score S measures the relative dominance of positive versus negative sentiments in a day’s headlines. It satisfies a couple of simple properties that are trivial to prove.

Proposition 1. The sentiment score S satisfies some canonical properties:

Boundedness: S is bounded as −1 ≤ S ≤ 1.

Symmetry: If sentiments of all headlines are reversed, then S changes its sign.

Neutrality: S=0 if there are equal numbers of positive and negative headlines.

Monotonicity: S increases as the difference between positive and negative headlines increases.

Scale Invariance: S remains the same if we multiply the number of both positive and negative headlines by a constant.

Additivity: The combined S for two sets of headlines is the weighted average of the individual S values.

Figure 1 shows the raw signal and highlights that the signal is very noisy. Using the raw sentiment score for daily news headlines of 10 results in noisy and less-interpretable results. To address this, we propose a cumulated sentiment score over a specified period. This score aggregates news sentiments over a duration, offering a more comprehensive measure of the news impact during that period. T.

Figure 1. Raw Signal: It Exhibits Significant Noise.

Definition 2. Cumulated Sentiment Score: We defined a monthly (d=20) Cumulative score as follows. Given:

hi,t as the ith headline on day t.

p(hi,t) and n(hi,t) as functions returning 1 for positive and negative sentiments of hi,t respectively, 0 otherwise.

d as the duration (we use d = 20 business days, approximating a month).

The cumulated sentiment score Sd over period d is:

Figure 2. Cumulative Sentiment Score.

The mathematical properties, that is boundedness, symmetry, neutrality, monotonicity, scale invariance remains for the Cumulated Sentiment Score. Figure 2 illustrates how the cumulated process diminishes the noise within the signal.

Converting to an Investment Strategy

Removing noise is key. Given the cumulated sentiment score (see definition 2), it is crucial to de-trend this score to identify more actionable trading signals. We compute the trend of the sentiment score by calculating the difference between the cumulated sentiment score and its average over a period d, which we also take as a month.

Definition 3. Detrended Cumulated Sentiment Score: We call the detrended cumulated sentiment score, the cumulated sentiment score subtracted from its average over d periods:

Splitting into long and short

From the de-trended score, we can derive two types of trading positions:

Long Position = max(DS(t), 0)

Short Position = min(DS(t), 0)

A long (respectively short) position is the purchase (respectively sale) of an asset with the expectation that its value will rise (respectively decline) in the future. Hence, if our detrended score is positive (respectively negative) we take a long (respectively short) position. To backtest our strategy, we use the NASDAQ index as this is well known to be sensitive to overall market sentiment [2]. We calculate the value of the strategy taking great care of accounting for transaction costs. We apply a linear transaction cost based on the weight difference between time t and t − 1.

The value of our strategy at time t is therefore given by the cumulated returns diminished by any transaction costs:

Where b represents the linear transaction cost and taken to be two basis points for the NASDAQ futures. It is essential to note the two- day lag in our weightings: for day t, we use the weights computed on t − 2. This lag ensures that the strategy is executed the next day ensuring that our backtest does not suffer from any data leakage.

Figure 3. Short Strategy with Cumulated Sentiment (Blue) & Detrended Score (Orange).

Results: Descriptive Statistics

To evaluate the performance of our strategy against a benchmark, such as a simple holding of the NASDAQ index, we consider multiple key financial metrics: Sharpe, Sortino and Calmar ratio presented below.

Figure 4. Long Strategy with Cumulated Sentiment (Blue) & Detrended Score (Orange).

Figure 5. Final strategy (long and short) with Cumulated Sentiment (Blue).

Sharpe Ratio: The Sharpe Ratio, introduced in [27], evaluates an investment strategy by computing its ratio between its excess return over the risk-free rate against its volatility. Essentially, it reflects how much additional return an investor receives per unit of increase in risk. A higher ratio suggests that the asset’s returns are better compensated for the risk taken.

Sortino Ratio and Calmer Ratio: The Sortino ratio [28] (respectively Calmar ratio) is a modification of the Sharpe Ratio, defined as the ratio of the excess return divided by the downside deviation (respectively divided by the maximum drawdowns).

Comparative Analysis of Strategy Performance Metrics

Tables 1 and 2 detail the performance metrics of the strategies. In these tables, the best scores are prominently highlighted in bold for easy identification and comparison. Table 1 reveals that:

The Detrended Cumulated Score (Detrended) strategy consistently outperforms the baseline across metrics: Sharpe (0.88 vs. 0.79), Sortino (1.06 vs. 1.02), and Calmar (0.52 vs. 0.45). This highlights the Detrended All strategy’s robustness and Pareto dominance.

In stark contrast, the naive cumulated score (Cumulated) strategies considerably underperform against the baseline. This is particularly noticeable with the Cumulated All, Cumulated Long, and Cumulated Short strategies which have the lowest ratios across all three metrics.

Table 2 offers a granular insight into the performance by providing metrics like annual return, annual volatility, and a tail risk measure computed as the annual return divided by the worst 10% quantile DD. Mirroring our previous observations, we observe that:

The Detrended All strategy has the best “Return over Worst 10% DD” ratio of 1.71 to compare with the baseline value of 1.03. This implies that Detrended All strategy has lower downside risk.

The Cumulated Sentiment Score strategies again seem less promising with a “Return over Worst 10% DD” ratio of 0.72, further emphasizing the potential problems of a straightforward cumulated score strategy.

The four ChatGPT based strategies have considerably lower volatility as expected as we time investment and have on average a reduced exposure to the NASDAQ futures.

Table 1. Investment Statistics.

Strategy Sharpe Ratio Sortino Ratio Calmar Ratio

Detrended All 0.88 1.06 0.52

Buy and Hold (baseline) 0.79 1.02 0.45

Detrended Short 0.75 0.76 0.32

Detrended Long 0.56 0.48 0.27

Cumulated All 0.45 0.50 0.17

Cumulated Short 0.45 0.27 0.21

Cumulated Long 0.38 0.36 0.14

Table 2. Descriptive Statistics.

Strategy Annual Return Annual Vol Return / Worst 10

Detrended All 1.2% 1.4% 1.71

Buy and Hold (baseline) 16.1% 20.4% 1.03

Detrended Short 0.6% 0.8% 1.12

Detrended Long 0.6% 1.1% 0.68

Cumulated All 1.9% 4.2% 0.72

Cumulated Short 0.3% 0.7% 0.28

Cumulated Long 1.6% 4.1% 0.60

Analysis of Weights

Analyzing the weights of ChatGPT-based investment strategies reveals differences in volatility and exposure. Table 3 provides the weights for four strategies: Cumulated Long, Detrended Long, Cumulated Short, and Detrended Short.

Detrended Sentiment weights display lower volatility than Cumulated Sentiment weights. Specifically, Detrended Long and Short weights have a volatility of 3.7%, while Cumulated Long and Short weights record higher volatilities of 4.9% and 11.1%, respectively.

In terms of average exposure:

The average market exposure is similar for both Detrended Long and Cumulated Long, around 2.5%.

In contrast, the Short strategies differ significantly, with Cumulated Short showing a mean exposure of 9.5%, compared to 2.7% for Detrended Short, indicating that detrending reduces short exposure.

The Detrended strategies, especially on the short side, are more controlled in weight distribution. Due to their low volatility, applying a volatility targeting approach could scale these strategies to a total volatility of 5-15%, aligning with investor risk tolerance.

Table 3. Weights Descriptive Statistics

Long Detrended Long Cumulated Short Detrended Short Cumulated

mean 2.6% 2.4% 2.7% 9.5%

Key Takeaways

In this study, we explored ChatGPT’s potential for generating sentiment scores from Bloomberg’s daily finance news summaries. Using zero-shot prompting, we demonstrated the model’s ability to produce predictive sentiment scores without domain-specific fine-tuning.

Our findings are promising, with strong Sharpe, Calmar, and Sortino ratios in an NLP-driven strategy, indicating potential for forecasting NASDAQ returns. Key insights include the importance of using effective prompts; breaking sentiment analysis into summarization and single-sentence sentiment tasks; and reducing data noise through cumulative, detrended scores.

Future work could examine ChatGPT’s applicability in predicting trends across other stock markets, individual stocks, and over different time frames, as well as its integration with alternative data sources like social media.

[1] D. W. Arner, J. Barberis, and R. P. Buckley. The evolution of fintech: A new post-crisis paradigm. Geo. J. Int’l L., 47:1271, 2015.

[2] S. R. Baker, N. Bloom, S. J. Davis, and M. C. Sammon. What triggers stock market jumps? Technical report, National Bureau of Economic Research, 2021.

[3] T. Cowen and A. T. Tabarrok. How to Learn and Teach Economics with Large Language Models, Including GPT. SSRN Electronic Journal, XXX(XXX):0–0, 3 2023. ISSN 1556-5068. doi: 10.2139/SSRN.

4391863. URL https://papers.ssrn.com/abstract=4391863.

[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, XX(XX):XX, 2018.

[5] G. Fatouros, G. Makridis, D. Kotios, J. Soldatos, M. Filippakis, and

D. Kyriazis. Deepvar: a framework for portfolio risk assessment lever- aging probabilistic deep neural networks. Digital finance, 5(1):29–56, 2023.

[6] A. S. George and A. H. George. A review of chatgpt ai’s impact on several business sectors. Partners Universal International Innovation Journal, 1(1):9–23, 2023.

[7] A. Ghaddar and P. Langlais. Sedar: a large scale french-english financial domain parallel corpus. In Proceedings of the Twelfth Language Re- sources and Evaluation Conference (LREC), pages 3595–3602, LREC, 2020. LREC. URL http://www.lrec-conf.org/proceedings/lrec2020/ index.html.

[8] A. L. Hansen and S. Kazinnik. Can ChatGPT Decipher Fedspeak?

SSRN Electronic Journal, XX(XX):XX, 3 2023. ISSN 1556-5068.

doi: 10.2139/SSRN.4399406. URL https://papers.ssrn.com/abstract= 4399406.

[9] I.-B. Iordache, A. S. Uban, C. Stoean, and L. P. Dinu. Investigating the relationship between romanian financial news and closing prices from the bucharest stock exchange. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 5130–5136, LREC, 2022. LREC. URL http://www.lrec-conf.org/ proceedings/lrec2022/index.html.

[10] A. Jabbari, O. Sauvage, H. Zeine, and H. Chergui. A french corpus and annotation schema for named entity recognition and relation ex- traction of financial news. In Proceedings of the Twelfth Language Re- sources and Evaluation Conference (LREC), pages 2293–2299, LREC, 2020. LREC. URL http://www.lrec-conf.org/proceedings/lrec2020/ index.html.

[11] A. Kim, M. Muhn, and V. Nikolaev. Bloated disclosures: Can chatgpt help investors process financial information? arXiv preprint arXiv:2306.10224, XXX(0-0):XX, 2023.

[12] H. Ko and J. Lee. Can ChatGPT Improve Investment Decision? From a Portfolio Management Perspective. SSRN Electronic Journal, XX(XX): XX, 2023. doi: 10.2139/SSRN.4390529. URL https://papers.ssrn.com/ abstract=4390529.

[13] A. Korinek. Language Models and Cognitive Automation for Economic Research. Cambridge, MA, XX(XX):XX, 2 2023. doi: 10.3386/ W30957. URL https://www.nber.org/papers/w30957.

[14] C. Li, W. Ye, and Y. Zhao. Finmath: Injecting a tree-structured solver for question answering over financial reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 6147–6152, LREC, 2022. LREC. URL http://www.lrec-conf.org/ proceedings/lrec2022/index.html.

[15] Z. Liu, D. Huang, K. Huang, Z. Li, and J. Zhao. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pages 4513–4519, ICLR, 2021. ICLR.

[16] A. Lopez-Lira and Y. Tang. Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models. SSRN Electronic Journal, XXX(XX-XX):XX, 4 2023. ISSN 1556-5068. doi: 10.

2139/SSRN.4412788. URL https://papers.ssrn.com/abstract=4412788. [17] T. Loughran and B. McDonald. When is a liability not a liability? textual analysis, dictionaries, and 10-ks. The Journal of finance, 66(1): 35–65, 2011.

[18] C. Masson and P. Paroubek. Nlp analytics in finance with dore: a french 250m tokens corpus of corporate annual reports. In Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), pages 2261–2267, LREC, 2020. LREC. URL http://www.lrec-conf.org/ proceedings/lrec2020/index.html.

[19] A. Moreno-Ortiz, J. Fernández-Cruz, and C. P. C. Hernández. Design and evaluation of sentiecon: A fine-grained economic/financial sentiment lexicon from a corpus of business news. In Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), pages 5065–5072, LREC, 2020. LREC. URL http://www.lrec-conf.org/ proceedings/lrec2020/index.html.

[20] S. Noy and W. Zhang. Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence. SSRN Electronic Journal, XX(XX):XX, 3 2023. doi: 10.2139/SSRN.4375283. URL https://papers.ssrn.com/abstract=4375283.

[21] J. Oksanen, A. Majumder, K. Saunack, F. Toni, and A. Dhondiyal. A graph-based method for unsupervised knowledge discovery from financial texts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 5412–5417, LREC, 2022. LREC. URL http://www.lrec-conf.org/proceedings/lrec2022/index. html.

[22] OpenAI. Gpt-4 technical report, 2023.

[23] S. Poria, E. Cambria, and A. Gelbukh. Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge-Based Systems, 108:42–49, 2016.

[24] S. Poria, E. Cambria, R. Bajpai, and A. Hussain. A review of affective computing: From unimodal analysis to multimodal fusion. Information fusion, 37:98–125, 2017.

[25] O. Romanko, A. Narayan, and R. H. Kwon. Chatgpt-based investment portfolio selection. arXiv preprint arXiv:2308.06260, XX(XX):

XX, 2023.

[26] R. P. Schumaker and H. Chen. Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Trans- actions on Information Systems (TOIS), 27(2):1–19, 2009.

[27] W. F. Sharpe. Capital asset prices: A theory of market equilibrium under conditions of risk. Journal of Finance, 19:425–442, 1964.

[28] F. A. Sortino and L. N. Price. Performance measurement in a downside risk framework. The Journal of Investing, 3:59–64, 1994.

[29] P. C. Tetlock. Giving Content to Investor Sentiment: The Role of Media in the Stock Market. The Journal of Finance, 62(3):1139–1168, 6 2007. ISSN 1540-6261. doi: 10.1111/J.1540-6261.2007.01232.X. URL: https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.2007. 01232.xhttps://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261. 2007.01232.xhttps://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261. 2007.01232.x.

[30] Q. Xie, W. Han, Y. Lai, M. Peng, and J. Huang. The Wall Street Neophyte: A Zero-Shot Analysis of ChatGPT Over MultiModal Stock Movement Prediction Challenges. arXiv preprint arXiv:2304.05351, XX(XX):XX, 4 2023.

[31] K.-C. Yang and F. Menczer. Large language models can rate news outlet credibility. Technical report, arxiv, 4 2023. URL https://arxiv.org/abs/ 2304.00228v1.

[32] C. Yuan, Y. Liu, R. Yin, J. Zhang, Q. Zhu, R. Mao, and R. Xu. Target-based sentiment annotation in chinese financial news. In Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), pages 5040–5045, LREC, 2020. LREC. URL http://www.lrec-conf.org/ proceedings/lrec2020/index.html.

[33] T. Yue, D. Au, C. C. Au, and K. Y. Iu. Democratizing financial knowledge with chatgpt by openai: Unleashing the power of technology. Available at SSRN 4346152, XX(XX):XX, 2023.

[34] N. Zmandar, T. Daudert, S. Ahmadi, M. El-Haj, and P. Rayson. Cofif plus: A french financial narrative summarization corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 1622–1639, LREC, 2022. LREC. URL http://www.lrec-conf.org/proceedings/lrec2022/index.html.