What if ChatGPT was trained on decades of financial news and data … – Nieman Journalism Lab at Harvard

If you were going to predict which news company would be the first out with its own massive AI model, Bloomberg would’ve been a good bet. For all its success expanding into consumer-facing news over the past decade, Bloomberg is fundamentally a data company, driven by $30,000/year subscriptions to its terminals.
On Friday, the company announced it had built something called BloombergGPT. Think of it as a computer that aims to “know” everything the entire company “knows.”

Bloomberg today released a research paper detailing the development of BloombergGPT™, a new large-scale generative artificial intelligence (AI) model. This large language model (LLM) has been specifically trained on a wide range of financial data to support a diverse set of natural language processing (NLP) tasks within the financial industry.

Recent advances in Artificial Intelligence (AI) based on LLMs have already demonstrated exciting new applications for many domains. However, the complexity and unique terminology of the financial domain warrant a domain-specific model. BloombergGPT represents the first step in the development and application of this new technology for the financial industry. This model will assist Bloomberg in improving existing financial NLP tasks, such as sentiment analysis, named entity recognition, news classification, and question answering, among others. Furthermore, BloombergGPT will unlock new opportunities for marshalling the vast quantities of data available on the Bloomberg Terminal to better help the firm’s customers, while bringing the full potential of AI to the financial domain.

Recent advances in Artificial Intelligence (AI) based on LLMs have already demonstrated exciting new applications for many domains. However, the complexity and unique terminology of the financial domain warrant a domain-specific model. BloombergGPT represents the first step in the development and application of this new technology for the financial industry. This model will assist Bloomberg in improving existing financial NLP tasks, such as sentiment analysis, named entity recognition, news classification, and question answering, among others. Furthermore, BloombergGPT will unlock new opportunities for marshalling the vast quantities of data available on the Bloomberg Terminal to better help the firm’s customers, while bringing the full potential of AI to the financial domain.
Meet #BloombergGPT 👋🏻
This 50-billion parameter #LargeLanguageModel was purpose-built from scratch for #finance using a unique mix of @Bloomberg's #data and public datasets to support financial #NLProc tasks.https://t.co/vehdOZtvu0 #AI #ArtificialIntelligence #LLMs #ML #GPT
— Tech At Bloomberg (@TechAtBloomberg) March 31, 2023

The technical details are, as promised, in this research paper. It’s by Bloomberg’s Shijie Wu, Ozan İrsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann.
How big is BloombergGPT? Well, the company says it was trained on a corpus of more than 700 billion tokens (or word fragments). For context, GPT-3, released in 2020, was trained on about 500 billion. (OpenAI has declined to reveal any equivalent number for GPT-4, the successor released last month, citing “the competitive landscape.”)
What’s in all that training data? Of the 700 million-plus tokens, 363 billion are taken from Bloomberg’s own financial data, the sort of information that powers its terminals — “the largest domain-specific dataset yet” constructed, it says. Another 345 billion tokens come from “general purpose datasets” obtained from elsewhere.

Rather than building a general-purpose LLM, or a small LLM exclusively on domain-specific data, we take a mixed approach. General models cover many domains, are able to perform at a high level across a wide variety of tasks, and obviate the need for specialization during training time. However, results from existing domain-specific models show that general models cannot replace them. At Bloomberg, we support a very large and diverse set of tasks, well served by a general model, but the vast majority of our applications are within the financial domain, better served by a specific model. For that reason, we set out to build a model that achieves best-in-class results on financial benchmarks, while also maintaining competitive performance on general-purpose LLM benchmarks.

The new BloombergGPT AI may be harbinger of the next wave of corporate AI. Current AIs are trained on web data (though firms can add their own training)
BloombergGPT is 52% either proprietary data or cleaned financial data. And it shows signs of being better at financial tasks. pic.twitter.com/VmDKjg6ztX
— Ethan Mollick (@emollick) April 1, 2023

The company-specific data, named FinPile, consists of “a range of English financial documents including news, filings, press releases, web-scraped financial documents, and social media drawn from the Bloomberg archives.” So if you’ve read a Bloomberg Businessweek story in the past few years, it’s in there. So are SEC filings, Bloomberg TV transcripts, Fed data, and “other data relevant to the financial markets.” It’s also trained on non-Bloomberg news sources:

The News category includes all news sources excluding news articles written by Bloomberg journalists. Overall, there are hundreds of English news sources in FinPile…Generally, the content in this dataset comes from reputable sources of news that are relevant to the financial community so as to maintain factuality and reduce bias.

The non-finance-specific data includes a massive corpus poetically known as The Pile; it includes everything from YouTube captions to Project Gutenberg to, yes, the cache of Enron emails that are always popping up in AI training. (It also has a complete copy of Wikipedia as of last July.)
But enough about the training data. What can BloombergGPT do? Because it shares a training base with other LLMs, BloombergGPT can do the sorts of things that we’ve come to expect from ChatGPT and similar models. But it can also perform tasks more tightly connected to Bloomberg’s needs. It can translate natural language requests (“apple and ibm market cap and eps”) into the Bloomberg Query Language terminal users love/hate (get(cur_mkt_cap,is_eps) for([’AAPL US Equity’,’IBM US Equity’])). It can also suggest Bloomberg-style headlines for news stories (sorry, copy editors):

Input: The US housing market shrank in value by $2.3 trillion, or 4.9%, in the second half of 2022, according to Redfin. That’s the largest drop in percentage terms since the 2008 housing crisis, when values slumped 5.8% during the same period.
Output: Home Prices See Biggest Drop in 15 Years

Input: The global economy is in a better place today than many predicted months ago, Janet Yellen said at the G20. At home, she pointed to a resilient US economy, where headline inflation has moderated and the labor market is strong. She also called for the IMF to move swiftly toward a fully-financed program for Ukraine.
Output: Yellen Sees Global Economy More Resilient Than Expected

Input: Google was sued by the US and eight states seeking the breakup of its ad-tech business for allegedly monopolizing the digital advertising market. The lawsuit is the Biden administration’s first big challenge to a tech titan and one of the rare times since 1982 that the DOJ has sought to cleave up a major company.
Output: Google Sued for Monopoly in Online Ad Market

Input: The global economy is in a better place today than many predicted months ago, Janet Yellen said at the G20. At home, she pointed to a resilient US economy, where headline inflation has moderated and the labor market is strong. She also called for the IMF to move swiftly toward a fully-financed program for Ukraine.
Output: Yellen Sees Global Economy More Resilient Than Expected
Input: Google was sued by the US and eight states seeking the breakup of its ad-tech business for allegedly monopolizing the digital advertising market. The lawsuit is the Biden administration’s first big challenge to a tech titan and one of the rare times since 1982 that the DOJ has sought to cleave up a major company.
Output: Google Sued for Monopoly in Online Ad Market
It’s also better tuned, they say, to answer specific business-related questions, whether they be sentiment analysis, categorization, data extraction, or something else entirely. (“For example, it performs well at identifying the CEO of a company.”)
The most interesting AI news of the week for me is Bloomberg’s 50B parameter model trained on financial data. Points to a polyglot future where a number of players can win in AI, as opposed to just Big Tech and OpenAI.
— Matt Turck (@mattturck) April 1, 2023

The paper includes a series of performance comparisons with GPT-3 and other LLMs and finds that BloombergGPT holds its own on general tasks — at least when facing off against similarly sized models — and outperforms on many finance-specific ones. (The internal testing battery includes such carnival-game-ready terms as “Penguins in a Table,” “Snarks,” “Web of Lies,” and the dreaded “Hyperbaton.”)

Across dozens of tasks in many benchmarks a clear picture emerges. Among the models with tens of billions of parameters that we compare to, BloombergGPT performs the best. Furthermore, in some cases, it is competitive or exceeds the performance of much larger models (hundreds of billions of parameters). While our goal for BloombergGPT was to be a best-in-class model for financial tasks, and we included general-purpose training data to support domain-specific training, the model has still attained abilities on general-purpose data that exceed similarly sized models, and in some cases match or outperform much larger models.

BloombergGPT is going to replace the Analyst
Analysts are fundamentally chat-based interfaces that senior finance folks use to gather, organize, and output data
Finance workflows are already very iterative and GPT doesnt care about protected Saturdays🧵https://t.co/3pwdX9boHT pic.twitter.com/2ayyWIhPMd
— Van Spina (🌴,🥷) (@palmtreeshinobi) March 31, 2023

Penguins aside, it’s not hard to imagine more specific use cases that go beyond benchmarking, either for Bloomberg’s journalists or its terminal customers. (The company’s announcement didn’t specify what it planned to do with what it has built.) A corpus of ~all of the world’s premium English-language business reporting — plus the universe of financial data, structured and otherwise, that underpins it — is just the sort of rich vein of information a generative AI is designed to mine. It’s institutional memory in a box.
BloombergGPT’s abilities excite us about the future. We have many questions. We’re exploring how to evaluate, improve, and use it. We’re also excited to see how the community can learn from our experience to build the next great model or application. https://t.co/4Or9l61d1u
— Mark Dredze (@mdredze) March 31, 2023

That said, all the usual caveats for LLMs apply. BloombergGPT can, I’m sure, hallucinate. All that training data comes with its own set of potential biases. (I’d wager BloombergGPT won’t call for the revolution of the proletariat anytime soon.)
As for how BloombergGPT might inspire other news organizations…well, Bloomberg’s in a pretty unique situation here, with the scale of data it’s assembled and the product it can be applied to. But I believe there will be, in the longer term, openings for smaller publishers here, especially those with large digitized archives. Imagine the Anytown Gazette training an AI on 100 years of its newspaper archives, plus a massive collection of city/county/state documents and whatever other sources of local data it can get its hands on. It’s a radically different scale than what Bloomberg can reach, of course, and it may be more useful as an internal tool than anything public-facing. But given the incredible pace of AI advances over the past year, it might be a worthy idea sooner than you think.

Image of Michael Bloomberg as a comic-book wizard generated by AI, of course.

Cite this article Hide citations
CLOSE
MLA
Benton, Joshua. “What if ChatGPT was trained on decades of financial news and data? BloombergGPT aims to be a domain-specific AI for business news.” Nieman Journalism Lab. Nieman Foundation for Journalism at Harvard, 3 Apr. 2023. Web. 21 Apr. 2023.
APA
Benton, J. (2023, Apr. 3). What if ChatGPT was trained on decades of financial news and data? BloombergGPT aims to be a domain-specific AI for business news. Nieman Journalism Lab. Retrieved April 21, 2023, from https://www.niemanlab.org/2023/04/what-if-chatgpt-was-trained-on-decades-of-financial-news-and-data-bloomberggpt-aims-to-be-a-domain-specific-ai-for-business-news/
Chicago
Benton, Joshua. “What if ChatGPT was trained on decades of financial news and data? BloombergGPT aims to be a domain-specific AI for business news.” Nieman Journalism Lab. Last modified April 3, 2023. Accessed April 21, 2023. https://www.niemanlab.org/2023/04/what-if-chatgpt-was-trained-on-decades-of-financial-news-and-data-bloomberggpt-aims-to-be-a-domain-specific-ai-for-business-news/.
Wikipedia
{{cite web
    | url = https://www.niemanlab.org/2023/04/what-if-chatgpt-was-trained-on-decades-of-financial-news-and-data-bloomberggpt-aims-to-be-a-domain-specific-ai-for-business-news/
    | title = What if ChatGPT was trained on decades of financial news and data? BloombergGPT aims to be a domain-specific AI for business news
    | last = Benton
    | first = Joshua
    | work = [[Nieman Journalism Lab]]
    | date = 3 April 2023
    | accessdate = 21 April 2023
    | ref = {{harvid|Benton|2023}}
}}
To promote and elevate the standards of journalism
Covering thought leadership in journalism
Pushing to the future of journalism
Exploring the art and craft of story
The Nieman Journalism Lab is a collaborative attempt to figure out how quality journalism can survive and thrive in the Internet age.
It’s a project of the Nieman Foundation for Journalism at Harvard University.

source

Other Documents