The Times Australia
The Times World News

.

Researchers warn we could run out of data to train AI by 2026. What then?

  • Written by Rita Matulionyte, Senior Lecturer in Law, Macquarie University
Researchers warn we could run out of data to train AI by 2026. What then?

As artificial intelligence (AI) reaches the peak of its popularity[1], researchers have warned[2] the industry might be running out of training data – the fuel that runs powerful AI systems. This could slow down the growth of AI models, especially large language models, and may even alter the trajectory of the AI revolution.

But why is a potential lack of data an issue, considering how much there are on the web? And is there a way to address the risk?

Read more: AI to Z: all the terms you need to know to keep up in the AI hype age[3]

Why high-quality data are important for AI

We need a lot of data to train powerful, accurate and high-quality AI algorithms. For instance, ChatGPT was trained on 570 gigabytes of text data, or about 300 billion words[4].

Similarly, the stable diffusion algorithm (which is behind many AI image-generating apps such as DALL-E, Lensa and Midjourney) was trained on the LIAON-5B dataset[5] comprising of 5.8 billion image-text pairs. If an algorithm is trained on an insufficient amount of data, it will produce inaccurate or low-quality outputs.

The quality of the training data is also important. Low-quality data such as social media posts or blurry photographs are easy to source, but aren’t sufficient to train high-performing AI models.

Text taken from social media platforms might be biased or prejudiced, or may include disinformation or illegal content which could be replicated by the model. For example, when Microsoft tried to train its AI bot using Twitter content, it learned to produce[6] racist and misogynistic outputs.

This is why AI developers seek out high-quality content such as text from books, online articles, scientific papers, Wikipedia, and certain filtered web content. The Google Assistant was trained[7] on 11,000 romance novels taken from self-publishing site Smashwords[8] to make it more conversational.

Do we have enough data?

The AI industry has been training AI systems on ever-larger datasets, which is why we now have high-performing models such as ChatGPT or DALL-E 3. At the same time, research shows online data stocks are growing much slower than datasets used to train AI.

In a paper published last year, a group of researchers[9] predicted we will run out of high-quality text data before 2026 if the current AI training trends continue. They also estimated low-quality language data will be exhausted sometime between 2030 and 2050, and low-quality image data between 2030 and 2060.

AI could contribute up to[10] US$15.7 trillion (A$24.1 trillion) to the world economy by 2030, according to accounting and consulting group PwC. But running out of usable data could slow down its development.

Should we be worried?

While the above points might alarm some AI fans, the situation may not be as bad as it seems. There are many unknowns about how AI models will develop in the future, as well as a few ways to address the risk of data shortages.

One opportunity is for AI developers to improve algorithms so they use the data they already have more efficiently.

It’s likely in the coming years they will be able to train high-performing AI systems using less data, and possibly less computational power. This would also help reduce AI’s carbon footprint[11].

Another option is to use AI to create synthetic data[12] to train systems. In other words, developers can simply generate the data they need, curated to suit their particular AI model.

Several projects are already using synthetic content, often sourced from data-generating services such as Mostly AI[13]. This will become more common[14] in the future.

Developers are also searching for content outside the free online space, such as that held by large publishers and offline repositories. Think about the millions of texts published before the internet. Made available digitally, they could provide a new source of data for AI projects.

News Corp, one of the world’s largest news content owners (which has much of its content behind a paywall) recently said it was negotiating[15] content deals with AI developers. Such deals would force AI companies to pay for training data – whereas they have mostly scraped it off the internet for free so far.

Content creators have protested against the unauthorised use of their content to train AI models, with some suing companies such as Microsoft, OpenAI[16] and Stability AI[17]. Being remunerated for their work may help restore some of the power imbalance that exists between creatives and AI companies.

Read more: No, the Lensa AI app technically isn’t stealing artists' work – but it will majorly shake up the art world[18]

References

  1. ^ peak of its popularity (trends.google.com)
  2. ^ have warned (www.technologyreview.com)
  3. ^ AI to Z: all the terms you need to know to keep up in the AI hype age (theconversation.com)
  4. ^ 300 billion words (www.sciencefocus.com)
  5. ^ LIAON-5B dataset (laion.ai)
  6. ^ learned to produce (www.theverge.com)
  7. ^ trained (www.theguardian.com)
  8. ^ self-publishing site Smashwords (www.smashwords.com)
  9. ^ a group of researchers (arxiv.org)
  10. ^ could contribute up to (www.pwc.co.uk)
  11. ^ carbon footprint (earth.org)
  12. ^ synthetic data (www.forbes.com)
  13. ^ Mostly AI (mostly.ai)
  14. ^ become more common (www.wsj.com)
  15. ^ negotiating (www.reuters.com)
  16. ^ Microsoft, OpenAI (www.forbes.com)
  17. ^ Stability AI (stablediffusionlitigation.com)
  18. ^ No, the Lensa AI app technically isn’t stealing artists' work – but it will majorly shake up the art world (theconversation.com)

Read more https://theconversation.com/researchers-warn-we-could-run-out-of-data-to-train-ai-by-2026-what-then-216741

Times Magazine

Building an AI-First Culture in Your Company

AI isn't just something to think about anymore - it's becoming part of how we live and work, whether we like it or not. At the office, it definitely helps us move faster. But here's the thing: just using tools like ChatGPT or plugging AI into your wo...

Data Management Isn't Just About Tech—Here’s Why It’s a Human Problem Too

Photo by Kevin Kuby Manuel O. Diaz Jr.We live in a world drowning in data. Every click, swipe, medical scan, and financial transaction generates information, so much that managing it all has become one of the biggest challenges of our digital age. Bu...

Headless CMS in Digital Twins and 3D Product Experiences

Image by freepik As the metaverse becomes more advanced and accessible, it's clear that multiple sectors will use digital twins and 3D product experiences to visualize, connect, and streamline efforts better. A digital twin is a virtual replica of ...

The Decline of Hyper-Casual: How Mid-Core Mobile Games Took Over in 2025

In recent years, the mobile gaming landscape has undergone a significant transformation, with mid-core mobile games emerging as the dominant force in app stores by 2025. This shift is underpinned by changing user habits and evolving monetization tr...

Understanding ITIL 4 and PRINCE2 Project Management Synergy

Key Highlights ITIL 4 focuses on IT service management, emphasising continual improvement and value creation through modern digital transformation approaches. PRINCE2 project management supports systematic planning and execution of projects wit...

What AI Adoption Means for the Future of Workplace Risk Management

Image by freepik As industrial operations become more complex and fast-paced, the risks faced by workers and employers alike continue to grow. Traditional safety models—reliant on manual oversight, reactive investigations, and standardised checklist...

The Times Features

Flipping vs. Holding: Which Investment Strategy Is Right for You?

Are you wondering whether flipping a property or holding onto it is the better investment strategy? The answer isn’t one-size-fits-all. Both strategies have distinct advantages a...

Why Everyone's Talking About Sea Moss - And Should You Try It Too?

Sea moss - a humble marine plant that’s been used for centuries - is making a major comeback in modern wellness circles. And it’s not just a trend. With growing interest from athle...

A Guide to Smarter Real Estate Accounting: What You Might Be Overlooking

Real estate accounting can be a complex terrain, even for experienced investors and property managers. From tracking rental income to managing property expenses, the financial in...

What Is the Dreamtime? Understanding Aboriginal Creation Stories Through Art

Aboriginal culture is built on the deep and important meaning of Dreamtime, which links beliefs and history with the elements that make life. It’s not just myths; the Dreamtime i...

How Short-Term Lenders Offer Long-Lasting Benefits in Australia

In the world of personal and business finance, short-term lenders are often viewed as temporary fixes—quick solutions for urgent cash needs. However, in Australia, short-term len...

Why School Breaks Are the Perfect Time to Build Real Game Skills

School holidays provide uninterrupted time to focus on individual skill development Players often return sharper and more confident after structured break-time training Holid...