Artificial Intelligence companies are draining the internet’s resources


Artificial intelligence is draining the resources of the Internet. As you and I log into our global network to enjoy (or, maybe not), educate, and connect, companies use this data to train their large language models (LLMs) and improve their capabilities. This is how ChatGPT not only knows factual information, but also knows how to string responses together: much of what it "knows" is based on a massive database of Internet content.

But while many companies rely on the Internet to train LL.M.s, they run into a problem: The Internet is limited, and companies developing artificial intelligence hope they will continue to grow rapidly. As the Wall Street Journal reports, companies like OpenAI and Google are facing this reality. Some industry estimates say they will run out of internet resources in about two years as high-quality data becomes scarce and some companies don't let AI get its hands on it.

Artificial intelligence requires huge amounts of data

Don’t underestimate the amount of data these companies will need now and in the future. Epoch researcher Pablo Villalobos told the Wall Street Journal that OpenAI trained GPT-4 using about 12 million tokens, which are words and parts of words broken down in a way that a master of law can understand. (OpenAI says a token is about 0.75 words, so 12 million tokens is roughly equivalent to 9 million words.) Villalobos believes that OpenAI’s next big model, GPT-5, will require 60 to 100 trillion tokens to Keep up with expected growth. According to OpenAI, this equates to 45 to 75 trillion words. Kicker? Villalobos said that after exhausting all possible high-quality data on the Internet, you will still need 10 to 20 trillion tokens, or even more.

Even so, Villalobos still thinks this data shortage won't materialize until around 2028, but others are less optimistic — especially AI companies. They saw the writing on the wall and were looking for alternatives to internet data to train their models.

Artificial Intelligence Data Problems

Of course, there are some issues that need to be addressed here. The first is the aforementioned data shortage: LLM cannot be trained without data, and giant models like GPT and Gemini require large amounts of data. However, the second issue is the quality of the data. Companies aren't going to clean up every imaginable corner of the internet because there's so much junk out there. OpenAI doesn't want to inject misinformation and poorly written content into GPT because its goal is to create an LLM that responds accurately to user prompts. (Of course, we’ve seen plenty of examples of AI spewing out misinformation.) Filtering this out leaves them with fewer options than before.

Finally, first and foremost, the ethics of scraping data from the internet. Whether you know it or not, AI companies have probably scraped your data and used it to train their LL.M. Of course, these companies don't care about your privacy: they just want the data. If allowed, they will accept it. It's big business, too: Reddit is selling your content to artificial intelligence companies, in case you didn't know. Some places are fighting back — The New York Times is suing OpenAI over this — but until there are real user protections on the books, your public internet data will be sent to an LLM near you.

So where do companies find this new information? OpenAI is leading the way. For GPT-5, the company is considering using its Whisper transcriber to train public video transcription models, such as those scraped from YouTube. (The company appears to have used these videos for its AI video generator Sora.) OpenAI is also working on developing smaller models for specific domains and developing a system that pays information providers based on quality. The data is.

Is synthetic data the answer?

But perhaps the most controversial next step some companies are considering is using synthetic data to train models. Synthetic data is simply information generated from an existing data set: the idea is to create a new data set that is similar to the original data set, but entirely new. In theory, it could be used to mask the content of the original dataset while providing the LL.M. with a similar dataset for training.

However, in practice, LLM training on synthetic data can lead to "model collapse". This is because the synthetic data contains existing patterns from the original dataset. Once an LLM is trained on the same model, it cannot grow and may even forget important parts of the data set. Over time, you'll find that your AI model returns the same results because it doesn't have the variety of training data to support unique responses. This would kill things like ChatGPT and defeat the purpose of using synthetic data in the first place.

Still, AI companies are somewhat optimistic about synthetic data. Anthropic and OpenAI both see a place for this technology in their training sets. These are capable companies, so if they can find a way to apply synthetic data to their models without burning the house down, more power to them. In fact, it would be nice if I knew that my Facebook posts in 2010 were not used to further the AI ​​revolution.