Online data has long been a precious commodity. For years, Meta and Google have used data to target their online ads. Netflix and Spotify have used it to recommend more movies and music. Political candidates have turned to data to learn which groups of voters to train their interest.
Over the past 18 months, it has become increasingly clear that digital data is also vital to the development of artificial intelligence. Here’s what you need to know.
The more data, the better.
AI success depends on data. This is because AI models become more accurate and more human with more data.
In the same way that a student learns by reading more books, essays, and other information, large language models—the systems that underlie chatbots—also become more accurate and more powerful if fed more data.
Some large language models, such as OpenAI’s GPT-3, released in 2020, were trained on hundreds of billions of “tokens,” which are essentially words or chunks of words. More recently large language models were trained on more than three trillion tokens.
Online data is a valuable and finite resource.
Tech companies use publicly available online data to develop their AI models faster than new data is generated. According to one prediction, high-quality digital data will be exhausted by 2026.
Tech companies are pushing hard to get more data.
In the race for more data, OpenAI, Google and Meta are turning to new tools, changing their terms of service and engaging in internal discussions.
At OpenAI, researchers created a program in 2021 that converted the audio of YouTube videos to text and then fed the transcripts to one of its artificial intelligence models, in violation of YouTube’s terms of service, people with knowledge of the matter said.
(The New York Times sued OpenAI and Microsoft for using copyrighted news without permission to develop AI. OpenAI and Microsoft said they used news articles in transformative ways that did not violate copyright law.)
Google, which owns YouTube, also used YouTube data to develop its artificial intelligence models, ending up in a legal gray area of ​​copyright, people with knowledge of the action said. And Google revised its privacy policy last year so it can use publicly available material to develop more of its AI products.
At Meta, executives and lawyers last year discussed how to get more data for AI development and discussed buying a major publisher like Simon & Schuster. In private meetings, they weighed the possibility of putting copyrighted works into their AI model, even if it meant being sued later, according to recordings of the meetings obtained by The Times.
One solution may be “synthetic” data.
OpenAI, Google and other companies are exploring using their AI to generate more data. The result would be what is known as “synthetic” data. The idea is that AI models generate new text that can then be used to create better AI
Synthetic data is dangerous because AI models can make mistakes. Relying on such data can exacerbate these errors.