The race for AI leadership has become a desperate hunt for the digital data needed to advance the technology. To get that data, tech companies including OpenAI, Google and Meta have cut corners, ignored company policies and discussed bending the law, according to a New York Times investigation.
At Meta, which owns Facebook and Instagram, executives, lawyers and engineers last year discussed buying the publishing house Simon & Schuster to acquire major projects, according to tapes of internal meetings obtained by The Times. They also discussed collecting copyrighted data from all over the Internet, even if it meant facing lawsuits. Licensing negotiations with publishers, artists, musicians and the news industry will take too long, they said.
Like OpenAI, Google transcribed YouTube videos to collect text for its AI models, five people with knowledge of the company’s practices said. This potentially violated the copyright of the videos, which belong to their creators.
Last year, Google also expanded its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message seen by the Times, was to allow Google to use publicly available Google Docs, restaurant reviews on Google Maps and other online material for more AI Products.
The companies’ actions show how online information—news, fiction, message board posts, Wikipedia articles, computer programs, photos, podcasts, and movie clips—is increasingly becoming the lifeblood of the booming AI industry. Creating innovative systems depends on having enough data to teach technologies to instantly produce text, images, sounds, and videos that look like what a human creates.