Companies like OpenAI and Midjourney build chatbots, image generators, and other AI tools that work in the digital world.
Now, a start-up founded by three former OpenAI researchers is using the technology development methods behind chatbots to create AI technology that can navigate the physical world.
Covariant, an Emeryville, California-based robotics company, creates ways for robots to pick, move and sort items as they move through warehouses and distribution centers. Its goal is to help robots understand what is happening around them and decide what to do next.
The technology also gives the bots a broad understanding of the English language, allowing people to chat with them as if they were chatting with ChatGPT.
The technology, still under development, is not perfect. But it’s a clear sign that the AI systems that drive online chatbots and image generators will also power machines in warehouses, on roads and in homes.
Like chatbots and image generators, this robotics technology learns its skills by analyzing vast amounts of digital data. This means engineers can improve the technology by feeding it more and more data.
Covariant, backed by $222 million in funding, doesn’t make robots. It makes the software that powers the robots. The company aims to develop its new technology with warehouse robots, providing a roadmap for others to do the same in manufacturing plants and perhaps even on roads with driverless cars.
The artificial intelligence systems that drive chatbots and image generators are called neural networks, named after the tissue of neurons in the brain.
By spotting patterns in vast amounts of data, these systems can learn to recognize words, sounds and images — or even create them themselves. This is how OpenAI built ChatGPT, giving it the power to instantly answer questions, write term papers and create computer programs. He learned these skills from text collected from all over the internet. (Several media outlets, including the New York Times, have sued OpenAI for copyright infringement.)
Companies are now building systems that can learn from different kinds of data simultaneously. For example, by analyzing a collection of photos and the captions that describe those photos, a system can understand the relationships between the two. He may learn that the word “banana” describes a curved yellow fruit.
OpenAI used this system to create Sora, its new video generator. By analyzing thousands of subtitled videos, the system learned to create videos when given a brief description of a scene, such as “a beautifully rendered papier-mâché world of a coral reef, full of colorful fish and sea creatures.”
Covariant, founded by Pieter Abbeel, a professor at the University of California, Berkeley, and three of his former students, Peter Chen, Rocky Duan and Tianhao Zhang, used similar techniques in building a system that drives warehouse robots.
The company helps operate sorting robots in warehouses around the world. He has spent years collecting data — from cameras and other sensors — that show how these robots work.
“It captures all kinds of data that matter to robots – that can help them understand the physical world and interact with it,” Dr. Chen said.
By combining this data with the vast amounts of text used to train chatbots like ChatGPT, the company has built AI technology that gives its bots a much broader understanding of the world around them.
After spotting patterns in this stew of images, sensory data and text, the technology gives a robot the power to handle unexpected situations in the physical world. The robot knows how to pick a banana, even if it has never seen a banana before.
It can also respond in plain English, like a chatbot. If you tell him to “pick up a banana”, he knows what that means. If you tell him to “take a yellow fruit”, he understands that too.
It can even create videos that predict what is likely to happen as it tries to get a banana. These videos have no practical use in a warehouse, but show the robot’s understanding of what’s around it.
“If it can predict the next frames in a video, it can identify the right strategy to follow,” said Dr. Abbeel.
The technology, called RFM, for robotics fundamental model, makes mistakes, just like chatbots do. Although it often understands what people are asking of it, there’s always a chance it won’t. Drops items from time to time.
Gary Marcus, an AI entrepreneur and professor emeritus of psychology and neural science at New York University, said the technology could be useful in warehouses and other situations where mistakes are acceptable. However, he said it would be more difficult and more dangerous to deploy in manufacturing plants and other potentially dangerous situations.
“That depends on the cost of error,” he said. “If you have a 150-pound robot that can do something harmful, that cost can be high.”
As companies train this kind of system on increasingly large and diverse data collections, the researchers believe it will improve rapidly.
This is very different from the way robots have worked in the past. Typically, engineers programmed the robots to perform the same exact motion over and over again—such as lifting a box of a certain size or attaching a rivet to a specific spot on the rear bumper of a car. But robots could not deal with unexpected or random situations.
By learning from digital data—hundreds of thousands of examples of what happens in the physical world—robots can begin to handle the unexpected. And when these examples are combined with language, bots can also respond to text and voice suggestions, just as a chatbot would.
This means that like chatbots and image generators, bots will become more agile.
“Whatever exists in digital data can be transferred to the real world,” said Dr. Chen.