A new artificial intelligence system can take still images and generate short videos that simulate what happens next similar to how humans can visually imagine how a scene will evolve, according to a new study.Humans intuitively understand how the world works, which makes it easier for people, as opposed to machines, to envision how a scene will play out. But objects in a still image could move and interact in a multitude of different ways, making it very hard for machines to accomplish this feat, the researchers said. But a new, so-called deep learning system was able to trick humans 20 per cent of the time when compared to real footage.Researchers at the Massachusetts Institute of Technology (MIT) pitted two neural network against each other, with one trying to distinguish real videos from machine-generated ones, and the other trying to create videos that were realistic enough to trick the first system.
Early Stages
Still, film directors probably don’t need to be too concerned about machines taking over their jobs yet — the videos were only 1 to 1.5 seconds long and were made at a resolution of 64 x 64 pixels. But the researchers said that the approach could eventually help robots and self driving cars navigation dynamic environments and interact with humans, or let Facebook automatically tag videos with labels describing what is happening.
The system is also able to learn unsupervised, the researchers said. This means that the two million videos — equivalent to about a year's worth of footage — that the system was trained on did not have to be labeled by a human, which dramatically reduces development time and makes it adaptable to new data.
AI Filmmakers
The MIT team is not the first to attempt to use Artificial Intelligence to generate video from scratch. But, previous approaches have tended to build video up frame by frame, the researchers said, which allows errors to accumulate at each stage. Instead, the new method processes the entire scene at once — normally 32 frames in one go.
The results are far from perfect, though. Often, objects in the foreground appear larger than they should, and humans can appear in the footage as blurry blobs, the researchers said. Objects can also disappear from a scene and others can appear out of nowhere, they added.
"The computer model starts off knowing nothing about the world. It has to learn what people look like, how objects move and what might happen," Vondrick said. "The model hasn't completely learned these things yet. Expanding its ability to understand high-level concepts like objects will dramatically improve the generations."
Another big challenge moving forward will be to create longer videos, because that will require the system to track more relationships between objects in the scene and for a longer time, according to Vondrick.
"To overcome this, it might be good to add human input to help the system understand elements of the scene that would be difficult for it to learn on its own," he said.