Skip to main content

Facebook’s AI extracts playable characters from real-world videos

Watch all the Transform 2020 sessions on-demand here.


Remember those FMV games from the ’90s — the ones that blended prerecorded clips with animated sprites and 3D models? Facebook is bringing them back in style, and improved tenfold. In a newly published preprint paper on Arxiv.org (“Vid2Game: Controllable Characters Extracted from Real-World Videos“), scientists at Facebook AI Research describe a system capable of extracting controllable characters from real-world videos.

“Our method extracts a character from an uncontrolled video and enables us to control its motion,” the paper’s coauthors explain. “The model generates novel image sequences of that person … [and the] generated video can have an arbitrary background, and effectively capture both the dynamics and appearance of the person.”

The team’s approach relies on two neural networks, or layers of mathematical functions modeled after biological neurons: Pose2Pose, a framework that maps a current pose and a single-instance control signal to the next post, and Pose2Frame, which plops the current pose and new pose (along with a given background) on an output frame. The reanimation can be controlled by any “low-dimensional” signal, such as one from a joystick or keyboard, and the researchers say that the system is robust enough to position extracted characters in dynamic backgrounds.

Facebook video game generation


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


So how’s it work? First, an input video containing one or more characters is fed into a Pose2Pose network trained for a specific domain (e.g., dancing), which isolates them (plus estimated foreground spatial masks) and their motion — the latter of which is taken as a trajectory of their centers of mass. (The masks are used to determine which regions of the background are replaced by synthesized image information.) Using these and combined pose data, Pose2Frame separates between character-dependent changes in the scene like shadows, held items, and reflections and those that are character-independent, and returns a pair of outputs that are linearly blended with any desired background.

To train the AI system, the researchers sourced three videos, each between five and eight minutes long, of a tennis player outdoors, a person swinging a sword indoors, and a person walking. Compared with a neural network model fed three-minute video of a dancer, they report that their approach managed to successfully field dynamic elements, such as other people and differences in camera angle, in addition to variations in character clothing and camera angle.

“Each network addresses a computational problem not previously fully met, together paving the way for the generation of video games with realistic graphics,” they wrote. “In addition, controllable characters extracted from YouTube-like videos can find their place in the virtual worlds and augmented realities.”

Facebook isn’t the only company investigating AI systems that might aid in game design. Startup Promethean AI employs machine learning to help human artists create art for video games, and Nvidia researchers recently demonstrated a generative model that can create virtual environments using video snippets. Machine learning has also been used to rescue old game textures in retro titles like Final Fantasy VII and The Legend of Zelda: Twilight Princess, and to generate thousands of levels in games like Doom from scratch.