Microsoft VASA-1: Realistic Deepfakes Single Photo Audio

On Tuesday, Microsoft Research Asia unveiled VASA-1, an AI model capable of creating a synchronised animated video of a person talking or singing from a single photo and an existing audio track. This development could power virtual avatars that render locally and don’t require video feeds, or even allow anyone to take a photo of a person found online and make them appear to say whatever they want. According to the research paper “VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time,” VASA-1 paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviours. The framework uses machine learning to analyse a static image along with a speech audio clip, generating a realistic video with precise facial expressions, head movements, and lip-syncing to the audio. Unlike other Microsoft research efforts that clone or simulate voices, VASA-1 relies on an existing audio input that could be specially recorded or spoken for a particular purpose.

Microsoft claims the model significantly outperforms previous speech animation methods in terms of realism, expressiveness, and efficiency. AI research efforts to animate a single photo of a person or character extend back at least a few years, but recently, researchers have been working on automatically synchronising a generated video to an audio track. In February, an AI model called EMO: Emote Portrait Alive from Alibaba’s Institute for Intelligent Computing research group made waves with a similar approach to VASA-1 that can automatically sync an animated photo to a provided audio track, which they call “Audio2Video.” Microsoft Researchers trained VASA-1 on the VoxCeleb2 dataset, which contains over 1 million utterances for 6,112 celebrities, extracted from videos uploaded to YouTube. VASA-1 can reportedly generate videos of 512×512 pixel resolution at up to 40 frames per second with minimal latency, potentially allowing for real-time applications like video conferencing.

To demonstrate the model, Microsoft created a VASA-1 research page featuring many sample videos of the tool in action, including people singing and speaking in sync with pre-recorded audio tracks. They show how the model can be controlled to express different moods or change its eye gaze, and the examples also include some more fanciful generations, such as Mona Lisa rapping to an audio track of Anne Hathaway performing a “Paparazzi” song on Conan O’Brien. For privacy reasons, each example photo on their page was AI-generated by StyleGAN2 or DALL-E 3, aside from the Mona Lisa. However, the technique could equally apply to photos of real people, although it is likely to work better if a person appears similar to a celebrity present in the training dataset. The researchers emphasise that deepfaking real humans is not their intention and that they are exploring visual affective skill generation for virtual, interactive characters.

The potential positive applications of VASA-1 include enhancing educational equity, improving accessibility, and providing therapeutic companionship. However, the technology could also be misused, allowing people to fake video chats or make real people appear to say things they never actually said, especially when paired with a cloned voice track. The researchers are aware of this risk and have not openly released the code that powers the model. They are opposed to creating misleading or harmful content and are interested in applying their technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artefacts, and there is still a gap to achieving the authenticity of real videos.

VASA-1 is only a research demonstration, but Microsoft is far from the only group developing similar technology. If the recent history of generative AI is any guide, it is potentially only a matter of time before similar technology becomes open source and freely available, likely continuing to improve in realism over time. The team behind VASA-1, comprising Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo, emphasises the need for responsible AI development. They have no plans to release an online demo, API, product, additional implementation details, or any related offerings until they are certain that the technology will be used responsibly and in accordance with proper regulations.

In another example of AI-based image manipulation becoming increasingly mainstream, Deep Nostalgia, a service from the genealogy site MyHeritage that animates old family photos, has gone viral on social media. Launched in late February, the service uses an AI technique called deep learning to automatically animate faces in photos uploaded to the system. Because of its ease of use and free trial, it quickly gained popularity on Twitter, where users uploaded animated versions of old family photos, celebrity pictures, and even drawings and illustrations.

Deep Nostalgia, like most “deep fakes,” is exceptionally good at smoothly animating features and expressions but can struggle to generate data to fill in the gaps in what it can see from the source photos, causing a sense of the uncanny. Some people love the Deep Nostalgia feature and consider it magical, while others find it creepy and dislike it. The service is intended for nostalgic use, to bring beloved ancestors back to life. The driver videos don’t include speech to prevent abuse, such as the creation of deep fake videos of living people.

Not every video created with Deep Nostalgia is elegantly animated or even good enough to be unsettling. An animated version of the infamous bust of Ronaldo, for instance, is exactly as distressing as the static version. Three years ago, artificially producing a 15-second face-swap of Theresa May and Margaret Thatcher took several hours on a powerful desktop computer. Now, the same effect can be done for free with a mobile phone and apps such as Snapchat or be given away for free as a promotion for a genealogy website. While the automatically produced videos of Deep Nostalgia are not likely to fool anyone into thinking they are real footage, more careful application of the same technology can be very hard to distinguish from reality.

Tom Cruise seems to be a particular subject of choice for deep fakes. In 2019, a video clip went viral of comedian Bill Hader being morphed into the Hollywood star as he performed an impression on David Letterman’s show. Last month, a new TikTok account named deep tom cruise racked up millions of views with a series of videos that are, it claims, deep fake versions of the actor talking to the camera. The Cruise fakes are so accurate that many programs designed to recognize manipulated media are unable to spot them.

The rapid advancements in AI-generated video technology bring both exciting possibilities and significant ethical concerns. As models like VASA-1 and services like Deep Nostalgia continue to develop and improve, it is crucial to consider their potential applications and the responsibilities that come with them. While these technologies can enhance creativity, accessibility, and education, they also pose risks of misuse that must be carefully managed to ensure they are used for the greater good.

for all my daily news and tips on AI, Emerging technologies at the intersection of humans, just sign up for my FREE newsletter at www.robotpigeon.be