Alibaba’s Institute for Intelligent Computing has developed an AI system called EMO that can animate a single portrait photo and generate realistic talking or singing videos. The system uses a direct audio-to-video synthesis approach, bypassing the need for 3D models or facial landmarks. EMO employs a diffusion model and was trained on a dataset of over 250 hours of talking head videos. It outperforms existing methods in terms of video quality, identity preservation, and expressiveness. EMO can also generate singing videos with appropriate mouth shapes and facial expressions. The system can produce videos of arbitrary duration based on the length of the input audio. However, ethical concerns remain regarding the potential misuse of this technology. The researchers plan to explore methods to detect synthetic videos.
