Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders
Kangle Deng
Aayush Bansal
Deva Ramanan
[Demo]
[GitHub]
[Paper]
[arXiv]


Our approach translates the input audio fromany speaker to that of a list of many known speakers. We show an example of audio translation in the top half. With little modification, our approach can also generate video alongside audio from an input audiosignal. The second half shows a variety of facial examples created using our method for audio-video generation.


Abstract

We present an unsupervised approach that enables us to convert the speech input of any one individual to an output set of potentially-infinitely many speakers, i.e., one can stand in front of a mic and be able to make their favorite celebrity say the same words. Our approach builds on simple autoencoders that project out-of-sample data to the distribution of the training set (motivated by PCA/linear autoencoders). We use an exemplar autoencoder to learn the voice and specific style (emotions and ambiance) of a target speaker. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers in a very little time using only two-three minutes of audio data from a speaker. We also exhibit the usefulness of our approach for generating video from audio signals and vice-versa.


K. Deng, A. Bansal, D. Ramanan
Unsupervised Any-to-Many Audiovisual Synthesis via Exemplar Autoencoders.
On ArXiv, 2020.

[Bibtex]

Summary Video




Demo Video




Voice Conversion


Target Speaker
&
Reference
Input-1 Input-2 Input-3
Barack Obama
                
Bill Clinton
Carl Sagan
Claude Shannon
Michelle Obama
Nelson Mandela
Oprah Winfrey
Richard Hamming
Stephen Hawking
Takeo Kanade
Theresa May

Audiovisual Synthesis

We show audiovisual results based on audio input of random speakers. We use the publicly available videos of various public figures for the experiement, especially the VoxCeleb2 dataset.


Input
Output


Audio-to-Video Synthesis

If we constrain the input within the training speaker even at inference time, the network then becomes capable of audio-to-video synthesis for a specific speaker. This application is useful for restoring the video records for some famous historical celebrities.


We take Winston Churchill’s famous “end of beginning” speech as example. We only have the recordings of speech yet without video. However, with this technology, we can restore Churchill’s talking head video based on the speech audio.




Acknowledgements

We thank the authors of AutoVC for their related work. We thank the larger community for collecting and uploading the videos on web. We thank members of Deva's Lab for helpful discussions. Finally, we thank the authors of Colorful Image Colorization for this webpage design.