Demo Page for "Image-to-Speech Synthesis without Text"

Abstract: In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties in order to work well.

[Paper PDF]
[Appendix PDF]: Detailed dataset descriptions, experimental setups, and additional experimental results.

Samples from the SpokenCOCO dataset
- Samples from the collected SpokenCOCO dataset, including reference text, collected audio, and inferred VQ3, VQ2, WVQ units from the audio.
Comparing Image-to-Speech model with traditional Image-to-Text models
- Beam search decoding results from three SAT-FT models that output words, characters, and VQ3 units, respectively (bottom 3 rows in Table 2).
Sampling-based evaluation for Image-to-Speech models
- Beam search decoding and sampling results from the SAT-FT model trained on VQ3 units. Five samples are drawn for each image, with temperature=0.4 and k=3 (Figure 5).
Disentangled voice control for Image-to-Speech models
- Synthesizing the same unit sequences with different TTS models conditioned on different speakers to demonstrate disentangled voice control. Unit sequences are generated with beam search decoding using the SAT-FT model trained on VQ3 units (Table 4).
Comparing VQ3 with alternative learned units on beam search
- Beam search decoding results from the four SAT models that outputs VQ3, VQ2, WaveNet-VQ, and VQ3 without RLE, respectively. WaveNet-VQ and VQ3 without RLE consistently fail to produce reasonable captions (Figure 3, leftmost column).
Comparing VQ3 with alternative learned units on sampling
- Sampling results from the four SAT models that output VQ3, VQ2, WaveNet-VQ, and VQ3 without RLE respectively. For each unit, the (temperature, top-k) configuration that achieves the highest SPICE score is shown, which are (0.1, all) for VQ3 and VQ2, (1.0, 10) for WaveNet-VQ, and (1.0, 3) for VQ3 without RLE (Figure 3, right three columns).