SPTTS: Parallel Speech Synthesis without Extra Aligner Model

Zeqing Zhao*, Xi Chen*, Hui Liu, Xuyang Wang, Lin Yang, Junjie Wang

AI Lab, Lenovo Research, Beijing, China

* Equal contribution

Abstract

In this work, we develop a novel non-autoregressive TTS model to predict all mel-spectrogram frames in parallel. Different from the previous non-autoregressive TTS methods, which typically require an external aligner implemented by an attention-based autoregressive model, our model can be optimized jointly without sophisticated external aligners. Motivated by the CTC-based speech recognition, which is a simple and effective manner to achieve the frame-level forced-alignment between the speech and text, our main idea is to consider the aligner learning of TTS as a CTC-based speech recognition like task. Specifically, our model learns the alignment generator by adopting the CTC-loss, to provide supervision for the duration predictor learning on the fly. In this way, we are able to learn a one-stage TTS system by optimizing the aligner with the feed forward transformer jointly. In inference phase, the aligner is removed and the duration predictor is used to predict duration sequence for synthesizing speech. To demonstrate our method, we conduct extensive experiments on a open-source Chinese standard Mandarin speech dataset. The results show that our method achieves competitive performance compared with counterpart models (FastSpeech: a well-known non-autoregressive with extra aligner) in terms of the synthesized speech quality and robustness.

Audio Samples

All of the audio samples use Multiband-MelGAN (MB-MelGAN) as vocoder.

Audio Quality

吴云宝奶奶挑选枇杷。

GT

GT(MB-MelGAN)

Tacotron2

Fastspeech

AlignTTS

GlowTTS

SPTTS

鼓起的兴奋一下子消散在无垠夜空里。

GT

GT(MB-MelGAN)

Tacotron2

Fastspeech

AlignTTS

GlowTTS

SPTTS

但是丁可儿一直沉默，对出走的事情只字不提。

GT

GT(MB-MelGAN)

Tacotron2

Fastspeech

AlignTTS

GlowTTS

SPTTS

成功是一种了不起的除臭剂。

GT

GT(MB-MelGAN)

Tacotron2

Fastspeech

AlignTTS

GlowTTS

SPTTS

我一切都好，您大可不必操心。

GT

GT(MB-MelGAN)

Tacotron2

Fastspeech

AlignTTS

GlowTTS

SPTTS

Parallel vs Seperate

考生有压力，状元也不例外。

驴子越跑越快，越跑越疯狂。

难道不是围巾、围裙和围嘴吗？

SPTTS

SPTTS-Seperate

SPTTS

SPTTS-Seperate

SPTTS

SPTTS-Seperate

鼓起的兴奋一下子消散在无垠夜空里。

因为物价仍在缓慢上升，而且还没有出现拐点。

积极开辟办证绿色通道。

SPTTS

SPTTS-Seperate

SPTTS

SPTTS-Seperate

SPTTS

SPTTS-Seperate

Voice Speed Control

ratio represents the coefficient multiplied by the phone duration sequences.

邓小平与撒切尔会晤。

SPTTS ratio=0.8

SPTTS ratio=0.9

SPTTS ratio=1.0

SPTTS ratio=1.1

SPTTS ratio=1.2

Fastspeech ratio=0.8

Fastspeech ratio=0.9

Fastspeech ratio=1.0

Fastspeech ratio=1.1

Fastspeech ratio=1.2

成荫挑选我演赵玉敏。

SPTTS ratio=0.8

SPTTS ratio=0.9

SPTTS ratio=1.0

SPTTS ratio=1.1

SPTTS ratio=1.2

Fastspeech ratio=0.8

Fastspeech ratio=0.9

Fastspeech ratio=1.0

Fastspeech ratio=1.1

Fastspeech ratio=1.2

小文杰枯瘦如柴。

SPTTS ratio=0.8

SPTTS ratio=0.9

SPTTS ratio=1.0

SPTTS ratio=1.1

SPTTS ratio=1.2

Fastspeech ratio=0.8

Fastspeech ratio=0.9

Fastspeech ratio=1.0

Fastspeech ratio=1.1

Fastspeech ratio=1.2

这是北美地区有史以来最大数额的彩票总奖金。

SPTTS ratio=0.8

SPTTS ratio=0.9

SPTTS ratio=1.0

SPTTS ratio=1.1

SPTTS ratio=1.2

Fastspeech ratio=0.8

Fastspeech ratio=0.9

Fastspeech ratio=1.0

Fastspeech ratio=1.1

Fastspeech ratio=1.2

村庄历史最早记载见于一四六一年。

SPTTS ratio=0.8

SPTTS ratio=0.9

SPTTS ratio=1.0

SPTTS ratio=1.1

SPTTS ratio=1.2

Fastspeech ratio=0.8

Fastspeech ratio=0.9

Fastspeech ratio=1.0

Fastspeech ratio=1.1

Fastspeech ratio=1.2