Text to Speech Synthesis in Celebrity’s Voice

  • Ajinkya P. Gaddime Electronics and Telecommunication Department, All India Shri Shivaji Memorial Society's College of Engineering, Pune, Maharashtra, India
  • Dhananjay P. Mane Electronics and Telecommunication Department, All India Shri Shivaji Memorial Society's College of Engineering, Pune, Maharashtra, India
  • Ruchita K. Vehale Electronics and Telecommunication Department, All India Shri Shivaji Memorial Society's College of Engineering, Pune, Maharashtra, India
  • Vaishnavi S. Khawale Electronics and Telecommunication Department, All India Shri Shivaji Memorial Society's College of Engineering, Pune, Maharashtra, India
  • D. G. Bhalke Electronics and Telecommunication Department, All India Shri Shivaji Memorial Society's College of Engineering, Pune, Maharashtra, India
Keywords: Acoustic distance measure (ADM), Artificial neural network (ANN), Coevolutionary deep neural networks (CNNs), Deep neural network (DNN), Recurrent neural networks (RNNs), Text-to-speech (TTS)

Abstract

This paper is proposed for text to speech synthesis. It uses neural network architecture for generation of speech and its synthesis directly from text in celebrity’s voice. The device is fitted with a recurring sequence-to-sequence prediction that graphs the embedding characters into mel scale spectrograms, followed by an updated WaveNet model that functions as a vocoder to create time-domain waveforms from those spectrograms. Here, project evaluation of the impact of mel spectrograms as the conditioning input to WaveNet rather than linguistic features, length, and F0. This paper further would be showing that utilizing this compact acoustic intermediate representation allows a significant reduction in the size of the WaveNet architecture. Using this technique, we are going to modulate the output of the vocoder according to the frequency and pitch of a specific celebrity. Using a unit selection method of concatenation synthesis, a database of prerecorded voice is collected. This paper includes creating a database of an Indian celebrity, clustering, indexing, and synthesizing it for creating a voice output with respect to the text as input. Also worked on normalization of text which includes abbreviations, acronyms, and linguistic analysis. This paper gives output for phonemic features, like vowel length, vowel height, frontness, consonant voicing, consonant poi, and position in the syllable and word.

Downloads

Download data is not yet available.
Published
2020-11-30