From cslt Wiki
Jump to: navigation, search

Dr. Frank Soong from Microsoft Asia visit CSLT

Search for the “Elementary Particles” in Human Speech

– Rendering a monolingual speaker’s speech in a different language?

In this talk, we first raise an intriguing question: Can we find some “elementary particles” in a monolingual speaker’s speech and use them for rendering his/her voice in a different language? A positive “yes” and the found “elementary particles” can have many useful applications, e.g. mixed code TTS, language learning, speech-to-speech translation, etc. We try to answer the question by concentrating ourselves on how to train a TTS in a different language with speech collected from a monolingual speaker. Additionally, a speech corpus in the targeted, different language is recorded by a reference speaker. We then use our “trajectory tiling algorithm,” invented for synthesizing high quality, unit selection TTS, to “tile” the trajectories of all sentences in the reference speaker’s corpus with the most appropriate speech segments in the monolingual speaker’s data. To make the tiling proper across two different (reference and monolingual) speakers, the speaker difference needs to be first equalized with appropriate vocal tract length normalization, e.g., a bilinear warping function or “formant” mapping. All tiled, sentences are then used to train a new HMM-based TTS of the monolingual speaker but in the reference speaker’s language. ‘Elementary particles” of different durations (in the original monolingual speaker’s recorded speech) are used to tile the warped trajectories of the reference speaker in the target language and thus tiled sentences are then ready to train the original speaker’s TTS in the target language. It will be shown that the TTS speech in the target language generated is highly intelligible and with high quality, it also sounds similar to the original speaker. Some preliminary results also show that training a speech recognizer with speech data of different languages can improve the ASR performance in each individual language. Also, additionally, the mouth shapes of a mono-lingual speaker have also been found adequate for rendering the lips movement of his/her talking head in different languages. Various demos will be shown to illustrate our findings.

Frank K. Soong (宋謌平)

Principal Researcher

Speech Group, Microsoft Research Asia (MSRA)

Frank K. Soong is a Principal Researcher/Research Manager, Speech Group, Microsoft Research Asia (MSRA), Beijing, China, where he works on fundamental research on speech and its practical applications. His professional research career spans more than 30 years, first with Bell Labs, US, then with ATR, Japan, before joining MSRA in 2004. At Bell Labs, he worked on stochastic modeling of speech signals, optimal decoder algorithm, speech analysis and coding, speech and speaker recognition. He was responsible for developing the recognition algorithm which was developed into voice-activated mobile phone products rated by the Mobile Office Magazine (Apr. 1993) as the “outstandingly the best”. He is a co-recipient of the Bell Labs President Gold Award for developing the Bell Labs Automatic Speech Recognition (BLASR) software package.

He has served as a member of the Speech and Language Technical Committee, IEEE Signal Processing Society and other society functions, including Associate Editor of the IEEE Speech and Audio Transactions and chairing IEEE Workshop. He published extensively with more than 200 papers and co-edited a widely used reference book, Automatic Speech and Speech Recognition- Advanced Topics, Kluwer, 1996. He is a visiting professor of the Chinese University of Hong Kong (CUHK) and a few other top-rated universities in China. He is also the co-Director of the MSRA-CUHK Joint Research Lab. He got his BS, MS and PhD from National Taiwan Univ., Univ. of Rhode Island, and Stanford Univ, all in Electrical Eng. He is an IEEE Fellow.