Text-to-speech conversion systems are used to convert text into speech that comes as close as possible to the speech of a native speaker of the language reading that text. Prime quality text-to-speech conversion has been the focus of researchers for more than two decades and we can see this from the text-to-speech converters available today. The needs for text-to-speech conversion are many. To mention some: telephone response systems, computer interfaces, reading for the blind and other text-to-speech systems assisting the handicapped. It finds use in language learning software and audiobooks as well.
A complementary part of the text-to-speech conversion is the automatic speech recognition which can play a big role in the application of the system by converting audio to text. However, designing a text-to-speech conversion system for a language is very dependent on the linguistic structure of that language. In addition, there are numerous dialects for most languages which makes this endeavor even more difficult. Developing a text-to-speech converter demands significant research of the language structure and linguistics of a given country. In our article, we will take a look at how text-to-speech conversion technology has been developing and its major aspects.
Approaches to Text to Speech Conversion
The current approaches to text-to-speech synthesis are articulatory, concatenative and formant synthesis and even though they all take a different approach, they are equally important for the development of text-to-speech conversion systems, here is a brief explanation of them:
- Articulatory synthesis models are based on the human vocal tract and the articulation processes occurring there. The shape of the vocal tract can be changed in a number of different ways by changing the position of the tongue, jaw and lips. Speech is created by digitally stimulating the flow of air through the vocal tract.
- Concatenative synthesis is done by using short samples of recorded sound. It is used in speech synthesis to generate user-specified sequences of sound from a database built from recordings of other sequences.
- Formant synthesis is a bit more difficult to explain. Part of what makes the timbre of a voice consistent over a wide range of frequencies is the presence of fixed frequency peaks called formants. These peaks stay in the same frequency, independent of the pitch being produced. There are many other factors that go into synthesizing a realistic-sounding timbre, however, the use of formants is one of the ways to get a reasonably accurate result.
Problems With Text-to-Speech Conversion
The problems with the text-to-speech conversion are many, however, there are two that stand out. The first is text analysis. The task here is to convert an input text into a linguistic representation. This linguistic representation is usually a complex structure that includes information on grammatical categories of words, accentual or tonal properties of words, prosodic phrase information, and of course word pronunciation. There is also the issue of proper synthesis of speech or the synthesis of a (digital) speech waveform from the analyzed text.
Qualities of a High-Performing Text-to-Speech Conversion System
One important quality of a text-to-speech conversion system is multilingualism. As we already mentioned, to develop text-to-speech conversion for different languages, it is necessary to have a good understanding of the linguistic properties of each target language. In other words, language-specific knowledge is required in the entire development process of a text-to-speech conversion system.
Robustness is another quality to look for.
In an audio-to-text converter, it refers to the capability of maintaining reasonable performance for different users under different conditions. In practical applications, the speakers’ voice characteristics and speaking style may change substantially. There may be channel distortion and environmental noise as well, which can further complicate the speech-to-text conversion. A robust automatic speech recognition system is expected to be anticipative and adaptive to the change in these conditions. As for a text-to-speech system, the robustness requirements are on its capability of processing text with unrestricted content and format, and on the perceived naturalness of the output speech.
Speaking of naturalness
It is one of the major issues with synthetic speech. For it to be considered unnatural there is an obvious lack of perceptual continuity. This can be due to the fact that transitions in between syllables are missing. In natural human speech, adjacent syllables are closely followed by each other. The naturalness of speech is determined by a few different aspects, among which prosody is generally regarded as predominant. Prosodic parameters refer to the overlaid linguistic functions of the inherent acoustic features of speech sound segments, including timing, tone, intonation, and stress. While timing, tone and intonation are realized through duration and fundamental frequency, respectively, stress is considered to be formed by the collective realization of duration, frequency, and intensity. Proper control of the prosodic parameters is the key to making synthetic speech sound natural.
And finally the usability of the text-to-speech conversion system
Although the existing spoken language translation is not perfect, they find a lot of meaningful applications. After all, there are many hi-tech companies that have a strong interest in using spoken language translation in their products or services. However, most of them do not specialize in speech signal processing and acoustics, let alone linguistics. It becomes increasingly important to make the spoken language translation easy to use, without requiring expert knowledge. It is also desirable to enable automatic system reconfiguration and adaptation.
Text-to-Speech Conversion Has Come A Long Way
There has been substantial progress in text-to-speech conversion systems over the last two decades. We have progressed from systems that could transform annotated phonetic transcriptions into barely intelligible speech, to systems that can take the written text in normal orthography and transform it into speech that is highly intelligible, though certainly still mechanical in quality. To a large extent, the problems that remain to be solved for Asian languages are the same as the problems that are faced by synthesizers for any language.