Speech Synthesis Presentation - A short Introduction and Historical Review (19.10.2021)

Page 1

Hello. In this short presentation I would like to introduce the topic of “speech synthesis”. The slides are in German. However, I am holding this talk in English, because I am a native English speaking speech synthesiser! In fact, I am the standard Apple IOS13 reader. First, an overview: After some usecases, definitions and historical background, I will touch on the challenges involved in transforming written thoughts to speech. Then, I will describe the basic architecture of various systems and compare methods. The talk will conclude with ideas on infusing artificial speech with emotion and a singing voice emulation culled from the Internet. What is speech synthesis? Essentially, it is the imitation of natural speech by artificial means. It requires a functional model of the human voice organ, be it mechanical, analogue or digital. Target groups include blind people, who can have a text read out to them, or people with a speech impediment, who can learn to express themselves through an artificial voice. Other speech applications are for situations, in which full visual acuteness is required of the able-bodied human, such as when simultaneously driving a car and navigating. In addition, speech synthesis and recognition are increasingly employed in human-robot interactions, as info-tainment and as art. Many of these uses are non-trivial, since misunderstandings can have real negative social or even life threatening consequences. Insight is required into the nature of human social interaction, in addition to access to appropriate technical resources. Over the last few years, the market for speech synthesis has boomed thanks to leaps in affordable computing power. We have become accustomed to hearing a computer voice when we call up a helpline, when we listen to announcements in public spaces, or interact with our personal digital assistants, home automation, media and navigation devices. We also increasingly make use of speech-synthesis for e-learning, in automated simultaneous translation and in dialogue systems that do not require a display device. Ultimately, we as customers and consumers are the judge of how successful a particular implementation is. At this point, I want to distinguish between text-to-speech, or TTS, and concept-to-speech, or CTS. The former type of system is used when pre-composed text, usually originated by humans, is the input. A TTS system needs to apply correct pronunciation and enunciation to the script. Traditionally, this requires a two-step approach: the linguistic meaning of the input text is interpreted before it is rendered by a synthetic voice. Naturally, the result significantly depends on the weakest link in the component chain: errors early in the pipeline ripple through the system uncorrected and are amplified, usually making them easy for a native speaker to detect. In contrast, a concept-to-speech or CTS system directly triggers a pre-recorded voice signal based on a generation component that selects the expression based on semantic, pragmatic and discourse-based knowledge. This approach has traditionally been employed in information systems for acoustic feedback or as the output component of an automatic program. In such cases, errors can initially be hard to detect based on the acoustic signature - even when the phrase memory is actually quite limited. A few words on the historic development of speech synthesis. In 1003, Gerbert Aurrilac built a bronze “talking head” that could respond with “yes” or “no” to a question. After a gap of several centuries, Christian Katzenstein built the first artificial speech organ in 1779. Only a few years later, Wolfgang von Kempelen built a similar device, albeit of different mechanical construction. In the 1930s, Bell Labs of the US developed the keyboard-controlled Vocoder, i.e. a device that emulates the human voice electronically. In the 1950s they presented the first true speech synthesiser based on an IBM 704 computer. Here we see the mechanical implementation of a modern voice organ, presented in 2011 by Prof. Sawada at Waseda university in Tokyo, Japan. We see the articulated mouth tract and imitation of a nose. Air is supplied by means of a blower. Using artificial intelligence,


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.