Speech Processing

Read Complete Research Material

SPEECH PROCESSING

Programming Assignment- Speech Processing

Programming Assignment- Speech Processing

Introduction

Traditional pitch-excited LPC vocoders use a fully parametric model to efficiently encode the important information in human speech. These vocoders can produce intelligible speech at low data rates (800-2400 bps), but they often sound synthetic and generate annoying artifacts such as buzzes, thumps, and tonal noises. These problems increase dramatically if acoustic background noise is present at the speech input (McCree, 1992, pp. 137).

Since these problems stem from the inability of a simple pulse train to reproduce all kinds of voiced speech, vocoders have been proposed with mixtures of pulse and noise excitation. We have previously developed a mixed excitation LPC vocoder which preserves the low bit rate of a fully parametric model, but adds more free parameters to the excitation signal so the synthesizer can mimic more characteristics of natural human speech and of acoustic background noise.

This paper presents a brief overview of the new LPC vocoder model, the addition of adaptive spectral enhancement and Fourier series modeling, the implementation of a 2400 bps speech coder based on the model, and the results of informal and formal listening tests (McCree, 1992, pp. 54). The new model is based on the traditional LPC vocoder with either a periodic impulse train or white noise exciting an all-pole filter, but contains four additional features as shown in Figure 1: mixed pulse and noise excitation, periodic or aperiodic pulses, pulse dispersion filter, and adaptive spectral enhancement.

Each of these new capabilities is intended to remove a particular distortion from the synthetic speech. The mixed excitation eliminates the buzzy quality usually associated with LPC vocoders by allowing frequency-dependent voicing strength. A separate aperiodic voiced state is added so the synthesizer can reproduce erratic glottal pulses without introducing tonal noises. The pulse dispersion filter improves the band pass filtered waveform match between synthetic and natural speech away from the formant regions by introducing time domain spread to the excitation signal. The fourth feature, adaptive spectral enhancement, has recently been added to the model and is described in the next section.

Besides linguistic information, voice conveys rich paralinguistic information regarding the expressive, organic and perspective aspects of communication. Although the corresponding information layers seem to be interplexed in oral communication, it is desirable to develop the ability of an artificial manipulation of the corresponding qualities (Chen, Gersho, 1987, pp. 2188). The interest is significant; for example, expressiveness can increase the naturalness of speech synthesis and thus render it more desirable while high-quality transformation facilitates inexpensive creation of new voices from a single corpus. Most of the expressivity of speech can be captured via prosodic modifications like pitch and time scaling, usually addressed as speech modifications.

The term speech transformation, on the other hand, refers mainly to the modification of the organic nature of the speech production system. Most of the work in speech modification/transformation is made using sinusoidal models, phase vocoders and nonparametric techniques like PSOLA.

The advantages and limitations of these approaches are well studied ...
Related Ads