Miller Puckette
Department of Music and Center for Research in Computing and the Arts, UCSD
A unified framework is developed in which to compare several techniques for synthesizing sounds with desired spectra, using AM, FM, waveshaping, and pulse width modulation.
Of the many approaches to specifying and synthesizing musical sounds, one of the oldest and best is to specify the sound's partial frequencies and spectral envelope. The frequencies of the partials might be chosen to lie on the harmonics of a desired fundamental frequency, and this gives a way of controlling the sound's (possibly time-varying) pitch. The spectral envelope is used to determine the amplitude of the individual partials, as a function of their frequencies, and is thought of as controlling the sound's (possibly time-varying) timbre. A simple example of this is synthesizing a plucked string as a sound with harmonically spaced partials in which the spectral envelope starts out rich but then dies away exponentially with higher frequencies decaying faster than lower ones, so that the timbre mellows over time. In a similar vein, [Risset] and [Grey] proposed spectral-evolution models for various acoustic instruments. A more complicated example is the spoken or sung voice, in which vowels appear as spectral envelopes, dipthongs and many consonants appear as time variations in the spectral envelopes, and other consonants appear as spectrally shaped noise.
Partly because of the intrinsic interest of the human voice and partly because of Bell Laboratories's strong influence on the early development of computer music, synthetic vocal sounds are perennial features of both early and modern computer music repertory. The palette of synthesis techniques has offered much variety within the framework described above. Starting with Mathews's Daisy, and soon afterward in Dodge's Speech Songs, subtractive synthesis has been used. At first the filters were usually designed using analyses of recorded voices (first via vocoder as in these two examples, and later using LPC as in Lansky's Six Fantasies on a Poem by Thomas Campion).
The other main historical approach to vocal synthesis (and also to synthesis of other time-varying spectra) has been the direct computation of formants, or more exactly, sounds containing a single formant that could be combined to create multi-formant spectra. In this class fall Konig's VOSIM generator [Templaars], Bennett and Rodet's FOF [Rodet] and Chowning's synthesis of formants using FM. A representative musical example of FOF synthesis is Barriere's Chreode I, and of Chowning's technique, his own piece, Phoné.
In these direct synthesis techniques, analyzed time-varying spectral envelopes have mostly given way to what Bennett and Rodet call ``synthesis by rule," in which formantic placement is codified as a function of desired phonemes. (For that matter, synthesis by rule has also been applied to subtractive synthesis. On the other hand, analyses of continuous speech have not often been used to drive formant generators.)
Especially in Phoné, the listener is struck by a much greater sophistication of timbral control offered by the rule-based approach to speech synthesis. In addition, the direct synthesis of formants has, at least in the past, reached higher levels of sound quality than has subtractive synthesis, which tends to sound machine-like and ``buzzy," especially when compared to the FM approach.
In light of this, it is important to point out an important practical difficulty in Chowning's method, which is that there is no obvious way to cause the formants to slide upward or downward in frequency as they do in real speech and singing. This limitation, and some techniques for overcoming it, are considered in the sections that follow.
The FM algorithm is to calculate time-dependent values of the function,
To analyse the resulting spectrum we can write,
In Chowning's scheme for synthesizing formants, becomes the formant center frequency and the modulator , which is aliased to a point in the spectrum centered at , determines the formant bandwidth and the placement of partial frequencies around the formant frequency. In particular, to get harmonically spaced partials we set the modulating frequency to the fundamental and the carrier frequency to the multiple of closest to the desired center frequency.
The bandwidth can then be controlled by changing the index of modulation. The center frequency can't be changed continuously however, since for harmonicity it must be an integer multiple of .
The next section will describe two alternative forms of the carrier signal , each of which allow changing center frequency without losing harmonicity. In the following section we will consider alternative forms of the modulator which in some situations are preferable to the classic FM formulation.
Two workable strategies for producing ``glissable" carrier signals have
emerged, one simple, the other more complex but better at handling the case of
very small bandwidths. In both cases, we start by synthesizing a low-bandwidth
carrier signal with components clustered around the desired formant center
frequency. The spectrum can then be fattened to a desired bandwidth by using a
suitable modulator. We will now let denote a desired center
frequency, no longer necessarily an integer multiple of the fundamental
frequency . We will denote the desired waveform period by
The first technique is to let
Here the full 6-DB bandwidth of the signal will be , which is reasonably small but not negligible. (It is tempting to try to reduce bandwidth further by lengthening the ``samples" and overlapping them, but this leads to seemingly insoluble phase cancellation problems.)
This method leads to an interesting generalization, which is to take a sequence of samples of length , align all their component phases to those of cosines, and use them in place of the cosine function in the formula for above. The phase alignment is necessary to allow coherent cross-fading between samples so that the spectral envelope can change smoothly. If, for example, we use successive snippets of a vocal sample as input, we get a strikingly effective vocoder.
The second technique, first described in [Puckette], is to synthesize a carrier signal,
However, it is not appropriate simply to change a, b, and n as smooth control signals. The trick is to note that whenever is a multiple of , regardless of the choice of , , and as long as . Hence, we may make discontinuous changes in , , and once each period without causing discontinuities in .
In the specific case of FM, if we wish we can now go back and modify the
original formulation:
In the waveshaping formulation the shape of the formantic peak is determined by the modulator term . In the case of FM this gives the famous Bessel ``J" functions. At indices of modulation less than about 1.43 radians we get a proper bell-shaped spectrum, with bandwidths ranging from 0 to about (full width at -6 dB height.) Further increases in index give rise to the well-known sidelobes in the FM spectrum.
Although we might desire the sidelobe effect, we needn't be tied to it;
other possibilities abound. The formula for the general modulation signal is:
We have so far found two functions:
Since both of these are even functions, we set to be half of the fundamental frequency, unlike the FM case where we set to the fundamental; this accounts for there being only one term to calculate here instead of the two in our analysis of FM.
Yet another approach is pulse width modulation, for instance:
The spectrum is simply the Fourier transform of the Hanning window, which is approximately band-limited (actually only good to about -34 dB), as compared to the waveshaping solutions which are non-band-limited. If we desire better stop-band rejection than -34 dB we can pass to Blackman-Harris windows; in this case we must allow overlap 3 before we can attain zero bandwidth.
Up to now we have only synthesized discrete spectra. It is also sometimes desirable to synthesize ``noisy" sounds with desired spectral envelopes. One technique for doing this is described in [Puckette]. The idea is to multiply a discrete spectrum (perhaps computed in one of the ways described above) with a noise signal with bandwidth . Each sinusoid is then modulated into a narrow noise band, and the overlapping noise bands fill out a continuous noisy spectrum.
However, the fact that each sinusoid is modulated by the same noise is problematic. To fix this we modulate four copies of the original signal, delayed varying amounts up to 10 milliseconds, by four independent band-limited noise streams. Each partial thus gets a different linear combination of the four noise signals and thus the partials ``move" independently.
These techniques have been gradually refined over the last fourteen years,
using IRCAM's 4X and ISPW, and later the standard real-time interactive
graphical synthesis environments Max/MSP, jMax, and Pd. The community of
active users of the techniques has, however, remained quite small, at least
partly since nothing has so far been published in computer music venues about
it. Implementations in the form of external objects for Pd and Max/MSP are
available, with sources, from https://msp.ucsd.edu
and
https://crca.ucsd.edu/~tapel
.
https://msp.ucsd.edu
.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -no_navigation icmc01-paf
The translation was initiated by on 2007-08-13