next up previous
Next: Bibliography Up: Phase bashing for sample-based Previous: Making synthetic formants

Using recorded sounds

The packet function $ p(t)$ may be derived from a recorded sound, in such a way that a recorded sinusoid gives a single formant as described in the previous section, but so that a recorded vocal sound, for instance, can give its spectral envelope to the resynthesized result. The technique still has the same possibility of shifting and bandwidth control as in the purely synthetic case.

Figure 4: Analyzing an incoming sinusoid to make a wave packet. First the sound is Hann windowed and wrapped around two-for-one to give a waveform. Then the waveform's components are individually bashed into phase.
\begin{figure}\psfig{file=fig4.ps}\end{figure}

We make the initial assumption that the signal to be analyzed is a sum of (not necessarily harmonic) sinusoids. Assuming their frequencies are widely enough separated we can treat them individually. We therefore assume that the signal to analyze is a pure sinusoid of frequency $ f\omega$ where $ \omega$ is the fundamental frequency of an analysis period $ N$ :

$\displaystyle \omega = 2\pi/N.
$

Under this assumption, we analyze a sinusoid:

$\displaystyle r[n] = {e^{2 \pi h n i/ N}}
$

by windowing it over a period $ 2N$ and overlap-adding it to a waveform of period $ N$ . Figure 4 shows the result for a sinusoid of frequency $ 2.4\omega$ . The analysis period is $ N=2\pi/\omega$ , so 2.4 periods of the analyzed sinusoid fit into the analysis period. The first step is to window a segment of the incoming sound over twice the analysis period. Next, the windowed signal is overlapped with itself to make one cycle to analyze (the OLA step). Finally, the phases of the signal components are set to zero (the phase bashing step.)

Exactly as in the synthesis step, the combination of a Hann window and a two-way overlap ensures that any sinusoidal component in the analyzed signal is represented in the resulting waveform in the most compact possible way, as a sum of two neighboring harmonics. If the analyzed sinusoid happens to lie on a harmonic of the analysis period, the resulting wavetable is again a pure sinusoid.

Figure 5: Handling sinusoidal and noisy components separately: a. analysis; b. resynthesis.
\begin{figure}\psfig{file=fig5.ps}\end{figure}

Since all the partials of the output have a fixed phase, different wavetables may be cross-faded coherently. For example, it is straightforward to analyze successive frames in a recorded vocal sample and play them back, successively cross-fading the frames to mimic the time-varying spectral envelope of the original sound.

The phase-bashing technique can be combined with sinusoidal/stochastic decomposition [10,11]. Figure 5, part (a), shows the decomposition step, in which an incoming sound is separated (in real time or not) into sinusoidal and noisy parts. Each is separately divided into a succession of analysis frames and converted into phase-bashed wavetables.

Figure 6: A real analysis/synthesis example: a. the original sound; b. a reconstructed sound with $ T=1$ ; c. with increased bandwidth.
\begin{figure}\psfig{file=fig6.ps}\end{figure}

For the reconstruction step (Figure 5 part b), the sinusoidal and noisy tables are each used to reconstruct signals as in Figure 2. The noisy part is then de-pitched by modulating it with a band-limited noise signal. For best results several slightly time-shifted copies are modulated separately [4].

Figure 6 shows a real situation in which a sung vowel (/a/ as in ``la") is analyzed and resynthesized. Part (a) shows an analyzed spectrum of the original voice, whose frequency is about 600 Hz. Two `formants' have center frequencies about 1.5$ f_0$ and 5.5$ f_0$ .

In part (b), the voice is resynthesized at a low pitch (about 170 Hz.) to show a sharp reconstruction of the original spectrum. The result has a formant for each harmonic of the original sample.

Part (c) shows the result of adding bandwidth by increasing the $ T$ parameter in order to erase the visible quantization of the spectrum of part (b) around harmonics of the original fundamental.

Unfortunately (I am grateful to a reviewer who pointed this out!) the /a/ vowel should normally contain two formants at about 800 and 1000 Hz; since the fundamental frequency of the recorded sample is so absurdly high, the correct formants are not possible to infer from the spectrum. But I'm running out of space and must stop here.


next up previous
Next: Bibliography Up: Phase bashing for sample-based Previous: Making synthetic formants
Miller Puckette 2006-03-30