From the point of view of humans, the human voice is arguably the most important kind of sound, and the ability to make vocal sounds, and to perceive them, is an ability whose importance is at least on a par with walking and seeing.
Until very recently in human history, most people could neither read nor write, and although the visual arts were often employed to impart messages (for instance, through story-telling mosaics or paintings), spoken language was the only practical way to warn someone that there might be a snake in a nearby bush. Perhaps as a result, human hearing and speech have evolved together to the point that we have an extraordinarily well-developed ability to perceive even subtle nuances in the human voice, and to vocalize in ways that can exploit our hearing abilities.
In addition to speech as communication, the voice is almost certainly the original medium for making and transmitting music. The function of music in human life is a deep mystery, but the fact that no human culture is without music suggests that it somehow plays a fundamental role. It seems certain that the ways we make and listen to music are in some deep way (or ways) connected to the use of the voice as the primary carrier of human language, even if we don't (and perhaps never will) know how either music or language really works.
In the following sections we'll look at natural production of vocal sounds (both speech and singing), and what can be said about the nature of the sounds that are prduced. We'll also describe a formal model of the human voice, at the basis of many speech analysis tools and synthesizers.
Here is a very simplified (and somewhat mis-proportioned) drawing of the parts of the body used for making vocal sounds:
To make a pitched sound, you push air out of your lungs (not shown) while closing a space within your glottis, which is a kind of encumbrance in your throat that forces the air to pass through a narrow slit between two vocal folds. These vibrate against each other, somewhat the way a trumpet player's two lips do. When they open, a short burst of air called a glottal pulse emerges. When things are running smoothly, these pulses emerge regularly, between about 50 and 600 times per second, and result in a periodic pulse train with an audible pitch.
This pitched sound then travels through the mouth and/or nose to become sound in the air. (There are also vibrations in the flesh, but for now we'll only worry about the vibrations in the air). The sound passes through larger or smaller cavities depending on the placement of the tongue, palate, lower jaw, and lips. The net effect is that the air passages filter the sound from the glottis on its way out to the air.
Depending on the placement of the tongue, etc., the filter's frequency response can change very quickly. This changing frequency response affects the timbre of the sound of the glottis as it appears in the outside air.
Unpitched sounds are made by relaxing the glottis (so that it no longer vibrates) but constricting one or another area to cause turbulence in the passing air: anywhere from the throat (for the English `h' sound) to the teeth and lips (for 'f'). Depending on where these sounds originate, they are either fully, partly, or barely at all filtered by the air passages through the throat, mouth, and nose.
Here is the measured spectrum of a vowel (the `a' in `cafeteria'):
and here is the measured spectrum of a consonant (the `t` in the same word):
It is important to remember that these are in no sense canonical measurements of the particular vowel and consonant - the spectra are constantly changing in time and if the same word were uttered again (even by the same speaker) the result would almost certainly be different.
The vowel example shows the characteristic form of a periodic signal (even though it is only approximately so). The fundamental frequency is about 85 Hz, (the marker at 1000 Hz. is between the 11th and 12th peak). Certain peaks are higher than their neighbors--they are peaks among the peaks. These are the 1st, 7th, 19th, and 27th peaks, at frequencies of about 85, 600, 1700, and 2350 Hz. These may be thought of as corresponding to resonances (peaks in the frequency response) of the filter that is made by the throat, nose, and mouth.
Although it isn't accurate, a simple model for the situation would be that the glottis is putting out a signal in which all the harmonics have equal height. (Actually, we think they drop off gradually with frequency, but we can pretend they're all equal and that the differences are all because of the filtering.)
The peaks in the frequency response of this filter (or, almost equivalently, the peaks in the spectrum that are higher than their neighbors) are called formants. The ear appears to be very sensitive to them. Their placement (their frequencies, bandwidths, and heights relative to each other) are properties of the throat-mouth-nose filter, and they roughly characterize the vowel that is being spoken. For instance, looking up the phoneme corresponding to the `a' in `cafeteria' on Wikipedia, we find that we should expect formants at 820 and 1530 Hz--a very poor match to what the picture above shows.
Consonants are more complicated. One class, the fricatives, are static in character (like vowels, they can last as long as breath permits), but are noisy and not periodic. Examples are `s', `f', and `sh'. Others are equally noisy but are generated by making short explosions that can't be sustained over time. These include `t' (shown above) and `p'; they are called plosives. Others are essentially parasites on vowels; they can't be made except at the start or beginning of a vowel because they consist of rapid changes in filtering as one passage or another closes or opens. These are called voiced consonants. Examples are `b', `d', and `g'.
Together the vowels and consonants are called phonemes, and they can be considered the basic sonic building blocks of language.
Speech and singing can be thought of as centered on the production of vowels. Vowels tend to have much longer durations than consonants, and the consonants can be thought of as decorations at the beginnings and ends of, and in between, vowels.
The pitch of the voice is usually set by the frequency of glottal pulses during vowels or voiced consonants. Whether in speech or music, the pitch changes continuously (unlike a piano string, for instance, whose pitch is approximately constant.) Here is an example from a spoken phrase (``a man, sitting in the cafeteria"):
SOUND EXAMPLE 1: a spoken phrase (``A man, sitting in the cafeteria").
And here is Johnny Cash singing the last line of ``Hurt":
SOUND EXAMPLE 2: a phrase sung by Johnny Cash at age 70.
The pitch is sometimes clearly defined and sometimes not. The voiced portions of speech are frequently nearly periodic and may be assigned a pitch experimentally. Unvoiced consonants, generated partly (and often entirely) by air turbulence, don't have a readily assignable pitch (and neither does silence.)
What is the difference between singing and speech? A first attempt at an answer might be, ``in singing the pitch sticks to a scale, and individual notes are characterized by having steady pitch in time". But the above example doesn't seem to support that statement at all.
It is often claimed that singing has a systematically different timbre--and correspondingly differing vowel spectra--from voice. On looking more closely, that seems in fact to be an artifact of Western art music conservatory training, and although their singing styles might have their own spectral ideosyncracies, these seem to be more reflective of style (or stylization) than of the essence of singing itself.
There are some general trends (but there are exceptions to all of them!): singing is often slower than speech (in particular, vowels are elongated more than consonants); it often has a wider range of pitch variation; its pitch patterns are often repeated from one performance to another in a way we wouldn't expect of speech; and if there is an accompaniment with discernible rhythms and pitches, singing is more likely to follow them than is speech. But the most general answer is probably that nobody knows the real difference, even though it is usually immediately clear to the listener.
One phenomenon that's present in some (but not all) singing, and nearly absent from speech, is vibrato. Traces of vibrato can be seen in the singing example above, at the end of the words ``find" and ``way", where the pitch wavers up and down. Physiologically, vibrato is produced by (roughly cyclicly) increasing and decreasing the air pressure underneath the glottis, which makes the pitch and power both vary upward and downward, and also changes the timbre. Typically, vibrato in singing cycles between 4.5 and 7 Hz, and the variation in pitch may be on the order of a half tone. (This is in agreement with the pitch trace of the singing example; in both areas the vibrato is about 5 Hz,) and, while it's unclear what the depth might be in the first instance (``find"--because the pitch is simultaneously sliding downward 9 half tones!), at the end of ``way" (which is heard as the pitch A), the pitch trace varies between about G and A.
In this example the pressure variation is enough that the glottis apparently stops vibrating altogether during part of the vibrato cycle; in less extreme situations it will often be possible to trace the pitch all the way through the vibrato cycle.
In both areas of the trace the vibrato increases over time; this is quite common in the West, in both popular and classical idioms, and both in the voice and in other instruments.
Over years of voice research, a sort of de facto model has emerged that people frequently refer to. This model is not only useful for synthesizing vocal sounds, but also for understanding the voice in its natural habitat (by, for instance, recording the voice's output and trying to make the model fit it). For example, voice recognition is often done by applying this model to a real voice and seeing what parameters one would have to supply the model with to get the observed output.
The model follows this block diagram:
The pulse train models the glottis and a noise generator models turbulent noise. The mixer selects one or both of these sources depending on which source is active. A filter (or filters) models the vocal tract as it enhances some frequencies more than others.
The glottal pulse generator should output a pulse train as shown here:
SOUND EXAMPLE 3: A synthetic pulse train.
In this example there are 100 pulses per second. The signal is periodic and so will have partials at 100, 200, ..., Hz. Their relative amplitudes are controlled by the width of the pulse: roughly speaking, the partials' amplitudes slope downward to zero over a frequency range that is inversely proportional to the width of each pulse. In this example the pulses are 1/4000 second in duration so we would expect about 40 partials. Here is the measured power spectrum of the pulse train shown above:
In real speech, the time duration of glottal pulses varies, roughly as a function of lung pressure (the higher the pressure, the narrower the pulse) so that louder or more forceful speech is brighter in timbre than soft speech.
This glottal pulse train is mixed with noise. This is an oversimplification--it would capture the way the vocal tract filters turbulence around the glottis itself, such as the air leaks one hears clearly in the singing example above, but other kinds of turbulence aren't really filtered the same way as the glottal pulse train.
The vocal tract is modeled using a filter. Filters, which were introduced in Chapter 3, can be thought of as frequency-dependent amplifiers, in that passing a sinusoid through a filter outputs a sinusoid of the same frequency but a different amplitude, depending on the filter's frequency response. Here, for example, is the frequency response of a filter intended to model the short 'a' vowel whose spectrum is shown above:
Here is the power spectrum of the result of filtering the pulse train with that filter:
SOUND EXAMPLE 4: The pulse train shown above, filtered.
This is far from a convincing human voice. The most glaring problem is that it is completely static; in natural voices, whether spoken or sung,the pitch, amplitude, and timbre are constantly varying with time. Arguably, it is the nature of the time-variations, rather than any given static snapshot, that gives the human voice its character.
By imitating these changes in time, speech synthesizers, which are usually constructed more or less along the lines described here, can often be made intelligible. But to make one sound natural, so that a listener would mistake one for a real voice, seems to be far from our present capabilities.
Speech recognition is done essentially by carrying this process out in reverse. Given a recorded sound as input, we measure its time-varying spectrum and try to fit a glottal pulse train (at some frequency and pulse width), amplitudes for the pitch and noise components, and a filter frequency response that best fit the measured spectrum. By looking for peaks in the filter frequency response we would find the frequencies, relative amplitudes, and bandwidths of any formants. Those (especially the formant frequencies, and how they are changing in time) can be matched to the known behaviors of phonemes in the language being spoken. This is compared to a huge dictionary of phonetic pronunciations of words and phrases, and the best fit becomes the output of the speech recognizer.
Real voices don't really follow this model terribly well. All sort of imperfections (raspiness, scratchiness, breathiness, etc.) result from variations in the way a real glottis vibrates compared to our theoretical one. There are at least two separate modes of vibration of the glottis in most healthy speakers (in singing they are called the low and high registers), which result in differently shaped pulses with different spectra.
Vowels aren't really characterized by formant location; peoples' differently shaped mouths and throats unavoidably introduce variations in formant structure.
In English, pitch variations in speech aren't considered part of the phonetic structure of the language, but in many other languages (called tonal languages), the pitch and/or the way the pitch changes in time may make the difference between one phoneme and another.
Moreover, phonemes aren't produced independently of each other; the way any phoneme is joined to the ones before and after it affects how it is produced. In practice, it often isn't even possible to say definitively at what point one phoneme ends and the following one starts. It frequently happens that, when trying to extract a phoneme in a sound editor, one hears a different phoneme entirely.
It's hard to see how to assign exercises to this chapter which is mostly descriptive. Instead, these exercises serve as a cumulative review of Chapters 1-5.
1. Suppose a signal is a sum of three sinusoids, each with peak amplitude 1, at 300, 400, and 500 Hz. What is the signal's average power?
2. What is the period of the signal of exercise 1?
3. How many barks wide is the spectrum of the signal of exercise 1?
4. What is the name of the interval from the lowest to the highest of the component sinusoids?
5. How many half-steps is the lowest of the component sinusoids above middle C?
6. By how many dB does this signal's power exceed the power of its lowest component (i.e., of a 300 Hz. sinusoid of peak amplitude 1)?
Project: multiplying sinusoids. This project is a demonstration of my Fundamental Law of Computer Music (last formula of Section 2.3).
First, make a "sinusoid" object and connect it both to an "output" (so you can hear it) and to a "spectrum" (so you can see it). To start with, tune the sinusoid to 500 Hz. and turn the spectrum object on (turn on the "repeat" toggle and optionally increase the rate, say to 5 or 10 Hz). With the scale control at 100 you should see a peak at 500 Hz. on the horizontal scale and just reaching the top of the graph (0 dB.) (Really, I should be reporting the RMS power as -3 dB; I'll fix this next release, but no matter for now.)
Now bring out a second oscillator at 100 Hz. and multiply it by the first one you have (using a "multiply" object). Disconnect the first oscillator from the spectrum object and connect the output of the multiply object instead; do the same with the output object so you hear the product of the two sinusoids. What are the frequencies and amplitudes of the resultant sinusoids? (To measure the amplitude I used the "scale" control to scale them back up to 0dB; that told me the relative mplitude in dB compared to the original sinusoid.
Now repeat the process to make a product of three sinusoids: disconnect the output of the multiplier, and introduce a third sinusoid and a second multiply object; multiply the new sinusoid by the product you already made of the first two. Set the third sinusoid to 50 Hz. and connect the output of the multiplier (the product of the three sinusoids) to the output and spectrum objects. What frequencies and amplitudes do you see now?
Now that you've done this you're welcome to change the frequencies of the three oscillators and see how the result behaves. If you find that interesting (I do!) try multiplying by a fourth sinusoid in the same way and enjoy.