back to list

More chord of nature fantasies, this time with mathematical varnish

🔗xed@...

8/27/2001 7:44:19 PM

Jacky Ligon mentioned using an FFT to "analyze" some sounds. The FFT stands for "Fast Fourier Transform." If we ignore the appellation "Fast" which has no relevance to or connection with the actual output of the mathematical operations, the FFT boils down to a Fourier Transform jiggered with various computer code to make it run more quickly on modern computers.
Using the Fourier Transform to "analyze" (in quotes) sounds creates a lot of potential problems.
The first issue you must deal with is the hard fact that a Fourier
Transform cannot ever produce anything but a harmonic series output. If the input consists entirely of a waveform all of whose spectral components fall very close to members of the harmonic series, then the Fourier Transform will produce an output which accurately reflects the spectral profile of the input, provided the sound does not change very much or very quickly throughout time.
However, you get serious problems when the spectral profile of the input does not consist entirely of perfectly harmonic series partials.
Let's take an example:
Suppose we analyze a noise burst. This sound has nothing to do with the harmonic series and cannot be reasonably described in terms of the harmonic series. Nonetheless, an FFT analysis will "analyze" the sound as having a large number of perfect harmonic overtones. In reality, the sound has none. The FFT has lied. The output is garbage, and not related to the input.
This represents an extreme example. So let's consider a less extreme case: namely, an inharmonic tone, of the kind we get from a bell or a chime.
Let's say the bell has partials of frequency 0.2 1.1 1.7 3.4
5.1 7.3 13.7 times the fundamental (which, let's say, clocks in at 100 Hz).
(The reason for using 0.2 is that bells tend to have "hum notes" which are not heard as belonging to the audible pitch of the bell.)
An FFT analysis of this bell note would yield something like this:
100Hz 600Hz 900Hz 1700Hz 2600Hz 3700Hz 6900Hz
So what's the problem?
The problem is that the actual partials have frequencies of
100Hz 550Hz 850Hz 1700Hz 2550Hz 3650Hz 6850Hz
As you can tell, most of these partial frequencies lie halfway in between integer multiples of the fundamental (100 Hz). A Fourier Transform is mathematically incapable of producing anything but integer multiples of the fundamental -- as a result, the FFT "rounds off" the output. All frequency outputs in a Fourier Transform are quantized, limited to a restricted set of frequencies that
are integer multiples of the fundamental.
Another way of putting it is that the FFT distorts the input and warps and twists it into the nearest harmonic-series version. As we can see above, this meant shifting the frequencies by up to 50Hz. That ain't small potatoes, kiddies. That's a SERIOUS distortion. Sometimes, depending on the input sound, this tendency of the FFT to distort the input may have minimal effects (viz., if the input
is a moderate-amplitude clarinet tone in the middle register) while in other cases, like this bell example, the distortions generated by the FFT produce an output with little relation to the input.
The Fourier Transform suffers from other drawbacks as well.
For one thing, the Fourier Transform quantizes frequencies. This means that the frequencies you can represent in a Fourier Trasnform depend entirely on the total number of points (frequency bins) you use for the Fourier Transform.
Suppose you use a 16-point Fourier Transform -- you then break up the audible range into 8 frequency bands. (For reasons which I will not discuss here, the frequency resolution of the FFT is 1/2 the number of points over which you perform the FFT. For more info, see Bracewell's book on the FFT, or any graduate-level course on digital signal processing.) Any spectral energy anywhere in any of these frequency bands gets lumped together into that frequency band and
misrepresented as a single spike in the frequency domain.
Let's take an example:
Suppose you analyze a sound from 0-16000 Hz using a sampling rate of 32000 Hz and you run a 16-point Fourier Transform on the sound. The 8 frequencies which will be output from your FFT are: 0-2000Hz 2000-4000Hz 4000-6000Hz 6000-8000Hz 8000-10000Hz 10000-12000Hz 12000-14000 Hz
14000-16000Hz.
Now suppose your input is a bell sound with frequency components at 1000 Hz 5500Hz 7300Hz 13400Hz.
What will the Fourier Transform report as the frequency components?
1 spike at 0-2000Hz, one spike at 4000-6000Hz, one spike at
6000-8000Hz and one spike at 12000-14000Hz.
Moreover, the height of the frequency spikes will be distorted by the FFT. The FFT not only misrepresents the frequency of a partial depending on which frequency bin the partial falls into, the FFT also misprepresents the magnitude of the partial depending on how far off from an integer multiple of the fundamental frequency the actual frequency component is.
Rule of thumb -- if a frequency component falls exactly halfway between integer multiples of the fundamental, it will be maximally broad and relatively low in magnitude (that is, much lower in magnitude than the actual partial in the real input sound), while if a frequency component falls right on top of an integer
multiple of the fundamental, it will be maximally narrow and relatively high in magnitude (that is, very close in magnitude to the actual magnitude of that partial in the input sound).
So the FFT necessarily, as a matter of mathematical necssity, distorts both the frequency and magnitude of the sinusoidal components which it produces as output.
This means that the Fourier Transform misrepresents the input in this case not only in terms of frequency but also in magnitude, since the 5500 Hz component will have a much lower magnitude in the Fourier Transform output than it actually has in the real acoustic sound (since 5500 Hz lies exactly halfway between 5000 and 6000 Hz, which are integer multiples of the fundamental).
Again, for reasons which not be discussed here, it's preferable to speak of the magnitude of a frequency spike rather than its amplitude. Mathematically, the magnitude is the square root of the quantity [Acos(x) + Bisin(x) times its complex conjugate] where A and B are the respective Fourier coefficients. For technical reasons the amplitude of the cosine and sine portions of the raw output of the Fourier Transform sometimes do not often correspond to physically
measurable quantities, while the magnitude of the output of the Fourier Transform typically does. In some cases, however, the real and imaginary parts of the raw output from the Fourier Transform do (on occasion) have physical significance -- for instance, the Fourier Transform of the luminance function of a light beam travelling through a dispersive medium has an imaginary part which corresponds to the change in the index of refraction while the beam travels through the medium. If the index of refraction is a non-linear function, or
a graded function as found in optical fibers, this can be important. In most cases, however, only the magnitude of the Fourier Transform corresponds to a physical measurable quantity. Again, if you want to know more about the physical distinction twixt magnitude and amplitude of a Fourier coefficient, consult "Signals and
Systems" or any other graduate-level signal processing textbook.
To top it all off, you CAN get better frequency accuracy in a Fourier Transform -- but ONLY at the cost of worse time accuracy.
The reason for this is that a Fourier Transform does not give a
representation of the magnitude of spectral components of a sound at each instantaneous moment in time. Instead, the Fourier Transform can *only* give a vague general average over time of the average magnitude of a changing power spectrum, and the FFT does this by averaging the power of each frequency bin over the entire time from the beginning of the sampling period to the end of the sampling period. So if we sample an acoustic sound each 1/16000 of a second, the output from out Fourier Transform yields only an rithmetic mean of the changing magnitude in that frequency bin over the entire 1/16000 of a second. In some portions of some sounds (middle part of clarinet a note with no vibrato and no tremolo > 1/10 of a second after the onset of the note) this produces little distortion. But at the very start of a note, when energy initially gets dumped into the acoustic system and a great deal of non-linear behavior produces complex acoustic events, this matters a great deal. As a result, the FFT produces rotten results at the very onset of a musical note.
Unfortunately, the human ear/brain system primarily recognizes musical timbre and the pitch of musical notes from the onset behavior of musical notes. So we've got a catch-22 here.
Let's take an example:
Suppose we have an input sound which begins at 1000Hz, glide upward to 2000Hz, and ends at 3000Hz over the course of 1 second.
If we analyze the input sound over its entire length (32,768 data points at a sampling rate of, let's say, 16,000), the Fourier Transform will have an exquisite accuracy in the frequency domain -- a full 16,384 frequency bins each of which covers just about exactly 2 Hz. This means that if we use a 32768-point Fourier Transform to analyze our 1-second-long sound, we can get the frequency accuracy down a nice 1 Hz (or thereabouts). That's a pretty good
accuracy compared to the previous example, where we had a wretched accuracy of only the nearest 2000 Hz!
But notice that we are now analyzing the *entire* sound by using
32,768 points. This means all the spectral energy throughout the entire length of the sound now gets lumped together and averaged over time. As a result, our supposedly "superbly accurate" Fourier Transform using 32,768 points reports that the sound had only one frequency component which never changed during its
entire 1-second length: a single spike at 2000Hz.
This is grossly inaccurate, since as we saw, the sound started at 1000 Hz and glided up to 2000Hz and ended at 3000Hz. But the Fourier Trasnform does not report this because it merely averaged the entire spectral content of the sound over the full 1-second length. As a result there's a little energy at 1000 Hz, a little energy at 3000Hz, and by far most of the spectral energy gets reported centered around 2000 Hz.
Thus the FFT output from such a 16,384-bin Fourier Transform will be a very broad spike centered at 2000 Hz, with large "tails" xtended down to 1000 Hz and up to 3000Hz.
In this case, as so often before, the Fourier Transform has lied to us and bamboozled us .
(Lest those of you technically minded start hammering on me here, let me point out that the term "energy" is a malaprop here. Technically speaking, the power spectrum is the autocorrelation of the magnitude of the Fourier Transform. However, in coloquial terms it's generally correct to speak of the "energy" in a frequency bin since the magnitude of the Fourier coefficient in that frequency bin is related to the total energy in that bin, albeit slightly indirectly. Therefore it is susbstantively correct to speak of the magnitude of the Fourier coefficient in a particular frequency bin as having a certain energy.)
Lastly, let us consider the case of a sound with 2 frequency components close together. If the spikes are closer together than the size of the frequency bins, the 2 spikes will blob together and show up as one frequency component much broader than either.
Taking our previous example of the 16-point FFT, suppose we
have a sound with frequency components 2000Hz and 5500 Hz and 5800 Hz and 8000 Hz. The output will be a single tall narrow frequency spike at 0-2000 Hz and a single broad relatively low-amplitude spike at 4000-6000 Hz and a single tall narrow spike at 6000-8000Hz.
As you can see, this is grossly inaccurate. The Fourier Transform has once again lied to you, telling you something that just ain't so.
Bottom line?
The Fourier Transform proves perpetually popular in music because it reinforces the same faulty delusions which lie at the base of 18th and 19th century and 20th century music "theory" (so-called). These faulty delusions boil down to the fantasy that "sounds
are periodic waves, and any periodic wave can be represented as the sum of periodic sinusoids." This is a delusion because in the real world, sounds are not periodic. They are at best quasi-periodic, with a large noise-burst (scraping of the violin bow, breath puff on a wind instrument etc.) at the start of the sound and increasing aperiodicity as the sound dies out. (examine the period of a pizzicato cello pluck at the end of the sound as compared to the start. You'll find a large discrepancy.) Moreover, only a small class of possible sounds adhere to even these limitations -- namely, the tiny class of sounds produced by 18th and 19th century European orcehestral instruments.
When we move outside Western music and begin to examine the Lobi balafon (inharmonic partials) and the Balinese gamelan (inharmonic partials) and the Thai concert orchestra (mostly inharmonic partials) and the zils and dumbeks used in Afghanistani zikr ceremonies (inharmonic partials) and the xylophones and drums used by central and south American indians (inharmonic partials), we find that in the overwhelming majority of cases the sound waves are aperiodic, since all inharmonic series produce continuously NON-repeating waveforms which constantly change, but noetheless have an audibly distinct musical pitch.
Like the discredited and pervasively disproven "chord of nature" fantasy that "the harmonic series is the basis of music," the Fourier Transform is apt to lead you all too easily astray unless you have a deep knowledge of all the ways in which the outdated 18th century musical conceptual model of the FFT fails to accurately represent reality.
Those of you enamoured of Fourier Transform most likely do not realize that John baptiste Joseph Baron Fourier did not originally invent his eponymous transform to analyze acoustic phenomena. In actual fact, Fourier concocted his transform in order to solve a very knotty differential equation which had resisted the efforts of even the
great D"Alembert -- namely, the heat equation. This diabolical differential equation yields the rate and spatial distribution over which which heat diffuses through a solid object. The heat diffusion differential equation proved so intractable that no one was able to make headway with it until Fourier came up with the clever idea of transforming the coordinates and variables of the original heat equation. Once again, this historical fact points up
the hard reality that the Fourier Transform does not introduce or elicit new information from a function -- the Fourier Transform merely transforms the variables in such a way that previously non-obvious information already present in the original function is represented in a new way which becomes easier to discern. Unlike a function of the
kind Benoit Mandelbrot created, the Fourier Transform adds no information to a function to which it is applied. The Fourier Transform does not create information, it merely changes the way existing information is represented...very much in the way that transforming rectangular to polar coordinates merely changes the way a phase plot is displayed (in electrical engineering), or in the way a Mercator projector as opposed to Bucky Fuller's dodecahedral projection of the earth's surface merely changes the representation without adding (say) new continents or new oceans to the world.
The Fourier Transform found its first and most productive use in changing the variables in differential equation to make them easier to solve. In this regard, the Fourier Transform qualifies as no more than a fancy version of integration by parts or L'Hopital's Rule -- these are all merely mathematical tricks which render various equations more tractable by changing their overall form without changing any essential aspect of the equation itself. For those of you not familiar with differential equations, you may prefer to consider such elementary operations as "completing
the square" in a simple quadratic equation. In all cases, nothing new is added and the equation is in now way augmented or altered, it is merely changed in form in such a way as to redner it easy to solve with conventional crank-and-grind by-the-numbers methods.
This is the original and stil the primary use of the Fourier Transform. As a result, applying the Fourier Transform to the analysis of sound is a Johnny-Come-Lately kludge, one that enjoyed a few successes, and many failures. For more details on the limitations of all the various transforms, including the Walsh Transform, the Lbesgue Transform, non-linear transforms like the number theoretic transform and Maragos' slope-theoretic transform, as well as the cosine transform and the Hadamard transform and the good ole original Fourier Transform, see "Analysis
by Synthesis" by Jean-Claude Risset in the new updated eidtion of "The Psychology of Music" ed. Diana Deutsch, 1998.
Over the last 20 years computer listening systems, which seek to accurately model the real-world behavior of the human auditory system in analyzing and identifying real-world sounds, have systematically abandoned frequency domain parametric methods of analysis like the Fourier Transform. These real-world computer listening systems represent the cutting edge of the current effort to model the human ear/brain system. The two most recent and most spectacularly successful computer listening systems are described in the unpublished MIT thesis from 1996 by D. P. W. Ellis, "Prediction-Driven Computational Auditory Scene Analysis" and "Computer Listening Systems," an unpublished 1999 MIT PhD thesis by Eric Sheirer.
(Much more could be said about the crucial difference twixt parametric and non-parametric methods of mathematical analysis, but there's no time for that here.)
In each case these researchers found themselves forced to abandon frequency domain methods of analysis as models of the human ear/brain system in the real world because of the unacceptable degree to which real-world transfer functions and real-world multipath distortion generates artifacts in Fourier representations of sounds. For those of you not familair with this technical terminology, a "real world transfer function" means nothing more than the resonant frequencies of a room. All real rooms in the real world have acoustically reflective surfaces, and these acoustically reflective surfaces create a tuned cavity. Because of this inevitable fact (i.e., no one listens to music inside an anechoic chamber), all rooms in the real world emphasive some frequencies and null other frequencies. This produces a room transfer function with peaks and zeroes which gets convolved with the impulse response of the sound source. This problem arises when a computer listening system tries to disentangle the transfer function of the room from the original impulse response of the sound source -- in the real world, this is not possible using a pure spectral analysis method like the Fourier Transform. To put it bluntly, you can't unscramble an egg. Once a transfer function has been convolved with an impulse response, there is no way to un-convolve it reliably without knowing what either the impulse response is, or what the transfer function is. But of course in the real world we have no way of knowing a priori what the impulse response of a sound will be before we hear it, nor do we magically (perhaps via ESP) become cognizant of the transfer function of a concert hall the instant we walk into it. (The entire idea remains ludicrous -- imagine walking into a concert hall and instantly saying to yourself: "Aha, spectral peaks at 2253 Hz and 1870 Hz and 1256 hz and 824Hz, and nulls at 227Hz and 1374 Hz and 3975 Hz!" This one doesn't even pass the straight-face test.)
The other problem, multi-path distortion, more commonly goes by the rubric "reverberation." To the musician, moderate reverberation proves a blessing, while to the engineer and the signal processing specialist, reverberation is a particularly irksome mathematical curse. Multi-path distortion (AKA reverb) not only smears out the impulse response of a sound by in effect convolving it with a comb filter, it introduces that dreaded artifact known as group delay. In short, reverberation plays hob with the Fourier Transform of an input sound, as does room resonance This should not prove surprising. After all, if you think about it reverb basically takes a sound and stretches out a vague mushy spectrally-distroted version of itself so that it overlaps with the next sound to occur. If you play a sequence of organ notes in a reverberant chamber, the spectrum of each organ note gets smushed out and overlapped in time with the spectrum of the note which occurs immediately afterwards. To us, as listeners, this does not present a problem -- but to a Fourier Transform, which takes discrete snapshots of isolated instants in time and relies on each snapshot containing only the spectrum from the note which currently sounds, this kind of reverberant distortion produces unacceptable and iinsuperable problems. Ditto the room resonance -- once again, the human ear/brain system quickly adjusts to the roomr esonance and discounts
it, while the Fourier Trasnform remains forced to deal with a series of isolated spectral snapshots. How does the Fourier Transform know whether the next time-slice represents a change int he overall spectrum of the sound source, or merely an unchanging sound source convolved with a room resonance which happens to have a zero or a peak at that frequency? Well, it doesn't, and there's no way to derive such information from within the Fourier Transform.
The simple fact that humans have no difficulty recognizing sounds subjected to even extreme amounts of reverb and room resonance, while Fourier Transforms simply go wacko and yield outputs with no recognizable point of similarity to the acoustic input when analyzing sounds with such artifacts, should tell us that a pure
spectral analysis method like the Fourier Transform does not provide an adequate model for the real-world operation of the human ear/brain system.
For an early and still definitive rundown of the various problems with frequency domain methods of analysis as models of human hearing, see "On the Analysis and Segmentation of Real-Time SIgnals" by James Moorer, unpublished PhD Thesis, Stanford, 1975.
-------------
--mclaren