FROM: mclaren

TO: new practical microtonality groups

SUBJECT: Jeff's points

In discussing the FFT, Jeff Scott mentioned that

"When you do an FFT, half the data in the frequency

domain you get is phase information. Brian's analysis

I think assumes that one is just tossing out the phase

info, which is what is actually done during some FFT

processing."

Actually, the FFT produces output in 2 components -- the real and the imaginary parts. The real parts, the A coefficients, are associated with the cosine half of the transform, while the imaginary parts, the B coefficient, come from the i*sin half of the transform. As Robert mentioned, some applications require only the cosine trasnform -- in fact several companies built ASIC chips which perform cosine transforms in hardware. Aplications which make use of only the cosine transform include image processing, as for instance the images you've seen from the Hubble Telescope.

Real + imaginary coefficients represents the raw output from the Fourier Transform but if you take the sqrt(A + i*B)(A - i*B) you get the magnitude and if you take arctangent of (sqrt(real/imaginary)) you get the phase angle in radians.

So these are actually just different ways of looking at the same data -- real + imaginary coeficients as opposed to magntiude (sqrt (X^2 + Y^2) and arctan(sqrt[X/Y]).

The reason for using magnitude and phase for sounds is that audio signals do not come in real and imaginary parts. All audio pressure waves are functions which run from zero to some positive sound pressure value. There is no such thing as an "imaginary sound

pressure level." Just as the number of streets in a city can never be an imaginary number, the sound pressure level measured by a microscophone can never be an imaginary number.

Phase typically gets thrown away in audio work because it is usually intolerably distorted by our reproduction systems. A typical loudspeaker grunges the phase of the audio signal so badly that the original phase is unrecoverable. No surprise -- as everyone knows, we have not yet discovered how to reproduce the sphereical acoustic wavefront of a point sound source using either one or two acoustic loudspeakers. The original point source radiates sound spherically and preserves phase and magnitude of sound pressure level across all the steradians (solid angle increments) of its surface area.

A loudspeaker emits a pressure wave from a frustum-shaped (truncated cone) plate embedded in a box, and as a result you get the distortion induced by a truncated cove convolved with the distortion induced by the ring around the loudspeaker convolved with the added distortion which results from being inside a hole cut in a box.

(The phase and magnitude distortion induced by the circular ring around a loudspeaker is similar mathematically to the Airy function distortion induced in telescopes b ythe telescope's circular aperture. In the case of the telescope, stars in effect turn into tiny Fresnel diffraction pattern whose resoution is diffraction-limit. Likewise, sounds emitted by a loudspeaker turn into tiny Fresnel diffraction patterns whose time/freuqency resoution is diffraction limited by the size of the loudspeaker cone.)

Net result?

No one can mistake the sound of a recording played through loudspeakers for the original sound wave. We instantly know the difference twixt live and memorex --primarily because our crude sound reproduction methods munge the phase up so badly. Phase primarily relates to soundstage -- it's phase that allows us to precisely locate a sound (relative phase in either ear, to be exact), but phase (as embodied in grou delay) also causes some cancellation twixt frequency components and messes up the clarity of the sound itself..

Loudspeakers at the very best produce a "hole in the wall" effect, and then only if you keep your head pretty nearly fixed in the "sweet spot" twixt the two loudspeakers. Move your head by more thana couple of inches, and the illusion goes away. You don't encounter this with a real sound source like a 'cello or a flute in live performance.

Binaural recordings played through headphones do a better job of creating an illusion that "you are there," but still not indistinguishable from the original sound source. Once again, because the headphone output is diffraction-limited due to the limited size of the headphone drivers.

This is why most audio work throws out phase. It gets wrecked by our wretched sound reproduction equipment anyway.

"But if you keep the phase data you can also

analyze it to improve your frequency resolution and

also do neat tricks like Settel & Lippe have done to

oh-so-cleverly separate harmonic and inharmonic

information. The improved frequency info you get by

looking at phase info is part of the selling point of

the phase vocoder, which works this way using the FFT."

In actual fact, unless I'm mistaken, the phase vocoder

merely uses phase unwrapping. This in no way improves

frequency resolution -- it merely tracks changes in the frequency

of each partial by wrapping the phase back around whenever it

returns to zero.

But Jeff, this does not "improve" frequency resolution. That

is entirely fixed and unavaoidably limited by the number of

frequency bins used. What the phase vocoder does is

to assume (often correctly) that if phase advances continuously

and wraps around to zero, it must be unwrapped to get the

change in frequency.

That allows the phase vocoder to track changing frequency

components throughout time, but once again subject to the

time-frequency tradeoff inherent in ALL Fourier Transform-based

analysis. If the partial changes too rapidly in frequency, not even

a phase vocoder can keep track of it.

I have the original MacAuley-Quatieri paper on the phase vocoder in my files. Re-read it. You'll find that while unwrapping the phase does help tracking changing frequency components, it's no magic bullet. If the frequency components change in frequency too rapidly, of if the frequency of the partials changes at the same time that the magnitude of those paritals also changes, you hit the brickwall of time-frequency resolution tradeoff.

Nothing can be done about that tradeoff. It is mathematically unavoidable. In fact, the time/frequency tradeoff isn't even a feature of the Fourier Transform -- it's a basic limitation of _any_ wave representation, and it forms the basis of Heisenberg's Uncertainty Principle as well as the less-well-known radio engineering conundrum that the more accurately you measure wavelength, the less accurately you can measure frequency, and vice versa.

Another way of putting that is that the heterodyne carrier reinsertion has a bandwidth with unit area in wavenumber-frequency space. The higher the gain, the worse the heterodyne carrier reinsertion, and conversly the better the heterodyne carrier reinsertion, the lower the gain of the radio receiver.

The mechanical equivalent of this unavoidable tradeoff is found inthe human inner ear. The basilar membrane is a mechanically damped resonator. And the more narrow the half-power resonant bandwdith (thus the better the frequency resolution) of such a damped resonator, the higher the mechanical damping must be and thus the lower the sensitivity. Conversely, the higher the senitivity of the mechanical resonator, the lower the mechanical damping must be and thus the wider the half-power resonant bandwidth (and more poorly frequency-selective) the mechanical resonant system.

This tradeoff is unavoidable, like the second law of thermodynamics or the maximum 50% efficiency of an audio amplifier which uses feedback to lower distortion, or the tradeoff twixt the ability to measure energy and the time during which the particle has that energy in quantum theory.

Incidentally, this unavoidable basic limitation on the mechanics of the basilar membrane explains many of the properties of the human ear/brain system, which will be discussed in an upcoming post.

Why does the human ear/brain system use 2 different systems of auditory analysis (a time-based autocorrelation method of analysis as well as a spectral system) rather than entirely using a spectral method of analysis? Because the human cochlea as a mechanically damped resonant system suffers from the inevitable tradeoffs of all damped resonant systems. Without two different simultaneous mechanisms of auditory processing, the human ear could not exhibit the remarkable qualities it does -- to wit, extraordinary dynamic range of 0-130 dB and simultaneous extraordinarily fine frequency discrmination, around 0.3% in the 1000-2000Hz region.

No purely spectral-based mechanical resonant system can exhibit such simultaneously large dynamic range (i.e., high sensitivity) and high frequency selectivity. This explains why nature found it necessary to equip us with 2 different mechanisms for perceiving sound: namely, spectral time-independent analysis and time-based sepctrum-independent analysis. The micromechanics of the cochlea demand it.

---------

Jeff went on to mention:

"Also the proof is in the pudding since FFT analysis and

its variants are productively used by many to get great

results."

The FFT has indeed been used productively in many

domains -- for example, in wind tunnels, in NMR spectroscopy, in the analysis of non-linear gradient optics, in image processing (check out those 1970s images of the moons of Saturn and Jupiter -- all FFT-processed to eliminate scanning artifacts) etc.

However, this is very different from saying that the FFT has been used productively to analyze *sounds.*

Merely because a technique works well in one field, that does not mean it necessarily works well in another field. As an example, consider the application of orbital mechanics to the moon program as opposed to the application of orbital mechanics to astrology computer programs to predict the exact house you were in at the time of your birth. Orbital mechanics produces worthwhile results in one field (the space program) and worthless crap in the other field (astrology).

The application of the FFT to many different fields has been a spectacular success. As just one concrete example, the CAT scans used at hosptials depend intimately on Fourier Transforms -- they could not produce images without 'em.

However the track record of the FFT in analyzing sounds has a much poorer record.

Want proof?

Fine -- here's proof: listen to a minidisk recording.

Minidisks use massive FFT processing and the finest psychoacoustic knowledge of the 1970s and early 1980s. As Carl Lumma pointed out to me at the recent microtonal conference "Minidisks sound like s**t." I can't argue with that assessment. Minidisks do not sound impressive. And they sound as poor as they do because minidisks depend entirely on spectral processing to try to eliminate all frequency components judged "unimportant" from a sound and then store the compressed spectral residue. In fact, a minidisk dumps out 80% of the incoming bandwidth and only stores 20%.

If the human ear/brain system actually operated as a pure spectral processor, minidisk recordings would sound indistinguishable form the original linear PCM recordings. They don't. Minidisk recordings suffer from a wide variety of artifacts, including sizzle, grunge, weird low-level modulated noise, and other bizarreness and crud induced by the faulty and failed effort to represent all types of acoustic pressure waves as sums of sinusoids.

Alas, very few types of acoustic pressure waves can be accurately represented as sums of sinsousoids, and the proof (as Jeff is wont to say) is in the pudding -- just listen to a minidisk recordingof a CD and then listen to the original CD. You'll hear quite a difference.

Sadly, when Jeff stated "Also, there are ways of getting around (eliminating) the trade off between frequency resolution and time

resolution" that statement is just plain incorrect.

It is no more possible to get around the frequency/time resolution tradeoff built into the FFT than it is possible to get around the second law of thermodynamics or wave-particle duality.

The reason, incidentally, is simple and self-evident -- in the discrete Fourier Transform frequency is _defined_ as N*(1/sampling period) while the time duration of an acoustic event is defined as N*(sampling period). Thus both frequency and time are inherently quantized by the acoustic version of the discrete Fourier Trandform, and there is absolutely no way around this in any way, shape or form.

Various folks will predictably deny this basic fact, and when they do they will be promulgating misinformation. As always, I shall post quotes from various textbooks to prove it.

Surprisingly, no one has yet mentioned the wavelet transform. The wavelet transform does more efficiently represent frequency and phase information. AS Robert pointed out, the frequency resoution of the FFT becomes ridiculously high at high frequencies and successively poorer at low frequencies. This is wasteful, because the Fourier Trasnform gives us much more frequency resoution than we want or need at one end of the frequency spectrum, and much less resoution than we need or want at the other end. Unlike the Fourier Transform, the wavelet transform changes the size of frequency bins as a function of frequency, so that high frequencies have relatively the same frequency resolution as low frequencies. However, as Moorer and other have shown mathematically, to obtain frequency and time resolution equal to the original linear PCM recording, the wavelet transform requires twice the number of bits per second as the equivalent PCM recording.

Once again, alas, there ain't no such thing as a free lunch. (Sigh.) Wouldn't life be peachy if we didn't have to worry about the laws of thermodynamics, the frequency/time resoution tradeoff of the Fourier Transform, or the catch-22 relationship twixt audio amplifier power and linearity in transistor-based audio amplifiers using Shockley's diode curve?

Incidentally, no one has mentioned either that most PCM digital audio recording systems today use 1-bit converters running at ultrasonic frequencies. Once again, the time/frequency resoution tradefoff rules with an iron hand -- you get the same results encoding 44100 samples per second with 16 bits as you get encoding 1 bit at 16*44100 samples per second and downsampling and reclocking. Alas, audiophiles have recently discovered an unsuspected side effect of this 1-bit digital audio recording process.... Recording 1 bit samples at such a super high audio sampling rate renders the digital audio recorder 16 times more vulnerable to jitter in the master clock crystal. In effect, this convolves the digital audio recording with thermal noise, munging up the soundstage and crudding up the clarity of the recorded audio. Such devices as the Genesis Time Lens have now been developed to feed the 1-bit PCM-encoded data into a memory buffer and reclock it so as to deconvolve the thermal noise induced by jitter in the master crystal oscillator from the recorded digital audio stream.

--------------

--mclaren

<snip>

> Surprisingly, no one has yet mentioned the wavelet transform.

The wavelet transform does more efficiently represent frequency and

phase information. <snip>

> --mclaren

You just did! If you have Matlab, you can use Stanford's free wavelet

package Wavelab, available here:

http://www-stat.stanford.edu/~wavelab/

I used this to good effect to bring out detail in washed out areas

(hotspots) from aerial photographs for topological maps. I never tried

any musical application. Brian, have you used wavelets for sound

analysis yet? It would be interesting to see how they do, since you

can custom design your set of wavelets to the signal you want to

analyze.

John Starrett