back to list

Interesting idea - Musical Tuning and Human Biology - today's NewScientist - London

🔗Charles Lucy <lucy@harmonics.com>

8/7/2003 5:28:56 AM

http://www.newscientist.com/news/news.jsp?id=ns99994031

Charles Lucy - lucy@harmonics.com (LucyScaleDevelopments)
------------ Promoting global harmony through LucyTuning -------
for information on LucyTuning go to:
http://www.harmonics.com/lucy/ for LucyTuned Lullabies go to http://www.lucytune.com http://www.lucytune.co.uk or
http://www.lullabies.co.uk

🔗Haresh BAKSHI <hareshbakshi@hotmail.com>

8/7/2003 10:55:20 AM

--- In tuning@yahoogroups.com, Charles Lucy <lucy@h...> wrote:
> http://www.newscientist.com/news/news.jsp?id=ns99994031
.............

Hello ALL, please also read
http://www.newscientist.com/news/news.jsp?id=ns9999994
where a study shows that babies' musical memories are formed in womb.

This reminds me of Abhimanyu, mentioned in Mahabharata (the 88,000-
verse long Sanskrit work) as a 16-year old son of Arjuna and
Subhadra. Abhimanyu is reported to have learnt some very complex war
stratagies while still in his womb -- his father Arjuna taught him
war planning even before Abhimanyu was born.

Regards,
Haresh.

🔗Graham Breed <graham@microtonal.co.uk>

8/7/2003 2:01:22 PM

Charles Lucy wrote:
> http://www.newscientist.com/news/news.jsp?id=ns99994031

Great! Looks like we can all give up and go back to 12-equal, because it's closest to speech patterns, and all world music uses it anyway!

The original abstract's here:

http://www.jneurosci.org/cgi/content/abstract/23/18/7160

and doesn't have any mention of tuning. Although it does say "probability distribution of amplitude-frequency combinations in human utterances predicts both the structure of the chromatic scale and consonance ordering." Does anybody have access to this journal to check the details?

Graham

🔗Gene Ward Smith <gwsmith@svpal.org>

8/7/2003 3:12:17 PM

--- In tuning@yahoogroups.com, Charles Lucy <lucy@h...> wrote:
> http://www.newscientist.com/news/news.jsp?id=ns99994031

This is a really bad article. Does anyone know of a better on-line
reference?

🔗Graham Breed <graham@microtonal.co.uk>

8/7/2003 7:52:04 PM

Thanks to my informant for slipping me a copy of the original J Neurosci article :)

It turns out that speech samples mostly use the first 6 partials of a harmonic series, with the emphasis on partials 3 and 4. Everything else follows from that.

They mention pelog, but otherwise think the whole world uses subsets of the chromatic scale.

Graham

🔗Paul Erlich <perlich@aya.yale.edu>

8/7/2003 11:54:29 PM

--- In tuning@yahoogroups.com, "Gene Ward Smith" <gwsmith@s...>
wrote:
> --- In tuning@yahoogroups.com, Charles Lucy <lucy@h...> wrote:
> > http://www.newscientist.com/news/news.jsp?id=ns99994031
>
> This is a really bad article.

looks like a D+ high school paper.

> Does anyone know of a better on-line
> reference?

i hope so!

🔗Martin Braun <nombraun@telia.com>

8/10/2003 11:50:25 AM

Graham Breed wrote (Aug 7):

> Charles Lucy wrote:
>> http://www.newscientist.com/news/news.jsp?id=ns99994031

> Great! Looks like we can all give up and go back to 12-equal, because
> it's closest to speech patterns, and all world music uses it anyway!

> The original abstract's here:

> http://www.jneurosci.org/cgi/content/abstract/23/18/7160

The paper does not make any suggestion what anybody might like to give up or
not give up. Those, however, who wonder why the distribution of tuning
systems in the world is as it is might greatly benefit from the paper.

The study proves for the first time that the spectral content of human
speech sounds is universally biased towards the frequency ratios that occur
in the consonant intervals of the common 12-tone scale (irrespective of
tuning system).

It had previously been known that the auditory system of humans, and of
other mammals, is biased towards these frequency ratios, as seen in the
anatomy of the apparatus of pitch extraction and in psychoacoustic
experiments.

That it now turns out that the bias in hearing apparently is an adaptation
to the bias in vocalization could be expected. Therefore the new results are
not revolutionary, but they present an important missing link.

By the way, the press release from Duke is here:

http://www.dukenews.duke.edu/news/newsrelease.asp?id=2653&catid=2,46&cpg=new
srelease.asp

An interesting side result of the study is that male speech, but not female
speech, includes a bias towards the minor third, whereas female speech
includes a stronger bias towards the major third (Fig. 2D). The reason for
this sex difference is the lower fundamental in male voices, which means
that different partials are favored by the resonance of the vocal tract.

I am very pleased about this side result, because I found a corresponding
sex difference in the distribution of frequency ratios in >5000 pairs of
"ear tones" (spontaneous otoacoustic emissions) in the mid-90s. In those
days I could only publish the finding, but not discuss it, because I did not
have a clue as to possible explanations:

http://w1.570.telia.com/~u57011259/Braun%201997%20abstract.htm

Later, when the close relation between hearing and speech became more and
more compelling in other data, I guessed that the sex difference in partial
resonance was probably related to this odd finding. I am glad that there is
now an empirical confirmation of this detail in the speech-hearing
coadaptation.

Martin

🔗Gene Ward Smith <gwsmith@svpal.org>

8/10/2003 12:33:23 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:

It apparently confuses 5-limit consonances with 12-equal.

> That it now turns out that the bias in hearing apparently is an
adaptation
> to the bias in vocalization could be expected.

How do you know which is cause and which is effect?

🔗Martin Braun <nombraun@telia.com>

8/10/2003 2:06:54 PM

--- In tuning@yahoogroups.com, "Gene Ward Smith" <gwsmith@s...> wrote:
> --- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
>
> It apparently confuses 5-limit consonances with 12-equal.

You are right, if you replace "12-equal" by "chromatic 12-tone
scale". That's the term the authors used.

"5-limit consonances" would not have been a term the authors could
have used. Not many readers would understand it.

> > That it now turns out that the bias in hearing apparently is an
> adaptation
> > to the bias in vocalization could be expected.
>
> How do you know which is cause and which is effect?

The speech side has been the "harder" one in evolution, because here
the parameters are mainly determined by the anatomy of the breathing
and eating organs. The auditory neural system has been the "softer"
side.

Martin

🔗Graham Breed <graham@microtonal.co.uk>

8/10/2003 3:22:24 PM

Gene Ward Smith wrote:

> It apparently confuses 5-limit consonances with 12-equal.

They aren't all 5-limit. The minor 7th is 7:4 and the tritone seems to be 31:22.

Graham

🔗Graham Breed <graham@microtonal.co.uk>

8/10/2003 3:44:15 PM

Martin Braun wrote:

> The paper does not make any suggestion what anybody might like to give up or
> not give up. Those, however, who wonder why the distribution of tuning
> systems in the world is as it is might greatly benefit from the paper.

The paper itself isn't as bad as the writeup suggests (as is usually the case). But it still isn't that valuable. It gives a very misleading impression of the range of tuning systems -- everybody uses a 12 note scale or subsets, except for "some interesting variations".

I don't see it being much use at all to a general tuning enthusiast. There are plenty of instruments that emphasize the lower harmonics, and plenty of commentaries on this.

> The study proves for the first time that the spectral content of human
> speech sounds is universally biased towards the frequency ratios that occur
> in the consonant intervals of the common 12-tone scale (irrespective of
> tuning system).

But *any* intervals can occur in any scale "irrespective of tuning system". They haven't shown any significant correlation between the intervals they get and either 12 note scale (equal temperament and Pythagorean intonation). They don't consider any other scales as controls. All they show is that speech emphasizes the lower partials.

> That it now turns out that the bias in hearing apparently is an adaptation
> to the bias in vocalization could be expected. Therefore the new results are
> not revolutionary, but they present an important missing link.

The experimental results may be important, but not of widespread interest. It's a shame they didn't stick with reporting those results -- a lot of the paper isn't that good at all.

> By the way, the press release from Duke is here:
> > http://www.dukenews.duke.edu/news/newsrelease.asp?id=2653&catid=2,46&cpg=new
> srelease.asp

That explains where New Scientist got all the wild ideas from. They claim causation when they only have correlation, don't mention that the same results can be derived from musical instruments, and claim to predict the chromatic scale, instead of explaining it post hoc.

Graham

🔗Gene Ward Smith <gwsmith@svpal.org>

8/10/2003 4:10:04 PM

--- In tuning@yahoogroups.com, Graham Breed <graham@m...> wrote:
> Gene Ward Smith wrote:
>
> > It apparently confuses 5-limit consonances with 12-equal.
>
> They aren't all 5-limit. The minor 7th is 7:4 and the tritone
seems to
> be 31:22.

Then why in hell did they drag 12-et into it?

I find it hard to buy that we need to go to the 31-limit for this
stuff.

🔗Jeff Olliff <jolliff@dslnorthwest.net>

8/11/2003 12:16:00 AM

>Abstract:
The similarity of musical scales and consonance judgments across
human populations has no generally accepted explanation. Here we
present evidence that these aspects of auditory perception arise
from the statistical structure of naturally occurring periodic sound
stimuli. An analysis of speech sounds, the principal source of
periodic sound stimuli in the human acoustical environment, shows
that the probability distribution of amplitude-frequency
combinations in human utterances predicts both the structure of the
chromatic scale and consonance ordering. These observations suggest
that what we hear is determined by the statistical relationship
between acoustical stimuli and their naturally occurring sources,
rather than by the physical parameters of the stimulus per se.

>Martin Braun:
The speech side has been the "harder" one in evolution, because here
the parameters are mainly determined by the anatomy of the breathing
and eating organs. The auditory neural system has been the "softer"
side.

The amplitude-frequency combinations they are measuring seem to be
an aggregate of the same thing phonologists measure at a finer
grain, the formant distribution in vowel sounds. Vowels are
recognized by the distinct placement of at least three bands of
frequencies at which the vocal tract resonates. These resonances
are controlled by movements of the tongue and lips, are independent
of the excitation frequency from the vocal chords, but serve to
emphasize harmonics falling within the formant regions. Each
language divides this available vowel space in a slightly different
way, but the nature of the formant resonator is of a continuously
variable device, starting perhaps with the tongue bunched at the
roof and the lips either drawn or pursed for an ee or umlaut, and
then while continuing to vocalize, drawing the tongue slowly back,
down, and forward, adjusting the lips as one may please. This may
produce a continuous series of recognizable vowel sounds, something
like ee-e-a-aw-o-oo-uu, according to ones linguistic heritage. I
view this recently evolved flexibility as catering to the
sensitivities of the ear, which has been fine tuned (shall we say)
during an enourmously longer period.

The investigators' result suggests that an analysis of vowel
formants within a language, and evidently across languages, will
also show interval relationships among the three formants of the
approximately ten vowels in a typical language. This is
interesting. I did not notice it in my sophomoric studies. These
resonances are the same sort as those of wind instruments, which
influence tone color and even perception of the fundamental, but are
much broader than the acute definition with which we perceive pitch.

People who tune instruments by beats and measurement devices know
that perceived consonance is directly related to ratios of physical
frequencies. The interesting neurophysiological question is why the
ear cares about these physical relationships. It is almost
certainly connected with language, but has been hard to
demonstrate. The ear needs to recognize the patterns of formant
frequencies, as well as non-vowel transients, and has just the
coiled up continuous frequency sensitive organ necessary to do it.
Since what we actually hear inside of the formant bands is selected
harmonics filtered from the sawtooth vocal chord fundamental, we can
perhaps understand why the ear-brain is sensitive to harmony. It
can intuit a fundamental from the harmonics in the formants. For
any single vowel, the three formants will contain perhaps three
related harmonics of the fundamental. The statistical analysis may
see these harmonic relationships within the vowels as more than a
random correlation. Then any other harmonic occurring within the
same formant may be statistically associated with the first (even if
a different vowel, different speaker, different language, or
different fundamental). So given that the frequency structure of a
human vowel is harmonically based, and that the lowest order
intervals 2/1, 3/2, 4/3, 5/4, 6/5 expand upon each other to produce
a first order approximation of a twelvish tone scale, we should not
find the result unbelievable.

It still seems curious that the frequencies of the formant bands
themselves are arranged in some kind of twelve-tone row, but worth a
try. The patterns vary by language and dialect, but should follow
the rule of optimizing signal to noise by being distinct. The
suggestion that we hear statistically is uninformed, either by
musical science or linguisitics.

All in fun,
Jeff

🔗Graham Breed <graham@microtonal.co.uk>

8/11/2003 12:35:29 AM

Me:
>>They aren't all 5-limit. The minor 7th is 7:4 and the tritone > seems to >>be 31:22.

Gene:
> Then why in hell did they drag 12-et into it?

Not all intervals are of equal importance. It's reasonable to assume that 12 would have a unique relationship with the lower partials, but they don't show it.

> I find it hard to buy that we need to go to the 31-limit for this > stuff. No, sorry, I calculated that wrong. It's a frequency ratio of 1.406 which is probablyu 45:32.

Graham

🔗Gene Ward Smith <gwsmith@svpal.org>

8/11/2003 3:42:10 AM

--- In tuning@yahoogroups.com, Graham Breed <graham@m...> wrote:

> Not all intervals are of equal importance. It's reasonable to
assume
> that 12 would have a unique relationship with the lower partials,
but
> they don't show it.

What makes it reasonable to assume that 12 or any other equal
temperament has anything to do with the question?

🔗Graham Breed <graham@microtonal.co.uk>

8/11/2003 5:28:31 AM

Gene Ward Smith wrote:

> What makes it reasonable to assume that 12 or any other equal > temperament has anything to do with the question?

Who cares? I didn't mention equal temperament and the paper only uses it as an example.

Graham

🔗Martin Braun <nombraun@telia.com>

8/11/2003 6:01:55 AM

Jeff Olliff wrote (46363):

> The suggestion that we hear statistically is uninformed, either by
musical science or linguisitics.

The authors based it on results in vision research. But they could easily
have based it on hearing research as well. We have a top-down neural
signaling from cortex to inner ear. And data show that even the pick-up
cells in the inner ear have increased sensitivity to the frequencies they
have been most exposed to. (Not in the disco, of course ;-))

Martin

🔗Martin Braun <nombraun@telia.com>

8/11/2003 5:44:02 AM

Graham Breed wrote (46359):

> It gives a very misleading impression of the range of tuning systems --
everybody uses a 12 note scale or subsets, except for "some interesting
variations".

They never say "everybody", but things like "preferential use" (p. 7160).
This is perfectly correct and in no way "a very misleading impression".

> I don't see it being much use at all to a general tuning enthusiast.
There are plenty of instruments that emphasize the lower harmonics, and
plenty of commentaries on this.

Yes, there is a wide-spread view that humans have their harmony templates
from hearing musical instruments. The data now show that this is wrong. They
are, of course, no surprise for those who think in terms of biological
evolution and development of individual human brains. The vocalization of
mammals is quite a bit older than musical instruments, and most infants hear
quite a bit more of speech than of musical instruments.

> They ...claim to predict the chromatic scale, instead of explaining it
post hoc.

It is the same in this case. The data (not the authors) predict a bias
towards the chromatic 12-tone scale. The authors thus explain the dominance
of this scale post hoc.

Martin

🔗Carl Lumma <ekin@lumma.org>

8/11/2003 11:57:40 AM

>It is the same in this case. The data (not the authors) predict a
>bias towards the chromatic 12-tone scale. The authors thus explain
>the dominance of this scale post hoc.

Martin,

Our gripe is that they are really explaining the dominance of just
intonation. The 12-tone scale happens to be exceptionally good at
representing JI among small equal temperaments, but there are many
other scales that would fit the data. The study seems fine, but
the authors should have done more research to find the right music-
theory terminology for their paper. The problem is that people like
Eytan might now use this as ammo to justify their insane revisionist
accounts of history.

-Carl

🔗Graham Breed <graham@microtonal.co.uk>

8/11/2003 2:37:25 PM

Martin Braun wrote:

> They never say "everybody", but things like "preferential use" (p. 7160).
> This is perfectly correct and in no way "a very misleading impression".

They say "all cultures over the centuries" (p.7164). That looks like "everybody" to me. Are you saying there are people who don't belong to a culture, or lived outside the centuries?

They also say on p.7164 "All musical traditions ... employ a relatively small set of tonal intervals ... each interval being defined by its relationship to the lowest tone of the set." It's not important to the matter at hand, but what a howler! Ancient Greeks, who measured from the *highest* tone in the set, don't constitute a musical tradition!

On the very p.7160, they say "`musical universals' include: (1) a division of the continuous dimension of pitch into iterated sets of 12 intervals that define the chromatic scale." "Universal" includes everything. Something without a "musical universal" is not music. Hence anybody making music is obliged to use a 12 note scale, or subsets. That's absurd. They do say it.

The phrase "preferential use" refers not to use of the chromatic scale itself, but to subsets of it. The universality of the chromatic scale itself has already been asserted at that point.

> Yes, there is a wide-spread view that humans have their harmony templates
> from hearing musical instruments. The data now show that this is wrong. They
> are, of course, no surprise for those who think in terms of biological
> evolution and development of individual human brains. The vocalization of
> mammals is quite a bit older than musical instruments, and most infants hear
> quite a bit more of speech than of musical instruments.

The data do now show that! All they do is not contradict it. If you start out thinking that harmonic templates derive from speech sounds, you can still think so. If you think they're irrelevant, you can say the results are a coincidence. The only way of deciding between the theories is to test people who have been exposed to music that predominantly uses inharmonic timbres, and see if their templates are different. Even then, harmonic templates are only one facet of a musical scale.

From what I can make out, the data show a "diatonic" 7-note scale for female English speakers, and Mandarin and Tamil speakers in general. I think you'll find that infants have heard quite a bit more female than male speech!

> It is the same in this case. The data (not the authors) predict a bias
> towards the chromatic 12-tone scale. The authors thus explain the dominance
> of this scale post hoc.

The data predict small integer ratios being consonant -- the same as every other theory of consonance with harmonic timbres.

Graham

🔗Gene Ward Smith <gwsmith@svpal.org>

8/11/2003 3:07:44 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:

> It is the same in this case. The data (not the authors) predict a
bias
> towards the chromatic 12-tone scale. The authors thus explain the
dominance
> of this scale post hoc.

If the data truly does point to 12-equal, that would be a strong
argument that the thesis is wrong, and it is exposure to music, not
speech, which is determinative.

🔗hstraub64 <straub@datacomm.ch>

8/18/2003 3:58:53 AM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> --- In tuning@yahoogroups.com, "Gene Ward Smith" <gwsmith@s...> >
> How do you know which is cause and which is effect?
>
> The speech side has been the "harder" one in evolution, because
> here the parameters are mainly determined by the anatomy of the
>breathing and eating organs. The auditory neural system has been
>the "softer" side.
>

I disagree. You cannot say the "hardware" was there before
the "software". Long before the development of speech "harder"
and "softer" sides had already been evolving together, influencing
each other in both directions.
--
HAns Straub

🔗Martin Braun <nombraun@telia.com>

8/19/2003 5:45:23 AM

Hans Straub wrote:

> > The speech side has been the "harder" one in evolution, because
> > here the parameters are mainly determined by the anatomy of the
> >breathing and eating organs. The auditory neural system has been
> >the "softer" side.
> >
>
> I disagree. You cannot say the "hardware" was there before
> the "software". Long before the development of speech "harder"
> and "softer" sides had already been evolving together, influencing
> each other in both directions.

The issue is not related to hardware and software. I just said that the
parameters of the air spaces in the vocal tract are mainly determined by the
needs of breathing and eating. Therefore they have very little room for
adaptations in vocalization. The auditory neural system has no such
limitations. Therefore I called it "evolutionary softer".

Martin

🔗Paul Erlich <perlich@aya.yale.edu>

8/19/2003 2:31:22 PM

--- In tuning@yahoogroups.com, "hstraub64" <straub@d...> wrote:
> --- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> > --- In tuning@yahoogroups.com, "Gene Ward Smith" <gwsmith@s...> >
> > How do you know which is cause and which is effect?
> >
> > The speech side has been the "harder" one in evolution, because
> > here the parameters are mainly determined by the anatomy of the
> >breathing and eating organs. The auditory neural system has been
> >the "softer" side.
> >
>
> I disagree. You cannot say the "hardware" was there before
> the "software". Long before the development of speech "harder"
> and "softer" sides had already been evolving together, influencing
> each other in both directions.
> --
> HAns Straub

let's get a grip. what we're talking about is the harmonic spectrum
of the human voice. every known mechanism of making an indefinitely
sustainable acoustic sound with discrete spectrum (i.e., not noise or
chaos) produces a harmonic spectrum, as it involves periodic
vibration. there is not much chance that the human voice could have
evolved to produce an inharmonic spectrum! the anatomy of the
breathing and eating organs could be wildly different and still the
same patterns observed in the paper would show up.

🔗Martin Braun <nombraun@telia.com>

8/20/2003 3:56:20 AM

Paul:
> let's get a grip. what we're talking about is the harmonic spectrum
> of the human voice. every known mechanism of making an indefinitely
> sustainable acoustic sound with discrete spectrum (i.e., not noise or
> chaos) produces a harmonic spectrum,

many idiophones don't. The replicas of the Zeng bells were a drastic
example. Discrete spectrum, yes, but only occasional low-order frequency
ratios, and never a series of partials that could cause a pitch percept, or
even interfere with the pitch percept based on the fundamental. The
spreadsheets are still on my website:

http://w1.570.telia.com/~u57011259/PartialsA-tone.htm

http://w1.570.telia.com/~u57011259/PartialsB-tone.htm

http://w1.570.telia.com/~u57011259/C:\website2\PartialsA+B-tone.htm

> as it involves periodic
> vibration. there is not much chance that the human voice could have
> evolved to produce an inharmonic spectrum! the anatomy of the
> breathing and eating organs could be wildly different and still the
> same patterns observed in the paper would show up.

No. All throats produce harmonic patterns, but different throat sizes
produce different harmonic patterns. Further, the pattern is f0 dependent,
as the sex differences show.

The data of the study confirm that resonance in air spaces produces harmonic
patterns. This alone, of course, would be the most boring result. The
interesting result is the SPECIFIC pattern of harmonicity that comes out of
human mouths.

Nobody could have predicted, based on the finest possible tuning maths, that
the sex difference concerning the thirds is as it is.

And nobody could have predicted that the limit between visible peaks and
noise in the spectrum statistics is just between 7:5 (still on the visible
peaks' side) and 7:6 (already on the lost-in-noise side).

These two findings were unexpected, and they had to remain unexpected, until
"some idiots" started to collect data.

Martin

🔗Martin Braun <nombraun@telia.com>

8/20/2003 4:04:54 AM

Paul:
> let's get a grip. what we're talking about is the harmonic spectrum
> of the human voice. every known mechanism of making an indefinitely
> sustainable acoustic sound with discrete spectrum (i.e., not noise or
> chaos) produces a harmonic spectrum,

many idiophones don't. The replicas of the Zeng bells were a drastic
example. Discrete spectrum, yes, but only occasional low-order frequency
ratios, and never a series of partials that could cause a pitch percept, or
even interfere with the pitch percept based on the fundamental. The
spreadsheets are still on my website:

http://w1.570.telia.com/~u57011259/PartialsA-tone.htm

http://w1.570.telia.com/~u57011259/PartialsB-tone.htm

http://w1.570.telia.com/~u57011259/C:\website2\PartialsA+B-tone.htm

> as it involves periodic
> vibration. there is not much chance that the human voice could have
> evolved to produce an inharmonic spectrum! the anatomy of the
> breathing and eating organs could be wildly different and still the
> same patterns observed in the paper would show up.

No. All throats produce harmonic patterns, but different throat sizes
produce different harmonic patterns. Further, the pattern is f0 dependent,
as the sex differences show.

The data of the study confirm that resonance in air spaces produces harmonic
patterns. This alone, of course, would be the most boring result. The
interesting result is the SPECIFIC pattern of harmonicity that comes out of
human mouths.

Nobody could have predicted, based on the finest possible tuning maths, that
the sex difference concerning the thirds is as it is.

And nobody could have predicted that the limit between visible peaks and
noise in the spectrum statistics is just between 7:5 (still on the visible
peaks' side) and 7:6 (already on the lost-in-noise side).

These two findings were unexpected, and they had to remain unexpected, until
"some idiots" started to collect data.

Martin

🔗hstraub64 <straub@datacomm.ch>

8/20/2003 4:10:04 AM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
>
> The issue is not related to hardware and software. I just said
that the
> parameters of the air spaces in the vocal tract are mainly
determined by the
> needs of breathing and eating. Therefore they have very little
room for
> adaptations in vocalization. The auditory neural system has no such
> limitations. Therefore I called it "evolutionary softer".
>

Aha. So this would mean that the mentioned characteristics are not
specific for humans, but would also be found in voices of apes or
_any_ air-breathing animal. Is this so? I am not qualified, but it
appears to be a rather bold statement.
What I know is that apes, e.g., can never learn to speak properly,
for anatomic reasons. (Some can learn sign language, so it is not a
question of "software"...)

Hans Straub

🔗Martin Braun <nombraun@telia.com>

8/20/2003 10:57:38 AM

Hans:

> Aha. So this would mean that the mentioned characteristics are not
> specific for humans, but would also be found in voices of apes or
> _any_ air-breathing animal. Is this so? I am not qualified, but it
> appears to be a rather bold statement.

It is so. In general. There are some differences in the details, similar to
those found between men, women, and children.

> What I know is that apes, e.g., can never learn to speak properly,
> for anatomic reasons. (Some can learn sign language, so it is not a
> question of "software"...)

There is a theory that they have not enough space between the throat and the
teeth to shape vowels as we do. But even if this is true, it would be
irrelevant for speech. Humans have speech with 5 or 20 vowels. Apes surely
would do fine with three, as well. For speech one vowel would also be
enough.

Martin

🔗Paul Erlich <perlich@aya.yale.edu>

8/20/2003 3:51:20 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> Paul:
> > let's get a grip. what we're talking about is the harmonic
spectrum
> > of the human voice. every known mechanism of making an
indefinitely
> > sustainable acoustic sound with discrete spectrum (i.e., not
noise or
> > chaos) produces a harmonic spectrum,
>
> many idiophones don't.

those are not indefintely sustainable when played in the manner in
which the measurements were made.

> > as it involves periodic
> > vibration. there is not much chance that the human voice could
have
> > evolved to produce an inharmonic spectrum! the anatomy of the
> > breathing and eating organs could be wildly different and still
the
> > same patterns observed in the paper would show up.
>
> No. All throats produce harmonic patterns, but different throat
sizes
> produce different harmonic patterns.

right, but as long as the pattern is harmonic, the same general
conclusion is reached.

> The data of the study confirm that resonance in air spaces produces
harmonic
> patterns.

how could they not?

> Nobody could have predicted, based on the finest possible tuning
maths, that
> the sex difference concerning the thirds is as it is.

ok, but that's not the conclusion i was referring to, and wasn't even
mentioned in the summary that was linked to here.

🔗Joseph Pehrson <jpehrson@rcn.com>

8/22/2003 8:37:24 PM

--- In tuning@yahoogroups.com, Carl Lumma <ekin@l...> wrote:

/tuning/topicId_46259.html#46377

> >It is the same in this case. The data (not the authors) predict a
> >bias towards the chromatic 12-tone scale. The authors thus explain
> >the dominance of this scale post hoc.
>
> Martin,
>
> Our gripe is that they are really explaining the dominance of just
> intonation. The 12-tone scale happens to be exceptionally good at
> representing JI among small equal temperaments, but there are many
> other scales that would fit the data.

***This is exactly what I was thinking upon reading the sophomoric
abstract. Paul Erlich's chart of the ETs clearly points this out.
Sure, 12 is a winner, but there are many others, and the musical
resources of 12 have been much exhausted...

J. Pehrson

🔗Joseph Pehrson <jpehrson@rcn.com>

8/22/2003 8:52:58 PM

--- In tuning@yahoogroups.com, "Paul Erlich" <perlich@a...> wrote:

/tuning/topicId_46259.html#46427

> --- In tuning@yahoogroups.com, "hstraub64" <straub@d...> wrote:
> > --- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...>
wrote:
> > > --- In tuning@yahoogroups.com, "Gene Ward Smith" <gwsmith@s...>
>
> > > How do you know which is cause and which is effect?
> > >
> > > The speech side has been the "harder" one in evolution, because
> > > here the parameters are mainly determined by the anatomy of the
> > >breathing and eating organs. The auditory neural system has been
> > >the "softer" side.
> > >
> >
> > I disagree. You cannot say the "hardware" was there before
> > the "software". Long before the development of speech "harder"
> > and "softer" sides had already been evolving together,
influencing
> > each other in both directions.
> > --
> > HAns Straub
>
> let's get a grip. what we're talking about is the harmonic spectrum
> of the human voice. every known mechanism of making an indefinitely
> sustainable acoustic sound with discrete spectrum (i.e., not noise
or
> chaos) produces a harmonic spectrum, as it involves periodic
> vibration. there is not much chance that the human voice could have
> evolved to produce an inharmonic spectrum! the anatomy of the
> breathing and eating organs could be wildly different and still the
> same patterns observed in the paper would show up.

***This paper really seems dumber and dumber all the time... (A movie
title??)

J. Pehrson

🔗Joseph Pehrson <jpehrson@rcn.com>

8/22/2003 9:12:15 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:

/tuning/topicId_46259.html#46467

> Hans:
>
> > Aha. So this would mean that the mentioned characteristics are not
> > specific for humans, but would also be found in voices of apes or
> > _any_ air-breathing animal. Is this so? I am not qualified, but it
> > appears to be a rather bold statement.
>
> It is so. In general. There are some differences in the details,
similar to
> those found between men, women, and children.
>
> > What I know is that apes, e.g., can never learn to speak properly,
> > for anatomic reasons. (Some can learn sign language, so it is not
a
> > question of "software"...)
>
> There is a theory that they have not enough space between the
throat and the
> teeth to shape vowels as we do. But even if this is true, it would
be
> irrelevant for speech. Humans have speech with 5 or 20 vowels. Apes
surely
> would do fine with three, as well. For speech one vowel would also
be
> enough.
>
> Martin

***Maybe they just don't concentrate enough to do it...

J. Pehrson

🔗francois_laferriere <francois.laferriere@oxymel.com>

8/26/2003 5:43:49 AM

Hello Paul and Martin,

I am just back from vacations. I read this tread with much interest
but it seem that you had access to the original article. Is it
possible to provide me with it.

Thanks in advance

François Laferrière

🔗Martin Braun <nombraun@telia.com>

8/26/2003 6:25:42 AM

OK, it's in your mail. Martin

--- In tuning@yahoogroups.com, "francois_laferriere"
<francois.laferriere@o...> wrote:
> Hello Paul and Martin,
>
> I am just back from vacations. I read this tread with much interest
> but it seem that you had access to the original article. Is it
> possible to provide me with it.
>
> Thanks in advance
>
> François Laferrière

🔗francois_laferriere <francois.laferriere@oxymel.com>

8/26/2003 9:03:32 AM

Thank Martin for the paper

I didn't had to read the entire paper to discover a major flaw in the
methodology that makes all the results void of any value.

With this methodology, you can take harmonic spectrum with random F0
and random spectral enveloppe (let say with an average decay of a few
dB per octave), you will get peaks corresponding to the simpler
integer ratios (as in the article) and nothing else.

Let see how it work.

The normalized value is defined (in the paper) as

Fn = F/Fm

As the signal is harmonic F and Fm are restricted to values in the
serie f0, 2f0, 3f0 ...

Thus,

Fn = ( a * f0 ) / ( b * f0 ) = a/b , a an b integer.

It should be noted that Fm is not the interpolated value of the first
formant frequency (F1) but the frequency of the harmonic nearest to
first formant.

As a male voice is unlikely to go below 100 Hz (in normal speech) and
that the first formant rarely goes beyond 700 Hz, we can state that
the maximum value for "b" is 7.

As the graph is presented between 1 and 2, the value of "a" is also
constrained to be b < a < 14. That can only lead to a preeminence of
the simple ratio corresponding to the "natural" scale.

In fact, the normalisation of frequency forces individual sample
values to be distributed as they are. For instance, the peak at 2
correspond to a/b = 2, the peak at 1.5 correspond at a/b 3/2 and so
on. The spreading of the peaks correspond only to noise, measurement
error and roundoff error.

In this scheme 2 has good chance (2/1, 4/2, 6/3), 1.5 is not unlikely
(3/2 6/4), 5/4 and 6/5 do exists, 11/9 or 81/80 dont have any chance.

I didn't dare to read carefully the conclusions as long as it is
based on a totally flawed science.

Would I have been the reviewer, I would never have accepted such a
paper.

yours truly

François Laferrière

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> OK, it's in your mail. Martin
>
> --- In tuning@yahoogroups.com, "francois_laferriere"
> <francois.laferriere@o...> wrote:
> > Hello Paul and Martin,
> >
> > I am just back from vacations. I read this tread with much
interest
> > but it seem that you had access to the original article. Is it
> > possible to provide me with it.
> >
> > Thanks in advance
> >
> > François Laferrière

🔗Martin Braun <nombraun@telia.com>

8/26/2003 11:07:44 AM

Hi François,

you can't simply state that the paper is flawed by writing an unclear
criticism on a methods issue. You have to quote a finding, and after
that you have to show why you think that this finding is false,
uninteresting, or whatever.

Martin

--- In tuning@yahoogroups.com, "francois_laferriere"
<francois.laferriere@o...> wrote:
> Thank Martin for the paper
>
> I didn't had to read the entire paper to discover a major flaw in
the
> methodology that makes all the results void of any value.
>
> With this methodology, you can take harmonic spectrum with random
F0
> and random spectral enveloppe (let say with an average decay of a
few
> dB per octave), you will get peaks corresponding to the simpler
> integer ratios (as in the article) and nothing else.
>
> Let see how it work.
>
> The normalized value is defined (in the paper) as
>
> Fn = F/Fm
>
> As the signal is harmonic F and Fm are restricted to values in the
> serie f0, 2f0, 3f0 ...
>
> Thus,
>
> Fn = ( a * f0 ) / ( b * f0 ) = a/b , a an b integer.
>
> It should be noted that Fm is not the interpolated value of the
first
> formant frequency (F1) but the frequency of the harmonic nearest to
> first formant.
>
> As a male voice is unlikely to go below 100 Hz (in normal speech)
and
> that the first formant rarely goes beyond 700 Hz, we can state that
> the maximum value for "b" is 7.
>
> As the graph is presented between 1 and 2, the value of "a" is also
> constrained to be b < a < 14. That can only lead to a preeminence
of
> the simple ratio corresponding to the "natural" scale.
>
> In fact, the normalisation of frequency forces individual sample
> values to be distributed as they are. For instance, the peak at 2
> correspond to a/b = 2, the peak at 1.5 correspond at a/b 3/2 and so
> on. The spreading of the peaks correspond only to noise,
measurement
> error and roundoff error.
>
> In this scheme 2 has good chance (2/1, 4/2, 6/3), 1.5 is not
unlikely
> (3/2 6/4), 5/4 and 6/5 do exists, 11/9 or 81/80 dont have any
chance.
>
> I didn't dare to read carefully the conclusions as long as it is
> based on a totally flawed science.
>
> Would I have been the reviewer, I would never have accepted such a
> paper.
>
> yours truly
>
> François Laferrière
>
> --- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> > OK, it's in your mail. Martin
> >
> > --- In tuning@yahoogroups.com, "francois_laferriere"
> > <francois.laferriere@o...> wrote:
> > > Hello Paul and Martin,
> > >
> > > I am just back from vacations. I read this tread with much
> interest
> > > but it seem that you had access to the original article. Is it
> > > possible to provide me with it.
> > >
> > > Thanks in advance
> > >
> > > François Laferrière

🔗francois_laferriere <francois.laferriere@oxymel.com>

8/27/2003 6:33:14 AM

Hello Martin,

> Martin wrote:
> you can't simply state that the paper is flawed by writing an
unclear
> criticism on a methods issue. You have to quote a finding, and
after
> that you have to show why you think that this finding is false,
> uninteresting, or whatever.

OK, but may I state that the paper is flawed by writing a clearer
criticism :)?
Also this is not "a method issue" taht I criticize, it is the very
basis of the whole protocol that is flawed. It cannot make appear
values elsewhere than where they appears, so it signify nothing.

I tough I was clear enough in explaining the flaw in the protocol.
Either you didn't read carefully what I wote, or I was not good
enough at explaining it. But once you get it, I can assure you that
the flaw is pretty obvious.

If you need any specific clarification about this post or the
previous one, just ask, I will be pleased to try to be more
understandable.

By the way, the paper is certainly honest, as far as the protocol is
provided with great detail, so that there is no evidence of any
attempt to hide the flaw I discovered (I must say that only a
minority of scientific papers are that clear).

My first tought was that the graphs where only a flattened version of
the classical "vowel triangle" representing the vowels of a phonetic
system on a graph where axis are F1 and F2 (first and second formant
frequencies). But in fact, it is something else, that is
unfortunately, much less significant... In fact it makes peaks appear
not especially where formants appear but where harmonics appear.

Let consider the "typical" spectrum of page 7161. Fm is the frequency
of the 4th harmonic, so the 5th harmonic shall contribute, on the
normalised spectrum of this sample, to intense value centered on 5/4,
the sixth harmonic on 3/2 and so on. We shall also have an intense
contribution on 2 due to the second formant around 8th harmonic.

From this exemple, wen can see that sample contribution can produce
intense value on for value corresponding to integer ratio 5/4, 3/2,
7/4. With Fm on 3rd harmonic, we would have contribution on 4/3, 5/3
and 2, With Fm on 5th harmonic, contributions are 6/5, 7/5, 8/5, 9/5
and so on for other location of Fm.

If Fm have been on 5th harmonic, we would have values on
further the constraints explained in my previous mail, those ratio
a/b are limited to value of b lower or equal to 7 (with a small
contribution of 7 as it is an extreme value).

The sum of all contribution must have peaks at simple integer ratio,
and matter of factly the simpler are more intense because they sum up
more contribution: for instance 3/2 includes 3:2 but also 6:4 and
9:6; 2/1 include 2:1, 4:2, 6:3, 8:4, 10:5 and 12:6 and is thus
relatively intense.

Further, this value of b is practically limited to 4 for women
voices. As Fm can hardly correspond to the 5th harmonic, the
contribution of 6/5 is negligible for women voice. This has nothing
to do with major/minor third stuff. That explains, in general, the
simpler normalized spectrum for women.

OK I forgot one thing, when I wrote

> The spreading of the peaks correspond only to noise, measurement
> error and roundoff error.

In fact the spreading of the peaks is also (and mostly) due to the
FFT harmonic peak spreading caused by the shortness of the analysis
window (0.1 sec). and to the possible instability of the pitch in the
window.

For me, all this result from a numerical artifact due to

- the normalisation protocol
- the harmonicity of the signal
- the fact that Fm is contrained to correspond to low harmonics

Even though, the conclusion is attractive, this has absolutely no
direct relation to musical scales.

Paul, any tought about this?

yours truly

François Laferrière

🔗Paul Erlich <perlich@aya.yale.edu>

8/27/2003 2:02:02 PM

--- In tuning@yahoogroups.com, "francois_laferriere"
<francois.laferriere@o...> wrote:
> Hello Martin,
>
> > Martin wrote:
> > you can't simply state that the paper is flawed by writing an
> unclear
> > criticism on a methods issue. You have to quote a finding, and
> after
> > that you have to show why you think that this finding is false,
> > uninteresting, or whatever.
>
> OK, but may I state that the paper is flawed by writing a clearer
> criticism :)?
> Also this is not "a method issue" taht I criticize, it is the very
> basis of the whole protocol that is flawed. It cannot make appear
> values elsewhere than where they appears, so it signify nothing.
>
> I tough I was clear enough in explaining the flaw in the protocol.
> Either you didn't read carefully what I wote, or I was not good
> enough at explaining it. But once you get it, I can assure you that
> the flaw is pretty obvious.
> [...]
> Paul, any tought about this?

i thought you explained it pretty well, and it agreed with what i
said earlier.

🔗Martin Braun <nombraun@telia.com>

8/28/2003 6:56:50 AM

Fran�ois:

> Even though, the conclusion is attractive, this has absolutely no
> direct relation to musical scales.

As you can see in the figures of the paper, they only found peaks at ratios
that are as simple and simpler than 7:5. The same applies to the majority of
intervals within the chromatic 12-tone scale that is based on the repetition
of 5ths and 3rds.

Is this a "direct relation", or is it not?

Is there anything in the methods of this study that made the borderline
between the inclusion of ratios (7:5 and lower) and the exclusion of ratios
(7:6 and higher) lie where it lies?

If yes, please show us this "trick" in their methods. If no, I would suggest
that you recall your criticism and apologize to list.

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

8/28/2003 9:29:39 AM

François:
> Even though, the conclusion is attractive, this has absolutely no

> Martin:
> direct relation to musical scales.
> As you can see in the figures of the paper, they only found peaks
at ratios
> that are as simple and simpler than 7:5. The same applies to the
majority of
> intervals within the chromatic 12-tone scale that is based on the
repetition
> of 5ths and 3rds.
>

Hello Martin,

seemingly, there is something you missed in my discussion, and it is
unfortunate that you do not pinpoint exactly where in my post I have
not been clear enough.

Here we go again, but that is the last time I explain at length,
nobody else but you and me seems to have any interest in this thread.

stop me whenever you disagree :).

The normalized spectrum figures theoretically should be a spectrum of
discrete peaks corresponding to integer ratio. If the measurement
where done on much longer FFT analysis on spectrally stable segment
of speech, that would be more obvious, but we can nevertheless make
a "though experiment".

Fm as defined in the paper represent the frequency of an harmonic
near F1, that is related to the effective length of the vocal tract:
the effective vocal tract length is a quarter of wavelenth. This
effective length value is typically around 15 cm, so that F1
typically varies between 500 and 750 Hz. It varies depending on the
vowel (mouth and tongue position change effective vocal tract
lenght). F1 values are higher for children and only slighly higher
for women as vocal tract is shorter than men's.

The peaks of our though experiment graph appears at integer ratio
corresponding to the ratio of N/Nm where N is wavenumer of harmonic
and Nm is the wavenumber of Fm. The most common values for Nm are
4,5,6. 7 can appears for low pitch men (around 100 Hz) and for some
very open vowel only. So there is, overall, little contributions for
denominator of 7 and nearly no value occurences above 7.

As the figures display only values, between 1 an 2, N can only be
between 8 and 14 and the simpler ratio are redundant, more frequent,
thus more intense. The exact shape of the real life curve depends on
a lot of statical properties but predicticably is more intense at
simpler ratio.

Make another tought experiment where subject are monsters with huge
vocal folds who speak with pitch around 30 Hz (with normal vocal
tract). For those, Nm is allowed to go up to 25, so we may have a
much denser normalized spectrum.

But for us, normal human, we are not allowed to have complex
normalized spectrum because

pitch > 100 Hz
F1 < 750 Hz

That's it: pitch too high, vocal tract too long, to have very often 7
as Nm.
too bad for this theory

Is this a "direct relation", or is it not?
> Is there anything in the methods of this study that made the
borderline
between the inclusion of ratios (7:5 and lower) and the exclusion of
ratios
(7:6 and higher) lie where it lies?

It is in the paper, check figure 3 A, 5 is nearly twice as probable
to occur at denominator than 6.

> If yes, please show us this "trick" in their methods.

As I stated in a previous post, I think that the paper is honest, and
it is not a "trick", or a hoax, only a misleading error in the way
data was gathered

> If no, I would suggest
> that you recall your criticism and apologize to list.

Did I offend anybody?

Should someone apologize to the list whenever one emit an opinion you
disagree with?

come on ! ;-) it is a discussion group!

yours truly

François Laferrière

🔗Carl Lumma <ekin@lumma.org>

8/28/2003 11:38:46 AM

>Here we go again, but that is the last time I explain at length,
>nobody else but you and me seems to have any interest in this thread.
>
>stop me whenever you disagree :).

I'm interested in it.

>But for us, normal human, we are not allowed to have complex
>normalized spectrum because
>
>pitch > 100 Hz
>F1 < 750 Hz
>
>That's it: pitch too high, vocal tract too long, to have very
>often 7 as Nm. too bad for this theory

Ok, but this is not a methods objection. It shows instead that
the result is trivial.

I will re-read your earlier post, where I believe you describe
the methods objection. I didn't follow you the first time I
read it.

>Did I offend anybody?

Not me. I'm quite glad you spoke up!

-Carl

🔗hstraub64 <straub@datacomm.ch>

8/29/2003 5:02:56 AM

--- In tuning@yahoogroups.com, Carl Lumma <ekin@l...> wrote:
> >Here we go again, but that is the last time I explain at length,
> >nobody else but you and me seems to have any interest in this
thread.
> >
> >stop me whenever you disagree :).
>
> I'm interested in it.
>

I am interested, too. Just not contributing when I have nothing to
say...

🔗Martin Braun <nombraun@telia.com>

8/29/2003 10:35:21 AM

Fran�ois:

>> Is there anything in the methods of this study that made the borderline
>> between the inclusion of ratios (7:5 and lower) and the exclusion of
>> ratios (7:6 and higher) lie where it lies?

> It is in the paper, check figure 3 A, 5 is nearly twice as probable
> to occur at denominator than 6.

Of course, it's in the paper. My question was, did they put it into the
paper by choosing a special trick to get it in. Fig 3A is an empirical one.
It does reflect the nature of human speech, not a methods decision by the
authors!!!

> As I stated in a previous post, I think that the paper is honest, and
> it is not a "trick", or a hoax, only a misleading error in the way
> data was gathered.

What then was this error? I can see no error in the FFT analysis.

>> If no, I would suggest that you recall your criticism and apologize to
list.

> Did I offend anybody?

That was not at issue. At issue was if you misinformed the group by stating
false facts. If you did, an apology would well be in order. This has got
nothing to do with having an opinion.

You claimed that the authors made a methodological error that lead to the
results they received. But you did not say what their error was. If you
cannot say this in three simple sentences, referring to a specific line of
text in the paper, we must assume that there is no such error.

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/1/2003 2:59:02 AM

Martin

I exposed the flaw in the method used three time, with diffrent
wording and/or more detail, asking you to pinpoint exactly where you
either disagree or do not understand (because seemingly I am not that
good at being clear).

Seemingly, you are unwilling to enter a fair discussion, but just
repeatedly state that I misinform the group (how?), that I state false
fact (which one please?), and that my writing is unclear (in spite of
the fact that some other group members like Paul tip me that I am not
totally obscure).

After that carload of amabilities, you state that "I" should apologize!!!!

Ok, I take it as humour...

> My question was, did they put it into the
> paper by choosing a special trick to get it in.
> Fig 3A is an empirical one.
> It does reflect the nature of human speech,
> not a methods decision by the authors!!!

In this figure 3A, there is indeed a method decision, even though,
probably not a malicious one. It is "currently admitted" (sorry, no
bibliography) that there is no significant physical coupling between
the vocal fold and the vocal tract, so there is no significant
correlate between f0 (the pitch) and F1, F2 .. (formant freequencies).
So as random variables, f0 and F1 can be considered continuous
independent random variable (or slightly correlated, that is not
important for the following).

The value of Fm do not correspond to a physical property of the vocal
tract conversely to F1 (effective vocal tract). F1 can be computed by
various interpolation method from the signal. F1 is continuously
distributed and is by no way forced to be an integer multiple of f0.

On the other hand side Fm is, by definition, a multiple of f0, thus,
it is a roundoff of the real physical value (F1). Fm has no physical
meaning (numerology excluded). By this simple process of rounding-off
Fm to an integer multiple of f0, a continuous spectrum is artificially
transformed in a discrete spectrum with a very limited number of
significant value.

Harmonic number of Fm is not a "natural" measure, it is the result of
a computation that creates an artificial set of discrete random
variable that have, conviently, a very limited ranged, that are
afterward "shaked and baked" to produce a very poor continuous
"normalized" spectrum that is just a discrete spectrum in disguise.

So no, figure 3A reflect nothing in the nature of human speech. Taking
Fm as
round (F1 / locutor height in centimeter)
would have given very similar results.

Its another way to explain the error in the method

Again, as soon as you pinpoint where I am wrong/unclear, I am
absolutely willing admit my error (it would not be the first time I
make a fool of myself on this list) or explain until cristal-like
clarity is reached (I am not that good at pedagogy, but I may try
again). You are my guest Martin

> Martin
> What then was this error? I can see no error in the FFT analysis.

No error in the FFT presented, as I explained, things goes bad just
afterward.

Even more good will, I recap below, in three sentence as you asked
(even though it is just a summary, not a clarication

1- The theoritical "normalized" spectrum is a spectrum of discrete
values at some integer ratio. No surprize that actual normalized
spectrum has peak.
2- The physical relationship limits between f0 range and F1 range
limits the number of significant peaks in interval [1,2]
3- this lead to the trivial result (peak that are physically unlikely
do not appear, those that are more likely due to the distribution of
harmonic number of Fm as N/2, N/3, N/4 N/5 are proeminent).

All this is just due to acoustics, elementary algebrae and fairly
simple stats concepts; it is totally neuroscience-free, otherwise, I
would not have permit to challenge you Martin :).

All the detail is in my previous posts

yours truly

François Laferrière

🔗Martin Braun <nombraun@telia.com>

9/1/2003 12:46:36 PM

Fran�ois:

>> My question was, did they put it into the
>> paper by choosing a special trick to get it in.
>> Fig 3A is an empirical one.
>> It does reflect the nature of human speech,
>> not a methods decision by the authors!!!

> In this figure 3A, there is indeed a method decision, even though,
> probably not a malicious one. It is "currently admitted" (sorry, no
> bibliography) that there is no significant physical coupling between
> the vocal fold and the vocal tract, so there is no significant
> correlate between f0 (the pitch) and F1, F2 .. (formant freequencies).
> So as random variables, f0 and F1 can be considered continuous
> independent random variable (or slightly correlated, that is not
> important for the following).

OK, now I can have a cautious guess what your misunderstanding may be about.
You seem to think that the study is about vowel formants. This, however, is
not the case. The study deals with the ratios between the peaks in the voice
spectrum, irrespective of any qualities that are related to vowels. Please
have a look at that part of the paper where the sampling of the speech
material is described.

> On the other hand side Fm is, by definition, a multiple of f0, thus,
> it is a roundoff of the real physical value (F1). Fm has no physical
> meaning (numerology excluded).

This is wrong. The amplitudes of all peaks in the investigated sound spectra
reflect the PHYSICAL resonance properties of the vocal tracts of the tested
subjects.

Martin

🔗Jeff Olliff <jolliff@dslnorthwest.net>

9/1/2003 10:15:36 PM

Martin, in every particular sample the energy peaks in the voice
spectrum match those harmonics of the fundamental favored by the
vowel formants. Thank you for clairifying for us non-subscribers to
the journal that the study deals with the ratios between the peaks
in the voice spectrum. The study does not find peaks at particular
frequencies presumably (that would be so hard to believe), but
rather the spectral peaks within every sample are taken as a data
set of ratios and energies, and these are accumulated to get a
picture of total energies at various ratios. A simple pattern of
harmonic ratios, within the limits Francois has described, seems
like an expected result of measuring many harmonic ratios. How can
the researchers leap from this apparently trivial result to a
speculation that habituation to spectral emphasis patterns in speech
[by the grace of God, harmonic] causes, or biases melodic scale
preferences? Tuners of instruments use harmonic relationships to
make scales, and probably have since gut harps, skin drums and bone
flutes. With no more basis at this time than the researchers, but
neither any less, I suggest that our flair for perceiving vowel
formants out of sketchy but harmonically related sensory input,
underlies our interest in melody and harmony. We judge harmonic
relationships with extraordinary physical accuracy, down to
multisecond beats, and other artifacts. We seem to have generalized
harmonic handlers in our wetwear, of the same order of physical
accuracy as those underlying our spatial interpretations. It is
counterintuitive to suppose we need to have this facility beaten
into us, or that some nonharmonic arrangement could have evolved, or
could still be substituted.

Besides, if the peaks drop off at the 7-limit, so that 7/6 is not
statistically significant, and so not reinforced in experience, how
come I hear that harmony no sweat? My analysis of Bach is that he
heard it and used it. Others on this list can hear higher limits
that that.

Perhaps you can elucidate the methods and implications of the
results, bearing in mind both the knowledge and limitations of your
audience of tuning gurus and enthusiasts. I appreciate Francois
taking the trouble to educate us in these matters, not that I
couldn't use further instruction. I apologize in advance for any
misrepresentations.

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> OK, now I can have a cautious guess what your misunderstanding may
be about.
> You seem to think that the study is about vowel formants. This,
however, is
> not the case. The study deals with the ratios between the peaks in
the voice
> spectrum, irrespective of any qualities that are related to
vowels. Please
> have a look at that part of the paper where the sampling of the
speech
> material is described.
>
>
> > On the other hand side Fm is, by definition, a multiple of f0,
thus,
> > it is a roundoff of the real physical value (F1). Fm has no
physical
> > meaning (numerology excluded).
>
> This is wrong. The amplitudes of all peaks in the investigated
sound spectra
> reflect the PHYSICAL resonance properties of the vocal tracts of
the tested
> subjects.

Jeff says: I hate to be picky, but this second note appears to
contradict the first, in that the PHYSICAL resonance properties of
the vocal tracts are most accurately described by the vowel formants.
>
> Martin

🔗Bill Sethares <sethares@ece.wisc.edu>

9/2/2003 7:17:37 AM

Here is an email I wrote when this article first appeared - which explains how they
process the data to arrive at their results -- and why the method is problematical.
I believe that this is similar to what Dominique was saying earlier.

I've just finished reading the paper that Graham and Gene and
others have been discussing:

The Statistical Structure of Human Speech Sounds Predicts
Musical Universals

by David A. Schwartz, Catherine Q. Howe, and Dale Purves

The graphs are striking, and seem to present a new and powerful
argument for JI (even though they claim its a case for 12-tet!).
Before drawing conclusions on the importance of this work, it's
maybe worth a look at what they actually did - and what assumptions
they made, perhaps without even realizing it.

Basically they measure the spectrum of many speakers in a variety of
languages, process the data, and come up with a graph that looks a lot like
a consonance curve such as Plomp and Levelt, or the one-footed bride from
Partch, or Helmholtz's graph of comparative consonance - with nice consonant
peaks at the simple integer ratios and dissonant valleys between. What's
new is that, unlike all the above, who consider the interaction between pairs
of harmonic sounds, they arrive at this curve by a statistical analysis
that processes only one (harmonic) sound at a time - the isolated voice.
Hence their claim about universality and etc.

Lets look carefully at where this curve comes from - look at their processing
of the data. Here's what they do: for each spectrum (graph of frequency
vs magnitude) normalize both the magnitude (by the largest magnitude) and
normalize the frequency (by the frequency of the partial with the greatest
magnitude). The result in every case is a spectrum for which the frequency
of the largest partial is "1" and the magnitude of the largest partial is
"1" (obviously not physical units, but normalized units). The voice has
the property that usually, the largest partial is not the fundamental - for
exaample, maybe the voice has partials at

150 300 450 600 750 900 Hz
with amplitudes normalized to
0.1 0.2 1.0 0.4 0.3

Doing the frequency normalization then gives the processed spectrum as

freq = .33 .66 1 1.33 1.66 2.0
mag = 0.1 0.2 1.0 0.4 0.3

Another example: a voice might have partials at:

250 500 750 1000 Hz
.3 1 .6 .4

Normalized this becomes:

freq = .5 1 1.5 2
mag = .3 1 .6 .4

So - now you take and average a few thousand of these together.
What do you get? Something with a lot of energy at 1 (of course)
and smaller amounts of energy at - you guessed it - small integer
ratios like 1.33, 1.5, 1.66, and 2!

In other words, their curve is a result of the way that they have
processed/normalized (the frequencies of) the data!

So, why they might have chosen to normalize the data in just this way?

Well, if you believe that the ear focuses attention on the loudest partial
in a sound, then this is a natural thing to do. Indeed, these people are
proud of the fact that they developed this technique for visual processes.
In vision, it is quite a good assumption that (say) the brightest spot is
the most salient. But in audition there is no reason to think that the ear
pays much attention to the loudest partial in a cluster - indeed, virtual
pitch theory tells us that the ear focuses instead on a harmonic template
and not on the individual partials themselves.

It appears to me that this argument is fundamentally based on a fallacious
assumption about auditory perception.

--Bill Sethares

🔗Martin Braun <nombraun@telia.com>

9/2/2003 8:30:39 AM

Jeff:

> Martin, in every particular sample the energy peaks in the voice
> spectrum match those harmonics of the fundamental favored by the
> vowel formants.

If you sample an ideal, steady-state vowel from a textbook, yes. But in real
speech very little is ideal, and almost nothing is steady-state. Further,
the authors of the study did not even focus on vowels. Their samples of 0.1
sec were randomly selected. They only wrote an algorithm to avoid silent
samples.

So what they sampled was actually a little bit of harmony with plenty of
noise. But it was exactly that type of sound that is with us during much
time of our lives. The interesting result was WHICH of all the theoretically
possible harmony is actually sticking out of this noise.

> A simple pattern of harmonic ratios, within the limits Francois has
described, seems
> like an expected result of measuring many harmonic ratios.

Yes, but nobody could have guessed beforehand WHICH harmonic ratios would be
sticking out of the noise.

> How can the researchers leap from this apparently trivial result to a
> speculation that habituation to spectral emphasis patterns in speech
> [by the grace of God, harmonic] causes, or biases melodic scale
> preferences?

An important part of the paper is the auditory side of it. Evolution adapted
our ears to that what is around them. We knew before this study was made
that our ears are very well adapted to the ratios 4:3, 5:4, and 6:5, but
less well to the ratios 7:6 and 9:7. The new speech data give us a strong
clue why things might have gone that way, and not another way that was
biologically possible.

> Besides, if the peaks drop off at the 7-limit, so that 7/6 is not
> statistically significant,

It is not only statistically insignificant in the sampled data, it is not
even visible as being different from the noise.

> and so not reinforced in experience, how come I hear that harmony no
sweat?

If you sample fine vowels from fine male voices, I would expect that 7:6
clearly pops out of the noise.

> My analysis of Bach is that he heard it and used it. Others on this list
can hear higher >limits that that.

Of course, but it's getting more difficult and you might need some practice.
Much in the world is probabilistic, but there are also kind of breakpoints.
In hearing there may well be one between 6:5 and 7:6.

> .....the PHYSICAL resonance properties of
> the vocal tracts are most accurately described by the vowel formants.

Just to be sure, vowel formants are neither certain frequencies nor certain
harmonics. They are frequency bands that cover several adjacent harmonics.
Also, these bands are not fixed. Not even in one speaker. They vary so much
that for the speech analysis of the brain not the absolute frequency range
can be used but only the relations between different formants, that is
between different frequency ranges, of the same speaker at one point in
time.

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/2/2003 8:33:35 AM

Hello Martin

When you write:

> Martin:
> This is wrong. The amplitudes of all peaks in the investigated
sound spectra
> reflect the PHYSICAL resonance properties of the vocal tracts of
the tested
> subjects.

I suppose that you means the peaks that appears in the spectra of
figure 1C around 210, 415, 625, 830 Hz etc. (rounded to the precision
I got from the PDF file) and that are then normalised to contribute
to the normalized spectum of figure 2. If you think that those peaks
are (or reflect) the physical resonance of the vocal tract you are
wrong, sorry. Those peaks are the harmonics of the glottal source.

Back to the basic: f0 and Fn are independents
----------------------------------------------

Human voice is made of two fairly independent components, namely the
glottal source (vocal folds) and the vocal tract (the flexible fleshy
tube above the glottis).

As discussed before in this group (furthermore, with you as far as I
remember) the frequency of the glottal source depends on the vocal
folds mechanical properties (tickness, tension etc.) and from the
subglottal pressure. This mechanism can be described as a phase
locking mechanism. As I measured by myself, human voice is harmonic
(i.e. not measurably inharmonic on 16 bit precision PCM recording).

The vocal tract play the role of a passive resonator, or bandpass
filters that amplify some broad bands named formants. The lowers of
those band, F1 correspond to a wavelength which is 4 time the
effective length of the vocal tract.

f0 and F1 are fairly independent. It is possible to change the pitch
without (significantly) moving the formants (glissando on a vowel)
and vice versa (one pitch, many vowels). Anybody can try it.

In summary, the glottal source is responsible for the harmonic comb-
like structure of the spectrum, while the vocal tract is responsible
for the bumpy spectral enveloppe with the formant structure.

More advanced topics: f0 and formants can be related , but not much
------------------------------------------------------------------

In fact, it is possible, but not easy to change the pitch without
moving at all the formants, simply because the whole glottis tend to
move a little up and down when the pitch goes up and down. This
movement is very limited compared to the length of the vocal tract,
let say a variation of some 5%, while the pitch can move by more than
two octave: 400%. 5% vs 400+%, that is not much coupling.

Trained signers learn to control glottis movement. Further, trained
singers, especially sopranos, are capable of shaping formants so that
the largest harmonics fall very near the center of formant, in order
to ensure the most efficient energy transfer (but often at the
expense of vowel inteligibility). That is certainly not natural, this
require education, training and practice.

J. Sunberg has explored in deep those fascinating topics.

Then back to the JNS paper
--------------------------
It should be clear, now, that Fm as defined in the JNE paper is not a
physical resonance value. It is a value that is near F1 more or less
1/2 f0.

If Fm was corresponding the vocal tract resonance (a physical
meaningfull value) that would means (among other more or les silly
things) that the effective vocal tract lenght is constraint to have
value in the set
c
L = ----------- n= 1, 2, 3...
4 * n * f0

In the case of the figure 1C, for a pitch of 210 Hz the length of the
vocal tract is limited to values:

38 cm
L = --- => 38, 19, 12, 9 ....cm
n

As common sense indicate, the effect, the effective lenght of a human
vocal tract is not limited to discrete set (like for valve trumpet):
it is a continuous variable.

mmmmm................

Ok let say I am wrong in the proof by the absurd above.

Let say that F1 is F1 and let say that Fm is a genuine-new-not-until-
now-discovered property of the vocal tract. As Fm is an acoustic
frequency, NECESSARILLY, there is a physical length associated to Fm.
But there is no such thing as a length that is limited to a discrete
set of value in the vocal tract. A way to get a discrete set of
lengths is by an artificial integer rounding off process as described
in the JNE paper.

Conclusion: The way Fm is derived is computation artefact, void of
any meaning (numerology still excluded)

The more and more I get at it, the more and more I am deeply conviced
that all this spectrum normalisation procedure is deeply fallacious,
to say the less.

But if you are not yet conviced.... I think that give up

yours truly

François Laferrière

🔗Martin Braun <nombraun@telia.com>

9/2/2003 10:14:43 AM

Bill:

> The voice has the property that usually, the largest partial is not the
> fundamental - for exaample, maybe the voice has partials at

> 150 300 450 600 750 900 Hz
> with amplitudes normalized to
> 0.1 0.2 1.0 0.4 0.3

> Doing the frequency normalization then gives the processed spectrum as

> freq = .33 .66 1 1.33 1.66 2.0
> mag = 0.1 0.2 1.0 0.4 0.3

> Another example: a voice might have partials at:

> 250 500 750 1000 Hz
> .3 1 .6 .4

> Normalized this becomes:

> freq = .5 1 1.5 2
> mag = .3 1 .6 .4

Thank you. This will certainly help many readers of the list.

> So - now you take and average a few thousand of these together.
> What do you get? Something with a lot of energy at 1 (of course)
> and smaller amounts of energy at - you guessed it - small integer
> ratios like 1.33, 1.5, 1.66, and 2!

> In other words, their curve is a result of the way that they have
> processed/normalized (the frequencies of) the data!

This, however, is not true. The author did NOT normalize data from a theory
text book, but real speech data that was randomly sampled. Your examples
suggest that they should have found an averaged ratio spectrum with vertical
lines at the low-order ratios. In fact they found one in which most of the
power is BETWEEN these ratios. See my previous post on this issue.

> So, why they might have chosen to normalize the data in just this way?

> Well, if you believe that the ear focuses attention on the loudest partial
> in a sound, then this is a natural thing to do. Indeed, these people are
> proud of the fact that they developed this technique for visual processes.
> In vision, it is quite a good assumption that (say) the brightest spot is
> the most salient. But in audition there is no reason to think that the ear
> pays much attention to the loudest partial in a cluster - indeed, virtual
> pitch theory tells us that the ear focuses instead on a harmonic template
> and not on the individual partials themselves.

Well ?? !! Well, this is surprising. You based your criticism on a guess
of the possible motives of the authors, even though the authors explicitly
explained their decision why they did precisely this normalization and
nothing else. On p. 7161 they write:

"This method of normalization avoids any assumptions about the structure of
human speech sounds, e.g., that such sounds should be conceptualized in
terms of ideal harmonic series."

If you think about this, I bet you will admit that this is the best possible
choice they could make to tackle the "problem" that the order of amplitudes
of the harmonics in the real spectra does not fit any norm.

[By the way, the comparison with vision, which the authors discussed, deals
with something different. They suggested that the "statistical" (others say
"probabilistic") learning in hearing may be similar to that in vision. The
harmonic bias ("harmonic template") you mentioned has now been shown to be a
possible result of statistical learning from speech sounds during evolution
and development.
This had been suggested earlier, also by Terhardt. But nobody could have
known beforehand how this would look like in detail with REAL, randomly
sampled speech sounds.]

> It appears to me that this argument is fundamentally based on a fallacious
> assumption about auditory perception.

It had not appeared like that, had you read the methods section of the
paper.

Martin

🔗Martin Braun <nombraun@telia.com>

9/2/2003 10:30:13 AM

Fran�ois:

> If you think that those peaks
> are (or reflect) the physical resonance of the vocal tract you are
> wrong, sorry. Those peaks are the harmonics of the glottal source.

"Those peaks" each have a frequency AND an amplitude. The frequencies in
ideal, steady-state vowels reflect nothing, if they are harmonics of the
glottal one.

But the amplitudes reflect the physical resonance of the vocal tract. That's
what the paper's data are all about. Perhaps Bill's examples could now make
this clear to some more readers.

Martin

🔗Bill Sethares <sethares@ece.wisc.edu>

9/2/2003 11:58:46 AM

> It had not appeared like that, had you read the methods section of the
paper.

I read every word of the paper...

The authors said:

"This method of normalization avoids any assumptions about the structure of
human speech sounds, e.g., that such sounds should be conceptualized in
terms of ideal harmonic series."

True. However, what they DID was to normalize the sounds by the
frequency of the largest partial.
What does such an normalization mean? It means that
in some way, the largest partial must be a meaningful entity.
I know of no psychoacoustic study that has ever indicated that
"the largest partial" in a sound is a particularly salient feature
of the sound, human voice or other.
Hence my criticsm is of their method,
not on my guess as to their motives.

While the authors avoid assumptions about the
human voice, they are *implicitely* assuming something
about the importance of a particular aspect of the sound
(the largest partial) to the
auditory system. What I find disturbing about the article is that
the authors sweep their actual assumptions under the rug,
while proclaiming that they were avoiding (other) assumptions.

🔗Gene Ward Smith <gwsmith@svpal.org>

9/2/2003 3:51:35 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:

> Much in the world is probabilistic, but there are also kind of
breakpoints.
> In hearing there may well be one between 6:5 and 7:6.

I hear one between 7/6 and 8/7, but not between 6/5 and 7/6.

🔗Paul Erlich <perlich@aya.yale.edu>

9/2/2003 4:25:48 PM

--- In tuning@yahoogroups.com, "francois_laferriere"
<francois.laferriere@o...> wrote:
> Hello Martin
>
> When you write:
>
> > Martin:
> > This is wrong. The amplitudes of all peaks in the investigated
> sound spectra
> > reflect the PHYSICAL resonance properties of the vocal tracts of
> the tested
> > subjects.
>
> I suppose that you means the peaks that appears in the spectra of
> figure 1C around 210, 415, 625, 830 Hz etc. (rounded to the
precision
> I got from the PDF file) and that are then normalised to contribute
> to the normalized spectum of figure 2. If you think that those
peaks
> are (or reflect) the physical resonance of the vocal tract you are
> wrong, sorry.

forgive me, francois, but who cares about the physical resonance of
the vocal tract when there's no sound in there resonating? what's
relevant to this paper are the properties of the sounds actually
produced by the vocal tract. you correctly pointed out that the
result of only finding peaks at integer ratios was completely
predictable and fairly trivial. you also pointed out that the
resonance peaks of the vocal tract are completely continuous, rather
than discrete as a function of the glottal frequency. but had the
authors had access to (or estimated, say by curve-fitting) the latter
quantities, how should they have incorporated them into their study?
one only hears the frequency components that are actually present, as
shaped by the resonances of the vocal tract (which is what the
authors measured), not the resonant frequencies themselves. so the
latter are relevant how? this is what mystifies me about your latest
post. especially considering the length you went to to make this
seemingly irrelevant point.

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/3/2003 2:58:59 AM

Hello Bill

I am quite happy to read from you that you reached the same
conclusion as mine in a certainly much clearer phrasing.

> Bill:
> Indeed, these people are
> proud of the fact that they developed this technique for visual
processes.
> In vision, it is quite a good assumption that (say) the brightest
spot is
> the most salient. But in audition there is no reason to think that
the ear
> pays much attention to the loudest partial in a cluster - indeed,
virtual
> pitch theory tells us that the ear focuses instead on a harmonic
template
> and not on the individual partials themselves.

You pinpoint that those researcher come from vision analysis, and it
is reasonable to think that this kind of data gathering procedure may
work on spacial frequencies of pictures, because normal pictures (for
instance natural landscape but not a checker) do not feature periodic
pattern that would lead to two dimensional harmonic pattern in
frequency space. So this is probably a too direct transposition of
image methodology to sound that create this unvolontary bias in the
data gathering.

> Bill:
> It appears to me that this argument is fundamentally based on a
fallacious
> assumption about auditory perception.

I just go a little farther than you, I think that the flaw is due to
the probability distribution of Fm that cannot be higher than 7 for
quite dumb physical reason.

> Jeff:
> A simple pattern of harmonic ratios, within the limits Francois has
described, seems
> like an expected result of measuring many harmonic ratios.

Correct, the physical limits and probability distribution of f0 and
F1 is bound to lead to a selection of simple ratios.

> Martin:
> Yes, but nobody could have guessed beforehand WHICH harmonic ratios
would
> be sticking out of the noise.

The exact amplitude could no be guessed, but I explained clearly how
only a very limited number of ratio are likely, some other are
possible but less likely, and much of them are clearly forbiden. I
also explained clearly the discrpencies between men and women
results. So, all in all, the results remains quite trivial.

> Martin:
> Just to be sure, vowel formants are neither certain frequencies nor
> certain harmonics.

Good idea to ask.

vowel formants can be modeled from a physical model of the vocal
tract, then computed from actual speech signal. A formant is defined
by a (certain, precise) central frequency and a bandwith. Any basic
textbook about speech processing describe that. The central frequency
is unrelated to pitch, so it cannot be related to certain harmonics
of the voice signal.

> Martin:
> They are frequency bands that cover several adjacent harmonics.
> Also, these bands are not fixed. Not even in one speaker.

Correct, these bands move over time, but can be computed (central
frequency and bandwith) with some accuracy at any given time.

> Martin:
> They vary so much
> that for the speech analysis of the brain not the absolute
frequency range
> can be used but only the relations between different formants, that
is
> between different frequency ranges, of the same speaker at one
point in

Correct, they vary much, F1 is "typically" between 500 and 750 for
adults. It is admitted that F1 and F2 are sufficient for wowel
recognition. I do not see your point.

> Paul:
> forgive me, francois, but who cares about the physical resonance of
> the vocal tract when there's no sound in there resonating? what's
> relevant to this paper are the properties of the sounds actually
> produced by the vocal tract. you correctly pointed out that the
> result of only finding peaks at integer ratios was completely
> predictable and fairly trivial. you also pointed out that the
> resonance peaks of the vocal tract are completely continuous,
rather
> than discrete as a function of the glottal frequency. but had the
> authors had access to (or estimated, say by curve-fitting) the
latter
> quantities, how should they have incorporated them into their
study?
> one only hears the frequency components that are actually present,
as
> shaped by the resonances of the vocal tract (which is what the
> authors measured), not the resonant frequencies themselves. so the
> latter are relevant how? this is what mystifies me about your
latest
> post. especially considering the length you went to to make this
> seemingly irrelevant point.

Thanks Paul for making me realise that as I tried to accumulate
demonstrations of what I, in my mind, is an obvious flaw, I became
more and more terse and obscure (while I hoped to progress the other
way). So I should have gave up before :-).

I used the technique of "dimensional analysis" to try to
show "clearly" that Fm is not a physical quantity but just a
computation artefact, but failed miserably at being any much clearer.

The fact remain that Fm that is the foundation of the paper is
computed and used in such a way that it generates artifical peaks in
the "normalized" spectrum.

yours truly

François Laferrière

🔗Martin Braun <nombraun@telia.com>

9/3/2003 3:15:53 AM

Gene:

>> Much in the world is probabilistic, but there are also kind of
>> breakpoints. In hearing there may well be one between 6:5 and 7:6.

> I hear one between 7/6 and 8/7, but not between 6/5 and 7/6.

Also this is plausible from what we know about hearing. Psychoacoustic
experiments showed that for pure tones the borderline between consonance and
dissonance lies around 260 Cent for frequencies in the music-relevant range.
So, some may hear 267 as consonant, while others do less so.

There is another component that may also vary considerably between
listeners. 7:6 deviates more from the ubiquitous 12-tone scales than 6:5
does. (Probably you discussed such things at length in earlier years.)

Martin

🔗Martin Braun <nombraun@telia.com>

9/3/2003 2:58:29 AM

Bill:

> True. However, what they DID was to normalize the sounds by the
> frequency of the largest partial.
> What does such an normalization mean? It means that
> in some way, the largest partial must be a meaningful entity.
> I know of no psychoacoustic study that has ever indicated that
> "the largest partial" in a sound is a particularly salient feature
> of the sound, human voice or other.
> Hence my criticsm is of their method,
> not on my guess as to their motives.

Again, I bet that all who think over this methods decision of the authors
will come to the conclusion that they chose the best possible option.

When sampling data you have to make decisions. And the best decisions are
those which are based on the smallest assumptions.

So, what could they do to collect data on the probability of frequency
ratios between the peaks in randomly selected speech samples?

1) They could have taken the 10 peaks with the highest amplitudes. But why
10? Why not 5 or 20? You would need masses of theory to "justify" such a
decision.

2) They could have taken all peaks in a frequency range, say 100-2000 Hz.
But why not 50-4000 Hz? You would need masses of theory to "justify" such a
decision. If the range is too large, you also lose most possible information
in noise.

3) They could have taken the peak at f0, instead of the highest peak, as a
"normalized center peak". But how to calculate f0? This has been a hot
issue in speech research for decades. There are competing algorithms, and
the discussion is still long from being settled.

I see no better way than that taken by the authors. If somebody else sees
one, please let us know.

> While the authors avoid assumptions about the
> human voice, they are *implicitely* assuming something
> about the importance of a particular aspect of the sound
> (the largest partial) to the auditory system.

This is a wrong conclusion. As shown above, their decision was the one that
assumed least, compared to all other possible decisions. The extremely
trivial assumption they may have made is that the highest peak and its
neighbors (see Fig. 1C) have a good chance to enter the auditory system.
This assumption is so self-evident that it cannot be a surprise that they
did not mention it explicitly.

> What I find disturbing about the article is that
> the authors sweep their actual assumptions under the rug,
> while proclaiming that they were avoiding (other) assumptions.

I hope you can now see what they "swept under the rug". Perhaps you can now
also see that something else may be considered as "disturbing".

Martin

🔗Martin Braun <nombraun@telia.com>

9/3/2003 12:12:12 PM

Fran�ois:

> The fact remain that Fm that is the foundation of the paper is
> computed and used in such a way that it generates artifical peaks in
> the "normalized" spectrum.

1) Fm, the frequency of the maximum peak in the spectra of the sound samples
is NOT the foundation of the paper. It is the frequency to which all other
frequencies and amplitudes in a given sample are related to receive a "power
spectrum" of ratios.

2) Fm is NOT computed in any way. It is simply read out from the FFT.

3) Artificial peaks are NOT generated. The peaks that appear in the grand
average spectra appear due to an adequate methods decision of the authors.
See my previous message.

Fran�ois, had you spent a small part of the time that went into writing your
long messages in this thread on simply reading the methods section of the
study, you would have done yourself and all other list members a great
favor. Perhaps next time ;-)

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/3/2003 2:02:39 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> François:
>
> > The fact remain that Fm that is the foundation of the paper is
> > computed and used in such a way that it generates artifical peaks in
> > the "normalized" spectrum.
>
> 1) Fm, the frequency of the maximum peak in the spectra of the sound samp=
les
> is NOT the foundation of the paper. It is the frequency to which all othe=
r
> frequencies and amplitudes in a given sample are related to receive a "po=
wer
> spectrum" of ratios.
>
> 2) Fm is NOT computed in any way. It is simply read out from the FFT.
>
> 3) Artificial peaks are NOT generated. The peaks that appear in the grand=

> average spectra appear due to an adequate methods decision of the authors=
.
> See my previous message.

Fm is not computet, my mistake in this specific sentence.

> François, had you spent a small part of the time that went into writing y=
our
> long messages in this thread on simply reading the methods section of the=

> study, you would have done yourself and all other list members a great
> favor. Perhaps next time ;-)
>
> Martin

Martin, you didn't read or understood a word of my very first post on
this topic (46560) which was clear enough for anybody with a minimal
math/science culture. You were unable to produce a single sensible
counter-argument on my demonstration.

The conclusion of the paper may be correct or interrsting or worth
discussion, at this point, I really do not care. Conclusion is just
based on false premices and that is bothering me.

What is starting to bother me as well is your arrogant attitude, not
especially toward me, but toward anybody who disagree with you.

Have you ever admited you were wrong, once in your life?

François Laferrière

🔗Bill Sethares <sethares@ece.wisc.edu>

9/3/2003 8:51:22 PM

Martin and Francois,

Despite appearences, it does seem that we have reached
agreement about the crucial nature of the normalization-by-largest
partial method used in the paper.

Where we disagree is in whether this normalization is sensible.

My view (and Francois, as well, assuming I read you right)
is that this procedure is unjustified -
while Martin feels that it is a sensible.

My reasoning is as follows:
the normalization places a very important role
on the perception of the largest partial in a sound -
indeed, the largest partial plays a pivotal role.
There are many things about sounds that are commonly considered
perceptually relevant - amplitude/pitch/virtual pitch/modulations/
etc, but a perception of "the largest partial" is not one of them.
Hence, a computational method in which the largest partial plays
a pivotal role needs to be looked at carefully.

Martin queried: I see no better way than that
taken by the authors. If somebody else sees
one, please let us know.

There are many possible paths the authors could have taken.
They could have normalized by the spectral center
(the center of mass of the power spectrum).
They could have normalized by the expectation of the
power spectrum.
They could have normalized by the spectral center of the
log of the power spectrum (since pitch is generally
perceived in a log (dB) scale).
I see no a priori reason to prefer any one of these methods over
any other one - is the largest partial more perceptually relevant
than the center of the spectrum? Than the expectation?

This is why I do not find the argument "what else could
they do?" convincing, because they had many choices.
Moreover, they make no argument in favor of the choice they made.

In any case, I hope it is clearer now why I find the normalization-by-
largest partial an unconvincing method.

--Bill Sethares

🔗Martin Braun <nombraun@telia.com>

9/4/2003 6:36:45 AM

Fran�ois:

> Martin, you didn't read or understood a word of my very first post on
> this topic (46560) which was clear enough for anybody with a minimal
> math/science culture. You were unable to produce a single sensible
> counter-argument on my demonstration.

This was hard to read. If you still want a public criticism, here it is:

Message 46560:

> Let see how it work.
> The normalized value is defined (in the paper) as
> Fn = F/Fm
> As the signal is harmonic F and Fm are restricted to values in the
> serie f0, 2f0, 3f0 ...

1) Only some of the selected samples are harmonic (see my previous post).

2) Even when a sample was harmonic, ONLY "Fm" was a multiple f0. "F" was
taken from ALL frequencies of the FFT spectrum, not just from the multiples
of f0. So all relevant ratios "had a chance" (in your words), even 11/9 or
17/16.

[By the way, all this is written down in the methods section of the paper.]

Martin

🔗Martin Braun <nombraun@telia.com>

9/4/2003 6:15:47 AM

Bill:

> Despite appearences, it does seem that we have reached
> agreement about the crucial nature of the normalization-by-largest
> partial method used in the paper.

The normalization was to the largest "peak" in the spectra of the samples,
not to the largest "partial". As I said before, the authors did NOT filter
their speech material for quasi-steady-state vowels. As they selected the
samples randomly, the amount of periodicity in the samples varied greatly.
Not all spectra had partials, but all had a highest peak.

> My reasoning is as follows:
> the normalization places a very important role
> on the perception of the largest partial in a sound -
> indeed, the largest partial plays a pivotal role.
> There are many things about sounds that are commonly considered
> perceptually relevant - amplitude/pitch/virtual pitch/modulations/
> etc, but a perception of "the largest partial" is not one of them.
> Hence, a computational method in which the largest partial plays
> a pivotal role needs to be looked at carefully.

OK. Looking at things carefully is good. The question here is, could this
decision have done any damage to the study? Could it have biased the results
in an inappropriate way?

So, let's assume the sampling algorithm has hit a fairly clean vowel and the
highest peak in the FFT spectrum is the highest partial. I fully agree that
the highest partial per se has no particular importance in hearing. In the
frequency range of speech and music the partials that are most weighed by
the brain for pitch extraction are the partials 3 to 6. Partials 1 and 2 are
also weighed, in particular if they have a high amplitude. Partials 7 and
higher have less or no importance, because their frequencies are poorly, or
not at all, resolved in ear and brain (neighboring partials that are too
close to each other mask each other). Now, in vowels the highest partial
usually is among numbers 1-6. So it is among those that are weighed for
pitch extraction, and it is weighed stronger than its neighbors, because it
has a higher amplitude.

In conclusion the decision of the authors agrees well with what the ear does
with the highest partials in vowel sounds.

> Martin queried: I see no better way than that
> taken by the authors. If somebody else sees
> one, please let us know.

> There are many possible paths the authors could have taken.
> They could have normalized by the spectral center
> (the center of mass of the power spectrum).

This would have been meaningless considering the research aims. Aim of the
study was to see how the distribution of ratios was between outstanding
spectral lines. For a ratio you need two points in the spectrum, and at
least one of them has to be a peak. The "center of mass of the power
spectrum" could fall anywhere, even in a valley of the spectrum. Therefore
it cannot be used for the purpose of the study.

> They could have normalized by the expectation of the power spectrum.

Same as above, plus problem to justify the "expectation".

> They could have normalized by the spectral center of the
> log of the power spectrum (since pitch is generally
> perceived in a log (dB) scale).

Same as first suggestion.

> In any case, I hope it is clearer now why I find the normalization-by-
> largest partial an unconvincing method.

I hope you can now reconsider this view.

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/4/2003 8:20:38 AM

Hello

>
> Message 46560:
>
> > Let see how it work.
> > The normalized value is defined (in the paper) as
> > Fn = F/Fm
> > As the signal is harmonic F and Fm are restricted to values in the
> > serie f0, 2f0, 3f0 ...
>
> 1) Only some of the selected samples are harmonic (see my previous
post).
>
> 2) Even when a sample was harmonic, ONLY "Fm" was a multiple f0. "F" was
> taken from ALL frequencies of the FFT spectrum, not just from the
multiples
> of f0. So all relevant ratios "had a chance" (in your words), even
11/9 or
> 17/16.

We need to go back to articulatory phonetics again.

A phoneme can be voiced (with harmonic spectrum due to vocal fold
vibration) or unvoiced. When voiced, the harmonic spectrum carry the
bulk of energy and then Fm is a multiple of f0. This is true for all
vowels and most consonants. What does that left? Two category of
consonant, the unvoiced plosive (P, T, K) and the unvoiced fricative
(in english F like in Fourier, S like in spectrum, CH like in sheep).
Unvoiced plosive are basically... silence, so are rightfully discarded
by the protocol.

That left us with unvoiced fricatives, which are basically, broadband
white noise that shall contribute to noise in the averaged spectrum.
Further, what is the probability weight of F,S,CH against the rest of
the phonological system (all the other vowels and consonants)? not
much, certainly.

The contribution of unharmonic segments is altogether irrelevant (it
is noise!) and insignificant (there are very few of them).

Except if the database contains a huge amount (more than 50%) of
whispered voice (which is very unlikely for what I know of speech
database) the data is clearly dominated by voiced, harmonic segment,
so my demonstration is still correct, until otherwhise debunked

yours truly

François Laferrière

🔗Martin Braun <nombraun@telia.com>

9/4/2003 12:13:46 PM

Fran�ois:

> ....the data is clearly dominated by voiced, harmonic segment,
> so my demonstration is still correct, until otherwhise debunked

Then I would like to invite you to debunk it yourself. If your statement was
correct, we would see spectral lines, or at least very steep peaks, in Fig.
1C. But we don't. The power distribution is mainly flat with some small
peaks ontop of it. Most of the power in this ratio "spectrogram" is OUTSIDE
the low-order ratios. This is due to the inharmonicity of the speech
samples, and nothing else. You can also see that the ratio 11/9 is well
represented in the distribution. The same applies to 11.5/9.5. These ratios
have only slightly less power than 5/4 and 6/5. [But they are irrelevant, of
course, because they do not stick out of the inharmonicity noise.]

Martin

🔗Paul Erlich <perlich@aya.yale.edu>

9/4/2003 12:46:09 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> François:
>
> > ....the data is clearly dominated by voiced, harmonic segment,
> > so my demonstration is still correct, until otherwhise debunked
>
> Then I would like to invite you to debunk it yourself. If your
statement was
> correct, we would see spectral lines, or at least very steep peaks,
in Fig.
> 1C. But we don't. The power distribution is mainly flat with some
small
> peaks ontop of it. Most of the power in this ratio "spectrogram" is
OUTSIDE
> the low-order ratios. This is due to the inharmonicity of the speech
> samples, and nothing else.

is inharmonicity really the right word, or should it be
merely "noise"? to me, inharmonicity implies a discrete spectrum that
deviates from the harmonic pattern, and i challenge anyone to show me
evidence of a human voice producing *that*.

> inharmonicity noise.]

now i'm extra confused -- i thought i knew what "inharmonicity"
and "noise" were, but what is "inharmonicity noise"??

🔗Paul Erlich <perlich@aya.yale.edu>

9/4/2003 1:04:36 PM

--- In tuning@yahoogroups.com, "Jeff Olliff" <jolliff@d...> wrote:

> Besides, if the peaks drop off at the 7-limit, so that 7/6 is not
> statistically significant,

but 7/4 and 7/5 are.

> and so not reinforced in experience, how
> come I hear that harmony no sweat? My analysis of Bach is that he
> heard it and used it.

well, i've spent enough words on this list disputing that, i'll hold
off on going back into that unless there's a strong desire on
someone's part . . . for now, interested parties can listen to john
delaubenfels' adaptively tuned versions of the bach chaconne as
arranged by busoni, with versions which target 5-limit vertical
sonorities as well as versions which target 7-limit vertical
sonorities (for dominant seventh chords and the like):

http://bellsouthpwp.net/j/d/jdelaub/jstudio.htm

download b-b-bj.zip

🔗Martin Braun <nombraun@telia.com>

9/5/2003 1:58:39 AM

I must apologize for the typo in my previous message. Instead of "Fig. 1C",
it must be "Fig. 2C", of course. Sorry.

Martin

🔗Martin Braun <nombraun@telia.com>

9/5/2003 2:58:01 AM

Paul:

>> The power distribution is mainly flat with some small
>> peaks ontop of it. Most of the power in this ratio "spectrogram" is
>> OUTSIDE the low-order ratios. This is due to the inharmonicity of the
>> speech samples, and nothing else.

> is inharmonicity really the right word, or should it be merely "noise"?

Then you would have to call the vast majority of speech signals "noise".
This would be against all conventions. You might be right perhaps when
considering the contents of much speech. But that would be another question
;-)

"Inharmonicity" would be the right word, when dealing with the difference
between harmonic (low-order) ratios and other (non-harmonic) ratios in the
distribution of ratios in speech spectra. And this is what the study was
about.

> to me, inharmonicity implies a discrete spectrum that
> deviates from the harmonic pattern, and i challenge anyone to show me
> evidence of a human voice producing *that*.

You mean discrete spectral lines with non-harmonic frequency ratios? These,
of course, do not exist. But non-harmonic ratios in the spectrum represent
the greater part of spectral power in human speech, as seen in Fig. 2C of
the study.

>> inharmonicity noise.]

> now i'm extra confused -- i thought i knew what "inharmonicity"
> and "noise" were, but what is "inharmonicity noise"??

"Noise" in science means the opposite of signal or information. In the study
the information at issue was harmonic (low-order) ratios. So other ratios
were "noise" compared to this "signal" background. Because these other
ratios are inharmonic (high-order) ratios, the results, as in Fig. 2C, were
"harmonicity information" plus "inharmonicity noise".

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/5/2003 4:08:54 AM

Hello Martin and Paul
> Martin:

> >> The power distribution is mainly flat with some small
> >> peaks ontop of it. Most of the power in this ratio "spectrogram" is
> >> OUTSIDE the low-order ratios. This is due to the inharmonicity of the
> >> speech samples, and nothing else.

Paul answered:
> is inharmonicity really the right word, or should it be merely "noise"?

The flat background is due to
- harmonic peaks spreading due to FFT window width (0.1 sec is quite
short in this respect)
- pitch movement within each sample that blurr the peaks (0.1 is
quite long then....)
- possible contribution of turbulent noise (frication noise)

All this introduces "noise" (information theory acceptance) in the
results, I understand Paul answer this way. This noise masks the basic
discrete nature of the spectrum by filling the "forbiden band" with noise.

inharmonicity is definitly not the right word.

> Then you would have to call the vast majority of speech signals "noise".
> This would be against all conventions. You might be right perhaps when
> considering the contents of much speech. But that would be another
question
> ;-)

> Martin:
> "Inharmonicity" would be the right word, when dealing with the
difference
> between harmonic (low-order) ratios and other (non-harmonic) ratios
in the
> distribution of ratios in speech spectra. And this is what the study was
> about.
>
>
> > to me, inharmonicity implies a discrete spectrum that
> > deviates from the harmonic pattern, and i challenge anyone to show me
> > evidence of a human voice producing *that*.
>
> You mean discrete spectral lines with non-harmonic frequency ratios?
These,
> of course, do not exist. But non-harmonic ratios in the spectrum
represent
> the greater part of spectral power in human speech, as seen in Fig.
2C of
> the study.
>
>
> >> inharmonicity noise.]
>
> > now i'm extra confused -- i thought i knew what "inharmonicity"
> > and "noise" were, but what is "inharmonicity noise"??
>
> "Noise" in science means the opposite of signal or information. In
the study
> the information at issue was harmonic (low-order) ratios. So other
ratios
> were "noise" compared to this "signal" background. Because these other
> ratios are inharmonic (high-order) ratios, the results, as in Fig.
2C, were
> "harmonicity information" plus "inharmonicity noise".
>

You seems to come to an agreement on that: there is no such thing as
anharmonic speech signal. So the background is due to measurement
dispersion or "noise" but in the information theory acceptanceof thge
word "noise", not the signal theory acceptance.

I hope that this clarify the point

yours truly

François Laferrière

🔗Paul Erlich <perlich@aya.yale.edu>

9/5/2003 3:58:03 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> Paul:
>
> >> The power distribution is mainly flat with some small
> >> peaks ontop of it. Most of the power in this ratio "spectrogram"
is
> >> OUTSIDE the low-order ratios. This is due to the inharmonicity
of the
> >> speech samples, and nothing else.
>
> > is inharmonicity really the right word, or should it be
merely "noise"?
>
> Then you would have to call the vast majority of speech
>signals "noise".

not necessarily. a finite fft will show a continuous spectrum, but
part of that is merely the classical uncertainty principle at work.
fourier transforms are only valid up to a certain level of
uncertainty, if you allow for full generality in the sample in terms
of both frequency and time.

> This would be against all conventions. You might be right perhaps
when
> considering the contents of much speech. But that would be another
question
> ;-)

well, what i had in mind was that most speech sounds have a noise
(continuous) component that one could in theory subtract out, and
then it is the discrete spectrum that remains which could be
described as either harmonic or inharmonic. my claim is that
inharmonic it would never be, though things would get hairy when the
vocal dynamics become chaotic (basically vocal multiphonics).

> "Inharmonicity" would be the right word, when dealing with the
difference
> between harmonic (low-order) ratios and other (non-harmonic) ratios
in the
> distribution of ratios in speech spectra. And this is what the
study was
> about.

hmm . . .

> > to me, inharmonicity implies a discrete spectrum that
> > deviates from the harmonic pattern, and i challenge anyone to
show me
> > evidence of a human voice producing *that*.
>
> You mean discrete spectral lines with non-harmonic frequency
ratios? These,
> of course, do not exist.

well, if you synthesize a bell sound, and make it sustain
indefinitely, the longer the sample you fft, the more closely the
result will approach this condition.

> But non-harmonic ratios in the spectrum represent
> the greater part of spectral power in human speech, as seen in Fig.
2C of
> the study.

in the form of noise, correct?

> >> inharmonicity noise.]
>
> > now i'm extra confused -- i thought i knew what "inharmonicity"
> > and "noise" were, but what is "inharmonicity noise"??
>
> "Noise" in science means the opposite of signal or information. In
the study
> the information at issue was harmonic (low-order) ratios. So other
ratios
> were "noise" compared to this "signal" background. Because these
other
> ratios are inharmonic (high-order) ratios, the results, as in Fig.
2C, were
> "harmonicity information" plus "inharmonicity noise".
>
> Martin

so "inharmonicity" means nothing more than to distinguish the noise
to which it applies from the harmonic-spectum component, yes?

🔗Martin Braun <nombraun@telia.com>

9/7/2003 6:00:02 AM

Paul:

>> "Noise" in science means the opposite of signal or information. In
>> the study the information at issue was harmonic (low-order) ratios. So
>> other ratios were "noise" compared to this "signal" background. Because
these
>> other ratios are inharmonic (high-order) ratios, the results, as in
Fig.2C,
>> were "harmonicity information" plus "inharmonicity noise".

> so "inharmonicity" means nothing more than to distinguish the noise
> to which it applies from the harmonic-spectum component, yes?

Yes, in this particular case. I also agree that in the vocal tract, as
opposed to bells, there are no inharmonic vibration modes. The difference
between us may be that we have different views on the transient status of
all speech sounds. Well over 90 % of all sounds of phoneme length are
glissandi or portamenti. In such transient sounds the harmonic spectrum is
stretched or compressed, that is, "disharmonized". The reason is that the
various vibration modes have different latencies. This in turn is due to the
different masses that vibrate. Usually, smaller masses (faster vibrations,
higher frequencies) have shorter latencies.

So, the reason why 7:5 sticks out in the data, but 7:6 doesn't, may be that
the "signal" at 7:6 is so small that it is blurred out by the ubiquitous
transient disharmonization. For me at least this would be an interesting
finding. We knew about these blurring effects, but we did not know their
extent in the statistics of a large speech corpus.

Martin

🔗Martin Braun <nombraun@telia.com>

9/7/2003 6:27:26 AM

Fran�ois:

> >> The power distribution is mainly flat with some small
> >> peaks ontop of it. Most of the power in this ratio "spectrogram" is
> >> OUTSIDE the low-order ratios. This is due to the inharmonicity of the
> >> speech samples, and nothing else.

> The flat background is due to
> - harmonic peaks spreading due to FFT window width (0.1 sec is quite
> short in this respect)
> - pitch movement within each sample that blurr the peaks (0.1 is
> quite long then....)

Well, if the sampling window was too short in one respect, and too long in
another respect, perhaps the authors chose an appropriate length in the
middle. This is what we all have to do. Washing the hands too little does
not clean them. Washing them too much is bad for the skin.

Now that we got this long in the discussion, what about telling us what the
authors should have done instead? If you think that their methods
predetermined their results, you can perhaps say which methods would have
lead to "meaningful" results (in your view). For example, which method would
have given the ratio 11:9 "a chance" (as you expressed it)?

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/9/2003 12:53:37 AM

Hello,

> Martin (refering to Paul comment):
> Yes, in this particular case. I also agree that in the vocal tract, as
> opposed to bells, there are no inharmonic vibration modes. The
difference
> between us may be that we have different views on the transient
status of
> all speech sounds. Well over 90 % of all sounds of phoneme length are
> glissandi or portamenti. In such transient sounds the harmonic
spectrum is
> stretched or compressed, that is, "disharmonized". The reason is
that the
> various vibration modes have different latencies. This in turn is
due to the
> different masses that vibrate. Usually, smaller masses (faster
vibrations,
> higher frequencies) have shorter latencies.

Once of all, the vocal tract has NOTHING to do with the harmonic structure
of the human voice! The harmonic structure of the spectrum depends on the
glottis phase lock mechanisms. The vocal tract has broad band resonances
but which is only very remotely like a set of harmonics.

Then, in "transient parts" of the signal, the harmonic spectrum is
evenly compressed or stretched AS FAR AS CAN BE MEASURED. In transient
parts, infinite time window approximation (i.e. the signal is stable
in a long window) is less and less true and the frequency analysis is
less and less reliable. Thas does no means that the underlying
physical mechanisms that produces harmonic structure ceases to exist:
that just means that its effect cannot be reliably measured from
frequency analysis (uncertainty principle is back). No such thing as
physical "disharmonisation" can be measured from speech signal.
Nevertheless, we have to admit that when the signal is transient, the
precision to which harmonicity (or inharmonicity) can be assessed is
limited by the uncertainty principle. This limitation may sometime be
erroneously interpreted as "disharmonisation" but in fact, it just a
measurement artifact, not a physical phenomena.

> So, the reason why 7:5 sticks out in the data, but 7:6 doesn't, may
be that
> the "signal" at 7:6 is so small that it is blurred out by the ubiquitous
> transient disharmonization. For me at least this would be an interesting
> finding. We knew about these blurring effects, but we did not know their
> extent in the statistics of a large speech corpus.

Fran?ois:

> The flat background is due to
> - harmonic peaks spreading due to FFT window width (0.1 sec is quite
> short in this respect)
> - pitch movement within each sample that blurr the peaks (0.1 is
> quite long then....)

> Martin
> Well, if the sampling window was too short in one respect, and too
long in
> another respect, perhaps the authors chose an appropriate length in the
> middle. This is what we all have to do. Washing the hands too little
does
> not clean them. Washing them too much is bad for the skin.
>
> Now that we got this long in the discussion, what about telling us
what the
> authors should have done instead?

Have a good walk :-).

Seriously: the way data is gathered is not the issue, not at all. My
disgression was just to explain why there is a "floor" to the
normalized spectrum (and not only discrete peaks at simple ratio values)

The problem is not in the data gathering (that seems fair enough), but
in the spectrum normalisation process. I have explained this with
sufficient detail already.

> If you think that their methods
> predetermined their results, you can perhaps say which methods would
have
> lead to "meaningful" results (in your view). For example, which
method would
> have given the ratio 11:9 "a chance" (as you expressed it)?

Select extraterrestrial speakers with vocal tract of 5cm and an
average pitch of 20 Hz.
Select those who speak veeeeeeeeeeeeeryyyyyyyyyyyyy
sloooooooooooooooolyyyyyyyyyyyyyy and monotonously (to avoid pitch and
spectrum instability).
Fiddle sample length until peaks appears the more clearly possible
(that was probably already done for the paper to find 0.1 sec as a
good compromise).

yours truly

François Laferrière

🔗Martin Braun <nombraun@telia.com>

9/9/2003 6:48:53 AM

Fran�ois:

> Once of all, the vocal tract has NOTHING to do with the harmonic structure
> of the human voice!

Not with the frequencies of the partials, but with the amplitudes at these
frequencies!!!!!!
That's what the study was all about.

> No such thing as
> physical "disharmonisation" can be measured from speech signal.

If partial 5 leads partial 4 in a glissando you have a physical
disharmonization!!!!!!

> Nevertheless, we have to admit that when the signal is transient, the
> precision to which harmonicity (or inharmonicity) can be assessed is
> limited by the uncertainty principle. This limitation may sometime be
> erroneously interpreted as "disharmonisation" but in fact, it just a
> measurement artifact, not a physical phenomena.

This view is in error. The physical disharmonization, as described above is
real. I can also be measured. For example, take a time window in which
partial 5 has reached 60 % of its frequency shift whereas partial 4 has only
reached 30 % of its frequency shift.

>> Now that we got this long in the discussion, what about telling us
>> what the authors should have done instead?

> The problem is not in the data gathering (that seems fair enough), but
> in the spectrum normalisation process. I have explained this with
> sufficient detail already.

What then should the authors have done to get the "problem" out of the
"spectrum normalization process"?

>> If you think that their methods
>> predetermined their results, you can perhaps say which methods would
>> have lead to "meaningful" results (in your view). For example, which
>> method would have given the ratio 11:9 "a chance" (as you expressed it)?

> Select extraterrestrial speakers with vocal tract of 5cm and an
> average pitch of 20 Hz.
> Select those who speak veeeeeeeeeeeeeryyyyyyyyyyyyy
> sloooooooooooooooolyyyyyyyyyyyyyy and monotonously (to avoid pitch and
> spectrum instability).

Are you saying now that the results of the study reflect the nature of human
speech?

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/10/2003 12:38:59 AM

Hello Martin

> François:
> No such thing as
> physical "disharmonisation" can be measured from speech signal.

> Martin
> If partial 5 leads partial 4 in a glissando you have a physical
> disharmonization!!!!!!

If... but this does not exists

1- There is no such thing as "physical disharmonization" of human voice.
2- If such thing exists it is not measureable.
3- Even if it exists and is measureable, it would be insignificant in
random speech sample of 0.1 sec

In other words, all harmonics moves togethers when pitch moves
reasonably slowly. When pitch moves very quickly, the precision that
would gives clue to disharmonisation cannot be reached. Finally, 0.1
sec. of speech has good chance of being fairly stable in term of pitch
and formant (there is no huge pitch jump like in singing voice).

I have checked manually thousands of speech and signing voice using
FFTs and other analysis method and never observed such thing.

If you can provide me with a single sample of speech (or even singing
voice) that features disharmonisations or even provide me with a
single serious paper that describes this phenomena, I will be too
happy to get educated.

Nevertheless, let us investigate the consequence of "disharmonisation"
(even though it does not exist).

Let suppose that f5 change ahead of f4 in an upward pitch movement
(this is exagerated due to limitaion of ascii art)

............................
...600.|................----
.......|............----....
.......|........----........
.......|....----............
f5 500.|----................
.......|....................
.......|....................
.......|....................
.......|....................
f4 400 |--------------------
............................

In this window, average value of f4 is 400 while average value of f5
is 550. So instead of contributing at 5:4 1.2500, (if Fm is f4) it
contribute to a "slightly" higher value 1.3750.

It is resonable to think that if higher frequencies move faster
upward, they move faster downward as well. For instance the contrary
movement of above shall be

............................
f5.600.|----................
.......|....----.............
.......|........---- ........
.......|............---- ....
...500.|................----.
f4 480.|--------------------
.......|....................
.......|....................
.......|....................
...400 |--------------------

So instead of contributing at 1.250 (if Fm is f4), this sample
contribute to a "slighly" lower value of 1.145.

On the average, there should be roughly the same (small) amont of
upward and downward pitch. So ther should be roughly the same amount
of contribution at right and at left of each average peak of the
average normalize spectrum.

If it is the case (time symetry of "disharmonisation")
disharmonisation does not changes peak locations on average normalized
spectrum, it just spread them out a little bit.

so

4 - if physical disharmonization exist (it don't) and is measurable
(it is not) and is significant in 0.1 sec. sample (unlikely), its
upward and downward contribution shall spread out peak locations but
not change peak centers.

Ok let suppose that I am wrong all the way, disharmonisation occurs
and occurs more strongly on upward (downward) pitch movement, this
should rises (lowers) every peak above 1 in the average normaliuzed
spectrum. So the 1.5 peak should be slighly higher (lower) than 1.5, 2
slighly higher (lower) that 2.0 and so on.

so finally.

5 - existence of significant physical disharmonisation would
contradict the average spectrum presented in the paper.

> François:
> The problem is not in the data gathering (that seems fair enough), but
> in the spectrum normalisation process. I have explained this with
> sufficient detail already.
> Martin:
> What then should the authors have done to get the "problem" out of the
> "spectrum normalization process"?

The variable analysed is roughly Fn F/Fm ~= F/(round (F1/f0) where F
has proeminet values at f0 2f0, 3f0 etc.

Suppose that I take a population, I define a variable as ShoeSize /
round (IQ / ShoeSize) and analyse stats on this variable. I discover a
lot of interesting properties of this variable. What is the problem?
How can I get it out?

yours truly

François Laferrière

🔗Paul Erlich <perlich@aya.yale.edu>

9/10/2003 10:32:19 AM

--- In tuning@yahoogroups.com, "francois_laferriere"
<francois.laferriere@o...> wrote:

> 1- There is no such thing as "physical disharmonization" of human
voice.

if i recall correctly, the only instrument on which judith brown
found "disharmonization" during pitch changes was the violin. here's
her paper again:

http://www.wellesley.edu/Physics/brown/pubs/freqRatV99P1210-P1218.djvu

unfortunately i have no idea how i was able to read this paper in the
first place, my system certainly doesn't know how to open it now!

> The variable analysed is roughly Fn F/Fm ~= F/(round (F1/f0) where F
> has proeminet values at f0 2f0, 3f0 etc.
>
> Suppose that I take a population, I define a variable as ShoeSize /
> round (IQ / ShoeSize)

francois, i think we may be getting back to the objection i already
debunked (or so i thought). it seems you're putting "round" in the
first formula because you'd like the formula to begin with the vocal
tract (simplified as a single-resonator system?) resonant frequency.
if so, i'll say again, this is unfair. the authors are only concerned
with the frequencies actually present in the sound, since this is the
only aspect of vocalization relevant to their hypothesis. therefore,
this "rounding" occurs naturally -- or not at all in the case of
unvoiced phonemes. either way, the authors simply use the frequencies
present in the signal. am i misinterpreting you?

🔗Martin Braun <nombraun@telia.com>

9/10/2003 1:19:28 PM

Paul:

> francois wrote:
> 1- There is no such thing as "physical disharmonization" of human
> voice.

> if i recall correctly, the only instrument on which judith brown
> found "disharmonization" during pitch changes was the violin. here's
> her paper again:

http://www.wellesley.edu/Physics/brown/pubs/freqRatV99P1210-P1218.djvu

This is correct. But she only considered vibratos, not glissandi. For
vibratos the frequency shifts are very small in string instruments and
extremely small in wind instruments. So the disharmonization due to latency
differences between the partials is negligible in her experiments.

> unfortunately i have no idea how i was able to read this paper in the
> first place, my system certainly doesn't know how to open it now!

You need to download a plug-in for djvu files. These are fairly short and
freely available on the web. Just google for "djvu" and pick the most
convenient download offer.

Martin

🔗Martin Braun <nombraun@telia.com>

9/10/2003 1:05:34 PM

Fran�ois:

>> Fran�ois:
>> No such thing as
>> physical "disharmonisation" can be measured from speech signal.

>> Martin:
>> If partial 5 leads partial 4 in a glissando you have a physical
>> disharmonization!!!!!!

> If... but this does not exists

In wind instruments, low tones have longer latencies than high tones. And
within the spectrum of one tone of these instruments, the latencies vary
between the partials. There is no reason to assume that this can be
different with the human voice.

> If you can provide me with a single sample of speech (or even singing
> voice) that features disharmonisations or even provide me with a
> single serious paper that describes this phenomena, I will be too
> happy to get educated.

The latency behavior of partials in wind instruments is described in good
books on the acoustics of wind instruments.

> If it is the case (time symetry of "disharmonisation")
> disharmonisation does not changes peak locations on average normalized
> spectrum, it just spread them out a little bit.

Well, "a little bit" is quite a joke here. In your example the ratio
fluctuations around 5/4 (1.25) went down below 6:5 (1.2) and up above 4:3
(1.33) !!!!!!! This is indeed a terrible "spreading". It would be blurring
the whole range between the minor third and the fourth !!!!! But thanks for
demonstrating to the list, then, WHY the ratio distribution curve has such a
strong flat component (high plateau) ;-))

>> Fran�ois:
>> The problem is not in the data gathering (that seems fair enough), but
>> in the spectrum normalisation process. I have explained this with
>> sufficient detail already.
>> Martin:
>> What then should the authors have done to get the "problem" out of the
>> "spectrum normalization process"?

>The variable analysed is roughly Fn F/Fm ~= F/(round (F1/f0) where F
>has proeminet values at f0 2f0, 3f0 etc.

>Suppose that I take a population, I define a variable as ShoeSize /
>round (IQ / ShoeSize) and analyse stats on this variable. I discover a
>lot of interesting properties of this variable. What is the problem?
>How can I get it out?

Could you please say what the authors should have done to avoid the problem
which you claim they have in their methods?

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/11/2003 3:01:38 AM

Hello,

> Francois:
> The variable analysed is roughly Fn F/Fm ~= F/(round (F1/f0) where F
> has proeminet values at f0 2f0, 3f0 etc.
>
> Suppose that I take a population, I define a variable as ShoeSize /
> round (IQ / ShoeSize)

> Paul:
> francois, i think we may be getting back to the objection i already
> debunked (or so i thought). it seems you're putting "round" in the
> first formula because you'd like the formula to begin with the vocal
> tract (simplified as a single-resonator system?) resonant frequency.
> if so, i'll say again, this is unfair. the authors are only concerned
> with the frequencies actually present in the sound, since this is the
> only aspect of vocalization relevant to their hypothesis. therefore,
> this "rounding" occurs naturally -- or not at all in the case of
> unvoiced phonemes. either way, the authors simply use the frequencies
> present in the signal. am i misinterpreting you?

Okay, I probably skipped an important step, in order to get quickly to
the explanation of the physical limits contraining N (the wavenumber
of Fm) and thus indirectly constraining the values of (a/b).

In fact it is not necessary to introduce F1 or any rounding to
understand why the spectrum is discrete.
It is not necessary to introduce any knowledge of acoustic. A simple
phenomenological viewpoint is necessary.

Let assume that the actual FFT spectrum of 1-B is representative of
human speech (and indeed it is): a spectrum made of evenly spaced
peaks, at integer multiple of a frequency f0. This comb-like spectrum
is modulated by a specral envelope, from which we know nothing except
that it has some maximum. We know nothing of it the enveloppe, but we
can nevertheless deduce that there is a maximum of it that indeed
occurs very likely at multiple of f0, so instead of expressing Fm in
term of frequency distribution, it is possible to express it in term
of wavenumber distribution, as it is fairly done in figure 3A.

If we know nothing of the spectra, it is nevertheless possible, from
the probablity distrib. of figure 3A as far as I can read them

P(1) = 0.035
p(2) = 0.18
p(3) = 0.24
p(4) = 0.26
p(5) = 0.17
p(6) = 0.08
p(7) = 0.03
p(8) = 0.01
p(9) = 0.003
p(10) = 0.002
p(n) = 0 for n > 10

to deduce some"results".

The sum is not exacly 1 (1.01), but I dont seek high precision, just
figures.

So made from those values, a simple, brute force computation using excel.
A created a complete version of table 3b, with all values up to 10,
and weighted each ratio by p(n), n being the denominator to compute a
rough spectral value.

For instance A(1.33333) = p(4:3) + p(8:6) + p(12:9)
assuming that we know nothing of the spectrum but the wavenumber of
Fm, let say that numerator are equiprobable).

That lead to the extremely raw approximation
A(1.33333) = p(3) + p(6) + p(9) = 0.323.

I computed all the ratio weights like this and then converted linear
results in dB (then A(1) = 0 dB )

I got

F/Fm A(F/Fm)
1,000000000 0
1,100000000 -27
1,111111111 -25
1,125000000 -20
1,142857143 -15
1,166666667 -11
1,200000000 -8
1,250000000 -6
1,300000000 -27
1,333333333 -5
1,375000000 -20
1,400000000 -8
1,428571429 -15
1,444444444 -25
1,500000000 -3
1,555555556 -25
1,571428571 -15
1,600000000 -8
1,666666667 -5
1,700000000 -26
1,714285714 -15
1,750000000 -6
1,777777778 -25
1,800000000 -8
1,833333333 -11
1,857142857 -15
1,875000000 -20
1,888888889 -25
1,900000000 -27
2,000000000 0

Obviously, this is raw, it does not feature the spectral decay of
actual data, but it shows the most salient feature of normalised spectrum:
- high amplitudes at simple ratios with large peaks at 1/2 and 2
- rough symetry around 1.5 (I didn't notice that before!)

AND THIS IS COMPUTED FROM DATA OF FIGURE 3A ALONE.

It misses a few things that can be explained fairly well, but some
phonation acoustics is needed (formant structure etc.), I wont get
into that.

>> Fran?ois:
>> No such thing as
>> physical "disharmonisation" can be measured from speech signal.

>> Martin:
>> If partial 5 leads partial 4 in a glissando you have a physical
>> disharmonization!!!!!!

> If... but this does not exists

I should have added "in speech signal"

> Martin
> In wind instruments, low tones have longer latencies than high
tones. And
> within the spectrum of one tone of these instruments, the latencies vary
> between the partials. There is no reason to assume that this can be
> different with the human voice.

Yes there is reason, but that will send us back to a discussion we had
earlier on this forum about the acoustical coupling between vocal fold
and vocal tract.

You still confused the sound production of woodwind (and brass) that
rely on a strong coupling between the reed (or whistle or lips) with
the bore. In a woodwind, when a hole is open, a high frequency mode,
having a high pressure near the hole, may die out instantenously,
before the reed is "aware" (due to bore length and finite value of
sound velocity (or something like that).

There is no significant coupling between vocal fold an the rest of the
vocal tract. Try to make a woodwind out of a rolled slice of steak :).
Vocal fold frequency depends on its own physical properties (mass
distribution, tension) and subglottic pressure. Any change of this
parameters propagates "instantanously" in the vocal fold (being much
smaller than a wavelength, in particular because sound velocity is
much higher in "steak" than in air, ask your butcher :-)). So

- there is no reason to think that disharmonisation exist in human voice
- it has, until proven otherwise, never been observed or documented

> Martin:
> The latency behavior of partials in wind instruments is described in
good
> books on the acoustics of wind instruments.

No doubt. I will try to get some more info, thank.

> Frano?=ois:
> If it is the case (time symetry of "disharmonisation")
> disharmonisation does not changes peak locations on average normalized
> spectrum, it just spread them out a little bit.

> Martin:
> Well, "a little bit" is quite a joke here. In your example the ratio
> fluctuations around 5/4 (1.25) went down below 6:5 (1.2) and up
above 4:3
> (1.33) !!!!!!! This is indeed a terrible "spreading". It would be
blurring
> the whole range between the minor third and the fourth !!!!! But
thanks for
> demonstrating to the list, then, WHY the ratio distribution curve
has such a
> strong flat component (high plateau) ;-))

Obviously, those numbers are ridiculous !!! It just a tought
experiment to show that even for extreme values of disharmonisation,
observed peak shall not move (exactly as observed).

> Martin
> Could you please say what the authors should have done to avoid the
problem
> which you claim they have in their methods?

Plenty of thing in fact, but I think that none of them would produces
"interresting" results: exemple

1 - use only whispered voice.
Using only whispered voice would suppress the bias due to the harmonic
structure.
Fm would not be forced to be (most of the time) a small integer
multiple of f0.
Would give an idea of average spectral decay after Fm.

2 - select f0 instead of Fm
pitch extraction algorithm are fairly reliable after all. That would
lead to peak only at integer values (instead of simple ratio) an
background noise

3 - compute and use F1 instead of Fm
That would also suppress the bias to harmonic structure. There are
rubust algorithms to do so. Same result as 1.

4 - select longer, spectraly stable sample
Would give the same result, with sharper peaks and much less noise.
much less trouble to debunk :)

etc..

yours truly

François Laferrière

🔗Martin Braun <nombraun@telia.com>

9/11/2003 7:32:29 AM

Francois:

>> Martin
>> In wind instruments, low tones have longer latencies than high
>> tones. And
>> within the spectrum of one tone of these instruments, the latencies vary
>> between the partials. There is no reason to assume that this can be
>> different with the human voice.

> Yes there is reason, but that will send us back to a discussion we had
> earlier on this forum about the acoustical coupling between vocal fold
> and vocal tract.
........

> Vocal fold frequency depends on its own physical properties (mass
> distribution, tension) and subglottic pressure.

You mean "vocal fold frequenciES" (f0 plus partials) change in synchrony
without relevant latencies between the partials. This is plausible. But
again, not the FREQUENCIES at the source (vocal folds) determine the power
spectra of speech. The AMPLITUDES at each of these frequencies do it. And
the amplitudes at the various partial frequencies do NOT change in
synchrony. This is due to the different sizes of vibrating air masses that
are involved in these amplitudes. In pitch shifts, phase leads across the
amplitude changes of partials have to occur, on simple physical grounds.

> Obviously, those numbers are ridiculous !!! It just a tought
> experiment to show that even for extreme values of disharmonisation,
> observed peak shall not move (exactly as observed).

The peaks are not shifted, of course, but the majority of ratio probability
is FLAT, that is BETWEEN the low-order-ratio peaks. This is what the figures
of the study show.

>> Martin
>> Could you please say what the authors should have done to avoid the
>> problem which you claim they have in their methods?

> Plenty of thing in fact, but I think that none of them would produces
> "interresting" results: exemple

> 1 - use only whispered voice.

Not relevant for an influence of speech on the evolution of hearing.

> 2 - select f0 instead of Fm
> pitch extraction algorithm are fairly reliable after all. That would
> lead to peak only at integer values (instead of simple ratio) an
> background noise

This would not change the results in the least. Again the ratios 1:2:3:4:5:6
would be standing out !!!!!!!!!!

> 3 - compute and use F1 instead of Fm
> That would also suppress the bias to harmonic structure. There are
> rubust algorithms to do so. Same result as 1.

You would need a PEAK as a point of reference in each of the sample spectra,
if you want to show the distribution of ratios between peaks. F1 can be
anywhere, even in a dip of a sample spectrum.

> 4 - select longer, spectraly stable sample
> Would give the same result, with sharper peaks and much less noise.
> much less trouble to debunk :)

In natural speech, longer samples are in NO WAY spectrally more stable.

Fran�ois, none of your suggestions would be of help in finding answers to
the questions of the research project. It seems to me you misunderstood
what the authors tried to investigate.

Martin

🔗Paul Erlich <perlich@aya.yale.edu>

9/11/2003 11:07:26 AM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> Francois:

> > 2 - select f0 instead of Fm
> > pitch extraction algorithm are fairly reliable after all. That
would
> > lead to peak only at integer values (instead of simple ratio) an
> > background noise
>
> This would not change the results in the least.

obviously it would change the results, because you'd only see peaks
at 1/1, 2/1, 3/1, 4/1, 5/1 . . . but no longer at 3/2, 4/3, 5/3,
5/4 . . .

🔗Martin Braun <nombraun@telia.com>

9/11/2003 12:31:58 PM

Paul:

> > > 2 - select f0 instead of Fm
> > > pitch extraction algorithm are fairly reliable after all. That
> would
> > > lead to peak only at integer values (instead of simple ratio) an
> > > background noise
> >
> > This would not change the results in the least.
>
> obviously it would change the results, because you'd only see peaks
> at 1/1, 2/1, 3/1, 4/1, 5/1 . . . but no longer at 3/2, 4/3, 5/3,
> 5/4 . . .

But Paul, what would the ratios between these peaks be?
Between 2/1 and 3/1 we would get 2/3,
Between 3/1 and 4/1 we would get 3/4,
Between 4/1 and 5/1 we would get 4/5,
etc.
We would end up with the same increased probability of the low-order
ratios, as when using the authors' methods. And we would end up with
the same limit of visible ratios, because this limit is determined by
the relation of f0 to the first formant range of the human vocal
tract.

Martin

🔗Paul Erlich <perlich@aya.yale.edu>

9/11/2003 12:54:11 PM

--- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> Paul:
>
> > > > 2 - select f0 instead of Fm
> > > > pitch extraction algorithm are fairly reliable after all.
That
> > would
> > > > lead to peak only at integer values (instead of simple ratio)
an
> > > > background noise
> > >
> > > This would not change the results in the least.
> >
> > obviously it would change the results, because you'd only see
peaks
> > at 1/1, 2/1, 3/1, 4/1, 5/1 . . . but no longer at 3/2, 4/3, 5/3,
> > 5/4 . . .
>
> But Paul, what would the ratios between these peaks be?
> Between 2/1 and 3/1 we would get 2/3,
> Between 3/1 and 4/1 we would get 3/4,
> Between 4/1 and 5/1 we would get 4/5,

right, but instead of seeing these ratios among the original peaks,
you're having to take ratios of peaks to get them. that's a change.

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/12/2003 2:06:17 AM

> Frano?=ois
> Vocal fold frequency depends on its own physical properties (mass
> distribution, tension) and subglottic pressure.

> Martin:
> You mean "vocal fold frequenciES" (f0 plus partials) change in synchrony
> without relevant latencies between the partials. This is plausible. But
> again, not the FREQUENCIES at the source (vocal folds) determine the
power
> spectra of speech. The AMPLITUDES at each of these frequencies do
it. And
> the amplitudes at the various partial frequencies do NOT change in
> synchrony.

Up to that we are in total agreement

> This is due to the different sizes of vibrating air masses that
> are involved in these amplitudes. In pitch shifts, phase leads
across the
> amplitude changes of partials have to occur, on simple physical grounds.

Whatever is your physical interpretation right or wrong, there is no
such thing as measurable phase lead (except possibly due to poor
recording hardware
as was for instance early digital recording AD/DA) or "disharmonisation"

Neverteless, I was puzzled in my early attempt at "high precision
pitch analysis" by the fact that in pitch movement, amplitude ratio
between partial may change. For instance (asci art again)

............................
...600.|................====
.......|............====....
.......|........----........
.......|....----............
f5 500.|----................
.......|.................---
.......|.............----...
.......|.........----.......
.......|.....====...........
f4 400 |=====---------------
............................

where === denote high intensity peak, while --- is lower intensity
(again this is exagerated due to limitation of ascii art). f4 moves
from 400 Hz to 480 Hz while f5 moves from 500 Hz to 600 Hz so that at
any given time f5/f4 = 5/4. But as f4 is more intense aroud 410 Hz,
its average on the whole window will sho a peak around 410 and not
around 440. For the same reason f5 will not average around 550 but
around 485 Hz. So, on the window average, f5/f4 seems to be inharmonic
(1.18 instead of 1.25).

To get rid of this annoying effect, you have to use shorter window,
but for short windows, the uncertainty principle kicks in. So
compromise must be made on window size to get best spectral estimate.

This computational inharmonicity do exist (and in fact occur all the
time for .1 sec window)

But again:

- This inharmonicity is not physical
- It shall spread the peaks but not shift them on the average as they
are as likely to contribute at right or at left of each peak.

again we seem to agree

> The peaks are not shifted, of course, but the majority of ratio
probability
> is FLAT, that is BETWEEN the low-order-ratio peaks. This is what the
figures
> of the study show.

The paper rightfully focus on the peaks, not on the background.
Secondly, I see no flat floor but gentle slope on each side of each
peak that eventually merge with the neighbouring peaks. I provided
plenty of plausible causes for peak spreading and background noise:

- pitch instability
- uncertainty principle
- turbulent white noise
- computational inharmonicity

I do not understand why you focus on what occur BETWEEN the peaks;
this is very secondary.

> Francois:
> 2 - select f0 instead of Fm
> pitch extraction algorithm are fairly reliable after all. That
would
> lead to peak only at integer values (instead of simple ratio) an
> background noise
>
> This would not change the results in the least.

> Paul:
> obviously it would change the results, because you'd only see peaks
> at 1/1, 2/1, 3/1, 4/1, 5/1 . . . but no longer at 3/2, 4/3, 5/3,
> 5/4 . . .

absolutely exact

> Martin:
> But Paul, what would the ratios between these peaks be?
> Between 2/1 and 3/1 we would get 2/3,
> Between 3/1 and 4/1 we would get 3/4,
> Between 4/1 and 5/1 we would get 4/5,
> etc.
> We would end up with the same increased probability of the low-order
> ratios, as when using the authors' methods. And we would end up with
> the same limit of visible ratios, because this limit is determined by
> the relation of f0 to the first formant range of the human vocal
> tract.

Not a all, think about it
Using pitch instead of Fm to normalise, and taking only harmonic
contribution (that are the most significant) sample contribution are:

A(1) = A(f1)/A(f1) = 1 by definition
A(2) = sum (A(f2)/A(f1))
A(3) = sum (A(f3)/A(f1))
...

with noise and error in between.

In fact this would demangle the current presentation where results are
highly convulated where

A(1) = 1 ;
A(a/b) = sum (A(fa)/A(fb)) when N = b) + sum (A(f2a)/A(f2b) when N =
2b) + ...

In other words, taking the pitch (N=1 always) instead of highest Fm
(prob(N) distributed around 4) would not produce peak anywhere else
than at integer values.

But that is not my point at all.

My point is not to advocate any of my dubious ;-) alternate protocol..

*****************
** MY POINT IS **
*****************

If one is not very careful about the analysis process, you can fiddle
with unrelated data until you get results fitting one preconception.
In my mind, this paper is honest (otherwhise detail to understand the
mistake would not have been clearly presented), but basicaly flawed.

** END OF MY POINT **

So Martin, when you write:

> Frano?=ois, none of your suggestions would be of help in finding
answers to
> the questions of the research project. It seems to me you misunderstood
> what the authors tried to investigate.

you are absolutely correct all the way:

- none of my suggestion would produce valuable results
- I do not understand what they try to investigate
- I see, but hardly, what are the questions of the research project

But I in some ways, don't care. I never questionned their conclusions,
perhaps they are correct by some chance. The idea that there are
traces of musical scale in spoken speech worth discussion (I do like it).

I just say that their protocol is wrong from the beginning, and
produces trivial, predictable and otherwise uninteresting results,
that's all.

François Laferrière

🔗Martin Braun <nombraun@telia.com>

9/12/2003 4:58:30 AM

--- In tuning@yahoogroups.com, "Paul Erlich" <perlich@a...> wrote:
> --- In tuning@yahoogroups.com, "Martin Braun" <nombraun@t...> wrote:
> > Paul:
> >
> > > > > 2 - select f0 instead of Fm
> > > > > pitch extraction algorithm are fairly reliable after all.
> That
> > > would
> > > > > lead to peak only at integer values (instead of simple
ratio)
> an
> > > > > background noise
> > > >
> > > > This would not change the results in the least.
> > >
> > > obviously it would change the results, because you'd only see
> peaks
> > > at 1/1, 2/1, 3/1, 4/1, 5/1 . . . but no longer at 3/2, 4/3,
5/3,
> > > 5/4 . . .
> >
> > But Paul, what would the ratios between these peaks be?
> > Between 2/1 and 3/1 we would get 2/3,
> > Between 3/1 and 4/1 we would get 3/4,
> > Between 4/1 and 5/1 we would get 4/5,
>
> right, but instead of seeing these ratios among the original peaks,
> you're having to take ratios of peaks to get them. that's a change.

OK. But it's a change in the looking, not a change in the results.
And the change in the looking is not bigger than the change between
opening your eyes once and opening your eyes twice.

Have a good day!

Martin

🔗Martin Braun <nombraun@telia.com>

9/12/2003 6:41:02 AM

Fran�ois:

> So, on the window average, f5/f4 seems to be inharmonic
> (1.18 instead of 1.25).

> To get rid of this annoying effect, you have to use shorter window,
> but for short windows, the uncertainty principle kicks in. So
> compromise must be made on window size to get best spectral estimate.

> This computational inharmonicity do exist (and in fact occur all the
> time for .1 sec window)

I am glad that we agree on the description of the phenomenon of
disharmonization in pitch shifts. It's also fine that you saw in empirical
data what I expected by only thinking through the physics.

> But again:
> - This inharmonicity is not physical

Here we still disagree. You are right that the ratio 5/4 (of your example)
does not disappear. But what happens, if your time windows are so short that
the "errors" through averaging disappear? You'll see this: in some windows
there is no power at the 4th partial and in other windows there is no power
at the 5th partials. So you have a 5/4 ratio all the time, but one that is
not real. What you have in reality is a ratio between real peaks (those that
have power) which deviates from 5/4.

> - It shall spread the peaks but not shift them on the average as they
> are as likely to contribute at right or at left of each peak.

> again we seem to agree

exactly

>> The peaks are not shifted, of course, but the majority of ratio
>> probability
>> is FLAT, that is BETWEEN the low-order-ratio peaks. This is what the
>> figures of the study show.

> The paper rightfully focus on the peaks, not on the background.

The interpretation of the authors focuses on the peaks. But they displayed
the complete spectra for anybody to see.

> Secondly, I see no flat floor but gentle slope on each side of each
> peak that eventually merge with the neighbouring peaks.

Fran�ois, below the slopes - in fact: below the dips - there is a HIGH
plateau of noise !!!
For example in Fig.2C the noise floor is 20 times (!) as high as the
difference between the peak at 5/4 and the valley between 5/4 and 6/5.

> I do not understand why you focus on what occur BETWEEN the peaks;
> this is very secondary.

Well, in the example above there is much more between the peaks than at the
peaks. This is important to note, because it shows the big difference
between "clean" theoretically derived textbook spectra and real speech
spectra. The value of the study is not to have replicated simple general
wisdom on the general harmonicity of speech. It's value is to have shown
what the harmonicity looks like in REAL data.

> So Martin, when you write:

>> Fran�ois, none of your suggestions would be of help in finding
>>answers to
>> the questions of the research project. It seems to me you misunderstood
>> what the authors tried to investigate.

> you are absolutely correct all the way:

> - none of my suggestion would produce valuable results
> - I do not understand what they try to investigate
> - I see, but hardly, what are the questions of the research project

You could have read that what you did not understand or see in the first
section of the paper, which is called "introduction".

> But I in some ways, don't care. I never questionned their conclusions,
> perhaps they are correct by some chance. The idea that there are
> traces of musical scale in spoken speech worth discussion (I do like it).

If the idea is worth discussing, and you even like it, then the same should
also apply to the details of the results. That is, which ratios stick out of
the noise and which don't.

> I just say that their protocol is wrong from the beginning, and
> produces trivial, predictable and otherwise uninteresting results,
> that's all.

Had the authors asked you, before starting this study, you had grossly
mispredicted the amount of noise between the peaks. And you also would not
have been able to predict the limit beyond which simple ratios disappear in
the noise. You might have predicted sex differences, but not their exact
values.

Martin

🔗francois_laferriere <francois.laferriere@oxymel.com>

9/15/2003 1:44:26 AM

Hello Martin,

A final post (as far as I am concerned) on this discussion that
becomes very technical, more and more distant to the list concerns and
somehow, tedious.

> Frano?=ois:
> So, on the window average, f5/f4 seems to be inharmonic
> (1.18 instead of 1.25).

> To get rid of this annoying effect, you have to use shorter window,
> but for short windows, the uncertainty principle kicks in. So
> compromise must be made on window size to get best spectral estimate.

> This computational inharmonicity do exist (and in fact occur all the
> time for .1 sec window)

> Martin:
> I am glad that we agree on the description of the phenomenon of
> disharmonization in pitch shifts. It's also fine that you saw in
empirical
> data what I expected by only thinking through the physics.

> Frano?=ois:
> But again:
> - This inharmonicity is not physical

> Martin:
> Here we still disagree. You are right that the ratio 5/4 (of your
example)
> does not disappear. But what happens, if your time windows are so
short that
> the "errors" through averaging disappear? You'll see this: in some
windows
> there is no power at the 4th partial and in other windows there is
no power
> at the 5th partials.

Not at all, what appears is:
- as window size get smaller, harmonic peaks get broader, until they
merge
(and no peak extraction algorith is of any help).
- As they get broader, the assesment of harmonicity is less and less
accurate

> So you have a 5/4 ratio all the time, but one that is
> not real. What you have in reality is a ratio between real peaks
(those that
> have power) which deviates from 5/4.

Even if that occur (an harmonic is so small that a given peak vanishes
in noise) that does not mean that it cease to exist, no more that
vanishing stars cease to exist in daylight. In other words, that some
peaks cease to be measurable is more a signal-to-noise problem than
anything related to a modification of the underlying physical process.

> Frano?=ois
> - It shall spread the peaks but not shift them on the average as they
> are as likely to contribute at right or at left of each peak.

> again we seem to agree

> Martin:
> exactly

We never agree on so much before, great!

> Martin
> The peaks are not shifted, of course, but the majority of ratio
> probability
> is FLAT, that is BETWEEN the low-order-ratio peaks. This is what the
> figures of the study show.

> François:
> The paper rightfully focus on the peaks, not on the background.

> Martin:
> The interpretation of the authors focuses on the peaks. But they
displayed
> the complete spectra for anybody to see.

> François:
> Secondly, I see no flat floor but gentle slope on each side of each
> peak that eventually merge with the neighbouring peaks.

> Martin:
> François, below the slopes - in fact: below the dips - there is a HIGH
> plateau of noise !!!
> For example in Fig.2C the noise floor is 20 times (!) as high as the
> difference between the peak at 5/4 and the valley between 5/4 and 6/5.

20 time? what do you mean? 13 dB? where do you see a "floor" (at which
ratio value)?

> Frano?=ois:
> I do not understand why you focus on what occur BETWEEN the peaks;
> this is very secondary.

There is no floor, but an steepy slope of around -20db per octave
between 1 and 2.
To have a clearer picture, we should remove or "normalise" this slope.

But to so, we ought to know where does it come from.

Where does it come from? On normal speech, there is around -6dB from
the glottal source (may vary much) and -6dB from the lips radiation
caracterists, so roughly -12dB per Octave. The missing -8dB come from
the fact that most data gatered for ratio between 1 and 2 are done on
the right hand side of a formant. But is certainly not a straight,
simple -8dB / octave, because near 1, the more complex N+1/N ratio
such as 7:6 (compared to less complex 3:2) contribute more due to
their average proximity to formant top. I see no way to "predict" this
value of -20 dB because it mixes up contributions of formant bandwith,
mde even more complex by the normalisation process, and glottis and
lips radiation. Instead of isolating variables contributing to the
background, the normalisation make them impossible to disentangle.

Not being able to modelize properly this high frequency decay hamper
any attempt to interpret the background. But that is not the topic of
the paper anyways.

> Martin:
> Well, in the example above there is much more between the peaks than
at the
> peaks. This is important to note, because it shows the big difference
> between "clean" theoretically derived textbook spectra and real speech
> spectra. The value of the study is not to have replicated simple general
> wisdom on the general harmonicity of speech. It's value is to have shown
> what the harmonicity looks like in REAL data.

An this has nothing to do with harmonicity/inharmonicity anayways and
inharmonicity has nothing to do the topic of the paper.

> So Martin, when you write:

>> Frano?=ois, none of your suggestions would be of help in finding
>>answers to
>> the questions of the research project. It seems to me you misunderstood
>> what the authors tried to investigate.

> Francois:
> you are absolutely correct all the way:

> - none of my suggestion would produce valuable results
> - I do not understand what they try to investigate
> - I see, but hardly, what are the questions of the research project

> Martin
> You could have read that what you did not understand or see in the first
> section of the paper, which is called "introduction".

I was only half kidding. The paper goes from a rather broad hypothesis
to a not less broad conclusion through the very small bottleneck of a
debatable (to say the less) process.

> Martin:
> If the idea is worth discussing, and you even like it, then the same
should
> also apply to the details of the results. That is, which ratios
stick out of
> the noise and which don't.

I explained clearly and quantitatively enough, which ration stick out
and maintain what I said:

> I just say that their protocol is wrong from the beginning, and
> produces trivial, predictable and otherwise uninteresting results,
> that's all.

> Martin:
> Had the authors asked you, before starting this study, you had grossly
> mispredicted the amount of noise between the peaks. And you also
would not
> have been able to predict the limit beyond which simple ratios
disappear in
> the noise.

Well, in experimental science, who care about "predicting" the noise.
Understanding the source of the noise in order to reduce it in the
data gathering process is the only useful issue.

> Martin:
> You might have predicted sex differences, but not their exact
> values.

I have been able to "predict" :-) peak locations and rough relatives
amplitudes from harmonic number distribution alone, explain sex
difference, describe roughly different contribution to background
noise and spectral decay.

I think that it is not bad for a dilettante.

I must go back to the work I am paid for

yours truly

François Laferrière