Ad Widget

Collapse

Announcement

Collapse
No announcement yet.

A duplex theory of pitch perception

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A duplex theory of pitch perception

    I came on a very interesting article (attached) on how mammals perceive pitch. This article is described as seminal in the psychoacoustics field.

    The short form is that both the frequency spectrum and the autocorrelation function (of time) is used, which allows the missing fundamental to be recreated.

    Transients have a distinctive autocorrelation, and a boring spectrum.

    The author is famous in that field: J. C. R. Licklider - Wikipedia, the free encyclopedia.

    Licklider1951.pdf


    Autocorrelation - Wikipedia, the free encyclopedia
    Last edited by Joe Gwinn; 03-09-2015, 01:36 PM.

  • #2
    Key passage from the paper:

    The essence of the duplex theory of pitch perception is that the auditory system employs
    both frequency and autocorrelational analysis. The frequency analysis is performed by
    the cochlea, the autocorrelational analysis by the neural part of the system. The latter is
    therefore an analysis not of the acoustic stimulus itself but of the trains of nerve impulses
    into which the action of the cochlea transforms the stimulus. This point is important
    because the highly nonlinear process of neural excitation intervenes between the two
    analyses.
    Interesting that this realization came so long ago! We have discussed here before that the process is the time domain analysis of the outputs of the rather broad cochlear filters. I doubt that the process is as simple as autocorrelation, but that certainly captures the essence of part of it.

    Comment


    • #3
      Originally posted by Mike Sulzer View Post
      Interesting that this realization came so long ago! We have discussed here before that the process is the time domain analysis of the outputs of the rather broad cochlear filters. I doubt that the process is as simple as autocorrelation, but that certainly captures the essence of part of it.
      Yes. One likely addition is a special section for handling transients. Like the crack of a twig breaking under the foot of a predator. The autocorrelation process may not be fast enough, and such transients are loud enough to be handled directly.

      Comment


      • #4
        The part that folks often have mistaken assumptions about is that, while sound IS a wave and waves can and do apply recurring pressure fronts at the ear drum and entrance to the cochlea, the frequency range of sound - more specifically that range of frequencies that are detected as sound - is well in excess of the conduction speed, and repolarization rate, or neurons. Vibration-sensing cells can be made to fire by mechanically poking them again quickly, but once you hit frequencies as "high" as maybe 40hz, the detector cells have not had enough time for the precarious chemical balance leading up to neuron firing to be re-established. All those ions that leaked in and out of the neuron have to find their way back to the inside or outside of the cell, from whence they came, for that neuron to fire again, and that takes time.

        The upshot is that sensing sound requires a more sophisticated combination of systems and eloborate "coding", in order for us to hear/sense frequencies above the capacity of individual neurons.

        The other rather miraculous aspect is that we only have two ears, and the cochlea itself does not provide spatial representation of sound, yet we experience sound as coming from multiple places. Our skin and eyes permit for sensory information to be applied to specific locations, whether on the skin or on the retina. The ear has no such correspondance between source and "landing spot". The extraction of spatial information from those two puny ears is a very complex affair. One of my old profs - Albert Bregman - is one of the godfathers of what came to be called "auditory scene analysis" ( Auditory scene analysis - Wikipedia, the free encyclopedia ). I never really followed it enough to know how that was the same or different from Licklider's stuff.

        Comment


        • #5
          Originally posted by Mark Hammer View Post
          The part that folks often have mistaken assumptions about is that, while sound IS a wave and waves can and do apply recurring pressure fronts at the ear drum and entrance to the cochlea, the frequency range of sound - more specifically that range of frequencies that are detected as sound - is well in excess of the conduction speed, and repolarization rate, or neurons. Vibration-sensing cells can be made to fire by mechanically poking them again quickly, but once you hit frequencies as "high" as maybe 40hz, the detector cells have not had enough time for the precarious chemical balance leading up to neuron firing to be re-established. All those ions that leaked in and out of the neuron have to find their way back to the inside or outside of the cell, from whence they came, for that neuron to fire again, and that takes time.
          While any one neuron is thus restricted, it isn't clear that a population of 10,000 neurons is thus restricted. Population codes can be very precise, no matter how flaky the individual neurons may be.

          Licklider shows a way to represent sounds in 2d, where it can be analyzed using something like the mechanisms used for vision.

          And is can easily be saved and associatively retrieved using Spars Distributed Memory: Sparse distributed memory - Wikipedia, the free encyclopedia


          The upshot is that sensing sound requires a more sophisticated combination of systems and elaborate "coding", in order for us to hear/sense frequencies above the capacity of individual neurons.
          Yes, probably some kind of population code.


          The other rather miraculous aspect is that we only have two ears, and the cochlea itself does not provide spatial representation of sound, yet we experience sound as coming from multiple places. Our skin and eyes permit for sensory information to be applied to specific locations, whether on the skin or on the retina. The ear has no such correspondence between source and "landing spot". The extraction of spatial information from those two puny ears is a very complex affair. One of my old profs - Albert Bregman - is one of the godfathers of what came to be called "auditory scene analysis" ( Auditory scene analysis - Wikipedia, the free encyclopedia ). I never really followed it enough to know how that was the same or different from Licklider's stuff.
          The parallel may be more apt than Bregman ever knew - it may be that the same mechanism is used for both, at least initially. By now they will have evolved away from one another, but the common ancestry will still shine through.

          Comment


          • #6
            Well, if I were given a stretched membrane with a continuously varying low Q resonant frequency along its length, densely populated by nerves capable of sensing the amplitude of motion with a time resolution much slower than the resonant frequencies, but fast enough to sense useful changes in the amplitude of oscillation, I would would encode the amplitude of vibration at each nerve location, probably by varying the firing rate of each nerve as a function of amplitude. Then I would construct a cross-comparison nerve network to compare the firing rates of groups of neighbors. The excitation caused by some frequency f would be associated with a specific neural pattern of firing rates. Then pitch detection is accomplished by detecting a particular pattern. Does it work something like this, Mark?

            Originally posted by Mark Hammer View Post
            The part that folks often have mistaken assumptions about is that, while sound IS a wave and waves can and do apply recurring pressure fronts at the ear drum and entrance to the cochlea, the frequency range of sound - more specifically that range of frequencies that are detected as sound - is well in excess of the conduction speed, and repolarization rate, or neurons. Vibration-sensing cells can be made to fire by mechanically poking them again quickly, but once you hit frequencies as "high" as maybe 40hz, the detector cells have not had enough time for the precarious chemical balance leading up to neuron firing to be re-established. All those ions that leaked in and out of the neuron have to find their way back to the inside or outside of the cell, from whence they came, for that neuron to fire again, and that takes time.

            The upshot is that sensing sound requires a more sophisticated combination of systems and eloborate "coding", in order for us to hear/sense frequencies above the capacity of individual neurons.

            The other rather miraculous aspect is that we only have two ears, and the cochlea itself does not provide spatial representation of sound, yet we experience sound as coming from multiple places. Our skin and eyes permit for sensory information to be applied to specific locations, whether on the skin or on the retina. The ear has no such correspondance between source and "landing spot". The extraction of spatial information from those two puny ears is a very complex affair. One of my old profs - Albert Bregman - is one of the godfathers of what came to be called "auditory scene analysis" ( Auditory scene analysis - Wikipedia, the free encyclopedia ). I never really followed it enough to know how that was the same or different from Licklider's stuff.

            Comment


            • #7
              One of the things I regularly draw attention to for less-informed (than yourself) folks is the role of correlations between harmonic content and the fundamentals they arise from in establishing recognizable and differentiable sound sources in the audio landscape. And certainly the busier and buzzier the audio landscape is, the harder it becomes to "assign" harmonic content to differentiable sources. This barely-there correlation can be further degraded by things like group delay, that can introduce asynchrony in portions of the harmonic content, vis-a-vis the source fundamental, and of course the role that SPL so often plays in distorting pitch-sensing and perception.

              If the world was all sine waves, that would be one thing. But there is so much harmonic content that requires linking up "in time" with fundamentals in order to produce a coherent aural landscape. The sorting of all that, such that the listener knows/perceives that "this goes with that", depends on the detected correlation between sounds.

              And it goes without saying that, in a binaural world, much of what we hear is - unless the head is held rigid, the surrounding surfaces identically reflective, and the sound source perfectly on-axis - a slightly delayed version of what the ear closest to the source hears. The need to correlate what is the same and different between the ears such that they are perceived as "the same thing/source", is no different than the need to correlate what is similar and different between the two retinas (looking in the same direction but sensing ever-so-slightly different patterns of visual stimulation).

              So the correlation between multiple sources of the same information is essential to perception. If sound was impeccably focussed, like a laser beam, and landed in one spot only, that would be one thing. But the same sound source has multiple components, and occurs in a reflective/absorptive environment, chock-a-block with other competing-but-irrelevant sounds, ultimately detected by sensory organs that impose their own natural timing difference. The only thing that could possibly make sense of all that chaos and allow us to perceive what seems like an orderly sound field/"scene" IS correlation of that sensory info.
              Last edited by Mark Hammer; 03-10-2015, 02:26 PM.

              Comment


              • #8
                References from a MIT course on the perception of pitch

                An added set of references, from an MIT course on the perception of pitch:

                Theme 7 Papers

                Comment

                Working...
                X