Free Novel Read

I Can Hear You Whisper Page 20


  “What they were looking at were these absolutely gorgeous pictures of cells with these long protrusions, and they had the best microscopy that was available,” Poeppel says. “They said that must be the unit we care about.” Fifty years later, scientists discovered the synapse and that became the focus. Then another thirty years passed. “Now, suddenly, what turns out to be the unit of modification for synaptic plasticity are dendritic spines, which are a fraction of a fraction of a cell,” says Poeppel. No one could see them before. “It’s not that we don’t know; we do know more. The problem is: Every time we go to higher resolutions, we’re discovering in part new phenomena and in part how we have to reinterpret old phenomena.”

  The explosion in brain-imaging technology—the “supercool” MEGs and fMRIs and PETs—made this new era of granularity possible by allowing scientists to record the activity of tens of thousands of cells firing at the same time. Until the mid-1990s, most data on the brain came from injuries, deficits, lesions, and so on. Data from deficits, though, would never have given you what Poeppel and Walker produced in just five minutes of watching my brain on sound, or what they could get if we had carried on with speech sounds. “I can stick you in there right now and we can do ‘ah’s,’ ‘oo’s,’ and ‘ee’s’ all day long until we map your space,” says Poeppel. “We can map the timing, we can do the spatial analysis. We can do a 3-D reconstruction. Everyone’s a little different, but there’s a high degree of consistency. Any auditory stimulus will have this cascade of responses.”

  In other words, everything from a beep to a recitation of Macbeth sends waves of electrical pulses rippling through the brain along complex though predictable routes on a schedule tracked in milliseconds. Along the way, the response to a beep gets less complicated because a beep is, well, less complicated than Macbeth’s guilty conscience. But any word from “apple” to “zipper” takes less than half a second to visit all the lower and higher processing centers of the brain, and in that fraction of a second neuroscientists have a pretty good idea of the several stops the word makes and the work done by the brain along the way. I now understood how the P1 and N1 got their names. Each point where the response is distinctive and concentrated has been labeled with an N or a P depending on whether the wave is usually negative-going or positive-going at that point (counterintuitively to me, convention dictates that negativities go up and positivities go down), and then the N or P is given a number to indicate either how many milliseconds were required to get there or how many major peaks have preceded it. Taken together, these responses have helped us understand far more about the steps in the intricate dance the brain performs as it converts sound to language.

  In a hearing adult, in the first fifteen milliseconds, the signal, or sound, is in the brain stem. In this early and relatively small section of the brain, the sound will be relayed through the cochlear nuclei, the superior olive, the lateral lemniscus, and the inferior colliculus in the midbrain, subdividing and branching each time onto one of several possible parallel pathways, until it reaches the early auditory cortex. “What has been accomplished so far?” asks Poeppel. “All these nuclei have done an enormous amount of sophisticated analysis and computation.” The superior olive, for instance, is the first place where cells are sensitive to information coming from both ears and begin to integrate and compare the two, the better to figure out where the sound originated. “Doing things like localizing … is not necessarily accomplished, but the critical agreements are already calculated for you at step two,” says Poeppel. “It’s pretty impressive.” It was these early distinctive responses that Jessica O’Gara was looking for in Alex when she gave him an auditory brain stem response test, designed to see if the auditory path to the brain is intact.

  If the route is clear, as it is in a typically hearing person, the sound reaches the primary auditory cortex in the temporal lobe somewhere around twenty milliseconds and spends another few milliseconds activating auditory areas such as Heschl’s gyrus. The P1 is usually at about sixty milliseconds, and the N1, as Poeppel and Walker just recorded in me, peaks around one hundred milliseconds. It’s still mainly an auditory response, as the brain continues to identify and analyze what it just heard, but the brain is also starting to take visual information into account. Already, at one hundred milliseconds, higher-order processing is under way. The more unpredictable the sound—a word you rarely hear, for instance—the bigger the amplitude of the N1 because the harder you’re working to make sense of it. Conversely, the response is more muted for words we hear all the time or sounds we expect, such as our own speech.

  Around two hundred milliseconds, where the P2 or P200 occurs, the brain starts to look things up. It is beginning to compare the arriving sound with what it already knows by digging into stored memories and consulting its mental dictionary. As befits this more complex task, the signals are now firmly in the higher regions of the brain, and the neural activity is far more widespread. This is also the point where information arriving from other systems truly converges. However a word is perceived—whether you hear it, read it, see it, or touch it—you will begin processing it fully at this same point.

  The ability to recognize words, to acknowledge the meaning found in the dictionary, is known as lexical processing and happens between two hundred and four hundred milliseconds, which is very late in brain time. As was true earlier, the amplitude of the N400 reflects the amount of work the brain is being asked to do: It is larger for more infrequent or unfamiliar words and for words that vary from many others by only one letter, such as “hat,” “hit,” “hot,” and “hut.” By six hundred milliseconds, the brain is processing entire sentences, and grammar kicks in. The P600 is thought to reflect what neuroscientists call “repair and reanalysis,” because it is elicited by grammatical errors and by “garden path sentences” (the kind that meander and dangle their modifiers: “The broker persuaded to sell the stock was tall”) and—in a non-linguistic example—by musical chords played out of key.

  But all of that is only half the story. The cascade of responses that begins with the ear and leads all the way to the ability to follow a poorly constructed sentence—to know that it’s the broker who is tall—is known as bottom-up processing. It starts with the basic input to any sense—raw data—and ends with such higher-level skills as reasoning and judgment and critical thinking—in other words, our expectations and knowledge. Neuroscientists now believe that the process is also happening in reverse, that the cascade flows both ways, with information being prepared, treated, and converted in both directions simultaneously, from the bottom up and from the top down.

  This idea amounts to a radical rethinking of the very nature of perception. “Historically, the way we intuitively think about all perception is that we’re like a passive recording device with detectors that are specialized for certain things, like a retina for seeing, a cochlea for hearing, and so forth,” says Poeppel. “We’re kind of a camera or microphone that gets encoded somehow and then magically makes contact with the stuff in your head.” At the same time, many of the big thinkers who pondered perception, beginning with Helmholtz (him again), knew that couldn’t be quite right. If we reached for a glass or listened to a sentence, didn’t it help to be able to anticipate what might come next? In the mid-to-late twentieth century, a handful of prominent researchers proposed models of perception that suggested instead that we engaged in “active sensing,” seeking out what was possible as we went along. Most important among these was Alvin Liberman at Yale University, whose influential motor theory of speech perception fell into this category. He proposed that as we listen to speech, the brain essentially imagines producing the words itself. Liberman’s elegant idea and other such ideas did not gain much traction until the past decade, when they suddenly became a hot topic of conversation in the study of cognition. What everyone is talking about today is the brain’s power of prediction.

  That power is not mystical but mathematical. It reflects the data-driven, statistical appro
ach that informs contemporary cognitive science and defines the workings of the brain in two ways: representations and computations. Representations are the equivalent of a series of thumbnail images of the things and ideas we have experienced; everything in our mental hard drive, like the family photographs stored in your computer. How exactly they are stored remains an open question—probably not as pictures, though, because that would be too easy. Computations are what they sound like: the addition, subtraction, multiplication, and division we perform on the representations, as if the brain begins cropping, rotating, and eliminating red-eye. They are how we react to the world and, crucially, how we learn. “The statistical approach makes strong assumptions about what kinds of things a learner can take in, process, ‘chunk’ in the right way, and then use for counting or for deriving higher-order representations,” says Poeppel.

  On one level, prediction is just common sense, which may be one reason it didn’t get much scientific respect for so long. If you see your doctor in the doctor’s office, you recognize her quickly. If you see her in the grocery store dressed in jeans, you’ll be slower to realize you know her. Predictable events are easy for the brain; unpredictable events require more effort. “Our expectations for what we’re going to perceive seem to be a critical part of the process,” says Greg Hickok, a neuroscientist who studies predictive coding among other things at the University of California, Irvine, and regularly collaborates with Poeppel. “It allows the system to make guesses as to what it might be seeing and to use computational shortcuts. Perception is very much a top-down process, a very active process of constructing a reality. A lot of that comes from prediction.”

  Predictive coding has real implications for Alex, Hickok points out. “Someone with a degraded input system has to rely a lot more on top-down information,” he says. “If you analyze sensory input roughly, you test against more information as it’s coming in. Let me look and see if it matches.” Anyone who reads speech is using prediction, guessing at context from the roughly one-third of what is said that can be seen on the mouth and using any other visual cues he can find. Those who use hearing aids and implants still have to fill in gaps as well. No wonder so many deaf and hard-of-hearing children are exhausted at the end of the day.

  But top-down processing can be simple, too. If a sound is uncomfortably loud, for instance, it is the cortex that registers that fact and sends a message all the way back to the cochlea to stiffen hair cells as a protective measure. The same is true of the retina, adjusting for the amount of light available. “It’s not your eye doing that,” says Poeppel. “It’s your brain.” Then he beats rhythmically on the desk with a pencil: tap, tap, tap, tap. “By beat three, you’ve anticipated the time. By beat four, we can show you neurophysiologically exactly how that prediction is encoded.”

  “Helmholtz couldn’t do that,” I point out.

  “We have pretty good theories about each separate level,” Poeppel agrees, citing the details of each brain response as an excellent example of fine-grained knowledge at work, but “how is it that you go from a very elementary stimulation at the periphery to understanding in your head ‘cat’? We don’t know. We’re looking for the linking hypothesis.”

  I realize it’s the same question that Blair Simmons asked at the cochlear implant workshop in 1967: How do we make sense of what we hear? And I have to acknowledge that, as Eric Kandel said, mysteries remain.

  • • •

  What kind of information does a sound have to carry in order to set this auditory chain in motion? That was another question Simmons asked. With Alex in mind, I wonder what happens if Macbeth is too quiet or garbled or otherwise distorted. I put this basic question to Andrew Oxenham, an auditory scientist at the University of Minnesota, who studies the auditory system in people both with and without hearing loss.

  “What cochlear implants have shown us,” he tells me, is just how well we can understand speech in “highly, highly degraded situations. Think about the normal ear with its ultrafine-frequency tuning and basilar membrane that [has] thousands of hair cells, and you think about replacing that with just six or even four electrodes—so we’re going from hundreds, possibly thousands, of independent frequency channels down to four or six. You’d say: Wow, that’s such a loss of information. How is anyone ever going to perceive anything with that? And yet people can understand speech. I guess that’s shown us first of all how adaptable the brain is in terms of interpreting whatever information it can get hold of. And secondly, what a robust signal speech is that you can degrade it to that extent and people can still extract the meaning.”

  This makes sense, Oxenham adds, from an evolutionary point of view. “You want something that can survive even in very challenging acoustic environments. You want to be able to get your message across.” Engineers often think about redundancy, he points out, and like to build things with a belt-and-suspenders approach, the better to ensure success. “What we have learned is that speech has an incredible amount of redundancy,” says Oxenham. “You can distort it, you can damage it, you can take out parts of it, and yet a lot of the message survives.”

  • • •

  Just how true that is has been shown in a series of experiments designed to test exactly how much distortion and damage speech can sustain and still be intelligible. Back in David Poeppel’s office at NYU, he walks me through a fun house of auditory perception to show me what we’ve learned about the minimal amount of auditory information necessary for comprehension.

  He begins with sine waves, a man-made narrow-band acoustic signal that lacks the richness and texture of speech—a beep, really. On the whiteboard in his office, Poeppel sketches out a spectrogram, a graph depicting the range of frequencies contained in sound over time, with squiggly lines representing bands of energy beginning at 100, 900, 1,100, and 2,200 Hz and then shifting.

  “This is me saying ‘ibex,’” he says.

  “If you say so,” I answer. I wouldn’t have known the details, but by now I know those bands of energy are the defining formants—the same features that Graeme Clark and his team used to design their first speech processing program.

  “Do you really need all that spectral information to extract something intelligible?” Poeppel asks.

  The answer is no. Some of the spectral information, the combination of frequencies that makes up each sound, turned out to be icing on the acoustic cake. In the early 1980s, several researchers, notably Robert Remez at Barnard University and Philip Rubin of Haskins Laboratories, created spectrograms of spoken sentences, got rid of the fundamental frequency (the band at 100 Hz in “ibex”), extracted the formants (the remaining bands), and replaced them with sine waves. With sine waves standing in for formants stacked up at the appropriate frequencies, the sentence is intelligible. If you practice, that is. Poeppel plays me some samples from the Haskins Laboratories website. At first, I just hear high-pitched whistles like the sounds Jodie Foster interpreted as alien communication in Contact. But after Poeppel plays the natural sentence and then replays the sine-wave sentence, I hear it: “The steady drip is worse than the drenching rain.” Do this for twenty minutes, Poeppel tells me, and you’ll get really good at it.

  At the House Research Institute in Los Angeles, an auditory neuroscientist named Bob Shannon and his colleagues took a different approach. They manipulated sentences not by minimizing the spectral information but by degrading it. When the ear groups similar frequencies into critical bands, or channels, along the basilar membrane, there are typically about thirty in a healthy ear. In an influential experiment, Shannon and his team took a set of sentences and reduced the number of channels used to convey them, as if thirty radio stations were trying to broadcast from just a few spots on the dial. Condensed into one big channel, the sentences were basically unintelligible. With two channels, there’s minimal information, but still not enough to understand much. With three channels, the sentences coalesce into words. “The difference between two and three is huge,” says Poeppel. “At
four, you’re almost at ceiling.” In other words, when asked to repeat what you heard, you would get nearly every sentence even without the remaining twenty-six channels of information about sound you normally use.

  What if you leave the spectral composition of sentences alone and change only the timing, what scientists call the temporal information, instead? One of Poeppel’s frequent collaborators, Oded Ghitza, a former Bell Labs scientist and now a biomedical engineer at Boston University, worked with auditory neuroscientist Steven Greenberg to play with the rate of sentences. Consistently across languages, normal conversation runs at about four to six hertz. Ghitza and Greenberg compressed sentences so that what originally took four seconds to speak now took two. Listeners could still understand them. But when they compressed the sentence again, by a factor of three, it got much harder. Poeppel plays me an example and it sounds like a garbled mouthful: badazump.

  But Ghitza and Greenberg weren’t done. “They take the compressed waveform,” says Poeppel, “which sounds for shit, and they just slash it into little slices and they add silence in between.” Now it sounds like “bit … tut … but … tit.” Poeppel explains what I’m hearing: “The acoustic signal is very small and not informative, totally smeared and gross, but you’ve added a little bit of empty time in there. You’ve squished everything and now you insert little silent intervals.” Then Ghitza and Greenberg measured how many mistakes listeners made when trying to repeat the sentences. With the speech totally compressed, the error rate was 50 percent. As they increased the intervals of silence, the error rate went down by 30 percent. But then it got worse again. “That’s weird, right?” says Poeppel. “There’s this moment where you do better, even though the acoustic signal is still totally crappy.” That sweet spot came with silent gaps of eighty milliseconds. What Ghitza and Greenberg concluded was that from a brain’s-eye view, a listener needs two things. The first is what they term an “acoustic glimpse”—the pitch, loudness, and timbre carried on the sound wave that tell the system to listen in. Then the brain needs what they call a “cortical glimpse,” which is simply time—enough time to decode the information that has just arrived.