Posts Tagged ‘fiction’

babygate blues: a neuromarketing tale

Monday, July 26th, 2010

Cory Doctorow has a new short story (“Ghosts in my Head“) about the undesirable consequences of neuromarketing run amok up on the Subterranean Press website.  I liked the story, but thought the premise was pretty unrealistic (and, yes, I do know it’s called science fiction for a reason–I’m just sayin’). So as a counterpoint, here’s an alternative neuromarketing future that I personally find much more plausible.

Deborah Stojko didn’t care much for Pockter and Gramble’s corporate headquarters. The building smelled of disinfectant and organization; the halogen corridors all blended together into one giant dimly-lit maze. Stojko had been visiting P&G regularly for several years now; it was never a pleasant experience, but it couldn’t be avoided. Communicating with major stakeholders was a large part of her job as director of the International Consortium for Neuromarketing Research. And P&G was by far the largest stakeholder, contributing over 70% of the money that supported the consortium’s work.

For several years now, ICNR had been pumping out first-class scientific research on the neural mechanisms of economic decision-making. The Richelieu effect, Preinforcement Learning, the neurometric satisficing theorem… ICNR was behind any number of recent discoveries; its members were continually in the news. And all of it was made possible only through the generosity of the marketing and R&D wings of P&G.

The generosity, or the naivete? Stojko asked herself as she reached her destination and knocked softly on an office door. Somehow, the executives at Pockter and Gramble had managed to convince themselves that the survival of P&G rested on their ability to mine the deep secrets of the brain. For years now, they’d been throwing sums of money at cognitive neuroscientists that would make European royalty blush. That streak of good fortune, Stojko suspected, was now about to end. Recent events had rendered P&G’s massive investment in ICNR something of a political liability; she had the feeling this was the last time she’d be making the trip to P&G headquarters.

And not a moment too soon, she thought, as the door opened in front of her.

*    *    *

“How long has Pockter and Gramble been funding you, Deborah,” Bob Ramsey, Chief Executive Officer, asked, once Stojko was seated and they’d gotten the standard pleasantries out of the way.

Stojko did the arithmetic in her head. The International Neuromarketing Consortium had formed in 2013, following a massive infusion of P&G cash, so…

“Six years,” she said.

“Right. And do you know how much money Pockter and Gramble has given your consortium in those six years?”

“I’d put it somewhere between 251.8 and 251.9 million dollars.”

“Very clever. A quarter of a billion dollars. We’ve given you a quarter. Of a billion. Dollars.”

“Well, to be fair, that amount is spread out over 8 sites and 30 other investigators,” Stojko pointed out. “It’s not like you wrote me a check for 250 million. My institution only got about forty-five million.”

Ramsey didn’t say anything, but his expression bespoke a thinly-veiled irritation. He picked up a remote control on the desk and pushed a button. Behind Stojko, the wall turned translucent as the embedded display lit up.

“No doubt you’ll recognize this clip,” Ramsey said.

Stojko swiveled around to watch the giant screen. The camera faded in on a bright and comfortable-looking living room somewhere in America. Almost immediately, six or seven babies in diapers filed into the room and began dancing synchronously in a circle. After a few seconds of dancing, the babies started babbling an Eastern-sounding melody in a totally incomprehensible–and, Stojko suspected, nonexistent–language. And a few seconds after that, they started banging spoons on the tabletop in perfect unison, all the while still dancing and singing in tongues. The whole thing lasted exactly thirty seconds, and occupied a very narrow emotional niche between really adorable and utterly creepy.

Stojko did recognize the clip, of course; it was an ad for Dampers, a P&G-owned diaper brand. The consortium had selected the ad from over two dozen candidates that P&G had asked them to test. For reasons that remained unclear to Stojko–and to pretty much everyone else–singing, dancing, spoon-banging babies lit the brain up like a christmas tree.

Stojko had had her reservations about declaring a ‘winner’; she’d written several long emails to the P&G marketing brain trust explaining that, brain activation notwithstanding, there really wasn’t any evidence yet that this particular ad was going to help sell more diapers, and many more studies were needed before the consortium could confidently interpret its own results. But marketing wasn’t into the whole waiting thing, and the ad was on the air within three months of the consortium’s initial report.

As it turned out, it didn’t do so well.

“That ad bombed,” Ramsey said, wagging his finger in the general direction of the screen, “According to you people, it was supposed to push all of the brain’s buttons at once. You spent three million dollars of our money just on that one testing program. Two dozen ads to choose from, and the one you pick completely tanked. It was an epic failure. At this very moment, people in living rooms all over America are laughing at Pockter and Gramble because of that ad.”

“I’m sure it’s not that bad” said Stojko, smirking almost imperceptibly. She was well aware of the PR disaster P&G had on its hands, of course. But she couldn’t deny the warm feeling of schadenfreude that accompanied the knowledge that P&G was now paying many times over for disregarding just about every recommendation the consortium had made in its 480-page report. She was pretty sure the suits had never made it past the fifth or sixth page.

“It is that bad,” Ramsey shot back. “We blew half of our network budget for the year on this ad. Our initial focus groups were already pretty positive, and then we received your report saying things like–and I quote–”of all the ads tested, number seventeen elicited the largest response in brain areas associated with reward.” So we figured it was a sure thing, and started airing the ad in all the major markets. And then, out of nowhere, we get this massive backlash. Thousands of angry emails from people complaining that the ad was trite and we were shamefully “exploiting babies”. People saying they would never buy Dampers diapers again; that the CEO–that’s me, mind you–should resign; that someone should “just torch Pockter and Gramble headquarters”. And those were just the serious complaints. There were also the people who apparently thought the whole thing was just a big joke that gave them an opening to do their own thing. We had forty YouTube videos a day uploaded by people spoofing the ad. There was one clip of six guys in giraffe suits singing and doing our baby dance. Sixteen million hits.”

“All publicity is good publicity, right?”

“No. Not even close.”

Stojko chuckled just loudly enough for Ramsey to hear.

“Is this funny to you?” Ramsey asked. “We give you a quarter of a billion dollars for commercials designed to push the brain’s reward buttons, and we get grown men in giraffe suits?”

“Well, let me put it this way, Bob. If your goal was really to make commercials that light up the brain’s reward circuitry, you wouldn’t have needed to do any serious research in the first place; you could have just run 30-second clips of semi-nude women making out with each other, or couples giggling and cuddling in bed. That’d cover most of the bases. You’d have all the reward-related activation you could want. But how many deodorant sticks do you think commercials like that would sell?”

Ramsey stared at Stojko blankly.

“Porn, flashing lights, pictures of hundred-dollar bills, a basket of shiny fresh fruit… lots of things activate the brain’s reward centers,” Stojko continued. “What makes you think a commercial that tangentially elicits reward-related activation is going to make people buy any more of a product?”

“Well, can’t you tell that?”

“Can we?” asked Stojko rhetorically. “I don’t know. Can you tell that? You guys probably have labs full of people trying to figure out whether the fact that people tell you they like a commercial means they’re going to buy more of the product featured in that commercial. And what’s the answer?”

“I don’t know that myself,” Ramsey replied abruptly. “It’s not my job to know that. I can have marketing come up here and tell you the answer if you like.”

Stojko shook her head.

“Doesn’t matter. I mean, it can only go one of two ways. If marketing doesn’t know what makes a commercial good or bad, you can’t really expect us to tell you what it is about the brain that makes people buy things. We don’t track how well your products sell after different ads go into circulation; how the hell would we know which commercials have the largest impact on sales? I can tell you which commercials activate the nucleus accumbens more than others, but so what? How am I supposed to know if nucleus accumbens activation is a good predictor of actual purchases without actually knowing anything about real-world purchases?”

Ramsey had nothing to say to that; he stared down at his shoes.

“So clearly, that’s not going to help us,” Stojko continued. “But suppose instead we pretend that the people in your marketing department are smart cookies, and they do know what it is about commercials that makes people buy your products. Well, in that case, what the hell would you need us? If you’ve figured out that people are more likely to buy your anti-dandruff shampoo after watching ads they rate ‘extremely interesting’, what is peering into the brain going to tell you?”

“Well, I guess you could use brain imaging to figure out what it is that people find extremely interesting, right?”

“Sure, Bob, we could do that. And you know how we’d do that? By asking people which commercials they found interesting, and then correlating their verbal responses with what their brains were doing while they watched those commercials. And you know what that means? It means we can never do any better than your people can do with your focus groups and spreadsheets. Because basically, we’re stuck trying to predict the same variables that you guys are using to predict people’s buying behavior. We’re just one step further removed.”

Ramsey listened quietly, but anger visibly colored his face as Stojko spoke.

“This is the kind of thing that might have been good to bring up, oh, say, five years ago,” he said.

“Oh, believe me, we did bring it up,” Stojko smiled bitterly. “Or at least, we tried to.”

She tapped a few keys on her holoboard.

“Here’s an email dated June 18th, 2014: “Dear Mr. Chauahan–I believe that’s your VP of marketing, right?–senior members of the consortium continue to express their frustration at Pockter and Gramble’s failure to provide us with the sales data we requested. As we indicated in our letter dated April 21st, it is not possible for us to properly evaluate the efficacy of our program without the use of real-world performance metrics. We understand your concerns about sharing private data with outside contractors; however…”

Stojko shot Ramsey a pointed look.

“I’ll spare you the rest; it goes on like that for three pages. See, we’ve been asking for the data we need for six years now–pretty much since we started. And every time we ask, you throw more money at us and tell us to go back to work, that you’re not going to share your numbers with us because they’re confidential and we shouldn’t need that information anyway.”

She tapped a few more keys.

“Here’s another similar one. September 30th: Dear Mr. Chauahan, the consortium is at a loss to understand…”

“Enough!” yelled Ramsey, slamming his fist down on the desk. “I get the point! We’ve spent a quarter of a billion buying you new toys to play with, and all the while you’ve been playing us for idiots. Well, you know what–enjoy your toys while they last, because we’re going to have Legal look at our options for recovering that money first thing Monday morning. Those fancy new scanners of yours are going away.”

He wheeled his chair away from Stojko and sat there fuming. Stojko took it as a sign the meeting was over; she shrugged and got up to leave.

The falling out was unfortunate, she thought as she walked down the long sterile corridor towards the elevator. But it had been a long time coming, and after the whole Babygate episode (as the scientists at ICNR had started calling it), no one at ICNR would be surprised to hear that P&G was pulling the plug.

Nor would most of them mind terribly much. Stojko had always planned for a six or seven-year run, and had stopped hiring people on short-term contracts a couple of years ago. There would be no massive lay-offs, no collective plunge into obscurity for the many researchers invested in the project. The data was already collected, and she and her colleagues would be kept busy analyzing and publishing the results for years to come.

As for Ramsey’s legal threats, Stojko wasn’t the least bit worried. Universities had lawyers too, and there wasn’t a judge in the country who’d award P&G a single nickel for breach of contract; not after reading the long series of emails from the consortium that already explained in excruciating detail exactly why P&G was never going to recoup its financial investment unless it fundamentally changed the way it did things. Which, of course, hadn’t happened–and probably never would.

Stojko left Pockter and Gramble headquarters with a clear conscience. At the end of the day, she thought as she walked to her car, all you could do was represent yourself honestly to the other party and let the chips fall where they may. And that was what she’d done. She’d told P&G all along exactly how the consortium was going to spend the money they received; the service agreements she signed were very clearly delineated in legalese that several lawyers on the institutional payroll had contributed to and pored over. Stojko and her colleagues had worked hard to ensure that no one at P&G was laboring under false pretenses about the likely outcome of ICNR’s work. As she’d once put it to a mid-level P&G executive over dinner, neuromarketing research was great for science, and (in her estimation) utterly useless for advertising. But if the suits were willing to pay for it, she was willing to do the research. That, after all, was her job; it was what she’d be doing with her time anyway, ICNR or no ICNR.

No, she thought, turning the key in the ignition. She’d been right to take the industry money; ICNR had conducted itself impeccably over the past six years. If someone insisted on filling your cup up with change even after you very carefully explained to them that you were only going to buy beer with it, who could blame you for paying a visit to the bar once panhandling hours were over?

the perils of digging too deep

Wednesday, June 2nd, 2010

Another in a series of posts supposedly at the intersection of fiction and research methods, but mostly just an excuse to write ridiculous stories and pretend they have some sort of moral.


Dr. Rickles the postdoc looked a bit startled when I walked into his office. He was eating a cheese sandwich and watching a chimp on a motorbike on his laptop screen.

“YouTube again?” I asked.

“Yes,” he said. “It’s lunch.”

“It’s 2:30 pm,” I said, pointing to my watch.

“Still my lunch hours.”

Lunch hours for Rickles were anywhere from 11 am to 4 pm. It depended on exactly when you walked in on him doing something he wasn’t supposed to; that was the event that marked the onset of Lunch.

“Fair enough,” I said. “I just stopped by to see how things were going.”

“Oh, quite well.” said Rickles. “Things are going well. I just found a video of a chimp and a squirrel riding a motorbike together. They aren’t even wearing helmets! I’ll send you the link.”

“Please don’t. I don’t like squirrels. But I meant with work. How’s the data looking.”

He shot me a pained look, like I’d just caught him stealing video game money from his grandmother.

“The data are TERRIBLE,” he said in all capital letters.

I wasn’t terribly surprised at the revelation; I’d handed Rickles the dataset only three days prior, taking care not to  tell him it was the dataset from hell. Rickles was the fourth or fifth person in the line of succession; the data had been handed down from postdoc to graduate student to postdoc for several years now. Everyone in the lab wanted to take a crack at it when they first heard about it, and no one in the lab wanted anything to do with it once they’d taken a peek. I’d given it to Rickles in part to teach him a lesson; he’d been in the lab for several weeks now and somehow still seemed happy and self-assured.

“Haven’t found anything interesting yet?” I asked. “I thought maybe if you ran the Flimflan test on the A-trax, you might get an effect. Or maybe if you jimmied the cryptos on the Borgatron…”

“No, no,” Rickles interrupted, waved me off. “The problem isn’t that there’s nothing interesting in the data; it’s that there’s too MUCH stuff. There are too MANY results. The story is too COMPLEX.”

That didn’t compute for me, so I just stared at him blankly. No one ever found COMPLEX effects in my lab. We usually stopped once we found SIMPLE effects.

Rickles was unimpressed.

“You follow what I’m saying, Guy? There are TOO-MANY-EFFECTS. There’s too much going on in the data.”

“I don’t see how that’s possible,” I said. “Keith, Maria, and Lakshmi each spent weeks on this data and found nothing.”

“That,” said Rickles, “is because Keith, Maria, and Lakshmi never thought to apply the Epistocene Zulu transform to the data.”

The Epistocene Zulu transform! It made perfect sense when you thought about it; so why hadn’t I ever thought about it? Who was Rickles cribbing analysis notes from?

“Pull up the data,” I said excitedly. “I want to see what you’re talking about.”

“Alright, alright. Lunch hours are over now anyway.”

He grudgingly clicked on the little X on his browser. Then he pulled up a spreadsheet that must have had a million columns in it. I don’t know where they’d all come from; it had only had sixteen thousand or so when I’d had the hard drives delivered to his office.

“Here,” said Rickles, showing me the output of the Pear-sampled Tea test. “There’s the A-trax, and there’s its Nuffton index, and there’s the Zimming Range. Look at that effect. It’s bigger than the zifflon correlation Yehudah’s group reported in Nature last year.”

“Impressive,” I said, trying to look calm and collected. But in my head, I was already trying to figure out how I’d ask the department chair for a raise once this finding was published. Each point on that Zimming Range is worth at least $500, I thought.

“Are there any secondary analyses we could publish alongside that,” I asked.

“Oh, I don’t think you want to publish that,” Rickles laughed.

“Why the hell not? It could be big! You just said yourself it was a giant effect!”

“Oh sure. It’s a big effect. But I don’t believe it for one second.”

“Why not? What’s not to like? This finding make’s Yehudah’s paper look like a corn dog!”

I recognized, in the course of uttering those words, that they did not constitute the finest simile ever produced.

“Well, there are two massive outliers, for one. If you eliminate them, the effect is much smaller. And if you take into consideration the Gupta skew because the data were collected with the old reverberator, there’s nothing left at all.”

“Okay, fine,” I muttered. “Is there anything else in the data?”

“Sure, tons of things. Like, for example, there’s a statistically significant gamma reduction.”

“A gamma reduction? Are you sure? Or do you mean beta,” I asked.

“Definitely gamma,” said Rickles. “There’s nothing in the betas, deltas, or thetas. I checked.”

“Okay. That sounds potentially interesting and publishable. But I bet you’re going to tell me why we shouldn’t believe that result, either, right?”

“Well,” said Rickles, looking a bit self-conscious, “it’s just that it’s a pretty fine-grained analysis; you’re not really leaving a lot of observations when you slice it up that thin. And the weird thing about the gamma reduction is that it is essentially tantamount to accepting a null effect; this was Jayaraman’s point in that article in Statistica Splenda last month.”

“Sure, the Gerryman article, right. I read that. Forget the gamma reduction. What else?”

“There are quite a few schweizels,” Rickles offered, twisting the cap off a beer that had appeared out of the minibar under his desk.

I looked at him suspiciously. I suspected it was a trap; Rickels knew how much I loved Schweizel units. But I still couldn’t resist. I had to know.

“How many schweizels are there,” I asked, my hand clutching at the back of a nearby chair to help keep me steady.

“Fourteen,” Rickles said matter-of-factedly.

“Fourteen!” I gasped. “That’s a lot of schweizels!”

“It’s not bad,” said Rickles. “But the problem is, if you look at the B-trax, they also have a lot of schweizels. Seventeen of them, actually.”

“Seventeen schweizels!” I exclaimed. “That’s impossible! How can there be so many Schweizel units in one dataset!”

“I’m not sure. But… I can tell you that if you normalize the variables based on the Smith-Gill ratio, the effect goes away completely.”

There it was; the sound of the other shoe dropping. My heart gave a little cough–not unlike the sound your car engine makes in the morning when it’s cold and it wants you to stop provoking it and go back to bed. It was aggravating, but I understood what Rickles was saying. You couldn’t really say much about the Zimming Range unless your schweizel count was properly weighted. Still, I didn’t want to just give up on the schweizels entirely. I’d spent too much of my career delicately massaging schweizels to give up without one last tug.

“Maybe we can just say that the A-trax/Nuffton relationship is non-linear?” I suggested.

“Non-linear?” Rickles snorted. “Only if by non-linear you mean non-real! If it doesn’t survive Smith-Gill, it’s not worth reporting!”

I grudgingly conceded the point.

“What about the zifflons? Have you looked at them at all? It wouldn’t be so novel given Yehudah’s work, but we might still be able to get it into some place like Acta Ziffletica if there was an effect…”

“Tried it. There isn’t really any A-trax influence on zifflons. Or a B-trax effect, for that matter. There is a modest effect if you generate the Mish component for all the trax combined and look only at that. But that’s a lot of trax, and we’re not correcting for multiple Mishing, so I don’t really trust it…”

I saw that point too, and was now nearing despondency. Rickles had shot down all my best ideas one after the other. I wondered how I’d convince the department chair to let me keep my job.

Then it came to me in a near-blinding flash of insight. Near blinding, because I smashed my forehead on the overhead chandelier jumping out of my chair. An inch lower, and I’d have lost both eyes.

“We need to get that chandelier replaced,” I said, clutching my head in my hands. “It has no business hanging around in an office like this.”

“We need to get it replaced,” Rickles agreed. “I’ll do it tomorrow during my lunch hours.”

I knew that meant the chandelier would be there forever–or at least as long as Rickles inhabited the office.

“Have you tried counting the Dunams,” I suggested, rubbing my forehead delicately and getting back to my brilliant idea.

“No,” he said, leaning forward in his chair slightly. “I didn’t count Dunams.”

Ah-hah! I thought to myself. Not so smart are we now! The old boy’s still got some tricks up his sleeve.

“I think you should count the Dunams,” I offered sagely. “That always works for me. I do believe it might shed some light on this problem.”

“Well…” said Rickles, shaking his head slightly, “maaaaaybe. But Li published a paper in Psykometrika last year showing that Dunam counting is just a special case of Klein’s occidental protrusion method. And Klein’s method is more robust to violations of normality. So I used that. But I don’t really know how to interpret the results, because the residual is negative.”

I really had no idea either. I’d never come across a negative Dunam residual, and I’d never even heard of occidental protrusion. As far as I was concerned, it sounded like a made-up method.

“Okay,” I said, sinking back into my chair, ready to give up. “You’re right. This data… I don’t know. I don’t know what it means.”

I should have expected it, really; it was, after all, the dataset from hell. I was pretty sure my old RA had taken a quick jaunt through purgatory every morning before settling into the bench to run some experiments.

“I told you so,” said Rickles, putting his feet up on the desk and handing me a beer I didn’t ask for. “But don’t worry about it too much. I’m sure we’ll figure it out eventually. We probably just haven’t picked the right transformation yet. There’s Nordstrom, El-Kabir, inverse Zulu…”

He turned to his laptop and double-clicked an icon on the desktop that said “YouTube”.

“…or maybe you can just give the data to your new graduate student when she starts in a couple of weeks,” he said as an afterthought.

In the background, a video of a chimp and a puppy driving a Jeep started playing on a discolored laptop screen.

I mulled it over. Should I give the data to Josephine? Well, why not? She couldn’t really do any worse with it, and it would be a good way to break her will quickly.

“That’s not a bad idea, Rickles,” I said. “In fact, I think it might be the best idea you’ve had all week. Boy, that chimp is a really aggressive driver. Don’t drive angry, chimp! You’ll have an accid–ouch, that can’t be good.”

The

perils of digging too deep

Dr. Rickles the postdoc looked a bit startled when I walked into his office. He was eating a cheese sandwich and watching a chimp on a motorbike on his laptop screen.
“YouTube again?” I asked.
“Yes,” he said. “It’s lunch.”
“It’s 2:30 pm,” I said, pointing to my watch.
“Still my lunch hours.”
Lunch hours for Rickles were anywhere from 11 am to 4 pm. It depended on exactly when you walked in on him doing something he wasn’t supposed to; that was the event that marked the onset of Lunch.
“Fair enough,” I said. “I just stopped by to see how things were going.”
“Oh, quite well.” said Rickles. “Things are going well. I just found a video of a chimp and a squirrel riding a motorbike together. They aren’t even wearing helmets! I’ll send you the link.”
“Please don’t. I don’t like squirrels. But I meant with work. How’s the data looking.”
He shot me a pained look, like I’d just caught him stealing video game money from his grandmother.
“The data are TERRIBLE,” he said in all capital letters.
I wasn’t terribly surprised at that revelation; I’d handed Rickles the dataset only three days prior, taking care not to  tell him it was the dataset from hell. Rickles was the fourth or fifth person in the line of succession; the data had been handed down from postdoc to graduate student to postdoc for several years now. Everyone in the lab wanted to take a crack at it when they first heard about it, and no one in the lab wanted anything to do with it once they’d taken a peek. I’d given it to Rickles in part to teach him a lesson; he’d been in the lab for several weeks now and somehow still seemed happy and self-assured.
“Haven’t found anything interesting yet?” I asked. “I thought maybe if you ran the Flimflan test on the A-trax, you might get an effect. Or maybe if you jimmied the cryptos on the Borgatron…”
“No, no,” Rickles interrupted, waved me off. “The problem isn’t that there’s nothing interesting in the data; it’s that there’s too MUCH stuff. There are too MANY results. The story is too COMPLEX.”
That didn’t compute for me, so I just stared at him blankly. No one ever found COMPLEX effects in my lab. We usually stopped once we found SIMPLE effects.
Rickles was unimpressed.
“You follow what I’m saying, Guy? There are TOO-MANY-EFFECTS. There’s too much going on in the data.”
“I don’t see how that’s possible,” I said. “Keith, Maria, and Lakshmi each spent weeks on this data and found *nothing*.”
“That,” said Rickles, “is because Keith, Maria, and Lakshmi never thought to apply the Epistocene Zulu transform to the data.”
The Epistocene Zulu transform! It made perfect sense when you thought about it; so why hadn’t I ever thought about it? Who was Rickles cribbing analysis notes from?
“Pull up the data,” I said excitedly. “I want to see what you’re talking about.”
“Alright, alright. Lunch hours are over now anyway.”
He grudgingly clicked on the little X on his browser. Then he pulled up a spreadsheet that must have had a million columns in it. I don’t know where they’d all come from; it had only had sixteen thousand or so when I’d had the hard drives delivered to his office.
“Here,” said Rickles, showing me the output of the Pear-sampled Tea test. “There’s the A-trax, and there’s its Nuffton index, and there’s the Zimming Range. Look at that effect. It’s bigger than the zifflon correlation Yehudah’s group reported in Nature last year.”
“Impressive,” I said, trying to look calm and collected. But in my head, I was already trying to figure out how I’d ask the department chair for a raise once this finding was published. *Each point on that Zimming Range is worth at least $500*, I thought.
“Are there any secondary analyses we could publish alongside that,” I asked.
“Oh, I don’t think you want to publish *that*,” Rickles laughed.
“Why the hell not? It could be big! You just said yourself it was a giant effect!”
“Oh *sure*. It’s a big effect. But I don’t believe it for one second.”
“Why not? What’s not to like? This finding make’s Yehudah’s paper look like a corn dog!”
I recognized, in the course of uttering those words, that they did not constitute the finest simile ever.
“Well, there are two massive outliers, for one. If you eliminate them, the effect is much smaller. And if you take into consideration the Gupta skew because the data were collected with the old reverberator, there’s nothing left at all.”
“Okay, fine,” I muttered. “Is there anything else in the data?”
“Sure, tons of things. Like, for example, there’s a statistically significant Gamma reduction.”
“A gamma reduction? Are you sure? Or do you mean Beta,” I asked.
“Definitely gamma,” said Rickles. “There’s nothing in the betas, deltas, or thetas. I looked.”
“Okay. That sounds potentially interesting and publishable. But I bet you’re going to tell me why we shouldn’t believe that result, either, right?”
“Well,” said Rickles, looking a bit self-conscious, “it’s just that it’s a pretty fine-grained analysis; you’re not really leaving a lot of observations when you slice it up that thin. And the weird thing about the gamma reduction is that it is essentially tantamount to accepting a null effect; this was Jayaraman’s point in that article in *Statistica Splenda* last month.”
“Sure, the Gerryman article, right. Okay. Forget the gamma reduction. What else?”
“There are quite a few Schweizels,” Rickles offered, twisting the cap off a beer that had appeared out of the minibar under his desk.
I looked at him suspiciously. I suspected it was a trap; Rickels knew how much I loved Schweizel units. But I still couldn’t resist. I had to know.
“How many Schweizels are there,” I asked, my hand clutching at the back of a nearby chair to help me stay upright.
“Fourteen,” Rickles said matter-of-factedly.
“Fourteen!” I gasped. “That’s a lot of Schweizels!”
“It’s not bad,” said Rickles. “But the problem is, if you look at the B-trax, they also have a lot of Schweizels. Seventeen of them, actually.”
“Seventeen Schweizels!” I exclaimed. “That’s impossible! How can there be so many Schweizel units in one dataset!”
“I’m not sure. But… I can tell you that if you normalize the variables based on the Smith-Gill ratio, the effect goes away completely.”
There it was; the sound of the other shoe dropping. My heart gave a little cough–not unlike the sound your car engine makes in the morning when it’s cold and it wants you to go back to bed and stop stressing it out. It was aggravating, but I understood what Rickles was saying. You couldn’t really say much about the Zimming Range unless your Schweizel count was properly weighted. Still, I didn’t want to just give up on the Schweizels entirely.
“Maybe we can just say that the A-trax/Nuffton relationship is non-linear,” I proposed.
“Non-linear?” Rickles snorted. “Only if by non-linear you mean non-real! If it doesn’t survive Smith-Gill, it’s not worth reporting!”
I grudgingly conceded the point.
“What about the zifflons? Have you looked at them at all? It wouldn’t be so novel given Yehudah’s work, but we might still be able to get it into some place like *Acta Ziffletica* if there was an effect…”
“Tried it. There isn’t really any A-trax influence on zifflons. Or a B-trax effect, for that matter. There *is* a modest effect if you generate the Mish component for all the trax combined and look only at that. But that’s a lot of trax, and we’re not correcting for multiple Mishing, so I don’t really trust it…”
I saw that point too, and was now nearing despondency. Rickles had shot down all my best ideas one after the other. What else was left?
Then it came to me in a near-blinding flash of insight. *Near* blinding, because I smashed my forehead on the overhead chandelier jumping out of my chair. An inch lower, and I’d have lost both eyes.
“We need to get that chandelier replaced,” I said, clutching my head in my hands. “It has no business hanging around in an office like this.”
“We need to get it replaced,” Rickles agreed. “I’ll do it tomorrow during my lunch hours.”
I knew that meant the chandelier would be there forever–or at least as long as Rickles inhabited the office.
“Have you tried counting the Dunams,” I suggested, rubbing my forehead delicately and getting back to my brilliant idea.
“No,” he said, leaning forward in his chair slightly. “I didn’t count Dunams.”
Ah-hah! I thought to myself. Not so smart are we now! The old boy’s still got some tricks up his sleeve.
“I think you should count the Dunams,” I offered sagely. “That always works for me. I do believe it might shed some light on this problem.”
“Well…” said Rickles, shaking his head slightly, “maaaaaybe. But Li published a paper in Psychometrika last year showing that Dunam counting is just a special case of Klein’s occidental protrusion method. And Klein’s method is more robust to violations of normality. So I used that. But I don’t really know how to interpret the results, because the residual is *negative*.”
I really had no idea either. I’d never come across a negative Dunam residual, and I’d never even heard of occidental protrusion. As far as I was concerned, it sounded like a made-up method.
“Okay,” I said, sinking back into my chair, ready to give up. “You’re right. This data… I don’t know. I don’t know what it means.” I should have expected it, really; it was, after all, the dataset from hell. I was pretty sure my old RA had collected it after taking a quick jaunt through purgatory every morning.
“I told you so,” said Rickles, putting his feet up on the desk and handing me a beer I didn’t ask for. “But don’t worry about it too much. I’m sure we’ll figure it out eventually. We probably just haven’t picked the right transformation yet.”
He turned to his laptop and double-clicked an icon on the desktop that said “YouTube”.
“Maybe you can give the data to your new graduate student when she starts in a couple of weeks,” he said as an afterthought.
In the background, a video of a chimp and a puppy driving a Jeep started playing on a discolored laptop screen.
I mulled it over. Should I give the data to Josephine? Well, why not? She couldn’t really do any *worse* with it, and it *would* be a good way to break her will in a hurry.
“That’s not a bad idea, Rickles,” I said. “In fact, I think it might be the best idea you’ve had all week. Boy, that chimp is a really aggressive driver. Don’t drive angry, chimp! You’ll have an accid–ouch, that can’t be good.”

the capricious nature of p < .05, or why data peeking is evil

Thursday, May 6th, 2010

There’s a time-honored tradition in the social sciences–or at least psychology–that goes something like this. You decide on some provisional number of subjects you’d like to run in your study; usually it’s a nice round number like twenty or sixty, or some number that just happens to coincide with the sample size of the last successful study you ran. Or maybe it just happens to be your favorite number (which of course is forty-four). You get your graduate student to start running the study, and promptly forget about it for a couple of weeks while you go about writing up journal reviews that are three weeks overdue and chapters that are six months overdue.

A few weeks later, you decide you’d like to know how that Amazing New Experiment you’re running is going. You summon your RA and ask him, in magisterial tones, “how’s that Amazing New Experiment we’re running going?” To which he falteringly replies that he’s been very busy with all the other data entry and analysis chores you assigned him, so he’s only managed to collect data from eighteen subjects so far. But he promises to have the other eighty-two subjects done any day now.

“Not to worry,” you say. “We’ll just take a peek at the data now and see what it looks like; with any luck, you won’t even need to run any more subjects! By the way, here are my car keys; see if you can’t have it washed by 5 pm. Your job depends on it. Ha ha.”

Once your RA’s gone to soil himself somewhere, you gleefully plunge into the task of peeking at your data. You pivot your tables, plyr your data frame, and bravely sort your columns. Then you extract two of the more juicy variables for analysis, and after some careful surgery a t-test or six, you arrive at the conclusion that your hypothesis is… “marginally” supported. Which is to say, the magical p value is somewhere north of .05 and somewhere south of .10, and now it’s just parked by the curb waiting for you to give it better directions.

You briefly contemplate reporting your result as a one-tailed test–since it’s in the direction you predicted, right?–but ultimately decide against that. You recall the way your old Research Methods professor used to rail at length against the evils of one-sample tests, and even if you don’t remember exactly why they’re so evil, you’re not willing to take any chances. So you decide it can’t be helped; you need to collect some more data.

You summon your RA again. “Is my car washed yet?” you ask.

“No,” says your RA in a squeaky voice. “You just asked me to do that fifteen minutes ago.”

“Right, right,” you say. “I knew that.”

You then explain to your RA that he should suspend all other assigned duties for the next few days and prioritize running subjects in the Amazing New Experiment. “Abandon all other tasks!” you decree. “If it doesn’t involve collecting new data, it’s unimportant! Your job is to eat, sleep, and breathe new subjects! But not literally!”

Being quite clever, your RA sees an opening. “I guess you’ll want your car keys back, then,” he suggests.

“Nice try, Poindexter,” you say. “Abandon all other tasks… starting tomorrow.”

You also give your RA very careful instructions to email you the new data after every single subject, so that you can toss it into your spreadsheet and inspect the p value at every step. After all, there’s no sense in wasting perfectly good data; once your p value is below .05, you can just funnel the rest of the participants over to the Equally Amazing And Even Newer Experiment you’ve been planning to run as a follow-up. It’s a win-win proposition for everyone involved. Except maybe your RA, who’s still expected to return triumphant with a squeaky clean vehicle by 5 pm.

Twenty-six months and four rounds of review later, you publish the results of the Amazing New Experiment as Study 2 in a six-study paper in the Journal of Ambiguous Results. The reviewers raked you over the coals for everything from the suggested running head of the paper to the ratio between the abscissa and the ordinate in Figure 3. But what they couldn’t argue with was the p value in Study 2, which clocked in at just under p < .05, with only 21 subjects’ worth of data (compare that to the 80 you had to run in Study 4 to get a statistically significant result!). Suck on that, Reviewers!, you think to yourself pleasantly while driving yourself home from work in your shiny, shiny Honda Civic.

So ends our short parable, which has at least two subtle points to teach us. One is that it takes a really long time to publish anything; who has time to wait twenty-six months and go through four rounds of review?

The other, more important point, is that the desire to peek at one’s data, which often seems innocuous enough–and possibly even advisable (quality control is important, right?)–can actually be quite harmful. At least if you believe that the goal of doing research is to arrive at the truth, and not necessarily to publish statistically significant results.

The basic problem is that peeking at your data is rarely a passive process; most often, it’s done in the context of a decision-making process, where the goal is to determine whether or not you need to keep collecting data. There are two possible peeking outcomes that might lead you to decide to halt data collection: a very low p value (i.e., p < .05), in which case your hypothesis is supported and you may as well stop gathering evidence; or a very high p value, in which case you might decide that it’s unlikely you’re ever going to successfully reject the null, so you may as well throw in the towel. Either way, you’re making the decision to terminate the study based on the results you find in a provisional sample.

A complementary situation, which also happens not infrequently, occurs when you collect data from exactly as many participants as you decided ahead of time, only to find that your results aren’t quite what you’d like them to be (e.g., a marginally significant hypothesis test). In that case, it may be quite tempting to keep collecting data even though you’ve already hit your predetermined target. I can count on more than one hand the number of times I’ve overheard people say (often without any hint of guilt) something to the effect of “my p value’s at .06 right now, so I just need to collect data from a few more subjects.”

Here’s the problem with either (a) collecting more data in an effort to turn p < .06 into p < .05, or (b) ceasing data collection because you’ve already hit p < .05: any time you add another subject to your sample, there’s a fairly large probability the p value will go down purely by chance, even if there’s no effect. So there you are sitting at p < .06 with twenty-four subjects, and you decide to run a twenty-fifth subject. Well, let’s suppose that there actually isn’t a meaningful effect in the population, and that p < .06 value you’ve got is a (near) false positive. Adding that twenty-fifth subject can only do one of two things: it can raise your p value, or it can lower it. The exact probabilities of these two outcomes depends on the current effect size in your sample before adding the new subject; but generally speaking, they’ll rarely be very far from 50-50. So now you can see the problem: if you stop collecting data as soon as you get a significant result, you may well be capitalizing on chance. It could be that if you’d collected data from a twenty-sixth and twenty-seventh subject, the p value would reverse its trajectory and start rising. It could even be that if you’d collected data from two hundred subjects, the effect size would stabilize near zero. But you’d never know that if you stopped the study as soon as you got the results you were looking for.

Lest you think I’m exaggerating, and think that this problem falls into the famous class of things-statisticians-and-methodologists-get-all-anal-about-but-that-don’t-really-matter-in-the-real-world, here’s a sobering figure (taken from this chapter):

data_peeking

The figure shows the results of a simulation quantifying the increase in false positives associated with data peeking. The assumptions here are that (a) data peeking begins after about 10 subjects (starting earlier would further increase false positives, and starting later would decrease false positives somewhat), (b) the researcher stops as soon as a peek at the data reveals a result significant at p < .05, and (c) data peeking occurs at incremental steps of either 1 or 5 subjects. Given these assumptions, you can see that there’s a fairly monstrous rise in the actual Type I error rate (relative to the nominal rate of 5%). For instance, if the researcher initially plans to collect 60 subjects, but peeks at the data after every 5 subjects, there’s approximately a 17% chance that the threshold of p < .05 will be reached before the full sample of 60 subjects is collected. When data peeking occurs even more frequently (as might happen if a researcher is actively trying to turn p < .07 into p < .05, and is monitoring the results after each incremental participant), Type I error inflation is even worse. So unless you think there’s no practical difference between a 5% false positive rate and a 15 – 20% false positive rate, you should be concerned about data peeking; it’s not the kind of thing you just brush off as needless pedantry.

How do we stop ourselves from capitalizing on chance by looking at the data? Broadly speaking, there are two reasonable solutions. One is to just pick a number up front and stick with it. If you commit yourself to collecting data from exactly as many subjects as you said you would (you can proclaim the exact number loudly to anyone who’ll listen, if you find it helps), you’re then free to peek at the data all you want. After all, it’s not the act of observing the data that creates the problem; it’s the decision to terminate data collection based on your observation that matters.

The other alternative is to explicitly correct for data peeking. This is a common approach in large clinical trials, where data peeking is often ethically mandated, because you don’t want to either (a) harm people in the treatment group if the treatment turns out to have clear and dangerous side effects, or (b) prevent the control group from capitalizing on the treatment too if it seems very efficacious. In either event, you’d want to terminate the trial early. What researchers often do, then, is pick predetermined intervals at which to peek at the data, and then apply a correction to the p values that takes into account the number of, and interval between, peeking occasions. Provided you do things systematically in that way, peeking then becomes perfectly legitimate. Of course, the downside is that having to account for those extra inspections of the data makes your statistical tests more conservative. So if there aren’t any ethical issues that necessitate peeking, and you’re not worried about quality control issues that might be revealed by eyeballing the data, your best bet is usually to just pick a reasonable sample size (ideally, one based on power calculations) and stick with it.

Oh, and also, don’t make your RAs wash your car for you; that’s not their job.

the fifty percent sleeper

Thursday, February 4th, 2010

That’s the title of a short fiction piece I have up at lablit.com today; it’s about brain scanning and beef jerky, among other things. It starts like this:

Day 1, 6 a.m.

Ok, I’m locked into this place now. I’ve got ten pounds of beef jerky, fifty dollars for the vending machine, and a flash drive full of experiments to run. If I can get eighteen usable subjects’ worth of data in five days, Yezerski mows my lawn, does my dishes for a week, and walks my dog three times a week for two months. If I don’t get eighteen subjects done, I mow his lawn, do his dishes, and drive his disabled grandmother to physiotherapy once a week for six months. Also: if I don’t get any subjects scanned, I have to tattoo Yezerski’s grandmother’s name on my back in 50-point font. We both know it’s not going to come to that, but Yezerski insisted we make it a part of the bet anyway.

And then goes on in a similar vein. You might enjoy it if you like MRI machines and cerebellums. If you don’t care for brains, you’ll probably just find it silly.

the parable of zoltan and his twelve sheep, or why a little skepticism goes a long way

Tuesday, December 22nd, 2009

What follows is a fictional piece about sheep and statistics. I wrote it about two years ago, intending it to serve as a preface to an article on the dangers of inadvertent data fudging. But then I decided that no journal editor in his or her right mind would accept an article that started out talking about thinking sheep. And anyway, the rest of the article wasn’t very good. So instead, I post this parable here for your ovine amusement. There’s a moral to the story, but I’m too lazy to write about it at the moment.

A shepherd named Zoltan lived in a small village in the foothills of the Carpathian Mountains. He tended to a flock of twelve sheep: Soffia, Krystyna, Anastasia, Orsolya, Marianna, Zigana, Julinka, Rozalia, Zsa Zsa, Franciska, Erzsebet, and Agi. Zoltan was a keen observer of animal nature, and would often point out the idiosyncracies of his sheep’s behavior to other shepherds whenever they got together.

“Anastasia and Orsolya are BFFs. Whatever one does, the other one does too. If Anastasia starts licking her face, Orsolya will too; if Orsolya starts bleating, Anastasia will start harmonizing along with her.”

“Julinka has a limp in her left leg that makes her ornery. She doesn’t want your pity, only your delicious clovers.”

“Agi is stubborn but logical. You know that old saying, spare the rod and spoil the sheep? Well, it doesn’t work for Agi. You need calculus and rhetoric with Agi.”

Zoltan’s colleagues were so impressed by these insights that they began to encourage him to record his observations for posterity.

“Just think, Zoltan,” young Gergely once confided. “If something bad happened to you, the world would lose all of your knowledge. You should write a book about sheep and give it to the rest of us. I hear you only need to know six or seven related things to publish a book.”

On such occasions, Zoltan would hem and haw solemnly, mumbling that he didn’t know enough to write a book, and that anyway, nothing he said was really very important. It was false modestly of course; in reality, he was deeply flattered, and very much concerned that his vast body of sheep knowledge would disappear along with him one day. So one day, Zoltan packed up his knapsack, asked Gergely to look after his sheep for the day, and went off to consult with the wise old woman who lived in the next village.

The old woman listened to Zoltan’s story with a good deal of interest, nodding sagely at all the right moments. When Zoltan was done, the old woman mulled her thoughts over for a while.

“If you want to be taken seriously, you must publish your findings in a peer-reviewed journal,” she said finally.

“What’s Pier Evew?” asked Zoltan.

“One moment,” said the old woman, disappearing into her bedroom. She returned clutching a dusty magazine. “Here,” she said, handing the magazine to Zoltan. “This is peer review.”

That night, after his sheep had gone to bed, Zoltan stayed up late poring over Vol. IV, Issue 5 of Domesticated Animal Behavior Quarterly. Since he couldn’t understand the figures in the magazine, he read it purely for the articles. By the time he put the magazine down and leaned over to turn off the light, the first glimmerings of an empirical research program had begun to dance around in his head. Just like fireflies, he thought. No, wait, those really were fireflies. He swatted them away.

“I like this… science,” he mumbled to himself as he fell asleep.

In the morning, Zoltan went down to the local library to find a book or two about science. He checked out a volume entitled Principia Scientifica Buccolica—a masterful derivation from first principles of all of the most common research methods, with special applications to animal behavior. By lunchtime, Zoltan had covered t-tests, and by bedtime, he had mastered Mordenkainen’s correction for inestimable herds.

In the morning, Zoltan made his first real scientific decision.

“Today I’ll collect some pilot data,” he thought to himself, “and tomorrow I’ll apply for an R01.”

His first set of studies tested the provocative hypothesis that sheep communicate with one another by moving their ears back and forth in Morse code. Study 1 tested the idea observationally. Zoltan and two other raters (his younger cousins), both blind to the hypothesis, studied sheep in pairs, coding one sheep’s ear movements and the other sheep’s behavioral responses. Studies 2 through 4 manipulated the sheep’s behavior experimentally. In Study 2, Zoltan taped the sheep’s ears to their head; in Study 3, he covered their eyes with opaque goggles so that they couldn’t see each other’s ears moving. In Study 4, he split the twelve sheep into three groups of four in order to determine whether smaller groups might promote increased sociability.

That night, Zoltan minded the data. “It’s a lot like minding sheep,” Zoltan explained to his cousin Griga the next day. “You need to always be vigilant, so that a significant result doesn’t get away from you.”

Zoltan had been vigilant, and the first 4 studies produced a number of significant results. In Study 1, Zoltan found that sheep appeared to coordinate ear twitches: if one sheep twitched an ear several times in a row, it was a safe bet that other sheep would start to do the same shortly thereafter (p < .01). There was, however, no coordination of licking, headbutting, stamping, or bleating behaviors, no matter how you sliced and diced it. “It’s a highly selective effect,” Zoltan concluded happily. After all, when you thought about it, it made sense. If you were going to pick just one channel for sheep to communicate through, ear twitching was surely a good one. One could make a very good evolutionary argument that more obvious methods of communication (e.g., bleating loudly) would have been detected by humans long ago, and that would be no good at all for the sheep.

Studies 2 and 3 further supported Zoltan’s story. Study 2 demonstrated that when you taped sheep’s ears to their heads, they ceased to communicate entirely. You could put Rozalia and Erzsebet in adjacent enclosures and show Rozalia the Jack of Spades for three or four minutes at a time, and when you went to test Erzsebet, she still wouldn’t know the Jack of Spades from the Three of Diamonds. It was as if the sheep were blind! Except they weren’t blind, they were dumb. Zoltan knew; he had made them that way by taping their ears to their heads.

In Study 3, Zoltan found that when the sheep’s eyes were covered, they no longer coordinated ear twitching. Instead, they now coordinated their bleating—but only if you excluded bleats that were produced when the sheep’s heads were oriented downwards. “Fantastic,” he thought. “When you cover their eyes, they can’t see each other’s ears any more. So they use a vocal channel. This, again, makes good adaptive sense: communication is too important to eliminate entirely just because your eyes happen to be covered. Much better to incur a small risk of being detected and make yourself known in other, less subtle, ways.”

But the real clincher was Study 4, which confirmed that ear twitching occurred at a higher rate in smaller groups than larger groups, and was particularly common in dyads of well-adjusted sheep (like Anastasia and Orsolya, and definitely not like Zsa Zsa and Marianna).

“Sheep are like everyday people,” Zoltan told his sister on the phone. “They won’t say anything to your face in public, but get them one-on-one, and they won’t stop gossiping about each other.”

It was a compelling story, Zoltan conceded to himself. The only problem was the F test. The difference in twitch rates as a function of group size wasn’t quite statistically significant. Instead, it hovered around p = .07, which the textbooks told Zoltan meant that he was almost right. Almost right was the same thing as potentially wrong, which wasn’t good enough. So the next morning, Zoltan asked Gergely to lend him four sheep so he could increase his sample size.

“Absolutely not,” said Gergely. “I don’t want your sheep filling my sheep’s heads with all of your crazy new ideas.”

“Look,” said Zoltan. “If you lend me four sheep, I’ll let you drive my Cadillac down to the village on weekends after I get famous.”

“Deal,” said Gergely.

So Zoltan borrowed the sheep. But it turned out that four sheep weren’t quite enough; after adding Gergely’s sheep to the sample, the effect only went from p < .07 to p < .06. So Zoltan cut a deal with his other neighbor, Yuri: four of Yuri’s sheep for two days, in return for three days with Zoltan’s new Lexus (once he bought it). That did the trick. Once Zoltan repeated the experiment with Yuri’s sheep, the p-value for Study 2 now came to .046, which the textbooks assured Zoltan meant he was going to be famous.

Data in hand, Zoltan spent the next two weeks writing up his very first journal article. He titled it “Baa baa baa, or not: Sheep communicate via non-verbal channels”—a decidedly modest title for the first empirical work to demonstrate that sheep are capable of sophisticated propositional thought. The article was published to widespread media attention and scientific acclaim, and Zoltan went on to have a productive few years in animal behavioral research, studying topics as interesting and varied as giraffe calisthenics and displays of affection in the common leech.

Much later, it turned out that no one was able to directly replicate his original findings with sheep (though some other researchers did manage to come up with conceptual replications). But that didn’t really matter to Zoltan, because by then he’d decided science was too demanding a career anyway; it was way more fun to lay under trees counting his sheep. Counting sheep, and occasionally, on Saturdays, driving down to the village in his new Lexus,  just to impress all the young cowgirls.