Vocalabs Newsletter: Quality Times

Issue
114

You May Be P-Hacking and Don't Even Know It

In This Issue


You May Be P-Hacking and Don't Even Know It

P-Hacking is a big problem. It can lead to bad decisions, wasted effort, and misplaced confidence in how your business works.

P-Hacking sounds like something you do to pass a drug test. Actually, it's something you do to pass a statistical test. "P" refers to the "P" value, the probability that an observed result is the result of random chance and not something real. "Hacking" in this case means manipulating, so P-Hacking is manipulating an experiment in order to make the P value look more significant than it really is, so that it looks like you discovered a real effect when in fact there may be nothing there.

It's the equivalent of smoke and mirrors for statistics nerds. And it's really really common. So common that some of the foundational research in the social sciences has turned out to not be true. It's led to a "Replication Crisis" in some fields. forcing a fresh look at many important experiments.

And as scientific techniques like A/B testing have become more common in the business world, P-Hacking has followed. A recent analysis of thousands of A/B tests through a commercial platform found convincing evidence of P-Hacking in over half the tests where a little P-Hacking might make the difference between a result that's declared "significant" and one that's just noise.

The problem is that P-Hacking is subtle: it's easy to do without realizing it, hard to detect, and extremely tempting when there's an incentive to produce results.

One common form of P-Hacking, and the one observed the recent analysis, is stopping an A/B test early when it shows a positive result. This may seem innocuous, but in reality it distorts the P value and gives you a better chance of hitting your threshold for statistical significance.

Think of it this way: If you consider a P value of less than 0.05 to be "significant" (a common threshold), that means that there's supposed to be a 5% chance that you would have gotten the same result by random chance if there was actually no difference between your A and B test cases. It's the equivalent of rolling one of those 20-sided Dungeons and Dragons dice and declaring that "20" means you found something real.

But if you peek at the results of your A/B test early, that's a little like giving yourself extra rolls of the dice. So Monday you roll 8 and keep the experiment running. Tuesday you roll 12 and keep running. Wednesday you roll 20 and declare that you found something significant and stop. Maybe if you had continued the experiment you would have kept rolling 20 on Thursday and Friday, but maybe not. You don't know because you stopped the experiment early.

The point is that by taking an early look at the results and deciding to end the test as soon as the results crossed your significance threshold, you're getting to roll the dice a few more times and increase the odds of showing a "significant" result when in fact there was no effect.

If there is a real effect, we expect the P value to keep dropping (showing more and more significance) as we collect more data. But the P value can bounce around, and even when the experiment is run perfectly with no P-Hacking there's still a one-in-20 chance that you'll see a "significant" result that's completely bogus. If you're P-Hacking, the odds of a bogus result can increase a lot.

What makes this so insidious is that we are all wired to want to find something. Null results--finding the things that don't have any effect--are boring. Positive results are much more interesting. We all want to go to our boss or client and talk about what we discovered, not what we didn't discover.

How can you avoid P-Hacking? It's hard. You need to be very aware of what your statistical tests mean and how they relate to the way you designed your study. Here's some tips:

  • Be aware that every decision you make while an A/B test is underway could be another roll of the dice. Don't change anything about your study design once data collection has started.
  • Also be aware that every relationship you analyze is also another roll of the dice. If you look at 20 different metrics that are just random noise, you actually expect that one of them will show a statistically significant trend with p < 0.05. So testing a lot of different relationships means you should use a lower P value to declare a significant result.
  • When in doubt, collect more data. When there's a real effect or trend, the statistical significance should improve as you collect more data. Bogus effects tend to go away.
  • Don't think of statistical significance as some hard threshold. In reality, this is just a tool for estimating whether or not the results of an analysis are real or bogus, and there's nothing magical about crossing p < 0.05, p <0.01, or any other threshold.

Another useful tip is to change the way you think and speak about statistical significance. When I discuss data with clients, I prefer to avoid the phrase "statistically significant" entirely: I'll use descriptive phrases like, "there's probably something real" when the P value is close to the significance threshold, and "there's almost certainly a real effect" when the P value is well below the significance threshold.

I find this gives my clients a much better understanding of what the data really means. All statistics are inherently fuzzy, and anointing some results as "statistically significant" tends to give a false impression of Scientific Truth.


Designing Hybrid Surveys

There's two elements to designing a hybrid survey program which combines the depth of actionable feedback from a live-person phone interview with the ability to cost-effectively collect huge sample sizes with an online survey. In this article I'll explore designing the survey questions and how the two feedback channels relate to each other. In a future article I'll write about designing the survey process itself, and some of the considerations which go into sampling and channel selection.

To get the most benefit from a hybrid survey we want to play to the strengths of each feedback channel. Online surveys are cost effective for collecting tens of thousands or millions of survey responses, while phone interviews let you collect a lot of details about individual customers. Online surveys are good for calculating metrics, and phone interviews give you insights into the individual customers' stories.

Keep The Online Survey Short and Sweet

The online survey is where you get to cast a very wide net, including large numbers of customers in the survey process. This is also where most of your tracking metrics will come from. But it's not the place to try to collect lots of detailed feedback from each customer: long survey forms often don't get a good response rate.

I recommend limiting the online survey to a handful of key metrics plus one box for customers to enter any other comments or suggestions they may have. The particular metrics you choose will depend on your survey goals, but I tend to think that one metric is too few, but more than five will just make the survey longer without yielding much (if any) new information.

It's also good practice to give customers a Service Recovery option, usually as a question at the end of the survey along the lines of, "Do you want a representative to contact you to resolve any outstanding issues?" Just make sure that those requests get routed to the right department and promptly handled.

And please please please please don't make any of your questions mandatory. Required questions serve no purpose other than frustrating customers and should be stricken from the survey toolbox.

Go Deep In Phone Interviews

You can ask a surprising number of questions in a typical five-minute phone interview. This is the place to ask follow-up questions, maybe include some metrics that had to be eliminated from the online survey due to length (you did keep it short, right?), and most importantly, give the customer a chance to really tell her story.

I usually start with the questions from the online survey and add to them. We may need to adjust the wording of some of the questions--not every question that looks good written will sound good when read aloud--but we want to cover the same ground. One of the purposes is to compare the results from the online survey to the interview, since we normally expect the interview to give us a truer reading of the survey metrics. If metrics for the interview and online survey diverge, that's an indication that something may be going wrong in the survey process.

It's a good idea to keep the interview questions flexible. Unlike the core metrics in the online survey, which need to stay consistent over time, the interview questions may need to be updated frequently depending on changing business needs or the particular reason a customers was selected for an interview rather than an automated survey.

I also bias heavily towards open-ended questions on the interview. This gives the customer a chance to use their own words and will often surface unexpected feedback. If needed, the interviewer can code the responses (along with providing a written summary) to allow for tracking of the types of feedback you're getting.

The end result is going to be a handful of metrics, with a healthy dollop of open-ended questions to explore the reasons behind the ratings. The metrics should be comparable to the online survey, so it can serve as a check on the validity of the high volume feedback process, but the true value will be in understanding individual customer stories.

Newsletter Archives