The Customer Service Survey

Analysis

You May Be P-Hacking and Don't Even Know It

by Peter Leppik on Fri, 2018-07-27 10:42

P-Hacking is a big problem. It can lead to bad decisions, wasted effort, and misplaced confidence in how your business works.

P-Hacking sounds like something you do to pass a drug test. Actually, it's something you do to pass a statistical test. "P" refers to the "P" value, the probability that an observed result is the result of random chance and not something real. "Hacking" in this case means manipulating, so P-Hacking is manipulating an experiment in order to make the P value look more significant than it really is, so that it looks like you discovered a real effect when in fact there may be nothing there.

It's the equivalent of smoke and mirrors for statistics nerds. And it's really really common. So common that some of the foundational research in the social sciences has turned out to not be true. It's led to a "Replication Crisis" in some fields. forcing a fresh look at many important experiments.

And as scientific techniques like A/B testing have become more common in the business world, P-Hacking has followed. A recent analysis of thousands of A/B tests through a commercial platform found convincing evidence of P-Hacking in over half the tests where a little P-Hacking might make the difference between a result that's declared "significant" and one that's just noise.

The problem is that P-Hacking is subtle: it's easy to do without realizing it, hard to detect, and extremely tempting when there's an incentive to produce results.

One common form of P-Hacking, and the one observed the recent analysis, is stopping an A/B test early when it shows a positive result. This may seem innocuous, but in reality it distorts the P value and gives you a better chance of hitting your threshold for statistical significance.

Think of it this way: If you consider a P value of less than 0.05 to be "significant" (a common threshold), that means that there's supposed to be a 5% chance that you would have gotten the same result by random chance if there was actually no difference between your A and B test cases. It's the equivalent of rolling one of those 20-sided Dungeons and Dragons dice and declaring that "20" means you found something real.

But if you peek at the results of your A/B test early, that's a little like giving yourself extra rolls of the dice. So Monday you roll 8 and keep the experiment running. Tuesday you roll 12 and keep running. Wednesday you roll 20 and declare that you found something significant and stop. Maybe if you had continued the experiment you would have kept rolling 20 on Thursday and Friday, but maybe not. You don't know because you stopped the experiment early.

The point is that by taking an early look at the results and deciding to end the test as soon as the results crossed your significance threshold, you're getting to roll the dice a few more times and increase the odds of showing a "significant" result when in fact there was no effect.

If there is a real effect, we expect the P value to keep dropping (showing more and more significance) as we collect more data. But the P value can bounce around, and even when the experiment is run perfectly with no P-Hacking there's still a one-in-20 chance that you'll see a "significant" result that's completely bogus. If you're P-Hacking, the odds of a bogus result can increase a lot.

What makes this so insidious is that we are all wired to want to find something. Null results--finding the things that don't have any effect--are boring. Positive results are much more interesting. We all want to go to our boss or client and talk about what we discovered, not what we didn't discover.

How can you avoid P-Hacking? It's hard. You need to be very aware of what your statistical tests mean and how they relate to the way you designed your study. Here's some tips:

  • Be aware that every decision you make while an A/B test is underway could be another roll of the dice. Don't change anything about your study design once data collection has started.
  • Every relationship you analyze is also another roll of the dice. If you look at 20 different metrics that are just random noise, you actually expect that one of them will show a statistically significant trend with p < 0.05.
  • When in doubt, collect more data. When there's a real effect or trend, the statistical significance should improve as you collect more data. Bogus effects tend to go away.
  • Don't think of statistical significance as some hard threshold. In reality, this is just a tool for estimating whether or not the results of an analysis are real or bogus, and there's nothing magical about crossing p < 0.05, p <0.01, or any other threshold.

Another useful tip is to change the way you think and speak about statistical significance. When I discuss data with clients, I prefer to avoid the phrase "statistically significant" entirely: I'll use descriptive phrases like, "there's probably something real" when the P value is close to the significance threshold, and "there's almost certainly a real effect" when the P value is well below the significance threshold.

I find this gives my clients a much better understanding of what the data really means. All statistics are inherently fuzzy, and anointing some results as "statistically significant" tends to give a false impression of Scientific Truth.

Circling the Drain with Happy Customers

by vocalabs on Fri, 2017-03-03 14:38

American Customer Satisfaction Index (ACSI) data for the retail industry was released this week, and the folks over at Consumerist noticed something, well, odd. Scores for Sears, JC Penny, and Macy's took huge leaps in 2016--despite the fact that those traditional department stores have been closing stores in the face of sustained sales declines and changing consumer tastes. In addition, Abercrombie & Fitch, a speciality clothing retailer, also posted a big ACSI gain despite struggling to actually sell stuff.

The idea that customer satisfaction scores would jump as the companies are losing customers seems counterintuitive to say the least. The ACSI analyst gamely suggested that shorter lines and less-crowded stores are leading to higher customer satisfaction and, yeah, I'm not buying it.

I'd like to suggest some alternate hypotheses to explain why these failing retailers are posting improved ACSI scores:

Theory 1: There's No There There

Before trying to explain why ACSI scores might be up, it's worth asking whether these companies' scores are actually improving, or whether it might just be a statistical blip.

Unfortunately, ACSI doesn't provide much help in trying to answer this question. In their report (at least the one you can download for free) there's no indication of margin of error or the statistical significance of any changes. They do disclose that a total of about 12,000 consumers completed their survey, but that's not helpful given that we don't know how many consumers answered the questions about Sears, JC Penny, etc.

With this kind of research there's always a temptation to exaggerate the statistical significance of any findings--after all, you don't want to go through all the effort just to publish a report that says, "nothing changed since last year." So I'm always skeptical when the report doesn't even make a passing reference to whether a change is meaningful or not.

It could be that these four companies saw big fluctuations in their scores simply because they don't have many customers anymore and the sample size for those companies is very small. There's nothing in the report to rule this possibility out.

Theory 2: Die-Hards Exit Last

We know that even though surveys like ACSI purport to collect opinions about specific aspects of customer satisfaction, consumer responses are strongly colored by their overall brand loyalty and affinity.

So as these shrinking brands lose customers, we expect the least-loyal customers to leave first. That means the remaining customers will, on the whole, be more loyal and more likely to give higher customer satisfaction scores than the ones who left.

In other words, these companies' survey scores are going up not because the customer experience is any better, but because only the true die-hard fans are still shopping there.

If this is the case, then the improved ACSI scores are real but not very helpful to the companies. They are circling the drain with an ever-smaller core group of more and more loyal customers.

This is a hard theory to test. If ACSI has longitudinal data (i.e. they survey some of the same customers each year) then it might be possible to tease out changes in customer populations from changes in customer experience.

Theory 3: ACSI Has Outlived its Usefulness

Finally, it's worth asking whether the ACSI is simply no longer relevant. The theory behind ACSI is that more-satisfied customers will lead to more customer loyalty and higher sales, all else being equal. But the details are important, and the specific methodology of ACSI was developed over 20 years ago, based on research originally done in the 1980's.

I know that ACSI has made some changes over the years (for example, they now collect data through email), but I don't know if they've evolved the survey questions and scoring to keep up with changes in customer expectations and technology. Back in 1994 when ACSI launched, not only did we not have Facebook and Twitter, but Amazon.com had only just been founded, and most people didn't even have access to the Internet.

So if the index hasn't kept up enough, it's possible that ACSI is putting too much weight on things that don't matter to a 21st century consumer, and missing new things that are important.

Interpreting Survey Data Is Hard

I'm only picking on ACSI because their report is fresh. The fact is that interpreting survey data is hard, and it's important to explore alternate explanations for the results. Even when the data perfectly fits your prior assumptions you may be missing something important without looking at competing theories.

It's entirely possible that ACSI did exactly that, tested all three of my alternate theories and others, and they have some internal data that supports their explanation that, "Fewer customers can lead to shorter lines, faster checkout, and more attention from the sales staff." But if they went through that analysis there's no evidence of it in their published report.

When the survey results are unexpected, you really need to explore what's going on and not just reach for the first explanation that's remotely plausible.

Goldilocks Data

by Peter Leppik on Wed, 2017-01-11 15:08

Apple's new laptops have been generating complaints about the battery meter. The "time remaining" display has a bad habit of jumping all around and not giving the user meaningful information about how much time they can actually keep using the computer.

Getting this display right is a tricky problem, and it's a nice simple example of a situation that's common to a lot of dashboards and data visualization. The challenge is that you are trying to communicate a relatively simple and actionable message with a very complicated underlying system, where the person receiving the message isn't an expert and can't be expected to become an expert.

In the case of Apple's battery meter, the user wants to know roughly how long he can keep using the laptop before plugging in. But the complicated reality is that the laptop's power usage can vary second-to-second, and it's not always obvious what's driving the changes. You may be happily surfing the web and barely sipping the battery, but should you visit a page with a lot of animations (or worse--scripts to track your web viewing and serve you ads) that suck up the CPU, your battery usage will spike and time available will plummet.

Juice Analytics took a look at this problem recently, and provided some different ways to better communicate the nuances of laptop battery life. In all likelihood, none of the options will be completely satisfying to the typical user who just wants to know if he has enough battery to watch The Matrix to the end.

But just like in the business world, where leadership may want simple answers to complex questions, sometimes it does a real disservice to give people the data they think they want. The challenge is to find a simple way to communicate the data they actually need.

Stop And Think Before Collecting Useless Data

by Peter Leppik on Wed, 2016-12-07 14:52

People are naturally attracted to shiny new things, and that's just as true in the world of business intelligence as in a shopping mall. So when offered an interesting new piece of data, the natural inclination is to chew on it for a while and ask for more.

But not all this data turns out to be particularly useful, and the result is often an accumulation of unread reports. I've known companies where whole departments were dedicated to collecting, analyzing, and distributing data that nobody (outside the department) used for any identifiable purpose.

Before gathering data and creating reports, it's worth taking a moment to consider what the data will be used for. There's a few broad categories, ranging from the most useful to the most useless:

  1. The most useful data is essential for business processes. For example, sales and accounting data is essential for running any kind of business. An employee coaching program built around customer feedback requires customer feedback data to operate. If the data is a required input into some day-to-day business process, then it falls into this category and there's no question of its usefulness as long as the underlying process is operating.
  2. Less essential but still very useful is data to help make specific decisions. Without this information the company might make a different (and probably worse) decision about something, but a decision could still be made. For example, before deciding whether to invest in online chat for customer service it's helpful to have some data about whether customers will actually use it.
  3. Data to validate a decision that's already been made may seem useless, since the data won't change the decision. But in a well-run organization, it can be valuable to take the time and effort to review whether specific decisions turned out in hindsight to be the right thing to do. Ideally this will lead to self-reflection and better decision-making in the future, though in practice most organizations aren't very good at this kind of follow-through.
  4. Occasionally data to monitor the health of the business will have value, though most of the time--when things are going well--these reports won't make any practical difference. Most tracking and trending data falls into this category (assuming that the underlying data isn't also being used for some other purpose). The value of this type of data is that it can warn of problems that might not be visible elsewhere; but the risk is that red flags will be ignored. Lots of companies track customer satisfaction, but might not take action if customer satisfaction plummets but sales and profitability remain high.
  5. Data that might be useful someday is the most useless category, since in practice "someday" rarely arrives. Information that's "nice to have" but doesn't drive any business activity or decision-making is probably information you can do without.

It may seem that there's little harm in collecting useless data, but the reality is that it comes with a cost. Someone has to collect the data, compile the reports, and distribute the results. Worse, recipients who get too many useless reports are more likely to miss the important bits for all the noise.

So before collecting data, take a moment to think about how--and whether--it's going to be used.

The Intersection of 65,000 and 115,000 Is Not "&"

by Peter Leppik on Tue, 2016-11-08 10:30

Bad Data VisualizationGood data visualization is a balancing act. Communicating facts and statistics in a way that's both pleasing to the eye and conveys meaning intuitively requires skills that are not always easy to find.

It's not surprising when charts and graphs sometimes misfire, especially when the designer tries too hard to be clever and just winds up being confusing.

Just as shipwrecks are sometimes useful ways to spot where the rocks are, really bad data visualizations can help us avoid the mistakes of others.

WTF Visualizations is like a roadmap of how not to communicate data. I highly recommend you spend some time browsing their examples of really terrible data visualization.

First you will laugh. Then you will think. And then, I hope, you will resolve never to venture into those same waters.

Insights Aren't Enough

by Peter Leppik on Fri, 2014-08-29 17:08

Anyone who has done any sort of data collection or analysis in the business world has almost certainly been asked to produce insights. "We're looking for insightful data," is a typical statement I hear from clients on a regular basis.

But for some reason, people don't talk much about getting useful data. There's an implicit assumption that "insightful data" and "useful data" are the same thing.

They aren't, and it's important to understand why.

  • "Insightful" data yields new knowledge or understanding about something. It tells you something you didn't already know.
  • "Useful" data can be applied towards achieving some goal. It moves you closer to your business objective.

Data can be either "insightful" or "useful," or both, or neither. Insightfulness and usefulness are completely different things.

For example, if you discover as part of your customer research that a surprisingly high percentage of your customers are left-handed, that may be insightful but it's probably not useful (unless you're planning to market specifically to southpaws).

Or if your survey data shows that some of your customer service reps have consistently higher customer satisfaction than others, that's very useful information, but it's probably not insightful (you probably expected some reps to score higher than others).

The best data is both insightful and useful, but that's rare. Most companies have enough of an understanding of how their business works that true insights are unusual, and true insights which can be immediately applied towards a business goal are even less common.

And of course data which is neither useful nor insightful serves no purpose. Nevertheless, this sort of research is distressingly common.

When it comes down to useful data vs. insightful data, I tend to prefer usefulness over insightfulness. Data which is useful, even if it doesn't reveal any new insights, still helps advance the goals of the company. That's not to imply that insights have no value: even a useless insight can be filed away in case it becomes important in the future.

But whether you're looking for insights or usefulness, remember that they are not the same thing.

Uncorrelated Data

by Peter Leppik on Wed, 2014-07-23 15:22

A few months ago I wrote about the Spurious Correlation Generator, a fun web page where you could discover pointless facts like the divorce rate in Maine is correlated to per-capita margarine consumption (who knew!).

The other side of the correlation coin is when there's a complete lack of any correlation whatsoever. Today, for example, I learned that in a sample of 200 large corporations, there is zero correlation between the relative CEO pay and the relative stock market return. None, nada, zippo.

(The statistician in me insists that I restate that as, "any correlation in the data is much smaller than the margin of error and is statistically indistinguishable from zero." But that's why I don't let my inner statistician go to any of the fun parties.)

Presumably, though, the boards of directors of these companies must believe there's some relationship between stock performance and CEO pay. Otherwise why on Earth would they pay, for example, Larry Ellison of Oracle $78 million? Or $12 million to Ken Frazier, CEO of Merck? What's more, since CEOs are often paid mostly in stock, the lack of any correlation between stock price and pay is surprising.

It's easy to conclude that these big companies are being very foolish and paying huge amounts of money to get no more value than they would have gotten had they hired a competent chief executive who didn't happen to be a rock star. And this explanation could well be right.

On the other hand, the data doesn't prove it. Just as a strong correlation doesn't prove that two things are related to each other, the lack of a correlation doesn't prove they aren't related.

It's also possible that the analysis was flawed. Or that they are related but in some more complicated way than a simple correlation.

In this case, here are a few things I'd examine about the data and the analysis before concluding that CEO pay isn't related to stock performance:

  1. Sample Bias: The data for this analysis consists of 200 large public companies in the U.S. Since there are thousands of public companies, and easily 500 which could be considered "large," it's important to ask how these 200 companies were chosen and what happens if you include a larger sample. It appears that the people who did the analysis chose the 200 companies with the highest CEO pay, which is a clearly biased sample. So the analysis needs to be re-done with a larger sample including companies with low CEO pay, or ideally, all public companies above some size (for example, all companies in the S&P 500).
  2. Analysis Choices: In addition to choosing a biased sample, the people who did the analysis also chose a weird way to try to correlate the variables. Rather than the obvious analysis correlating CEO pay in dollars against stock performance in percent, this analysis was done using the relative rank in CEO pay (i.e. 1 to 200) and relative rank in stock performance. That flattens any bell curve distribution and eliminates any clustering which, depending on the details of the source data, could either eliminate or enhance any linear correlation.
  3. Input Data: Finally there's the question of what input data is being used for the analysis. Big public companies usually pay their CEOs mostly in stock, so you would normally expect a very strong relationship between stock price and CEO pay. But there's a quirk in how CEO compensation is reported to shareholders: in any given year, the reported CEO pay includes only what the CEO got for actually selling shares in that year. A chief executive could hang on to his (or too rarely, her) stock for many years and then sell it all in one big block. So in reality the CEO is collecting many years' worth of pay all at once, but the stock performance data used in this analysis probably only includes the last year. The analysis really should include CEO pay and stock performance for multiple years, possibly the CEO's entire tenure.

So the lack of correlation in a data analysis doesn't mean there's no relationship in the data. It might just mean you need to look harder or in a different place.

My Dashboard Pet Peeve

by Peter Leppik on Fri, 2014-07-18 18:10

I have a pet peeve about business dashboards.

Dashboards are great in theory. The idea is to present the most important information about the business in a single display so you can see at a glance how it's performing and whether action is required. Besides, jet planes and sports cars have dashboards, and those things are fast and cool. Everyone wants to be fast and cool!

In reality, though, most business dashboards are a mess. A quick Google search for business dashboard designs reveals very few which clearly communicate critical information at a glance.

Instead, you find example after example after example after example after example which is too cluttered, fails to communicate useful information, and doesn't differentiate between urgent, important, and irrelevant information. I didn't have to look far for those bad examples, either: I literally just took the top search results.

Based on what I've seen, the typical business dashboard looks like the company's Access database got drunk and vomited PowerPoint all over the screen.

As I see, there are two key problems with the way business dashboards are implemented in practice:

First, there's not enough attention given to what's most important. As a result, most dashboards have too much information displayed and it becomes difficult to figure out what to pay attention to.

This data-minimization problem is hard. Even a modest size company has dozens, perhaps hundreds, of pieces of information which are important to the day-to-day management of the business. While not everyone cares about everything, everything is important to someone. So the impulse to consolidate everything into a single view inevitably leads to a display which includes a dizzying array of numbers, charts, and graphical blobs.

Second, the concept of a "dashboard" isn't actually all that relevant to most parts of a business. The whole purpose of making critical information available at a glance is to enable immediate action, meaning within a few seconds. In the business world, "extremely urgent" usually means a decision is needed within a few hours, not seconds. You have time to pause and digest the information before taking action.

That said, there are few places where immediate action is required. For example, a contact center has to ensure enough people are on the phones at all times to keep the wait time down. In these situations, a dashboard is entirely appropriate.

But the idea of an executive watching every tick of a company dashboard and steering the company second-by-second is absurd. I get that driving a sports car or flying a jet is fun and work is, well, work. But you will never manage a company the way you drive a car. Not going to happen.

But for better or worse, the idea of a business dashboard has resonance and dashboards are likely to be around for a while.

To make a dashboard useful and effective, probably the most important thing is to severely restrict what's included. Think about your car. Your car's dashboard probably displays just a few pieces of information: speed, fuel, the time, miles traveled, and maybe temperature and oil pressure. Plus there's a bunch of lights which turn on if something goes wrong. A business dashboard should be limited to just a handful (3-4) pieces of information which are most important, and maybe some alerts for other things which need urgent attention. This might require having different dashboards for different functions within the company--on the other hand, it would be silly to give the pilot and the flight attendants the same flight instruments.

The other element in useful dashboards is timing. If the data doesn't require minute-by-minute action, then having real-time displays serves little purpose. In fact, it might become a distraction if people get too focused on every little blip and wobble. Instead, match the pace of data delivery to the actions required. For example, a daily dashboard pushed out via e-mail, with alerts and notifications if something needs attention during the day. 

More on Correlation

by Peter Leppik on Wed, 2014-05-14 14:39

Just in case you need more convicing that correlation is not causation, spend some time browsing the Spurious Correlation Generator.

Every day it finds a new (but almost invariably bogus) statistical correlation between two unrelated data sets. Today, for example, we learn that U.S. spending on science and technolgy is very strongly correlated (0.992) with suicides by hanging.

In the past we've learned that the divorce rate in Maine is correlated to per-capita margarine consumption, and that the number of beehives is inversely correlated with arrests for juvenile pot possession.

These correlations are completely bogus, of course. The point is to illustrate the fact that if you look at enough different data points you will find lots of spurious statistical relationships. With computers and big data, it's trivially simple to generate thousands of correlations with very high statistical significance which also happen to be utterly meaningless.

Getting It Wrong

by Peter Leppik on Wed, 2014-05-07 15:30

A little over a year ago, Forrester Research issued a report called 2013 Mobile Workforce Adoption Trends. In this report they did a survey of a few thousand IT workers worldwide and asked a bunch of questions about what kinds of gadgets they wanted. Based on those survey answers they tried to make some predictions about what sorts of gadgets people would be buying.

One of the much-hyped predictions was that worldwide about 200 million technology workers wanted a Microsoft Surface tablet.

Since then, Microsoft went on to sell a whopping 1-2 million tablets in the holiday selling season (the seasonal peak for tablet sales), capturing just a few percent of the market.

At first blush, one would be tempted to conclude that Forrester blew it.

Upon further reflection, it becomes clearer that Forrester blew it.

So what happened? With the strong disclaimer that I have not read the actual Forrester report (I'm not going to spend money to buy the full report), here are a few mistakes I think Forrester made:

  1. Forrester was motivated to generate attention-grabbing headlines. It worked, too. Ignoring the fact that there have been Windows-based tablets since 2002 and none of them set the world on fire, Forrester's seeming discovery of a vast unmet demand for Windows tablets generated a huge amount of publicity. Forrester might also have been trying to win business from Microsoft, creating a gigantic conflict of interest.
  2. Forrester oversold the conclusions. The survey (as far as I can tell) only asked IT workers what sort of tablet they would prefer to use, at a time when the Microsoft Surface had only recently entered the market and almost nobody had actually used one. That right there makes the extrapolation to "what people want to buy" highly suspect, since answers will be based more on marketing and brand name than the actual product. Furthermore, since this was a "global" survey, there was probably a substantial fraction of the population outside the U.S., Canada, and the E.U. who are unlikely to buy (or be issued) a tablet of any sort in the near future.
  3. Forrester let the hype cycle get carried away. I found many many articles quoting the "200 million Microsoft Surface Tablets" headline without any indication that Forrester did anything to tamp this down. Forrester's actual data basically said that about a third of IT workers surveyed said they would prefer a Microsoft-based tablet rather than Android, Apple, or some other brand, and if you believe there are about 600 million information workers worldwide (which Forrester apparently does), that's 200 million people. When that morphed into "Forrester predicts sales of 200 million Surface tablets," they did nothing to bring that back to reality.

All this is assuming that Forrester actually did the survey right, and they got a random sample, asked properly designed questions, and so forth.

At the end of the day, anyone who built a business plan and spent money on the assumption that Microsoft would sell 200 million Surface tablets any time in the next decade has probably realized by now that they made a huge mistake.

As the old saw goes, making predictions is hard, especially about the future.