Vocalabs Newsletter: Quality Times


In This Issue:

  • Meeting the Challenge

Meeting the Challenge

By Peter Leppik

SpeechTEK, the speech recognition industry's largest trade show, this year set out to do something which had never been tried (dared!) before in the industry: the SpeechTEK Solutions Challenge had seven different teams compete to produce a functional speech recognition based application in under a day.

VocaLabs was an official tester for the Challenge, and as such, we took on a challenge of our own. We set out to evaluate all seven applications, by having 500 different people try each application, before the SpeechTEK show ended.

That gave us a little over 48 hours to do seven complete studies from start to finish. Normally, we promise clients turnaround of ten business days. In practice, we often deliver faster, but this would be far and away the most challenging project we've ever undertaken.

Just to make it more interesting, we had a two-hour tutorial scheduled for Tuesday afternoon--less than 24 hours after our study began--where we had promised to review live data from our study of the challenge applications.

While there were the expected bumps along the way, we were up to it. In the end, study participants made over 4,000 calls in 48 hours, generating gigabytes of meaningful data.

The data will continue to be available on our web site for a few more weeks at http://www.vocalabs.com/challenge.html

The Set-Up

Monday morning, the seven participating teams were given a specification for a speech recognition application for scheduling car-repair appointments. None of the teams knew in advance what the application would be. Using whatever tools they wanted, each team had until Monday afternoon to implement the specification. Monday evening, VocaLabs began testing.

The first hurdle for us was ensuring that we had the complete specification far enough in advance to create our study, and then enough time to complete our internal quality assurance tests. This had to be done under a strict NDA to ensure that the specification didn't leak.

We were also concerned about the strain this test would place on our internal systems. Not only was this the most ambitious study we'd ever attempted, but we also expected significant web traffic from show-goers looking at the results and downloading audio files.

In addition, we didn't know what kinds of problems the study participants would face, and how much help they would need.

Application Metrics

  • Completion: The percentage of participants able to successfully complete a call.
  • Single Call Completion: The percentage of participants whose first call was successful.
  • "Very Satisfied:" The percentage of participants who were very satisfied with their experience.
  • Understanding: The percentage of participants who found the application easy or very easy to understand.
  • Average Time: The average time each participant took to get through the application, including multiple calls (in minutes).
  • Average Call: The length of an average call (in minutes).
  • Calls/Panelist: The average number of calls it took each person to get through the application.
  • Error Recovery: The likelihood that a participant who encountered a speech recognition error would be able to finish the call.
  • "Friendly:" The percentage of participants who viewed the application as friendly.
  • "Helpful:" The percentage of participants who viewed the application as helpful.

Running the Studies

Monday evening, we launched. The technical infrastructure performed well, but we quickly discovered that the Challenge applications were not perfectly debugged. As a result, calls took longer to complete, and some study participants had to make multiple calls.

By carefully inviting only as many participants as we could handle at a time, we managed to keep our T1s running at close to full capacity through the entire study, while still minimizing busy signals. As a result, by Tuesday afternoon--less than 24 hours into our study--we were able to present a meaningful amount of data from each of the seven applications.

Another major obstacle was that we had far more E-mails from study participants than normal. Part of this was due to the fact that the applications were still pretty rough, and many of the participants needed to know what to do after encountering a problem.

Some of the problems were due to the application designs themselves. For example, one of the applications implemented a sassy, street-wise persona. The design team included a cute surprise for show-goers: if the caller was taking too long to answer, the recording would snap, "I gotta life to live, so maybe you can call back later," and hang up. While many of the show attendees appreciated the joke, many of the panelists in our study complained about how rude the system was.

We compared the performance of the seven applications on ten different parameters. Designing a customer service application always involves tradeoffs, and each of the seven teams made different design choices. With seven different ways to schedule a car repair appointment, we wanted to see how each approach performed on a variety of satisfaction, usability, and cost metrics.

We highlighted the top two scores in each metric. There was no "clean sweep," with one team getting top scores across the board, though some did better than others.


Going into the Challenge, there were a lot of questions about whether creating a speech recognition interface from scratch in one day was even possible. Clearly it is, at least in a raw form.

Of course, all seven application had problems, as there was no time to debug.

Beyond that, there are some interesting lessons in the data:

Be Careful With Your Persona

Whimsical personae can make for a fun demo, but need to be handled with care in a finished application. One recording used the word "darn," which many callers heard as something stronger--so a strong persona must be evaluated very carefully, and with a broad demographic group, since some people are more easily offended than others.

Some Design Choices Are More Important than Others

For example, having a design which is easy to navigate and minimizes errors appears to increase satisfaction and completion rates more than the choice of voice talent or the overall call length. This suggests that creating a simple, understandable callflow, and testing it against a large and diverse set of callers, should be a higher priority than getting the perfect voice talent, or making the call as efficient as possible.

Get It Right The First Time

Higher satisfaction scores go hand-in-hand with higher completion rates. The highest satisfaction score was the same team which had the fewest average number of calls per participant.

SpeechTEK Solutions Challenge Application Results
  Team 1 Team 2 Team 3 Team 4 Team 5 Team 6 Team 7
Completion 91% 79% 77% 95% 95% 92% 94%
Single Call Completion 74% 66% 46% 81% 79% 71% 84%
Very Satisfied 41% 18% 20% 31% 27% 20% 25%
Understanding 92% 74% 75% 88% 82% 76% 81%
Average Time (min) 4.53 3.95 5.83 3.33 2.83 4.00 5.02
Average Call (min) 4.02 2.73 2.92 2.67 2.22 2.95 4.27
Calls/Panelist 1.13 1.45 2.00 1.25 1.28 1.36 1.18
Error Recovery 76% 49% 44% 78% 72% 73% 83%
Friendly 60% 52% 50% 51% 62% 43% 45%
Helpful 34% 21% 25% 25% 30% 26% 27%

Newsletter Archives