Speech Technology - June 2008 - (Page 39) PART 4 OF A 4-PART SERIES GUEST COLUMN Caller: Excuse me. (Cough, cough, breath). Caller: My account number is (Cough, cough, breath). Agent: Wow, sounds like a nasty cough you have. System: Whenever you’re ready, I’d like to get that account Caller: Yeah, I got it from my daughter in daycare. Anyway, my number. account number is (Side speech: “Hey Mom! Where did you Caller: My account number is (Side speech: “Hey Mom! hide the peanut butter?” “Can’t you see I’m on the phone?”) Where did you hide the peanut butter?” “Can’t you see I’m on Agent: Kids have a great sense of timing, don’t they? the phone?”) Caller: Tell me about it. OK, that number is one, eight, five System: Hmmm, I’m not sure that was meant for me. I still (Doorbell, sigh). Do you mind if I get the door? need to get your nine-digit account number. Agent: Sure, take your time… Caller: That number is one, eight, five (Doorbell, sigh). Do Despite the fact that the caller had the same troubles as in the you mind if I get the door? first example, this conversation went much more smoothly System: If you need me to wait a moment, say “Hold on.” You because humans have the ability to understand speech that’s not could also try calling back later, or hang on for the next available directly applicable to the question at hand, identify a broad array agent… of environmental and nonspeech audio, and react accordingly. In this last example we strike a balance between human The designer can know more. ability and what technology can arguably do. First, the sysTrue, there will always be words and phrases a recognizer is tem recognized the cough and didn’t assume the caller tried not programmed to understand. Though it might not be easy, to say the account number. Next, it detected the quieter volspeech recognition engines can, or should, do several things ume and unexpected length of the side speech, and didn’t that designers might want to start leveraging. We’re not sugassume the utterance was relevant to the conversation. gesting that voice systems begin offering health advice or childFinally, it recognized the doorbell and anticipated the caller’s rearing tips. However, we are suggesting that need to step away. We didn’t engage in small talk using certain pieces of information beyond the or understand exactly what was going wrong, Why not compare basics could result in better error handling. A few but each response was much more relevant to ideas include: the cause of the error and, therefore, made much the caller’s volume • Modeling common nonspeech sounds. There more sense. from one utterance are already acoustic models of filler utterances, Two more things to consider are how each of to the next and flag like “um” and “uh,” just so we can identify and these strategies complements the others and the any significant ignore them. Why not model other common intelligence they can provide about the overall changes? sounds, like coughs and sneezes, barking dogs, interaction. Today’s typical system is likely condoorbells, or honking horns, so that our designs figured to count the number of errors allowed react appropriately when they happen, too? before triggering some kind of max-error condi• Leveraging signal data. If another person (or a radio or teltion (such as a transfer or disconnect). But imagine, for evision) is making sounds that can be picked up by the system, example, that once a system detects coughing at the beginning those sounds are likely to have less acoustic energy than the of the call, the design anticipates coughs for the remainder of caller’s voice. And the caller herself may turn her head or cover the conversation. Certain related events, such as false bargeher mouthpiece when she’s engaging in side speech. Why not in or a lack of speech, may be planned for and not even be compare the caller’s volume from one utterance to the next and treated like an error at all. flag any significant changes? These are just a few ideas for dealing with one scenario. • Using prosody. Pitch detection is already used for speech Many more types of errors exist, as do many more tools to recognition of tonal languages, like Mandarin Chinese. Why address them. Some are available now, while others, like emonot leverage prosodic information in other ways as well, such tion detection, are still works in progress. as to identify the slow hesitations of thought, the rate and volOf course, there will always be things humans can do that ume of anger, or the rising intonation of questions like “Do speech systems cannot. That’s the way it should be. But why you mind if I get the door?” not let our abilities to communicate with other humans in all • Turning the timer on. Since a recognizer may already be sorts of contexts and environments also inspire us to push our logging data about utterance duration, why not compare that designs to do better in some situations as well? In the end, with the expected time it might take for a caller to answer a we’re all just trying to avoid the following scenario: given question? It is likely that a too-short or too-long utterSystem: I’m sorry, I didn’t get that. ance from the caller either wasn’t meant for the system to hear Caller: Well, duh! There was nothing to get. or wasn’t really an answer to the question. An innate human ability to do all four of the above tasks ■ EDITOR’S NOTES: clued the agent into what was going on in our caller’s environThanks to Lizanne Kaiser and Jim Larson for organizing the workshop that generated the content for these four articles on error handling. ment. If similar information was easy to extract from the recKaiser coordinated the writing and submission of the articles. ognizer, speech system designers could be equally empowered. VUI designers will organize their own professional society during For example: SpeechTEK 2008. If you are interested in participating, please contact Susan Hura at susan@speechusability.com. System: Can I get your account number? www.speechtechmag.com JUNE 2008 Speech Technology | 39 http://www.speechtechmag.com
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.