We probably all remember the humorous scene in Star Trek IV: The Voyage Home where Scotty attempts to give voice commands to a twentieth-century computer, even trying to use the mouse as a microphone. Or what about the much creepier human-computer interactions with HAL in 2001: A Space Odyssey? Clearly, the dream of voice-operated computer systems has been with us for a long time. With a new partnership announced this week between Microsoft and the International Computer Science Institute (ICSI), that dream is perhaps coming closer to reality.
This announcement comes at a time when voice recognition technologies are already becoming more prevalent. Who hasn't dialed into a phone system that asked for voice input rather than (or in addition to) key presses? And I'm sure just about everyone has seen Apple's iPhone commercials with various celebrities putting Siri to the test, even if they haven't used Siri on an iPhone personally. However, these voice implementations are far from what we see in HAL or the Enterprise.
The reason the current implementations fall short comes down to the concept of prosody, and it is in this linguistic area that the ICSI/Microsoft partnership will begin its research efforts. ICSI is an independent computer science research institute, with affiliations to the University of California at Berkeley. Although ICSI studies a range of computer technologies, director Roberto Pieraccini's background is in speech technology; in fact, he's recently published The Voice in the Machine: Building Computers That Understand Speech (MIT Press), which examines the history of computers and voice technology stretching back six decades.
"We get the benefit of working with the world-class people at Microsoft, but also get to work on real problems," Pieraccini said of the partnership. "It's very important for us to work on real problems, on real data, which we don't have and Microsoft has." As far as what this partnership could achieve, Pieraccini said, "Eventually, we would like to have speech substituting keyboards and mice. We would like to be able to give commands and to interact with machines, not only at the consumer level like we do with Siri or Google Voice Search, but also at the level of doing more important things."
I don't use an iPhone, and if I did, I doubt that I'd use Siri. I've always felt it acts more like a novelty -- best for a little humor rather than getting anything significant done. Even on my Android smartphone, I use the voice capabilities rarely. I've used voice-to-text for hands-free email responses while driving once or twice. When I've used Google Voice Search, it's pretty hit-or-miss whether I get what I intended right away. And in any case, these aren't the sort of tasks a Microsoft Exchange Server administrator, for instance, is greatly concerned with.
So, we come back to prosody. "Speech conveys much more than just the words," said Andreas Stolcke, principal scientist with the Conversational Systems Lab (CSL) at Microsoft, and a key member of this partnership. "Things like the emotional and maybe even physical space of the speaker, the nuances of meaning that would be ambiguous if you didn't have the actual intonation and the timing of what is being said. This is a group of phenomena that linguists call prosody." As a music-lover (and wannabe musician), I like to think of prosody as the natural music of language.
Current voice technologies are based on decoding speech into literal transcriptions of the words, then turning those words into commands. But how does the computer tell if you're making a statement or asking a question? Or how does it deal with sarcasm (of which I'm all too often guilty)? "Our speech interfaces right now ignore this type of information," Stolcke said. "One of the big goals of our collaboration is to look into ways of extracting prosodic information and other 'beyond the word' information about speaker state and so forth, and make that available to computers that people interact with."
This notion of improving spoken communication with computers fits well with Microsoft Research's focus on Natural User Interface, a project which led to the gesture-based Kinnect for Windows. Elizabeth Shriberg is another principal scientist at Microsoft's CSL involved with the ICSI/Microsoft partnership. "One of the big challenges that we're actually focusing on is to develop a common framework to a number of these types of capabilities," Shriberg said, "where prosodic cues are used to do something -- some task. We've started doing this already; it's been implemented in a prototype in a lab at Microsoft."
This is one of the most exciting aspects of this project: Although carried out in a lab environment, it's clear that the intent is to find real-world applications for the technology. Shriberg said, "We don't want to be the type of researchers, or this should not be the type of project, where it's sort of Ivory Tower and it stays out there forever. We took problems where we know there's a need, we know that the systems right now don't perform perfectly, and we said, hey, prosody could probably help on this particular problem."
The researchers couldn't specify anything about the prototype they're currently working with, nor could they predict when the research would result in something that would go in a marketable product -- nonetheless, it's good to know that is their aim. It's not hard to imagine the ways a truly intelligent voice technology could be used for IT management.Preview has simplified the management systems into one web-based console, the Exchange Administration Center (EAC) -- but wouldn't giving voice commands be even easier than point-and-click?
So watch out, Siri; watch out, Google Voice Search -- or better yet, step it up! True voice management of computer systems could be all that much closer due to this ICSI/Microsoft partnership. Now I'd like to go to the coffee machine and give it a command such as, "Tea, Earl Grey, hot," to get a nice beverage -- but instead I'll be forced to punch buttons like some kind of sucker. Oh well.