In Voice Portals and VoiceXML, Part 1, I introduced voice portals and VoiceXML, a method to deliver Internet-hosted content to anyone who uses a regular phone. Now, let's expand these topics and examine the components that comprise a VoiceXML system and application.

VoiceXML's architecture resembles that of Wireless Application Protocol (WAP). However, a primary architectural difference is that VoiceXML renders content to voice prompts and audio output. To access a VoiceXML application, the user dials a regular phone number (local or toll-free). The call connects to the server that hosts the voice browser software; the voice browser answers the call and uses HTTP to request content from the originating Web server in the form of VoiceXML, .wav files, and Grammar files. The voice browser then loads the application, and the user can interact by voice over the phone.

The voice browser's software executes voice prompts by playing .wav or Text To Speech (TTS) files, then uses automatic speech recognition (ASR) to recognize speech or dual-tone multifrequency (DTMF) input. The voice browser's ASR software provides voice-recognition functionality. The ASR software detects verbal input and attempts to match the input to VoiceXML's grammar definitions. If the software detects no input or no match, it prompts the user again.

VoiceXML grammar defines the values that the software uses to decipher user input. For example, grammar can define verbal input that the software uses to transfer the user to Help menus. Examples might include the words "help" and "operator," but you can also add words and phrases that reflect user frustration, such as "hate" or "I hate this." (I'm sure you can think of a few more.) You can embed or link grammar definitions (which you define at the field, form, or application level) to a grammar file. For links to some good grammar examples, see my June 21, 2001, column.

TTS provides verbal output of application text. You can use .wav files for voice prompts, but if the file is missing or the text stream is dynamically generated, the TTS engine synthesizes verbal output in an acceptable computer-generated voice. TTS also can output dynamic data that doesn't permit recording voice prompts for every possibility. Depending on the voice browser, TTS can speak in either a male or female voice.

Voice browsers are expensive, whether they're hosted by an organization or accessed through increasingly popular Voice application service providers (Voice ASPs). Below are the main options and costs of hosting a voice browser.

Option A: Self-hosting

  • $50,000 to $80,000 for a single T1 line that handles 24 concurrent calls and includes software licensing fees for the voice browser, TTS, and ASR

Option B: Outsourced hosting (Voice ASP)

  • Per-port charges, 4 to 8 cents per minute
  • Per-call charges, 7 to 14 cents per minute
  • Setup fees, as high as $3000