Voice Portals and - AD-COM

advertisement
Voice Portals and
Multimodal Technology
Innovative speech applications
benefit companies and their customers
DAVID SHADBOLT
The evolution in speech technology and established open voice standards has led to the
development of voice portals, voice-enabled user interfaces on desktops and mobile
devices and the use of speech technology in applications as diverse as mobile workforce
support and air traffic control simulation training programs.
In customer support, speech technology is the latest driving force for improvement
because while multichannel marketing may have increased point-of-sale opportunities for
companies, a company still has to keep the customer smiling. In a recent global survey by
Genesys Telecommunications Laboratories, 56% of respondents rated customer service
as more important than the product itself, which was rated second in importance at 28%.
After a bad customer experience, 63% of consumers will stop using a company's product
or service, and 100% of those between 18 and 25, where brand loyalty seems nonexistent, would do so. Conversely, 76% of customers say they would buy from a
company based on a positive call-center experience.
Speech technology gives customers the opportunity to interrupt operator voice menus by
speaking naturally to go directly to the service required. This raises satisfaction levels
among consumers who often find interactive voice response (IVR) systems confusing,
frustrating and time consuming. According to speech technology vendor Nuance, a
survey conducted in the year 2000 showed that 80% of customers preferred speech to
touch-tone keypads, and 84% of them rated speech interaction equal to or better than
Web service.
A voice recognition process is as follows: The speaker speaks into the phone which
captures the analog signal and converts it to a digital signal. The speech recognition
engine converts the digital signal to phonemes (the smallest segment of speech), followed
by the specific application processing words in grammar matching phonemes.
The act of developing a voice portal gives corporate management the unusual experience
of not only improving customer satisfaction but also reducing call center costs. IBM,
which has the WebSphere Voice Server, estimates that voice recognition reduces
customer service costs at call centers by 10 to 30 cents per call on normal costs of $5 to
$10, while Genesys claims its Voice Portal to be as much as 40% less costly to buy than
IVR systems. It claims that a standard implementation of 200 ports in a 200-agent
corporate contact center can expect a return on investment in as little as eight months.
Verizon set out to provide a voice portal for its 230,000 employees, many of them mobile
or without access at work to the intranet (eWeb.verizon.com) containing the corporate
directory and other information. The eWeb Voice Portal has proven very successful, says
project lead Lu Shen. "We wanted to give employees anytime, anywhere access without
incurring the cost of additional operators. It's now receiving 114,000 calls per month,
with a 5.4 minute average call length. It has also won Verizon's most prestigious award,
the Verizon Excellence Award."
Following its success with its intranet, Verizon is developing a nationwide customerfacing application to serve its 38 million customers, millions of whom do not have
Internet access. "By giving them convenient access to their Verizon accounts," says Shen,
"we will reduce the number of person-to-person calls, which has a huge cost benefit.
Fifteen features are either being supported or will soon be supported. These include bill
summary, payment, bill history, popular services and products, and pricing information."
One leading European technical and IT outsourcer has implemented Genesys' Voice
Portal. twenty4help employs 1,700 people handling more than 850,000 phone and Webbased interactions per month in 15 different languages at five European customer contact
centers. The company provides technical support for major software manufacturers, 85%
of the support conducted by phone and the balance via fax or by Web-based contact such
as e-mail, text chat and Web collaboration. Providing comprehensive technical support
24 hours a day, seven days a week demands a successful integration of hardware and
software solutions. Ralf Rottman, the European IT manager for twenty4help, says,
"Because we are acting as outsourcing partners, we have to fight to decrease support cost
but increase service availability and service quality. Speech-driven self-service
applications offer customers the flexibility of moving from a self-service to an agentassisted transaction as needed. Voice Portal is not only fast and easy to install, but also
easy to configure. In addition, it can be combined with existing Web databases and
applications through VoiceXML." Recently expanded language support for French,
Italian, German and Spanish further enables global capabilities for voice self service.
Programming and management of voice response applications are aided by voiceextensible markup language (VoiceXML), which was released publicly by the
VoiceXML Forum in 2000. This markup language enables voice response applications,
including telephony features such as call transfer, mixed initiative conversations and
recognition of spoken and dual tone multifrequency key input, audio dialogs that feature
synthesized speech and digitized audio. As an application of XML, VoiceXML supports
Unicode through the "xml:lang" attribute, a mechanism for precise control of the input
and output languages and the ability to interpret input in a language different from the
output language(s).
Multimodality
Consumers who want feature-rich handheld devices that incorporate computer screens
and advanced data and messaging applications, as well as "anytime, anywhere" access
have found the small graphical user interface frustrating. It has hindered market growth.
VoiceXML offers manufacturers the possibility of partially overcoming size limitations
by adding speech input, in addition to the keyboard, keypad, mouse and or stylus, as a
way of accessing applications and Web services, and adding output received as
synthesized speech, as well as audio, plain text, motion video and/or graphics. Access
through this more versatile user interface via PCs, telephones, tablet PCs and wireless
personal digital assistants is termed "multimodal." It required a new set of open
application standards to make it viable.
One answer is Speech Applications Language Tags (SALT). The SALT Forum — a
standards-setting body that has Cisco, Intel and Microsoft among its founding members
— describes the standards on its Web site as "a lightweight set of extensions to existing
markup languages, in particular HTML and XHTML." IBM, Motorola, Opera Software
ASA and others developed another standard. IBM Pervasive Computing's multimodalproduct manager Igor Jablakov says, "We had known from looking at our customers'
investments in various technologies that data and voice would merge at some point, but it
needed a computing language that would allow programmers to write code for these
multimodal applications. Our working group arrived at XHTML+Voice (X+V), a
standards-compliant multimodal markup language that uses combinations of XHTML
and VoiceXML."
Concerns over one standard eventually dominating the market may lead some companies
to hedge their bets by building applications to support both SALT and X+V; but both
standards enable developers to use existing skills, thereby reducing the time required to
build Web applications for voice, browser and new multimodal devices. Speech
technology vendors offer developer toolkits. IBM provides its IBM Multimodal Toolkit,
an integrated development environment built on the Eclipse framework. The toolkit
includes a multimodal editor in which developers can write both XHTML and
VoiceXML in the same application, reusable blocks of X+V code and a simulator to test
the applications.
Royal Philips Electronics in the Netherlands has released Speech SDK 3.1, a
customizable software development kit for creating professional applications that benefit
from the features of the Philips speech recognition technology. Armin Scheuer, Philips
media relations manager, says, "The new version includes the world's first VoiceXML
engine to support STT functionality in addition to the existing dialog capabilities. With
SDK 3.1, developers can implement the full capabilities of the recognition engine directly
into their application software using any standard software development through the PSP
C/C++ API interface."
Embedded Solutions
The small footprint of current speech technology has made it more practical for
manufacturers to embed voice into their own products, particularly for those consumer
products, services and supporting systems that deliver information, communications and
entertainment to on-vehicle and mobile devices. In-vehicle services called telematics
allow users to obtain customized services such as driving directions, emergency roadside
assistance, personalized news, sports and weather information as well as access to e-mail
and other productivity tools. An organization such as Starbucks, McDonalds or the 24hour copy service Kinko's can link the in-car experience with a channel for selling its
products and services. The Kelsey Group predicts US and European spending in this area
will exceed $6.4 billion by 2006. Voice interaction adds a new customer service
dimension to this sector.
Automobile manufacturers have already begun incorporating speech technology into new
models. Honda is using IBM's Embedded ViaVoice in its navigation technology in select
2003 Honda Accord models. The application has a vocabulary of 150 English-language
commands that recognizes a range of accents. Drivers can ask for directions and hear
responses over their existing audio car systems. To get directions, the driver uses the
"talk" button located on the steering wheel. Jablakov points to another example. "Daimler
is looking at integrating speech within the 2004 minivan. It will allow customers to use
mobile handsets that use Bluetooth technology (an open technology specification for
short-range radio links between mobile PCs, "smart" devices and other portable
machines). A user can simply bring a handset into the vehicle, and it will integrate with
the on-vehicle navigation. Voice access gives the driver a hands-free, eye-free way of
obtaining e-mail, having it read, saved or deleted; book travel and even obtain
information from his or her 401(k)."
Global Languages & Cultures, a multiservice translation company, has translated voice
messages in on-vehicle global positioning devices as well as voice for general telephone
services. Project manager Francesco Carbonari explains, "We translate material from the
client's script files, record into the target languages at the studio and then send the files
back to the client. The client creates a voice database based on the splicing of the
material. To avoid problems later, we need to know in advance what voice talent is
required. Is it a male or female? And if male, what for them is a good male voice? This
can change according to the culture. In Japanese culture, a good female voice sounds
light and somewhat passive, with a low, soft tone, which wouldn't fit for an American
audience or other countries."
Carbonari continues: "If there is a problem with the final recording, it's a serious issue.
We need to fix the translation, find out where the problem began, and if it's truly a
problem, locate the voice talent, book the studio (it could be in Europe) and record again.
The same costly process results if we have saved files in the wrong format, as a WAV,
for example, rather than in MP3. We try to lower client costs as much as possible, for
example, with enumeration which in the case of driving directions can be thousands of
numbers. To avoid recording all the numbers from zero to a thousand and everything in
between, we buy prerecorded language number sets from 1 to 1,000 so that the client
avoids paying studio time."
Server-based Offering
Companies can voice-enable their Web sites, intranets, databases and other applications.
Baltimore-based investment firm T. Rowe Price sought to handle more calls from
retirement plan participants and improve customer satisfaction. The company first had
positive results from the use of IBM WebSphere Voice Response for touch-tone IVR
systems, followed by IBM WebSphere Voice Server (available in 14 languages) with
speech recognition allowing callers to speak selections by name. The company took the
next step when IBM introduced Natural Language Understanding technology.
Participants can use phrases such as "I'd like my balance, please" or "What funds are in
my plan?" instead of being limited to specific numeric keypad functions. In addition, plan
participants can interrupt the system, change their minds, ask for different information or
execute tasks in any order they choose. The company's management finds that the
system's more natural dialogues allow it to handle more transactions per hour and even
support customers who are still using rotary dial "pulse" telephones and are unable to
access IVR touch-tone systems.
Mobile Workforce
The convenience of voice can save time for mobile workforces. Newport Wireless, based
in Irvine, California, has a product line called NewportWorks that includes integrated,
wireless solutions for residential real estate agents designed to increase the productivity
and the revenue of mobile workforces. Within this line is anytimeMLS, a solution based
on IBM WebSphere Voice Server technology that uses a voice interface that gives
residential real estate agents the tool to request and/or retrieve information via mobile
phone. During the initial market trial, the test group estimated that they could sell an
additional three to five properties per year if they had wireless access to multiple listing
service (MLS). In addition to accessing MLS databases, features such as short text
messages or voice call alert to mobile phones enable users to forward MLS information
to customers by e-mail directly from their phones. In the near future, personal contact
management tools (address book and calendar) will become available.
Other Applications
The evolution in speech technology seems like good news in training environments and
in defense departments. Adacel Inc. provides professional training products and services
to the air transportation industries. It has licensed third-party voice recognition and
rVoice text-to-speech technology from Rhetorical for integration into its MaxSim
software used in air traffic control training programs. Simulated scenarios will cover a
wide range of situations from equipment failures to emergency landings.
The United States Defense Advanced Research Projects Agency, together with the
Defense Department's Language and Speech Exploitation Resources initiative, "launched
two speech programs in 2002: Babylon, which aims to develop a portable speech-tospeech translation device, and the Effective, Affordable, Reusable Speech-to-Text
project, for turning speech recordings into searchable digital text. The programs aim to
develop and deliver improved speech transcription and translation capabilities to
intelligence analysts and the military." (Ed McKenna, "Listen Up," Federal Computer
Week, September 30, 2002.)
One of the goals with EARS is to reduce the word error rate to 10% and improve foreign
language speech transcriptions. "Another program requirement," according to Elizabeth
Shriberg, one of the researchers in the program, "is the ability to extend this technology
to other languages, starting with Arabic and Mandarin Chinese."
Babylon's goals, according to program director Kristin Precoda, include building a twoway translator, "a sort of bilingual phraselator. Questions can be asked in English, while
the foreign speakers will have a limited range of answers they can say in their language
for translation into English."
Two-way translation is a user benefit that IBM has moved toward. Brian Garr, program
manager of Voice and Translation Servers for IBM Pervasive Computing, points out that
by combining IBM's machine translation technology and ViaVoice's desktop product,
"ViaVoice Translator allows users to enter text in one language and have it returned in
another language (in English, French, Italian, German and Spanish) either as text or read
out as speech using a Compaq/HP iPAQ PocketPC handheld device."
Innovative uses for speech applications will continue as more companies see the
advantages both for internal use and for their customers. Jablakov says, "We see Web
browsers that support X+V, as well as device manufacturers adding special features to
hardware as their developers create great applications that are both visual and voice
oriented. Basically it's following the natural evolution when different opportunities arise
as a function of approved open standards."
David Shadbolt is a research editor for MultiLingual Computing & Technology. He can
be reached at david@multilingual.com
Download