In the belief that speech technology is a big, fast-moving, and diverse market, IBM will put its stake in the ground over the next two quarters by giving its portfolio of voice recognition systems a new umbrella term -- Conversational Services.
The company will roll out products that will include speech translation, multimodal interfaces, middleware, natural-language understanding (NLU), text-to-speech, and biometrics.
IBM will soon introduce one of the first products to use visual cues -- such as the movements of the lips and mouth -- to understand the spoken word for speech interpretation, according to Dr. David Nahamoo, senior manager, Human Language Technologies Department at IBM's Thomas J. Watson Research Center.
Nahamoo said the product is already in beta with a number of enterprises and will be available in about two years.
Even longer range, the visual recognition system can be an assist in fixed place environments where gestures can add value. In customer relationship management applications, for example, call center personnel will understand the unspoken mood of a customer by interpreting body language, Nahamoo said.
"The face is sending a message, happiness, sadness, anger. The challenge is how do you model that and integrate it on top of the other [speech] technologies," Nahamoo said.
In the short term, IBM's visual recognition system -- now in beta -- uses a microphone, a camera to monitor lip and mouth movement, and a set of business rules built into the recognition system.
"It might have a policy that if the face is not looking at the camera, the system understands that the person is not talking to me and so the computer can eliminate the sounds as noise," Nahamoo said.
Also, if the lips are not moving but the system is picking up words or sounds, that information is filtered out as extraneous, Nahamoo said.
Some of these technologies will be especially useful in noisy environments, such as a moving car or on the trading floor of the stock market, noted Nigel Beck, IBMs director of Voice Systems.
"If the vocabulary in the system is small enough it can recognize some words even in noise, and can especially be trained for digits in something as noisy as a 10-decibel environment," Beck said.
The system builds templates in time for each movement of the lips and converts the information into the basic ones and zeroes that a computer understands.
The visual analysis is called a "viseme," not unlike a phoneme, the smallest intelligible segment of sound in a word. A viseme is the smallest intelligible segment of a lip gesture, which when put together with other visemes, allows the system to recognize the movements in aggregate as a word.
In other recent developments, last week, IBM officials displayed a prototype add-on sled that will fit onto the back of a Palm Inc. handheld. The speech sled contains a DSP (digital signal processing) chip and memory for translating speech to text, and can be used for executing commands to a contact database or appointments calendar, as well as for voice-activated phone dialing.
For dialing out on a handheld device, the system would interact with a Bluetooth-enabled cell phone. Plans also call for a PC card version with a speech DSP chip as well.
IBM is also focusing efforts on its WebSphere middleware products. Last month the company introduced its WebSphere Translation Server that can translate about 500 words per second and supports translation bi-directionally between various languages including English and French, German, Spanish, and Italian, and uni-directionally from English to Chinese, Japanese, and Korean.
Deustche Bank AG will be one of IBM's first customers for its translation system, said an official.
In May, the company also announced that the latest version of its Voice Server technology, also server-based, is being deployed by such companies as T. Rowe Price, which is using the technology to answer questions for investors in its Retirement Plan Services.
Because it is impractical to ask a customer to push buttons on a phone keypad when there are more than a few choices, IBM created a NLU call center with a two-way question and answer voice system. NLU allows the customer to drill down to get answers without repeating basic information each time. The NLU system also knows the financial product the customer is talking about from the context of the speech, and includes 30,000 different phrases in this one vertical domain.
Finally, Big Blue is using something called phrase slicing so that users will not have to listen to a mechanized voice that is at times difficult to understand.
Voice models or actors spend five to six hours reading phrases in a specific domain. The words and phrases are recorded and chopped into pieces. Phrase slicing takes the sounds from the pieces and puts them back together to make new words. Although the word was never spoken by a human being it sounds like a human said the word.
Natural Language Understanding, along with interpretation of visual cues and computers that sound human, may mean that one day, sooner rather than later, computers will pass the test invented by Alan M. Turing in the early 1950s to determine machine intelligence.
Turing said that a machine can be considered "intelligent" when a person observing a question and answer session between a computer and a human, without seeing the participants, cannot decide which is the person and which is the machine. For better or worse, we appear to be on that indecisive threshold.