How do Georgia Technology professionals get plugged in?
We respect the privacy of our
subscribers and do not
disclose e-mail addresses.
Membership in TAG puts you in good company: a who's who of tech leaders and pioneers.
TAG is an unmatched way to network with peers, grow professionally, find a job or hire an employee, and hone your skills.
Learn the Benefits of Membership…
TAG's monthly IT Job Trends Report about the Georgia IT job market and skills in demand.
The scene: a local museum or gallery.
An art patron strolls into a gallery or museum. She is informed that this location supports a wireless network that will enhance her visit, so she pulls out her PDA and browses to the given website. As she works her way from gallery to gallery, and, indeed, from work to work, her PDA immediately senses the closest work, retrieves any relevant information, and speaks it to her through headphones. She is prompted to ask questions regarding this particular work, so, using speech technologies, she speaks a plain English question to the PDA, which immediately answers out loud. The system keeps track of where she paused the longest during her visit and each kind of interaction she had, and it tunes her experience accordingly.
Over the past 25 years or so, information technology has improved in geometric fashion, pushing man-made devices to the limits of known physics. Pure technological advances are the enablers, but real progress is made when the average person can utilize these advances directly to improve the quality of everyday living and working. The history of the application of advanced technology describes a series of "jumps" that are best described as "killer applications". Killer apps materialize as a result of the convergence of maturing underlying technologies, but infrequently. Some good examples are the electronic spreadsheet, email and the web browser.
It's been quite a while since the last killer app appeared on the scene. The next one will be more of a "killer genre”: widely-used voice-enabled human-machine interfaces.
Some may point out that this is old hat, that we've had this capability for some time. However, a true voice-enabled human-machine interface encompasses more than just a machine's ability to convert speech to text and text to speech: the machine must also be able to understand the speech, relate it to a stored knowledgebase, then respond in a useful manner. Until now, the convergence of speech recognition and natural language processing (also known as natural language understanding) has been seen rarely outside the laboratory and has remained the province of highly funded private research endeavors. These research projects are just now maturing into off-the-shelf tools that can be integrated with existing proven technologies to finally foment the next killer app.
The infrastructure required to support the next killer app has also matured, almost to the point of commodity. Input and output devices (PDA's, microphones, speakers, PC's) are reliable, cheap and ubiquitous and are supported by equally reliable, cheap and ubiquitous high-speed networks, wired or wireless, over short and long distances. Problems such as understanding speech in very noisy environments are being solved through the use of clever techniques such as bone-induction microphones.
The operational tool set components fall into several categories, mirroring the research fields from which they've sprung: speech processing, speech synthesis and natural-language processing (NLP). Speech processing converts speech into text. Speech synthesis converts text into speech. NLP understands grammar: how words connect and how their definitions relate to one another.
Tim Berners-Lee, the inventor of the World Wide Web, said: “Speech technologies bridge the gap between computer language and human language; it helps computers to figure out what people are thinking, and people to figure out what computers are thinking.”
The goal of natural language interaction is to communicate concepts, not words: "It's not how you say it, it's what you mean."
Building machines that communicate using human languages has proven tricky. Now, decades of research and development are finally paying off, delivering a suite of tools and products that enable the construction of effective broad-based applications. The constituent technology categories are described below:
Speech processing empowers computers to recognize - and, to some extent, understand - spoken language. Speech is "eyes free" and "hands free,” allowing a device to be truly used anywhere. Effective natural language understanding relies on “context-free” grammars, which allows a speech recognition engine to reduce the number of recognized words to a predefined list, enabling a recognition level in a speaker-independent environment. Context-free grammars work great with no voice training, cheap microphones, and average CPUs.
Although speech recognition technologies are not new, accuracy rates are just now becoming acceptable for natural language discourse. According to industry experts, speech recognition accuracy is improving 10% per year and, currently (2006), the error rate sits at about 5%, with error rates similar to human error rates achievable by 2011.
The ability to synthesize the sound of speech is useful for applications that require spontaneous interaction, or in situations where reading isn't practical (giving instructions to a driver, for example). In products aimed at the general public, it's critical that the output sounds pleasant and human enough to encourage regular use. Lab projects have even demonstrated the ability for a machine to detect the “emotional” state of the speaker and respond in kind or in an appropriate manner.
NLP systems interpret written rather than spoken language. In fact, NLP modules can be found in speech-processing systems that start by converting spoken input into text. Using lexicons and grammar rules, NLP parses sentences, determines underlying meanings, and retrieves or constructs responses. This technology's main use is to enable databases to answer queries entered in the form of a question. A newer application is handling high-volume email. NLP performance can be improved by incorporating a common sense knowledge base - that is, an encyclopedia of real-world rules.
Almost all database query languages tend to be rigid and difficult to learn, not to mention that it is often difficult even for the experienced user to get the desired information out of databases. A natural language interface to the SQL language overcomes the need for users to master the complexities of SQL. Robust software engines exist that provide the ability for users to query databases using plain English by translating utterances into executable database queries. These engines rely on marrying a “semantic model” of the data to the data schema itself.
For example, the user needs to create a verb relationship between the salespeople table and a products table by indicating that "salespeople sell products." The NLP engine uses these relationships to perform natural language parsing of users' questions, which provides better search results than you would get using keyword-based technology. Although your initial goal might be to answer the most common questions your users will ask, the ultimate goal is to identify and model all the relationships between entities in your database. You want to have a semantic model that defines the knowledge domain of your application, thus enabling the NLP engine to provide answers to a wide range of questions without having to identify those questions ahead of time.
Multimodality seamlessly combines graphics, text, audio and avatar output with speech, text, ink, body attitude, gaze, RFID, GPS and touch input to enable a greatly enhanced user experience. It is enabled by the convergence of voice, data and content and is enabled by multimedia, IP, speech and wireless technologies hosted on a diversity of devices and device combinations. When compared to single-mode voice and visual applications, multimodal applications are easier and more intuitive to use. The user can pick how best to interact with an application, which is especially helpful with newer, small-form-factor devices. When modalities are used contemporaneously, the resulting decrease in Mutual Disambiguation (MD) input error rates improves accuracy, performance and robustness.
For users who work in an environment where speed and ease of access are critical, speech and natural language understanding technologies hold enormous promise for future applications. As hardware continues to become more powerful and cheaper, human speech understanding should continue to become more accurate and useful to increasingly wider audiences.
Dave Bernard (MCSD, MCDBA), is co-founder, Vice President and Chief Technologist of The Intellection Group, Inc., a technology consulting company that specializes in extranet development and natural language-related technologies. For almost 30 years, Dave has held developer, managerial and executive positions in a variety of industries, including supply chain logistics, healthcare, automotive, property and casualty insurance, retail point of sale, construction, hospitality, real estate, church and academia. He leads the company's development of a technology architecture that unifies web development capabilities with voice recognition, text-to-speech, natural language, RFID and GPS technologies, deliverable to wireless handheld and desktop devices.
http://www.IntellectionGroup.com/SpeechWhitePaper.asp
TAG's article library is optimized by Medium Blue, ranked the #1 search engine optimization company in the world.