Notes From the Workshop: A Look at the Sphinx

The life of a consultant has many twists/turns and often one does not know what is around the corner. Today, I am examining speech recognition technology. For doing books and articles, I have used Nuance's Dragon NaturallySpeaking software at http://www.nuance.com/naturallyspeaking/ , however, if I wanted to learn by designing and implementing my own I could turn to open source applications. Specifically, speech technology like Carnegie Mellon University's Sphinx Speech Recognition Engine. The architecture of Sphinx is based on the Java Platform and relies on digitizing sound waves, coverting them to phonemes and then constructing words to analyze them. This as you might have guessed with my previous posts, needs statistical modeling for mapping phoneme representations and matchings. The matching involves both a grammar and a dictionary which is dicussed in more detailed with the architecture of Sphinx provided here: http://www.try.idv.tw/try/talks/g9104_present.ppt. The home page for the project is at
http://cmusphinx.sourceforge.net/sphinx4/.

In the presentation, notice the use of Hidden Markov Models (i.e. http://htk.eng.cam.ac.uk./) with 3-5 states to model a phoneme where each state is a Gaussian Mixture Density function. As stated in the powerpoint, word transitions are developed in terms of transition probabilities. For more information, see the white paper at http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4Whitepaper.pdf. Another good powerpoint presentation on speech recognition and HMMs is at http://www.cis.upenn.edu/~cis530/slides-2008/530-speech-rec-2008.ppt. Abate (2007) has a good review article on speech recognition at http://nlp.amharic.org/Members/solomon/papers/ies16conference.pdf

The Java Speech API enables the use of speech technology for graphical user interfaces. Documentation on the API can be found here at http://java.sun.com/products/java-media/speech/reference/api/index.html. There is a good article on "Processing Speech with Java" at http://www.developer.com/java/other/article.php/1471001 and a tutorial at http://javaboutique.internet.com/tutorial/speechapi/.

As far as Natural Language Processing for speech recognition, MIT has a good advanced course on NLP at http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-864Fall-2005/Syllabus/index.htm. The lingusitic layers of pragmatic, semantic, syntatic, lexical, morphologic, and phonetic are used as part of NLP. For a more complete description, see http://courses.mbl.edu/Medical_Informatics/99.2/Outlines/Starren/jbs_acquisition1.ppt.
An interesting application for this is with low-cost note-taking, i.e.
http://www.csc.villanova.edu/~tway/publications/wayAT08.pdf. Another example is speech recognition for Family Medicine at https://idealhealth.wikispaces.com/Speech+recognition which provides both information, resources and links.

Speech recognition is a fascinating field because it models analog waves through translation into digital signatures that have semantic content. Thus, real-time data flows, i.e. streams like stock prices or EEG signals, and then mapped into symbols that can be arranged in to cluster/groups for more semantic content. This provides for multiple layers of abstraction that can be used in inter/intra disciplinary ways. Again seems like a different path, but is not.

Notes From the Workshop

Monday, January 12, 2009

A Look at the Sphinx

No comments:

Blog Archive

My Blog List

Current Research

Associations

Software Development