Spoken Language Recognition Based on Spoken Language Recognition: From Fundamentals to Practice Haizhou Li; Bin Ma; Kong Aik Lee Stanisław Kacprzak 27.03.2014, Kraków, Seminarium DSP
Problem definition Given a spoken observation utterance O and set L of target languages we have to decide: Language Recognition / Language Identification (LID) Which of the N languages does O belong to? Language Verification Does O belong to language Li or to the other languages?
Why do we need language recogniton? Multilanguage spoken dialog systems (e.g., iformational terminals) Database, archive-search and retrieval systems Human-human communication systems (call-routing, automatic translation, emergency calls) [2]
Why do we need language recogniton? 1973: Australia introduces telephone interpretation as a fee-free service to respond to its growing immigrant communities. 1981: The first Over-the-Phone Interpretation (OPI) service is offered in the United States. 1981 1990: Telephone interpretation enters major U.S. industries including financial services, telecommunications, healthcare, and public safety. 1990's: The demand for telephone interpretation grows significantly; contributing factors include decreased prices in long distance calls, toll-free number access, and immigration trends. 1995: Language services company Kevmark, later known as CyraCom, patents a multiple-handset phone adapted for telephone interpreting. 1999: AT&T sells language services company Language Line Services. 2000's: Telephone interpretation becomes more sophisticated; quality of interpretation, faster connection speeds, and customer service become important to consumers. 2005: The U.S. telephone interpreting market is estimated at approximately $200 million. 2013: Language Lines Services acquires Pacific Interpreters.
Real life example! The company employs approximately 5,000 interpreters and support staff globally who answer 40 million calls each year. [3]
How people do it? It was concluded that human beings, with adequate training, are the most accurate language recognizers. This observation still holds after 15 years as confirmed again, provided that the human listeners speak the languages. For languages that they are not familiar with, human listeners can often make subjective judgments with reference to the languages they know, e.g., it sounds like German. This judgments are less precise but show how people apply linguistic knowledge at different levels for distinguishing between certain broad language groups Given only little previous exposure, human listeners can effectively identify a lan-guage without much lexical knowledge. In this case, human listeners rely on prominent phonetic, phonotactic, and prosody cues to characterize the languages.
Perceptual cues used for language recognition
Perceptual cues used for language recognition The use of phonetic and phonotactic cues is based on the assumption that languages possess partially overlapping sets of phonemes. (Though there are over 6k languages in the world, the total number of phones required to represent all the sounds of these languages ranges only from 200 to 300)
Phonotactics cues We can study the phonotactic differences between languages by examining how well a phone n-gram model of one language predicts the phone sequence across different languages in terms of perplexity. A lower perplexity shows that a phone n-gram matches better the phone sequence, in other words, the phone sequence is more predictable.
General scheme of acoustic -phonetic and phonotactic approaches Phonotatic approach example PRLM (Phone recognition and language modeling) Acoustic-phonetic apprach example SDC (Shifted Delata Cepstral coefficients)
Shift Delta Cepstral coefficients (SDC)
Parralel PRLM
Vector Space Modeling (VSC)
Vector Space Modeling in acoustic-phonetic appraches Creation of supervector m by stacking mean vectors from of all adopted mixture components derived from GMM-UBM. Kullback-Leibler (KL) divergence (approximation) KL Kernel function Bhattacharyya kernel
Intersession Variability Vocal Tract Length Normalization (VTLN) feature-level latent factor analysis (flfa) U - session variability matrix Feature compenastaions i-vector paradigm T total variablity matrix
Corpora The availability of sufficiently large corpora has been the major driving factor in the development of speech technology in recent decades 1990's OGI telephone speech database OGI-11L, OGI-22L Conversational corpora: CallHome (6 languages) CallFriend (12 languages) NIST LREs 1996, 2003, 2005, 2007, 2009, and 2011.
NIST LREs
Results The MITLL NIST LRE 2011 Language Recognition System [4]
Future directions We have not been able to effectively venture beyond acoustic phonetic and phonotactic knowledge, despite the fact that there exists strong evidence in human listening experiments that prosodic information, syllable structure, and morphology are useful knowledge sources.
References 1. Haizhou Li; Bin Ma; Kong Aik Lee, "Spoken Language Recognition: From Fundamentals to Practice," Proceedings of the IEEE, vol.101, no.5, pp.1136,1159, May 2013 doi: 10.1109/JPROC.2012.2237151 2. http://www.languageline.com/ 3. Navratil, Jiri. "Spoken language recognition-a step toward multilinguality in speech processing." Speech and Audio Processing, IEEE Transactions on 9.6 (2001): 678-685. 4. Singer, Elliot, et al. "The MITLL NIST LRE 2011 language recognition system." Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. 2012.