Click here to submit your abstract to the 2024 conference now! Submissions close on 21 February, 23:59 GMT.

The Unsolved Problem of Language Identification: A GMM-based Approach

In our current world that is inundated by the abundance of data, the ability to systematically, and, accurately, classify large bodies of natural language datasets is invaluable for natural language processing (NLP) and speech technology applications. Such an application is language identification (LID), which attempts to identify a language from a series of randomly spoken utterances (Das & Roy, 2019). LID systems provide the foundations of multimedia mining systems, spoken-document retrieval, as well as multilingual spoken dialogue systems (Navratil, 2006). Although, presently, the LID task is still very much an unsolved problem, often with increasing equal error rate (EER) as the duration and quality of the test dataset decreases (Ambikairajah, Li, Wang, Yin, & Sethu, 2011). The idiosyncratic nature of natural languages means “rule-based” systems are insufficient approaches to model languages. Thus, a visible challenge is structuring what seems to be highly unstructured datasets. The use of probability is significant in natural language processing, as quantitative techniques can account for such idiosyncrasies. Previous research in this field have trained and tested LID systems extensively on telephone speech datasets (e.g., Manchala, Prasad, & Janaki, 2014; Torres-Carrasquillo et al., 2002) and television broadcasts (Madhu, George, & Mary, 2017). However, little research has been done regarding the effect of other data groupings on the systems’ performance, including alterations of experimental parameters such as the distance of the speaker from the microphone. The approach taken in this paper involves building an acoustic model that uses probabilistic representations of the speech datasets across 10 languages (Dutch, Russian, Italian, Spanish, Portuguese, German, English, French, Turkish, and Greek). Each language is probabilistically modelled using Gaussian Mixture Models (GMMs). During recognition, quantitative representations of test data were computed and compared to training data’s cepstral features. The language of the test speech is hypothesised as the language with the training spectra that best matched the test speech’s spectra (Zissman & Berkling, 2001). Through the exploration of the performance of such GMMs on different groupings of datasets, areas of weakness and corresponding means of improvements are therefore revealed.

This individual article from the Proceedings is published here