YorubaAI: Bridging Language Barrier with Advanced Language Models
Keywords:
Yoruba language, speech-to-text, text-to-speech, natural language processing, low-resource languageAbstract
YorubaAI addresses the digital divide caused by language barriers, particularly for Yoruba language speakers who struggle to interact with advanced large language models (LLMs) like GPT-4, which primarily support high-resource languages. This study develops a system, named YorubaAI, for seamless communication in Yoruba language with LLMs. The YorubaAI enables users to input and receive responses in Yoruba language, both in text and audio formats. To achieve this, a speech-to-text (STT) model is fine-tuned for automatic Yoruba language speech recognition while a text-to-speech (TTS) model is employed for conversion of Yoruba language text to speech equivalent. Direct communication with LLM in low-resource languages like Yoruba language typically yields poor results. To prevent this, a generation technique known as retrieval-augmented generation (RAG) is utilized to augment the LLM's existing knowledge with additional information. The RAG is formed through creation of a database of questions and answers in Yoruba language. This database serves as the primary knowledge base that the YorubaAI uses to retrieve relevant information with respect to the question asked. The content of the created questions and answers database is converted into vector embeddings using Google’s Language-Agnostic BERT Sentence Embedding (LaBSE) model to yield numerical representations that capture the semantic meaning of the texts. The embeddings generated from the Yoruba questions database are stored in a vector store database. These embeddings were essential for efficient search and retrieval.The the two models (STT and TTS models) were integrated with a LLM using a user-friendly interface that was built using the Gradio framework. The STT model achieved a word error rate of 13.06% while the TTS model generated natural-sounding Yoruba language speech. YorubaAI correctly responded to various queries in pure Yoruba language syntax and thus successfully bridges the AI accessibility gap for Yoruba language speakers.
Downloads
References
Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang et al. “A survey on evaluation of large language models”. ACM Trans. Intel. Syst. Tech., vol. 15, issue 3, pp. 1-45, March 2024.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones et al. “Attention is all you need.” 31st Conf. Neural Infor. Process. Syst. (NIPS 2017), pp. 6000-6010, Long Beach, CA, USA, Dec. 2017.
J. Devlin, M.-W. Chang, K. Lee and K. Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In Proc. 2019 Conf. of the North American Chapter of the Assoc. for Computat. Linguist: Human Lang. Technol., Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang et al. “Exploring the limits of transfer learning with a unified text-to-text transformer”. J Mach. Learn. Res., vol. 21, no. 140, pp. 1-67, June 2020.
OECD Digital Economy Paper. “AI language models: Technological, socio-economic and policy considerations”. OECD Publishing, No. 352, pp. 1-52, April 2023. DOI: 10.1787/13d38f92-en
C. Wei, Y.C. Wang, B. Wang, and C.C.J. Kuo. “An overview on language models: Recent developments and outlook”. APSIPA Trans. Signal Infor. Process., vol. 13, no. 2:e101, 1-49, Feb. 2024
X. Lu, S. Li. and M. Fujimoto. “Automatic speech recognition”. In Book: Speech-to-Speech Translation, Springer Singapore, 2020, pp. 21-38. DOI: 10. 1007/978-981-15-0595-9_2.
A. Radford, K. Narasimhan, T. Salimans and I. Sutskever. “Improving language understanding by generative pre-training”. OpenAI Pre-print, pp. 1-12, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou et al. “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer”. Proc. 2021 Conf. of North American Chapter of the Assoc. Comput. Ling.: Human Lang. Technol., pp. 483-498, June 2021.
S. Gandhi, M. Hollemans, M. Khalusova, V. Srivastav. “HuggingFace Audio Course”. https://huggingface.co/learn/audio-course/en/chapter5/introduction Accessed Feb. 16, 2024.
Webmob Software Solutions (WSS). “How can NLP-based Language Translation solve Real-World Problems?” https://www.linkedin.com/pulse/how-can-nlp-based-language-translation-/ Accessed Feb. 16, 2024.
S. Ralethe. “Adaptation of deep bidirectional transformers for Afrikaans language”. In Proc. 12th Lang. Res. Evaluat. Conf., Marseille, France, pp. 2475-2478, May 2020.
OpenAI. “GPT-4 Technical Report”. OpenAI, pp. 1-100, 2023. https://cdn.openai.com/papers/gpt-4.pdf
Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao et al. “Transformer-based acoustic modeling for hybrid speech recognition”. In ICASSP 2020-2020 IEEE Int. Conf. Acoustics, Speech and Signal Process., Barcelona, Spain, pp. 6874-6878, 2020.
C. F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le et al. “Transformer-transducer: End-to-end speech recognition with self-attention”. arXiv preprint, 2019. arXiv:1910.12977
G. Anmol, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang et al. “Conformer: Convolution-augmented Transformer for Speech Recognition”, Proc. Interspeech, Shangai, China, pp. 5036-5040, 2020
M. N. Abdou, A. Allak, K. Gaanoun, B. Imade, Z. Erraji and A. Bahafid. “Multilingual speech recognition initiative for African languages”. Int J Data Sci Anal., 2024. doi:10.1007/s41060-024-00677-9
A. Radford, J.W. Kim, T. Xu, G. Brockman, C. McLeavey and I. Sutskever. “Robust speech recognition via large-scale weak supervision”. In Int. Conf. Mach. Learn., Honolulu, Hawaii USA, pp. 28492-28518, July 2023.
Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna et al. “Google USM: Scaling automatic speech recognition beyond 100 languages”, arXiv preprint, 2023. arXiv:2303.01037.
Z. Yao, L. Guo, X. Yang, W. Kang, F. Kuang et al. “Zipformer: A faster and better encoder for automatic speech recognition”. Proc. ICLR 2024, pp. 1-16, 2024.
Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao et al. “Fastspeech: Fast, robust and controllable text to speech”. Proc. 33rd Int. Conf. Neural Infor. Process. Syst., vol. 285, pp. 3171-3180, 2019.
Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao et al. “Fastspeech 2: Fast and high-quality end-to-end text to speech”, arXiv preprint, 2020. arXiv:2006.04558.
V. Popov, I. Vovk, V. Gogoryan, T. Sadekova and M. Kudinov. “Grad-TTS: A diffusion probabilistic model for text-to-speech”. In Int. Conf. Mach. Learn., Virtual, pp. 8599-8608, July 2021.
C.B. Clement, D. Drain, J. Timcheck, A. Svyatkovskiy and N. Sundaresan. “PyMT5: multi-mode translation of natural language and Python code with transformers”. Proc. 2020 Conf. Empirical Methods in Nat. Lang. Process., pp. 9052–9065, Nov. 16–20, 2020.
T. J. Sefara, S. G. Zwane, N. Gama, H. Sibisi, P. N. Senoamadi and V. Marivate. “Transformer-based machine translation for low-resourced languages embedded with language identification”. In 2021 Conf. Infor. Commun. Tech. & Soc. (ICTAS), Durban, pp. 127-132, 2021.
F. Dhanani and M. Rafi. “Attention transformer model for translation of similar languages”. In Proc. Fifth Conf. Mach. Translat., Virtual, pp. 387-392, 2020.
N. Banar, W. Daelemans and M. Kestemont. “Character-level transformer-based neural machine translation”. In Proc. 4th Int. Conf. Nat. Lang. Process. & Infor. Retr., Soel, South Korea, pp. 149-156, 2020.
R. Liu, Y. Hu, H. Zuo, Z. Luo, L. Wang and G. Goa. “Text-to-speech for low-resource agglutinative language with morphology-aware language model pre-training”. IEEE/ACM Trans. Aud., Speech, and Lang. Process., vol. 32, pp. 1075-1087, 2024.
O. Isaac. “Machine Translation System for Numeral in English Text to Yorùbá Language”. Ife J. of Infor. Commun. Tech., vol. 6, pp. 26-37, 2022.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi et al. “RoBERTa: A Robustly Optimized BERT Pre-training Approach”, 2019. http://arxiv.org/abs/1907.11692v1
Mozilla Discourse. “Common Voice Dataset 13”, 2023. https://discourse.mozilla.org/t/dataset-13-release/112216
F. Feng, Y. Yang, D. Cer, N. Arivazhagan and W. Wang. “Language-agnostic BERT Sentence Embedding”. In Proc. 60th Annual Meeting of the Assoc. Computat. Linguist. (Vol.1: Long Papers), Dublin, Ireland, pp. 878–891, 2022.
Published
How to Cite
Issue
Section
Copyright (c) 2025 Kamoli Akinwale Amusa, Tolulope Christiana Erinosho, Olufunke Olubusola Nuga, Abdulmatin Olalekan Omotoso

This work is licensed under a Creative Commons Attribution 4.0 International License.
