National Multimodal Large Language Model Programme (PQ Reply by Minister Josephine Teo)

10 JAN 2024

Use of Non-textual Training Data for Singapore's Common Languages in National Multimodal Large Language Model Programme

Fourteenth Parliament of Singapore – Second Session for the Sitting on 10 January 2024

Question

Mr Dennis Tan Lip Fong asked the Prime Minister whether the National Multimodal Large Language Model Programme will incorporate non-textual training data for commonly used languages in Singapore such as Singlish, Malay, Tamil and Chinese dialects to enhance its voice recognition capabilities and widen the accessibility of this programme to Singaporeans.

Answer

Written answer by Mrs Josephine Teo, Minister for Communications and Information and Minister-in-charge of Smart Nation and Cybersecurity (for the Prime Minister)

The National Multimodal LLM Programme aims to develop Large Language Models (LLMs) that are more suited for our context. The Southeast Asian Languages in One Network (SEA-LION) model that was recently released was trained on a dataset that has more than 10 languages, including colloquial English (or Singlish), Chinese, Malay and Tamil.

In the next phase, the Programme will look at techniques to incorporate speech data containing nonverbal cues such as tone and pitch, to augment SEA-LION. For this, they will first evaluate the model performance when non-textual data in standard and colloquial English are added, before moving on to other languages. As we build our local expertise in developing and training regional LLMs through this effort, we will closely monitor ongoing developments and will adapt our plans as the technologies in the field evolve.