India's Road Map Towards Building Foundation LLMs in India Building LLMs presents several challenges--the need for extensive computing power, large datasets, and specialized expertise
You're reading Entrepreneur India, an international franchise of Entrepreneur Media.
India has recorded significant growth in Artificial Intelligence in the past two years after the Large Language Model (LLM) ChatGPT became the talk of the town and took the world towards the AI race. However, India is a very unique country with different cultures, with thousands of dialects, and several unique languages spoken across the country. This beauty of the country sometimes leads to challenges in implementing new innovative policies and the process consumes a lot of resources and time.
Leveraging Open Source for Building Scalable Foundation Models in India
While sharing insights on how India can leverage best practices of open source to build foundational models and LLMs, Pratyush Kumar, Co-founder, of Sarvam AI said that this technology is very new and would take time to better and become more useful. Building LLMs presents several challenges--the need for extensive computing power, large datasets, and specialized expertise. Open source in democratizing AI technology is essential and presents unique challenges and opportunities. India has done well, but still, it has a long journey ahead for these technologies to mature and become more universally useful.
"And that's where open source as a movement in computing has been very strong. And kudos to the government of India. I think the Bhashni project has been a big success in demonstrating how to do open-source Indian language AI at scale," said Kumar.
The Complexity of Indian Languages
While speaking on what innovative approaches and technologies India can use to address the challenges of building the foundation model MohitSewak, AI researcher and developer relations, South Asia, NVIDIA said that India's linguistic diversity is immense, with 23 official languages, over 10,500 unique dialects, and 123 unique languages. LLMs, like GPT, currently support only up to 100 languages with a tokenizer vocabulary size of 254,000. However, the diverse linguistic landscape of India requires models with an even larger tokenizer vocabulary to handle the multitude of languages and dialects effectively.
"That means we are talking about tens of trillions of tokens of data across these languages if we want a real Indian LLM that can actually do the type of tasks that we expect it to do," said Sewak.
Alignment with Cultural Sensitivities: Need Insider's View
While addressing if present language models accurately represent Indian cultural nuances and traditions other than English Dr.Kalika Bali, principal researcher at Microsoft said that LLMs currently possess what can be described as an "outsider's view" of culture. While these models are not entirely ignorant of cultural contexts, their understanding can be superficial.
Indian culture and sensitivities are vastly different from those of the Western world, where most current models are trained. To create effective models for India, it is crucial to incorporate alignment techniques that make models more attuned to Indian cultural nuances.
"I do not think that we can ever have a bias-free system. We can only hope to mitigate the bias as far as possible," Bali further added.
Public-Private Partnerships Is Needed
While speaking on how various stakeholders can collaborate in the journey of making Bharat GPT, Professor Ganesh Ramakrishnan began by highlighting the role of public-private partnerships in the Bharat GPT initiative. The project is supported by the Department of Science and Technology under the NMICPS program with several IITs and IIMs.
He feels that India needs more skilled people so that the open-source culture can be facilitated. Also, the importance of algorithmic innovation, particularly in resource-constrained settings given the limited availability of data across Indian languages, innovative algorithms can play a crucial role in optimizing the use of available data. "A lot more can be done there, and that's where, through this academic-industrial collaboration," said Ramakrishnan.
Measuring the Impact of AI Solutions in India
While addressing the impact of AI solutions and their measurement parameters, Shalini Kapoor, chief technologist APJ, AWS said that it will be defined a lot by Indian citizens because they are going to use it and usage comes only when there is a need. One of the primary metrics is the business value derived from them. This includes immediate benefits as well as long-term sustainability.
Another metric is AI cost-effectiveness includes not only the initial investment but also ongoing operational costs. "People don't have that much time and energy, cost, effort to waste," said Kapoor. Also, a successful AI solution should integrate multiple components rather than relying solely on LLMs. Mitigating bias and ensuring the ethical use of AI are essential metrics.
All the speakers shared their views at the Global India AI Summit 2024.