Making AI work for all: Bridging the language gap in AI by focusing on languages that are underrepresented in the digital world 

By Oleksii Sharavar, CEO, Qazcode 

The rapid evolution of artificial intelligence (AI) has highlighted a significant disparity: the most transformative technology of our times is built on a limited number of languages that are already dominant in the digital sphere, potentially exacerbating the exclusion of thousands of languages from the world of AI.    

For the billions of speakers of low-resource languages, the problem is not merely theoretical – it is very real impacting the quality of the services they get and more importantly those that they will get to develop over the decades to come.  

When Faisal Zia of the GSMA Foundry, Mariona Sanz Ausas and Albert Canigueral Bago of Barcelona Supercomputing Center walked into the room for our VEON Leadership Team meeting on a sunny afternoon in Barcelona in March 2024, I knew we were looking at a very similar problem for populations that are almost 8000 kilometers apart:  From Catalunya, Spain to Astana, Kazakhstan and beyond, we have to make AI work for the speakers of low-resources languages.   

We could learn from each other to develop a global roadmap on how to create LLMs for these languages, and in the case of QazCode and Beeline Kazakhstan, benefit from this know-how for a dream project: the Kaz-LLM.  

Digital Divide 

Out of nearly 7,000 languages spoken worldwide, only a handful are considered high resource languages in the digital domain, this includes Mandarin, English and Spanish. These languages dominate AI development, while others remain underrepresented. This disparity compounds the digital divide issue, restricting access to AI-powered services and marginalizing communities that speak local, under-resourced languages.  

UNESCO’s warning that a language disappears every two weeks highlights the urgent need to combat this issue. Without proactive measures from businesses, AI risks pushing smaller languages close to extinction rather than championing them. 

Kazakh, spoken rarely outside the 20 million population of Kazakhstan and with plenty of interference of languages like Turkish, Russian and English, is a prime example of a language that would benefit from such a focus. It is a language spoken by a few; has the additional complexity stemming from the fact that its speakers would frequently use languages like Turkish, Russian and English; and, importantly, both the Government and public sector players such as universities, and the private sector are focused on investing in the country’s digital infrastructure. There was a problem and a shared will and support to address it.  

Creating an LLM for the Kazakh Language: Kaz-LLM unveiled 

At QazCode, the enterprise and IT services subsidiary of Beeline Kazakhstan, Kazakhstan’s largest mobile operator, we started exploring augmenting the capabilities of Kazakh speakers with a locally developed, Kazakh-language LLM a few years ago. We launched KazRoberta, an earlier, 2-billion-parameter Kazakh-oriented language model, in May 2023.  

Developed and published on Hugging Face in 2023, the model was downloaded more than 6,000 times and demonstrated that there is significant appetite for models designed with local linguistic nuances in mind. Its success emboldened QazCode and partners – Ministry of Digital, Innovations, and Aerospace Industry of Kazakhstan, the Institute of Smart Systems and Artificial Intelligence at the Nazarbayev University (ISSAI NU), and the Astana Hub to think bigger. 

On December 11th, 2024, just days ahead of Kazakhstan’s Independence Day, following a close of collaboration between many groups of experts from linguists to developers, the partners unveiled the Kaz-LLM. With over 150 billion tokens collected, curated, synthesized and translated, Kaz-LLM was unveiled with 8-billion and 70-billion parameter versions, capable of interacting in Kazakh language as well as in Turkish, English and Russian.  

Why a Kazakh LLM matters and where we are going next 

At QazCode we are now focusing on using this sophisticated model for our AI-based products and services. Among the use cases that I find most exciting is our upcoming AI tutor, with which we will support the education curriculum equipping every student (and their families) with a service that augments the learning experience.  

VEON’s digital operator strategy, which we have been implementing since 2021, gives us the consumer interfaces that enable the faster roll-out of these types of services. We have been working on the developing the digital app portfolio of our parent company Beeline Kazakhstan; and now we have the privilege of integrating AI capabilities into these applications to provide services that augment the capabilities of each of our customers 1440 minutes a day – what we call AI1440 at VEON.   

We were very excited that we were able to make the Kaz-LLM project come to fruition in partnership with our counterparts. Now we look forward to taking it to every smartphone in Kazakhstan and to every remote village, enabling millions to get second-to-none services in their native language through AI while also empowering the developer landscape of Kazakhstan.  

Equally importantly, we are excited to continue working with the GSMA Foundry, Barcelona Supercomputing Center and the international AI community to share our experience and know-how, supporting the wider ecosystem to address the AI. linguistic gap. Just like we were deeply inspired by the work of BSC with Catalan, we hope that the lessons learned from KazLLM will one day also serve as a blueprint for other regions and countries, encouraging them to invest in their linguistic heritage and harness AI to preserve, enhance, and celebrate their cultural roots.