BharatGen: IIT Bombay’s AI That Thinks in India’s Own Languages (2026)

Most of the world’s major AI models were built in San Francisco, trained on English text scraped from the Western internet, and designed for people who speak the languages of Silicon Valley. That works fine if you happen to be one of them. It works considerably less well if you think, speak, work, and live in Hindi, Tamil, Telugu, Bengali, Marathi, or any of the other languages spoken daily by over a billion people.

That gap is exactly what BharatGen is trying to close. And this week, the initiative got its biggest public moment yet: a global showcase at Bharat Innovates 2026 in Nice, France, as part of the India-France Year of Innovation.

What is BharatGen, exactly?

BharatGen is India’s government-backed sovereign AI initiative, built out of IIT Bombay’s Department of Computer Science and Engineering and led by Professor Ganesh Ramakrishnan. The consortium behind it brings together nine premier academic institutions, including IIT Madras, IIT Kanpur, IIT Hyderabad, IIIT Hyderabad, IIT Mandi, IIT Kharagpur, IIIT Delhi, and IIM Indore. A team of 60+ researchers, engineers, and linguists is doing the actual building.

The mandate is straightforward: build foundational AI that works natively across all 22 scheduled Indian languages, covering text, speech, and documents. Not translated. Not adapted. Trained from scratch on Indian data, for Indian contexts.

It is backed by the Department of Science and Technology (DST) and the IndiaAI Mission, with total funding commitments exceeding ₹1,200 crore across DST and MeitY, making it the largest beneficiary of India’s national AI budget so far.

The four models that actually matter

Here is what BharatGen has built and released. These are not prototypes or research papers. They are live models on Hugging Face and the government’s AIKosha repository.

Param2: the text brain

Param2 is BharatGen’s flagship language model. It is a 17-billion-parameter system built on a Mixture of Experts (MoE) architecture. Here is what that means in plain terms: the model has 17 billion total parameters, but it only activates 2.4 billion of them per token during inference. You get high-capacity performance at far lower compute cost. Think of it as a team of specialists where only the relevant experts get called in for each task, instead of everyone showing up to every meeting.

Param2 was trained on approximately 22 trillion tokens across two pretraining phases, covering English, Hindi, and 21 other Indian languages. It supports reasoning, coding, tool calling, and mathematical tasks. It also uses 64 specialized experts that route dynamically per token, with two always-active “shared” experts specifically designed to maintain stable cross-lingual understanding, including the kind of code-switching common in Indian daily conversation (Hindi-English, Tamil-English, and so on).

The model is available on Hugging Face under BharatGen’s non-commercial licence, along with post-training workflows and documentation. Developers and researchers can download, fine-tune, and deploy it for their own use cases right now.

Shrutam2: the ears

Shrutam2 is an LLM-powered automatic speech recognition (ASR) model that transcribes speech across 12 Indian languages, including Hindi, Marathi, Tamil, Telugu, Malayalam, Kannada, Odia, Bengali, Urdu, Assamese, Gujarati, and Punjabi. One thing worth calling out is its support for code-mixed speech. Things like Hindi-English sentences, which is how a very large number of Indians actually talk. Most ASR systems stumble on this. Shrutam2 was built with it in mind.

Sooktam2: the voice

Sooktam2 is BharatGen’s text-to-speech model. It generates natural speech across 12 Indian languages using reference-guided voice conditioning, which means you can give it a short audio clip of a speaker’s voice and it will reproduce new speech that preserves that person’s accent, tone, and cadence. This is what voice cloning looks like when it works well, and it runs without needing large speaker datasets for each new voice.

The practical applications are significant. Think of government public health announcements that sound like a familiar local voice. Or educational content that speaks in a teacher’s own dialect. Or customer service that doesn’t feel foreign.

Patram: the reader of Indian documents

Patram is a 7-billion-parameter vision-language model built specifically to understand Indian documents. Not generic PDFs. Indian government forms, legal notices, land records, insurance documents, medical reports, and similar materials that have their own layouts, scripts, mixed-language structures, and scanning artifacts that global document models were never trained on. Patram was trained on a curated dataset of real Indic documents and can handle natural language queries: summarize this, extract this field, answer a question about this page.

If you have ever tried to feed an Indian government document into a standard AI model and watched it hallucinate or fail on the formatting, Patram is the answer to that problem.

Why this is bigger than it looks

Here is the thing about language in AI. When a model doesn’t know your language, it doesn’t just translate badly. It fails to understand cultural context, misreads formal versus informal register, gets idioms wrong, struggles with mixed-script inputs, and generally produces outputs that feel like they were written by someone who read a lot of textbooks about your culture but never actually lived in it.

For a country where 22 languages are officially scheduled and hundreds of dialects are spoken daily, this is not a niche problem. It is the problem. The vast majority of India’s population does not communicate primarily in English. AI that only works well in English is AI that works for a small slice of India.

BharatGen is also notable for what it chose to do with its models: release them openly. The models, training workflows, and documentation are on Hugging Face. Any developer, researcher, startup, or enterprise can download Param2, fine-tune it on their domain data, and build applications on top of it. This is meaningfully different from sovereign AI that sits behind a government portal and is never seen again.

The BharatGen Technology Foundation, registered as a Section 8 (not-for-profit) company, is the entity actually running the initiative as a long-term operation, not a one-time research project.

Where BharatGen is already being put to work

The sectors BharatGen targets are not abstract. They map to the areas where the language gap causes real harm.

Healthcare: A farmer in a rural district who needs medical information and speaks Odia or Kannada has very few AI-assisted options today. A model that understands his language and the local context changes that equation.

Governance: Public services, citizen grievance systems, and government document processing are areas where language barriers create enormous friction. Patram’s document understanding, combined with Param2 and Shrutam2, points toward interfaces where a citizen can speak to a system in their language and get something done.

Education: Content in regional languages at scale is still thin. AI that can generate, translate, and vocalize educational material in 22 languages is a significant enabler for the schools and learners that have been underserved by English-first digital content.

Financial services and insurance: Explaining a policy document or a loan agreement in the customer’s own language, in their own accent, is not a luxury. It is the difference between informed consent and confusion.

Cultural preservation: Languages with smaller digital footprints risk being left out of the AI era entirely. BharatGen’s approach to data-efficient learning for low-resource languages addresses this directly.

The honest picture: what we don’t know yet

BharatGen is an ambitious initiative, and the technology is real. But a few things are worth being clear about.

The models are released under a non-commercial licence for now. Commercial use requires separate arrangements, which limits what enterprises can do with the current public versions. The Hugging Face release is genuinely open in terms of access, but the licence terms matter depending on what you want to build.

Independent benchmark comparisons against global models like GPT-4 variants or Llama 3 on Indic tasks are still limited in the public domain. BharatGen competes with other India-focused efforts like Sarvam AI’s models, and how they stack up head-to-head across different languages and tasks is an open question worth watching.

The Bharat Data Sagar dataset, which underpins much of the training, was assembled through collaboration with local publishers, radio stations, and volunteers. That is the right approach, but the depth and quality of coverage will vary across languages. The 22-language target is ambitious; the quality across all 22 will not be uniform.

None of this undercuts what has been built. It just means the work is ongoing, as it should be with foundational infrastructure of this scale.

How to access BharatGen’s models right now

If you want to try the models, here is how:

Go to Hugging Face. Visit huggingface.co/bharatgenai and browse the available models. Param2, Sooktam2, Patram, and earlier versions of the Shrutam ASR models are all there.
Choose what you need. For text and reasoning, start with Param2-17B-A2.4B-Thinking. For speech-to-text, look at the Shrutam series. For text-to-speech with voice conditioning, that is Sooktam2. For document understanding, Patram-7B-Instruct.
Follow the model cards. Each model page has inference code, language lists, and usage notes. For Sooktam2, you will need a short reference audio file of the speaker you want to clone.
Check AIKosha too. The India AI Kosha repository also hosts BharatGen models with additional documentation from the IndiaAI Mission side.
Read the licence before building. The non-commercial restriction is important. If you are building something for commercial deployment, check with the BharatGen team through the contact form on bharatgen.com.

The global context

India is not alone in pursuing sovereign AI. The EU has its own AI Act and associated model efforts. France, Japan, South Korea, and the UAE have all put national resources into homegrown foundation models. What makes BharatGen distinctive is the scale of the linguistic problem it is solving and the fact that it is doing it openly. Most national AI initiatives produce models that stay inside government systems. BharatGen is putting its work on Hugging Face and inviting the developer community to use it.

For any country or community grappling with the fact that the dominant AI systems were not trained on their languages, BharatGen is worth paying attention to as a model for how this can be approached. The technical decisions (MoE architecture, data localization, domain-specific models rather than one giant general model) are applicable far beyond India.

The most interesting thing about BharatGen is not that India built an AI. It is that they built AI in a way that 1.4 billion people can actually use it in the language they think in.