Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 11 | Nov 2025

p-ISSN: 2395-0072

www.irjet.net

Marathi Dialect to Pure Marathi using RAG & AI Gaurav Gaikwad1, Deepak Panchal2, Omprasad Deshmukh3, Sonali Patil4 *1,2,3 Department of Computer Science & Engineering, Yadrav (Ichalkaranji) Maharashtra, India. *4 Assistant Professor, Department of Computer Science & Engineering, Yadrav (Ichalkaranji) Maharashtra, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - The proliferation of Natural Language

rural and regional Marathi-speaking communities, making knowledge access easier and more inclusive.

Processing (NLP) technologies has revolutionized human- computer interaction, yet linguistic diversity remains underserved, particularly for regional languages and their dialects. This project addresses this gap by developing localized NLP tools tailored for Marathi dialects, aiming to bridge the digital divide and promote equitable access to language technologies for Marathi speakers. Marathi, an Indo-Aryan language spoken by over 83 million people in India, exhibits significant dialectal variation across regions such as Deshi, Varhadi, Konkani, and Ahirani. These dialects differ phonologically, lexically, and syntactically from Standard Marathi, rendering generic NLP models ineffective for dialect-specific tasks. This project focuses on creating robust, inclusive tools that account for these variations, enabling accurate dialect identification, machine translation, sentiment analysis, and speech recognition for underrepresented Marathi-speaking communities.

1.1 PROBLEM STATEMENT & OBJECTIVES Marathi has many regional dialects such as Konkani (Devanagari), Varhadi, Ahirani, and Deshi, which differ significantly from Standard Marathi in vocabulary, grammar, and pronunciation. Existing NLP and AI models are trained mostly on standard language forms and therefore fail to understand or translate these dialects accurately. This creates barriers for users who speak in local dialects, especially in rural areas, when interacting with digital systems. To address this issue, there is a need for a system that can: 

accurately translate Marathi dialects into Standard Marathi,



retrieve relevant information from a knowledge base, and



generate correct, context-aware responses using AI.

Key Words: Marathi Dialects, Konkani Translation, Standard Marathi, NLP, RAG Model, LLM, Gemini API, Embeddings, FAISS, Sentence Transformers, LangChain, Streamlit, Vector Database, Document Chunking.

The problem is to develop an integrated, dialect-aware RAG system that bridges the linguistic gap and enables smooth communication between Marathi dialect speakers and AIbased platforms.

1.INTRODUCTION India is a country rich in linguistic diversity, and Marathi is one of its major languages spoken across regions like Konkan, Vidarbha, Marathwada and Western Maharashtra. Each region has its own dialect, creating variations in pronunciation, vocabulary and grammar. These dialectal differences make it difficult for standard NLP systems to correctly understand or process user queries.

The main objectives of this project are: 1.

To develop a dialect-to-Marathi translation system that accurately converts Konkani (Devanagari) and other Marathi dialect inputs into Standard Marathi.

With the growth of Natural Language Processing (NLP) and Large Language Models (LLMs), intelligent systems can now understand human language better—but most of these models are trained only on standard forms of languages. As a result, they struggle when users speak in regional dialects such as Konkani (Devanagari) or other Marathi variants.

To integrate a Retrieval-Augmented Generation (RAG) model for providing context-based, accurate answers from uploaded documents.

To use the Gemini API for high-quality translation and AI-generated responses in Marathi.

To address this challenge, this project develops a RetrievalAugmented Generation (RAG)-based system that translates Marathi dialects into standard Marathi and provides accurate answers using AI. The system translates dialectal input, retrieves relevant information from uploaded documents, and generates context-aware responses in clear Marathi. This approach helps bridge the linguistic gap for

To preprocess and vectorize textual data using Sentence Transformers and store it efficiently using FAISS for fast and relevant information retrieval.

To design a user-friendly Streamlit interface that supports document upload, dialect translation, contextual search, and answer generation in Marathi.

Impact Factor value: 8.315

ISO 9001:2008 Certified Journal

Page 787