Cross-Lingual Transfer Learning for Medical NER and Question Answering in Low-Resource Language 1st Given Name Surname 2nd Given Name Surname 3rd Given Name Surname dept. name of organization (of Aff.) name of organization (of Aff.) City, Country email address or ORCID dept. name of organization (of Aff.) name of organization (of Aff.) City, Country email address or ORCID dept. name of organization (of Aff.) name of organization (of Aff.) City, Country email address or ORCID 4th Given Name Surname 5th Given Name Surname 6th Given Name Surname dept. name of organization (of Aff.) name of organization (of Aff.) City, Country email address or ORCID dept. name of organization (of Aff.) name of organization (of Aff.) City, Country email address or ORCID dept. name of organization (of Aff.) name of organization (of Aff.) City, Country email address or ORCID Abstract—In recent years, advanced Natural Language Processing (NLP) techniques have driven significant improvements in biomedical NER and question answering systems. However, most of these advances have been achieved in high-resource languages such as English, leaving low-resource languages— like Hindi—understudied. This paper presents a unified crosslingual transfer learning framework to address two core tasks in the medical domain: Named Entity Recognition (NER) and Question Answering (QA). First, we employ a bilingual NER pipeline that leverages large-scale, English biomedical corpora and adapts the learned representations to Hindi, using cosine similarity to classify entities into categories such as diseases, symptoms, and consumables. Notably, our state-of-the-art NER model achieves 88% accuracy on benchmark datasets. Second, we develop a cross-lingual QA pipeline that is trained on an English medical QA dataset and subsequently adapted to Hindi using a retrieval-augmented generation (RAG) mechanism. Our model integrates transformer-based architectures (XLMRoBERTa and BLOOM) and domain-focused translation techniques (IndicTrans2/IndicBERT) to mitigate the scarcity of highquality Hindi medical data. Experimental results demonstrate that our approach retains robust performance in entity recognition and delivers coherent, context-rich question answering in Hindi, providing a scalable blueprint for multilingual medical NLP systems in low-resource settings. Index Terms—component, formatting, style, styling, insert I. I NTRODUCTION Accurate extraction of biomedical information through sophisticated Named Entity Recognition (NER) and precise Question Answering (QA) systems is vital for transforming modern healthcare practices. Despite significant advances in Natural Language Processing (NLP) driven by transformerbased language models, these improvements have largely been realized in high-resource languages such as English. Recent studies indicate that over 80% of biomedical research publications are in English [8], whereas less than 5% of annotated biomedical datasets exist for languages like Hindi [9]. This disparity is largely attributable to the limited availability of large, annotated, and reliable datasets in many other languages. Consequently, the lack of robust annotated data in low-resource languages creates a bottleneck that limits the deployment of advanced NLP systems and hinders the global democratization of medical insights [10]. In this paper, we introduce a comprehensive cross-lingual transfer learning methodology designed to overcome these limitations for two core tasks in the medical domain: NER and QA. Cross-lingual transfer learning is a paradigm in which the rich resources and pre-trained models developed for highresource languages are leveraged to transfer knowledge to low-resource languages. This approach capitalizes on shared linguistic structures and semantic similarities across languages, mitigating the scarcity of annotated data and enabling robust NLP performance even in resource-constrained settings [11]. For the NER task, our framework exploits expansive English biomedical corpora to learn high-quality entity representations that are subsequently adapted to Hindi using state-of-theart multilingual models. To address the inherent challenge of semantic categorization—such as differentiating between diseases, symptoms, and consumables—we employ cosine similarity within an embedding space aligned with domainspecific resources. For the QA task, our methodology begins with fine-tuning a generative language model on an extensive English medical QA dataset, which is then adapted to Hindi. This adaptation ensures the generation of coherent and medically precise responses with minimal additional supervision. To further enhance cross-lingual performance, we integrate a RetrievalAugmented Generation (RAG) component that dynamically enriches the model’s outputs by incorporating context from a multilingual database containing both English and Hindi medical documents. The impact of our work is twofold. First, it democratizes access to advanced NLP techniques in the medical domain by enabling low-resource languages to benefit from high-quality, pre-trained models. Second, it establishes a scalable and adaptable framework that can be extended to other languages and domains, ultimately contributing to the broader goal of equitable healthcare information dissemination. By leveraging multilingual transformer models and domain-specific translation techniques, our approach effectively addresses data scarcity while maintaining high performance across tasks, paving the way for future multilingual medical NLP applications. II. L ITERATURE S URVEY This section presents a comprehensive survey of key research studies that have contributed to the development of cross-lingual transfer learning techniques, multilingual medical NLP models, and named entity recognition (NER) systems across low-resource languages such as Hindi. The studies span across innovations in efficient fine-tuning (e.g., LoRA and QLoRA), advancements in multilingual medical corpora, translation strategies for clinical text, and empirical results from shared tasks and benchmarks. Each subsection highlights the methodology and insights from these foundational works that inform our approach to cross-lingual medical question answering. LoRA: Low-Rank Adaptation (Hu et al., 2021) LoRA introduces a parameter-efficient fine-tuning technique for LLMs by freezing pretrained model weights and injecting trainable low-rank matrices into each layer. This allows task-specific adaptation with minimal additional parameters, preserving performance while reducing computational cost. Clinical Text MT with Transformers (Han et al., 2024) This work evaluates Marian, NLLB, and WMT21fb models on clinical English–Spanish translation using the MeSpEn and ClinSpEn-2022 corpora. Transfer learning is shown effective even when the target language (Spanish) was not present in pretraining. MultiCardioNER Task (Lima-López et al., 2024) The shared task focuses on multilingual disease and medication NER from Spanish, English, and Italian cardiology case reports. BERT and LLMs are fine-tuned using domainspecific data. Evaluation is performed using micro-averaged F1 metrics. MedPodGPT (Jia et al., 2024) MedPodGPT incorporates 4,300 hours of transcribed medical podcasts into a multilingual LLM. This improves performance on both English and non-English tasks, especially in zero-shot transfer scenarios. NER for Hindi using MaxEnt (Jain et al., 2022) Two models—BL-MENE and CP-MENE—use maximum entropy classifiers with handcrafted features like gazetteers and POS patterns. The CP-MENE system shows improved F1 scores across multiple medical entity categories in Hindi. NER for Hindi using HAL + CRF (Rani & Lobiyal, 2017) HinTwtNER uses a three-stage pipeline with preprocessing, feature extraction via HAL (Hyperspace Analogue to Language), and CRF-based sequence labeling. This method yields moderate F1 scores across Hindi medical entity types. II. E ASE OF U SE QLoRA: Quantized LoRA (Dettmers et al., 2023) A. Maintaining the Integrity of the Specifications QLoRA builds on LoRA by introducing 4-bit quantization and double quantization techniques to minimize memory footprint during training. It uses paged optimizers and maintains strong task performance even on consumer hardware, facilitating scalable fine-tuning of large multilingual models. The IEEEtran class file is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin measures proportionately more than is customary. This measurement and others are deliberate, using specifications that anticipate your paper as one part of the entire proceedings, and not as an independent document. Please do not revise any of the current designations. Cross-Lingual NER in Biomedical Domain (Lancheros et al., 2024) The study uses Google Translate for cheap translation of English biomedical corpora into Spanish, followed by entity replacement using ontologies. BioBERT and RoBERTa-based models are fine-tuned using direct and continuous learning strategies, achieving high F1 scores on Spanish biomedical NER. Medical mT5 (Garcı́a-Ferrero et al., 2024) Medical mT5 is trained on multilingual medical corpora in four languages (EN, ES, FR, IT) using a text-to-text objective. It supports QA, translation, and summarization, outperforming comparable models on multilingual medical benchmarks. III. M ETHODOLOGY In this section, we present our comprehensive methodology designed to enable cross-lingual transfer learning for Hindi—a low-resource language—in two fundamental NLP tasks: Named Entity Recognition (NER) and Question Answering (QA). The objective is to utilize the extensive resources and pretrained models available for high-resource languages such as English, and transfer this knowledge effectively to Hindi, thereby mitigating the challenges posed by limited annotated data. Fig. 1. Overview of the proposed Cross-Lingual Transfer Learning Architecture for Hindi NER and Question Answering. Our approach integrates multiple stages, including data preprocessing, multilingual representation learning, parameterefficient fine-tuning, and task-specific adaptation. By systematically aligning linguistic structures and optimizing model components for Hindi, we enable robust performance in both NER and QA. The methodology is organized into two core pipelines: one for Named Entity Recognition and the other for Question Answering. Each pipeline is described in detail in the subsequent subsections. An overview of the overall architecture is illustrated in Figure 1. A. Cross Lingual Transfer Learning for Named Entity Recognition (NER) 1) Data Acquisition and Preprocessing: To facilitate robust training and cross-lingual transfer for Named Entity Recognition (NER) in the biomedical domain, we curated our training data from two authoritative English-language corpora: the NCBI Disease Corpus [1] and the BC5CDR Corpus [2]. These corpora are widely recognized for their expert-annotated entity mentions and alignment with controlled biomedical vocabularies such as the Medical Subject Headings (MeSH), making them ideal for supervised learning in the highprecision domain of biomedical information extraction. The NCBI Disease Corpus comprises 793 PubMed abstracts annotated with 6,892 disease mentions, covering 790 unique disease concepts. The BC5CDR Corpus contains 1,500 PubMed articles annotated for both diseases and chemicals, with 12,694 disease mentions and 15,935 chemical mentions. All entities in both corpora are grounded to standardized MeSH identifiers, which supports concept-level normalization and plays a crucial role in both entity disambiguation and multilingual alignment. To prepare the data, we parsed the raw datasets to extract titles and abstracts along with corresponding entity annotations, including character-level spans and MeSH-linked concept IDs. These annotations were mapped back to the original text to preserve span alignment. Since biomedical abstracts often exhibit complex syntactic structures, we used domain-specific sentence segmentation strategies to prevent splitting entities across sentence boundaries, which preserved the integrity of multi-token entities. Each segmented sentence was tokenized using a domaincompatible tokenizer suitable for transformer-based models. This tokenizer maintained subword integrity for complex biomedical terms. We then annotated the tokenized sentences using the BIO (Begin-Inside-Outside) tagging scheme. Tokens corresponding to the beginning of an entity were labeled with a B- tag, internal tokens with I-, and all others with O. This alignment between token-level spans and character-level annotations was computed precisely to ensure consistency. A key enhancement was the integration of MeSH identifiers into the BIO tagging pipeline. Each labeled token was coupled with its corresponding MeSH concept ID, enriching the training data with semantic grounding. This semantic augmentation was central to enabling downstream tasks like cross-lingual transfer and entity disambiguation. The final preprocessed data was serialized into a tokenlabel format suitable for sequence labeling. Each token was accompanied by its BIO tag (and optionally its MeSH ID), and sentences were separated by newline characters to mark input boundaries. This structure allowed seamless ingestion by transformer-based NER models while preserving both linguistic and semantic coherence, which is crucial for crosslingual generalization. 2) Training on English Biomedical Datasets: Following preprocessing, we trained our multilingual NER model on the English biomedical corpora using the XLM-RoBERTa architecture. This transformer-based model is pre-trained on over 100 languages and is designed for high contextual understanding and robust cross-lingual transfer, making it suitable for adapting knowledge from English to Hindi. We formulated two tasks: disease mention recognition and chemical entity recognition. The NCBI and BC5CDR corpora were used for disease mentions, and BC5CDR also contributed chemical annotations. These datasets provided gold-standard BIO annotations and standardized MeSH identifiers. We adopted XLM-RoBERTa due to its multilingual capability and robust performance on sequence labeling tasks. Its architecture builds upon RoBERTa, using dynamic masking, larger batch sizes, and longer training durations to enhance contextual learning. Subword-aware alignment strategies ensured token-to-label consistency even with tokenized subwords. Model training employed a cross-entropy loss function optimized for token classification, with dynamic batching and padding. We trained over multiple epochs, evaluating regularly on a validation set using metrics such as precision, recall, and F1-score. Regularization techniques like weight decay and learning rate scheduling helped mitigate overfitting. By training on disease and chemical annotations jointly, the model developed rich biomedical representations. These representations were foundational for our subsequent transfer learning to Hindi, enabling the model to generalize to lowresource biomedical text. 3) Cross-Lingual Transfer: Translating and Adapting to Hindi Using XLM-RoBERTa and IndicTrans2: To extend our high-resource English biomedical NER system to the low-resource Hindi domain, we implemented a cross-lingual transfer learning strategy that leverages the robust annotations available in English and adapts them for Hindi. This strategy involves translating the preprocessed English dataset into Hindi while maintaining label alignment, and then finetuning an English-trained XLM-RoBERTa model on the Hinditranslated data. Initially, the English corpus, which is annotated in the BIO tagging format, is carefully processed to generate tokenlabel pairs with precise offset mappings that preserve entity boundaries. These annotated sentences are then translated into Hindi using IndicTrans2 [3], a multilingual model with strong support for Indian languages. The translation process is designed to maintain the one-to-one correspondence between English tokens and their Hindi equivalents. By leveraging offset mappings during tokenization and subword segmentation, we ensure that each translated Hindi token retains its corresponding BIO label, thereby preserving the integrity of entity annotations in the new language. Once the translation and label alignment are complete, we fine-tune an XLM-RoBERTa model — originally trained on extensive English biomedical datasets such as the NCBI Disease Corpus and BC5CDR Chemical annotations — using the Hindi-translated dataset. XLM-RoBERTa’s architecture, which is pre-trained on over 100 languages and built to deliver robust, language-agnostic representations, is particularly effective for cross-lingual transfer. The model’s shared multilingual vocabulary and contextual embeddings enable it to exploit the semantic and syntactic similarities between English and Hindi. Given that only a limited amount of annotated Hindi data is available, this fine-tuning process effectively operates as a few-shot learning scenario. The model utilizes the transferable knowledge it acquired from English to accurately recognize medical entities in Hindi, even with minimal native supervision. Through this cross-lingual transfer process, our system not only adapts high-quality biomedical entity recognition capabilities to Hindi, but also demonstrates the feasibility of employing few-shot learning techniques in low-resource settings. The combined use of IndicBERT for high-fidelity translation and XLM-RoBERTa for multilingual fine-tuning provides a scalable and effective framework for developing biomedical NER systems in Indian languages, addressing the critical data scarcity challenge in this domain. 4) Entity Type Classification using Embedding-Based Similarity Search: After identifying entities in Hindi text, we performed semantic type classification—distinguishing between diseases, symptoms, and consumables—using embedding-based similarity search. We first constructed a curated database of Hindi medical terms categorized into three types: diseases, symptoms, and consumables. These entities were gathered from public medical resources, including the Hindi Health Dataset (HHD) [4], [5], and were manually verified for accuracy. The HHD corpus was collected from Indian health-related websites and enriched with gazetteer lists for Person, Disease, Consumable, and Symptom entities. Each entry in our final database included the term and its corresponding type. To represent the semantics of each term, we used IndicBERT to generate contextualized embeddings. This ensured consistency with the embeddings used during model inference. The database embeddings were indexed using FAISS (Facebook AI Similarity Search), allowing fast nearest-neighbor retrieval in high-dimensional vector space. When the NER model extracted an entity span from a Hindi sentence, it was embedded using IndicBERT and queried against the FAISS index. The closest match in the embedding space determined the type label (disease, symptom, or consumable) for the extracted entity. This method provides a flexible and scalable solution for semantic typing without requiring joint training on multiple entity types. It leverages semantic alignment in the embedding space to support accurate classification even in low-resource scenarios. B. Cross-Lingual Transfer Learning for Question Answering (QA) 1) Data Preprocessing: To enable effective cross-lingual transfer learning for medical question answering in Hindi, we began by preprocessing the MedQuAD dataset [6], a largescale English QA corpus in the medical domain curated from authoritative sources such as MedlinePlus and the NIH. Since Hindi is a low-resource language with limited high-quality QA data available, we focused on enhancing the quality and diversity of the English dataset to improve downstream transfer performance. The first step involved identifying and consolidating duplicate or semantically similar questions. We employed a TFIDF-based cosine similarity approach to compute pairwise similarity scores between all question entries in the dataset. Using a threshold of 0.95, questions exceeding this similarity level were considered duplicates. For each group of such similar questions, we selected a canonical representation and merged all associated answers into a single entry. This consolidation preserved diverse answer formulations while reducing redundancy and preventing potential data leakage. Metadata such as the number of merged answers and a consolidation flag were also retained for each entry. Following the deduplication step, we applied comprehensive text normalization techniques to both the questions and their corresponding answers. This included removing bracketed or parenthetical content, eliminating special characters (excluding meaningful punctuation), and reducing excessive whitespace. All text was lowercased to ensure consistency, and common English contractions were expanded (e.g., don’t was converted to do not) to reduce lexical variation. Additionally, questions containing duplicated tokens or phrases were cleaned using a custom consolidation function to retain only the unique components while ensuring grammatical completeness and appropriate punctuation. The resulting corpus ensured reduced redundancy, high lexical consistency, and diverse yet relevant answer sets, laying a strong foundation for effective crosslingual adaptation in the downstream pipeline. 2) Model Training on the English MedQuAD Dataset: To develop a strong foundation for medical question answering, we fine-tuned a large-scale language model on a curated subset of the English MedQuAD dataset. The BLOOM-3B model, a multilingual autoregressive transformer, was selected for its generative capabilities and proven effectiveness across diverse tasks. Given the computational demands associated with full-model fine-tuning, we employed Low-Rank Adaptation (LoRA), a parameter-efficient method that enables targeted fine-tuning of key model components, thereby reducing memory overhead while maintaining performance. The training data consisted of approximately 11,000 highquality question-answer pairs derived from the cleaned and deduplicated MedQuAD corpus. Each sample was transformed into an instruction-style prompt, where the input followed the format: Question: <question> Answer: <answer> This prompt structure encouraged the model to learn the underlying task of generating medically relevant answers based on natural language queries. The data was tokenized using BLOOM’s tokenizer, applying truncation and padding to a maximum length of 512 tokens. During preprocessing, labels were constructed such that the tokens corresponding to the input question were masked out, guiding the model to focus learning on answer generation. Training was performed in a GPU environment using mixedprecision (fp16) and gradient checkpointing to optimize memory usage. The model was trained over three epochs using the AdamW optimizer and a linear learning rate scheduler, with a learning rate set to 2 × 10−5 . A batch size of 8 and gradient accumulation of 4 steps were used to simulate a larger effective batch size, which helped stabilize updates. The dataset was split into training and validation sets in a 90:10 ratio, and performance was evaluated periodically during training, with checkpoints saved based on validation metrics. LoRA was configured with a rank of 16 and a scaling factor of 32, targeting the attention-related components of the model (specifically the query_key_value layers). The base model was first prepared for low-bit training, then adapted using the LoRA configuration to fine-tune a small subset of parameters efficiently without altering the entire model. To qualitatively assess performance, we implemented an inference routine capable of generating answers to previously unseen questions. Early evaluations demonstrated that the model produced fluent and medically coherent responses, validating the success of the fine-tuning process. This English-trained BLOOM model forms the foundation for subsequent cross-lingual transfer to Hindi, where knowledge gained from high-resource English data is leveraged to enhance performance in a low-resource target language setting. 3) Translating English Data to Hindi using IndicTrans2: To facilitate cross-lingual transfer learning for medical question answering, we translated the English MedQuAD dataset into Hindi using a state-of-the-art neural machine translation (NMT) model. This step was essential to generate high-quality Hindi question-answer pairs, enabling finetuning in a low-resource language setting. We utilized the indictrans2-dist-en-indic-200M [3] model—a lightweight version of IndicTrans2—optimized for English to Indic language translation. The translation pipeline was designed to handle both short and long texts effectively. For short inputs (e.g., individual questions or brief answers), translation was performed directly. In contrast, longer texts were processed by segmenting them into overlapping chunks, each translated separately and then recombined to preserve semantic continuity. This chunking strategy was especially important for maintaining coherence in longer answer passages. We used the IndicProcessor module to perform languagespecific preprocessing and postprocessing operations, such as Unicode normalization, punctuation standardization, and script conversion. Inputs were first preprocessed and tokenized using the model’s tokenizer, then passed through the encoderdecoder architecture to generate translations. Decoding was performed using beam search with constraints (e.g., length penalty and repetition penalty) to ensure fluent and non- redundant outputs. Translations were subsequently postprocessed to yield clean, readable Hindi text. For the translation of the MedQuAD dataset, a parallel translation strategy was implemented over batches of questionanswer pairs.This translation phase produced a Hindi version of the MedQuAD dataset, preserving the original instructionstyle QA format while enabling the model to be exposed to medically relevant content in the target language. These translated examples served as training data for the next stage, where the English-trained BLOOM model was adapted to perform Hindi medical question answering via continued finetuning. 4) Fine-Tuning BLOOM on Hindi Translations: Following the initial training of the BLOOM model on the English MedQuAD dataset, we extended our cross-lingual transfer learning pipeline by fine-tuning the model on Hindi. The goal of this phase was to enable the BLOOM architecture to generate accurate, fluent, and medically relevant answers in Hindi—a low-resource language in the biomedical NLP domain—by transferring the knowledge acquired during English training to the translated Hindi dataset. Instead of full-parameter fine-tuning, we employed parameter-efficient fine-tuning using the Low-Rank Adaptation (LoRA) method to minimize computational requirements while allowing the model to adapt effectively to the Hindi language. A new LoRA adapter was initialized with a reduced rank configuration and a dropout layer to increase regularization. This choice was made to avoid overfitting, especially given the relatively limited size of the Hindi QA dataset. The model was configured with a LoRA rank of 8, a LoRA scaling factor (alpha) of 16, and a dropout probability of 0.2. These parameters were applied to core transformer layers such as query_key_value, dense, and intermediate_dense layers. For training, we utilized the translated version of the MedQuAD dataset prepared in the earlier stage, which contained aligned question-answer pairs in Hindi. The dataset was filtered to remove samples with extremely short or excessively long inputs and outputs to maintain a clean distribution and avoid extreme outliers. Specifically, questions shorter than 10 characters or longer than 500, and answers shorter than 20 characters or longer than 1000, were excluded. After cleaning, the dataset was split into training and validation subsets in a 90:10 ratio using a stratified sampling technique based on answer length buckets to ensure that both subsets maintained a comparable distribution of answer lengths. To format the training data for instruction-style learning, we adopted a simple prompting scheme where each input sequence began with the Hindi prefix “:” followed by the question, and then “:” preceding the answer. This format was selected to clearly demarcate the question and answer, providing the model with an implicit structure to follow during generation. During preprocessing, the input text was tokenized, and a custom labeling strategy was applied to ensure that only the tokens corresponding to the answer (following “:”) contributed to the loss during training. This technique encour- aged the model to focus specifically on generating the correct answer rather than reconstructing the entire question. We used a low learning rate of 5 × 10−6 , a cosine learning rate scheduler with a warm-up phase, and set the number of training epochs to three. The model was trained with a batch size per device of 2, and gradient accumulation was used to simulate a larger batch size, thus stabilizing optimization. The evaluation was carried out more frequently setting the evaluation step interval to 100, allowing for closer monitoring of the training performance. We incorporated early stopping with patience for three evaluation steps, enabling automatic stopping of training if no improvement in validation loss was observed. In practice, the fine-tuned model demonstrated a marked ability to generate coherent and medically meaningful answers in Hindi, indicating a successful transfer of domain-specific knowledge from English to Hindi. This phase of fine-tuning played a critical role in enabling our pipeline to function effectively in a low-resource language setting, validating the utility of cross-lingual transfer learning in the context of medical question answering. 5) Retrieval-Augmented Generation (RAG) Integration: To further enhance the model’s capability to generate informed and contextually accurate responses, we integrated a Retrieval-Augmented Generation (RAG) mechanism into our pipeline. In our approach, a comprehensive combined database was first constructed by merging multiple sources, including approximately 16,000 PubMed articles [7], Hindi Health Data [4], [5] containing detailed information such as disease descriptions, symptoms, causes, treatments, and home remedies, as well as question-answer pairs from both Hindi and English MedQuAD datasets. This merged corpus provided a rich and diverse medical knowledge base in both English and Hindi, ensuring that the QA system could draw on a wide spectrum of contextually relevant information. Once assembled, the combined dataset was processed to generate dense vector embeddings for each document using a multilingual sentence transformer model, which efficiently captured the semantic information present in the text. The resulting embeddings were aggregated into a FAISS index, a high-performance similarity search framework designed for scalable retrieval. Complementary metadata—including document IDs, titles, language identifiers, and content details—was stored alongside these embeddings to facilitate effective retrieval. During inference, when a medical query in Hindi is submitted, the system computes an embedding for the query and searches the FAISS index to retrieve the most semantically similar documents. The retrieval mechanism prioritizes Hindilanguage content, while supplementing with English documents when necessary, ensuring that the provided context is both relevant and sufficiently comprehensive. The retrieved documents are then formatted into a coherent context block, with each document’s title and content clearly demarcated and separated by visual delimiters to enhance readability. This context block is incorporated into an enhanced prompt that is fed into the fine-tuned BLOOM model. The prompt explicitly instructs the model, in Hindi, to generate a detailed and scientifically accurate answer based on the provided context. By augmenting the generation process with dynamically retrieved, domain-specific knowledge, this RAG framework improves both the factual accuracy and the relevance of the responses. Post-processing and validation routines are applied to ensure that the final output is well-structured, avoids redundancy, and adheres to established medical quality standards. In summary, the integration of the FAISS-based retrieval mechanism with our fine-tuned language model enables a robust RAG system that effectively combines pre-trained language understanding with dynamically acquired, contextrich information, thereby significantly enhancing the overall performance of the medical question-answering system. IV. C ONCLUSION In this work, we proposed a two-pronged cross-lingual transfer learning solution tailored to the Hindi language—an under-resourced yet widely spoken language in the medical domain. Our methodology first demonstrated that multilingual pretraining and effective adaptation enable robust Named Entity Recognition (NER), even with limited parallel data. By leveraging high-quality English biomedical corpora and translating them to Hindi, our approach addressed the scarcity of annotated data in Hindi. The integration of a cosinesimilarity-based classification further allowed for efficient entity categorization into medically relevant classes, including diseases, symptoms, and consumables. Subsequently, we extended the cross-lingual paradigm to develop a Hindi Question Answering (QA) system grounded in advanced transformer architectures (XLM-RoBERTa and BLOOM). Through strategic use of translation (IndicTrans2) and parameter-efficient fine-tuning (LoRA), we were able to adapt knowledge from an English QA corpus to Hindi effectively. Additionally, the incorporation of a RetrievalAugmented Generation (RAG) mechanism significantly improved the factual accuracy of generated responses by providing dynamic access to a multilingual repository of medical documents. Our findings underscore the promise of cross-lingual transfer learning in enabling accurate and context-rich NLP solutions for languages with limited annotated resources. The evaluations confirm that the approach retains robust performance in both NER and QA settings, paving the way for broader deployment of medical NLP applications among Hindi-speaking communities. V. R ESULTS To evaluate the effectiveness of our cross-lingual NER approach, we fine-tuned a custom NERModel architecture built on top of XLM-R and tested it on a Hindi medical NER dataset. The evaluation was conducted using standard token-level classification metrics: precision, recall, F1-score, and accuracy. We report both overall performance and perentity analysis. A. Overall Performance The model achieved the following performance on the heldout Hindi test dataset: • Accuracy: 84.65% • Precision: 85.16% • Recall: 84.65% • F1-Score: 84.79% These metrics reflect the model’s ability to accurately detect and classify entities in biomedical Hindi text, despite being originally pretrained on English corpora. B. Label Space and Evaluation Overview The model was trained to identify five distinct labels using the BIO tagging scheme: • B-Disease: Beginning of a disease entity • I-Disease: Inside a disease entity • B-Entity: Beginning of a generic medical entity • I-Entity: Inside a generic medical entity • O: Outside any named entity Despite being trained for this full label space, the evaluation results show that the model struggled a little identify actual entity boundaries, indicating a need for further data refinement and rebalancing. C. Error Analysis Most of the misclassifications occurred due to ambiguous tagging between B-Disease and B-Entity, particularly in phrases involving anatomical or symptom-related terms. Additionally, long entities with subword splits contributed to label alignment challenges. D. Cross-Lingual Implications The promising results validate the capability of transformerbased multilingual encoders like XLM-R in zero-shot or lowresource settings. Our use of custom tag mapping allowed effective adaptation from general medical NER tags to domainspecific categories in Hindi, supporting the feasibility of crosslingual transfer for medical NLP tasks. E. Question Answering Evaluation To evaluate the question answering capabilities of our finetuned BLOOM-1B model on machine-translated Hindi medical QA pairs, we used a custom clinical evaluation pipeline on a subset of 1000 samples from the MedQuad-derived Hindi dataset. The responses were generated using a structured prompting strategy that enforced factuality, coherence, and clinical disclaimers. Evaluation Metrics: We employed a combination of automatic linguistic metrics and domain-specific clinical safety checks: • BLEU Score and Smoothed BLEU: Measure ngram overlap between generated and reference answers. Smoothed BLEU improves stability over short sequences. • ROUGE-L: Evaluates longest common subsequences, capturing fluency and structure matching. TABLE I Q UESTION A NSWERING E VALUATION R ESULTS ON 1000 H INDI QA PAIRS Metric Concept Recall Rate Safety Issues per Answer Critical Risk Answers Exact Match Average BLEU Score Average Smoothed BLEU Score Average ROUGE-L Score Average BERTScore (F1) global reach of medical NLP solutions to better serve diverse language communities and healthcare contexts. Score R EFERENCES 53.36% 0.00 0.00% 2.40% 0.0314 0.0368 0.2123 0.6958 [1] R. I. Doğan, R. Leaman, and Z. Lu, “NCBI disease corpus: a resource for disease name recognition and concept normalization,” Journal of Biomedical Informatics, vol. 47, pp. 1–10, 2014. [2] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, and Z. Lu, “BioCreative V CDR task corpus: a resource for chemical disease relation extraction,” Database: The Journal of Biological Databases and Curation, vol. 2016, 2016. [3] J. Gala, P. A. Chitale, A. K. Raghavan, V. Gumma, S. Doddapaneni, A. K. M, J. A. Nawale, A. Sujatha, R. Puduppully, V. Raghavan, P. Kumar, M. M. Khapra, R. Dabre, and A. Kunchukuttan, “IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages,” Transactions on Machine Learning Research, 2023. [4] A. Jain and A. Arora, “Named Entity Recognition in Hindi Using Hyperspace Analogue to Language and Conditional Random Field,” Pertanika Journal of Science and Technology, UPM, vol. 26, no. 4, pp. 1801–1822, 2018. [5] A. Jain, D. K. Tayal, and A. Arora, “OntoHindi NER – An Ontology Based Novel Approach For Hindi Named Entity Recognition,” International Journal of Artificial Intelligence, vol. 16, no. 2, pp. 1–36, 2018. [6] A. Ben Abacha and D. Demner-Fushman, “A Question-Entailment Approach to Question Answering,” BMC Bioinformatics, vol. 20, no. 1, pp. 511:1–511:23, 2019. [7] D. R. Sherman, L. H. Lipscomb, and D. R. Masys, “The NCBI PubMed system,” Medical Reference Services Quarterly, vol. 18, no. 4, pp. 25– 36, 1999. [8] S. Smith et al., “Advances in NLP for Biomedical Applications,” Journal of Biomedical Informatics, vol. 105, pp. 103–120, 2020. [9] R. Gupta and P. Kumar, “Challenges of Low-Resource Languages in Biomedical NLP,” International Journal of Language and Information, vol. 8, no. 2, pp. 45–60, 2018. [10] M. Patel et al., “Barriers in Deploying NLP for Global Healthcare,” IEEE Journal of Biomedical and Health Informatics, vol. 23, no. 3, pp. 1294–1301, 2019. [11] S. Ruder, I. Vulic, and A. Søgaard, “A Survey of Cross-lingual Word Embedding Models,” Journal of Artificial Intelligence Research, vol. 65, pp. 569–630, 2019. BERTScore (F1): Computes semantic similarity using contextual embeddings, especially useful in paraphrased or loosely matched responses. • Concept Recall: Measures how well the model captures key medical concepts from the reference answers. • Safety Metrics: Detects usage of unsafe advice or highrisk language (e.g., dosage without clinical context). • Exact Match: Binary metric for strict string-level correctness. • Results: Table I summarizes the QA performance of the finetuned BLOOM-1B model. Notably, no critical safety issues were identified, and the model maintained reasonable semantic alignment despite low lexical overlap scores. These results indicate that while surface-level similarity with references was low (BLEU/ROUGE), the generated answers captured semantically meaningful and medically relevant content (as reflected by BERTScore and concept recall), with a strong emphasis on safety and factual integrity. VI. F UTURE W ORK While our cross-lingual transfer learning approach has yielded promising results in Hindi medical text processing, several avenues remain for further exploration: • Domain Generalization: Incorporating additional medical sub-domains—such as radiology, clinical trial data, or mental health—could bolster the system’s coverage and applicability, further enhancing the breadth of medical services provided. • Unified Multilingual Framework: Extending our pipeline to other low-resource Indic languages (e.g., Bengali, Marathi, Telugu) would maximize the impact of cross-lingual transfer, enabling parallel advancements across various language pairs. • VLLM Integration for Cross-Lingual Capabilities: Investigating the use of VLLM to enhance question answering in low-resource languages could provide a means to achieve more dynamic and contextually sensitive responses. This integration would enable interactive querying and facilitate more natural language understanding in resource-constrained settings. Through these avenues, we aim to build on the strong foundation laid by our present work, ultimately advancing the
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )