Multilingual Extractive Question Answering With Conflibert for Political and Social Science Studies
Political conflict and violence have emerged as prominent concerns for political scientists in both academia and policy circles. The overwhelming influx of complex and dense news makes it increasingly challenging to effectively monitor and analyze political events. To address this challenge and contribute to the advancement of conflict research, we propose the introduction of ConfliBERT English and ConfliBERT Spanish. These two domain-specific pre-trained language models are specifically designed for the analysis of political conflict and violence, and have undergone fine-tuning to excel in extractive question answering tasks, which are not susceptible to hallucination. The pre-training of our ConfliBERT models utilized our comprehensive conflict-specific corpus from diverse sources. In order to evaluate the performance of ConfliBERT for extractive question-answering, We performed fine-tuning on SQuAD v1.1 and NewsQA, two large question-answering datasets. Additionally, we created ConfliQA English and Spanish, two crowd-sourced evaluation datasets for conflict- domain extractive QA. Through extensive experimentation and evaluation on all versions of ConfliBERT English and Spanish, we proved that ConfliBERT English outperforms in analyzing political texts compared to BERT English baseline models, and provided detailed insight into further developing ConfliBERT for low-resource languages.