Arabic machine reading comprehension on the Holy Qur'an using CL-AraBERT

Malhas, Rana; Elsayed, Tamer

المؤلف	Malhas, Rana
المؤلف	Elsayed, Tamer
تاريخ الإتاحة	2024-11-05T06:05:19Z
تاريخ النشر	2022
اسم المنشور	Information Processing and Management
المصدر	Scopus
المعرّف	http://dx.doi.org/10.1016/j.ipm.2022.103068
الرقم المعياري الدولي للكتاب	3064573
معرّف المصادر الموحد	http://hdl.handle.net/10576/60875
الملخص	In this work, we tackle the problem of machine reading comprehension (MRC) on the Holy Qur'an to address the lack of Arabic datasets and systems for this important task. We construct QRCD as the first Qur'anic Reading Comprehension Dataset, composed of 1,337 question-passage-answer triplets for 1,093 question-passage pairs, of which 14% are multi-answer questions. We then introduce CLassical-AraBERT (CL-AraBERT for short), a new AraBERT-based pre-trained model, which is further pre-trained on about 1.0B-word Classical Arabic (CA) dataset, to complement the Modern Standard Arabic (MSA) resources used in pre-training the initial model, and make it a better fit for the task. Finally, we leverage cross-lingual transfer learning from MSA to CA, and fine-tune CL-AraBERT as a reader using two MSA-based MRC datasets followed by our QRCD dataset to constitute the first (to the best of our knowledge) MRC system on the Holy Qur'an. To evaluate our system, we introduce Partial Average Precision (pAP) as an adapted version of the traditional rank-based Average Precision measure, which integrates partial matching in the evaluation over multi-answer and single-answer MSA questions. Adopting two experimental evaluation setups (hold-out and cross validation (CV)), we empirically show that the fine-tuned CL-AraBERT reader model significantly outperforms the baseline fine-tuned AraBERT reader model by 6.12 and 3.75 points in pAP scores, in the hold-out and CV setups, respectively. To promote further research on this task and other related tasks on Qur'an and Classical Arabic text, we make both the QRCD dataset and the pre-trained CL-AraBERT model publicly available.
راعي المشروع	We would like to thank TensorFlow Research Cloud (TFRC) for their valuable support in providing us with free access to their cloud TPUs. This research work was partially funded by Qatar University through grant number QUST-2-CENG-2020-20 . Open Access funding is provided by the Qatar National Library.
اللغة	en
الناشر	Elsevier
الموضوع	Answer extraction Classical Arabic Cross-lingual transfer learning Partial matching evaluation Pre-trained language models Reading comprehension
العنوان	Arabic machine reading comprehension on the Holy Qur'an using CL-AraBERT
النوع	Article
رقم العدد	6
رقم المجلد	59
dc.accessType	Open Access

تحقق من خيارات الوصول

الملفات في هذه التسجيلة

الاسم:: 1-s2.0-S0306457322001704-main.pdf
الحجم:: 6.822Mb
الصيغة:: PDF

عرض / فتح

هذه التسجيلة تظهر في المجموعات التالية

علوم وهندسة الحاسب [‎2522‎ items ]

عرض بسيط للتسجيلة

Arabic machine reading comprehension on the Holy Qur'an using CL-AraBERT

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video