DART: A large dataset of dialectal Arabic tweets

Alsarsour, Israa; Mohamed, Esraa; Suwaileh, Reem; Elsayed, Tamer

المؤلف	Alsarsour, Israa
المؤلف	Mohamed, Esraa
المؤلف	Suwaileh, Reem
المؤلف	Elsayed, Tamer
تاريخ الإتاحة	2020-07-16T20:11:04Z
تاريخ النشر	2019
اسم المنشور	LREC 2018 - 11th International Conference on Language Resources and Evaluation
المصدر	Scopus
معرّف المصادر الموحد	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85059884453&partnerID=40&md5=b3a515e144e74a7cb819868d62d1b814
معرّف المصادر الموحد	http://hdl.handle.net/10576/15265
الملخص	In this paper, we present a new large manually-annotated multi-dialect dataset of Arabic tweets that is publicly available. The Dialectal ARabic Tweets (DART) dataset has about 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi. The paper outlines the pipeline of constructing the dataset from crawling tweets that match a list of dialect phrases to annotating the tweets by the crowd. We also touch some challenges that we face during the process. We evaluate the quality of the dataset from two perspectives: the inter-annotator agreement and the accuracy of the final labels. Results show that both measures were substantially high for the Egyptian, Gulf, and Levantine dialect groups, but lower for the Iraqi and Maghrebi dialects, which indicates the difficulty of identifying those two dialects manually and hence automatically.
راعي المشروع	This work was made possible by NPRP grant# NPRP 7-1313-1-245 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. The work was also supported by grant QUST-CENG-SPR-2017-21 from College of Engineering at Qatar University.
اللغة	en
الناشر	European Language Resources Association (ELRA)
الموضوع	Annotations Arabic Corpus Crowdsourcing Multi-Dialect Twitter
العنوان	DART: A large dataset of dialectal Arabic tweets
النوع	Conference
dc.accessType	Abstract Only

الملفات في هذه التسجيلة

الملفات	الحجم	الصيغة	العرض
لا توجد ملفات لها صلة بهذه التسجيلة.

هذه التسجيلة تظهر في المجموعات التالية

علوم وهندسة الحاسب [‎2426‎ items ]

عرض بسيط للتسجيلة

DART: A large dataset of dialectal Arabic tweets

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video