ENABLING EFFECTIVE ARABIC INFORMATION RETRIEVAL ON THE WEB AND SOCIAL MEDIA
Arabic is one of the most dominant languages on the Web and social media. The huge and ever-growing Arabic user generated content, further motivated by the ongoing political unrest in the region, created an immense need for Information Retrieval (IR) systems to support users in consuming and analyzing Arabic content at such scale. In the past decade, tasks like ad hoc retrieval, event detection, document summarization, and fake news detection became of great importance to Arab users. However, research on developing IR systems for these tasks over Arabic content is severely lacking, as compared to higher-resource languages like English. This dissertation makes an argument that the main reason behind the slow progress in the development of Arabic IR systems is the lack of language resources. In particular, there is a severe shortage of standardized, large-scale, and representative test collections and annotated datasets, needed for system training and evaluation. The main goal of this dissertation is to motivate research on Arabic IR by providing necessary evaluation resources, baseline systems, and alternative approaches to training and evaluation of IR systems. To that end, two IR tasks were identified as important and underdeveloped for Arabic content, namely, ad hoc retrieval, and misinformation detection. Each task was investigated over two domains: the Web, and social media (Twitter in particular). For the ad hoc retrieval task, an approach for constructing test collections without the need for a shared-task evaluation campaign is proposed. As a result, two large-scale and manually annotated test collections were constructed starting from recent snapshots of each of the ArabicWeb and Arabic Twittersphere. Moreover, state-of-the-art retrieval models that were previously tested over English content, were benchmarked over the newtest collections, providing baseline performance for future systems. The constructed test collections were proved to include high quality annotations, motivating creation of similar test collections for other problems and domains, with relatively low cost. As for the misinformation detection problem, I focus on two components that are usually part of the claim verification pipeline followed to address this problem. In particular, this work tackles two problems: (1) claim check-worthiness identification, and (2) evidence retrieval for verification. Claim check-worthiness detection is the problem of identifying claims that should be prioritized for verification. Once a claim is identified to be verified, evidence retrieval involves searching for documents that contain information supporting or denying the claim. This thesis describes the process of creating the first Arabic annotated datasets for the two tasks. Furthermore, for claim check-worthiness detection, studied within the social media domain, I extensively study whether we can avoid creating a dedicated Arabic training dataset to train an effective system for the task. To achieve that, I consider cross-lingual transfer learning, where a supervised model trained on non-Arabic data is applied to an Arabic test set. The study demonstrated that cross-lingual transfer learning from some languages to Arabic is comparable to monolingual models exclusively trained on Arabic. For evidence retrieval, I study the suitability of relying on topical relevance as the main approach to evaluate the task in the Web domain. Moreover, I run an extended study on the effectiveness of Web search systems in retrieving documents containing evidenceas opposed to topically relevant documents to a claim. My study shows that pages (retrieved by a commercial search engine) that are topically-relevant to a claim are not always useful for verifying it. Given the aforementioned finding, I investigate and identify characteristics or features specific to evidential pages. Furthermore, preliminary experiments show that effectiveness of a supervised evidential pages retrieval model that employs them has a 5.3% increased recall of evidential pages over the search engine.
- Computer Science & Engineering [79 items ]