Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging

Kutlu M.; Elsayed T.; Lease M.

المؤلف	Kutlu M.
المؤلف	Elsayed T.
المؤلف	Lease M.
تاريخ الإتاحة	2020-04-07T11:46:17Z
تاريخ النشر	2018
اسم المنشور	Information Processing and Management
المصدر	Scopus
الرقم المعياري الدولي للكتاب	3064573
معرّف المصادر الموحد	http://dx.doi.org/10.1016/j.ipm.2017.09.002
معرّف المصادر الموحد	http://hdl.handle.net/10576/13892
الملخص	While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today's massive document collections (e.g., ClueWeb12's 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics) In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.
راعي المشروع	This work was made possible by NPRP grant# NPRP 7-1313-1-245 from the Qatar National Research Fund (a member of Qatar Foundation). We thank the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for computing resources enabling this research. Qatar Foundation
اللغة	en
الناشر	Elsevier Ltd
الموضوع	information retrieval (IR)
العنوان	Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging
النوع	Article
الصفحات	37-59
رقم العدد	1
رقم المجلد	54
dc.accessType	Abstract Only

تحقق من خيارات الوصول

الملفات في هذه التسجيلة

الملفات	الحجم	الصيغة	العرض
لا توجد ملفات لها صلة بهذه التسجيلة.

هذه التسجيلة تظهر في المجموعات التالية

علوم وهندسة الحاسب [‎2485‎ items ]

عرض بسيط للتسجيلة

Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video