SparkIR: a Scalable Distributed Information Retrieval Engine over Spark

Al-Rasbi, Sara Yaqoob

المرشد	Elsayed, Tamer
المؤلف	Al-Rasbi, Sara Yaqoob
تاريخ الإتاحة	2020-02-04T10:31:52Z
تاريخ النشر	2020-01
معرّف المصادر الموحد	http://hdl.handle.net/10576/12667
الملخص	Search engines have to deal with a huge amount of data (e.g., billions of documents in the case of the Web) and find scalable and efficient ways to produce effective search results. In this thesis, we propose to use Spark framework, an in memory distributed big data processing framework, and leverage its powerful capabilities of handling large amount of data to build an efficient and scalable experimental search engine over textual documents. The proposed system, SparkIR, can serve as a research framework for conducting information retrieval (IR) experiments. SparkIR supports two indexing schemes, document-based partitioning and term-based partitioning, to adopt document-at-a-time (DAAT) and term-at-a-time (TAAT) query evaluation methods. Moreover, it offers static and dynamic pruning to improve the retrieval efficiency. For static pruning, it employs champion list and tiering, while for dynamic pruning, it uses MaxScore top k retrieval. We evaluated the performance of SparkIR using ClueWeb12-B13 collection that contains about 50M English Web pages. Experiments over different subsets of the collection and compared the Elasticsearch baseline show that SparkIR exhibits reasonable efficiency and scalability performance overall for both indexing and retrieval. Implemented as an open-source library over Spark, users of SparkIR can also benefit from other Spark libraries (e.g., MLlib and GraphX), which, therefore, eliminates the need of using
اللغة	en
الموضوع	information retrieval (IR) Adopt document-at-a-time (DAAT) term-at-a-time (TAAT)
العنوان	SparkIR: a Scalable Distributed Information Retrieval Engine over Spark
النوع	Master Thesis
التخصص	Computing
dc.accessType	Open Access

الملفات في هذه التسجيلة

الاسم:: Sara Al-Rasbi_OGS Approved ...
الحجم:: 1.846Mb
الصيغة:: PDF

عرض / فتح

هذه التسجيلة تظهر في المجموعات التالية

الحوسبة [‎103‎ items ]

عرض بسيط للتسجيلة

SparkIR: a Scalable Distributed Information Retrieval Engine over Spark

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video